InternVLA-M1

Spatially Guided Vision-Language-Action Framework for generalist robot policy. Shanghai AI Lab.

Overview

InternVLA-M1 uses a two-stage pipeline: (1) spatial grounding pre-training on 2.3M samples to determine "where to act," (2) spatially guided action post-training for "how to act." Modular, extensible, with dual supervision.

Benchmarks

Google Robot 71.7% (WidowX), 76.0% (VM), 80.7% (VA)
LIBERO 95.9% success
+14.6% on SimplerEnv, +20.6% on unseen objects with synthetic co-training

Official Links

internrobotics.github.io/internvla-m1 — Project site
github.com/InternRobotics/InternVLA-M1 — Code (MIT)
Hugging Face: InternRobotics — Models & datasets

Citation

See the project site for BibTeX and paper references.