← Models

InternVLA-M1

Spatially Guided Vision-Language-Action Framework for generalist robot policy. Shanghai AI Lab.

Overview

InternVLA-M1 uses a two-stage pipeline: (1) spatial grounding pre-training on 2.3M samples to determine "where to act," (2) spatially guided action post-training for "how to act." Modular, extensible, with dual supervision.

Benchmarks

  • Google Robot 71.7% (WidowX), 76.0% (VM), 80.7% (VA)
  • LIBERO 95.9% success
  • +14.6% on SimplerEnv, +20.6% on unseen objects with synthetic co-training

Official Links

Citation

See the project site for BibTeX and paper references.