← Models

OpenVLA

Open-Source Vision-Language-Action Model for robotic manipulation. Stanford, Berkeley, TRI, Google DeepMind, MIT.

Overview

OpenVLA is a 7B-parameter vision-language-action (VLA) model trained on 970K real-world robot demonstrations from Open X-Embodiment. It combines Llama 2 with fused visual encoders (DINOv2 + SigLIP) and outperforms RT-2-X (55B) by 16.5% with 7× fewer parameters.

Architecture & Training

  • 7B parameters
  • Llama 2 backbone + DINOv2/SigLIP visual encoder
  • 970K demos from Open X-Embodiment
  • Multi-robot, zero-shot transfer
  • LoRA fine-tuning on consumer GPUs

Official Links

Citation

CoRL 2025. See the project site for BibTeX.