OpenVLA
Open-Source Vision-Language-Action Model for robotic manipulation. Stanford, Berkeley, TRI, Google DeepMind, MIT.
Overview
OpenVLA is a 7B-parameter vision-language-action (VLA) model trained on 970K real-world robot demonstrations from Open X-Embodiment. It combines Llama 2 with fused visual encoders (DINOv2 + SigLIP) and outperforms RT-2-X (55B) by 16.5% with 7× fewer parameters.
Architecture & Training
- 7B parameters
- Llama 2 backbone + DINOv2/SigLIP visual encoder
- 970K demos from Open X-Embodiment
- Multi-robot, zero-shot transfer
- LoRA fine-tuning on consumer GPUs
Official Links
- openvla.github.io — Project site
- github.com/openvla/openvla — Code & training
- Hugging Face: openvla — Model checkpoints
Citation
CoRL 2025. See the project site for BibTeX.