OpenVLA

Open-Source Vision-Language-Action Model for robotic manipulation. Stanford, Berkeley, TRI, Google DeepMind, MIT.

Overview

OpenVLA is a 7B-parameter vision-language-action (VLA) model trained on 970K real-world robot demonstrations from Open X-Embodiment. It combines Llama 2 with fused visual encoders (DINOv2 + SigLIP) and outperforms RT-2-X (55B) by 16.5% with 7× fewer parameters.

Architecture & Training

7B parameters
Llama 2 backbone + DINOv2/SigLIP visual encoder
970K demos from Open X-Embodiment
Multi-robot, zero-shot transfer
LoRA fine-tuning on consumer GPUs

Official Links

openvla.github.io — Project site
github.com/openvla/openvla — Code & training
Hugging Face: openvla — Model checkpoints

Citation

CoRL 2025. See the project site for BibTeX.