VLA & VLM

Definition

A Vision-Language Model (VLM) is a multimodal neural network that processes both images and text. Models like GPT-4V, LLaVA, and Gemini can describe scenes, answer questions about images, and ground language concepts in visual observations. VLMs understand the world but cannot act in it — they produce text, not motor commands.

A Vision-Language-Action model (VLA) extends this capability to physical action. A VLA takes camera images and a natural language instruction (e.g., "pick up the red cup and place it on the tray") and directly outputs robot actions — joint positions, end-effector velocities, or gripper commands. This closes the loop from perception and language understanding to physical execution, enabling robots to follow open-ended instructions without task-specific programming.

The distinction matters because VLMs are useful as planners, scene describers, and reward labelers, but they cannot control a robot in real time. VLAs can. The two are complementary: a VLM might decompose a complex instruction into subtasks, while a VLA executes each subtask. Together, they represent the foundation model approach to general-purpose robot intelligence.

How VLAs Work

A typical VLA architecture has three components: a vision encoder (ViT or SigLIP) that converts camera images into visual tokens, a language model backbone (LLaMA, PaLM, Gemma) that processes language instructions and visual tokens jointly, and an action head that decodes the model's hidden states into robot action vectors.

During pre-training, the vision encoder and language model are trained on internet-scale image-text data, giving the model broad visual and semantic understanding. During robot fine-tuning, the action head is added and the entire model is trained on robot demonstration data: (image, instruction, action) triplets collected via teleoperation. The language model's weights are updated to produce representations that are useful for action prediction, not just text generation.

Action tokenization is a key design choice. RT-2 discretizes actions into 256 bins per dimension and treats them as text tokens. OpenVLA uses a similar approach. More recent models like π0 use continuous action heads with diffusion decoders, which better capture the multimodal nature of manipulation actions.

Key Models

RT-2 (Google DeepMind, 2023) — 55B-parameter VLA built on PaLM-E. Demonstrated that large VLMs can be fine-tuned to output robot actions. Showed emergent generalization to novel objects and instructions not seen during robot training. Not open-source.
OpenVLA (Stanford/Berkeley, 2024) — 7B-parameter open-source VLA based on Prismatic VLM + LLaMA 2. Trained on 970K demonstrations from the Open X-Embodiment dataset. The first widely accessible VLA that researchers can fine-tune on their own data.
π0 (Physical Intelligence, 2024) — 3B-parameter VLA with a flow-matching action decoder. Designed for dexterous manipulation with high-frequency control. Demonstrates strong performance on bimanual and contact-rich tasks.
Octo (Berkeley, 2024) — A generalist policy trained on 800K episodes from Open X-Embodiment. Uses a transformer backbone with a diffusion action head. Supports both language and goal-image conditioning. Designed as a base model for fine-tuning to specific robots.

VLA vs Task-Specific Policies

VLAs accept language instructions and can generalize across many tasks without retraining. They require large-scale pre-training (internet data + cross-embodiment robot data) and significant compute. Inference is slower (5–15 Hz) due to model size. Best for multi-task deployments where flexibility matters more than reaction speed.

Task-specific policies like ACT or Diffusion Policy are trained on data from a single task. They are smaller, faster (50–200 Hz), and often achieve higher success rates on their target task. But they cannot generalize to new instructions or objects without retraining.

The practical choice depends on your deployment: if your robot performs one or two tasks in a structured environment, a task-specific policy trained on 50–200 demonstrations is faster to develop and more reliable. If your robot must handle diverse, language-described tasks, a VLA fine-tuned on your specific embodiment is the better investment.

Data Requirements

Fine-tuning an existing VLA (e.g., OpenVLA or Octo) to a new robot embodiment typically requires 100–1,000 teleoperation demonstrations covering the target tasks. Language annotations can be added post-hoc. Fine-tuning takes 12–48 hours on 4–8 A100 GPUs.

Training a VLA from scratch requires 100K+ robot demonstrations across multiple embodiments and tasks, plus internet-scale image-text pre-training data. This is currently only feasible for well-funded labs (Google, Physical Intelligence, Berkeley). The Open X-Embodiment dataset (970K episodes, 22 robot types) was created specifically to enable this.

Language annotations must describe the task at an appropriate level of detail. Overly vague labels ("do the task") hurt generalization, while overly specific labels ("move joint 3 by 0.2 radians") defeat the purpose of language conditioning. Task-level descriptions ("pick up the blue block and stack it on the red block") work best.

Inference and Deployment Considerations

Compute requirements: VLA inference is significantly more expensive than task-specific policies. A 7B-parameter model (OpenVLA) requires an A100 or equivalent GPU and achieves 5–8 Hz inference. Smaller models like Octo (93M parameters) run at 15–20 Hz on an RTX 4090, making them suitable for real-time control. The Pi-zero family targets 10–15 Hz on a single GPU with its 3B architecture. For comparison, ACT runs at 50–200 Hz on consumer GPUs.

Quantization and distillation: Model compression techniques can improve VLA inference speed. INT8 and INT4 quantization reduce the 7B OpenVLA model to 4–8 Hz on an RTX 4090 without significant accuracy loss. Knowledge distillation — training a smaller student model to mimic the VLA's outputs — can produce 100M-parameter policies that run at 30–50 Hz while retaining much of the VLA's generalization capability. This is an active research area.

Edge deployment: For untethered robots (mobile manipulators, humanoids), VLA inference must run on onboard compute. NVIDIA Jetson Orin (275 TOPS) can run quantized Octo-class models at practical speeds. Larger VLAs require offloading inference to a nearby workstation connected via low-latency WiFi 6E or 5G, adding 10–30 ms network latency.

Safety layers: VLA outputs should be filtered through safety constraints before execution. Joint limit checks, velocity clamping, workspace boundary enforcement, and collision checking with the scene model are essential. The VLA's language understanding does not guarantee physically safe actions.

VLA Fine-Tuning at SVRC

SVRC supports VLA fine-tuning workflows end-to-end at our Mountain View and Allston facilities:

Data collection: Teleoperation rigs (ALOHA, VR-based, SpaceMouse) with synchronized multi-camera recording and language annotation tools. Professional operators available for high-throughput data campaigns.
Dataset preparation: Our data platform converts raw recordings to Open X-Embodiment and LeRobot formats, with language annotation quality checks and automatic filtering of low-quality episodes.
Training infrastructure: Multi-GPU workstations (4-8x A100) pre-configured with OpenVLA and Octo training pipelines. Fine-tuning a 7B VLA on 500 demonstrations typically completes in 12–24 hours.
Evaluation: Real robot cells for measuring fine-tuned VLA success rates across task variants, novel objects, and language instructions.

Key Papers

Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. Demonstrated that a 55B VLM fine-tuned on robot data can follow novel language instructions and generalize to unseen objects.
Kim, M. J. et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." CoRL 2024. The first open-source 7B VLA, enabling researchers to fine-tune and evaluate VLAs on their own hardware and tasks.
Team, O. X.-E. et al. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. Aggregated 970K robot episodes across 22 embodiments, establishing the data foundation for cross-embodiment VLA training.

Related Terms

Foundation Model — The broader category of large pre-trained models adapted to downstream tasks
Policy Learning — The general framework for training observation-to-action mappings
Action Chunking (ACT) — A task-specific policy alternative to VLAs
Diffusion Policy — Denoising-based action generation, used as action heads in some VLAs
Teleoperation — How demonstration data for VLA training is collected

Fine-Tune VLAs at SVRC

Robotics Center of Silicon Valley provides GPU clusters for VLA fine-tuning, teleoperation rigs for collecting language-annotated demonstrations, and expert guidance on choosing between VLA and task-specific approaches for your application. Our data platform manages datasets in Open X-Embodiment and LeRobot formats.

Explore Data Services Contact Us

Definition

How VLAs Work

Key Models

VLA vs Task-Specific Policies

Data Requirements

Inference and Deployment Considerations

VLA Fine-Tuning at SVRC

See Also

Key Papers

Related Terms

Fine-Tune VLAs at SVRC

Related Pages

Foundation Model

Policy Learning

Action Chunking (ACT)

Diffusion Policy