← Glossary

VLA & VLM

Vision-Language-Action and Vision-Language Models — language-conditioned robot control.

What Are VLA and VLM?

VLM (Vision-Language Model) — Multimodal models that understand both images and text. Used for captioning, VQA, and grounding.

VLA (Vision-Language-Action) — VLMs extended to output robot actions. Take images + language instructions, output control commands (e.g., joint positions, gripper). Enable "pick up the red block" style control.

Key Models

  • OpenVLA — 7B open-source VLA, 970K demos
  • RT-2 / RT-X — Google's VLA family
  • Octo — Diffusion policy with language conditioning
  • RoboFlamingo — OpenFlamingo-based VLM for robots

Related Resources