多模态大模型崛起预言成真！现在断言：VLA技术五年内必火，引领AI新潮流！

Vision Language Action（VLA）旨在通过融合计算机视觉（CV）、自然语言处理（NLP）和机器人学，使智能体能够理解视觉信息、解析语言指令，并生成物理世界中的动作序列，实现从感知到执行的闭环能力。本质上VLA是多模态大模型的一个子集。

全栈大佬！

892人浏览 · 2025-06-10 11:37:36

全栈大佬！ · 2025-06-10 11:37:36 发布

Vision Language Action通过多模态融合与端到端架构，正在推动人工智能从“感知智能”迈向“具身智能”。其核心技术如预训练模型、分层策略、实时优化等，已在机器人、自动驾驶等领域展现出巨大潜力。尽管面临数据、算力、安全等挑战，随着多模态大模型与硬件技术的进步，VLA有望在未来五年内实现大规模商用，重塑人机交互的未来。

VLA模型的发展

技术架构与核心组件

VLA模型通常包含三个核心模块：

多模态编码器：
- 视觉编码器：处理图像或视频输入，提取场景特征（如目标类别、空间关系）。常用技术包括卷积神经网络（CNN）、视觉Transformer（ViT）及Segment Anything Model（SAM）等。
- 语言编码器：解析自然语言指令，生成语义表示。大语言模型（LLMs）如GPT-4、PaLM-E在此类任务中表现突出，能够理解复杂指令并生成逻辑推理链。
动作解码器：
- 将多模态编码后的信息映射为机器人或车辆的控制指令。例如，自动驾驶中的VLA模型可直接输出轨迹规划（如转向、加速），而机器人控制模型则生成关节角度或末端执行器位姿。
闭环交互机制：
- 通过强化学习（RL）或模仿学习（IL）优化动作策略，结合实时视觉反馈调整执行过程。例如，Google的RT-2模型通过共同微调（Co-finetuning）互联网图文数据与机器人动作数据，显著提升泛化能力。

关键技术与研究方向

1. 多模态融合与泛化能力

预训练与微调策略：
- 基于预训练视觉语言模型（VLM）如CLIP、Flamingo，通过微调机器人或自动驾驶数据实现多模态对齐。例如，DeepMind的RT-2在VLM基础上引入机器人运动轨迹数据，直接输出可执行动作。
- 分阶段训练（如ChatVLA的Phased Alignment Training）可避免虚假遗忘，即在训练控制任务时保留原有的视觉-语言对齐能力。
模型架构创新：
- 混合专家（MoE）架构：如ChatVLA通过共享注意力层与独立MLP层设计，减少任务干扰，同时支持多模态理解与机器人控制。
- 并行解码与动作分块：OpenVLA-OFT采用并行解码替代自回归生成，结合动作分块技术，将推理速度提升26倍，适用于高频控制场景。

2. 具身智能与任务执行

分层策略设计：
- 高层任务规划器：将长期任务分解为子任务（如“整理房间”→“找到玩具”→“放入箱子”），常用LLM或代码生成逻辑。
- 低层控制策略：生成具体动作序列，如基于扩散模型的连续动作生成（Diffusion Policy）或逆运动学计算。
零样本与少样本学习：
- PaLM-E等模型通过大模型的“世界知识”实现未训练任务的执行，例如用香蕉击倒积木等非常规操作。

3. 实时性与硬件优化

高效推理技术：
- 模型压缩（如知识蒸馏）和硬件适配（如地平线征程6芯片）可提升实时响应速度。例如，征程6的BPU架构针对VLA优化，能效比提升3倍。
- 并行解码与动作分块技术（如OpenVLA-OFT）在保证精度的同时，显著降低延迟。

4. 可解释性与安全性

思维链（CoT）推理：
- EMMA等自动驾驶模型通过生成自然语言解释（如“行人横穿马路，车辆减速等待”），增强决策透明度。
安全机制设计：
- 对抗训练与动态安全边界（如自动驾驶中的紧急制动策略）可减少误操作风险。

典型应用场景

1. 机器人与工业自动化

家庭服务机器人：如Helix模型支持多机器人协作处理未知物体，通过双系统架构实现实时控制与复杂决策。
工业机械臂：宝马工厂的机械臂通过语音指令切换工具，亚马逊AGV机器人结合视觉与语言导航分拣货物。

2. 自动驾驶

端到端控制：元戎启行的VLA量产车型基于英伟达Thor芯片，直接从摄像头输入生成驾驶指令，处理潮汐车道、施工绕行等复杂场景。
多模态决策：Google的EMMA模型将驾驶任务转化为视觉问答，同时输出轨迹、目标检测结果及推理解释，提升系统鲁棒性。

3. 医疗与特殊环境

手术机器人：VLA模型可结合医学影像与语音指令，实现精准器械操作（如腹腔镜手术）。
灾难救援：无人机或地面机器人通过视觉与语言指令在危险环境中执行搜救任务。

挑战与未来趋势

1. 当前挑战

数据稀缺性：真实机器人或自动驾驶数据收集成本高，且分布差异大（如不同光照、路况）。
实时性与算力瓶颈：多模态大模型的高参数需求与车端芯片性能限制（如英伟达Orin-X算力不足）形成矛盾。
动作可靠性：物理世界的不确定性（如物体动态变化）可能导致动作失败，需结合闭环反馈优化。

2. 研究趋势

通用具身智能：探索终身学习与跨任务迁移，如RoboCat通过自改进机制持续提升机器人操作能力。
大模型与物理交互融合：结合GPT-4、Gemini等模型的推理能力与机器人运动规划，实现更复杂的任务执行。
伦理与安全研究：制定VLA系统的伦理规范，解决隐私保护、责任归属等问题。

VLA热门项目

https://github.com/yueen-ma/Awesome-VLA/blob/main/README.md
https://github.com/openvla/openvla

Definitions

Generalized VLA
Input: state, instruction.
Output: action.
Large VLA
A special type of generalized VLA that is adapted from large VLMs. (Same as VLA defined by RT-2.)

Taxonomy

Components of VLA

Reinforcement Learning

DT: "Decision Transformer: Reinforcement Learning via Sequence Modeling", NeurIPS, 2021 [Paper][Code]

Trajectory Transformer: "Offline Reinforcement Learning as One Big Sequence Modeling Problem", NeurIPS, 2021 [Paper][Code]

SEED: "Primitive Skill-based Robot Learning from Human Evaluative Feedback", IROS, 2023 [Paper][Code]

Reflexion: "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS, 2023 [Paper][Code]

Pretrained Visual Representations

"Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 [Paper][Website][Code]

MVP: "Real-World Robot Learning with Masked Visual Pre-training", CoRL, 2022 [Paper][Website][Code]

Voltron: "Language-Driven Representation Learning for Robotics", RSS, 2023 [Paper]

VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", NeurIPS, 2023 [Paper][Website][Code]

"The (Un)surprising Effectiveness of Pre-Trained Vision Models for Control", ICML, 2022 [Paper]

R3M: "R3M: A Universal Visual Representation for Robot Manipulation", CoRL, 2022 [Paper][Website][Code]

VIP: "VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training", ICLR, 2023 [Paper][Website][Code]

DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", Trans. Mach. Learn. Res., 2023 [Paper][Code]

RPT: "Robot Learning with Sensorimotor Pre-training", CoRL, 2023 [Paper][Website]

I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", CVPR, 2023 [Paper]

Theia: "Theia: Distilling Diverse Vision Foundation Models for Robot Learning", CoRL, 2024 [Paper]
HRP: "HRP: Human Affordances for Robotic Pre-Training", RSS, 2024 [Paper][Website][Code]

Video Representations

F3RM: "Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation", CoRL, 2023 [Paper][Website][Code]

PhysGaussian: "PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics", CVPR, 2024 [Paper][Website][Code]

UniGS: "UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting", ICLR, 2025 [Paper][Code]

That Sounds Right: "That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation", CoRL, 2023 [Paper][Code]

Dynamics Learning

MaskDP: "Masked Autoencoding for Scalable and Generalizable Decision Making", NeurIPS, 2022 [Paper][Code]

PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", IROS, 2023 [Paper]

GR-1: "Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", ICLR, 2024 [Paper]

SMART: "SMART: Self-supervised Multi-task pretrAining with contRol Transformers", ICLR, 2023 [Paper]

MIDAS: "Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning", ICML, 2024 [Paper][Website]

Vi-PRoM: "Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods", IROS, 2023 [Paper][Website]

VPT: "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos", NeurIPS, 2022 [Paper]

World Models

"A Path Towards Autonomous Machine Intelligence", OpenReview, 2022 [Paper]

DreamerV1: "Dream to Control: Learning Behaviors by Latent Imagination", ICLR, 2020 [Paper]

DreamerV2: "Mastering Atari with Discrete World Models", ICLR, 2021 [Paper]

DreamerV3: "Mastering Diverse Domains through World Models", arXiv, Jan 2023 [Paper]

DayDreamer: "DayDreamer: World Models for Physical Robot Learning", CoRL, 2022 [Paper]

TWM: "Transformer-based World Models Are Happy With 100k Interactions", ICLR, 2023 [Paper]

IRIS: "Transformers are Sample-Efficient World Models", ICLR, 2023 [Paper][Code]

LLM-induced World Models

DECKARD: "Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling", ICML, 2023 [Paper][Website][Code]

LLM-MCTS: "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning", NeurIPS, 2023 [Paper]

RAP: "Reasoning with Language Model is Planning with World Model", EMNLP, 2023 [Paper]

LLM+P: "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency", arXiv, Apr 2023 [Paper][Code]

LLM-DM: "Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning", NeurIPS, 2023 [Paper][Website][Code]

Visual World Models

E2WM: "Language Models Meet World Models: Embodied Experiences Enhance Language Models", NeurIPS, 2023 [Paper][Code]

Genie: "Genie: Generative Interactive Environments", ICML, 2024 [Paper][Website]

3D-VLA: "3D-VLA: A 3D Vision-Language-Action Generative World Model", ICML, 2024 [Paper][Code]

UniSim: "Learning Interactive Real-World Simulators", ICLR, 2024 [Paper][Code]

Reasoning

ThinkBot: "ThinkBot: Embodied Instruction Following with Thought Chain Reasoning", arXiv, Dec 2023 [Paper]

ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper]

RAT: "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation", arXiv, Mar 2024 [Paper]

ECoT: "Robotic Control via Embodied Chain-of-Thought Reasoning", arXiv, Jul 2024 [Paper]

Tree-Planner: "Tree-Planner: Efficient Close-loop Task Planning with Large Language Models", ICLR, 2024 [Paper]

OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", arXiv, Jun 2024 [Paper]

Low-level Control Policies

Non-Transformer Control Policies

Transporter Networks: "Transporter Networks: Rearranging the Visual World for Robotic Manipulation", CoRL, 2020 [Paper]

CLIPort: "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL, 2021 [Paper][Website][Code]

BC-Z: "BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning", CoRL, 2021 [Paper][Website][Code]

HULC: "What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data", arXiv, Apr 2022 [Paper][Website][Code]

HULC++: "Grounding Language with Visual Affordances over Unstructured Data", ICRA, 2023 [Paper][Website][Paper]

MCIL: "Language Conditioned Imitation Learning over Unstructured Data", Robotics: Science and Systems, 2021 [Paper][Website][Paper]

UniPi: "Learning Universal Policies via Text-Guided Video Generation", NeurIPS, 2023 [Paper][Website]

Transformer-based Control Policies

RoboFlamingo: "Vision-Language Foundation Models as Effective Robot Imitators", arXiv, Jan 2025 [Paper][Website][Code]

ACT: "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", Robotics: Science and Systems, 2023 [Paper]

RoboCat: "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation", arXiv, Mar 2021 [Paper]

Gato: "A Generalist Agent", Trans. Mach. Learn. Res., 2022 [Paper]

RT-Trajectory: "RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches", ICLR, 2023 [Paper]

Q-Transformer: "Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions", arXiv, Sep 2023 [Paper]

Interactive Language: "Interactive Language: Talking to Robots in Real Time", arXiv, Oct 2022 [Paper]

RT-1: "RT-1: Robotics Transformer for Real-World Control at Scale", RSS, 2023 [Paper][Website]

MT-ACT: "RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking", ICRA, 2024 [Paper][Code][Code]

Hiveformer: "Instruction-driven history-aware policies for robotic manipulations", CoRL, 2022 [Paper][Website][Code]

Control Policies for Multimodal Instructions

VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", arXiv, Oct 2022 [Paper]

MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", CoRL, 2023 [Paper]

Control Policies with 3D Vision

VER: "Volumetric Environment Representation for Vision-Language Navigation", CVPR, 2024 [Paper][Code]

RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL, 2023 [Paper]

RVT-2: "RVT-2: Learning Precise Manipulation from Few Demonstrations", arXiv, Jun 2024 [Paper]

RoboUniView: "RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton", [Code]

PerAct: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", CoRL, 2022 [Paper]

Act3D: "Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation", CoRL, 2023 [Paper][Website][Code]

Diffusion-based Control Policies

MDT: "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals", Robotics: Science and Systems, 2024 [Paper][Website][Code]

RDT-1B: "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", arXiv, Oct 2024 [Paper][Website][Code]

Diffusion Policy: "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", Robotics: Science and Systems, 2023 [Paper][Website][Code]

Octo: "Octo: An Open-Source Generalist Robot Policy", Robotics: Science and Systems, 2024 [Paper][Website][Code]

SUDD: "Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition", CoRL, 2023 [Paper][Code]

Diffusion-based Control Policies with 3D Vision

3D Diffuser Actor: "3D Diffuser Actor: Policy Diffusion with 3D Scene Representations", arXiv, Feb 2024 [Paper][Code]

DP3: "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations", Proceedings of Robotics: Science and Systems (RSS), 2024 [Paper][Website][Code]

Control Policies for Motion Planning

VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL, 2023 [Paper][Website][Code]

Language costs: "Correcting Robot Plans with Natural Language Feedback", Robotics: Science and Systems, 2022 [Paper][Website]

RoboTAP: "RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation", ICRA, 2024 [Paper][Website]

Control Policies with Point-based Action

ReKep: "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation", arXiv, Sep 2024 [Paper][Website][Code]

RoboPoint: "RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics", arXiv, Jun 2024 [Paper][Website][Code]

PIVOT: "PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs", ICML, 2024 [Paper][Website]

Large VLA

RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL, 2023 [Paper][Website]

RT-H: "RT-H: Action Hierarchies Using Language", Robotics: Science and Systems, 2024 [Paper][Website]

RT-X, OXE: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", arXiv, Oct 2023 [Paper][Website][Code]

OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", CoRL, 2024 [Paper][Website][Code]

π0: "π0: A Vision-Language-Action Flow Model for General Robot Control", arXiv, Oct 2024 [Paper][Website]

Task Planners

Monolithic Task Planners

Grounded Task Planners

(SL)^3: "Skill Induction and Planning with Latent Language", ACL, 2022 [Paper]

Translated <LM>: "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents", ICML, 2022 [Paper][Code]

SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", CoRL, 2022 [Paper][Website][Code]

End-to-end Task Planners

EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", NeurIPS, 2023 [Paper][Code]

PaLM-E: "PaLM-E: An Embodied Multimodal Language Model", ICML, 2023 [Paper][Website]

End-to-end Task Planners with 3D Vision

MultiPLY: "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World", CVPR, 2024 [Paper]

3D-LLM: "3D-LLM: Injecting the 3D World into Large Language Models", NeurIPS, 2023 [Paper][Website]

LEO: "An Embodied Generalist Agent in 3D World", ICML, 2024 [Paper][Website][Code]

ShapeLLM: "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction", ECCV, 2024 [Paper][Website][Code]

Modular Task Planners

Language-based Task Planners

ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper][Website][Code]

Socratic Models: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 [Paper]

LID: "Pre-Trained Language Models for Interactive Decision-Making", NeurIPS, 2022 [Paper][Website][Code]

Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models", arXiv, Jul 2022 [Paper][Website]

LLM-Planner: "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models", ICCV, 2023 [Paper][Website][Website]

Code-based Task Planners

ChatGPT for Robotics: "ChatGPT for Robotics: Design Principles and Model Abilities", IEEE Access, 2023 [Paper][Website][Code]

DEPS: "Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents", arXiv, Feb 2023 [Paper][Code]

ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", ICRA, 2023 [Paper][Website][Code]

CaP: "Code as Policies: Language Model Programs for Embodied Control", ICRA, 2023 [Paper][Website][Code]

ProgPrompt: "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA, 2023 [Paper][Website][Code]

COME-robot: "Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V", arXiv, Apr 2024 [Paper][Website]

Related Surveys

"Foundation Models in Robotics: Applications, Challenges, and the Future", arXiv, Dec 2023 [Paper]

"Real-World Robot Applications of Foundation Models: A Review", arXiv, Feb 2024 [Paper]

"Large Language Models for Robotics: Opportunities, Challenges, and Perspectives", arXiv, Jan 2024 [Paper]

"Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis", arXiv, Dec 2023 [Paper]

"Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, May 2024 [Paper]

多模态大模型的发展

多模态大模型的爆发前期主要集中在2018年至2021年，这一阶段是技术积累与关键突破的交汇期，为后续爆发奠定了核心基础。以下是关键时间节点与技术进展的梳理：

1. 技术奠基期（2018-2019）

Transformer架构的普及

2017年，Google提出的Transformer架构彻底改变了深度学习的范式。其自注意力机制解决了长序列依赖问题，为多模态模型的跨模态对齐提供了理论基础。2018年，OpenAI和Google分别发布GPT-1与BERT，标志着预训练大模型成为自然语言处理的主流。这一时期，研究者开始探索如何将Transformer扩展到视觉领域。

早期多模态模型的尝试

ViLBERT（2019）：首次将视觉和语言模态通过Transformer进行联合建模，支持图像问答和图文检索任务。
LXMERT（2019）：提出跨模态预训练框架，通过对比学习对齐图像区域与文本片段，显著提升视觉语言任务性能。
UNITER（2019）：在大规模图文数据上预训练，通过掩码语言建模和图像-文本匹配任务增强跨模态理解能力。

这一阶段的模型虽然参数规模较小（通常在百亿级以下），但验证了多模态预训练的可行性，为后续大模型提供了方法论参考。

2. 技术突破期（2020-2021）

多模态预训练的规模化

CLIP（2021）：OpenAI提出的对比学习框架，通过4亿图文对预训练，实现了图像与文本的深度对齐。其零样本迁移能力（如通过文本描述直接分类图像）颠覆了传统视觉任务的范式。
DALL-E（2021）：基于GPT-3架构，首次实现从文本生成高分辨率图像的能力，展示了多模态生成任务的潜力。
Flamingo（2021）：通过冻结预训练的视觉编码器和语言模型，仅微调门控Transformer模块，实现了多模态上下文学习，例如通过示例直接解决视觉数学问题。