What matters in building vision–language–action models for generalist robots

Li, Xinghang; Li, Peiyan; Qian, Long; Liu, Minghuan; Wang, Dong; Liu, Jirong; Kang, Bingyi; Ma, Xiao; Wang, Xinlong; Guo, Di; Kong, Tao; Zhang, Hanbo; Liu, Huaping

doi:10.1038/s42256-025-01168-7

Article
Published: 11 February 2026

What matters in building vision–language–action models for generalist robots

Nature Machine Intelligence volume 8, pages 158–172 (2026) Cite this article

3764 Accesses
3 Citations
1 Altmetric
Metrics details

Subjects

This article has been updated

A preprint version of the article is available at arXiv.

Abstract

To utilize foundation vision–language models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs and building the vision–language–action models (VLAs). Here we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets and toolkits, along with detailed training and evaluation recipes at robovlms.github.io.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Strategies and categorization of robot policies.**

**Fig. 2: Illustration of the key ingredients and proposed unified VLA framework.**

**Fig. 3: Illustration of the involved simulations and real-world benchmarks.**

**Fig. 4: The experimental results for RoboVLMs in simulations and in the real world.**

Constrained natural language action planning for resilient embodied systems

Article Open access 16 May 2026

Visual language models show widespread visual deficits on neuropsychological tests

Article 06 February 2026

A robot operating system framework for using large language models in embodied AI

Article 16 March 2026

Data availability

The datasets used in this study are available via GitHub at https://github.com/mees/calvin (CALVIN), https://github.com/google-deepmind/open_x_embodiment (OXE) and https://huggingface.co/datasets/robovlms/bytedance_robot_benchmark_20 (BDRBench20).

Code availability

The source code of this study is available via GitHub at https://github.com/Robot-VLAs/RoboVLMs. It is also available via Zenodo at https://zenodo.org/records/17757179 (ref. ⁵⁵).

Change history

16 February 2026
In the version of the article initially published, the affiliations of Di Guo and Hanbo Zhang were switched and have now been amended so that Di Guo is affiliated with the the Beijing University of Posts and Telecommunications, Beijing, China and Hanbo Zhang with the National University of Singapore, Singapore, Singapore. This correction has been made to the HTML and PDF versions of the article.

References

Bousmalis, K. et al. Robocat: a self-improving foundation agent for robotic manipulation. Transactions on Machine Learning Research (ed. Walter, M.) (TMLR, 2024).
Brohan, A. et al. Rt-2: vision–language–action models transfer web knowledge to robotic control. In Conference on Robot Learning (eds Tan, J. et al.) 2165–2183 (PMLR, 2023).
Black, K. et al. π₀: a vision–language–action flow model for general robot control. Preprint at https://arxiv.org/pdf/2410.24164 (2024).
O’Neill, A. et al. Open X-Embodiment: robotic learning datasets and rt-x models. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (ed. O’Malley, M. K.) 6892–6903 (IEEE, 2024).
Liu, H., Guo, D. & Cangelosi, A. Embodied intelligence: a synergy of morphology, action, perception and learning. ACM Comput. Surv. 57, 1–36 (2025).
Google Scholar
Kim, M. J. et al. Openvla: an open-source vision–language–action model. In Conference on Robot Learning (eds Agrawal, P. et al.) 2679–2713 (PMLR, 2025).
Li, X. et al. Vision–language foundation models as effective robot imitators. In The Twelfth International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2024).
Ghosh, D. et al. Octo: an open-source generalist robot policy. In Proc. Robotics: Science and Systems 090 (RSS, 2024).
Wu, H. et al. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2024).
Nair, S., Rajeswaran, A., Kumar, V., Finn, C. & Gupta, A. R3m: a universal visual representation for robot manipulation. In Conference on Robot Learning (eds Liu, K. et al.) 892–909 (PMLR, 2023).
Jiang, Y. et al. Vima: feneral robot manipulation with multimodal prompts. In Proc. Machine Learning Research 14975–15022 (PMLR, 2023).
Zhen, H. et al. 3d-vla: A 3D vision–language–action generative world model. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 61229–61245 (ICML, 2024).
Zhou, Z., Zhu, Y., Wen, J., Shen, C. & Xu, Y. Vision–language–action model with open-world embodied reasoning from pretrained knowledge. Preprint at https://arxiv.org/pdf/2505.21906 (2025).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning 1126–1135 (PMLR, 2017).
Mees, O., Hermann, L., Rosete-Beas, E. & Burgard, W. Calvin: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics Autom. Lett. 7, 7327–7334 (2022).
Article Google Scholar
Radosavovic, I. et al. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning (eds Liu, K. et al.) 416–426 (PMLR, 2023).
Peng, Z. et al. Kosmos-2: grounding multimodal large language models to the world. Preprint at https://arxiv.org/pdf/2306.14824 (2023).
Beyer, L. et al. Paligemma: a versatile 3b VLM for transfer. Preprint at https://arxiv.org/pdf/2407.07726 (2024).
Torne, M. et al. Reconciling reality through simulation: a real-to-sim-to-real approach for robust manipulation. Preprint at https://arxiv.org/pdf/2403.03949 (2024).
Li, X. et al. Evaluating real-world robot manipulation policies in simulation. In Conference on Robot Learning (eds Agrawal, P. et al.) 3705–3728 (PMLR, 2025).
Brohan, A. et al. Rt-1: robotics transformer for real-world control at scale. In Proc. Robotics Science and Systems XIX 025 (RSS, 2023).
Walke, H. et al. Bridgedata v2: a dataset for robot learning at scale. In Conference on Robot Learning (eds Tan, J. et al.) 1723–1736 (PMLR, 2023).
Cheang, C.-L. et al. Gr-2: a generative video-language-action model with web-scale knowledge for robot manipulation. Preprint at https://arxiv.org/pdf/2410.06158 (2024).
Li, P. et al. Gr-mg: leveraging partially annotated data via multi-modal goal conditioned policy. IEEE Robotics and Automation Letters (eds Asfour, A. et al.) 1912–1919 (IEEE, 2025).
Zhao, W., Queralta, J. P. & Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. 2020 IEEE symposium series on Computational Intelligence (SSCI) 737–744 (eds Abbass, H. et al.) (IEEE, 2020).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 35, 23716–23736 (2022).
Google Scholar
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2024).
Google Scholar
Bai, J. et al. Qwen-vl: a frontier large vision–language model with versatile abilities. Preprint at https://arxiv.org/pdf/2308.12966 (2023).
Vikhyat. Moondream, tiny vision language model. GitHub https://github.com/vikhyat/moondream (2024).
Unum-cloud. Uform: pocket-sized multimodal ai for content understanding and generation. GitHub https://huggingface.co/unum-cloud/uform-gen2-qwen-500m (2024).
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M. & Le, M. Flow matching for generative modeling. The Eleventh International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2023).
Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. Proc. Robotics: Science and Systems XIX, 016 (RSS, 2023).
Shazeer, N. et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations (ICLR, 2017).
Intelligence, P. et al. A vision–language–action model with open-world generalization. Preprint at https://arxiv.org/pdf/2505.21906 (2025).
Dosovitskiy, A. An image is worth 16 × 16 words: transformers for image recognition at scale. In International Conference on Learning Representations (eds Kim, B. et al.) (ICLR 2021).
Jaegle, A. et al. Perceiver: general perception with iterative attention. In International Conference on Machine Learning 4651–4664 (PMLR, 2021).
Liu, J. et al. Robomamba: multimodal state space model for efficient robot reasoning and manipulation. Adv. Neural Inf. Proc. Sys. 37, 40085–40110 (2024).
Google Scholar
Nagrani, A. et al. Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 34, 14200–14213 (2021).
Google Scholar
Xu, H. et al. VLM: task-agnostic video-language model pre-training for video understanding. Findings of the Association for Computational Linguistics 4227–4239 (ACL-IJCNLP, 2021).
Wang, P. et al. Ofa: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. International Conference on Machine Learning (eds Chaudhuri, K. et al.) 23318–23340 (PMLR, 2022).
Yang, Z. et al. The dawn of lmms: preliminary explorations with GPT-4v(ision). Preprint at https://arxiv.org/pdf/2309.17421 (2023).
Jang, E. et al. Bc-z: Zero-shot task generalization with robotic imitation learning. Conference on Robot Learning (eds Faust, A. et al.) 991–1002 (PMLR, 2022).
Ke, T.-W., Gkanatsios, N. & Fragkiadaki, K. 3d diffuser actor: policy diffusion with 3d scene representations. Conference on Robot Learning (eds Agrawal, P. et al.) 1949–1974 (PMLR, 2025).
Ye, S. et al. Latent action pretraining from videos. The Thirteenth International Conference on Learning Representations (eds Yue, Y. et al.) 90629–90655 (ICLR, 2025).
Zawalski, M. et al. Robotic control via embodied chain-of-thought reasoning. Conference on Robot Learning (eds Agrawal, A. et al.) 3157–3181 (PMLR, 2025).
Reed, S. et al. A generalist agent. Transactions on Machine Learning Research (eds Larochelle, H. et al.) 1–42 (ML Research Foundation, 2022).
Medsker, L. R., Jain, L. et al. Recurrent neural networks. Des. Appl. 5, 2 (2001).
Google Scholar
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/pdf/1412.3555 (2014).
Kawakami K. Supervised Sequence Labelling with Recurrent Neural Networks. Ph.D. thesis, Technical University of Munich (2008).
Vaswani, A. Attention is all you need. Adv. Neural Inf. Proc. Sys. 30 (2017).
Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).
Article Google Scholar
Chi, C. et al. Diffusion policy: visuomotor policy learning via action diffusion. In Proc. Robotics: Science and Systems XIX 026 (RSS, 2023).
Liu, F. et al. Robouniview: Visual-language model with unified view representation for robotic manipulation. Preprint at https://arxiv.org/pdf/2406.18977 (2024).
Yue, Y. et al. Deer-vla: dynamic inference of multimodal large language models for efficient robot execution. Adv. Neural Info. Proc. Sys. 37, 56619–56643 (2024).
Google Scholar
Li, X. et al. What matters in building vision–language–action models for generalist robots (codebase). Zenodo https://zenodo.org/records/17757179 (2025).

Download references

Acknowledgements

This work was jointly supported by the National Natural Science Fund under grant nos 62025304 and 62120106005, Beijing Natural Science Foundation under grant no. L253006 and Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China under grant no. JYB2025XDXM109. We thank all the members of the robotics research team at ByteDance Research for their assistance in real-world data collection, setup design, robot maintenance and experiments. M.L. is supported by the ByteDance Scholarship. We also want to thank @YouJiacheng for his active and instructive discussion on X.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Xinghang Li, Dong Wang & Huaping Liu
ByteDance Research, Beijing, China
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma & Tao Kong
Beijing Academy of Artificial Intelligence, Beijing, China
Xinghang Li, Long Qian & Xinlong Wang
CASIA MAIS-NLPR, Beijing, China
Peiyan Li
Shanghai Jiao Tong University, Shanghai, China
Minghuan Liu & Jirong Liu
Beijing University of Posts and Telecommunications, Beijing, China
Di Guo
National University of Singapore, Singapore, Singapore
Hanbo Zhang

Authors

Xinghang Li
View author publications
Search author on:PubMed Google Scholar
Peiyan Li
View author publications
Search author on:PubMed Google Scholar
Long Qian
View author publications
Search author on:PubMed Google Scholar
Minghuan Liu
View author publications
Search author on:PubMed Google Scholar
Dong Wang
View author publications
Search author on:PubMed Google Scholar
Jirong Liu
View author publications
Search author on:PubMed Google Scholar
Bingyi Kang
View author publications
Search author on:PubMed Google Scholar
Xiao Ma
View author publications
Search author on:PubMed Google Scholar
Xinlong Wang
View author publications
Search author on:PubMed Google Scholar
Di Guo
View author publications
Search author on:PubMed Google Scholar
Tao Kong
View author publications
Search author on:PubMed Google Scholar
Hanbo Zhang
View author publications
Search author on:PubMed Google Scholar
Huaping Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

Project Leads: H.L., X.L., H.Z. and M.L. Methodology and codebase: X.L. Model training and evaluation (experimental design, implementation): X.L., L.Q., H.Z., D.W., M.L., X.M. and J.L. Real-robot deployment and experiments: X.L. and P.L. Logic, figures, visualizations and writing: X.L., M.L., H.Z., L.Q., B.K., X.M., P.L., J.L., D.G., H.L. and T.K. Advising: H.L., T.K., H.Z., X.M., B.K., D.G. and X.W.

Corresponding authors

Correspondence to Tao Kong, Hanbo Zhang or Huaping Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Self correction on real world experiment.

Visualization for rollouts that the best setting VLA built by RoboVLMs emerges the ability of self-correction. For instance, in the Open The Oven task, the robotś first attempt does not reach the oven handle, and it adjusts the end-effector position to re-locate the handle at the second attempt. Note that the training dataset does not contain this kind of data.

Extended Data Fig. 2 The architectures of MoE and considered VLAs.

(a) Illustration of Mix-of-Expert structure. In original VLAs, both vision-language tokens and action tokens share the same weights of the original VLM FFN (Feed Forward Network). For the Mix-of-Expert structure, vision-Language tokens and action tokens have separate query, key and value projection layers for self-attention, and have separate feed-forward networks. Action tokens would only interact with vision-language tokens through self-attention. This Mix-of-Expert structure preserves the original parameters and forward process of the VLMs, and is claimed to benefit the generalization for the built VLAs. (b) The illustration of considered VLA formulations, including several popular designs. For example, RoboFlamingo is a Policy-Head-Continuous-type VLA, RT-2 and OpenVLA corresponds to the One-Step-Discrete-Action-type VLA. Octo and GR correspond to the Interleaved-Continuous-Action-type VLA with a fixed window size. The architectures of MoE and considered VLAs.

Extended Data Fig. 3 Illustration of cross embodiment training configurations for Bridge and Google Robot.

Illustration of training configurations for Bridge and Google Robot. The red crosses (✓) denote the excluded training stage, and the small icons represent the datasets used, corresponding to those shown above in the figure. Take Bridge Post Train as an example, two stages are employed: Stage 1: Co-training with cross-embodiment data, Bridge V2 dataset, RT1 target dataset, and RT1 extra data. Stage 2: Post-training refinement using only the Bridge V2 dataset.

Extended Data Fig. 4 Ablation of cross embodiment training results on SimplerEnv.

We evaluate four different training recipes. On the WidowX+Bridge environments, we test (1) Bridge Finetune finetunes the VLA directly on the full Bridge datasets (tested tasks not included); (2) OXE Co-Train Co-trains the VLA on OXE dataset; (3) Post-Train trains the OXE Co-trained VLA on Bridge datasets. On the Google Robot environments, we test (1) RT-Partial Finetune finetunes the VLA on tested RT tasks only; (2) RT Finetune finetunes the VLA on the full RT dataset (tested tasks included), along with (3) OXE Co-Train and (4) Post-Train on the tested RT tasks stage.

Extended Data Fig. 5 Few-shot learning on CALVIN.

The effect of cross-embodiment pre-training on OXE datasets for few-shot learning.

Extended Data Table 1 Chapter questions and findings

Full size table

Extended Data Table 2 Comparison with baselines on CALVIN

Full size table

Extended Data Table 3 The detailed performance of RoboVLMs on SimplerEnv

Full size table

Extended Data Table 4 The performance of VLAs implemented with different formulations and training data scales

Full size table

Extended Data Table 5 The performance of the built VLAs based on VLMs with different image token numbers and VL pre-train data scales

Full size table

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–6, Tables 1–4 and discussion.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, X., Li, P., Qian, L. et al. What matters in building vision–language–action models for generalist robots. Nat Mach Intell 8, 158–172 (2026). https://doi.org/10.1038/s42256-025-01168-7

Download citation

Received: 05 January 2025
Accepted: 03 December 2025
Published: 11 February 2026
Version of record: 11 February 2026
Issue date: February 2026
DOI: https://doi.org/10.1038/s42256-025-01168-7