Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

What matters in building vision–language–action models for generalist robots

This article has been updated

A preprint version of the article is available at arXiv.

Abstract

To utilize foundation vision–language models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs and building the vision–language–action models (VLAs). Here we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets and toolkits, along with detailed training and evaluation recipes at robovlms.github.io.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Strategies and categorization of robot policies.
The alternative text for this image may have been generated using AI.
Fig. 2: Illustration of the key ingredients and proposed unified VLA framework.
The alternative text for this image may have been generated using AI.
Fig. 3: Illustration of the involved simulations and real-world benchmarks.
The alternative text for this image may have been generated using AI.
Fig. 4: The experimental results for RoboVLMs in simulations and in the real world.
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The datasets used in this study are available via GitHub at https://github.com/mees/calvin (CALVIN), https://github.com/google-deepmind/open_x_embodiment (OXE) and https://huggingface.co/datasets/robovlms/bytedance_robot_benchmark_20 (BDRBench20).

Code availability

The source code of this study is available via GitHub at https://github.com/Robot-VLAs/RoboVLMs. It is also available via Zenodo at https://zenodo.org/records/17757179 (ref. 55).

Change history

  • 16 February 2026

    In the version of the article initially published, the affiliations of Di Guo and Hanbo Zhang were switched and have now been amended so that Di Guo is affiliated with the the Beijing University of Posts and Telecommunications, Beijing, China and Hanbo Zhang with the National University of Singapore, Singapore, Singapore. This correction has been made to the HTML and PDF versions of the article.

References

  1. Bousmalis, K. et al. Robocat: a self-improving foundation agent for robotic manipulation. Transactions on Machine Learning Research (ed. Walter, M.) (TMLR, 2024).

  2. Brohan, A. et al. Rt-2: vision–language–action models transfer web knowledge to robotic control. In Conference on Robot Learning (eds Tan, J. et al.) 2165–2183 (PMLR, 2023).

  3. Black, K. et al. π0: a vision–language–action flow model for general robot control. Preprint at https://arxiv.org/pdf/2410.24164 (2024).

  4. O’Neill, A. et al. Open X-Embodiment: robotic learning datasets and rt-x models. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (ed. O’Malley, M. K.) 6892–6903 (IEEE, 2024).

  5. Liu, H., Guo, D. & Cangelosi, A. Embodied intelligence: a synergy of morphology, action, perception and learning. ACM Comput. Surv. 57, 1–36 (2025).

    Google Scholar 

  6. Kim, M. J. et al. Openvla: an open-source vision–language–action model. In Conference on Robot Learning (eds Agrawal, P. et al.) 2679–2713 (PMLR, 2025).

  7. Li, X. et al. Vision–language foundation models as effective robot imitators. In The Twelfth International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2024).

  8. Ghosh, D. et al. Octo: an open-source generalist robot policy. In Proc. Robotics: Science and Systems 090 (RSS, 2024).

  9. Wu, H. et al. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2024).

  10. Nair, S., Rajeswaran, A., Kumar, V., Finn, C. & Gupta, A. R3m: a universal visual representation for robot manipulation. In Conference on Robot Learning (eds Liu, K. et al.) 892–909 (PMLR, 2023).

  11. Jiang, Y. et al. Vima: feneral robot manipulation with multimodal prompts. In Proc. Machine Learning Research 14975–15022 (PMLR, 2023).

  12. Zhen, H. et al. 3d-vla: A 3D vision–language–action generative world model. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 61229–61245 (ICML, 2024).

  13. Zhou, Z., Zhu, Y., Wen, J., Shen, C. & Xu, Y. Vision–language–action model with open-world embodied reasoning from pretrained knowledge. Preprint at https://arxiv.org/pdf/2505.21906 (2025).

  14. Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning 1126–1135 (PMLR, 2017).

  15. Mees, O., Hermann, L., Rosete-Beas, E. & Burgard, W. Calvin: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics Autom. Lett. 7, 7327–7334 (2022).

    Article  Google Scholar 

  16. Radosavovic, I. et al. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning (eds Liu, K. et al.) 416–426 (PMLR, 2023).

  17. Peng, Z. et al. Kosmos-2: grounding multimodal large language models to the world. Preprint at https://arxiv.org/pdf/2306.14824 (2023).

  18. Beyer, L. et al. Paligemma: a versatile 3b VLM for transfer. Preprint at https://arxiv.org/pdf/2407.07726 (2024).

  19. Torne, M. et al. Reconciling reality through simulation: a real-to-sim-to-real approach for robust manipulation. Preprint at https://arxiv.org/pdf/2403.03949 (2024).

  20. Li, X. et al. Evaluating real-world robot manipulation policies in simulation. In Conference on Robot Learning (eds Agrawal, P. et al.) 3705–3728 (PMLR, 2025).

  21. Brohan, A. et al. Rt-1: robotics transformer for real-world control at scale. In Proc. Robotics Science and Systems XIX 025 (RSS, 2023).

  22. Walke, H. et al. Bridgedata v2: a dataset for robot learning at scale. In Conference on Robot Learning (eds Tan, J. et al.) 1723–1736 (PMLR, 2023).

  23. Cheang, C.-L. et al. Gr-2: a generative video-language-action model with web-scale knowledge for robot manipulation. Preprint at https://arxiv.org/pdf/2410.06158 (2024).

  24. Li, P. et al. Gr-mg: leveraging partially annotated data via multi-modal goal conditioned policy. IEEE Robotics and Automation Letters (eds Asfour, A. et al.) 1912–1919 (IEEE, 2025).

  25. Zhao, W., Queralta, J. P. & Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. 2020 IEEE symposium series on Computational Intelligence (SSCI) 737–744 (eds Abbass, H. et al.) (IEEE, 2020).

  26. Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 35, 23716–23736 (2022).

    Google Scholar 

  27. Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2024).

    Google Scholar 

  28. Bai, J. et al. Qwen-vl: a frontier large vision–language model with versatile abilities. Preprint at https://arxiv.org/pdf/2308.12966 (2023).

  29. Vikhyat. Moondream, tiny vision language model. GitHub https://github.com/vikhyat/moondream (2024).

  30. Unum-cloud. Uform: pocket-sized multimodal ai for content understanding and generation. GitHub https://huggingface.co/unum-cloud/uform-gen2-qwen-500m (2024).

  31. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M. & Le, M. Flow matching for generative modeling. The Eleventh International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2023).

  32. Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. Proc. Robotics: Science and Systems XIX, 016 (RSS, 2023).

  33. Shazeer, N. et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations (ICLR, 2017).

  34. Intelligence, P. et al. A vision–language–action model with open-world generalization. Preprint at https://arxiv.org/pdf/2505.21906 (2025).

  35. Dosovitskiy, A. An image is worth 16 × 16 words: transformers for image recognition at scale. In International Conference on Learning Representations (eds Kim, B. et al.) (ICLR 2021).

  36. Jaegle, A. et al. Perceiver: general perception with iterative attention. In International Conference on Machine Learning 4651–4664 (PMLR, 2021).

  37. Liu, J. et al. Robomamba: multimodal state space model for efficient robot reasoning and manipulation. Adv. Neural Inf. Proc. Sys. 37, 40085–40110 (2024).

    Google Scholar 

  38. Nagrani, A. et al. Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 34, 14200–14213 (2021).

    Google Scholar 

  39. Xu, H. et al. VLM: task-agnostic video-language model pre-training for video understanding. Findings of the Association for Computational Linguistics 4227–4239 (ACL-IJCNLP, 2021).

  40. Wang, P. et al. Ofa: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. International Conference on Machine Learning (eds Chaudhuri, K. et al.) 23318–23340 (PMLR, 2022).

  41. Yang, Z. et al. The dawn of lmms: preliminary explorations with GPT-4v(ision). Preprint at https://arxiv.org/pdf/2309.17421 (2023).

  42. Jang, E. et al. Bc-z: Zero-shot task generalization with robotic imitation learning. Conference on Robot Learning (eds Faust, A. et al.) 991–1002 (PMLR, 2022).

  43. Ke, T.-W., Gkanatsios, N. & Fragkiadaki, K. 3d diffuser actor: policy diffusion with 3d scene representations. Conference on Robot Learning (eds Agrawal, P. et al.) 1949–1974 (PMLR, 2025).

  44. Ye, S. et al. Latent action pretraining from videos. The Thirteenth International Conference on Learning Representations (eds Yue, Y. et al.) 90629–90655 (ICLR, 2025).

  45. Zawalski, M. et al. Robotic control via embodied chain-of-thought reasoning. Conference on Robot Learning (eds Agrawal, A. et al.) 3157–3181 (PMLR, 2025).

  46. Reed, S. et al. A generalist agent. Transactions on Machine Learning Research (eds Larochelle, H. et al.) 1–42 (ML Research Foundation, 2022).

  47. Medsker, L. R., Jain, L. et al. Recurrent neural networks. Des. Appl. 5, 2 (2001).

    Google Scholar 

  48. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/pdf/1412.3555 (2014).

  49. Kawakami K. Supervised Sequence Labelling with Recurrent Neural Networks. Ph.D. thesis, Technical University of Munich (2008).

  50. Vaswani, A. Attention is all you need. Adv. Neural Inf. Proc. Sys. 30 (2017).

  51. Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).

    Article  Google Scholar 

  52. Chi, C. et al. Diffusion policy: visuomotor policy learning via action diffusion. In Proc. Robotics: Science and Systems XIX 026 (RSS, 2023).

  53. Liu, F. et al. Robouniview: Visual-language model with unified view representation for robotic manipulation. Preprint at https://arxiv.org/pdf/2406.18977 (2024).

  54. Yue, Y. et al. Deer-vla: dynamic inference of multimodal large language models for efficient robot execution. Adv. Neural Info. Proc. Sys. 37, 56619–56643 (2024).

    Google Scholar 

  55. Li, X. et al. What matters in building vision–language–action models for generalist robots (codebase). Zenodo https://zenodo.org/records/17757179 (2025).

Download references

Acknowledgements

This work was jointly supported by the National Natural Science Fund under grant nos 62025304 and 62120106005, Beijing Natural Science Foundation under grant no. L253006 and Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China under grant no. JYB2025XDXM109. We thank all the members of the robotics research team at ByteDance Research for their assistance in real-world data collection, setup design, robot maintenance and experiments. M.L. is supported by the ByteDance Scholarship. We also want to thank @YouJiacheng for his active and instructive discussion on X.

Author information

Authors and Affiliations

Contributions

Project Leads: H.L., X.L., H.Z. and M.L. Methodology and codebase: X.L. Model training and evaluation (experimental design, implementation): X.L., L.Q., H.Z., D.W., M.L., X.M. and J.L. Real-robot deployment and experiments: X.L. and P.L. Logic, figures, visualizations and writing: X.L., M.L., H.Z., L.Q., B.K., X.M., P.L., J.L., D.G., H.L. and T.K. Advising: H.L., T.K., H.Z., X.M., B.K., D.G. and X.W.

Corresponding authors

Correspondence to Tao Kong, Hanbo Zhang or Huaping Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Self correction on real world experiment.

Visualization for rollouts that the best setting VLA built by RoboVLMs emerges the ability of self-correction. For instance, in the Open The Oven task, the robotś first attempt does not reach the oven handle, and it adjusts the end-effector position to re-locate the handle at the second attempt. Note that the training dataset does not contain this kind of data.

Extended Data Fig. 2 The architectures of MoE and considered VLAs.

(a) Illustration of Mix-of-Expert structure. In original VLAs, both vision-language tokens and action tokens share the same weights of the original VLM FFN (Feed Forward Network). For the Mix-of-Expert structure, vision-Language tokens and action tokens have separate query, key and value projection layers for self-attention, and have separate feed-forward networks. Action tokens would only interact with vision-language tokens through self-attention. This Mix-of-Expert structure preserves the original parameters and forward process of the VLMs, and is claimed to benefit the generalization for the built VLAs. (b) The illustration of considered VLA formulations, including several popular designs. For example, RoboFlamingo is a Policy-Head-Continuous-type VLA, RT-2 and OpenVLA corresponds to the One-Step-Discrete-Action-type VLA. Octo and GR correspond to the Interleaved-Continuous-Action-type VLA with a fixed window size. The architectures of MoE and considered VLAs.

Extended Data Fig. 3 Illustration of cross embodiment training configurations for Bridge and Google Robot.

Illustration of training configurations for Bridge and Google Robot. The red crosses () denote the excluded training stage, and the small icons represent the datasets used, corresponding to those shown above in the figure. Take Bridge Post Train as an example, two stages are employed: Stage 1: Co-training with cross-embodiment data, Bridge V2 dataset, RT1 target dataset, and RT1 extra data. Stage 2: Post-training refinement using only the Bridge V2 dataset.

Extended Data Fig. 4 Ablation of cross embodiment training results on SimplerEnv.

We evaluate four different training recipes. On the WidowX+Bridge environments, we test (1) Bridge Finetune finetunes the VLA directly on the full Bridge datasets (tested tasks not included); (2) OXE Co-Train Co-trains the VLA on OXE dataset; (3) Post-Train trains the OXE Co-trained VLA on Bridge datasets. On the Google Robot environments, we test (1) RT-Partial Finetune finetunes the VLA on tested RT tasks only; (2) RT Finetune finetunes the VLA on the full RT dataset (tested tasks included), along with (3) OXE Co-Train and (4) Post-Train on the tested RT tasks stage.

Extended Data Fig. 5 Few-shot learning on CALVIN.

The effect of cross-embodiment pre-training on OXE datasets for few-shot learning.

Extended Data Table 1 Chapter questions and findings
Extended Data Table 2 Comparison with baselines on CALVIN
Extended Data Table 3 The detailed performance of RoboVLMs on SimplerEnv
Extended Data Table 4 The performance of VLAs implemented with different formulations and training data scales
Extended Data Table 5 The performance of the built VLAs based on VLMs with different image token numbers and VL pre-train data scales

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–6, Tables 1–4 and discussion.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Li, P., Qian, L. et al. What matters in building vision–language–action models for generalist robots. Nat Mach Intell 8, 158–172 (2026). https://doi.org/10.1038/s42256-025-01168-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01168-7

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics