Abstract
To utilize foundation vision–language models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs and building the vision–language–action models (VLAs). Here we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets and toolkits, along with detailed training and evaluation recipes at robovlms.github.io.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
The datasets used in this study are available via GitHub at https://github.com/mees/calvin (CALVIN), https://github.com/google-deepmind/open_x_embodiment (OXE) and https://huggingface.co/datasets/robovlms/bytedance_robot_benchmark_20 (BDRBench20).
Code availability
The source code of this study is available via GitHub at https://github.com/Robot-VLAs/RoboVLMs. It is also available via Zenodo at https://zenodo.org/records/17757179 (ref. 55).
Change history
16 February 2026
In the version of the article initially published, the affiliations of Di Guo and Hanbo Zhang were switched and have now been amended so that Di Guo is affiliated with the the Beijing University of Posts and Telecommunications, Beijing, China and Hanbo Zhang with the National University of Singapore, Singapore, Singapore. This correction has been made to the HTML and PDF versions of the article.
References
Bousmalis, K. et al. Robocat: a self-improving foundation agent for robotic manipulation. Transactions on Machine Learning Research (ed. Walter, M.) (TMLR, 2024).
Brohan, A. et al. Rt-2: vision–language–action models transfer web knowledge to robotic control. In Conference on Robot Learning (eds Tan, J. et al.) 2165–2183 (PMLR, 2023).
Black, K. et al. π0: a vision–language–action flow model for general robot control. Preprint at https://arxiv.org/pdf/2410.24164 (2024).
O’Neill, A. et al. Open X-Embodiment: robotic learning datasets and rt-x models. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (ed. O’Malley, M. K.) 6892–6903 (IEEE, 2024).
Liu, H., Guo, D. & Cangelosi, A. Embodied intelligence: a synergy of morphology, action, perception and learning. ACM Comput. Surv. 57, 1–36 (2025).
Kim, M. J. et al. Openvla: an open-source vision–language–action model. In Conference on Robot Learning (eds Agrawal, P. et al.) 2679–2713 (PMLR, 2025).
Li, X. et al. Vision–language foundation models as effective robot imitators. In The Twelfth International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2024).
Ghosh, D. et al. Octo: an open-source generalist robot policy. In Proc. Robotics: Science and Systems 090 (RSS, 2024).
Wu, H. et al. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2024).
Nair, S., Rajeswaran, A., Kumar, V., Finn, C. & Gupta, A. R3m: a universal visual representation for robot manipulation. In Conference on Robot Learning (eds Liu, K. et al.) 892–909 (PMLR, 2023).
Jiang, Y. et al. Vima: feneral robot manipulation with multimodal prompts. In Proc. Machine Learning Research 14975–15022 (PMLR, 2023).
Zhen, H. et al. 3d-vla: A 3D vision–language–action generative world model. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 61229–61245 (ICML, 2024).
Zhou, Z., Zhu, Y., Wen, J., Shen, C. & Xu, Y. Vision–language–action model with open-world embodied reasoning from pretrained knowledge. Preprint at https://arxiv.org/pdf/2505.21906 (2025).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning 1126–1135 (PMLR, 2017).
Mees, O., Hermann, L., Rosete-Beas, E. & Burgard, W. Calvin: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics Autom. Lett. 7, 7327–7334 (2022).
Radosavovic, I. et al. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning (eds Liu, K. et al.) 416–426 (PMLR, 2023).
Peng, Z. et al. Kosmos-2: grounding multimodal large language models to the world. Preprint at https://arxiv.org/pdf/2306.14824 (2023).
Beyer, L. et al. Paligemma: a versatile 3b VLM for transfer. Preprint at https://arxiv.org/pdf/2407.07726 (2024).
Torne, M. et al. Reconciling reality through simulation: a real-to-sim-to-real approach for robust manipulation. Preprint at https://arxiv.org/pdf/2403.03949 (2024).
Li, X. et al. Evaluating real-world robot manipulation policies in simulation. In Conference on Robot Learning (eds Agrawal, P. et al.) 3705–3728 (PMLR, 2025).
Brohan, A. et al. Rt-1: robotics transformer for real-world control at scale. In Proc. Robotics Science and Systems XIX 025 (RSS, 2023).
Walke, H. et al. Bridgedata v2: a dataset for robot learning at scale. In Conference on Robot Learning (eds Tan, J. et al.) 1723–1736 (PMLR, 2023).
Cheang, C.-L. et al. Gr-2: a generative video-language-action model with web-scale knowledge for robot manipulation. Preprint at https://arxiv.org/pdf/2410.06158 (2024).
Li, P. et al. Gr-mg: leveraging partially annotated data via multi-modal goal conditioned policy. IEEE Robotics and Automation Letters (eds Asfour, A. et al.) 1912–1919 (IEEE, 2025).
Zhao, W., Queralta, J. P. & Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. 2020 IEEE symposium series on Computational Intelligence (SSCI) 737–744 (eds Abbass, H. et al.) (IEEE, 2020).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 35, 23716–23736 (2022).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2024).
Bai, J. et al. Qwen-vl: a frontier large vision–language model with versatile abilities. Preprint at https://arxiv.org/pdf/2308.12966 (2023).
Vikhyat. Moondream, tiny vision language model. GitHub https://github.com/vikhyat/moondream (2024).
Unum-cloud. Uform: pocket-sized multimodal ai for content understanding and generation. GitHub https://huggingface.co/unum-cloud/uform-gen2-qwen-500m (2024).
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M. & Le, M. Flow matching for generative modeling. The Eleventh International Conference on Learning Representations (eds Kim, B. et al.) (ICLR, 2023).
Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. Proc. Robotics: Science and Systems XIX, 016 (RSS, 2023).
Shazeer, N. et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations (ICLR, 2017).
Intelligence, P. et al. A vision–language–action model with open-world generalization. Preprint at https://arxiv.org/pdf/2505.21906 (2025).
Dosovitskiy, A. An image is worth 16 × 16 words: transformers for image recognition at scale. In International Conference on Learning Representations (eds Kim, B. et al.) (ICLR 2021).
Jaegle, A. et al. Perceiver: general perception with iterative attention. In International Conference on Machine Learning 4651–4664 (PMLR, 2021).
Liu, J. et al. Robomamba: multimodal state space model for efficient robot reasoning and manipulation. Adv. Neural Inf. Proc. Sys. 37, 40085–40110 (2024).
Nagrani, A. et al. Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 34, 14200–14213 (2021).
Xu, H. et al. VLM: task-agnostic video-language model pre-training for video understanding. Findings of the Association for Computational Linguistics 4227–4239 (ACL-IJCNLP, 2021).
Wang, P. et al. Ofa: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. International Conference on Machine Learning (eds Chaudhuri, K. et al.) 23318–23340 (PMLR, 2022).
Yang, Z. et al. The dawn of lmms: preliminary explorations with GPT-4v(ision). Preprint at https://arxiv.org/pdf/2309.17421 (2023).
Jang, E. et al. Bc-z: Zero-shot task generalization with robotic imitation learning. Conference on Robot Learning (eds Faust, A. et al.) 991–1002 (PMLR, 2022).
Ke, T.-W., Gkanatsios, N. & Fragkiadaki, K. 3d diffuser actor: policy diffusion with 3d scene representations. Conference on Robot Learning (eds Agrawal, P. et al.) 1949–1974 (PMLR, 2025).
Ye, S. et al. Latent action pretraining from videos. The Thirteenth International Conference on Learning Representations (eds Yue, Y. et al.) 90629–90655 (ICLR, 2025).
Zawalski, M. et al. Robotic control via embodied chain-of-thought reasoning. Conference on Robot Learning (eds Agrawal, A. et al.) 3157–3181 (PMLR, 2025).
Reed, S. et al. A generalist agent. Transactions on Machine Learning Research (eds Larochelle, H. et al.) 1–42 (ML Research Foundation, 2022).
Medsker, L. R., Jain, L. et al. Recurrent neural networks. Des. Appl. 5, 2 (2001).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/pdf/1412.3555 (2014).
Kawakami K. Supervised Sequence Labelling with Recurrent Neural Networks. Ph.D. thesis, Technical University of Munich (2008).
Vaswani, A. Attention is all you need. Adv. Neural Inf. Proc. Sys. 30 (2017).
Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).
Chi, C. et al. Diffusion policy: visuomotor policy learning via action diffusion. In Proc. Robotics: Science and Systems XIX 026 (RSS, 2023).
Liu, F. et al. Robouniview: Visual-language model with unified view representation for robotic manipulation. Preprint at https://arxiv.org/pdf/2406.18977 (2024).
Yue, Y. et al. Deer-vla: dynamic inference of multimodal large language models for efficient robot execution. Adv. Neural Info. Proc. Sys. 37, 56619–56643 (2024).
Li, X. et al. What matters in building vision–language–action models for generalist robots (codebase). Zenodo https://zenodo.org/records/17757179 (2025).
Acknowledgements
This work was jointly supported by the National Natural Science Fund under grant nos 62025304 and 62120106005, Beijing Natural Science Foundation under grant no. L253006 and Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China under grant no. JYB2025XDXM109. We thank all the members of the robotics research team at ByteDance Research for their assistance in real-world data collection, setup design, robot maintenance and experiments. M.L. is supported by the ByteDance Scholarship. We also want to thank @YouJiacheng for his active and instructive discussion on X.
Author information
Authors and Affiliations
Contributions
Project Leads: H.L., X.L., H.Z. and M.L. Methodology and codebase: X.L. Model training and evaluation (experimental design, implementation): X.L., L.Q., H.Z., D.W., M.L., X.M. and J.L. Real-robot deployment and experiments: X.L. and P.L. Logic, figures, visualizations and writing: X.L., M.L., H.Z., L.Q., B.K., X.M., P.L., J.L., D.G., H.L. and T.K. Advising: H.L., T.K., H.Z., X.M., B.K., D.G. and X.W.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Self correction on real world experiment.
Visualization for rollouts that the best setting VLA built by RoboVLMs emerges the ability of self-correction. For instance, in the Open The Oven task, the robotś first attempt does not reach the oven handle, and it adjusts the end-effector position to re-locate the handle at the second attempt. Note that the training dataset does not contain this kind of data.
Extended Data Fig. 2 The architectures of MoE and considered VLAs.
(a) Illustration of Mix-of-Expert structure. In original VLAs, both vision-language tokens and action tokens share the same weights of the original VLM FFN (Feed Forward Network). For the Mix-of-Expert structure, vision-Language tokens and action tokens have separate query, key and value projection layers for self-attention, and have separate feed-forward networks. Action tokens would only interact with vision-language tokens through self-attention. This Mix-of-Expert structure preserves the original parameters and forward process of the VLMs, and is claimed to benefit the generalization for the built VLAs. (b) The illustration of considered VLA formulations, including several popular designs. For example, RoboFlamingo is a Policy-Head-Continuous-type VLA, RT-2 and OpenVLA corresponds to the One-Step-Discrete-Action-type VLA. Octo and GR correspond to the Interleaved-Continuous-Action-type VLA with a fixed window size. The architectures of MoE and considered VLAs.
Extended Data Fig. 3 Illustration of cross embodiment training configurations for Bridge and Google Robot.
Illustration of training configurations for Bridge and Google Robot. The red crosses (✓) denote the excluded training stage, and the small icons represent the datasets used, corresponding to those shown above in the figure. Take Bridge Post Train as an example, two stages are employed: Stage 1: Co-training with cross-embodiment data, Bridge V2 dataset, RT1 target dataset, and RT1 extra data. Stage 2: Post-training refinement using only the Bridge V2 dataset.
Extended Data Fig. 4 Ablation of cross embodiment training results on SimplerEnv.
We evaluate four different training recipes. On the WidowX+Bridge environments, we test (1) Bridge Finetune finetunes the VLA directly on the full Bridge datasets (tested tasks not included); (2) OXE Co-Train Co-trains the VLA on OXE dataset; (3) Post-Train trains the OXE Co-trained VLA on Bridge datasets. On the Google Robot environments, we test (1) RT-Partial Finetune finetunes the VLA on tested RT tasks only; (2) RT Finetune finetunes the VLA on the full RT dataset (tested tasks included), along with (3) OXE Co-Train and (4) Post-Train on the tested RT tasks stage.
Extended Data Fig. 5 Few-shot learning on CALVIN.
The effect of cross-embodiment pre-training on OXE datasets for few-shot learning.
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1–6, Tables 1–4 and discussion.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Li, P., Qian, L. et al. What matters in building vision–language–action models for generalist robots. Nat Mach Intell 8, 158–172 (2026). https://doi.org/10.1038/s42256-025-01168-7
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-01168-7


