A benchmark — MaCBench — is developed for evaluating the scientific knowledge of vision language models (VLMs). Evaluation of leading VLMs reveals that they excel at basic scientific tasks such as equipment identification, but struggle with spatial reasoning and multistep analysis — a limitation for autonomous scientific discovery.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout

References
Durante, Z. et al. Agent AI: Surveying the horizons of multimodal interaction. Preprint at https://arxiv.org/abs/2401.03568 (2024). This preprint discusses agent-based multimodal intelligence.
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023). This paper reports a chemical agent that autonomously performs reactions.
Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024). This paper reports an LLM agent that uses chemical tools.
Chollet, F. On the measure of intelligence. Preprint at https://arxiv.org/abs/1911.01547 (2019). This preprint discusses how to evaluate the ‘intelligence’ of models and to compare with that of humans.
Mirza, A. et al. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat. Chem. 17, 1027–1034 (2025). This article describes a framework for the evaluation of the chemical capabilities of LLMs.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is a summary of: Alampara, N. et al. Probing the limitations of multimodal language models for chemistry and materials research. Nat. Comput. Sci. https://doi.org/10.1038/s43588-025-00836-3 (2025).
Rights and permissions
About this article
Cite this article
Vision language models excel at perception but struggles with scientific reasoning. Nat Comput Sci 5, 852–853 (2025). https://doi.org/10.1038/s43588-025-00871-0
Published:
Issue date:
DOI: https://doi.org/10.1038/s43588-025-00871-0