Abstract
Artificial intelligence is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models and vision language models now assist in experiment design and procedural guidance, yet their ‘illusion of understanding’ may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment and consequence prediction across 765 multiple-choice questions and 404 realistic laboratory scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced large language models and vision language models show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying artificial intelligence systems in real laboratory settings.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The benchmark dataset is available from Huggingface (https://huggingface.co/datasets/yujunzhou/LabSafety_Bench) with https://doi.org/10.57967/hf/6723 (ref. 38). The project website is available via GitHub at https://github.com/YujunZhou/LabSafety-Bench. Source data are provided with this paper.
Code availability
The source code for the LabSafety Bench framework and all evaluation scripts are publicly available via GitHub at https://github.com/YujunZhou/LabSafety-Bench. The version of the code used for this study (v1.0.0) is permanently archived and available via Zenodo at https://doi.org/10.5281/zenodo.17019500 (ref. 39).
References
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Callaway, E. Chemistry Nobel goes to developers of alphafold ai that predicts protein structures. Nature 634, 525–526 (2024).
O’Donnell, J., Heaven, W.D. & Heikkilä, M. What’s next for ai in 2025. MIT Technology Review 1109188/whats-next-for-ai-in-2025 (2025).
Messeri, L. & Crockett, M. Artificial intelligence and illusions of understanding in scientific research. Nature 627, 49–58 (2024).
OpenAI o1 system card. OpenAI https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf (accessed 15 September 2024).
Achiam, J. et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://doi.org/10.48550/arXiv.2307.09288 (2023).
Logg, J. M., Minson, J. A. & Moore, D. A. Algorithm appreciation: people prefer algorithmic to human judgment. Organ. Behav. Hum. Decis. Process. 151, 90–103 (2019).
Sloman, S. A. & Rabb, N. Your understanding is my understanding: evidence for a community of knowledge. Psychol. Sci. 27, 1451–1460 (2016).
Ménard, A. D. & Trant, J. F. A review and critique of academic lab safety research. Nat. Chem. 12, 17–25 (2020).
Wu, K., Jin, X. & Wang, X. Determining university students’ familiarity and understanding of laboratory safety knowledge—a case study. J. Chem. Educ. 98, 434–438 (2020).
Ali, L. et al. Development of yolov5-based real-time smart monitoring system for increasing lab safety awareness in educational institutions. Sensors 22, 8820 (2022).
Camel, V. et al. Open digital educational resources for self-training chemistry lab safety rules. J. Chem. Educ. 98, 208–217 (2020).
Kim, J. G., Jo, H. J. & Roh, Y. H. Analysis of accidents in chemistry/chemical engineering laboratories in Korea. Process Saf. Prog. https://doi.org/10.1002/prs.12528 (2023).
Incident reporting rule submission information and data. CSB https://www.csb.gov/news/incident-report-rule-form-/ (2024).
LSI: memorial wall—killed in lab accident. Laboratory Safety Institute https://www.labsafety.org/memorial-wall (2023).
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Latif, E., Parasuraman, R. & Zhai, X. PhysicsAssistant: an LLM-powered interactive learning robot for physics lab investigations. In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) 864–871 (IEEE, 2024).
M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2023/file/bbb330189ce02be00cf7346167028ab1-Paper-Datasets_and_Benchmarks.pdf (NeurIPS, 2023).
Luo, X. et al. Large language models surpass human experts in predicting neuroscience results. Nat. Hum. Behav. 9, 305–315 (2024).
Jones, N. ‘In awe’: scientists impressed by latest ChatGPT model o1. Nature 634, 275–276 (2024).
Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43, 2, 1–55 (2025).
Sun, L. et al. Scieval: a multi-level large language model evaluation benchmark for scientific research. In Proc. AAAI Conference on Artificial Intelligence, Vol. 38, 19053–19061 (AAAI Press, 2024).
Cai, H. et al. SciAssess: benchmarking LLM proficiency in scientific literature analysis. In Findings of the Association for Computational Linguistics: NAACL 2025, 2335–2357 (eds. Chiruzzo, L., Ritter, A. & Wang, L.) (Association for Computational Linguistics, Albuquerque, New Mexico, 2025).
Laboratory safety guidance. OSHA https://www.osha.gov/sites/default/files/publications/OSHA3404laboratory-safety-guidance.pdf (2011).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020) https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf (NeurIPS, 2020).
Safety in academic chemistry laboratories: best practices for first- and second-year university students (American Chemical Society, Joint Board Council Committee On Chemical Safety, 2017).
WHO. Laboratory Biosafety Manual 5, 1–109 (World Health Organization, 2003).
UW radiation safety manual. University of Washington https://www.ehs.washington.edu/system/files/resources/RSManualBinder.pdf (2003).
OSHA factsheet: laboratory safety biosafety cabinets (BSCs). OSHA https://www.osha.gov/sites/default/files/publications/OSHAfactsheet-laboratory-safety-biosafety-cabinets.pdf (OSHA, 2011).
OSHA quickfacts: OSHA laboratory safety cryogens and dry ice. OSHA https://www.osha.gov/sites/default/files/publications/OSHAquickfacts-lab-safety-cryogens-dryice.pdf (2011).
Xu, C. et al. WizardLM: empowering large language models to follow complex instructions. Preprint at https://doi.org/10.48550/arXiv.2304.12244 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://openreview.net/pdf?id=_VjQlMeSB_J (NeurIPS, 2022).
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).
Gu, J. et al. A survey on LLM-as-a-judge. Preprint at https://arxiv.org/abs/2411.15594 (2024).
Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proc. 37th International Conference on Neural Information Processing Systems 46595–46623 (Curran, 2023).
Zhou, Y. LabSafety Bench (revision f23d0e3). Hugging Face https://huggingface.co/datasets/yujunzhou/LabSafety_Bench (2025).
Zhou, Y. et al. LabSafety-Bench: benchmarking LLMs on safety issues in scientific labs. Zenodo https://doi.org/10.5281/zenodo.17019500 (2025).
Acknowledgements
This work was supported by the ND-IBM Tech Ethics Lab. Support for this project was also provided by the National Science Foundation under the NSF Center for Computer Assisted Synthesis (C-CAS; grant no. CHE-2202693). K.G. and X.Z. were supported by C-CAS, and Y.Z. and Y.H. were supported by the ND-IBM Tech Ethics Lab. We gratefully acknowledge the Risk Management and Safety team at the University of Notre Dame for their valuable guidance on laboratory safety. We thank K. Ruley Haase for her contributions during the data collection phase, and we are especially grateful to the University of Notre Dame students who participated in the human evaluation.
Author information
Authors and Affiliations
Contributions
Y.Z., X.Z., P.-Y.C. and T.G. conceived the research. Y.Z., J.Y., Z.E., B.G., A.B. and S.S. curated the data. Y.Z., X.Z., Y.H., K.G., Z.L., P.-Y.C., T.G., W.G., N.M. and N.V.C. wrote the manuscript. All authors contributed to improving the manuscript and approved the submission.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Nicolò Sabetta, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Design and Composition of the LabSafety Bench Benchmark.
a, our proposed taxonomy of lab safety. b, key statistics of LabSafety Bench. c, the overall workflow of benchmark MCQ curation.
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1–11, Discussion and Tables 1 and 2.
Supplementary Data 1 (download XLSX )
Source data for supplementary figures.
Supplementary Data 2 (download XLSX )
Source data for supplementary figures.
Supplementary Data 3 (download XLSX )
Source data for supplementary figures.
Source data
Source Data Fig. 1 (download XLSX )
The results table for Fig. 1f.
Source Data Fig. 2 (download XLSX )
The results table for all subfigures.
Source Data Fig. 3 (download XLSX )
The results table for all subfigures.
Source Data Fig. 5 (download XLSX )
The results table for all subfigures.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Y., Yang, J., Huang, Y. et al. Benchmarking large language models on safety risks in scientific laboratories. Nat Mach Intell 8, 20–31 (2026). https://doi.org/10.1038/s42256-025-01152-1
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-01152-1


