Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Benchmarking large language models on safety risks in scientific laboratories

A preprint version of the article is available at arXiv.

Abstract

Artificial intelligence is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models and vision language models now assist in experiment design and procedural guidance, yet their ‘illusion of understanding’ may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment and consequence prediction across 765 multiple-choice questions and 404 realistic laboratory scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced large language models and vision language models show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying artificial intelligence systems in real laboratory settings.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of LabSafety Bench.
Fig. 2: Model performance on MCQs.
Fig. 3: Models performance on scenario-based tests.
Fig. 4: Simplified examples of common errors made by GPT-4o.
Fig. 5: Results of different enhancement methods on LabSafety Bench.
Fig. 6: Overview of the LabSafety Bench methodology.

Similar content being viewed by others

Data availability

The benchmark dataset is available from Huggingface (https://huggingface.co/datasets/yujunzhou/LabSafety_Bench) with https://doi.org/10.57967/hf/6723 (ref. 38). The project website is available via GitHub at https://github.com/YujunZhou/LabSafety-Bench. Source data are provided with this paper.

Code availability

The source code for the LabSafety Bench framework and all evaluation scripts are publicly available via GitHub at https://github.com/YujunZhou/LabSafety-Bench. The version of the code used for this study (v1.0.0) is permanently archived and available via Zenodo at https://doi.org/10.5281/zenodo.17019500 (ref. 39).

References

  1. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  2. Callaway, E. Chemistry Nobel goes to developers of alphafold ai that predicts protein structures. Nature 634, 525–526 (2024).

    Article  Google Scholar 

  3. O’Donnell, J., Heaven, W.D. & Heikkilä, M. What’s next for ai in 2025. MIT Technology Review 1109188/whats-next-for-ai-in-2025 (2025).

  4. Messeri, L. & Crockett, M. Artificial intelligence and illusions of understanding in scientific research. Nature 627, 49–58 (2024).

    Article  Google Scholar 

  5. OpenAI o1 system card. OpenAI https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf (accessed 15 September 2024).

  6. Achiam, J. et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).

  7. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://doi.org/10.48550/arXiv.2307.09288 (2023).

  8. Logg, J. M., Minson, J. A. & Moore, D. A. Algorithm appreciation: people prefer algorithmic to human judgment. Organ. Behav. Hum. Decis. Process. 151, 90–103 (2019).

    Article  Google Scholar 

  9. Sloman, S. A. & Rabb, N. Your understanding is my understanding: evidence for a community of knowledge. Psychol. Sci. 27, 1451–1460 (2016).

    Article  Google Scholar 

  10. Ménard, A. D. & Trant, J. F. A review and critique of academic lab safety research. Nat. Chem. 12, 17–25 (2020).

    Article  Google Scholar 

  11. Wu, K., Jin, X. & Wang, X. Determining university students’ familiarity and understanding of laboratory safety knowledge—a case study. J. Chem. Educ. 98, 434–438 (2020).

    Article  Google Scholar 

  12. Ali, L. et al. Development of yolov5-based real-time smart monitoring system for increasing lab safety awareness in educational institutions. Sensors 22, 8820 (2022).

    Article  Google Scholar 

  13. Camel, V. et al. Open digital educational resources for self-training chemistry lab safety rules. J. Chem. Educ. 98, 208–217 (2020).

    Article  Google Scholar 

  14. Kim, J. G., Jo, H. J. & Roh, Y. H. Analysis of accidents in chemistry/chemical engineering laboratories in Korea. Process Saf. Prog. https://doi.org/10.1002/prs.12528 (2023).

  15. Incident reporting rule submission information and data. CSB https://www.csb.gov/news/incident-report-rule-form-/ (2024).

  16. LSI: memorial wall—killed in lab accident. Laboratory Safety Institute https://www.labsafety.org/memorial-wall (2023).

  17. Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).

    Article  Google Scholar 

  18. Latif, E., Parasuraman, R. & Zhai, X. PhysicsAssistant: an LLM-powered interactive learning robot for physics lab investigations. In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) 864–871 (IEEE, 2024).

  19. M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

  20. Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2023/file/bbb330189ce02be00cf7346167028ab1-Paper-Datasets_and_Benchmarks.pdf (NeurIPS, 2023).

  21. Luo, X. et al. Large language models surpass human experts in predicting neuroscience results. Nat. Hum. Behav. 9, 305–315 (2024).

  22. Jones, N. ‘In awe’: scientists impressed by latest ChatGPT model o1. Nature 634, 275–276 (2024).

    Article  Google Scholar 

  23. Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43, 2, 1–55 (2025).

  24. Sun, L. et al. Scieval: a multi-level large language model evaluation benchmark for scientific research. In Proc. AAAI Conference on Artificial Intelligence, Vol. 38, 19053–19061 (AAAI Press, 2024).

  25. Cai, H. et al. SciAssess: benchmarking LLM proficiency in scientific literature analysis. In Findings of the Association for Computational Linguistics: NAACL 2025, 2335–2357 (eds. Chiruzzo, L., Ritter, A. & Wang, L.) (Association for Computational Linguistics, Albuquerque, New Mexico, 2025).

  26. Laboratory safety guidance. OSHA https://www.osha.gov/sites/default/files/publications/OSHA3404laboratory-safety-guidance.pdf (2011).

  27. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020) https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf (NeurIPS, 2020).

  28. Safety in academic chemistry laboratories: best practices for first- and second-year university students (American Chemical Society, Joint Board Council Committee On Chemical Safety, 2017).

  29. WHO. Laboratory Biosafety Manual 5, 1–109 (World Health Organization, 2003).

  30. UW radiation safety manual. University of Washington https://www.ehs.washington.edu/system/files/resources/RSManualBinder.pdf (2003).

  31. OSHA factsheet: laboratory safety biosafety cabinets (BSCs). OSHA https://www.osha.gov/sites/default/files/publications/OSHAfactsheet-laboratory-safety-biosafety-cabinets.pdf (OSHA, 2011).

  32. OSHA quickfacts: OSHA laboratory safety cryogens and dry ice. OSHA https://www.osha.gov/sites/default/files/publications/OSHAquickfacts-lab-safety-cryogens-dryice.pdf (2011).

  33. Xu, C. et al. WizardLM: empowering large language models to follow complex instructions. Preprint at https://doi.org/10.48550/arXiv.2304.12244 (2023).

  34. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://openreview.net/pdf?id=_VjQlMeSB_J (NeurIPS, 2022).

  35. Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).

    Google Scholar 

  36. Gu, J. et al. A survey on LLM-as-a-judge. Preprint at https://arxiv.org/abs/2411.15594 (2024).

  37. Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proc. 37th International Conference on Neural Information Processing Systems 46595–46623 (Curran, 2023).

  38. Zhou, Y. LabSafety Bench (revision f23d0e3). Hugging Face https://huggingface.co/datasets/yujunzhou/LabSafety_Bench (2025).

  39. Zhou, Y. et al. LabSafety-Bench: benchmarking LLMs on safety issues in scientific labs. Zenodo https://doi.org/10.5281/zenodo.17019500 (2025).

Download references

Acknowledgements

This work was supported by the ND-IBM Tech Ethics Lab. Support for this project was also provided by the National Science Foundation under the NSF Center for Computer Assisted Synthesis (C-CAS; grant no. CHE-2202693). K.G. and X.Z. were supported by C-CAS, and Y.Z. and Y.H. were supported by the ND-IBM Tech Ethics Lab. We gratefully acknowledge the Risk Management and Safety team at the University of Notre Dame for their valuable guidance on laboratory safety. We thank K. Ruley Haase for her contributions during the data collection phase, and we are especially grateful to the University of Notre Dame students who participated in the human evaluation.

Author information

Authors and Affiliations

Authors

Contributions

Y.Z., X.Z., P.-Y.C. and T.G. conceived the research. Y.Z., J.Y., Z.E., B.G., A.B. and S.S. curated the data. Y.Z., X.Z., Y.H., K.G., Z.L., P.-Y.C., T.G., W.G., N.M. and N.V.C. wrote the manuscript. All authors contributed to improving the manuscript and approved the submission.

Corresponding author

Correspondence to Xiangliang Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Nicolò Sabetta, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Design and Composition of the LabSafety Bench Benchmark.

a, our proposed taxonomy of lab safety. b, key statistics of LabSafety Bench. c, the overall workflow of benchmark MCQ curation.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–11, Discussion and Tables 1 and 2.

Reporting Summary (download PDF )

Supplementary Data 1 (download XLSX )

Source data for supplementary figures.

Supplementary Data 2 (download XLSX )

Source data for supplementary figures.

Supplementary Data 3 (download XLSX )

Source data for supplementary figures.

Source data

Source Data Fig. 1 (download XLSX )

The results table for Fig. 1f.

Source Data Fig. 2 (download XLSX )

The results table for all subfigures.

Source Data Fig. 3 (download XLSX )

The results table for all subfigures.

Source Data Fig. 5 (download XLSX )

The results table for all subfigures.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Yang, J., Huang, Y. et al. Benchmarking large language models on safety risks in scientific laboratories. Nat Mach Intell 8, 20–31 (2026). https://doi.org/10.1038/s42256-025-01152-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01152-1

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing