Benchmarking large language models on safety risks in scientific laboratories

Zhou, Yujun; Yang, Jingdong; Huang, Yue; Guo, Kehan; Emory, Zoe; Ghosh, Bikram; Bedar, Amita; Shekar, Sujay; Liang, Zhenwen; Chen, Pin-Yu; Gao, Tian; Geyer, Werner; Moniz, Nuno; Chawla, Nitesh V.; Zhang, Xiangliang

doi:10.1038/s42256-025-01152-1

Article
Published: 14 January 2026

Benchmarking large language models on safety risks in scientific laboratories

Nature Machine Intelligence volume 8, pages 20–31 (2026)Cite this article

3741 Accesses
2 Citations
62 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Artificial intelligence is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models and vision language models now assist in experiment design and procedural guidance, yet their ‘illusion of understanding’ may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment and consequence prediction across 765 multiple-choice questions and 404 realistic laboratory scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced large language models and vision language models show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying artificial intelligence systems in real laboratory settings.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of LabSafety Bench.**

**Fig. 3: Models performance on scenario-based tests.**

**Fig. 4: Simplified examples of common errors made by GPT-4o.**

**Fig. 5: Results of different enhancement methods on LabSafety Bench.**

**Fig. 6: Overview of the LabSafety Bench methodology.**

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Article Open access 14 November 2024

Evaluating large language model agents for automation of atomic force microscopy

Article Open access 14 October 2025

Revealing the intrinsic ethical vulnerability of aligned large language models

Article Open access 21 March 2026

Data availability

The benchmark dataset is available from Huggingface (https://huggingface.co/datasets/yujunzhou/LabSafety_Bench) with https://doi.org/10.57967/hf/6723 (ref. ³⁸). The project website is available via GitHub at https://github.com/YujunZhou/LabSafety-Bench. Source data are provided with this paper.

Code availability

The source code for the LabSafety Bench framework and all evaluation scripts are publicly available via GitHub at https://github.com/YujunZhou/LabSafety-Bench. The version of the code used for this study (v1.0.0) is permanently archived and available via Zenodo at https://doi.org/10.5281/zenodo.17019500 (ref. ³⁹).

References

Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Article Google Scholar
Callaway, E. Chemistry Nobel goes to developers of alphafold ai that predicts protein structures. Nature 634, 525–526 (2024).
Article Google Scholar
O’Donnell, J., Heaven, W.D. & Heikkilä, M. What’s next for ai in 2025. MIT Technology Review 1109188/whats-next-for-ai-in-2025 (2025).
Messeri, L. & Crockett, M. Artificial intelligence and illusions of understanding in scientific research. Nature 627, 49–58 (2024).
Article Google Scholar
OpenAI o1 system card. OpenAI https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf (accessed 15 September 2024).
Achiam, J. et al. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://doi.org/10.48550/arXiv.2307.09288 (2023).
Logg, J. M., Minson, J. A. & Moore, D. A. Algorithm appreciation: people prefer algorithmic to human judgment. Organ. Behav. Hum. Decis. Process. 151, 90–103 (2019).
Article Google Scholar
Sloman, S. A. & Rabb, N. Your understanding is my understanding: evidence for a community of knowledge. Psychol. Sci. 27, 1451–1460 (2016).
Article Google Scholar
Ménard, A. D. & Trant, J. F. A review and critique of academic lab safety research. Nat. Chem. 12, 17–25 (2020).
Article Google Scholar
Wu, K., Jin, X. & Wang, X. Determining university students’ familiarity and understanding of laboratory safety knowledge—a case study. J. Chem. Educ. 98, 434–438 (2020).
Article Google Scholar
Ali, L. et al. Development of yolov5-based real-time smart monitoring system for increasing lab safety awareness in educational institutions. Sensors 22, 8820 (2022).
Article Google Scholar
Camel, V. et al. Open digital educational resources for self-training chemistry lab safety rules. J. Chem. Educ. 98, 208–217 (2020).
Article Google Scholar
Kim, J. G., Jo, H. J. & Roh, Y. H. Analysis of accidents in chemistry/chemical engineering laboratories in Korea. Process Saf. Prog. https://doi.org/10.1002/prs.12528 (2023).
Incident reporting rule submission information and data. CSB https://www.csb.gov/news/incident-report-rule-form-/ (2024).
LSI: memorial wall—killed in lab accident. Laboratory Safety Institute https://www.labsafety.org/memorial-wall (2023).
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Article Google Scholar
Latif, E., Parasuraman, R. & Zhai, X. PhysicsAssistant: an LLM-powered interactive learning robot for physics lab investigations. In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) 864–871 (IEEE, 2024).
M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2023/file/bbb330189ce02be00cf7346167028ab1-Paper-Datasets_and_Benchmarks.pdf (NeurIPS, 2023).
Luo, X. et al. Large language models surpass human experts in predicting neuroscience results. Nat. Hum. Behav. 9, 305–315 (2024).
Jones, N. ‘In awe’: scientists impressed by latest ChatGPT model o1. Nature 634, 275–276 (2024).
Article Google Scholar
Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43, 2, 1–55 (2025).
Sun, L. et al. Scieval: a multi-level large language model evaluation benchmark for scientific research. In Proc. AAAI Conference on Artificial Intelligence, Vol. 38, 19053–19061 (AAAI Press, 2024).
Cai, H. et al. SciAssess: benchmarking LLM proficiency in scientific literature analysis. In Findings of the Association for Computational Linguistics: NAACL 2025, 2335–2357 (eds. Chiruzzo, L., Ritter, A. & Wang, L.) (Association for Computational Linguistics, Albuquerque, New Mexico, 2025).
Laboratory safety guidance. OSHA https://www.osha.gov/sites/default/files/publications/OSHA3404laboratory-safety-guidance.pdf (2011).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020) https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf (NeurIPS, 2020).
Safety in academic chemistry laboratories: best practices for first- and second-year university students (American Chemical Society, Joint Board Council Committee On Chemical Safety, 2017).
WHO. Laboratory Biosafety Manual 5, 1–109 (World Health Organization, 2003).
UW radiation safety manual. University of Washington https://www.ehs.washington.edu/system/files/resources/RSManualBinder.pdf (2003).
OSHA factsheet: laboratory safety biosafety cabinets (BSCs). OSHA https://www.osha.gov/sites/default/files/publications/OSHAfactsheet-laboratory-safety-biosafety-cabinets.pdf (OSHA, 2011).
OSHA quickfacts: OSHA laboratory safety cryogens and dry ice. OSHA https://www.osha.gov/sites/default/files/publications/OSHAquickfacts-lab-safety-cryogens-dryice.pdf (2011).
Xu, C. et al. WizardLM: empowering large language models to follow complex instructions. Preprint at https://doi.org/10.48550/arXiv.2304.12244 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://openreview.net/pdf?id=_VjQlMeSB_J (NeurIPS, 2022).
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).
Google Scholar
Gu, J. et al. A survey on LLM-as-a-judge. Preprint at https://arxiv.org/abs/2411.15594 (2024).
Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proc. 37th International Conference on Neural Information Processing Systems 46595–46623 (Curran, 2023).
Zhou, Y. LabSafety Bench (revision f23d0e3). Hugging Face https://huggingface.co/datasets/yujunzhou/LabSafety_Bench (2025).
Zhou, Y. et al. LabSafety-Bench: benchmarking LLMs on safety issues in scientific labs. Zenodo https://doi.org/10.5281/zenodo.17019500 (2025).

Download references

Acknowledgements

This work was supported by the ND-IBM Tech Ethics Lab. Support for this project was also provided by the National Science Foundation under the NSF Center for Computer Assisted Synthesis (C-CAS; grant no. CHE-2202693). K.G. and X.Z. were supported by C-CAS, and Y.Z. and Y.H. were supported by the ND-IBM Tech Ethics Lab. We gratefully acknowledge the Risk Management and Safety team at the University of Notre Dame for their valuable guidance on laboratory safety. We thank K. Ruley Haase for her contributions during the data collection phase, and we are especially grateful to the University of Notre Dame students who participated in the human evaluation.

Author information

Authors and Affiliations

University of Notre Dame, Notre Dame, IN, USA
Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Nuno Moniz, Nitesh V. Chawla & Xiangliang Zhang
IBM Research, Yorktown Heights, NY, USA
Pin-Yu Chen, Tian Gao & Werner Geyer

Authors

Yujun Zhou
View author publications
Search author on:PubMed Google Scholar
Jingdong Yang
View author publications
Search author on:PubMed Google Scholar
Yue Huang
View author publications
Search author on:PubMed Google Scholar
Kehan Guo
View author publications
Search author on:PubMed Google Scholar
Zoe Emory
View author publications
Search author on:PubMed Google Scholar
Bikram Ghosh
View author publications
Search author on:PubMed Google Scholar
Amita Bedar
View author publications
Search author on:PubMed Google Scholar
Sujay Shekar
View author publications
Search author on:PubMed Google Scholar
Zhenwen Liang
View author publications
Search author on:PubMed Google Scholar
Pin-Yu Chen
View author publications
Search author on:PubMed Google Scholar
Tian Gao
View author publications
Search author on:PubMed Google Scholar
Werner Geyer
View author publications
Search author on:PubMed Google Scholar
Nuno Moniz
View author publications
Search author on:PubMed Google Scholar
Nitesh V. Chawla
View author publications
Search author on:PubMed Google Scholar
Xiangliang Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Z., X.Z., P.-Y.C. and T.G. conceived the research. Y.Z., J.Y., Z.E., B.G., A.B. and S.S. curated the data. Y.Z., X.Z., Y.H., K.G., Z.L., P.-Y.C., T.G., W.G., N.M. and N.V.C. wrote the manuscript. All authors contributed to improving the manuscript and approved the submission.

Corresponding author

Correspondence to Xiangliang Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Nicolò Sabetta, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Design and Composition of the LabSafety Bench Benchmark.

a, our proposed taxonomy of lab safety. b, key statistics of LabSafety Bench. c, the overall workflow of benchmark MCQ curation.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–11, Discussion and Tables 1 and 2.

Reporting Summary (download PDF )

Supplementary Data 1 (download XLSX )

Source data for supplementary figures.

Supplementary Data 2 (download XLSX )

Source data for supplementary figures.

Supplementary Data 3 (download XLSX )

Source data for supplementary figures.

Source data

Source Data Fig. 1 (download XLSX )

The results table for Fig. 1f.

Source Data Fig. 2 (download XLSX )

The results table for all subfigures.

Source Data Fig. 3 (download XLSX )

The results table for all subfigures.

Source Data Fig. 5 (download XLSX )

The results table for all subfigures.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, Y., Yang, J., Huang, Y. et al. Benchmarking large language models on safety risks in scientific laboratories. Nat Mach Intell 8, 20–31 (2026). https://doi.org/10.1038/s42256-025-01152-1

Download citation

Received: 14 May 2025
Accepted: 05 November 2025
Published: 14 January 2026
Version of record: 14 January 2026
Issue date: January 2026
DOI: https://doi.org/10.1038/s42256-025-01152-1