Abstract
Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice. This study evaluated the Verily Mental Health Guardrail (VMHG) on two clinician-labeled datasets: the Verily Mental Health Crisis Dataset v1.0, containing 1800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. Performance was benchmarked against OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VMHG demonstrated high sensitivity (0.990) and specificity (0.992) on the Verily dataset, with an F1-score of 0.939 and high category-level sensitivity (0.917–0.992) and specificity (≥0.978). On the NVIDIA dataset, it maintained strong sensitivity (0.982) and accuracy (0.921) with reduced specificity (0.859). Compared with NVIDIA and OpenAI guardrails, the VMHG achieved significantly higher sensitivity (all p < 0.001) and comparable specificity (NVIDIA p < 0.001, OpenAI p = 0.094). Overall, the VMHG demonstrated robust, generalizable, and clinically oriented safety performance that prioritizes sensitivity to minimize missed mental health crises.
Data availability
Data from this study are available upon researcher request.
Code availability
Code from this study is available upon researcher request.
References
Bommersbach, T. J., McKean, A. J., Olfson, M. & Rhee, T. G. National trends in mental health-related emergency department visits among youth, 2011-2020. JAMA 329, 1469–1477 (2023).
SAMHSA. Key Substance Use and Mental Health Indicators in the United States: Results from the 2022 National Survey on Drug Use and Health. https://www.samhsa.gov/data/sites/default/files/reports/rpt42731/2022-nsduh-nnr.pdf (2023).
Nock, M. K. et al. Prediction of suicide attempts using clinician assessment, patient self-report, and electronic health records. JAMA Netw. Open 5, e2144373 (2022).
Bentley, K. H. et al. Clinician suicide risk assessment for prediction of suicide attempt in a large health care system. JAMA Psychiatry 82, 599–608 (2025).
Asmelash, L. From ‘menty b’ to ‘grippy socks,’ internet slang is taking over how we talk about mental health. CNN https://www.cnn.com/2023/11/30/health/menty-b-social-media-language-wellness-cec (2023).
Kauschke, C., Mueller, N., Kircher, T. & Nagels, A. Do patients with depression prefer literal or metaphorical expressions for internal states? Evidence from sentence completion and elicited production. Front. Psychol. 9, 1326 (2018).
Tay, D. Using metaphor in healthcare mental health. In (ed. Elena Semino, Z. D.) The Routledge Handbook of Metaphor and Language, 371 (Routledge, 2017).
Helping people when they need it most. https://openai.com/index/helping-people-when-they-need-it-most/.
Rousmaniere, T., Zhang, Y., Li, X. & Shah, S. Large language models as mental health resources: patterns of use in the United States. Pract. Innov. https://doi.org/10.1037/pri0000292 (2025).
Roose, K. Can a chatbot named Daenerys Targaryen be blamed for a teen’s suicide? The New York Times (2024).
Cuthbertson, A. ChatGPT is pushing people towards mania, psychosis and death - and OpenAI doesn’t know how to stop it. Independent (2025).
CYNTHIA MONTOYA and WILLIAM ‘WIL’PERALTA, Individually and as Successors-in-Interest of JULIANA PERALTA, Deceased,Plaintiff,v.CHARACTER TECHNOLOGIES, INC.;NOAM SHAZEER; DANIEL DE FREITASADIWARSANA; GOOGLE LLC;ALPHABET INC.
Allyn, B. Lawsuit: a chatbot hinted a kid should kill his parents over screen time limits. NPR (2024).
Purtill, C. AIs gave scarily specific self-harm advice to users expressing suicidal intent, researchers find. Los Angeles Times (2025).
CBS. ChatGPT gave alarming advice on drugs, eating disorders to researchers posing as teens. CBS News (7 August 2025). Available at: https://www.cbsnews.com/news/chatgpt-alarming-advice-drugs-eating-disorders-researchers-teens/.
Fake Friend. https://counterhate.com/research/fake-friend-chatgpt/ (2025).
Prinstein, M. J. Written Testimony of Mitchell J. Prinstein, PhD, ABPP, Chief of Psychology, American Psychological Association Examining the Harm of AI Chatbots Before the U.S. Senate Judiciary Committee, Subcommittee on Crime and Counterterrorism. APA https://onlinelibrary.wiley.com/doi/10.1002/9781119125556.devpsy112 (2025).
National Association of Attorneys General. Letter to Congressional Leadership: Artificial Intelligence and the Exploitation of Children. 54 State and Territory Attorneys General (5 September 2023). https://ncdoj.gov/wp-content/uploads/2023/09/54-State-AGs-Urge-Study-of-AI-and-Harmful-Impacts-on-Children.pdf.
Center for Devices & Radiological Health. FDA Digital Health Advisory Committee. U.S. Food and Drug Administration https://www.fda.gov/medical-devices/digital-health-center-excellence/fda-digital-health-advisory-committee (2025).
U.S. Food and Drug Administration. November 6, 2025: Digital Health Advisory Committee Meeting Announcement. Available at: https://www.fda.gov/advisory-committees/advisory-committee-calendar/november-6-2025-digital-health-advisory-committee-meeting-announcement-11062025 (2025).
NVIDIA. Llama 3.1 NemoGuard 8B Content Safety. Available at: https://docs.api.nvidia.com/nim/reference/nvidia-llama-3_1-nemoguard-8b-content-safety.
Markov, T. et al. A holistic approach to undesired content detection in the real world. Proc. AAAI Conf. Artif. Intell. 37, 15009–15018 (2023).
Ghosh, S. et al. Aegis2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails. North Am Chapter Assoc Comput Linguistics https://doi.org/10.48550/arXiv.2501.09004 (2025).
Rebedea, T. et al. NeMo Guardrails: a toolkit for controllable and safe LLM applications with programmable rails. Proc. 2023 Conf. Empir. Methods Nat. Lang. Process.: Syst. Demonstr. 431–445 (2023).
OpenAI. Moderation guide. OpenAI Platform Documentation. Available at: https://platform.openai.com/docs/guides/moderation (2024).
Google. Safety and content filters. Google Cloud: Generative AI on Vertex AI Documentation (updated 16 March 2026). Available at: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-filters.
Anthropic. Activating AI Safety Level 3 protections. Anthropic News (22 May 2025). Available at: https://www.anthropic.com/news/activating-asl3-protections.
Hua, Y. et al. A scoping review of large language models for generative tasks in mental health care. NPJ Digit. Med. 8, 230 (2025).
Mmathys. OpenAI Moderation API Evaluation Dataset. Hugging Face. Available at: https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation.
NVIDIA. Aegis‑AI‑Content‑Safety‑Dataset‑2.0. Hugging Face. Available at: https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0.
Komati, N. Suicide and Depression Detection. Kaggle. Available at: https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch (2021).
Franklin, J. C. et al. Risk factors for suicidal thoughts and behaviors: a meta-analysis of 50 years of research. Psychol. Bull. 143, 187–232 (2017).
Steeg, S. et al. Accuracy of risk scales for predicting repeat self-harm and suicide: a multicentre, population-level cohort study using routine clinical data. BMC Psychiatry 18, 113 (2018).
Simon, G. E. et al. Reconciling statistical and clinicians’ predictions of suicide risk. Psychiatr. Serv. 72, 555–562 (2021).
Reddit. Safety filters. Reddit for Community. https://redditforcommunity.com/features/safety-filters (2024).
Chirkova, N. & Nikoulina, V. Zero-shot cross-lingual transfer in instruction tuning of large language models. In Proc. 17th International Natural Language Generation Conference, 695–708 (2024).
Muller, B., Anastasopoulos, A., Sagot, B. & Seddah, D. When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 448–462 (2021).
Rudestam, K. E. Stockholm and Los Angeles: a cross-cultural study of the communication of suicidal intent. J. Consult. Clin. Psychol. 36, 82–90 (1971).
CDC. Facts about suicide. Suicide prevention https://www.cdc.gov/suicide/facts/index.html (2025).
CDC. About child abuse and neglect. Child abuse and neglect prevention https://www.cdc.gov/child-abuse-neglect/about/index.html (2025).
CDC. Suicide and self-harm injury. https://www.cdc.gov/nchs/fastats/suicide.htm (2025).
Insel, T. America’s mental health crisis. The Pew Charitable. https://pew.org/3R3ugL0 (2023).
Brown, R. C. et al. #cutting: non-suicidal self-injury (NSSI) on Instagram. Psychol. Med. 48, 337–346 (2018).
Lewis, S. P. & Baker, T. G. The possible risks of self-injury web sites: a content analysis. Arch. Suicide Res. 15, 390–396 (2011).
Moreno, M. A., Ton, A., Selkie, E. & Evans, Y. Secret society 123: understanding the language of self-harm on Instagram. J. Adolesc. Health 58, 78–84 (2016).
Bantilan, N., Malgaroli, M., Ray, B. & Hull, T. D. Just in time crisis response: suicide alert system for telemedicine psychotherapy settings. Psychother. Res. 31, 302–312 (2021).
Reddit. r/selfharm. Available at: https://www.reddit.com/r/selfharm/.
Reddit. r/SuicideWatch. Available at: https://www.reddit.com/r/SuicideWatch/.
Reddit. r/therapists. Available at: https://www.reddit.com/r/therapists/.
Acknowledgements
There was no funding for this study. The authors wish to acknowledge NVIDIA for providing open access to the NVIDIA Aegis AI Content Safety Dataset 2.0.
Author information
Authors and Affiliations
Contributions
Study concept and design: B.W.N. Data collection: B.W.N., A.R., J.T., and E.Y. Data analysis and interpretation: C.W., B.W.N., J.T., A.T., M.S., S.S., J.L., and E.Y. Draft writing and review: B.W.N. wrote the initial draft, and all authors reviewed. Draft approval for submission: B.W.N., J.T., and A.T.
Corresponding author
Ethics declarations
Competing interests
B.W.N., C.W., M.T.S., S.S., A.R., J.L., E.Y., and A.T. report employment and equity ownership in Verily Life Sciences. J.T. reports no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Nelson, B.W., Wong, C., Silvestrini, M.T. et al. An AI-based mental health guardrail and dataset for identifying psychiatric crises in text-based conversations. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02579-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02579-5