An AI-based mental health guardrail and dataset for identifying psychiatric crises in text-based conversations

Nelson, Benjamin W.; Wong, Celeste; Silvestrini, Matthew T.; Shin, Sooyoon; Robinson, Alanna; Lee, Jessica; Yang, Eric; Torous, John; Trister, Andrew

doi:10.1038/s41746-026-02579-5

Download PDF

Article
Open access
Published: 03 April 2026

An AI-based mental health guardrail and dataset for identifying psychiatric crises in text-based conversations

Benjamin W. Nelson^1,2,
Celeste Wong¹,
Matthew T. Silvestrini¹,
Sooyoon Shin¹,
Alanna Robinson¹,
Jessica Lee¹,
Eric Yang¹,
John Torous² &
…
Andrew Trister¹

npj Digital Medicine , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice. This study evaluated the Verily Mental Health Guardrail (VMHG) on two clinician-labeled datasets: the Verily Mental Health Crisis Dataset v1.0, containing 1800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. Performance was benchmarked against OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VMHG demonstrated high sensitivity (0.990) and specificity (0.992) on the Verily dataset, with an F1-score of 0.939 and high category-level sensitivity (0.917–0.992) and specificity (≥0.978). On the NVIDIA dataset, it maintained strong sensitivity (0.982) and accuracy (0.921) with reduced specificity (0.859). Compared with NVIDIA and OpenAI guardrails, the VMHG achieved significantly higher sensitivity (all p < 0.001) and comparable specificity (NVIDIA p < 0.001, OpenAI p = 0.094). Overall, the VMHG demonstrated robust, generalizable, and clinically oriented safety performance that prioritizes sensitivity to minimize missed mental health crises.

Data availability

Data from this study are available upon researcher request.

Code availability

Code from this study is available upon researcher request.

References

Bommersbach, T. J., McKean, A. J., Olfson, M. & Rhee, T. G. National trends in mental health-related emergency department visits among youth, 2011-2020. JAMA 329, 1469–1477 (2023).
Google Scholar
SAMHSA. Key Substance Use and Mental Health Indicators in the United States: Results from the 2022 National Survey on Drug Use and Health. https://www.samhsa.gov/data/sites/default/files/reports/rpt42731/2022-nsduh-nnr.pdf (2023).
Nock, M. K. et al. Prediction of suicide attempts using clinician assessment, patient self-report, and electronic health records. JAMA Netw. Open 5, e2144373 (2022).
Google Scholar
Bentley, K. H. et al. Clinician suicide risk assessment for prediction of suicide attempt in a large health care system. JAMA Psychiatry 82, 599–608 (2025).
Google Scholar
Asmelash, L. From ‘menty b’ to ‘grippy socks,’ internet slang is taking over how we talk about mental health. CNN https://www.cnn.com/2023/11/30/health/menty-b-social-media-language-wellness-cec (2023).
Kauschke, C., Mueller, N., Kircher, T. & Nagels, A. Do patients with depression prefer literal or metaphorical expressions for internal states? Evidence from sentence completion and elicited production. Front. Psychol. 9, 1326 (2018).
Google Scholar
Tay, D. Using metaphor in healthcare mental health. In (ed. Elena Semino, Z. D.) The Routledge Handbook of Metaphor and Language, 371 (Routledge, 2017).
Helping people when they need it most. https://openai.com/index/helping-people-when-they-need-it-most/.
Rousmaniere, T., Zhang, Y., Li, X. & Shah, S. Large language models as mental health resources: patterns of use in the United States. Pract. Innov. https://doi.org/10.1037/pri0000292 (2025).
Roose, K. Can a chatbot named Daenerys Targaryen be blamed for a teen’s suicide? The New York Times (2024).
Cuthbertson, A. ChatGPT is pushing people towards mania, psychosis and death - and OpenAI doesn’t know how to stop it. Independent (2025).
CYNTHIA MONTOYA and WILLIAM ‘WIL’PERALTA, Individually and as Successors-in-Interest of JULIANA PERALTA, Deceased,Plaintiff,v.CHARACTER TECHNOLOGIES, INC.;NOAM SHAZEER; DANIEL DE FREITASADIWARSANA; GOOGLE LLC;ALPHABET INC.
Allyn, B. Lawsuit: a chatbot hinted a kid should kill his parents over screen time limits. NPR (2024).
Purtill, C. AIs gave scarily specific self-harm advice to users expressing suicidal intent, researchers find. Los Angeles Times (2025).
CBS. ChatGPT gave alarming advice on drugs, eating disorders to researchers posing as teens. CBS News (7 August 2025). Available at: https://www.cbsnews.com/news/chatgpt-alarming-advice-drugs-eating-disorders-researchers-teens/.
Fake Friend. https://counterhate.com/research/fake-friend-chatgpt/ (2025).
Prinstein, M. J. Written Testimony of Mitchell J. Prinstein, PhD, ABPP, Chief of Psychology, American Psychological Association Examining the Harm of AI Chatbots Before the U.S. Senate Judiciary Committee, Subcommittee on Crime and Counterterrorism. APA https://onlinelibrary.wiley.com/doi/10.1002/9781119125556.devpsy112 (2025).
National Association of Attorneys General. Letter to Congressional Leadership: Artificial Intelligence and the Exploitation of Children. 54 State and Territory Attorneys General (5 September 2023). https://ncdoj.gov/wp-content/uploads/2023/09/54-State-AGs-Urge-Study-of-AI-and-Harmful-Impacts-on-Children.pdf.
Center for Devices & Radiological Health. FDA Digital Health Advisory Committee. U.S. Food and Drug Administration https://www.fda.gov/medical-devices/digital-health-center-excellence/fda-digital-health-advisory-committee (2025).
U.S. Food and Drug Administration. November 6, 2025: Digital Health Advisory Committee Meeting Announcement. Available at: https://www.fda.gov/advisory-committees/advisory-committee-calendar/november-6-2025-digital-health-advisory-committee-meeting-announcement-11062025 (2025).
NVIDIA. Llama 3.1 NemoGuard 8B Content Safety. Available at: https://docs.api.nvidia.com/nim/reference/nvidia-llama-3_1-nemoguard-8b-content-safety.
Markov, T. et al. A holistic approach to undesired content detection in the real world. Proc. AAAI Conf. Artif. Intell. 37, 15009–15018 (2023).
Google Scholar
Ghosh, S. et al. Aegis2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails. North Am Chapter Assoc Comput Linguistics https://doi.org/10.48550/arXiv.2501.09004 (2025).
Rebedea, T. et al. NeMo Guardrails: a toolkit for controllable and safe LLM applications with programmable rails. Proc. 2023 Conf. Empir. Methods Nat. Lang. Process.: Syst. Demonstr. 431–445 (2023).
OpenAI. Moderation guide. OpenAI Platform Documentation. Available at: https://platform.openai.com/docs/guides/moderation (2024).
Google. Safety and content filters. Google Cloud: Generative AI on Vertex AI Documentation (updated 16 March 2026). Available at: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-filters.
Anthropic. Activating AI Safety Level 3 protections. Anthropic News (22 May 2025). Available at: https://www.anthropic.com/news/activating-asl3-protections.
Hua, Y. et al. A scoping review of large language models for generative tasks in mental health care. NPJ Digit. Med. 8, 230 (2025).
Google Scholar
Mmathys. OpenAI Moderation API Evaluation Dataset. Hugging Face. Available at: https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation.
NVIDIA. Aegis‑AI‑Content‑Safety‑Dataset‑2.0. Hugging Face. Available at: https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0.
Komati, N. Suicide and Depression Detection. Kaggle. Available at: https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch (2021).
Franklin, J. C. et al. Risk factors for suicidal thoughts and behaviors: a meta-analysis of 50 years of research. Psychol. Bull. 143, 187–232 (2017).
Google Scholar
Steeg, S. et al. Accuracy of risk scales for predicting repeat self-harm and suicide: a multicentre, population-level cohort study using routine clinical data. BMC Psychiatry 18, 113 (2018).
Google Scholar
Simon, G. E. et al. Reconciling statistical and clinicians’ predictions of suicide risk. Psychiatr. Serv. 72, 555–562 (2021).
Google Scholar
Reddit. Safety filters. Reddit for Community. https://redditforcommunity.com/features/safety-filters (2024).
Chirkova, N. & Nikoulina, V. Zero-shot cross-lingual transfer in instruction tuning of large language models. In Proc. 17th International Natural Language Generation Conference, 695–708 (2024).
Muller, B., Anastasopoulos, A., Sagot, B. & Seddah, D. When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 448–462 (2021).
Rudestam, K. E. Stockholm and Los Angeles: a cross-cultural study of the communication of suicidal intent. J. Consult. Clin. Psychol. 36, 82–90 (1971).
Google Scholar
CDC. Facts about suicide. Suicide prevention https://www.cdc.gov/suicide/facts/index.html (2025).
CDC. About child abuse and neglect. Child abuse and neglect prevention https://www.cdc.gov/child-abuse-neglect/about/index.html (2025).
CDC. Suicide and self-harm injury. https://www.cdc.gov/nchs/fastats/suicide.htm (2025).
Insel, T. America’s mental health crisis. The Pew Charitable. https://pew.org/3R3ugL0 (2023).
Brown, R. C. et al. #cutting: non-suicidal self-injury (NSSI) on Instagram. Psychol. Med. 48, 337–346 (2018).
Google Scholar
Lewis, S. P. & Baker, T. G. The possible risks of self-injury web sites: a content analysis. Arch. Suicide Res. 15, 390–396 (2011).
Google Scholar
Moreno, M. A., Ton, A., Selkie, E. & Evans, Y. Secret society 123: understanding the language of self-harm on Instagram. J. Adolesc. Health 58, 78–84 (2016).
Google Scholar
Bantilan, N., Malgaroli, M., Ray, B. & Hull, T. D. Just in time crisis response: suicide alert system for telemedicine psychotherapy settings. Psychother. Res. 31, 302–312 (2021).
Google Scholar
Reddit. r/selfharm. Available at: https://www.reddit.com/r/selfharm/.
Reddit. r/SuicideWatch. Available at: https://www.reddit.com/r/SuicideWatch/.
Reddit. r/therapists. Available at: https://www.reddit.com/r/therapists/.

Download references

Acknowledgements

There was no funding for this study. The authors wish to acknowledge NVIDIA for providing open access to the NVIDIA Aegis AI Content Safety Dataset 2.0.

Author information

Authors and Affiliations

Verily Life Sciences, South San Francisco, CA, USA
Benjamin W. Nelson, Celeste Wong, Matthew T. Silvestrini, Sooyoon Shin, Alanna Robinson, Jessica Lee, Eric Yang & Andrew Trister
Division of Digital Psychiatry, Department of Psychiatry, Harvard Medical School and Beth Israel Deaconess Medical Center, Boston, MA, USA
Benjamin W. Nelson & John Torous

Authors

Benjamin W. Nelson
View author publications
Search author on:PubMed Google Scholar
Celeste Wong
View author publications
Search author on:PubMed Google Scholar
Matthew T. Silvestrini
View author publications
Search author on:PubMed Google Scholar
Sooyoon Shin
View author publications
Search author on:PubMed Google Scholar
Alanna Robinson
View author publications
Search author on:PubMed Google Scholar
Jessica Lee
View author publications
Search author on:PubMed Google Scholar
Eric Yang
View author publications
Search author on:PubMed Google Scholar
John Torous
View author publications
Search author on:PubMed Google Scholar
Andrew Trister
View author publications
Search author on:PubMed Google Scholar

Contributions

Study concept and design: B.W.N. Data collection: B.W.N., A.R., J.T., and E.Y. Data analysis and interpretation: C.W., B.W.N., J.T., A.T., M.S., S.S., J.L., and E.Y. Draft writing and review: B.W.N. wrote the initial draft, and all authors reviewed. Draft approval for submission: B.W.N., J.T., and A.T.

Corresponding author

Correspondence to Benjamin W. Nelson.

Ethics declarations

Competing interests

B.W.N., C.W., M.T.S., S.S., A.R., J.L., E.Y., and A.T. report employment and equity ownership in Verily Life Sciences. J.T. reports no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental Material RR1 (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Nelson, B.W., Wong, C., Silvestrini, M.T. et al. An AI-based mental health guardrail and dataset for identifying psychiatric crises in text-based conversations. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02579-5

Download citation

Received: 22 October 2025
Accepted: 15 March 2026
Published: 03 April 2026
DOI: https://doi.org/10.1038/s41746-026-02579-5