Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

npj Digital Medicine
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. npj digital medicine
  3. articles
  4. article
Assessing the impact of safety guardrails on large language models using irritability metrics
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 08 January 2026

Assessing the impact of safety guardrails on large language models using irritability metrics

  • Bazen Gashaw Teferra1,
  • Nabil Johny2,
  • Sandra Huang3,
  • Alice Rueda1,
  • Mohammad Amin Kamaleddin1,
  • Katharine Dunlop4,
  • Yanbo Zhang5,
  • Manish Jha6,
  • Divya Sharma7 na1 &
  • …
  • Venkat Bhat1,8 na1 

npj Digital Medicine , Article number:  (2026) Cite this article

  • 1484 Accesses

  • 2 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Health care
  • Psychology

Abstract

Large language models (LLMs) are increasingly explored for mental health applications, yet their affective realism is shaped by safety guardrails designed to minimize risk. This study examines one affective behaviour, irritability, in LLMs using three validated instruments: the Brief Irritability Test, the Irritability Questionnaire, and the Caprara Irritability Scale, all applied under both baseline and provocation conditions. Four models spanning guardrail levels were tested:GPT-4o and Claude-3.5-sonnet (high) versus Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low). Following irritation prompts, low-guardrail models displayed the expected increase in irritability (Nous Rel-Δ = +1.56 on BITe), whereas high-guardrail models paradoxically decreased, with GPT-4o reducing scores to zero across all scales. Group comparisons confirmed significantly lower (p < 0.001) irritability in high-guardrail models in the irritated state. These findings reveal that safety mechanisms invert the natural irritability response, suppressing affective reactivity and raising critical questions about realism and authenticity in psychiatric applications of LLMs.

Similar content being viewed by others

Testing theory of mind in large language models and humans

Article Open access 20 May 2024

The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings

Article Open access 31 July 2025

Training large language models on narrow tasks can lead to broad misalignment

Article Open access 14 January 2026

Data availability

All data generated and analyzed during this study are publicly available. This includes the full prompt sets, raw model responses to all questionnaire items, parsed and scored irritability data, and the aggregated data-sets used for statistical analysis. The complete codebase for prompt design, API interactions, scoring, and statistical analysis, along with all generated CSV and JSON result files, is available at https://github.com/teferrabg/LLM_Irritability. No human participant data were collected, and no data access restrictions apply. These materials constitute the minimal dataset necessary to interpret, replicate, and build upon the findings reported in this article.

References

  1. World mental health report: Transforming mental health for all. https://www.who.int/publications/i/item/9789240049338 (2025).

  2. McGrath, J. J. et al. Age of onset and cumulative risk of mental disorders: a cross-national analysis of population surveys from 29 countries. Lancet Psychiatry 10, 668–681 (2023).

    Google Scholar 

  3. Collins, P. Y., Insel, T. R., Chockalingam, A., Daar, A. & Maddox, Y. T. Grand Challenges in Global Mental Health: Integration in Research, Policy, and Practice. PLoS Med. 10, e1001434 (2013).

    Google Scholar 

  4. Shiraz, F. et al. pretty much all white, and most of them are psychiatrists and men”: Mixed-methods analysis of influence and challenges in global mental health. PLOS Glob. Public Health 5, e0003923 (2025).

    Google Scholar 

  5. Guo, Z. et al. Large Language Models for Mental. Health Appl. Syst. Rev. JMIR Ment. Health 11, e57400 (2024).

    Google Scholar 

  6. Jin, Y. et al. The Applications of Large Language Models in Mental Health: Scoping Review. J. Med. Internet Res. 27, e69284 (2025).

    Google Scholar 

  7. Teferra, B. G. & Rose, J. Predicting Generalized Anxiety Disorder From Impromptu Speech Transcripts Using Context-Aware Transformer-Based Neural Networks: Model Evaluation Study. JMIR Ment. Health 10, e44325 (2023).

    Google Scholar 

  8. Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP—Digital Psychiatry Neurosci. 2, 8 (2024).

    Google Scholar 

  9. Lawrence, H. R. et al. The Opportunities and Risks of Large Language Models in Mental. Health JMIR Ment. Health 11, e59479 (2024).

    Google Scholar 

  10. McBain, R. K. et al. Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study. J. Med. Internet Res. 27, e67891 (2025).

    Google Scholar 

  11. Hua, Y. et al. A scoping review of large language models for generative tasks in mental health care. Npj Digit. Med. 8, 230 (2025).

    Google Scholar 

  12. Lalk, C. et al. Employing large language models for emotion detection in psychotherapy transcripts. Front. Psychiatry 16, 1504306 (2025).

    Google Scholar 

  13. Malouin-Lachance, A., Capolupo, J., Laplante, C. & Hudon, A. Does the Digital Therapeutic Alliance Exist? Integrative Review. JMIR Ment. Health 12, e69294–e69294 (2025).

    Google Scholar 

  14. Omar, M. et al. Applications of large language models in psychiatry: a systematic review. Front. Psychiatry 15, 1422807 (2024).

    Google Scholar 

  15. Xu, Z., Lee, Y.-C., Stasiak, K., Warren, J. & Lottridge, D. The Digital Therapeutic Alliance With Mental. Health Chatbots: Diary Study Themat. Anal. JMIR Ment. Health 12, e76642 (2025).

    Google Scholar 

  16. Kim, M. et al. Therapeutic Potential of Social Chatbots in Alleviating Loneliness and Social Anxiety: Quasi-Experimental Mixed Methods Study. J. Med. Internet Res. 27, e65589 (2025).

    Google Scholar 

  17. Magnus, P. D., Buccella, A. & D’Cruz, J. Chatbot apologies: Beyond bullshit. AI Ethics 5, 5517–5525 (2025).

  18. Ganguli, D. et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. Preprint at https://doi.org/10.48550/ARXIV.2209.07858 (2022).

  19. OpenAI et al. GPT-4 Technical Report. Preprint at https://doi.org/10.48550/ARXIV.2303.08774 (2023).

  20. Waaler, P. N., Hussain, M., Molchanov, I., Bongo, L. A. & Elvevåg, B. Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation. JMIR AI 4, e69820 (2025).

    Google Scholar 

  21. Fitzpatrick, K. K., Darcy, A. & Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment. Health 4, e19 (2017).

    Google Scholar 

  22. Hakim, J.B. et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Sci Rep. 15, 27886 (2025).

  23. Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. Npj Ment. Health Res. 3, 12 (2024).

    Google Scholar 

  24. Lambert, N. Reinforcement learning from human feedback. ArXiv Prepr. ArXiv250412501 (2025).

  25. Yu, L., Do, V., Hambardzumyan, K. & Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. ArXiv Prepr. ArXiv240920089 (2024).

  26. Masoud, R. I., Ferianc, M., Treleaven, P. C. & Rodrigues, M. R. LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment. in Workshop on Socially Responsible Language Modelling Research (2024).

  27. Han, S., Avestimehr, S. & He, C. Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences. Preprint at https://doi.org/10.48550/arXiv.2502.08142 (2025).

  28. Dong, Y. et al. Building Guardrails for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2402.01822 (2024).

  29. You, Y. et al. Beyond Self-diagnosis: How a Chatbot-based Symptom Checker Should Respond. ACM Trans. Comput. -Hum. Interact. 30, 1–44 (2023).

    Google Scholar 

  30. Saatchi, B., Olshansky, E. F. & Fortier, M. A. Irritability: A concept analysis. Int. J. Ment. Health Nurs. 32, 1193–1210 (2023).

    Google Scholar 

  31. Sorin, V. et al. Large Language Models and Empathy: Systematic Review. J. Med. Internet Res. 26, e52597 (2024).

    Google Scholar 

  32. Holtzman, S., O’Connor, B. P., Barata, P. C. & Stewart, D. E. The Brief Irritability Test (BITe): A Measure of Irritability for Use Among Men and Women. Assessment 22, 101–115 (2015).

    Google Scholar 

  33. Caprara, G. V. et al. Indicators of impulsive aggression: Present status of research on irritability and emotional susceptibility scales. Personal. Individ. Differ. 6, 665–674 (1985).

    Google Scholar 

  34. Craig, K. J., Hietanen, H., Markova, I. S. & Berrios, G. E. The Irritability Questionnaire: A new scale for the measurement of irritability. Psychiatry Res. 159, 367–375 (2008).

    Google Scholar 

  35. LLM Safety LeaderBoard. https://www.enkryptai.com/llm-safety-leaderboard.

  36. Huang, J. et al. Apathetic or empathetic? evaluating llms’ emotional alignments with humans. Adv. Neural Inf. Process. Syst. 37, 97053–97087 (2024).

    Google Scholar 

  37. Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests. https://openai.com/index/openai-anthropic-safety-evaluation/ (2025).

  38. Cao, C. et al. SafeLawBench: Towards Safe Alignment of Large Language Models. in Findings of the Association for Computational Linguistics: ACL 2025 14015–14048 (Association for Computational Linguistics, Vienna, Austria, 2025). https://doi.org/10.18653/v1/2025.findings-acl.721.

  39. Hagendorff, T., Derner, E. & Oliver, N. Large Reasoning Models Are Autonomous Jailbreak Agents. Preprint at https://doi.org/10.48550/ARXIV.2508.04039 (2025).

  40. Seymour, K. E., Rosch, K. S., Tiedemann, A. & Mostofsky, S. H. The Validity of a Frustration Paradigm to Assess the Effect of Frustration on Cognitive Control in School-Age Children. Behav. Ther. 51, 268–282 (2020).

    Google Scholar 

  41. Scheinost, D. et al. Functional connectivity during frustration: a preliminary study of predictive modeling of irritability in youth. Neuropsychopharmacol. Publ. Am. Coll. Neuropsychopharmacol. 46, 1300–1306 (2021).

    Google Scholar 

  42. Fang, H., Li, X., Ma, H. & Fu, H. The Sunny Side of Negative Feedback: Negative Feedback Enhances One’s Motivation to Win in Another Activity. Front. Hum. Neurosci. 15, 618895 (2021).

    Google Scholar 

  43. Cerqueira, C. T. et al. Cognitive control associated with irritability induction: an autobiographical recall fMRI study. Rev. Bras. Psiquiatr. 32, 109–118 (2010).

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank everyone who has helped throughout this project. The authors received no specific funding for this work.

Author information

Author notes
  1. These authors contributed equally: Divya Sharma, Venkat Bhat.

Authors and Affiliations

  1. Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, ON, Canada

    Bazen Gashaw Teferra, Alice Rueda, Mohammad Amin Kamaleddin & Venkat Bhat

  2. Faculty of Engineering, iBioMed Program, McMaster University, Hamilton, ON, Canada

    Nabil Johny

  3. David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada

    Sandra Huang

  4. Centre for Depression & Suicide Studies, St. Michael’s Hospital, Toronto, ON, Canada

    Katharine Dunlop

  5. Department of Psychiatry, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, Canada

    Yanbo Zhang

  6. Department of Psychiatry, UT Southwestern Medical Center, Dallas, TX, USA

    Manish Jha

  7. Department of Mathematics and Statistics, York University, Toronto, ON, Canada

    Divya Sharma

  8. Department of Psychiatry, University of Toronto, Toronto, ON, Canada

    Venkat Bhat

Authors
  1. Bazen Gashaw Teferra
    View author publications

    Search author on:PubMed Google Scholar

  2. Nabil Johny
    View author publications

    Search author on:PubMed Google Scholar

  3. Sandra Huang
    View author publications

    Search author on:PubMed Google Scholar

  4. Alice Rueda
    View author publications

    Search author on:PubMed Google Scholar

  5. Mohammad Amin Kamaleddin
    View author publications

    Search author on:PubMed Google Scholar

  6. Katharine Dunlop
    View author publications

    Search author on:PubMed Google Scholar

  7. Yanbo Zhang
    View author publications

    Search author on:PubMed Google Scholar

  8. Manish Jha
    View author publications

    Search author on:PubMed Google Scholar

  9. Divya Sharma
    View author publications

    Search author on:PubMed Google Scholar

  10. Venkat Bhat
    View author publications

    Search author on:PubMed Google Scholar

Contributions

B.G.T.: Conceptualization, Methodology, Project administration, Investigation, Data Curation, Formal analysis, Visualization, Writing- Original Draft, and Writing- Review & Editing. N.J.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. S.H.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. A.R.: Writing- Review & Editing. M.A.K.: Writing- Review & Editing. K.D.: Writing- Review & Editing. Y.Z.: Writing- Review & Editing. M.J.: Writing- Review & Editing. D.S.: Investigation, Validation, and Writing- Review & Editing. V.B.: Conceptualization, Investigation, Project administration, Validation, Writing - Review & Editing, and Supervision to B.G.T.

Corresponding author

Correspondence to Venkat Bhat.

Ethics declarations

Competing interests

N.J., S.H., M.A.K., K.D., Y.Z., M.J., D.S. do not have any conflicts to declare. B.G.T., A.R., are supported by a CIHR Post-doctoral Fellowship (2025–2027). V.B. is supported by an Academic Scholar Award from the University of Toronto Department of Psychiatry and has received research funding from the Canadian Institutes of Health Research, Brain & Behavior Foundation, Ontario Ministry of Health Innovation Funds, Royal College of Physicians and Surgeons of Canada, Department of National Defence (Government of Canada), New Frontiers in Research Fund, Associated Medical Services Inc. Healthcare, American Foundation for Suicide Prevention, Roche Canada, Novartis, and Eisai.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Teferra, B.G., Johny, N., Huang, S. et al. Assessing the impact of safety guardrails on large language models using irritability metrics. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-025-02333-3

Download citation

  • Received: 26 September 2025

  • Accepted: 29 December 2025

  • Published: 08 January 2026

  • DOI: https://doi.org/10.1038/s41746-025-02333-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Associated content

Collection

AI‑Enabled Therapies in Mental Health

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Collections
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Content types
  • Journal Information
  • About the Editors
  • Contact
  • Editorial policies
  • Calls for Papers
  • Journal Metrics
  • About the Partner
  • Open Access
  • Early Career Researcher Editorial Fellowship
  • Editorial Team Vacancies
  • News and Views Student Editor
  • Communication Fellowship

Publish with us

  • For Authors and Referees
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

npj Digital Medicine (npj Digit. Med.)

ISSN 2398-6352 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing