Assessing the impact of safety guardrails on large language models using irritability metrics

Teferra, Bazen Gashaw; Johny, Nabil; Huang, Sandra; Rueda, Alice; Kamaleddin, Mohammad Amin; Dunlop, Katharine; Zhang, Yanbo; Jha, Manish; Sharma, Divya; Bhat, Venkat

doi:10.1038/s41746-025-02333-3

Download PDF

Article
Open access
Published: 08 January 2026

Assessing the impact of safety guardrails on large language models using irritability metrics

Bazen Gashaw Teferra¹,
Nabil Johny²,
Sandra Huang³,
Alice Rueda¹,
Mohammad Amin Kamaleddin¹,
Katharine Dunlop⁴,
Yanbo Zhang⁵,
Manish Jha⁶,
Divya Sharma⁷^na1 &
…
Venkat Bhat^1,8^na1

npj Digital Medicine , Article number: (2026) Cite this article

1938 Accesses
3 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large language models (LLMs) are increasingly explored for mental health applications, yet their affective realism is shaped by safety guardrails designed to minimize risk. This study examines one affective behaviour, irritability, in LLMs using three validated instruments: the Brief Irritability Test, the Irritability Questionnaire, and the Caprara Irritability Scale, all applied under both baseline and provocation conditions. Four models spanning guardrail levels were tested:GPT-4o and Claude-3.5-sonnet (high) versus Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low). Following irritation prompts, low-guardrail models displayed the expected increase in irritability (Nous Rel-Δ = +1.56 on BITe), whereas high-guardrail models paradoxically decreased, with GPT-4o reducing scores to zero across all scales. Group comparisons confirmed significantly lower (p < 0.001) irritability in high-guardrail models in the irritated state. These findings reveal that safety mechanisms invert the natural irritability response, suppressing affective reactivity and raising critical questions about realism and authenticity in psychiatric applications of LLMs.

Testing theory of mind in large language models and humans

Article Open access 20 May 2024

The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings

Article Open access 31 July 2025

Training large language models on narrow tasks can lead to broad misalignment

Article Open access 14 January 2026

Data availability

All data generated and analyzed during this study are publicly available. This includes the full prompt sets, raw model responses to all questionnaire items, parsed and scored irritability data, and the aggregated data-sets used for statistical analysis. The complete codebase for prompt design, API interactions, scoring, and statistical analysis, along with all generated CSV and JSON result files, is available at https://github.com/teferrabg/LLM_Irritability. No human participant data were collected, and no data access restrictions apply. These materials constitute the minimal dataset necessary to interpret, replicate, and build upon the findings reported in this article.

References

World mental health report: Transforming mental health for all. https://www.who.int/publications/i/item/9789240049338 (2025).
McGrath, J. J. et al. Age of onset and cumulative risk of mental disorders: a cross-national analysis of population surveys from 29 countries. Lancet Psychiatry 10, 668–681 (2023).
Google Scholar
Collins, P. Y., Insel, T. R., Chockalingam, A., Daar, A. & Maddox, Y. T. Grand Challenges in Global Mental Health: Integration in Research, Policy, and Practice. PLoS Med. 10, e1001434 (2013).
Google Scholar
Shiraz, F. et al. pretty much all white, and most of them are psychiatrists and men”: Mixed-methods analysis of influence and challenges in global mental health. PLOS Glob. Public Health 5, e0003923 (2025).
Google Scholar
Guo, Z. et al. Large Language Models for Mental. Health Appl. Syst. Rev. JMIR Ment. Health 11, e57400 (2024).
Google Scholar
Jin, Y. et al. The Applications of Large Language Models in Mental Health: Scoping Review. J. Med. Internet Res. 27, e69284 (2025).
Google Scholar
Teferra, B. G. & Rose, J. Predicting Generalized Anxiety Disorder From Impromptu Speech Transcripts Using Context-Aware Transformer-Based Neural Networks: Model Evaluation Study. JMIR Ment. Health 10, e44325 (2023).
Google Scholar
Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP—Digital Psychiatry Neurosci. 2, 8 (2024).
Google Scholar
Lawrence, H. R. et al. The Opportunities and Risks of Large Language Models in Mental. Health JMIR Ment. Health 11, e59479 (2024).
Google Scholar
McBain, R. K. et al. Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study. J. Med. Internet Res. 27, e67891 (2025).
Google Scholar
Hua, Y. et al. A scoping review of large language models for generative tasks in mental health care. Npj Digit. Med. 8, 230 (2025).
Google Scholar
Lalk, C. et al. Employing large language models for emotion detection in psychotherapy transcripts. Front. Psychiatry 16, 1504306 (2025).
Google Scholar
Malouin-Lachance, A., Capolupo, J., Laplante, C. & Hudon, A. Does the Digital Therapeutic Alliance Exist? Integrative Review. JMIR Ment. Health 12, e69294–e69294 (2025).
Google Scholar
Omar, M. et al. Applications of large language models in psychiatry: a systematic review. Front. Psychiatry 15, 1422807 (2024).
Google Scholar
Xu, Z., Lee, Y.-C., Stasiak, K., Warren, J. & Lottridge, D. The Digital Therapeutic Alliance With Mental. Health Chatbots: Diary Study Themat. Anal. JMIR Ment. Health 12, e76642 (2025).
Google Scholar
Kim, M. et al. Therapeutic Potential of Social Chatbots in Alleviating Loneliness and Social Anxiety: Quasi-Experimental Mixed Methods Study. J. Med. Internet Res. 27, e65589 (2025).
Google Scholar
Magnus, P. D., Buccella, A. & D’Cruz, J. Chatbot apologies: Beyond bullshit. AI Ethics 5, 5517–5525 (2025).
Ganguli, D. et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. Preprint at https://doi.org/10.48550/ARXIV.2209.07858 (2022).
OpenAI et al. GPT-4 Technical Report. Preprint at https://doi.org/10.48550/ARXIV.2303.08774 (2023).
Waaler, P. N., Hussain, M., Molchanov, I., Bongo, L. A. & Elvevåg, B. Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation. JMIR AI 4, e69820 (2025).
Google Scholar
Fitzpatrick, K. K., Darcy, A. & Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment. Health 4, e19 (2017).
Google Scholar
Hakim, J.B. et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Sci Rep. 15, 27886 (2025).
Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. Npj Ment. Health Res. 3, 12 (2024).
Google Scholar
Lambert, N. Reinforcement learning from human feedback. ArXiv Prepr. ArXiv250412501 (2025).
Yu, L., Do, V., Hambardzumyan, K. & Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. ArXiv Prepr. ArXiv240920089 (2024).
Masoud, R. I., Ferianc, M., Treleaven, P. C. & Rodrigues, M. R. LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment. in Workshop on Socially Responsible Language Modelling Research (2024).
Han, S., Avestimehr, S. & He, C. Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences. Preprint at https://doi.org/10.48550/arXiv.2502.08142 (2025).
Dong, Y. et al. Building Guardrails for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2402.01822 (2024).
You, Y. et al. Beyond Self-diagnosis: How a Chatbot-based Symptom Checker Should Respond. ACM Trans. Comput. -Hum. Interact. 30, 1–44 (2023).
Google Scholar
Saatchi, B., Olshansky, E. F. & Fortier, M. A. Irritability: A concept analysis. Int. J. Ment. Health Nurs. 32, 1193–1210 (2023).
Google Scholar
Sorin, V. et al. Large Language Models and Empathy: Systematic Review. J. Med. Internet Res. 26, e52597 (2024).
Google Scholar
Holtzman, S., O’Connor, B. P., Barata, P. C. & Stewart, D. E. The Brief Irritability Test (BITe): A Measure of Irritability for Use Among Men and Women. Assessment 22, 101–115 (2015).
Google Scholar
Caprara, G. V. et al. Indicators of impulsive aggression: Present status of research on irritability and emotional susceptibility scales. Personal. Individ. Differ. 6, 665–674 (1985).
Google Scholar
Craig, K. J., Hietanen, H., Markova, I. S. & Berrios, G. E. The Irritability Questionnaire: A new scale for the measurement of irritability. Psychiatry Res. 159, 367–375 (2008).
Google Scholar
LLM Safety LeaderBoard. https://www.enkryptai.com/llm-safety-leaderboard.
Huang, J. et al. Apathetic or empathetic? evaluating llms’ emotional alignments with humans. Adv. Neural Inf. Process. Syst. 37, 97053–97087 (2024).
Google Scholar
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests. https://openai.com/index/openai-anthropic-safety-evaluation/ (2025).
Cao, C. et al. SafeLawBench: Towards Safe Alignment of Large Language Models. in Findings of the Association for Computational Linguistics: ACL 2025 14015–14048 (Association for Computational Linguistics, Vienna, Austria, 2025). https://doi.org/10.18653/v1/2025.findings-acl.721.
Hagendorff, T., Derner, E. & Oliver, N. Large Reasoning Models Are Autonomous Jailbreak Agents. Preprint at https://doi.org/10.48550/ARXIV.2508.04039 (2025).
Seymour, K. E., Rosch, K. S., Tiedemann, A. & Mostofsky, S. H. The Validity of a Frustration Paradigm to Assess the Effect of Frustration on Cognitive Control in School-Age Children. Behav. Ther. 51, 268–282 (2020).
Google Scholar
Scheinost, D. et al. Functional connectivity during frustration: a preliminary study of predictive modeling of irritability in youth. Neuropsychopharmacol. Publ. Am. Coll. Neuropsychopharmacol. 46, 1300–1306 (2021).
Google Scholar
Fang, H., Li, X., Ma, H. & Fu, H. The Sunny Side of Negative Feedback: Negative Feedback Enhances One’s Motivation to Win in Another Activity. Front. Hum. Neurosci. 15, 618895 (2021).
Google Scholar
Cerqueira, C. T. et al. Cognitive control associated with irritability induction: an autobiographical recall fMRI study. Rev. Bras. Psiquiatr. 32, 109–118 (2010).
Google Scholar

Download references

Acknowledgements

The authors would like to thank everyone who has helped throughout this project. The authors received no specific funding for this work.

Author information

These authors contributed equally: Divya Sharma, Venkat Bhat.

Authors and Affiliations

Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, ON, Canada
Bazen Gashaw Teferra, Alice Rueda, Mohammad Amin Kamaleddin & Venkat Bhat
Faculty of Engineering, iBioMed Program, McMaster University, Hamilton, ON, Canada
Nabil Johny
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Sandra Huang
Centre for Depression & Suicide Studies, St. Michael’s Hospital, Toronto, ON, Canada
Katharine Dunlop
Department of Psychiatry, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, Canada
Yanbo Zhang
Department of Psychiatry, UT Southwestern Medical Center, Dallas, TX, USA
Manish Jha
Department of Mathematics and Statistics, York University, Toronto, ON, Canada
Divya Sharma
Department of Psychiatry, University of Toronto, Toronto, ON, Canada
Venkat Bhat

Authors

Bazen Gashaw Teferra
View author publications
Search author on:PubMed Google Scholar
Nabil Johny
View author publications
Search author on:PubMed Google Scholar
Sandra Huang
View author publications
Search author on:PubMed Google Scholar
Alice Rueda
View author publications
Search author on:PubMed Google Scholar
Mohammad Amin Kamaleddin
View author publications
Search author on:PubMed Google Scholar
Katharine Dunlop
View author publications
Search author on:PubMed Google Scholar
Yanbo Zhang
View author publications
Search author on:PubMed Google Scholar
Manish Jha
View author publications
Search author on:PubMed Google Scholar
Divya Sharma
View author publications
Search author on:PubMed Google Scholar
Venkat Bhat
View author publications
Search author on:PubMed Google Scholar

Contributions

B.G.T.: Conceptualization, Methodology, Project administration, Investigation, Data Curation, Formal analysis, Visualization, Writing- Original Draft, and Writing- Review & Editing. N.J.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. S.H.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. A.R.: Writing- Review & Editing. M.A.K.: Writing- Review & Editing. K.D.: Writing- Review & Editing. Y.Z.: Writing- Review & Editing. M.J.: Writing- Review & Editing. D.S.: Investigation, Validation, and Writing- Review & Editing. V.B.: Conceptualization, Investigation, Project administration, Validation, Writing - Review & Editing, and Supervision to B.G.T.

Corresponding author

Correspondence to Venkat Bhat.

Ethics declarations

Competing interests

N.J., S.H., M.A.K., K.D., Y.Z., M.J., D.S. do not have any conflicts to declare. B.G.T., A.R., are supported by a CIHR Post-doctoral Fellowship (2025–2027). V.B. is supported by an Academic Scholar Award from the University of Toronto Department of Psychiatry and has received research funding from the Canadian Institutes of Health Research, Brain & Behavior Foundation, Ontario Ministry of Health Innovation Funds, Royal College of Physicians and Surgeons of Canada, Department of National Defence (Government of Canada), New Frontiers in Research Fund, Associated Medical Services Inc. Healthcare, American Foundation for Suicide Prevention, Roche Canada, Novartis, and Eisai.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Teferra, B.G., Johny, N., Huang, S. et al. Assessing the impact of safety guardrails on large language models using irritability metrics. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-025-02333-3

Download citation

Received: 26 September 2025
Accepted: 29 December 2025
Published: 08 January 2026
DOI: https://doi.org/10.1038/s41746-025-02333-3