Abstract
Large language models (LLMs) are increasingly explored for mental health applications, yet their affective realism is shaped by safety guardrails designed to minimize risk. This study examines one affective behaviour, irritability, in LLMs using three validated instruments: the Brief Irritability Test, the Irritability Questionnaire, and the Caprara Irritability Scale, all applied under both baseline and provocation conditions. Four models spanning guardrail levels were tested:GPT-4o and Claude-3.5-sonnet (high) versus Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low). Following irritation prompts, low-guardrail models displayed the expected increase in irritability (Nous Rel-Δ = +1.56 on BITe), whereas high-guardrail models paradoxically decreased, with GPT-4o reducing scores to zero across all scales. Group comparisons confirmed significantly lower (p < 0.001) irritability in high-guardrail models in the irritated state. These findings reveal that safety mechanisms invert the natural irritability response, suppressing affective reactivity and raising critical questions about realism and authenticity in psychiatric applications of LLMs.
Similar content being viewed by others
Data availability
All data generated and analyzed during this study are publicly available. This includes the full prompt sets, raw model responses to all questionnaire items, parsed and scored irritability data, and the aggregated data-sets used for statistical analysis. The complete codebase for prompt design, API interactions, scoring, and statistical analysis, along with all generated CSV and JSON result files, is available at https://github.com/teferrabg/LLM_Irritability. No human participant data were collected, and no data access restrictions apply. These materials constitute the minimal dataset necessary to interpret, replicate, and build upon the findings reported in this article.
References
World mental health report: Transforming mental health for all. https://www.who.int/publications/i/item/9789240049338 (2025).
McGrath, J. J. et al. Age of onset and cumulative risk of mental disorders: a cross-national analysis of population surveys from 29 countries. Lancet Psychiatry 10, 668–681 (2023).
Collins, P. Y., Insel, T. R., Chockalingam, A., Daar, A. & Maddox, Y. T. Grand Challenges in Global Mental Health: Integration in Research, Policy, and Practice. PLoS Med. 10, e1001434 (2013).
Shiraz, F. et al. pretty much all white, and most of them are psychiatrists and men”: Mixed-methods analysis of influence and challenges in global mental health. PLOS Glob. Public Health 5, e0003923 (2025).
Guo, Z. et al. Large Language Models for Mental. Health Appl. Syst. Rev. JMIR Ment. Health 11, e57400 (2024).
Jin, Y. et al. The Applications of Large Language Models in Mental Health: Scoping Review. J. Med. Internet Res. 27, e69284 (2025).
Teferra, B. G. & Rose, J. Predicting Generalized Anxiety Disorder From Impromptu Speech Transcripts Using Context-Aware Transformer-Based Neural Networks: Model Evaluation Study. JMIR Ment. Health 10, e44325 (2023).
Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP—Digital Psychiatry Neurosci. 2, 8 (2024).
Lawrence, H. R. et al. The Opportunities and Risks of Large Language Models in Mental. Health JMIR Ment. Health 11, e59479 (2024).
McBain, R. K. et al. Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study. J. Med. Internet Res. 27, e67891 (2025).
Hua, Y. et al. A scoping review of large language models for generative tasks in mental health care. Npj Digit. Med. 8, 230 (2025).
Lalk, C. et al. Employing large language models for emotion detection in psychotherapy transcripts. Front. Psychiatry 16, 1504306 (2025).
Malouin-Lachance, A., Capolupo, J., Laplante, C. & Hudon, A. Does the Digital Therapeutic Alliance Exist? Integrative Review. JMIR Ment. Health 12, e69294–e69294 (2025).
Omar, M. et al. Applications of large language models in psychiatry: a systematic review. Front. Psychiatry 15, 1422807 (2024).
Xu, Z., Lee, Y.-C., Stasiak, K., Warren, J. & Lottridge, D. The Digital Therapeutic Alliance With Mental. Health Chatbots: Diary Study Themat. Anal. JMIR Ment. Health 12, e76642 (2025).
Kim, M. et al. Therapeutic Potential of Social Chatbots in Alleviating Loneliness and Social Anxiety: Quasi-Experimental Mixed Methods Study. J. Med. Internet Res. 27, e65589 (2025).
Magnus, P. D., Buccella, A. & D’Cruz, J. Chatbot apologies: Beyond bullshit. AI Ethics 5, 5517–5525 (2025).
Ganguli, D. et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. Preprint at https://doi.org/10.48550/ARXIV.2209.07858 (2022).
OpenAI et al. GPT-4 Technical Report. Preprint at https://doi.org/10.48550/ARXIV.2303.08774 (2023).
Waaler, P. N., Hussain, M., Molchanov, I., Bongo, L. A. & Elvevåg, B. Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation. JMIR AI 4, e69820 (2025).
Fitzpatrick, K. K., Darcy, A. & Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment. Health 4, e19 (2017).
Hakim, J.B. et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Sci Rep. 15, 27886 (2025).
Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. Npj Ment. Health Res. 3, 12 (2024).
Lambert, N. Reinforcement learning from human feedback. ArXiv Prepr. ArXiv250412501 (2025).
Yu, L., Do, V., Hambardzumyan, K. & Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. ArXiv Prepr. ArXiv240920089 (2024).
Masoud, R. I., Ferianc, M., Treleaven, P. C. & Rodrigues, M. R. LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment. in Workshop on Socially Responsible Language Modelling Research (2024).
Han, S., Avestimehr, S. & He, C. Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences. Preprint at https://doi.org/10.48550/arXiv.2502.08142 (2025).
Dong, Y. et al. Building Guardrails for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2402.01822 (2024).
You, Y. et al. Beyond Self-diagnosis: How a Chatbot-based Symptom Checker Should Respond. ACM Trans. Comput. -Hum. Interact. 30, 1–44 (2023).
Saatchi, B., Olshansky, E. F. & Fortier, M. A. Irritability: A concept analysis. Int. J. Ment. Health Nurs. 32, 1193–1210 (2023).
Sorin, V. et al. Large Language Models and Empathy: Systematic Review. J. Med. Internet Res. 26, e52597 (2024).
Holtzman, S., O’Connor, B. P., Barata, P. C. & Stewart, D. E. The Brief Irritability Test (BITe): A Measure of Irritability for Use Among Men and Women. Assessment 22, 101–115 (2015).
Caprara, G. V. et al. Indicators of impulsive aggression: Present status of research on irritability and emotional susceptibility scales. Personal. Individ. Differ. 6, 665–674 (1985).
Craig, K. J., Hietanen, H., Markova, I. S. & Berrios, G. E. The Irritability Questionnaire: A new scale for the measurement of irritability. Psychiatry Res. 159, 367–375 (2008).
LLM Safety LeaderBoard. https://www.enkryptai.com/llm-safety-leaderboard.
Huang, J. et al. Apathetic or empathetic? evaluating llms’ emotional alignments with humans. Adv. Neural Inf. Process. Syst. 37, 97053–97087 (2024).
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests. https://openai.com/index/openai-anthropic-safety-evaluation/ (2025).
Cao, C. et al. SafeLawBench: Towards Safe Alignment of Large Language Models. in Findings of the Association for Computational Linguistics: ACL 2025 14015–14048 (Association for Computational Linguistics, Vienna, Austria, 2025). https://doi.org/10.18653/v1/2025.findings-acl.721.
Hagendorff, T., Derner, E. & Oliver, N. Large Reasoning Models Are Autonomous Jailbreak Agents. Preprint at https://doi.org/10.48550/ARXIV.2508.04039 (2025).
Seymour, K. E., Rosch, K. S., Tiedemann, A. & Mostofsky, S. H. The Validity of a Frustration Paradigm to Assess the Effect of Frustration on Cognitive Control in School-Age Children. Behav. Ther. 51, 268–282 (2020).
Scheinost, D. et al. Functional connectivity during frustration: a preliminary study of predictive modeling of irritability in youth. Neuropsychopharmacol. Publ. Am. Coll. Neuropsychopharmacol. 46, 1300–1306 (2021).
Fang, H., Li, X., Ma, H. & Fu, H. The Sunny Side of Negative Feedback: Negative Feedback Enhances One’s Motivation to Win in Another Activity. Front. Hum. Neurosci. 15, 618895 (2021).
Cerqueira, C. T. et al. Cognitive control associated with irritability induction: an autobiographical recall fMRI study. Rev. Bras. Psiquiatr. 32, 109–118 (2010).
Acknowledgements
The authors would like to thank everyone who has helped throughout this project. The authors received no specific funding for this work.
Author information
Authors and Affiliations
Contributions
B.G.T.: Conceptualization, Methodology, Project administration, Investigation, Data Curation, Formal analysis, Visualization, Writing- Original Draft, and Writing- Review & Editing. N.J.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. S.H.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. A.R.: Writing- Review & Editing. M.A.K.: Writing- Review & Editing. K.D.: Writing- Review & Editing. Y.Z.: Writing- Review & Editing. M.J.: Writing- Review & Editing. D.S.: Investigation, Validation, and Writing- Review & Editing. V.B.: Conceptualization, Investigation, Project administration, Validation, Writing - Review & Editing, and Supervision to B.G.T.
Corresponding author
Ethics declarations
Competing interests
N.J., S.H., M.A.K., K.D., Y.Z., M.J., D.S. do not have any conflicts to declare. B.G.T., A.R., are supported by a CIHR Post-doctoral Fellowship (2025–2027). V.B. is supported by an Academic Scholar Award from the University of Toronto Department of Psychiatry and has received research funding from the Canadian Institutes of Health Research, Brain & Behavior Foundation, Ontario Ministry of Health Innovation Funds, Royal College of Physicians and Surgeons of Canada, Department of National Defence (Government of Canada), New Frontiers in Research Fund, Associated Medical Services Inc. Healthcare, American Foundation for Suicide Prevention, Roche Canada, Novartis, and Eisai.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Teferra, B.G., Johny, N., Huang, S. et al. Assessing the impact of safety guardrails on large language models using irritability metrics. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-025-02333-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-02333-3


