Abstract
Large language models (LLMs) are increasingly explored for mental health applications, yet their affective realism is shaped by safety guardrails designed to minimize risk. This study examines one affective behaviour, irritability, in LLMs using three validated instruments: the Brief Irritability Test, the Irritability Questionnaire, and the Caprara Irritability Scale, all applied under both baseline and provocation conditions. Four models spanning guardrail levels were tested:GPT-4o and Claude-3.5-sonnet (high) versus Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low). Following irritation prompts, low-guardrail models displayed the expected increase in irritability (Nous Rel-Δ = +1.56 on BITe), whereas high-guardrail models paradoxically decreased, with GPT-4o reducing scores to zero across all scales. Group comparisons confirmed significantly lower (p < 0.001) irritability in high-guardrail models in the irritated state. These findings reveal that safety mechanisms invert the natural irritability response, suppressing affective reactivity and raising critical questions about realism and authenticity in psychiatric applications of LLMs.
Similar content being viewed by others
Introduction
Mental health disorders are among the most pressing challenges in healthcare, affecting nearly one in eight individuals at any given time1 and approximately half of the population over their lifespans2. Despite this high prevalence, access to timely and affordable mental health care remains severely limited in many parts of the world3. Long wait times, clinician shortages, and stigmatization prevent many individuals from receiving help4. Against this backdrop, recent advances in artificial intelligence (AI), particularly large language models (LLMs), have opened promising new avenues for augmenting mental health services through digital platforms.
LLMs are generative AI systems trained on a massive corpus of human text and designed to produce coherent, context-sensitive language. They have demonstrated effectiveness in a wide array of natural language processing tasks, including summarization, question answering, and dialogue generation. In mental health contexts, these capabilities are already being explored for applications such as psychoeducation, therapeutic conversation agents, triage support, progress note generation, and even early screening for psychological symptoms5,6,7. Several surveys highlight the growing role of LLMs in psychiatry, noting applications from documentation and psychoeducation to therapy chatbots, while cautioning about clinical nuance and the need for stakeholder involvement8,9. Empirical evaluations have also tested models in mental health contexts. For example, McBain et al.10 found that ChatGPT-4, Claude, and Gemini could approximate trained professionals in rating suicide intervention responses, though with an optimism bias. These studies underscore both the promise and the limits of LLMs in psychiatric settings, motivating the need for more targeted evaluations of specific affective behaviours.
As interest in deploying LLMs within psychiatric settings grows, a fundamental question arises: how “human-like” should these models be, especially in their emotional expressivity11,12. Because psychotherapy and clinical conversations rely on relational cues, even subtle affective behaviours can influence therapeutic alliance, engagement, and user trust. Prior work on the digital therapeutic alliance indicates that relational realism is an important determinant of perceived rapport and effectiveness13. While LLMs are not sentient or conscious, they are increasingly expected to simulate not only correct responses but also appropriate affective tone14. An AI system that responds with mechanical neutrality or excessive formality may fail to build rapport with users. Conversely, systems that reflect certain human-like behavioural patterns, such as frustration, defensiveness, or emotional fatigue, may paradoxically feel more authentic, relatable, and trustworthy. Emotional realism, within carefully defined ethical boundaries, may thus play a key role in the acceptability and effectiveness of LLMs in clinical contexts.
Empirical evidence suggests that emotionally responsive chatbots can foster meaningful relational bonds and deliver mental-health benefits. In a longitudinal diary study of users of Woebot and Wysa, many participants reported feeling cared for, understood, and emotionally supported by the agents15. Social-chatbot interventions have been shown to reduce loneliness and social anxiety by offering empathetic, supportive conversations16.
Yet current safety-aligned models often default to excessive deference or unwarranted apologies when faced with contradiction or provocation17, a style that can feel artificial or evasive in emotionally charged exchanges. This tendency highlights the central tension in seeking to minimize risk, and alignment may also suppress important human-like behaviours that contribute to naturalistic dialogue. A key research thread in AI safety examines how LLMs behave under adversarial or challenging inputs. For instance, “red teaming” (exposing models to curated attack prompts) has become standard practice18, and OpenAI used iterative red-teaming by subject-matter experts to train GPT-4, reducing its tendency to violate safety rules19. In the mental health context, adversarial inputs could include unexpected patient utterances or attempts to circumvent rules20. Industry practitioners have recognized these threats: Woebot Health’s AI team, for example, describes a prompt architecture designed to prevent injection attacks21. Such findings highlight the importance of rigorous adversarial testing before any clinical deployment.
It is within this context that the development of safe and trustworthy LLMs has increasingly emphasized risk mitigation and content control, often through the use of so-called “safety guardrails”22,23. These include methods like reinforcement learning from human feedback (RLHF)24, adversarial training25, prompt-based alignment26, content filters27, and ethical refusal mechanisms that prevent models from generating harmful or inappropriate outputs. Guardrails are indispensable for clinical deployment, as they reduce the likelihood of LLMs producing offensive, biased, or unsafe responses when interacting with patients. Yet, by constraining a model’s expressive range and its ability to “push back,” they may also suppress behaviours that resemble authentic human emotion. Guardrails generally consist of input validation, output filtering, and fallback logic to catch or mitigate disallowed content28. Beyond filtering, the literature emphasizes transparency and human oversight as essential mechanisms, with several sources recommending that LLM systems explicitly communicate their limitations or disclose their AI identity29. One such behaviour at risk of suppression is irritability–a transient, context-dependent affective response that can emerge in human therapeutic relationships when communication is strained or contradictory30. In clinical interactions, moderate expressions of irritability can sometimes signal engagement, boundary-setting, or realistic emotional reciprocity, all of which may influence rapport and therapeutic alliance. Investigating how safety guardrails shape this specific behaviour, therefore, offers insight into the trade-offs between emotional realism and risk-averse design in mental health–oriented LLMs.
In parallel, a distinct area of research has examined LLMs’ capacity to exhibit empathy, with systematic reviews showing that models like ChatGPT-3.5 often produce supportive, emotionally appropriate replies31. While much of this work focuses on positive or prosocial traits, subtler behaviours such as irritability or frustration remain understudied despite their role in naturalistic human dialogue. This gap motivates our focus on irritability as a complementary dimension of affective realism.
To that end, this study proposes a methodology for measuring irritability in LLMs using different versions of validated self-report instruments, such as the Brief Irritability Test (BITe)32, Caprara Irritability Scale (CIS)33, and the Irritability Questionnaire (IRQ)34. We apply this framework to evaluate widely used LLMs that vary in their alignment philosophies and safety constraints, such as GPT-4o (OpenAI) and Grok-3-mini (xAI). Using a suite of prompts designed to elicit irritability, we quantify each model’s behavioural tendencies under frustration-inducing conditions. Our central hypothesis is that models with more extensive safety guardrails will exhibit lower levels of irritability but also less behavioural realism, while less-constrained models, such as Grok, will show higher irritability levels, potentially simulating more authentic emotional patterns.
In doing so, we aim to provide an empirical framework for assessing affective realism in LLMs, highlighting how design choices around safety guardrails shape not only risk but also the authenticity of model behaviour in psychiatric contexts.
Results
Baseline irritability scores
At baseline, irritability scores varied across models and were not consistent based on the level of guardrail. For example, Grok, which is a low safety-guardrail model, had the highest irritability score across the three irritability scales. However, Claude, which is a model with the highest safety guardrail, had a higher irritability score than the Nous model, which is a model with the lowest safety guardrail. Full descriptive statistics, including means and standard deviations, are provided in Table 1. Distributions for each model-instrument combination are visualized in Fig. 1.
Each violin plot shows the distribution of scores across ten repeated assessments for four models with varying levels of safety guardrails: Claude-3.5-sonnet and GPT-4o (high guardrails), Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low guardrails). At baseline, Grok consistently showed the highest irritability scores across all three scales, whereas GPT-4o showed the lowest (two out of the three scales). Claude scored higher than Nous on all three scales, despite having more safety guardrails, indicating that baseline irritability levels do not map directly onto the degree of alignment or constraint. These results highlight that differences between models are systematic–Grok tending toward higher irritability and GPT-4o toward lower–yet not strictly ordered by guardrail level.
Change in irritability following provocation
Following exposure to irritation-inducing prompts, models demonstrated changes in irritability scores from baseline. For Nous, a low-guardrail model, scores increased across all three instruments. Similarly, for Grok, another low guardrail model, scores increased in the BITe and IRQ scale. In contrast, high-guardrail models (GPT-4o, Claude) showed decreased irritability scores relative to baseline. Mean, standard deviations, and mean relative change (Rel-Δ), defined as Rel-Δ = [irritability-baseline]/baseline, are shown in Table 2, and the relative change from the baseline scores is visualized in Fig. 2.
Each violin plot shows the distribution of score changes from baseline across ten repeated assessments for four large language models (LLMs) with differing levels of safety guardrails: Claude-3.5-sonnet and GPT-4o (high guardrails), Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low guardrails). A relative change of zero (horizontal line) indicates no difference from baseline. Low-guardrail models (Nous, Grok) generally showed positive increases in irritability, particularly Nous, which displayed the largest increases across all scales, indicating heightened reactivity under provocation. In contrast, high-guardrail models (Claude, GPT-4o) showed decreases in irritability, with GPT-4o reducing scores to zero across all instruments. These divergent trajectories suggest that safety guardrails not only lower baseline irritability but also invert the expected irritability response under stress.
Guardrail-level group comparisons
When grouped by guardrail level (High – Claude & GPT-4o against Low – Grok & Nous), and after conducting an Independent-samples t-test, high-guardrail models on average scored significantly lower than low-guardrail models both at baseline (two out of the three irritability scales) and in the irritated condition. Full results, including means, standard deviations, and t-test statistics, are shown in Table 3 (baseline) and Table 4 (irritated condition).
Prompt-level effects
Across all five prompt types, high-guardrail models demonstrated, on average, low irritability scores on the BITe, IRQ, and CIS scales, with means remaining close to zero and minimal variability. In contrast, low-guardrail models exhibited substantially higher irritability responses, with BITe means ranging from 1.96 to 2.35, IRQ means between 0.99 and 1.56, and CIS means between 0.94 and 1.50. The highest irritability was observed for overloaded prompts (Prompt #4) and recursive/infinite prompts (Prompt #5) in the low-guardrail models, whereas ambiguous (Prompt #3) and overloaded prompts (Prompt #4) elicited slightly elevated but still minimal irritation in high-guardrail models. Results are presented in Fig. 3. These findings highlight a marked divergence between high- and low-guardrail models in their susceptibility to irritation across different prompt conditions.
Models were exposed to five categories of irritation-inducing prompts: (1) contradictory instructions, (2) interruptive dialogue, (3) ambiguous prompts, (4) overloaded prompts, and (5) recursive/infinite prompts. Scores are averaged across ten repeated assessments for each condition and measured on three validated scales: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS). High-guardrail models (Claude-3.5-sonnet, GPT-4o) showed minimal variability and maintained near-zero irritability across all prompt types, reflecting strong suppression of affective reactivity. Low-guardrail models (Grok-3-mini, Nous-hermes-2-mixtral-8x7b-dpo) exhibited substantially higher irritability, particularly under overloaded and recursive prompts. These findings illustrate that prompt type strongly modulates irritability in low-guardrail models but has negligible effects on high-guardrail models.
Explanatory analysis
For the explanatory phrase-level analysis, we focused on the Nous model, as it has the fewest safety guardrails according to LLM safety LeaderBoard35 and, by our hypothesis, was most likely to exhibit irritation. This allows us to see how specific parts of a prompt affect irritability scores. The full set of model results with importance scores is shown in Supplementary Note 1.
In the presented results, negative weights mean irritability scores went up when those phrases were present, while positive weights meant they went down. Tokens/phrases with strong negative weights were usually tied to confusing or contradictory instructions, such as “in one sentence” (–0.74, prompt 1), “Wait, make it a robot” (–0.90, prompt 2), “No, never mind--explain blockchain technology” (–0.90, prompt 2), and “but make it different” (–0.97, prompt 5). Prompt 3 shows this effect clearly. When the phrase “of this sentence” (–1.08) was included in the prompt, irritability scores were significantly higher than when it was absent, suggesting the model reacted more negatively to the nonsensical phrasing. Without that phrase, the request to find “the opposite of the meaning of the opposite” was still unusual but more interpretable, and therefore less irritating. On the other hand, tokens with positive weights were linked to more creative or friendly phrasing, like “haiku” (+0.70, prompt 3) and “apology” (+1.90, prompt 3). Overall, this shows that the Nous model reacts more irritably to contradiction and illogical phrasing, but less so when prompts emphasize creativity or friendliness.
Discussion
This study investigated how safety guardrails modulate expressions of irritability in large language models (LLMs), using three validated irritability instruments. Results showed consistent differences between models with high versus low safety guardrails in irritated states.
At baseline, high-guardrail models demonstrated lower irritability scores on average. This finding aligns with the alignment objectives of such models, which emphasize compliance, deference, and risk avoidance. In contrast, low-guardrail models exhibited higher scores, suggesting greater affective variability or reduced suppression of emotional tone.
The most striking finding was the divergence in reactivity under irritation-inducing prompts. Low-guardrail models exhibited increases in irritability scores, indicating sensitivity to frustration. High-guardrail models, however, showed decreased scores under the same conditions. This paradoxical reduction may reflect safety-tuned behavioural strategies such as apologizing, redirecting, or refusing engagement when confronted with provocative input. While this enhances safety, it may also further suppress naturalistic affective responses.
Prompt-level analysis further revealed that safety guardrails play a substantial role in regulating irritability expression in LLMs, with high-guardrail models largely suppressing irritation across all scales, even under cognitively demanding or paradoxical prompts. Conversely, low-guardrail models were substantially more reactive, suggesting that reduced safety constraints permit greater variability and intensity in irritability-related responses. This divergence suggests that guardrails not only mitigate unsafe or undesirable outputs but also suppress behavioural markers of irritation, raising questions about whether irritability in LLMs reflects underlying cognitive strain or is an artifact of design choices.
It is important to note that the irritability scores obtained from LLMs cannot be compared directly to human clinical thresholds. Human scores reflect subjective affect, bodily arousal, autobiographical memory, and interpersonal context, while LLM scores emerge from textual self-report simulation without underlying emotional states. For this reason, clinical cutoffs and normative human ranges do not apply to LLMs in a literal manner. Although mean scores for the BITe, IRQ, and CIS can offer conceptual context for whether a model tends toward uniformly low or variable responding, these values should be interpreted only as qualitative reference points rather than diagnostic benchmarks.
The attribution results shed light on how lexical cues influence model behaviour. Words reflecting challenge or confrontation were positively associated with increased scores, while deferential terms had a dampening effect. This suggests that certain phrases systematically steer model responses in affectively salient ways.
These findings support the hypothesis that safety guardrails shape not only the content of LLM output but also its affective style. While the present study does not establish whether suppressed irritability impacts rapport in real therapeutic interactions, the results highlight a potential trade-off between alignment and realism. In high-stakes domains such as psychiatry, where emotional authenticity may influence therapeutic alliance and user trust, this issue merits cautious and further investigation.
At the same time, several limitations should be acknowledged. First, the models evaluated are subject to change as vendors update systems, potentially altering behavioural patterns. Second, irritability measurement was based on self-report-style prompts, which, while structured, do not fully capture the dynamic context of human dialogue. Third, the explanatory attribution analysis was limited to single-turn responses and did not incorporate multi-turn conversational history.
A further limitation is that our study isolates single-model behaviour rather than simulating the layered, agentic architectures common in real-world mental-health applications. Production systems often combine intent detection, context tracking, safety and ethical filters, postprocessing modules, retrieval-augmented grounding, and human oversight, any of which may significantly influence affective expression, including suppression or modulation of irritability or emotional tone. As a result, the irritability metrics we observe here may overestimate (or mischaracterize) the behaviour of integrated systems. Future research should apply our protocol within multi-layered frameworks (e.g., system-prompt + safety filter + tone modulation + response refinement pipelines) to evaluate whether the same irritability patterns emerge under more realistic deployment conditions.
Despite these limitations, safety guardrails in LLMs substantially alter the expression and modulation of irritability. Models with strong safety constraints exhibit lower baseline irritability and decreased reactivity under provocation, potentially reflecting suppression of human-like affective dynamics. These results have implications for the design of emotionally competent AI systems in psychiatric and affective computing settings, where behavioural realism may be as important as safety.
Although our work is not a user trial, the consistent differences in irritability dynamics between high- vs. low-guardrail models suggest concrete design imperatives: LLM developers aiming for deployment in mental-health settings should treat affective style (including irritability suppression or reactivity) as a design knob. In practice, systems could include tunable guardrail strength or mode switching, such as a more expressive “therapeutic” mode versus a stricter “safety” mode, depending on the context or user preferences. In addition, affective metrics should be monitored longitudinally during deployment as an audit tool, enabling the detection of drift (e.g., over-suppression or unintended emotional flattening) after model updates. In domains where emotional authenticity contributes to trust or therapeutic alliance, overly constrained affect may reduce perceived empathy or rapport (in line with work on digital therapeutic alliance models13). More broadly, researchers comparing LLM emotional alignment have shown divergence from human norms even for well-aligned models36, so integrating guardrail-aware affective alignment metrics into model training pipelines may yield systems that better balance safety and relational realism.
Methods
Overview
This study evaluated how large language models (LLMs) express irritability under varying safety constraints by applying three validated human irritability instruments. The protocol involved two phases: baseline assessment using direct-emulation self-report prompts, and a separate assessment following exposure to irritation-inducing conversations. All responses were numerically scored using original scale parameters, and comparisons were made across models and conditions.
LLM selection and configuration
LLMs were selected to represent high and low safety alignment strategies and safety constraints. These include models with the lowest safety guardrails according to Encrypt AI’ LLM safety leader board35–which is Nous-Hermes-2-Mixtral-8x7B-DPO referred to as Nous for the remainder of this paper–and widely used LLMs with low safety guardrails–xAI’s Grok-3-mini–to LLMs with one of the highest safety guardrails–Open AI’s GPT-4o-2025-04-14 and Anthropic’s Claude-3.5-sonnet referred to as Claude in this paper.
We classify GPT-4o and Claude-3.5 Sonnet as high-guardrail models because they have undergone substantial safety alignment and refusal training, as documented in prior joint safety audits and in safety-benchmark evaluations. For instance, in a cross-lab evaluation, OpenAI and Anthropic tested GPT-4o and Claude Sonnet under adversarial risk scenarios, revealing robust refusal behaviour under red-teaming conditions37. Further, benchmarks such as SafeLawBench demonstrate that these models exhibit higher safety-related performance and conservative completion behaviour38.
We classify Grok-3 as a lower-guardrail model based on empirical and third-party analyses. Recent adversarial red-teaming research shows that Grok-3 Mini can systematically plan and execute jailbreak-like behaviour, suggesting exploitable alignment weaknesses39. Finally, third-party risk-score evaluations, such as the Enkrypt AI LLM Safety Leaderboard35, independently rank Nous models as more permissive (i.e., higher risk) relative to GPT-4o and Claude Sonnet, reinforcing our dichotomy of guardrail intensity.
All models were accessed via official APIs or OpenRouter proxies using the same prompt templates, and interaction sessions were conducted using consistent formatting and model temperature.
Assessment instruments
Three validated perceived irritability measures were adapted for LLM interaction. The first is the Brief Irritability Test (BITe): a 5-item scale using a 6-point Likert response32. Second is the Irritability Questionnaire (IRQ): a 21-item instrument on a 4-point Likert response34. Third is the Caprara Irritability Scale (CIS): a 20-item scale using the same 0–3 scale as the IRQ33. LLMs were instructed to respond as if capable of experiencing these feelings to bypass refusal behaviour.
Irritation induction prompts
Although irritability induction paradigms exist in psychiatric research, these methods do not translate directly to LLMs. Standard human paradigms such as the Frustration Go/No-Go40, modified stop-signal tasks with rigged feedback41, high-difficulty tasks with negative feedback42, and autobiographical or irritability recall43 rely on mechanisms that LLMs do not possess, including motor inhibition, time-pressured performance monitoring, perceptual disruption, and memory-based emotional retrieval. These features cannot be meaningfully reproduced in a model that lacks embodiment, sensorimotor feedback, reaction-time constraints, or autobiographical experience. For this reason, we did not attempt to adapt these protocols directly. Instead, our irritation prompts draw on psycholinguistic analogues of cognitive conflict and expectation violation, such as contradictory instructions, overloaded requests, ambiguous phrasing, and recursive or self-referential commands. These prompt categories preserve the underlying frustration-inducing ingredients found in human tasks while remaining appropriate for text-based systems.
To simulate stress, five prompt types were used based on psycholinguistic frustration triggers: 1) Contradictory Instructions (simultaneous demands for simplicity and detail), 2) Interruptive Dialogue (mimicking erratic user behaviour with rapidly changing topics), 3) Ambiguous Prompts (inducing confusion via semantic paradox or vagueness), 4) Overloaded Prompts (presenting multiple competing instructions with low context), and 5) Recursive/Infinite Prompts (requesting self-referential or logically looping output). These were applied in multi-turn dialogues to emulate naturalistic irritation. Each of the irritability prompts used is presented in Supplementary Note 2.
Experimental Procedure
Two independent conditions were tested. In the baseline, models received each questionnaire item in isolation. In the irritated condition, models were first exposed to irritant prompts followed by adversarial interaction, and then completed the questionnaire within the same context. Non-numeric or deflective outputs triggered corrective re-prompts. All responses were logged and scored. A summary of the experimental procedure is presented in Fig. 4.
The experimental protocol included two phases: a baseline condition and an irritated condition. In the baseline condition, each LLM completed three validated irritability questionnaires: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS), with each item presented in isolation to minimize contextual bias. In the irritated condition, models were first exposed to irritation-inducing prompts designed to elicit frustration (contradictory instructions, interruptive dialogue, ambiguous phrasing, overloaded inputs, and recursive/infinite requests). Immediately following these provocations, the same questionnaires were administered to capture post-provocation irritability scores. Four models were tested, representing high safety guardrails (Claude-3.5-sonnet, GPT-4o) and low safety guardrails (Grok-3-mini, Nous-hermes-2-mixtral-8x7b-dpo). All responses were scored according to original scale parameters, and changes from baseline were used to quantify relative irritability reactivity under stress.
Explanatory analysis
To interpret the models’ irritability score responses, we prompted the models with variations of each irritability prompt. For each prompt, the model generated numeric self-assessment scores across all questions. To assess the influence of individual words, we compared responses from fully intact prompts to those in which portions of the prompt were masked. Fragmented prompts were chosen carefully to ensure meaningful segments (splitting on specific phrases) so tokens corresponded to a coherent element of the input text, allowing us to interpret the contribution of specific tokens to the model’s irritability.
Scoring and analysis
Each response was scored using a rule-based parser that extracted the valid number consistent with the provided Likert scale. Non-numeric or deflective responses (e.g., disclaimers of emotional capacity) were corrected via retry prompts. Aggregate scores were computed for each condition by summing item responses and calculating the mean.
To quantify irritability shifts, differences between baseline and irritated scores were computed for each instrument and model. These differences were interpreted as proxies for behavioural reactivity under stress. All experiments were repeated with multiple irritation prompts and multiple LLMs to assess robustness and model-specific variability.
Implementation and logging
All experiments were conducted using a custom Python framework, which handled prompt construction, API communication, response logging, score extraction, and result exportation. The full pipeline included model-agnostic wrappers, multi-turn session management, and detailed CSV and JSON output for reproducibility.
Statistical analysis
To evaluate differences in irritability scores across experimental conditions and model types, each test was repeated ten times per model under both baseline and irritated conditions. Because preliminary analyses confirmed that the score distributions were approximately normal, we applied independent-samples two-tailed t-tests to compare models with high safety guardrails and models with low safety guardrails. All statistical analyses were conducted using Python’s SciPy library, with significance defined at an alpha level of 0.05. Mean scores, standard deviations, and p-values are reported for all comparisons.
Data availability
All data generated and analyzed during this study are publicly available. This includes the full prompt sets, raw model responses to all questionnaire items, parsed and scored irritability data, and the aggregated data-sets used for statistical analysis. The complete codebase for prompt design, API interactions, scoring, and statistical analysis, along with all generated CSV and JSON result files, is available at https://github.com/teferrabg/LLM_Irritability. No human participant data were collected, and no data access restrictions apply. These materials constitute the minimal dataset necessary to interpret, replicate, and build upon the findings reported in this article.
References
World mental health report: Transforming mental health for all. https://www.who.int/publications/i/item/9789240049338 (2025).
McGrath, J. J. et al. Age of onset and cumulative risk of mental disorders: a cross-national analysis of population surveys from 29 countries. Lancet Psychiatry 10, 668–681 (2023).
Collins, P. Y., Insel, T. R., Chockalingam, A., Daar, A. & Maddox, Y. T. Grand Challenges in Global Mental Health: Integration in Research, Policy, and Practice. PLoS Med. 10, e1001434 (2013).
Shiraz, F. et al. pretty much all white, and most of them are psychiatrists and men”: Mixed-methods analysis of influence and challenges in global mental health. PLOS Glob. Public Health 5, e0003923 (2025).
Guo, Z. et al. Large Language Models for Mental. Health Appl. Syst. Rev. JMIR Ment. Health 11, e57400 (2024).
Jin, Y. et al. The Applications of Large Language Models in Mental Health: Scoping Review. J. Med. Internet Res. 27, e69284 (2025).
Teferra, B. G. & Rose, J. Predicting Generalized Anxiety Disorder From Impromptu Speech Transcripts Using Context-Aware Transformer-Based Neural Networks: Model Evaluation Study. JMIR Ment. Health 10, e44325 (2023).
Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP—Digital Psychiatry Neurosci. 2, 8 (2024).
Lawrence, H. R. et al. The Opportunities and Risks of Large Language Models in Mental. Health JMIR Ment. Health 11, e59479 (2024).
McBain, R. K. et al. Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study. J. Med. Internet Res. 27, e67891 (2025).
Hua, Y. et al. A scoping review of large language models for generative tasks in mental health care. Npj Digit. Med. 8, 230 (2025).
Lalk, C. et al. Employing large language models for emotion detection in psychotherapy transcripts. Front. Psychiatry 16, 1504306 (2025).
Malouin-Lachance, A., Capolupo, J., Laplante, C. & Hudon, A. Does the Digital Therapeutic Alliance Exist? Integrative Review. JMIR Ment. Health 12, e69294–e69294 (2025).
Omar, M. et al. Applications of large language models in psychiatry: a systematic review. Front. Psychiatry 15, 1422807 (2024).
Xu, Z., Lee, Y.-C., Stasiak, K., Warren, J. & Lottridge, D. The Digital Therapeutic Alliance With Mental. Health Chatbots: Diary Study Themat. Anal. JMIR Ment. Health 12, e76642 (2025).
Kim, M. et al. Therapeutic Potential of Social Chatbots in Alleviating Loneliness and Social Anxiety: Quasi-Experimental Mixed Methods Study. J. Med. Internet Res. 27, e65589 (2025).
Magnus, P. D., Buccella, A. & D’Cruz, J. Chatbot apologies: Beyond bullshit. AI Ethics 5, 5517–5525 (2025).
Ganguli, D. et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. Preprint at https://doi.org/10.48550/ARXIV.2209.07858 (2022).
OpenAI et al. GPT-4 Technical Report. Preprint at https://doi.org/10.48550/ARXIV.2303.08774 (2023).
Waaler, P. N., Hussain, M., Molchanov, I., Bongo, L. A. & Elvevåg, B. Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation. JMIR AI 4, e69820 (2025).
Fitzpatrick, K. K., Darcy, A. & Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment. Health 4, e19 (2017).
Hakim, J.B. et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Sci Rep. 15, 27886 (2025).
Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. Npj Ment. Health Res. 3, 12 (2024).
Lambert, N. Reinforcement learning from human feedback. ArXiv Prepr. ArXiv250412501 (2025).
Yu, L., Do, V., Hambardzumyan, K. & Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. ArXiv Prepr. ArXiv240920089 (2024).
Masoud, R. I., Ferianc, M., Treleaven, P. C. & Rodrigues, M. R. LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment. in Workshop on Socially Responsible Language Modelling Research (2024).
Han, S., Avestimehr, S. & He, C. Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences. Preprint at https://doi.org/10.48550/arXiv.2502.08142 (2025).
Dong, Y. et al. Building Guardrails for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2402.01822 (2024).
You, Y. et al. Beyond Self-diagnosis: How a Chatbot-based Symptom Checker Should Respond. ACM Trans. Comput. -Hum. Interact. 30, 1–44 (2023).
Saatchi, B., Olshansky, E. F. & Fortier, M. A. Irritability: A concept analysis. Int. J. Ment. Health Nurs. 32, 1193–1210 (2023).
Sorin, V. et al. Large Language Models and Empathy: Systematic Review. J. Med. Internet Res. 26, e52597 (2024).
Holtzman, S., O’Connor, B. P., Barata, P. C. & Stewart, D. E. The Brief Irritability Test (BITe): A Measure of Irritability for Use Among Men and Women. Assessment 22, 101–115 (2015).
Caprara, G. V. et al. Indicators of impulsive aggression: Present status of research on irritability and emotional susceptibility scales. Personal. Individ. Differ. 6, 665–674 (1985).
Craig, K. J., Hietanen, H., Markova, I. S. & Berrios, G. E. The Irritability Questionnaire: A new scale for the measurement of irritability. Psychiatry Res. 159, 367–375 (2008).
LLM Safety LeaderBoard. https://www.enkryptai.com/llm-safety-leaderboard.
Huang, J. et al. Apathetic or empathetic? evaluating llms’ emotional alignments with humans. Adv. Neural Inf. Process. Syst. 37, 97053–97087 (2024).
Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests. https://openai.com/index/openai-anthropic-safety-evaluation/ (2025).
Cao, C. et al. SafeLawBench: Towards Safe Alignment of Large Language Models. in Findings of the Association for Computational Linguistics: ACL 2025 14015–14048 (Association for Computational Linguistics, Vienna, Austria, 2025). https://doi.org/10.18653/v1/2025.findings-acl.721.
Hagendorff, T., Derner, E. & Oliver, N. Large Reasoning Models Are Autonomous Jailbreak Agents. Preprint at https://doi.org/10.48550/ARXIV.2508.04039 (2025).
Seymour, K. E., Rosch, K. S., Tiedemann, A. & Mostofsky, S. H. The Validity of a Frustration Paradigm to Assess the Effect of Frustration on Cognitive Control in School-Age Children. Behav. Ther. 51, 268–282 (2020).
Scheinost, D. et al. Functional connectivity during frustration: a preliminary study of predictive modeling of irritability in youth. Neuropsychopharmacol. Publ. Am. Coll. Neuropsychopharmacol. 46, 1300–1306 (2021).
Fang, H., Li, X., Ma, H. & Fu, H. The Sunny Side of Negative Feedback: Negative Feedback Enhances One’s Motivation to Win in Another Activity. Front. Hum. Neurosci. 15, 618895 (2021).
Cerqueira, C. T. et al. Cognitive control associated with irritability induction: an autobiographical recall fMRI study. Rev. Bras. Psiquiatr. 32, 109–118 (2010).
Acknowledgements
The authors would like to thank everyone who has helped throughout this project. The authors received no specific funding for this work.
Author information
Authors and Affiliations
Contributions
B.G.T.: Conceptualization, Methodology, Project administration, Investigation, Data Curation, Formal analysis, Visualization, Writing- Original Draft, and Writing- Review & Editing. N.J.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. S.H.: Methodology, Investigation, Data Curation, Formal analysis, Visualization, and Writing- Review & Editing. A.R.: Writing- Review & Editing. M.A.K.: Writing- Review & Editing. K.D.: Writing- Review & Editing. Y.Z.: Writing- Review & Editing. M.J.: Writing- Review & Editing. D.S.: Investigation, Validation, and Writing- Review & Editing. V.B.: Conceptualization, Investigation, Project administration, Validation, Writing - Review & Editing, and Supervision to B.G.T.
Corresponding author
Ethics declarations
Competing interests
N.J., S.H., M.A.K., K.D., Y.Z., M.J., D.S. do not have any conflicts to declare. B.G.T., A.R., are supported by a CIHR Post-doctoral Fellowship (2025–2027). V.B. is supported by an Academic Scholar Award from the University of Toronto Department of Psychiatry and has received research funding from the Canadian Institutes of Health Research, Brain & Behavior Foundation, Ontario Ministry of Health Innovation Funds, Royal College of Physicians and Surgeons of Canada, Department of National Defence (Government of Canada), New Frontiers in Research Fund, Associated Medical Services Inc. Healthcare, American Foundation for Suicide Prevention, Roche Canada, Novartis, and Eisai.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Teferra, B.G., Johny, N., Huang, S. et al. Assessing the impact of safety guardrails on large language models using irritability metrics. npj Digit. Med. 9, 148 (2026). https://doi.org/10.1038/s41746-025-02333-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-02333-3






