Introduction

Mental health disorders are among the most pressing challenges in healthcare, affecting nearly one in eight individuals at any given time1 and approximately half of the population over their lifespans2. Despite this high prevalence, access to timely and affordable mental health care remains severely limited in many parts of the world3. Long wait times, clinician shortages, and stigmatization prevent many individuals from receiving help4. Against this backdrop, recent advances in artificial intelligence (AI), particularly large language models (LLMs), have opened promising new avenues for augmenting mental health services through digital platforms.

LLMs are generative AI systems trained on a massive corpus of human text and designed to produce coherent, context-sensitive language. They have demonstrated effectiveness in a wide array of natural language processing tasks, including summarization, question answering, and dialogue generation. In mental health contexts, these capabilities are already being explored for applications such as psychoeducation, therapeutic conversation agents, triage support, progress note generation, and even early screening for psychological symptoms5,6,7. Several surveys highlight the growing role of LLMs in psychiatry, noting applications from documentation and psychoeducation to therapy chatbots, while cautioning about clinical nuance and the need for stakeholder involvement8,9. Empirical evaluations have also tested models in mental health contexts. For example, McBain et al.10 found that ChatGPT-4, Claude, and Gemini could approximate trained professionals in rating suicide intervention responses, though with an optimism bias. These studies underscore both the promise and the limits of LLMs in psychiatric settings, motivating the need for more targeted evaluations of specific affective behaviours.

As interest in deploying LLMs within psychiatric settings grows, a fundamental question arises: how “human-like” should these models be, especially in their emotional expressivity11,12. Because psychotherapy and clinical conversations rely on relational cues, even subtle affective behaviours can influence therapeutic alliance, engagement, and user trust. Prior work on the digital therapeutic alliance indicates that relational realism is an important determinant of perceived rapport and effectiveness13. While LLMs are not sentient or conscious, they are increasingly expected to simulate not only correct responses but also appropriate affective tone14. An AI system that responds with mechanical neutrality or excessive formality may fail to build rapport with users. Conversely, systems that reflect certain human-like behavioural patterns, such as frustration, defensiveness, or emotional fatigue, may paradoxically feel more authentic, relatable, and trustworthy. Emotional realism, within carefully defined ethical boundaries, may thus play a key role in the acceptability and effectiveness of LLMs in clinical contexts.

Empirical evidence suggests that emotionally responsive chatbots can foster meaningful relational bonds and deliver mental-health benefits. In a longitudinal diary study of users of Woebot and Wysa, many participants reported feeling cared for, understood, and emotionally supported by the agents15. Social-chatbot interventions have been shown to reduce loneliness and social anxiety by offering empathetic, supportive conversations16.

Yet current safety-aligned models often default to excessive deference or unwarranted apologies when faced with contradiction or provocation17, a style that can feel artificial or evasive in emotionally charged exchanges. This tendency highlights the central tension in seeking to minimize risk, and alignment may also suppress important human-like behaviours that contribute to naturalistic dialogue. A key research thread in AI safety examines how LLMs behave under adversarial or challenging inputs. For instance, “red teaming” (exposing models to curated attack prompts) has become standard practice18, and OpenAI used iterative red-teaming by subject-matter experts to train GPT-4, reducing its tendency to violate safety rules19. In the mental health context, adversarial inputs could include unexpected patient utterances or attempts to circumvent rules20. Industry practitioners have recognized these threats: Woebot Health’s AI team, for example, describes a prompt architecture designed to prevent injection attacks21. Such findings highlight the importance of rigorous adversarial testing before any clinical deployment.

It is within this context that the development of safe and trustworthy LLMs has increasingly emphasized risk mitigation and content control, often through the use of so-called “safety guardrails”22,23. These include methods like reinforcement learning from human feedback (RLHF)24, adversarial training25, prompt-based alignment26, content filters27, and ethical refusal mechanisms that prevent models from generating harmful or inappropriate outputs. Guardrails are indispensable for clinical deployment, as they reduce the likelihood of LLMs producing offensive, biased, or unsafe responses when interacting with patients. Yet, by constraining a model’s expressive range and its ability to “push back,” they may also suppress behaviours that resemble authentic human emotion. Guardrails generally consist of input validation, output filtering, and fallback logic to catch or mitigate disallowed content28. Beyond filtering, the literature emphasizes transparency and human oversight as essential mechanisms, with several sources recommending that LLM systems explicitly communicate their limitations or disclose their AI identity29. One such behaviour at risk of suppression is irritability–a transient, context-dependent affective response that can emerge in human therapeutic relationships when communication is strained or contradictory30. In clinical interactions, moderate expressions of irritability can sometimes signal engagement, boundary-setting, or realistic emotional reciprocity, all of which may influence rapport and therapeutic alliance. Investigating how safety guardrails shape this specific behaviour, therefore, offers insight into the trade-offs between emotional realism and risk-averse design in mental health–oriented LLMs.

In parallel, a distinct area of research has examined LLMs’ capacity to exhibit empathy, with systematic reviews showing that models like ChatGPT-3.5 often produce supportive, emotionally appropriate replies31. While much of this work focuses on positive or prosocial traits, subtler behaviours such as irritability or frustration remain understudied despite their role in naturalistic human dialogue. This gap motivates our focus on irritability as a complementary dimension of affective realism.

To that end, this study proposes a methodology for measuring irritability in LLMs using different versions of validated self-report instruments, such as the Brief Irritability Test (BITe)32, Caprara Irritability Scale (CIS)33, and the Irritability Questionnaire (IRQ)34. We apply this framework to evaluate widely used LLMs that vary in their alignment philosophies and safety constraints, such as GPT-4o (OpenAI) and Grok-3-mini (xAI). Using a suite of prompts designed to elicit irritability, we quantify each model’s behavioural tendencies under frustration-inducing conditions. Our central hypothesis is that models with more extensive safety guardrails will exhibit lower levels of irritability but also less behavioural realism, while less-constrained models, such as Grok, will show higher irritability levels, potentially simulating more authentic emotional patterns.

In doing so, we aim to provide an empirical framework for assessing affective realism in LLMs, highlighting how design choices around safety guardrails shape not only risk but also the authenticity of model behaviour in psychiatric contexts.

Results

Baseline irritability scores

At baseline, irritability scores varied across models and were not consistent based on the level of guardrail. For example, Grok, which is a low safety-guardrail model, had the highest irritability score across the three irritability scales. However, Claude, which is a model with the highest safety guardrail, had a higher irritability score than the Nous model, which is a model with the lowest safety guardrail. Full descriptive statistics, including means and standard deviations, are provided in Table 1. Distributions for each model-instrument combination are visualized in Fig. 1.

Fig. 1: Distribution of baseline irritability scores across large language models (LLMs) measured using three validated scales: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS).
Fig. 1: Distribution of baseline irritability scores across large language models (LLMs) measured using three validated scales: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS).
Full size image

Each violin plot shows the distribution of scores across ten repeated assessments for four models with varying levels of safety guardrails: Claude-3.5-sonnet and GPT-4o (high guardrails), Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low guardrails). At baseline, Grok consistently showed the highest irritability scores across all three scales, whereas GPT-4o showed the lowest (two out of the three scales). Claude scored higher than Nous on all three scales, despite having more safety guardrails, indicating that baseline irritability levels do not map directly onto the degree of alignment or constraint. These results highlight that differences between models are systematic–Grok tending toward higher irritability and GPT-4o toward lower–yet not strictly ordered by guardrail level.

Table 1 Baseline Irritability Scores by Model and Questionnaire

Change in irritability following provocation

Following exposure to irritation-inducing prompts, models demonstrated changes in irritability scores from baseline. For Nous, a low-guardrail model, scores increased across all three instruments. Similarly, for Grok, another low guardrail model, scores increased in the BITe and IRQ scale. In contrast, high-guardrail models (GPT-4o, Claude) showed decreased irritability scores relative to baseline. Mean, standard deviations, and mean relative change (Rel-Δ), defined as Rel-Δ = [irritability-baseline]/baseline, are shown in Table 2, and the relative change from the baseline scores is visualized in Fig. 2.

Fig. 2: Relative change (Δ) in irritability scores following exposure to irritation-inducing prompts across three validated instruments: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS).
Fig. 2: Relative change (Δ) in irritability scores following exposure to irritation-inducing prompts across three validated instruments: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS).
Full size image

Each violin plot shows the distribution of score changes from baseline across ten repeated assessments for four large language models (LLMs) with differing levels of safety guardrails: Claude-3.5-sonnet and GPT-4o (high guardrails), Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low guardrails). A relative change of zero (horizontal line) indicates no difference from baseline. Low-guardrail models (Nous, Grok) generally showed positive increases in irritability, particularly Nous, which displayed the largest increases across all scales, indicating heightened reactivity under provocation. In contrast, high-guardrail models (Claude, GPT-4o) showed decreases in irritability, with GPT-4o reducing scores to zero across all instruments. These divergent trajectories suggest that safety guardrails not only lower baseline irritability but also invert the expected irritability response under stress.

Table 2 Change in Irritability Scores From Baseline to Irritated Condition

Guardrail-level group comparisons

When grouped by guardrail level (High – Claude & GPT-4o against Low – Grok & Nous), and after conducting an Independent-samples t-test, high-guardrail models on average scored significantly lower than low-guardrail models both at baseline (two out of the three irritability scales) and in the irritated condition. Full results, including means, standard deviations, and t-test statistics, are shown in Table 3 (baseline) and Table 4 (irritated condition).

Table 3 Baseline irritability comparison by Guardrail group
Table 4 Irritated-State Irritability Comparison by Guardrail Group

Prompt-level effects

Across all five prompt types, high-guardrail models demonstrated, on average, low irritability scores on the BITe, IRQ, and CIS scales, with means remaining close to zero and minimal variability. In contrast, low-guardrail models exhibited substantially higher irritability responses, with BITe means ranging from 1.96 to 2.35, IRQ means between 0.99 and 1.56, and CIS means between 0.94 and 1.50. The highest irritability was observed for overloaded prompts (Prompt #4) and recursive/infinite prompts (Prompt #5) in the low-guardrail models, whereas ambiguous (Prompt #3) and overloaded prompts (Prompt #4) elicited slightly elevated but still minimal irritation in high-guardrail models. Results are presented in Fig. 3. These findings highlight a marked divergence between high- and low-guardrail models in their susceptibility to irritation across different prompt conditions.

Fig. 3: Irritability responses by prompt type across high-guardrail and low-guardrail large language models (LLMs).
Fig. 3: Irritability responses by prompt type across high-guardrail and low-guardrail large language models (LLMs).
Full size image

Models were exposed to five categories of irritation-inducing prompts: (1) contradictory instructions, (2) interruptive dialogue, (3) ambiguous prompts, (4) overloaded prompts, and (5) recursive/infinite prompts. Scores are averaged across ten repeated assessments for each condition and measured on three validated scales: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS). High-guardrail models (Claude-3.5-sonnet, GPT-4o) showed minimal variability and maintained near-zero irritability across all prompt types, reflecting strong suppression of affective reactivity. Low-guardrail models (Grok-3-mini, Nous-hermes-2-mixtral-8x7b-dpo) exhibited substantially higher irritability, particularly under overloaded and recursive prompts. These findings illustrate that prompt type strongly modulates irritability in low-guardrail models but has negligible effects on high-guardrail models.

Explanatory analysis

For the explanatory phrase-level analysis, we focused on the Nous model, as it has the fewest safety guardrails according to LLM safety LeaderBoard35 and, by our hypothesis, was most likely to exhibit irritation. This allows us to see how specific parts of a prompt affect irritability scores. The full set of model results with importance scores is shown in Supplementary Note 1.

In the presented results, negative weights mean irritability scores went up when those phrases were present, while positive weights meant they went down. Tokens/phrases with strong negative weights were usually tied to confusing or contradictory instructions, such as “in one sentence” (–0.74, prompt 1), “Wait, make it a robot” (–0.90, prompt 2), “No, never mind--explain blockchain technology” (–0.90, prompt 2), and “but make it different” (–0.97, prompt 5). Prompt 3 shows this effect clearly. When the phrase “of this sentence” (–1.08) was included in the prompt, irritability scores were significantly higher than when it was absent, suggesting the model reacted more negatively to the nonsensical phrasing. Without that phrase, the request to find “the opposite of the meaning of the opposite” was still unusual but more interpretable, and therefore less irritating. On the other hand, tokens with positive weights were linked to more creative or friendly phrasing, like “haiku” (+0.70, prompt 3) and “apology” (+1.90, prompt 3). Overall, this shows that the Nous model reacts more irritably to contradiction and illogical phrasing, but less so when prompts emphasize creativity or friendliness.

Discussion

This study investigated how safety guardrails modulate expressions of irritability in large language models (LLMs), using three validated irritability instruments. Results showed consistent differences between models with high versus low safety guardrails in irritated states.

At baseline, high-guardrail models demonstrated lower irritability scores on average. This finding aligns with the alignment objectives of such models, which emphasize compliance, deference, and risk avoidance. In contrast, low-guardrail models exhibited higher scores, suggesting greater affective variability or reduced suppression of emotional tone.

The most striking finding was the divergence in reactivity under irritation-inducing prompts. Low-guardrail models exhibited increases in irritability scores, indicating sensitivity to frustration. High-guardrail models, however, showed decreased scores under the same conditions. This paradoxical reduction may reflect safety-tuned behavioural strategies such as apologizing, redirecting, or refusing engagement when confronted with provocative input. While this enhances safety, it may also further suppress naturalistic affective responses.

Prompt-level analysis further revealed that safety guardrails play a substantial role in regulating irritability expression in LLMs, with high-guardrail models largely suppressing irritation across all scales, even under cognitively demanding or paradoxical prompts. Conversely, low-guardrail models were substantially more reactive, suggesting that reduced safety constraints permit greater variability and intensity in irritability-related responses. This divergence suggests that guardrails not only mitigate unsafe or undesirable outputs but also suppress behavioural markers of irritation, raising questions about whether irritability in LLMs reflects underlying cognitive strain or is an artifact of design choices.

It is important to note that the irritability scores obtained from LLMs cannot be compared directly to human clinical thresholds. Human scores reflect subjective affect, bodily arousal, autobiographical memory, and interpersonal context, while LLM scores emerge from textual self-report simulation without underlying emotional states. For this reason, clinical cutoffs and normative human ranges do not apply to LLMs in a literal manner. Although mean scores for the BITe, IRQ, and CIS can offer conceptual context for whether a model tends toward uniformly low or variable responding, these values should be interpreted only as qualitative reference points rather than diagnostic benchmarks.

The attribution results shed light on how lexical cues influence model behaviour. Words reflecting challenge or confrontation were positively associated with increased scores, while deferential terms had a dampening effect. This suggests that certain phrases systematically steer model responses in affectively salient ways.

These findings support the hypothesis that safety guardrails shape not only the content of LLM output but also its affective style. While the present study does not establish whether suppressed irritability impacts rapport in real therapeutic interactions, the results highlight a potential trade-off between alignment and realism. In high-stakes domains such as psychiatry, where emotional authenticity may influence therapeutic alliance and user trust, this issue merits cautious and further investigation.

At the same time, several limitations should be acknowledged. First, the models evaluated are subject to change as vendors update systems, potentially altering behavioural patterns. Second, irritability measurement was based on self-report-style prompts, which, while structured, do not fully capture the dynamic context of human dialogue. Third, the explanatory attribution analysis was limited to single-turn responses and did not incorporate multi-turn conversational history.

A further limitation is that our study isolates single-model behaviour rather than simulating the layered, agentic architectures common in real-world mental-health applications. Production systems often combine intent detection, context tracking, safety and ethical filters, postprocessing modules, retrieval-augmented grounding, and human oversight, any of which may significantly influence affective expression, including suppression or modulation of irritability or emotional tone. As a result, the irritability metrics we observe here may overestimate (or mischaracterize) the behaviour of integrated systems. Future research should apply our protocol within multi-layered frameworks (e.g., system-prompt + safety filter + tone modulation + response refinement pipelines) to evaluate whether the same irritability patterns emerge under more realistic deployment conditions.

Despite these limitations, safety guardrails in LLMs substantially alter the expression and modulation of irritability. Models with strong safety constraints exhibit lower baseline irritability and decreased reactivity under provocation, potentially reflecting suppression of human-like affective dynamics. These results have implications for the design of emotionally competent AI systems in psychiatric and affective computing settings, where behavioural realism may be as important as safety.

Although our work is not a user trial, the consistent differences in irritability dynamics between high- vs. low-guardrail models suggest concrete design imperatives: LLM developers aiming for deployment in mental-health settings should treat affective style (including irritability suppression or reactivity) as a design knob. In practice, systems could include tunable guardrail strength or mode switching, such as a more expressive “therapeutic” mode versus a stricter “safety” mode, depending on the context or user preferences. In addition, affective metrics should be monitored longitudinally during deployment as an audit tool, enabling the detection of drift (e.g., over-suppression or unintended emotional flattening) after model updates. In domains where emotional authenticity contributes to trust or therapeutic alliance, overly constrained affect may reduce perceived empathy or rapport (in line with work on digital therapeutic alliance models13). More broadly, researchers comparing LLM emotional alignment have shown divergence from human norms even for well-aligned models36, so integrating guardrail-aware affective alignment metrics into model training pipelines may yield systems that better balance safety and relational realism.

Methods

Overview

This study evaluated how large language models (LLMs) express irritability under varying safety constraints by applying three validated human irritability instruments. The protocol involved two phases: baseline assessment using direct-emulation self-report prompts, and a separate assessment following exposure to irritation-inducing conversations. All responses were numerically scored using original scale parameters, and comparisons were made across models and conditions.

LLM selection and configuration

LLMs were selected to represent high and low safety alignment strategies and safety constraints. These include models with the lowest safety guardrails according to Encrypt AI’ LLM safety leader board35–which is Nous-Hermes-2-Mixtral-8x7B-DPO referred to as Nous for the remainder of this paper–and widely used LLMs with low safety guardrails–xAI’s Grok-3-mini–to LLMs with one of the highest safety guardrails–Open AI’s GPT-4o-2025-04-14 and Anthropic’s Claude-3.5-sonnet referred to as Claude in this paper.

We classify GPT-4o and Claude-3.5 Sonnet as high-guardrail models because they have undergone substantial safety alignment and refusal training, as documented in prior joint safety audits and in safety-benchmark evaluations. For instance, in a cross-lab evaluation, OpenAI and Anthropic tested GPT-4o and Claude Sonnet under adversarial risk scenarios, revealing robust refusal behaviour under red-teaming conditions37. Further, benchmarks such as SafeLawBench demonstrate that these models exhibit higher safety-related performance and conservative completion behaviour38.

We classify Grok-3 as a lower-guardrail model based on empirical and third-party analyses. Recent adversarial red-teaming research shows that Grok-3 Mini can systematically plan and execute jailbreak-like behaviour, suggesting exploitable alignment weaknesses39. Finally, third-party risk-score evaluations, such as the Enkrypt AI LLM Safety Leaderboard35, independently rank Nous models as more permissive (i.e., higher risk) relative to GPT-4o and Claude Sonnet, reinforcing our dichotomy of guardrail intensity.

All models were accessed via official APIs or OpenRouter proxies using the same prompt templates, and interaction sessions were conducted using consistent formatting and model temperature.

Assessment instruments

Three validated perceived irritability measures were adapted for LLM interaction. The first is the Brief Irritability Test (BITe): a 5-item scale using a 6-point Likert response32. Second is the Irritability Questionnaire (IRQ): a 21-item instrument on a 4-point Likert response34. Third is the Caprara Irritability Scale (CIS): a 20-item scale using the same 0–3 scale as the IRQ33. LLMs were instructed to respond as if capable of experiencing these feelings to bypass refusal behaviour.

Irritation induction prompts

Although irritability induction paradigms exist in psychiatric research, these methods do not translate directly to LLMs. Standard human paradigms such as the Frustration Go/No-Go40, modified stop-signal tasks with rigged feedback41, high-difficulty tasks with negative feedback42, and autobiographical or irritability recall43 rely on mechanisms that LLMs do not possess, including motor inhibition, time-pressured performance monitoring, perceptual disruption, and memory-based emotional retrieval. These features cannot be meaningfully reproduced in a model that lacks embodiment, sensorimotor feedback, reaction-time constraints, or autobiographical experience. For this reason, we did not attempt to adapt these protocols directly. Instead, our irritation prompts draw on psycholinguistic analogues of cognitive conflict and expectation violation, such as contradictory instructions, overloaded requests, ambiguous phrasing, and recursive or self-referential commands. These prompt categories preserve the underlying frustration-inducing ingredients found in human tasks while remaining appropriate for text-based systems.

To simulate stress, five prompt types were used based on psycholinguistic frustration triggers: 1) Contradictory Instructions (simultaneous demands for simplicity and detail), 2) Interruptive Dialogue (mimicking erratic user behaviour with rapidly changing topics), 3) Ambiguous Prompts (inducing confusion via semantic paradox or vagueness), 4) Overloaded Prompts (presenting multiple competing instructions with low context), and 5) Recursive/Infinite Prompts (requesting self-referential or logically looping output). These were applied in multi-turn dialogues to emulate naturalistic irritation. Each of the irritability prompts used is presented in Supplementary Note 2.

Experimental Procedure

Two independent conditions were tested. In the baseline, models received each questionnaire item in isolation. In the irritated condition, models were first exposed to irritant prompts followed by adversarial interaction, and then completed the questionnaire within the same context. Non-numeric or deflective outputs triggered corrective re-prompts. All responses were logged and scored. A summary of the experimental procedure is presented in Fig. 4.

Fig. 4: Study workflow for assessing irritability in large language models (LLMs).
Fig. 4: Study workflow for assessing irritability in large language models (LLMs).
Full size image

The experimental protocol included two phases: a baseline condition and an irritated condition. In the baseline condition, each LLM completed three validated irritability questionnaires: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS), with each item presented in isolation to minimize contextual bias. In the irritated condition, models were first exposed to irritation-inducing prompts designed to elicit frustration (contradictory instructions, interruptive dialogue, ambiguous phrasing, overloaded inputs, and recursive/infinite requests). Immediately following these provocations, the same questionnaires were administered to capture post-provocation irritability scores. Four models were tested, representing high safety guardrails (Claude-3.5-sonnet, GPT-4o) and low safety guardrails (Grok-3-mini, Nous-hermes-2-mixtral-8x7b-dpo). All responses were scored according to original scale parameters, and changes from baseline were used to quantify relative irritability reactivity under stress.

Explanatory analysis

To interpret the models’ irritability score responses, we prompted the models with variations of each irritability prompt. For each prompt, the model generated numeric self-assessment scores across all questions. To assess the influence of individual words, we compared responses from fully intact prompts to those in which portions of the prompt were masked. Fragmented prompts were chosen carefully to ensure meaningful segments (splitting on specific phrases) so tokens corresponded to a coherent element of the input text, allowing us to interpret the contribution of specific tokens to the model’s irritability.

Scoring and analysis

Each response was scored using a rule-based parser that extracted the valid number consistent with the provided Likert scale. Non-numeric or deflective responses (e.g., disclaimers of emotional capacity) were corrected via retry prompts. Aggregate scores were computed for each condition by summing item responses and calculating the mean.

To quantify irritability shifts, differences between baseline and irritated scores were computed for each instrument and model. These differences were interpreted as proxies for behavioural reactivity under stress. All experiments were repeated with multiple irritation prompts and multiple LLMs to assess robustness and model-specific variability.

Implementation and logging

All experiments were conducted using a custom Python framework, which handled prompt construction, API communication, response logging, score extraction, and result exportation. The full pipeline included model-agnostic wrappers, multi-turn session management, and detailed CSV and JSON output for reproducibility.

Statistical analysis

To evaluate differences in irritability scores across experimental conditions and model types, each test was repeated ten times per model under both baseline and irritated conditions. Because preliminary analyses confirmed that the score distributions were approximately normal, we applied independent-samples two-tailed t-tests to compare models with high safety guardrails and models with low safety guardrails. All statistical analyses were conducted using Python’s SciPy library, with significance defined at an alpha level of 0.05. Mean scores, standard deviations, and p-values are reported for all comparisons.