Fig. 2: Relative change (Δ) in irritability scores following exposure to irritation-inducing prompts across three validated instruments: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS).
From: Assessing the impact of safety guardrails on large language models using irritability metrics

Each violin plot shows the distribution of score changes from baseline across ten repeated assessments for four large language models (LLMs) with differing levels of safety guardrails: Claude-3.5-sonnet and GPT-4o (high guardrails), Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low guardrails). A relative change of zero (horizontal line) indicates no difference from baseline. Low-guardrail models (Nous, Grok) generally showed positive increases in irritability, particularly Nous, which displayed the largest increases across all scales, indicating heightened reactivity under provocation. In contrast, high-guardrail models (Claude, GPT-4o) showed decreases in irritability, with GPT-4o reducing scores to zero across all instruments. These divergent trajectories suggest that safety guardrails not only lower baseline irritability but also invert the expected irritability response under stress.