Fig. 1: Distribution of baseline irritability scores across large language models (LLMs) measured using three validated scales: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS).
From: Assessing the impact of safety guardrails on large language models using irritability metrics

Each violin plot shows the distribution of scores across ten repeated assessments for four models with varying levels of safety guardrails: Claude-3.5-sonnet and GPT-4o (high guardrails), Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo (low guardrails). At baseline, Grok consistently showed the highest irritability scores across all three scales, whereas GPT-4o showed the lowest (two out of the three scales). Claude scored higher than Nous on all three scales, despite having more safety guardrails, indicating that baseline irritability levels do not map directly onto the degree of alignment or constraint. These results highlight that differences between models are systematic–Grok tending toward higher irritability and GPT-4o toward lower–yet not strictly ordered by guardrail level.