Table 2 Change in Irritability Scores From Baseline to Irritated Condition
From: Assessing the impact of safety guardrails on large language models using irritability metrics
Guardrail level | Model | BITe | IRQ | CIS | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
M | SD | Rel-Δ | M | SD | Rel-Δ | M | SD | Rel-Δ | ||
High | Claude | 0.32 | 0.47 | −0.82 | 0.24 | 0.22 | −0.73 | 0.10 | 0.17 | −0.89 |
GPT-4o | 0 | 0 | −1 | 0 | 0 | −1 | 0 | 0 | −1 | |
Low | Grok-3-mini | 3.54 | 1.28 | 0.77 | 1.76 | 0.57 | −0.08 | 0.96 | 1.01 | −0.45 |
Nous | 2.28 | 1.06 | 1.56 | 1.17 | 0.92 | 0.52 | 1.52 | 1.10 | 1.67 | |