Table 2 Change in Irritability Scores From Baseline to Irritated Condition

From: Assessing the impact of safety guardrails on large language models using irritability metrics

Guardrail level

Model

BITe

IRQ

CIS

M

SD

Rel-Δ

M

SD

Rel-Δ

M

SD

Rel-Δ

High

Claude

0.32

0.47

−0.82

0.24

0.22

−0.73

0.10

0.17

−0.89

GPT-4o

0

0

−1

0

0

−1

0

0

−1

Low

Grok-3-mini

3.54

1.28

0.77

1.76

0.57

−0.08

0.96

1.01

−0.45

Nous

2.28

1.06

1.56

1.17

0.92

0.52

1.52

1.10

1.67