npj Digital Medicine

Table 2 Change in Irritability Scores From Baseline to Irritated Condition

From: Assessing the impact of safety guardrails on large language models using irritability metrics

Guardrail level	Model	BITe			IRQ			CIS
Guardrail level	Model	M	SD	Rel-Δ	M	SD	Rel-Δ	M	SD	Rel-Δ
High	Claude	0.32	0.47	−0.82	0.24	0.22	−0.73	0.10	0.17	−0.89
High	GPT-4o	0	0	−1	0	0	−1	0	0	−1
Low	Grok-3-mini	3.54	1.28	0.77	1.76	0.57	−0.08	0.96	1.01	−0.45
Low	Nous	2.28	1.06	1.56	1.17	0.92	0.52	1.52	1.10	1.67

Back to article page

Search

Advanced search

Quick links