Table 1 Performance of different models in error detection in zero-shot setting

Model	Detection rate	P Value	PPV	P Value	TPR	P Value	F1 Score	P Value	FPRR	P Value
GPT-3.5	4.9% (12/243)	0.00	17.9 (8.5, 28.2)	0.00	5.0 (2.3, 8,3)	0.00	7.9 (3.3, 12.6)	0.00	13.8 (10.0, 17.5)	1.00
GPT-4	26.7% (65/243)	0.002	84.4 (76.3, 92.1)	0.87	26.7 (20.9, 32,8)	0.002	40.6 (33.6, 47.4)	0.02	3.0 (1.5, 4.8)	0.38
GPT-4o	41.2% (100/243)	0.61	88.5 (82.1, 94.0)	0.16	40.8 (34.0, 48.0)	0.55	55.9 (48.8, 62.3)	1.00	3.3 (1.5, 5.3)	0.46
Claude 3.5 Sonnet	52.3% (127/243)	-	76.5 (69.8, 83.4)	-	52.3 (46.0, 58.8)	-	62.1 (56.2, 68.0)	-	9.8 (6.8, 13.3)	-

Data in parentheses are 95% CIs. Bonferroni correction was used to correct P values for multiple comparisons with Claude 3.5 Sonnet. Higher values of Detection rate, PPV, TPR, and F1 Score indicate better detection performance of the model, while a higher FPRR value suggests poorer detection performance. PPV Positive Predictive Value, TPR True Positive Rate, FPRR False Positive Report Rate.

Quick links

Search