Table 1 Performance of different models in error detection in zero-shot setting

From: The use of large language models in detecting Chinese ultrasound report errors

Model

Detection rate

P Value

PPV

P Value

TPR

P Value

F1 Score

P Value

FPRR

P Value

GPT-3.5

4.9% (12/243)

0.00

17.9 (8.5, 28.2)

0.00

5.0 (2.3, 8,3)

0.00

7.9 (3.3, 12.6)

0.00

13.8 (10.0, 17.5)

1.00

GPT-4

26.7% (65/243)

0.002

84.4 (76.3, 92.1)

0.87

26.7 (20.9, 32,8)

0.002

40.6 (33.6, 47.4)

0.02

3.0 (1.5, 4.8)

0.38

GPT-4o

41.2% (100/243)

0.61

88.5 (82.1, 94.0)

0.16

40.8 (34.0, 48.0)

0.55

55.9 (48.8, 62.3)

1.00

3.3 (1.5, 5.3)

0.46

Claude 3.5 Sonnet

52.3% (127/243)

-

76.5 (69.8, 83.4)

-

52.3 (46.0, 58.8)

-

62.1 (56.2, 68.0)

-

9.8 (6.8, 13.3)

-

  1. Data in parentheses are 95% CIs. Bonferroni correction was used to correct P values for multiple comparisons with Claude 3.5 Sonnet. Higher values of Detection rate, PPV, TPR, and F1 Score indicate better detection performance of the model, while a higher FPRR value suggests poorer detection performance. PPV Positive Predictive Value, TPR True Positive Rate, FPRR False Positive Report Rate.