Table 1 Performance of different models in error detection in zero-shot setting
From: The use of large language models in detecting Chinese ultrasound report errors
Model | Detection rate | P Value | PPV | P Value | TPR | P Value | F1 Score | P Value | FPRR | P Value |
---|---|---|---|---|---|---|---|---|---|---|
GPT-3.5 | 4.9% (12/243) | 0.00 | 17.9 (8.5, 28.2) | 0.00 | 5.0 (2.3, 8,3) | 0.00 | 7.9 (3.3, 12.6) | 0.00 | 13.8 (10.0, 17.5) | 1.00 |
GPT-4 | 26.7% (65/243) | 0.002 | 84.4 (76.3, 92.1) | 0.87 | 26.7 (20.9, 32,8) | 0.002 | 40.6 (33.6, 47.4) | 0.02 | 3.0 (1.5, 4.8) | 0.38 |
GPT-4o | 41.2% (100/243) | 0.61 | 88.5 (82.1, 94.0) | 0.16 | 40.8 (34.0, 48.0) | 0.55 | 55.9 (48.8, 62.3) | 1.00 | 3.3 (1.5, 5.3) | 0.46 |
Claude 3.5 Sonnet | 52.3% (127/243) | - | 76.5 (69.8, 83.4) | - | 52.3 (46.0, 58.8) | - | 62.1 (56.2, 68.0) | - | 9.8 (6.8, 13.3) | - |