Table 3 Total performances of five popular public LLMs for adjudications of clinical events
Raw agreement, % (95% CI) | Sensitivity, % (95% CI) | Specificity, % (95% CI) | Positive predictive value, % (95% CI) | Negative predictive value, % (95% CI) | |
|---|---|---|---|---|---|
DeepSeek-v3 (2024_12_26) | 82.5 (81.5–83.5) | 94.4 (93.4–95.4) | 78.3 (76.8–79.8) | 75.0 (73.2–76.8) | 95.3 (94.4–96.2) |
GPT-3.5-turbo (2025_01_25) | 82.5 (81.4–83.5) | 91.3 (90.1–92.6) | 93.3 (92.3–94.2) | 90.3 (89.0–91.6) | 94.0 (93.0–94.9) |
GPT-4o (2024_11_20) | 85.7 (84.6–86.6) | 91.5 (90.1–92.7) | 95.7 (95.0–96.5) | 93.6 (92.5–94.7) | 94.2 (93.3–95.0) |
claude 3.5-sonnet (2024_10_22) | 86.1 (85.1–87.0) | 95.7 (94.7–96.5) | 96.1 (95.3–96.8) | 94.4 (93.3–95.4) | 97.0 (96.3–97.6) |
gemini-2.0-pro (2025_02_05) | 84.0 (83.0–85.0) | 93.2 (92.0–94.2) | 96.7 (96.0–97.4) | 95.1 (94.1–96.0) | 95.4 (94.6–96.2) |