Table 3 Total performances of five popular public LLMs for adjudications of clinical events

	Raw agreement, % (95% CI)	Sensitivity, % (95% CI)	Specificity, % (95% CI)	Positive predictive value, % (95% CI)	Negative predictive value, % (95% CI)
DeepSeek-v3 (2024_12_26)	82.5 (81.5–83.5)	94.4 (93.4–95.4)	78.3 (76.8–79.8)	75.0 (73.2–76.8)	95.3 (94.4–96.2)
GPT-3.5-turbo (2025_01_25)	82.5 (81.4–83.5)	91.3 (90.1–92.6)	93.3 (92.3–94.2)	90.3 (89.0–91.6)	94.0 (93.0–94.9)
GPT-4o (2024_11_20)	85.7 (84.6–86.6)	91.5 (90.1–92.7)	95.7 (95.0–96.5)	93.6 (92.5–94.7)	94.2 (93.3–95.0)
claude 3.5-sonnet (2024_10_22)	86.1 (85.1–87.0)	95.7 (94.7–96.5)	96.1 (95.3–96.8)	94.4 (93.3–95.4)	97.0 (96.3–97.6)
gemini-2.0-pro (2025_02_05)	84.0 (83.0–85.0)	93.2 (92.0–94.2)	96.7 (96.0–97.4)	95.1 (94.1–96.0)	95.4 (94.6–96.2)

Quick links

Search