Table 3 Total performances of five popular public LLMs for adjudications of clinical events

From: A large language model for clinical outcome adjudication from telephone follow-up interviews: a secondary analysis of a multicenter randomized clinical trial

 

Raw agreement, % (95% CI)

Sensitivity, % (95% CI)

Specificity, % (95% CI)

Positive predictive value, % (95% CI)

Negative predictive value, % (95% CI)

DeepSeek-v3 (2024_12_26)

82.5 (81.5–83.5)

94.4 (93.4–95.4)

78.3 (76.8–79.8)

75.0 (73.2–76.8)

95.3 (94.4–96.2)

GPT-3.5-turbo (2025_01_25)

82.5 (81.4–83.5)

91.3 (90.1–92.6)

93.3 (92.3–94.2)

90.3 (89.0–91.6)

94.0 (93.0–94.9)

GPT-4o (2024_11_20)

85.7 (84.6–86.6)

91.5 (90.1–92.7)

95.7 (95.0–96.5)

93.6 (92.5–94.7)

94.2 (93.3–95.0)

claude 3.5-sonnet (2024_10_22)

86.1 (85.1–87.0)

95.7 (94.7–96.5)

96.1 (95.3–96.8)

94.4 (93.3–95.4)

97.0 (96.3–97.6)

gemini-2.0-pro (2025_02_05)

84.0 (83.0–85.0)

93.2 (92.0–94.2)

96.7 (96.0–97.4)

95.1 (94.1–96.0)

95.4 (94.6–96.2)

  1. CI confidence interval; GPT generative pretrained transformer.