Table 1 Diagnostic performance of LLMs in four rounds

From: Multiple large language models versus experienced physicians in diagnosing challenging cases with gastrointestinal symptoms

Model	Average number of diagnoses (SD^a)	Coverage rate (%, 95% CI^a)	Accuracy (%, 95% CI^a)	Krippendorff’s Alpha
GPT-3.5t	4.4(1.0)	22.8 (18.1–28.2)	6.0 (3.7–9.5)	0.643
Round 1	4.6(1.1)	25.4 (15.9–37.5)	4.5 (1.2–12.7)
Round 2	4.4(1.1)	19.4 (11.2–31.0)	7.5 (2.9–16.7)
Round 3	4.4(0.9)	17.9 (10.0–29.3)	3.0 (0.6–10.5)
Round 4	4.2(1.0)	28.4 (18.4–40.7)	9.0 (3.8–18.6)
GPT-4o	5.3(1.0)	64.2 (58.2–69.8)	42.9 (37.0–49.0)	0.664
Round 1	5.4(1.0)	56.7 (44.1–68.6)	35.8 (24.8–48.4)
Round 2	5.4(1.0)	62.7 (50.1–73.9)	43.3 (31.4–55.9)
Round 3	5.3(0.9)	68.7 (56.2–79.1)	47.8 (35.6–60.2)
Round 4	5.2(0.9)	68.7 (56.2–79.1)	44.8 (32.8–57.3)
Gemini-1.0-pro	5.3(1.3)	30.6 (25.3–36.4)	16.4 (12.4–21.4)	0.555
Round 1	5.3(1.5)	31.3 (20.9–43.8)	17.9 (10.0–29.3)
Round 2	5.4(1.3)	32.8 (22.2–45.4)	19.4 (11.2–31.0)
Round 3	5.3(1.3)	26.9 (17.1–39.1)	11.9 (5.7–22.3)
Round 4	5.2(1.2)	31.3 (20.9–43.8)	16.4 (8.9–27.6)
Gemini-1.5-pro	5.2(0.9)	53.4 (47.3–59.3)	24.6 (19.8–30.2)	0.720
Round 1	5.2(0.8)	53.7 (41.3–65.8)	22.4 (13.5–34.3)
Round 2	5.2(0.8)	53.7 (41.3–65.8)	28.4 (18.4–40.7)
Round 3	5.3(0.8)	52.2 (39.8–64.4)	23.9 (14.7–35.9)
Round 4	5.1(0.9)	53.7 (41.3–65.8)	23.9 (14.7–35.9)
Claude-2.1	4.6(0.8)	40.3 (34.5–46.4)	22.4 (17.7–27.8)	0.656
Round 1	4.6(0.9)	38.8 (27.4–51.4)	20.9 (12.3–32.6)
Round 2	4.6(0.8)	41.8 (30.1–54.4)	23.9 (14.7–35.9)
Round 3	4.5(0.8)	43.3 (31.4–55.9)	22.4 (13.5–34.3)
Round 4	4.5(0.7)	37.3 (26.1–49.9)	22.4 (13.5–34.3)
Claude 3 Opus	4.8(0.6)	66.4 (60.5–71.9)	44.4 (38.5–50.5)	0.697
Round 1	4.8(0.7)	64.2 (51.6–75.2)	47.8 (35.6–60.2)
Round 2	4.8(0.6)	68.7 (56.2–79.1)	41.8 (30.1–54.4)
Round 3	4.9(0.6)	65.7 (53.1–76.5)	44.8 (32.8–57.3)
Round 4	4.8(0.5)	67.2 (54.6–77.8)	43.3 (31.4–55.9)
Claude 3.5 Sonnet	5.6(0.7)	76.1 (70.6–80.9)	48.9 (42.9–54.9)	0.780
Round 1	5.5(0.6)	77.6 (65.7–86.5)	47.8 (35.6–60.2)
Round 2	5.6(0.7)	77.6 (65.7–86.5)	49.3 (37.0–61.6)
Round 3	5.7(0.7)	77.6 (65.7–86.5)	52.2 (39.8–64.4)
Round 4	5.7(0.7)	71.6 (59.3–81.6)	46.3 (34.2–58.7)

^aSD standard deviation, CI confidence interval.

Back to article page

Table 1 Diagnostic performance of LLMs in four rounds

Search

Quick links