Table 1 Diagnostic performance of LLMs in four rounds

From: Multiple large language models versus experienced physicians in diagnosing challenging cases with gastrointestinal symptoms

Model

Average number of diagnoses (SDa)

Coverage rate (%, 95% CIa)

Accuracy (%, 95% CIa)

Krippendorff’s Alpha

GPT-3.5t

4.4(1.0)

22.8 (18.1–28.2)

6.0 (3.7–9.5)

0.643

Round 1

4.6(1.1)

25.4 (15.9–37.5)

4.5 (1.2–12.7)

 

Round 2

4.4(1.1)

19.4 (11.2–31.0)

7.5 (2.9–16.7)

 

Round 3

4.4(0.9)

17.9 (10.0–29.3)

3.0 (0.6–10.5)

 

Round 4

4.2(1.0)

28.4 (18.4–40.7)

9.0 (3.8–18.6)

 

GPT-4o

5.3(1.0)

64.2 (58.2–69.8)

42.9 (37.0–49.0)

0.664

Round 1

5.4(1.0)

56.7 (44.1–68.6)

35.8 (24.8–48.4)

 

Round 2

5.4(1.0)

62.7 (50.1–73.9)

43.3 (31.4–55.9)

 

Round 3

5.3(0.9)

68.7 (56.2–79.1)

47.8 (35.6–60.2)

 

Round 4

5.2(0.9)

68.7 (56.2–79.1)

44.8 (32.8–57.3)

 

Gemini-1.0-pro

5.3(1.3)

30.6 (25.3–36.4)

16.4 (12.4–21.4)

0.555

Round 1

5.3(1.5)

31.3 (20.9–43.8)

17.9 (10.0–29.3)

 

Round 2

5.4(1.3)

32.8 (22.2–45.4)

19.4 (11.2–31.0)

 

Round 3

5.3(1.3)

26.9 (17.1–39.1)

11.9 (5.7–22.3)

 

Round 4

5.2(1.2)

31.3 (20.9–43.8)

16.4 (8.9–27.6)

 

Gemini-1.5-pro

5.2(0.9)

53.4 (47.3–59.3)

24.6 (19.8–30.2)

0.720

Round 1

5.2(0.8)

53.7 (41.3–65.8)

22.4 (13.5–34.3)

 

Round 2

5.2(0.8)

53.7 (41.3–65.8)

28.4 (18.4–40.7)

 

Round 3

5.3(0.8)

52.2 (39.8–64.4)

23.9 (14.7–35.9)

 

Round 4

5.1(0.9)

53.7 (41.3–65.8)

23.9 (14.7–35.9)

 

Claude-2.1

4.6(0.8)

40.3 (34.5–46.4)

22.4 (17.7–27.8)

0.656

Round 1

4.6(0.9)

38.8 (27.4–51.4)

20.9 (12.3–32.6)

 

Round 2

4.6(0.8)

41.8 (30.1–54.4)

23.9 (14.7–35.9)

 

Round 3

4.5(0.8)

43.3 (31.4–55.9)

22.4 (13.5–34.3)

 

Round 4

4.5(0.7)

37.3 (26.1–49.9)

22.4 (13.5–34.3)

 

Claude 3 Opus

4.8(0.6)

66.4 (60.5–71.9)

44.4 (38.5–50.5)

0.697

Round 1

4.8(0.7)

64.2 (51.6–75.2)

47.8 (35.6–60.2)

 

Round 2

4.8(0.6)

68.7 (56.2–79.1)

41.8 (30.1–54.4)

 

Round 3

4.9(0.6)

65.7 (53.1–76.5)

44.8 (32.8–57.3)

 

Round 4

4.8(0.5)

67.2 (54.6–77.8)

43.3 (31.4–55.9)

 

Claude 3.5 Sonnet

5.6(0.7)

76.1 (70.6–80.9)

48.9 (42.9–54.9)

0.780

Round 1

5.5(0.6)

77.6 (65.7–86.5)

47.8 (35.6–60.2)

 

Round 2

5.6(0.7)

77.6 (65.7–86.5)

49.3 (37.0–61.6)

 

Round 3

5.7(0.7)

77.6 (65.7–86.5)

52.2 (39.8–64.4)

 

Round 4

5.7(0.7)

71.6 (59.3–81.6)

46.3 (34.2–58.7)

 
  1. aSD standard deviation, CI confidence interval.