Table 1 Accuracy of LLMs according to the input types.

From: Evaluating diagnostic accuracy of large language models in neuroradiology cases using image inputs from JAMA neurology and JAMA clinical challenges

 

Original quiz

text with image

Rephrased quiz

text with image

Rephrased quiz

text only

P-valuea

P-valueb

GPT-4v

 T0

62.5% (35/56)

64.3% (36/56)

67.9% (38/56)

0.763

0.313

 T0.5

64.3% (36/56)

62.5% (35/56)

67.9% (38/56)

0.739

0.173

 T1

66.1% (37/56)

67.9% (38/56)

66.1% (37/56)

0.763

0.313

GPT-4o

 T0

75.0% (42/56)

67.9% (38/56)

62.5% (35/56)

0.150

0.251

 T0.5

76.8% (43/56)

67.9% (38/56)

62.5% (35/56)

0.051

0.313

 T1

75.0% (42/56)

66.1% (37/56)

66.1% (37/56)

0.124

1.000

Gemini 1.5 pro

 T0

58.9% (33/56)

64.3% (36/56)

62.5% (35/56)

0.489

0.763

 T0.5

57.1% (32/56)

64.3% (36/56)

62.5% (35/56)

0.342

0.763

 T1

58.9% (33/56)

66.1% (37/56)

64.3% (36/56)

0.391

0.739

Gemini Flash

 T0

53.6% (30/56)

64.3% (36/56)

51.8% (29/56)

0.027

0.005

 T0.5

51.8% (29/56)

64.3% (36/56)

51.8% (29/56)

0.014

0.005

 T1

55.4% (31/56)

64.3% (36/56)

53.6% (30/56)

0.051

0.010

Claude 3.0

 T0

75.0% (42/56)

73.2% (41/56)

69.6% (39/56)

0.654

0.478

 T0.5

73.2% (41/56)

69.6% (39/56)

75.0% (42/56)

0.411

0.563

 T1

75.0% (42/56)

75.0% (42/56)

67.9% (38/56)

1.000

0.199

Claude 3.5

 T0

80.4% (45/56)

76.8% (43/56)

76.8% (43/56)

0.313

1.000

 T0.5

80.4% (45/56)

76.8% (43/56)

76.8% (43/56)

0.313

1.000

 T1

80.4% (45/56)

76.8% (43/56)

76.8% (43/56)

0.313

1.000

  1. GPT-4v, GPT-4 turbo with vision, GPT-4o, GPT-4 omni, LLM, large language model, T0, temperature 0, T0.5, temperature 0.5, T1, temperature 1. Differences in accuracy were calculated using the generalized estimating equations.
  2. aP-value for comparison between the original and rephrased quiz composed of texts and images. bP-value for comparison between the rephrased quiz composed of texts and images and rephrased quiz with texts only. P < .05 was considered statistically significant.
  3. Significance value bold (P < .05) .