Table 1 Accuracy of LLMs according to the input types.

	Original quiz text with image	Rephrased quiz text with image	Rephrased quiz text only	P-value^a	P-value^b
GPT-4v
T0	62.5% (35/56)	64.3% (36/56)	67.9% (38/56)	0.763	0.313
T0.5	64.3% (36/56)	62.5% (35/56)	67.9% (38/56)	0.739	0.173
T1	66.1% (37/56)	67.9% (38/56)	66.1% (37/56)	0.763	0.313
GPT-4o
T0	75.0% (42/56)	67.9% (38/56)	62.5% (35/56)	0.150	0.251
T0.5	76.8% (43/56)	67.9% (38/56)	62.5% (35/56)	0.051	0.313
T1	75.0% (42/56)	66.1% (37/56)	66.1% (37/56)	0.124	1.000
Gemini 1.5 pro
T0	58.9% (33/56)	64.3% (36/56)	62.5% (35/56)	0.489	0.763
T0.5	57.1% (32/56)	64.3% (36/56)	62.5% (35/56)	0.342	0.763
T1	58.9% (33/56)	66.1% (37/56)	64.3% (36/56)	0.391	0.739
Gemini Flash
T0	53.6% (30/56)	64.3% (36/56)	51.8% (29/56)	0.027	0.005
T0.5	51.8% (29/56)	64.3% (36/56)	51.8% (29/56)	0.014	0.005
T1	55.4% (31/56)	64.3% (36/56)	53.6% (30/56)	0.051	0.010
Claude 3.0
T0	75.0% (42/56)	73.2% (41/56)	69.6% (39/56)	0.654	0.478
T0.5	73.2% (41/56)	69.6% (39/56)	75.0% (42/56)	0.411	0.563
T1	75.0% (42/56)	75.0% (42/56)	67.9% (38/56)	1.000	0.199
Claude 3.5
T0	80.4% (45/56)	76.8% (43/56)	76.8% (43/56)	0.313	1.000
T0.5	80.4% (45/56)	76.8% (43/56)	76.8% (43/56)	0.313	1.000
T1	80.4% (45/56)	76.8% (43/56)	76.8% (43/56)	0.313	1.000

GPT-4v, GPT-4 turbo with vision, GPT-4o, GPT-4 omni, LLM, large language model, T0, temperature 0, T0.5, temperature 0.5, T1, temperature 1. Differences in accuracy were calculated using the generalized estimating equations.
^aP-value for comparison between the original and rephrased quiz composed of texts and images. ^bP-value for comparison between the rephrased quiz composed of texts and images and rephrased quiz with texts only. P < .05 was considered statistically significant.
Significance value bold (P < .05) .

Quick links

Search