Table 1 Accuracy of LLMs according to the input types.
Original quiz text with image | Rephrased quiz text with image | Rephrased quiz text only | P-valuea | P-valueb | |
|---|---|---|---|---|---|
GPT-4v | |||||
T0 | 62.5% (35/56) | 64.3% (36/56) | 67.9% (38/56) | 0.763 | 0.313 |
T0.5 | 64.3% (36/56) | 62.5% (35/56) | 67.9% (38/56) | 0.739 | 0.173 |
T1 | 66.1% (37/56) | 67.9% (38/56) | 66.1% (37/56) | 0.763 | 0.313 |
GPT-4o | |||||
T0 | 75.0% (42/56) | 67.9% (38/56) | 62.5% (35/56) | 0.150 | 0.251 |
T0.5 | 76.8% (43/56) | 67.9% (38/56) | 62.5% (35/56) | 0.051 | 0.313 |
T1 | 75.0% (42/56) | 66.1% (37/56) | 66.1% (37/56) | 0.124 | 1.000 |
Gemini 1.5 pro | |||||
T0 | 58.9% (33/56) | 64.3% (36/56) | 62.5% (35/56) | 0.489 | 0.763 |
T0.5 | 57.1% (32/56) | 64.3% (36/56) | 62.5% (35/56) | 0.342 | 0.763 |
T1 | 58.9% (33/56) | 66.1% (37/56) | 64.3% (36/56) | 0.391 | 0.739 |
Gemini Flash | |||||
T0 | 53.6% (30/56) | 64.3% (36/56) | 51.8% (29/56) | 0.027 | 0.005 |
T0.5 | 51.8% (29/56) | 64.3% (36/56) | 51.8% (29/56) | 0.014 | 0.005 |
T1 | 55.4% (31/56) | 64.3% (36/56) | 53.6% (30/56) | 0.051 | 0.010 |
Claude 3.0 | |||||
T0 | 75.0% (42/56) | 73.2% (41/56) | 69.6% (39/56) | 0.654 | 0.478 |
T0.5 | 73.2% (41/56) | 69.6% (39/56) | 75.0% (42/56) | 0.411 | 0.563 |
T1 | 75.0% (42/56) | 75.0% (42/56) | 67.9% (38/56) | 1.000 | 0.199 |
Claude 3.5 | |||||
T0 | 80.4% (45/56) | 76.8% (43/56) | 76.8% (43/56) | 0.313 | 1.000 |
T0.5 | 80.4% (45/56) | 76.8% (43/56) | 76.8% (43/56) | 0.313 | 1.000 |
T1 | 80.4% (45/56) | 76.8% (43/56) | 76.8% (43/56) | 0.313 | 1.000 |