Table 2 The performance of international LLMs in answering questions.

From: Comparative performance of Chinese and international large language models on the Chinese radiology attending physician qualification examination

 

ChatGPT-4o

Gemini 2.0 Pro

Grok3

χ

P

Total

317/400(79.3)

326/400(81.5)

269/335(80.3)

0.642

0.733

Units of questions

     

Unit 1

80/100(80)

78/100(78)

83/100(83)

0.802

0.670

Unit 2

80/100(80)

90/100(90)

82/97(84.5)

3.900

0.142

Unit 3

77/100(77)

85/100(85)

77/100(77)

2.634

0.268

Unit 4

69/100(69)

73/100(73)

27/38(71.1)

0.389

0.823

Text questions/image questions

     

Text questions

260/335(77.6)

281/335(83.9)

269/335(80.3)

4.238

0.120

Image questions

46/65(70.8)

45/65(69.2)

-

0.037

0.848

Types of questions

     

A1

196/254(77.2)

209/254(82.3)

202/252(80.3)

2.089

0.352

A2-4

58/78(74.4)

61/78(78.2)

27/33(81.8)

0.803

0.669

B

37/41(90.2)

39/41(95.1)

34/41(82.9)

3.269

0.195

C

15/27(55.6)

17/27(63)

4/9(44.4)

0.994

0.608