Table 4 Performance comparison of model’s answer across different prompt types.

From: Evaluation of DeepSeek-R1 and ChatGPT-4o on the Chinese national medical licensing examination: a multi-year comparative study

Model

Error type

Question

Accurate answer

Baseline answer

Prompt 2 answer

Prompt 3 answer

  

Q17

B

Incorrect

Correct

Incorrect

  

Q37

E

Correct

Correct

Incorrect

 

Fact-based recall

Q1

E

Correct

Correct

Incorrect

DeepSeek-R1

 

Q20

D

Incorrect

Correct

Incorrect

 

Diagnostic reasoning

Q53

D

Incorrect

Correct

Incorrect

  

Q110

D

Incorrect

Correct

Incorrect

  

Q14

A

Incorrect

Correct

Correct

 

Fact-based recall

Q20a

D

Incorrect

Correct

Correct

  

Q20b

D

Incorrect

Correct

Incorrect

ChatGPT-4o

 

Q59

C

Correct

Correct

Correct

  

Q122

B

Incorrect

Correct

Incorrect

 

Diagnostic reasoning

Q53

D

Incorrect

Correct

Incorrect

  

Q110

D

Correct

Correct

Incorrect

  

Q117a

B

Incorrect

Correct

Incorrect

  

Q117b

B

Incorrect

Correct

Incorrect

  1. The results shown in Table 4 represent a subset of the questions analyzed. Only a sample of questions is provided here for comparison purposes.