Table 3 Performance comparison across different prompt types.

From: Evaluation of DeepSeek-R1 and ChatGPT-4o on the Chinese national medical licensing examination: a multi-year comparative study

Model

Prompt type

Accuracy (%)

Accuracy vs. Baseline

Deepseek-R1

Baseline (original prompt)

93.50

Step-by-step reasoning

96.00

 + 2.50%

Concise answer-only

95.50

 + 2.00%

ChatGPT-4o

Baseline (original prompt)

63.50

Step-by-step reasoning

88.00

 + 24.50%

Concise answer-only

77.50

 + 14.00%