Table 2 Comparative performance of LLMs and clinicians across total and domain scores.

From: Benchmarking large language models against clinicians across hospital levels in cardiovascular decision-making: a cross-sectional vignette-based study

Domain

Clinician

DeepSeek-R1

Mean difference (95% CI)

P value

ChatGPT 4.0

Mean difference (95% CI)

P value

CR

37.4 ± 2.7

43

 + 5.6 (4.4–6.9)

 < 0.001

42

 + 4.6 (3.4–5.8)

 < 0.001

FU

7.3 ± 2.5

13

 + 5.7 (3.8–7.6)

 < 0.001

12

 + 4.7 (2.8–6.6)

 < 0.001

BM

18.1 ± 2.4

30

 + 11.9 (9.2–14.5)

 < 0.001

30

 + 11.9 (9.2–14.5)

 < 0.001

ED

6.9 ± 2.0

11

 + 4.1 (2.9–5.3)

 < 0.001

11

 + 4.1 (2.9–5.3)

 < 0.001

Total

69.7 ± 7.9

97

 + 27.3 (24.4–30.1)

 < 0.001

95

 + 25.3 (22.4–28.1)

 < 0.001

  1. Mean differences (LLM − clinician) and 95% CIs were obtained from 10,000 bootstrap iterations using clinician means. A P value < 0.05 indicated a significant difference.
  2. BM, basic memory; CI, confidence interval; CR, clinical reasoning; ED, emergency decision; FU, frontier updates; LLM, large language model; SD, standard deviation.