Table 2 Comparative performance of LLMs and clinicians across total and domain scores.

Domain	Clinician	DeepSeek-R1	Mean difference (95% CI)	P value	ChatGPT 4.0	Mean difference (95% CI)	P value
CR	37.4 ± 2.7	43	+ 5.6 (4.4–6.9)	< 0.001	42	+ 4.6 (3.4–5.8)	< 0.001
FU	7.3 ± 2.5	13	+ 5.7 (3.8–7.6)	< 0.001	12	+ 4.7 (2.8–6.6)	< 0.001
BM	18.1 ± 2.4	30	+ 11.9 (9.2–14.5)	< 0.001	30	+ 11.9 (9.2–14.5)	< 0.001
ED	6.9 ± 2.0	11	+ 4.1 (2.9–5.3)	< 0.001	11	+ 4.1 (2.9–5.3)	< 0.001
Total	69.7 ± 7.9	97	+ 27.3 (24.4–30.1)	< 0.001	95	+ 25.3 (22.4–28.1)	< 0.001

Mean differences (LLM − clinician) and 95% CIs were obtained from 10,000 bootstrap iterations using clinician means. A P value < 0.05 indicated a significant difference.
BM, basic memory; CI, confidence interval; CR, clinical reasoning; ED, emergency decision; FU, frontier updates; LLM, large language model; SD, standard deviation.

Quick links

Search