Table 2 Comparative performance of LLMs and clinicians across total and domain scores.
Domain | Clinician | DeepSeek-R1 | Mean difference (95% CI) | P value | ChatGPT 4.0 | Mean difference (95% CI) | P value |
|---|---|---|---|---|---|---|---|
CR | 37.4 ± 2.7 | 43 | + 5.6 (4.4–6.9) | < 0.001 | 42 | + 4.6 (3.4–5.8) | < 0.001 |
FU | 7.3 ± 2.5 | 13 | + 5.7 (3.8–7.6) | < 0.001 | 12 | + 4.7 (2.8–6.6) | < 0.001 |
BM | 18.1 ± 2.4 | 30 | + 11.9 (9.2–14.5) | < 0.001 | 30 | + 11.9 (9.2–14.5) | < 0.001 |
ED | 6.9 ± 2.0 | 11 | + 4.1 (2.9–5.3) | < 0.001 | 11 | + 4.1 (2.9–5.3) | < 0.001 |
Total | 69.7 ± 7.9 | 97 | + 27.3 (24.4–30.1) | < 0.001 | 95 | + 25.3 (22.4–28.1) | < 0.001 |