Fig. 1: Model performance for diagnosis tasks.
From: Benchmark evaluation of DeepSeek large language models in clinical decision-making

a–d, Bubble plots showing the results of the 125 pairwise comparisons on a 5-point Likert scale for GPT-4o versus DeepSeek-R1 (a) (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 0.3085, V = 378, 95% CI −3.13 × 10−7 to infinity, estimate 0.25); GPT-4o versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 7.89 × 10−6, V = 1,576, 95% CI 0.5 to infinity, estimate 0.75) (b); DeepSeek-R1 versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 5.73 × 10−5, V = 1,515, 95% CI 0.5 to infinity, estimate 0.5) (c); and DeepSeek-R1 versus DeepSeek-V3 (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 1, V = 307, 95% CI −0.25 to infinity, estimate 1.97 × 10−5) (d). e, Violin plots comparing the Likert scores of GPT-4o, DeepSeek-R1, DeepSeek-V3 and Gem2FTE with those of GPT-4, GPT-3.5 and Google in our previous study (n.s., not significant; ***P < 0.001; significance levels visualizing the results of statistical tests performed in a–d). Explorative comparison of the n = 110 cases analyzed by all seven models with the n = 15 newly added cases shows that the performance scores align well (one-sided unpaired Mann–Whitney test, alternative = greater; GPT-4o: pGPT-4o 0.5441, W = 813.5, 95% CI −1.84 × 10−5 to infinity, estimate −4.99 × 10−5; DeepSeek-R1: pDeepSeek-R1 0.7710, W = 740, 95% CI 3.75 × 10−5 to infinity, estimate −2.16 × 10−5; DeepSeek-V3: pDeepSeek-V3 0.6678, W = 775.5, 95% CI −7.45 × 10−5 to infinity, estimate 5.91 × 10−5; Gem2FTE: pGem2FTE 0.9899, W = 540, 95% CI −0.5 to infinity, estimate −3.51 × 10−5). f, The cumulative frequency of the Likert scores for GPT-4o, DeepSeek-R1, DeepSeek-V3, Gem2FTE and GPT-4.