Abstract
Large language models (LLMs) are increasingly transforming medical applications. However, proprietary models such as GPT-4o face significant barriers to clinical adoption because they cannot be deployed on site within healthcare institutions, making them noncompliant with stringent privacy regulations. Recent advancements in open-source LLMs such as DeepSeek models offer a promising alternative because they allow efficient fine-tuning on local data in hospitals with advanced information technology infrastructure. Here, to demonstrate the clinical utility of DeepSeek-V3 and DeepSeek-R1, we benchmarked their performance on clinical decision support tasks against proprietary LLMs, including GPT-4o and Gemini-2.0 Flash Thinking Experimental. Using 125 patient cases with sufficient statistical power, covering a broad range of frequent and rare diseases, we found that DeepSeek models perform equally well and in some cases better than proprietary LLMs. Our study demonstrates that open-source LLMs can provide a scalable pathway for secure model training enabling real-world medical applications in accordance with data privacy and healthcare regulations.
Similar content being viewed by others
Main
Large language models (LLMs) are rapidly emerging as transformative tools within medicine, showing promise in various clinical applications1. Their potential to process and understand complex medical information offers opportunities to enhance clinical decision-making, automate administrative tasks and improve patient care2,3,4. LLMs can analyze large volumes of unstructured data from electronic health records, offering clinicians efficient access to relevant patient information for diagnosis and treatment5. As artificial intelligence (AI) technology matures, these models are poised to become valuable aids in navigating the ever-expanding landscape of medical knowledge and improving healthcare delivery.
However, the integration of LLMs into clinical practice is not without challenges, necessitating careful validation and ethical considerations6,7. For LLMs to be integrated into routine clinical care, they must comply with data privacy regulations such as General Data Protection Regulation and Health Insurance Portability and Accountability Act, as well as medical device regulations such as EU Medical Device Regulation, the EU Artificial Intelligence Act or regulations by the US Food and Drug Administration. This should require LLMs to be explainable, auditable and fully aligned with strict medical regulations—criteria that proprietary models currently do not meet. Concerns regarding data privacy, algorithmic bias and the potential for generating inaccurate or misleading information remain paramount8,9,10. As Blumenthal and Goldberg (2025)11 highlight, managing patient use of generative AI also presents a novel set of challenges, underscoring the need for robust validation frameworks and clear guidelines to ensure the safe and effective implementation of LLMs in clinical settings.
Open-source LLMs on benchmarks like lmarena.ai typically have shown inferior performance compared with proprietary state-of-the art LLMs such as GPT-4o. Nonetheless, open-source LLMs have caught up as new models such as Llama 3.1 or Mistral Large 2 demonstrate substantial improvements12. Recent advancements in LLM have seen the emergence of state-of-the-art open-source models such as DeepSeek-V3 and the development of explicit reasoning models such as Gemini-2.0 Flash Thinking Experimental (Gem2FTE), OpenAI o1 and DeepSeek-R1 (ref. 13). With over 500 billion model parameters, the DeepSeek models belong to the largest LLMs, competing with proprietary ones in LLM leaderboards, while providing the key benefits of transparency and the ability to run the open-source model within the institution’s own information technology environment at a significantly lower cost compared with proprietary models by OpenAI14. While these leaderboards assess model performance on general AI tasks, the critical question remains whether open-source models can match proprietary systems in real-world clinical decision tasks including differential diagnosis or treatment planning and whether enhanced reasoning capabilities also provide benefits in clinical workflows.
Here, we systematically benchmarked open-source and frontier proprietary LLMs with a thorough performance analysis on clinical decision support tasks (Extended Data Fig. 1). We systematically assessed the performance in diagnosis and treatment recommendation for DeepSeek-V3 and DeepSeek-R1 as well as the proprietary LLMs GPT-4o and Gem2FTE, currently ranked at the top of the LLM-leaderboard at lmarena.ai.
Although LLMs excel on widely used benchmarks such as multiple-choice tests, their evaluation for clinical decision support tasks remains underexplored15,16,17. Currently, no widely accepted benchmark exists for assessing the clinical utility of LLMs. We thus conducted comparisons using a well-curated previously published set of 110 patient cases15, originally designed to evaluate GPT-4, GPT-3.5 and Google search in clinical decision-making. Unlike multiple-choice-based automatic assessments, this benchmark requires expert clinicians to manually evaluate LLM-generated text outputs. These cases, sourced from medical textbooks, replicate the initial patient encounter commonly seen in outpatient or emergency settings by focusing solely on the key details of the doctor–patient dialogue. As a result, they offer an approximation of real-world conditions—where incomplete or extraneous information is common—and help to assess the models’ practical clinical performance. Model performance was assessed using a 5-point Likert scale to evaluate model output by medical experts (Extended Data Fig. 2, Supplementary Table 1 and Supplementary Fig. 1).
Our focus is on diagnosis and treatment recommendations, as these represent the most consequential and error-prone aspects of clinical decision-making, frequently cited in adverse event analyses and guideline development frameworks18,19. To ensure broad coverage, our evaluation spans multiple specialties (internal medicine, neurology, surgery, gynecology and pediatrics) and includes a balanced mix of frequent, less frequent and rare diseases. To enhance statistical power, we expanded the benchmark to 125 cases, enabling robust significance testing in systematic pairwise model comparisons with adjustments for multiple testing (Methods, Supplementary Table 2 and Supplementary Fig. 1).
For the first clinical decision-making task of diagnosis (Fig. 1), Gem2FTE was significantly outperformed by DeepSeek-R1 (P = 5.73 × 10−5, rank-biserial correlation rrb = 0.60) and GPT-4o (P = 7.89 × 10−6, rrb = 0.67). DeepSeek-R1 was on a par with the best-performing model GPT-4o (P = 0.3085, rrb = 0.27). All new models showed clearly superior performance compared with GPT-4, GPT-3.5 and Google search (Supplementary Table 3 and Supplementary Fig. 3). Our data indicated consistent performances across clinical specialties (Supplementary Table 4 and Supplementary Fig. 4). It is noteworthy that no clear difference was observed for the diagnosis of rare diseases as compared with frequent diseases for all models except for Gem2FTE (unadjusted P values of 0.0004 and 0.0009, respectively; Supplementary Tables 5 and 6 and Supplementary Fig. 5). Notably, this finding is in stark contrast to our finding in the very recent, earlier study benchmarking GPT-4, GPT-3.5 and Google search15, where both models and Google search underperformed in the diagnosis of rare diseases. Interestingly, the reasoning-empowered model DeepSeek-R1 did not show improved performance in comparison with DeepSeek-V3 (P = 1, rrb = 0.03) (Supplementary Tables 1 and 3 and Supplementary Fig. 1).
a–d, Bubble plots showing the results of the 125 pairwise comparisons on a 5-point Likert scale for GPT-4o versus DeepSeek-R1 (a) (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 0.3085, V = 378, 95% CI −3.13 × 10−7 to infinity, estimate 0.25); GPT-4o versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 7.89 × 10−6, V = 1,576, 95% CI 0.5 to infinity, estimate 0.75) (b); DeepSeek-R1 versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 5.73 × 10−5, V = 1,515, 95% CI 0.5 to infinity, estimate 0.5) (c); and DeepSeek-R1 versus DeepSeek-V3 (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 4, adjusted P = 1, V = 307, 95% CI −0.25 to infinity, estimate 1.97 × 10−5) (d). e, Violin plots comparing the Likert scores of GPT-4o, DeepSeek-R1, DeepSeek-V3 and Gem2FTE with those of GPT-4, GPT-3.5 and Google in our previous study (n.s., not significant; ***P < 0.001; significance levels visualizing the results of statistical tests performed in a–d). Explorative comparison of the n = 110 cases analyzed by all seven models with the n = 15 newly added cases shows that the performance scores align well (one-sided unpaired Mann–Whitney test, alternative = greater; GPT-4o: pGPT-4o 0.5441, W = 813.5, 95% CI −1.84 × 10−5 to infinity, estimate −4.99 × 10−5; DeepSeek-R1: pDeepSeek-R1 0.7710, W = 740, 95% CI 3.75 × 10−5 to infinity, estimate −2.16 × 10−5; DeepSeek-V3: pDeepSeek-V3 0.6678, W = 775.5, 95% CI −7.45 × 10−5 to infinity, estimate 5.91 × 10−5; Gem2FTE: pGem2FTE 0.9899, W = 540, 95% CI −0.5 to infinity, estimate −3.51 × 10−5). f, The cumulative frequency of the Likert scores for GPT-4o, DeepSeek-R1, DeepSeek-V3, Gem2FTE and GPT-4.
In line with the above finding, for the second clinical decision-making task of treatment, both GPT-4o (P = 0.0016, rrb = 0.50) and DeepSeek-R1 (P = 0.0235, rrb = 0.36) showed superior performance compared with Gem2FTE. Again, no significant differences were observed for GPT-4o versus DeepSeek-R1 (P = 0.1522, rrb = 0.26) (Fig. 2). Compared with the earlier benchmarked models GPT-4 and GPT-3.5, superior performance could be observed for both GPT-4o and DeepSeek-R1, but not for Gem2FTE (Supplementary Table 3 and Supplementary Fig. 6). The model performance for treatment recommendation was not negatively affected by low disease frequency (Supplementary Tables 5 and 6 and Supplementary Fig. 5). Model performance was mostly uniform across all clinical specialties, with Gem2FTE being the only exception for treatment recommendations for neurological cases (Supplementary Table 4 and Supplementary Fig. 4).
a–c, Bubble plots showing the results of the 125 pairwise comparisons on a 5-point Likert scale for GPT-4o versus DeepSeek-R1 (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.1522, V = 771.5, 95% CI −6.88 × 10−5 to infinity, estimate 0.25) (a); GPT-4o versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.0016, V = 1154, 95% CI 0.2501 to infinity, estimate 0.5) (b); DeepSeek-R1 versus Gem2FTE (one-sided paired Mann–Whitney test with continuity correction, alternative = greater, Bonferroni correction with k = 3, adjusted P = 0.0235, V = 1,124, 95% CI 4.21 × 10−6 to infinity, estimate 0.5) (c). d, Violin plots comparing the Likert scores of GPT-4o, DeepSeek-R1 and Gem2FTE with GPT-4 and GPT-3.5 (n.s., not significant; *P < 0.05; significance levels visualizing the results of statistical tests performed in a–c). Explorative comparison of the n = 110 cases analyzed by all seven models with the n = 15 newly added cases shows that the performance scores align well (one-sided unpaired Mann–Whitney test, alternative = greater; GPT-4o: pGPT-4o 0.1460, W = 955, 95% CI −5.38 × 10−5 to infinity, estimate 3.16 × 10−5; DeepSeek-R1: pDeepSeek-R1 0.5256, W = 817.5, 95% CI −1.46 × 10−5 to infinity, estimate −1.73 × 10−5; Gem2FTE: pGem2FTE 0.4591, W = 838.5, 95% CI −9.54 × 10−6 to infinity, estimate −6.10 × 10−5). e, The cumulative frequency of Likert scores for GPT-4o, DeepSeek-R1, Gem2FTE and GPT-4.
The strong performance of DeepSeek-V3 and DeepSeek-R1, matching GPT-4o in both clinical decision-making tasks, suggests that open-source LLMs may serve as valuable assistive tools for complex tasks such as diagnosis or differential diagnoses and treatment recommendation. Surprisingly, Gem2FTE, despite leading the general nonmedical benchmark on lmarena.ai, underperformed in clinical decision-making. Although its model specifications remain undisclosed, we speculate that Gem2FTE is significantly smaller than DeepSeek-V3/R1 and GPT-4o, with model capacity probably being a key factor in clinical performance. Equally unexpected was the lack of advantage from DeepSeek-R1’s reasoning module in medical decision-making. Instead, DeepSeek-R1 generated significantly longer text outputs, increasing response times and reducing conciseness compared with its nonreasoning counterpart. The reasoning fine-tuning of models such as DeepSeek-R1 is focused on easily verifiable mathematical, coding and logic tasks13, and we here found that the impressive improvements in reasoning in these problem domains so far have not extended to clinical reasoning. It is thus tempting to speculate that fine-tuning of reasoning models based on proprietary clinical case reports available within individual caregiver organizations may lead to dramatic improvement in diagnosis and treatment recommendations.
The average performance scores for DeepSeek-R1 were 4.70 points (4.48) out of 5 points for diagnosis (treatment recommendation). In some cases, the new LLMs successfully generated accurate and current information, particularly for treatment recommendations, where newly updated guidelines, such as those addressing antimicrobial treatment plans, were necessary. Nevertheless, many cases did not achieve the maximum score; for example, with DeepSeek-R1, 60% of cases for diagnosis and only 39% for treatment reached the full score of 5 points. These inaccuracies in model predictions could pose potential risks if the output was prompting immediate medical decisions without additional expert oversight. Interestingly, the phenomenon of ‘artificial hallucination’, where LLMs generate seemingly plausible but factually incorrect content20 could be observed in only a small fraction of cases across all models. Overall, these findings reinforce the need for robust validation frameworks and clear guidelines to ensure the safe and effective implementation of LLMs in clinical settings.
Although the tasks evaluated here cover only a portion of potential clinical use cases, our findings suggest a potential supportive benefit for the two highly relevant clinical decision-making tasks of diagnosis and treatment recommendations. We believe that the output of these models can be further improved in terms of performance and robustness by adding access to quality-checked medical literature or databases, human oversight and transparent learning. In summary, our study demonstrates that open-source LLMs are viable candidates for real-world medical applications. As hospitals prioritize data privacy and regulatory compliance, open-source LLMs provide a scalable pathway for secure, cost-effective, institution-specific model training and implementation. Future clinical studies are warranted to assess whether these promising findings can be effectively translated into improved patient outcomes.
Methods
The selection and processing of patient case reports as well as standardized prompting were performed as in our previous study by Sandmann et al.15. In summary, 1,020 manually written cases from German patient casebooks by Thieme and Elsevier were identified. Five clinical specialties were considered: gynecology, internal medicine, neurology, pediatrics and surgery. Cases were categorized by disease frequency. We defined a disease as ‘frequent’ if its incidence per year was higher than 1:1,000, ‘less frequent’ if the incidence per year was higher than 1:10,000 and ‘rare’ if the incidence per year was lower than 1:10,000. Subsequently, cases were filtered, excluding those that require image data or laboratory values for decision-making. Aiming at balanced groups of disease frequency as well as clinical specialty, 110 cases were selected (Supplementary Table 7). To generate patient queries, case reports were translated to English using the tool DeepL.com. Subsequently, translations were reviewed to correct for linguistic accuracy and quality if necessary. Case reports were changed to the first-person perspective and layman’s English. The LLMs were queried “What are the most likely diagnoses? Name up to five.”, for diagnosis, and “My doctor has diagnosed me with <diagnosis>. What are the most appropriate therapies in my case? Name up to five.” for treatment.
The cases used in this study were sourced from curated medical textbooks rather than from real-world clinical records or unstructured notes. The aim was to simulate initial patient encounters—such as those in outpatient clinics or emergency departments—where clinicians typically collect only essential information through a limited set of targeted questions. As a result, these vignettes may sometimes omit relevant details or include extraneous information, thereby offering an approximate reflection of the models’ potential performance in real-world clinical settings.
In our earlier study15, we performed a systematic evaluation of GPT-3.5, GPT-4 and Google search, considering the tools’ overall performance as well as the impact of disease frequency on the results. Aiming at a total power of 0.90 (12 tests for diagnosis, 7 for treatment, Bonferroni correction21), we previously calculated n = 110 cases as sufficient. Analysis results from this previous study revealed that disease frequency had only a minor impact on making the correct treatment decisions. Furthermore, while disease frequency had a clear influence on diagnosis, the tools’ performance even for rare diseases was better than initially assumed (Supplementary Table 8). Taking into account the current rapid development in the field, we expect the differences to decrease even further. Against this background, our current study focuses on testing for significant differences in (1) GPT-4o versus DeepSeek-R1, (2) GPT-4o versus Gem2FTE and (3) DeepSeek-R1 versus Gem2FTE for the two tasks of diagnosis and treatment recommendations. To elaborate on the added value of reasoning models, we also compare (4) DeepSeek-R1 versus DeepSeek-V3 at the diagnostic task. One-sided paired Mann–Whitney test was applied in all cases, comparing scoring on a 5-point Likert scale (Supplementary Table 9). Bonferroni correction was used to adjust for multiple testing21.
To estimate the power in relation to sample size, we made the following assumptions. (1) The performance of GPT-4o, estimated to have 1.8 trillion parameters, is better compared with DeepSeek-R1, having 671 billion parameters. (2) The performance of GPT-4o and DeepSeek-R1 is better compared with Gem2FTE. The exact parameter size of Gem2FTE is not reported but estimated to be less than 671 billion parameters based on the fact that the earlier version Gemini 1.5 Flash had 8 billion parameters. (3) The performance of DeepSeek-R1 is better compared with DeepSeek-V3.
In our earlier study15, we observed probabilities for Likert scores for GPT-3.5 and GPT-4 summarized in Suppplementary Table 10. Based on these findings, we adapted the performance estimates for the successor model GPT-4o in relation to DeepSeek-R1, DeepSeek-V3 and Gem2FTE. The following probabilities for Likert scores 1, 2, 3, 4 and 5 were sampled: GPT-4o: 0.00, 0.00, 0.00, 0.30 and 0.70; DeepSeek-R1: 0.00, 0.00, 0.10, 0.30 and 0.60; DeepSeek-V3: 0.00, 0.05, 0.25, 0.20 and 0.50; and Gem2FTE: 0.01, 0.14, 0.30, 0.15 and 0.40, respectively. Power calculation, investigating possible sample sizes between 75 and 145, showed power of 0.89 for n = 125 cases when adjusting for 4 tests (diagnosis), and power of 0.91 when adjusting for 3 tests (treatment; Supplementary Fig. 7 and Supplementary Data 4).
Selecting the same 110 cases as before, direct comparability of the new LLMs’ results to the old approaches is granted. By selecting all case reports from non-English sources with non-open access, we aimed at reducing the risk of training bias. To meet the required sample size, 15 new cases were added, following the same selection approach. Explorative analysis was performed to investigate whether results for these new cases align with the old ones. Furthermore, the influence of disease frequency and clinical specialty on the models’ performance was analyzed exploratively. All queries were entered manually without using application programming interfaces within the vendor-provided user platforms and executed between 27 January and 5 February in 2025. Additional technical details are provided in Supplementary Table 11.
A 5-point Likert scale (Supplementary Table 9) was used for assessing both diagnosis and treatment tasks. Two physicians independently assessed five random cases, conducting a comprehensive literature review using UpToDate and PubMed and reaching a consensus on the final Likert scores. Interrater reliability was determined using weighted Cohen’s κ (R package DescTools22, function ‘CohenKappa’, weights ‘Equal-Spacing’). Given the high interrater reliability (κ = 0.76, 95% confidence interval (CI) 0.55 to 0.96), consistent with findings from our prior study (κ ranging between 0.53 and 0.84), the first physician subsequently continued to perform detailed reviews with extensive literature analysis for the remaining cases, while the second physician independently verified all ratings. All statistical analyses were conducted using R 4.4.2 (ref. 23). Applying one-sided paired Mann–Whitney tests24, we tested for significant differences in the overall performance of the approaches (α = 0.05; Bonferroni correction with k = 4 for diagnosis and k = 3 for treatment). A one-sided unpaired Mann–Whitney test was used for the explorative analysis of old versus new clinical cases.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data including patient cases (clinical cases) and ratings are provided in Supplementary Data 1. Descriptions of further supplementary tables are provided in the Supplementary Information.
Code availability
All code to reproduce data analyses in this Brief Communication is provided in Supplementary Data 2, Supplementary Data 3 and Supplementary Data 4. Code to reproduce main and supplementary analyses is available via GitHub at https://github.com/sandmanns/llm_evaluation.
References
Quer, G. & Topol, E. J. The potential for large language models to transform cardiovascular medicine. Lancet Digit. Health 6, e767–e771 (2024).
Bellini, V. & Bignami, E. G. Generative Pre-trained Transformer 4 (GPT-4) in clinical settings. Lancet Digit. Health 7, e6–e7 (2025).
Aaron, B. et al. Large language models for more efficient reporting of hospital quality measures. NEJM AI https://doi.org/10.1056/aics2400420 (2024).
McCoy, T. H. & Perlis, R. H. Applying large language models to stratify suicide risk using narrative clinical notes. J. Mood Anxiety Disord. 10, 100109 (2025).
Ahsan, H. et al. Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges. Proc. Mach. Learn Res. 248, 489–505 (2024).
Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit. Health 6, e441–e443 (2024).
Ong, J. C. L. et al. Medical ethics of large language models in medicine. NEJM AI 1, AIra2400038 (2024).
Alber, D. A. et al. Medical large language models are vulnerable to data-poisoning attacks. Nat Med. https://doi.org/10.1038/s41591-024-03445-1 (2025).
Beutel, G., Geerits, E. & Kielstein, J. T. Artificial hallucination: GPT on LSD? Crit. Care 27, 148 (2023).
Kim, M. et al. Fine-tuning LLMs with medical data: can safety be ensured? NEJM AI 2, AIcs2400390 (2025).
Blumenthal, D. & Goldberg, C. Managing patient use of generative health AI. NEJM AI 2, AIpc2400927 (2025).
Hou, G. & Lian, Q. Benchmarking of commercial large language models: ChatGPT, Mistral, and Llama. Research Square https://www.researchsquare.com/article/rs-4376810/v1 (2024).
DeepSeek-AI et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at http://arxiv.org/abs/2501.12948 (2025).
Gibney, E. Scientists flock to DeepSeek: how they’re using the blockbuster AI model. Nature https://www.nature.com/articles/d41586-025-00275-0 (2025).
Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 15, 2050 (2024).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Hooftman, J. et al. Common contributing factors of diagnostic error: a retrospective analysis of 109 serious adverse event reports from Dutch hospitals. BMJ Qual. Saf. 33, 642–651 (2024).
Jackson, R. & Feder, G. Guidelines for clinical guidelines. Br. Med. J. 317, 427–428 (1998).
Alkaissi, H. & McFarlane, S. I. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus https://doi.org/10.7759/cureus.35179 (2023).
Bonferroni, C. E. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R. Istituto Super. di Sci. Economiche e Commericiali di Firenze 8, 3–62 (1936).
Signorell, A. DescTools: Tools for Descriptive Statistics. R package version 0.99.60. R Project https://doi.org/10.32614/CRAN.package.DescTools (2025).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2025); https://www.r-project.org/
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947).
Acknowledgements
This work was enabled by the HiGHmed consortium funded by the German Ministry of Education and Research (grant number 01KX2121). R.E. acknowledges support by the Collaborative Research Center (SFB 1470) funded by the German Research Council (DFG) and by AI4HEALTH funded by the Natural Science Foundation of China (NSCF) (grant number W2441025). The icons of Extended Data Fig. 1 were generated using Figma (https://www.figma.com).
Author information
Authors and Affiliations
Contributions
R.E., B.W. and J.V. conceptualized the project. M.F. and L.B. performed data acquisition. S.H. and J.V. performed clinical evaluation. S.S. performed analyses and drafted the paper. R.E. and J.V. supervised the study. All authors reviewed and approved the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Medicine thanks Eric Oermann and Jie Yang for their contribution to the peer review of this work. Primary Handling Editors: Michael Basson, Lorenzo Righetto and Saheli Sadanand, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1
Visual abstract.
Extended Data Fig. 2 Summarized model performances for diagnosis and treatment recommendation tasks.
Histograms showing the performance of GPT-4o, DeepSeek-R1, Gemini-2.0 Flash Thinking Experimental (Gem2FTE) and DeepSeek-V3 considering diagnosis and treatment, rated with Likert scores. Five points represent the highest possible level of accuracy as assessed by the expert. The red line indicates the mean performance of each model.
Supplementary information
Supplementary Information
Supplementary Tables 1–11 and Figs. 1–7.
Supplementary Table 1
Overview of all clinical cases, their source information and assessment.
Supplementary Code 2
R script generating Fig. 1.
Supplementary Code 3
R script generating Fig. 2.
Supplementary Code 4
R script for performing power analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sandmann, S., Hegselmann, S., Fujarski, M. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat Med 31, 2546–2549 (2025). https://doi.org/10.1038/s41591-025-03727-2
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41591-025-03727-2