Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

Abstract

DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10−3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of this study comparing three LLMs in performing the four indicated medical tasks.
Fig. 2: Comparative performance of three LLMs for a variety of medical tasks.

Similar content being viewed by others

Data availability

The USMLE dataset is available at https://www.usmle.org/exam-resources. The NEJM case challenges are available at https://www.nejm.org/case-challenges. The BMJ endgames are available at https://www.bmj.com/specialties/endgames. The MIMIC-III dataset can be requested at https://physionet.org/content/mimiciii/1.4/. The Médicilline dataset (http://www.medicilline.com/) of multiple-choice questions for clinical diagnosis and management translated in English is available upon request for private or research use only, after agreement of Médicilline. Médicilline data requests should be sent to Mickael.Tordjman@mssm.edu. In response to the inquiry, the timeframe for responding to requests is approximately within 2 weeks of the request. The radiological reports used for summarization and RECIST classification are not available due to privacy issues.

Code availability

All studies were conducted using Azure OpenAI service (ChatGPT), Azure AI Foundry (Llama) and Azure AI service (DeepSeek). The algorithms used in this study are based on the official guidelines provided by the developers of the LLMs evaluated—DeepSeek, ChatGPT and Llama. The implementation followed these guidelines strictly to ensure consistency and reproducibility of results. As the study did not involve the development of new code but rather the application of existing, officially provided algorithms, specific source codes referenced are proprietary and maintained by their respective developers.

References

  1. The Lancet Digital Health. Large language models: a new chapter in digital health. Lancet Digit. Health 6, e1 (2024).

    Article  CAS  PubMed  Google Scholar 

  2. Gibney, E. Scientists flock to DeepSeek: how they’re using the blockbuster AI model. Nature https://doi.org/10.1038/d41586-025-00275-0 (2025).

  3. Conroy, G. & Mallapaty, S. How China created AI model DeepSeek and shocked the world. Nature https://doi.org/10.1038/d41586-025-00259-0 (2025).

  4. OpenAI. GPT-4 technical report. Preprint at http://arxiv.org/abs/2303.08774 (2023).

  5. Grattafiori, A. et al. The Llama 3 Herd of Models. Preprint at https://arxiv.org/abs/2407.21783 (2024).

  6. Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. Preprint at https://arxiv.org/abs/1904.09675 (2020).

  8. Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971).

    Article  Google Scholar 

  9. Exam resources. United States Medical Licensing Examination https://www.usmle.org/exam-resources (2025).

  10. Case challenges. N. Engl. J. Med. https://www.nejm.org/case-challenges (2025).

  11. Endgames. BMJ https://www.bmj.com/education/endgames (2025).

  12. Médicilline dataset. Médicilline http://www.medicilline.com/13-externe-ecni (2025).

  13. Laurent, G., Craynest, F., Thobois, M. & Hajjaji, N. Automatic classification of tumor response from radiology reports with rule-based natural language processing integrated into the clinical oncology workflow. JCO Clin. Cancer Inform. 7, e2200139 (2023).

    Article  PubMed  Google Scholar 

  14. Penny, P., Bane, R. & Riddle, V. Advancements in AI medical education: assessing ChatGPT’s performance on USMLE-style questions across topics and difficulty levels. Cureus 16, e76309 (2024).

    PubMed  PubMed Central  Google Scholar 

  15. Zhao, W. X. et al. A survey of large language models. Preprint at https://arxiv.org/abs/2303.18223 (2024).

  16. Learning to reason with LLMs. OpenAI https://openai.com/index/learning-to-reason-with-llms/ (2024).

  17. Temsah, M.-H., Jamal, A., Alhasan, K., Temsah, A. A. & Malki, K. H. OpenAI o1-Preview vs. ChatGPT in Healthcare: A New Frontier in Medical AI Reasoning. Cureus 16, e70640 (2024).

    PubMed  PubMed Central  Google Scholar 

  18. DeepSeek-AI et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).

  19. Suh, P. S. et al. Comparing large language model and human reader accuracy with New England Journal of Medicine image challenge case image inputs. Radiology 313, e241668 (2024).

    Article  PubMed  Google Scholar 

  20. Tenner, Z. M., Cottone, M. C. & Chavez, M. R. Harnessing the open access version of ChatGPT for enhanced clinical opinions. PLoS Digit. Health 3, e0000355 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Zhang, L. et al. Constructing a large language model to generate impressions from findings in radiology reports. Radiology 312, e240885 (2024).

    Article  PubMed  Google Scholar 

  23. Park, J., Oh, K., Han, K. & Lee, Y. H. Patient-centered radiology reports with generative artificial intelligence: adding value to radiology reporting. Sci. Rep. 14, 13218 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Eisenhauer, E. A. et al. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur. J. Cancer 45, 228–247 (2009).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank G. Zagury and Médicilline Editions for sharing their dataset. We thank N. Venturelli, L. Chiche, J. Beaziz and R. Lejoyeux for their help updating these questions based on the latest medical knowledge. This project is supported by the Eric and Wendy Schmidt AI in Human Health Fellowship, a program of Schmidt Sciences. M.T. is supported by the French Society of Radiology and the French Musculoskeletal Imaging Society. Figure 1 was created with BioRender.com.

Author information

Authors and Affiliations

Authors

Contributions

Concept and design: M.T., Z.L., H.-C.L., Z.A.F. and X.M. Acquisition, analysis or interpretation of data: M.T., Z.L., M.Y., V.F., Y.M., J.H., I.B., H.A., C.H., A.S.P., A.G., A.M., N.Y., N.N., P.R., A.Z., S.L., M.H., T.D., B.T., H.-C.L., Z.A.F. and X.M. Drafting of the paper: M.T., Z.L., M.Y., V.F., A.M., I.B., S.L., B.T., H.-C.L., Z.A.F. and X.M. Critical revision of the paper and final draft: all authors.

Corresponding authors

Correspondence to Hao-Chih Lee, Zahi A. Fayad or Xueyan Mei.

Ethics declarations

Competing interests

T.D. is the managing partner of RadImageNet LLC and a paid consultant to GEHC and AirsMedical. X.M. is a paid consultant to RadImageNet LLC. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Ahmed Alaa, Kirk Roberts and Jie Yang for their contribution to the peer review of this work. Primary Handling Editors: Michael Basson, Lorenzo Righetto and Saheli Sadanand, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Number of parameters for each model according to Azure
Extended Data Table 2 Prompts used for the different tasks
Extended Data Table 3 Text-based case evaluation using 5-point Likert score (adapted from the R-IDEA score) for the reasoning evaluation of Complex NEJM Cases and Management evaluation of the BMJ cases; and accuracy for the NEJM and private database of multiple choice questions
Extended Data Table 4 Comparison of the average scores for the linguistic evaluation of the “Findings to Impression” task on the 2 databases (private database of reports and MIMIC) based on the evaluation of 2 independent radiologists for each database
Extended Data Table 5 RECIST 1.1 performances for the 3 LLMs per category of response based on the consensus between the 3 human experts (average of the 3 runs)
Extended Data Table 6 BERTscore (F1) between the 3 runs of each model for test-retest repeatability
Extended Data Table 7 Fleiss kappa between the 3 runs of each model for test-retest repeatability
Extended Data Table 8 Inter-model Cohen kappa based on the first run of each model for USMLE, RECIST, Medicilline questions, and NEJM diagnostic cases

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tordjman, M., Liu, Z., Yuce, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med 31, 2550–2555 (2025). https://doi.org/10.1038/s41591-025-03726-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41591-025-03726-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing