Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 17 March 2026

Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

  • Qing Wang1 na1,
  • Junlian Li1 na1,
  • Xiaoying Li1 &
  • …
  • Panpan Deng1 

Scientific Reports , Article number:  (2026) Cite this article

  • 550 Accesses

  • 3 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computational biology and bioinformatics
  • Health care
  • Mathematics and computing

Abstract

Large language models (LLMs) like ChatGPT and DeepSeek are gaining attention for their potential in medical education. This study aims to evaluate the performance of ChatGPT and DeepSeek in the United States Medical Licensing Examination (USMLE) and the Chinese National Medical Licensing Examination (CNMLE), followed by the targeted optimizations methods to advance the efficient and effective application of LLMs in medical education. This study conducted a comparative quantitative analysis across multiple dimensions, including answer accuracy, consistency, the number of reasoning characters, and runtime.Based on the identified limitations of LLMs, targeted optimization explorations were carried out, including the construction of a technical safeguard framework and a multi-dimensional evaluation system. In the USMLE, DeepSeek had an average accuracy of 92.59% and a Fleiss’ Kappa of 0.96, while ChatGPT had 90.26% accuracy and a Fleiss’ Kappa of 0.93. In the CNMLE, DeepSeek achieved an accuracy of 86.78% and a Fleiss’ Kappa of 0.96, while ChatGPT had an accuracy of 79.44% and a Fleiss’ Kappa of 0.90. Both DeepSeek and ChatGPT demonstrated the ability to identify flawed questions, yet they also produced incorrect answers due to hallucinations. Additionally, DeepSeek had a relatively longer runtime. To address these issues, this study proposed a Knowledge Graph-Based RAG Fact-Checking Framework centered on evidence anchoring and a multi-dimensional evaluation system focusing on reliability and safety. DeepSeek generally outperforms ChatGPT in accuracy, particularly excelling in handling complex medical problems and Chinese medical knowledge. However, DeepSeek had a longer runtime compared with ChatGPT. The proposed optimization framework and evaluation system effectively address core issues such as LLM hallucinations, clarifying the positioning of LLMs as “auxiliary tools” that require rigorous fact-checking. These solutions jointly form a core governance system for the application of LLM in medical education, providing key support for their precise and efficient integration into educational scenarios. The study indicates that LLMs are expected to bring about a progressive transformation, evolving from functional enhancement to paradigm reconstruction.

References

  1. Wang, Z. C. et al. History, development, and principles of large Language models: an introductory survey[J]. AI Ethics 2025,5:1955-1971. https://doi.org/10.1007/s43681-024-00583-7

  2. Schneuing, A. et al. Structure-based drug design with equivariant diffusion models[J]. Nat. Comput. Sci. 4, 899–909. https://doi.org/10.1038/s43588-024-00737-x (2024).

  3. Jia, S., Zhang, C., Fung, V. & LLMatDesign Autonomous Materials Discovery with Large Language Models[Preprint].arXiv. https://arxiv.org/abs/2406.13163

  4. Abramson, J. et al. Accurate structure prediction of biomolecular interactions with alphafold 3[J]. Nature 630, 493–500. https://doi.org/10.1038/s41586-024-07487-w (2024).

  5. Zvyagin, M. et al. GenSLMs: Genome-scale Language models reveal SARS-CoV-2 evolutionary dynamics[Preprint].bioRxiv. https://doi.org/10.1101/2022.10.10.511571

  6. Thulke, D. et al. ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change[Preprint]. arXiv. https://arxiv.org/html/2401.09646v1

  7. AstroOne a Large Language Model for Astronomy Launched[EB/OL]. [2025-9-12].Zhengjiang lab. https://en.zhejianglab.com/news/202501/t20250127_4158.shtml

  8. Lu, M. Y. et al. A multimodal generative AI copilot for human pathology[J]. Nature 634, 466–473. https://doi.org/10.1038/s41586-024-07618-3 (2024).

    Google Scholar 

  9. Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine[J]. N Engl. J. Med. 388 (13), 1233–1239. https://doi.org/10.1056/NEJMsr2214184 (2023).

    Google Scholar 

  10. Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries?[J]. Lancet Digit. Health. 5 (3), e107–e108. https://doi.org/10.1016/S2589-7500(23)00021-3 (2023).

    Google Scholar 

  11. Else, H. Abstracts written by ChatGPT fool scientists [J]. Nat. 2023, 613(7944):423 .https://doi.org/10.1038/d41586-023-00056-7

  12. Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns[J]. Healthc. (Basel). 11 (6), 887. https://doi.org/10.3390/healthcare11060887 (2023).

    Google Scholar 

  13. Belkina, M. et al. Implementing generative AI (GenAI) in higher education: A systematic review of case Studies[J]. Comput. Educ. Artif. Intell. 2025,8:https://doi.org/10.1016/j.caeai.2025.100407

  14. Shang, L. X. et al. Evaluating the application of ChatGPT in china’s residency training education: an exploratory study[J]. Med. Teach. 47 (5), 858–864. https://doi.org/10.1080/0142159X.2024.2377808 (2025).

    Google Scholar 

  15. Kung, T. H., Cheatham, M. & Medenilla, A. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models[J]. PLoS Digit. Health. 2 (2), e0000198. https://doi.org/10.1371/journal.pdig.0000198 (2023).

    Google Scholar 

  16. Wang, X. Y. et al. ChatGPT performs on the Chinese National medical licensing Examination[J]. J. Med. Syst. 47 (1), 86. https://doi.org/10.1007/s10916-023-01961-0 (2023).

    Google Scholar 

  17. Huh, S. Are chatgpt’s knowledge and interpretation ability comparable to those of medical students in Korea for taking A parasitology examination? A descriptive study[J]. J. Educ. Eval Health Prof. 20, 1. https://doi.org/10.3352/jeehp.2023.20.1 (2023).

    Google Scholar 

  18. Conroy, G. & Mallapaty, S. How China created AI model deepseek and shocked the world[J]. Nature 638 (8050), 300–301. https://doi.org/10.1038/d41586-025-00259-0 (2025).

    Google Scholar 

  19. Temsah, A. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new Open-Source artificial intelligence frontier. Cureus 17 (2), e79221. https://doi.org/10.7759/cureus.79221 (2025).

    Google Scholar 

  20. Blundell, C. N., Mukherjee, M. & Nykvist, S. A scoping review of the application of the SAMR model in research[J]. Comput. Educ. Open. https://doi.org/10.1016/j.caeo.2022.100093 (2022). ,3:100093.

    Google Scholar 

  21. Wang, S. et al. Artificial intelligence in education: A systematic literature review[J]. Expert Syst. Appl. https://doi.org/10.1016/j.eswa.2024.124167 (2024). 252(Part A):124167.

    Google Scholar 

  22. Thirunavukarasu, A. J. et al. Large Language models in medicine[J]. Nat. Med. 29 (8), 1930–1940. https://doi.org/10.1038/s41591-023-02448-8 (2023).

    Google Scholar 

  23. Hang, C. N., Yu, P. D., Tan, C. W. & TrumorGPT Graph-Based Retrieval-Augmented large Language model for Fact-Checking[J]. IEEE Trans. Artif. Intell. 6 (11), 3148–3162. https://doi.org/10.1109/TAI.2025.3567369 (2025).

    Google Scholar 

  24. Hang, C. N., Yu, P. D., Chen, S., Tan, C. W. & Chen, G. M. E. G. A. Machine Learning-Enhanced graph analytics for infodemic risk Management[J]. IEEE J. Biomed. Health Inf. 27 (12), 6100–6111. https://doi.org/10.1109/JBHI.2023.3314632 (2023).

    Google Scholar 

  25. Wang, Y., Guo, S. & Tan, C. W. From code generation to software testing: AI copilot with Context-Based Retrieval-Augmented generation[J]. IEEE Softw. 42 (4), 34–42. https://doi.org/10.1109/MS.2025.3549628 (2025).

    Google Scholar 

  26. Li, H. et al. Mitigating LLM hallucinations with knowledge graphs: A case Study[Preprint]. https://arxiv.org/abs/2504.12422. https://doi.org/10.48550/arXiv.2504.12422

  27. Wang, H. & Shu, K. Explainable Claim Verification via Knowledge-Grounded Reasoning with Large Language Models[C/OL]//BOUAMOR H, PINO J, BALI K, eds. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 6288-6304[2025-12-07]. https://aclanthology.org/2023.findings-emnlp.416 (2023).

  28. Wong, M. F. & Tan, C. W. Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models[Preprint].[2025-12-07]. https://arxiv.org/pdf/2503.15129

  29. Ouyang, L. et al. Training language models to follow instructions with human feedback[Preprint]. https://arxiv.org/abs/2203.02155

Download references

Acknowledgements

Data Availability Statement In this study, the 2023 USMLE (United States Medical Licensing Examination) sample exam, consisting of 376 questions for public release, was downloaded from the official USMLE website (https://www.usmle.org/).The original questions of the 2022 Chinese National Medical Licensing Examination (CNMLE) were purchased from a local bookstore. While the specific data sets from these purchased resources are not publicly available in a centralized online repository, interested readers can obtain similar data by purchasing the same or equivalent CNMLE preparation books. These books typically contain practice questions, exam formats, and detailed explanations that mirror the actual CNMLE exam content, and they were instrumental in shaping the analysis and conclusions drawn in this study.We have ensured that all data sources are properly cited within the manuscript to facilitate transparency and reproducibility of our research.

Funding

This research was supported by Peking Union Medical College’s 2024 Youth Medical Education Backbone Training Program under Grant No. [2024mesp010].

Author information

Author notes
  1. These authors contributed equally to this work: Qing Wang and Junlian Li.

Authors and Affiliations

  1. Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China

    Qing Wang, Junlian Li, Xiaoying Li & Panpan Deng

Authors
  1. Qing Wang
    View author publications

    Search author on:PubMed Google Scholar

  2. Junlian Li
    View author publications

    Search author on:PubMed Google Scholar

  3. Xiaoying Li
    View author publications

    Search author on:PubMed Google Scholar

  4. Panpan Deng
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Qing Wang drafted the initial manuscript and led subsequent revisions. Xiaoying Li designed the article’s structure and conceptualized its core framework. Junlian Li performed statistical analysis and data processing, while Panpan Deng managed data collection and curation. All authors provided critical feedback during revisions and approved the final version.

Corresponding author

Correspondence to Xiaoying Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Q., Li, J., Li, X. et al. Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education. Sci Rep (2026). https://doi.org/10.1038/s41598-026-40043-2

Download citation

  • Received: 07 October 2025

  • Accepted: 10 February 2026

  • Published: 17 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-40043-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Large language models
  • Medical education
  • DeepSeek
  • ChatGPT
  • United States Medical Licensing Examination
  • Chinese National Medical Licensing Examination
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics