Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

Wang, Qing; Li, Junlian; Li, Xiaoying; Deng, Panpan

doi:10.1038/s41598-026-40043-2

Download PDF

Article
Open access
Published: 17 March 2026

Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

Qing Wang¹^na1,
Junlian Li¹^na1,
Xiaoying Li¹ &
…
Panpan Deng¹

Scientific Reports , Article number: (2026) Cite this article

550 Accesses
3 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large language models (LLMs) like ChatGPT and DeepSeek are gaining attention for their potential in medical education. This study aims to evaluate the performance of ChatGPT and DeepSeek in the United States Medical Licensing Examination (USMLE) and the Chinese National Medical Licensing Examination (CNMLE), followed by the targeted optimizations methods to advance the efficient and effective application of LLMs in medical education. This study conducted a comparative quantitative analysis across multiple dimensions, including answer accuracy, consistency, the number of reasoning characters, and runtime.Based on the identified limitations of LLMs, targeted optimization explorations were carried out, including the construction of a technical safeguard framework and a multi-dimensional evaluation system. In the USMLE, DeepSeek had an average accuracy of 92.59% and a Fleiss’ Kappa of 0.96, while ChatGPT had 90.26% accuracy and a Fleiss’ Kappa of 0.93. In the CNMLE, DeepSeek achieved an accuracy of 86.78% and a Fleiss’ Kappa of 0.96, while ChatGPT had an accuracy of 79.44% and a Fleiss’ Kappa of 0.90. Both DeepSeek and ChatGPT demonstrated the ability to identify flawed questions, yet they also produced incorrect answers due to hallucinations. Additionally, DeepSeek had a relatively longer runtime. To address these issues, this study proposed a Knowledge Graph-Based RAG Fact-Checking Framework centered on evidence anchoring and a multi-dimensional evaluation system focusing on reliability and safety. DeepSeek generally outperforms ChatGPT in accuracy, particularly excelling in handling complex medical problems and Chinese medical knowledge. However, DeepSeek had a longer runtime compared with ChatGPT. The proposed optimization framework and evaluation system effectively address core issues such as LLM hallucinations, clarifying the positioning of LLMs as “auxiliary tools” that require rigorous fact-checking. These solutions jointly form a core governance system for the application of LLM in medical education, providing key support for their precise and efficient integration into educational scenarios. The study indicates that LLMs are expected to bring about a progressive transformation, evolving from functional enhancement to paradigm reconstruction.

References

Wang, Z. C. et al. History, development, and principles of large Language models: an introductory survey[J]. AI Ethics 2025,5:1955-1971. https://doi.org/10.1007/s43681-024-00583-7
Schneuing, A. et al. Structure-based drug design with equivariant diffusion models[J]. Nat. Comput. Sci. 4, 899–909. https://doi.org/10.1038/s43588-024-00737-x (2024).
Jia, S., Zhang, C., Fung, V. & LLMatDesign Autonomous Materials Discovery with Large Language Models[Preprint].arXiv. https://arxiv.org/abs/2406.13163
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with alphafold 3[J]. Nature 630, 493–500. https://doi.org/10.1038/s41586-024-07487-w (2024).
Zvyagin, M. et al. GenSLMs: Genome-scale Language models reveal SARS-CoV-2 evolutionary dynamics[Preprint].bioRxiv. https://doi.org/10.1101/2022.10.10.511571
Thulke, D. et al. ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change[Preprint]. arXiv. https://arxiv.org/html/2401.09646v1
AstroOne a Large Language Model for Astronomy Launched[EB/OL]. [2025-9-12].Zhengjiang lab. https://en.zhejianglab.com/news/202501/t20250127_4158.shtml
Lu, M. Y. et al. A multimodal generative AI copilot for human pathology[J]. Nature 634, 466–473. https://doi.org/10.1038/s41586-024-07618-3 (2024).
Google Scholar
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine[J]. N Engl. J. Med. 388 (13), 1233–1239. https://doi.org/10.1056/NEJMsr2214184 (2023).
Google Scholar
Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries?[J]. Lancet Digit. Health. 5 (3), e107–e108. https://doi.org/10.1016/S2589-7500(23)00021-3 (2023).
Google Scholar
Else, H. Abstracts written by ChatGPT fool scientists [J]. Nat. 2023, 613(7944):423 .https://doi.org/10.1038/d41586-023-00056-7
Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns[J]. Healthc. (Basel). 11 (6), 887. https://doi.org/10.3390/healthcare11060887 (2023).
Google Scholar
Belkina, M. et al. Implementing generative AI (GenAI) in higher education: A systematic review of case Studies[J]. Comput. Educ. Artif. Intell. 2025,8:https://doi.org/10.1016/j.caeai.2025.100407
Shang, L. X. et al. Evaluating the application of ChatGPT in china’s residency training education: an exploratory study[J]. Med. Teach. 47 (5), 858–864. https://doi.org/10.1080/0142159X.2024.2377808 (2025).
Google Scholar
Kung, T. H., Cheatham, M. & Medenilla, A. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models[J]. PLoS Digit. Health. 2 (2), e0000198. https://doi.org/10.1371/journal.pdig.0000198 (2023).
Google Scholar
Wang, X. Y. et al. ChatGPT performs on the Chinese National medical licensing Examination[J]. J. Med. Syst. 47 (1), 86. https://doi.org/10.1007/s10916-023-01961-0 (2023).
Google Scholar
Huh, S. Are chatgpt’s knowledge and interpretation ability comparable to those of medical students in Korea for taking A parasitology examination? A descriptive study[J]. J. Educ. Eval Health Prof. 20, 1. https://doi.org/10.3352/jeehp.2023.20.1 (2023).
Google Scholar
Conroy, G. & Mallapaty, S. How China created AI model deepseek and shocked the world[J]. Nature 638 (8050), 300–301. https://doi.org/10.1038/d41586-025-00259-0 (2025).
Google Scholar
Temsah, A. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new Open-Source artificial intelligence frontier. Cureus 17 (2), e79221. https://doi.org/10.7759/cureus.79221 (2025).
Google Scholar
Blundell, C. N., Mukherjee, M. & Nykvist, S. A scoping review of the application of the SAMR model in research[J]. Comput. Educ. Open. https://doi.org/10.1016/j.caeo.2022.100093 (2022). ,3:100093.
Google Scholar
Wang, S. et al. Artificial intelligence in education: A systematic literature review[J]. Expert Syst. Appl. https://doi.org/10.1016/j.eswa.2024.124167 (2024). 252(Part A):124167.
Google Scholar
Thirunavukarasu, A. J. et al. Large Language models in medicine[J]. Nat. Med. 29 (8), 1930–1940. https://doi.org/10.1038/s41591-023-02448-8 (2023).
Google Scholar
Hang, C. N., Yu, P. D., Tan, C. W. & TrumorGPT Graph-Based Retrieval-Augmented large Language model for Fact-Checking[J]. IEEE Trans. Artif. Intell. 6 (11), 3148–3162. https://doi.org/10.1109/TAI.2025.3567369 (2025).
Google Scholar
Hang, C. N., Yu, P. D., Chen, S., Tan, C. W. & Chen, G. M. E. G. A. Machine Learning-Enhanced graph analytics for infodemic risk Management[J]. IEEE J. Biomed. Health Inf. 27 (12), 6100–6111. https://doi.org/10.1109/JBHI.2023.3314632 (2023).
Google Scholar
Wang, Y., Guo, S. & Tan, C. W. From code generation to software testing: AI copilot with Context-Based Retrieval-Augmented generation[J]. IEEE Softw. 42 (4), 34–42. https://doi.org/10.1109/MS.2025.3549628 (2025).
Google Scholar
Li, H. et al. Mitigating LLM hallucinations with knowledge graphs: A case Study[Preprint]. https://arxiv.org/abs/2504.12422. https://doi.org/10.48550/arXiv.2504.12422
Wang, H. & Shu, K. Explainable Claim Verification via Knowledge-Grounded Reasoning with Large Language Models[C/OL]//BOUAMOR H, PINO J, BALI K, eds. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 6288-6304[2025-12-07]. https://aclanthology.org/2023.findings-emnlp.416 (2023).
Wong, M. F. & Tan, C. W. Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models[Preprint].[2025-12-07]. https://arxiv.org/pdf/2503.15129
Ouyang, L. et al. Training language models to follow instructions with human feedback[Preprint]. https://arxiv.org/abs/2203.02155

Download references

Acknowledgements

Data Availability Statement In this study, the 2023 USMLE (United States Medical Licensing Examination) sample exam, consisting of 376 questions for public release, was downloaded from the official USMLE website (https://www.usmle.org/).The original questions of the 2022 Chinese National Medical Licensing Examination (CNMLE) were purchased from a local bookstore. While the specific data sets from these purchased resources are not publicly available in a centralized online repository, interested readers can obtain similar data by purchasing the same or equivalent CNMLE preparation books. These books typically contain practice questions, exam formats, and detailed explanations that mirror the actual CNMLE exam content, and they were instrumental in shaping the analysis and conclusions drawn in this study.We have ensured that all data sources are properly cited within the manuscript to facilitate transparency and reproducibility of our research.

Funding

This research was supported by Peking Union Medical College’s 2024 Youth Medical Education Backbone Training Program under Grant No. [2024mesp010].

Author information

These authors contributed equally to this work: Qing Wang and Junlian Li.

Authors and Affiliations

Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
Qing Wang, Junlian Li, Xiaoying Li & Panpan Deng

Authors

Qing Wang
View author publications
Search author on:PubMed Google Scholar
Junlian Li
View author publications
Search author on:PubMed Google Scholar
Xiaoying Li
View author publications
Search author on:PubMed Google Scholar
Panpan Deng
View author publications
Search author on:PubMed Google Scholar

Contributions

Qing Wang drafted the initial manuscript and led subsequent revisions. Xiaoying Li designed the article’s structure and conceptualized its core framework. Junlian Li performed statistical analysis and data processing, while Panpan Deng managed data collection and curation. All authors provided critical feedback during revisions and approved the final version.

Corresponding author

Correspondence to Xiaoying Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Q., Li, J., Li, X. et al. Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education. Sci Rep (2026). https://doi.org/10.1038/s41598-026-40043-2

Download citation

Received: 07 October 2025
Accepted: 10 February 2026
Published: 17 March 2026
DOI: https://doi.org/10.1038/s41598-026-40043-2

Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

Subjects

Abstract

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links