Abstract
Large language models (LLMs) like ChatGPT and DeepSeek are gaining attention for their potential in medical education. This study aims to evaluate the performance of ChatGPT and DeepSeek in the United States Medical Licensing Examination (USMLE) and the Chinese National Medical Licensing Examination (CNMLE), followed by the targeted optimizations methods to advance the efficient and effective application of LLMs in medical education. This study conducted a comparative quantitative analysis across multiple dimensions, including answer accuracy, consistency, the number of reasoning characters, and runtime.Based on the identified limitations of LLMs, targeted optimization explorations were carried out, including the construction of a technical safeguard framework and a multi-dimensional evaluation system. In the USMLE, DeepSeek had an average accuracy of 92.59% and a Fleiss’ Kappa of 0.96, while ChatGPT had 90.26% accuracy and a Fleiss’ Kappa of 0.93. In the CNMLE, DeepSeek achieved an accuracy of 86.78% and a Fleiss’ Kappa of 0.96, while ChatGPT had an accuracy of 79.44% and a Fleiss’ Kappa of 0.90. Both DeepSeek and ChatGPT demonstrated the ability to identify flawed questions, yet they also produced incorrect answers due to hallucinations. Additionally, DeepSeek had a relatively longer runtime. To address these issues, this study proposed a Knowledge Graph-Based RAG Fact-Checking Framework centered on evidence anchoring and a multi-dimensional evaluation system focusing on reliability and safety. DeepSeek generally outperforms ChatGPT in accuracy, particularly excelling in handling complex medical problems and Chinese medical knowledge. However, DeepSeek had a longer runtime compared with ChatGPT. The proposed optimization framework and evaluation system effectively address core issues such as LLM hallucinations, clarifying the positioning of LLMs as “auxiliary tools” that require rigorous fact-checking. These solutions jointly form a core governance system for the application of LLM in medical education, providing key support for their precise and efficient integration into educational scenarios. The study indicates that LLMs are expected to bring about a progressive transformation, evolving from functional enhancement to paradigm reconstruction.
References
Wang, Z. C. et al. History, development, and principles of large Language models: an introductory survey[J]. AI Ethics 2025,5:1955-1971. https://doi.org/10.1007/s43681-024-00583-7
Schneuing, A. et al. Structure-based drug design with equivariant diffusion models[J]. Nat. Comput. Sci. 4, 899–909. https://doi.org/10.1038/s43588-024-00737-x (2024).
Jia, S., Zhang, C., Fung, V. & LLMatDesign Autonomous Materials Discovery with Large Language Models[Preprint].arXiv. https://arxiv.org/abs/2406.13163
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with alphafold 3[J]. Nature 630, 493–500. https://doi.org/10.1038/s41586-024-07487-w (2024).
Zvyagin, M. et al. GenSLMs: Genome-scale Language models reveal SARS-CoV-2 evolutionary dynamics[Preprint].bioRxiv. https://doi.org/10.1101/2022.10.10.511571
Thulke, D. et al. ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change[Preprint]. arXiv. https://arxiv.org/html/2401.09646v1
AstroOne a Large Language Model for Astronomy Launched[EB/OL]. [2025-9-12].Zhengjiang lab. https://en.zhejianglab.com/news/202501/t20250127_4158.shtml
Lu, M. Y. et al. A multimodal generative AI copilot for human pathology[J]. Nature 634, 466–473. https://doi.org/10.1038/s41586-024-07618-3 (2024).
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine[J]. N Engl. J. Med. 388 (13), 1233–1239. https://doi.org/10.1056/NEJMsr2214184 (2023).
Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries?[J]. Lancet Digit. Health. 5 (3), e107–e108. https://doi.org/10.1016/S2589-7500(23)00021-3 (2023).
Else, H. Abstracts written by ChatGPT fool scientists [J]. Nat. 2023, 613(7944):423 .https://doi.org/10.1038/d41586-023-00056-7
Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns[J]. Healthc. (Basel). 11 (6), 887. https://doi.org/10.3390/healthcare11060887 (2023).
Belkina, M. et al. Implementing generative AI (GenAI) in higher education: A systematic review of case Studies[J]. Comput. Educ. Artif. Intell. 2025,8:https://doi.org/10.1016/j.caeai.2025.100407
Shang, L. X. et al. Evaluating the application of ChatGPT in china’s residency training education: an exploratory study[J]. Med. Teach. 47 (5), 858–864. https://doi.org/10.1080/0142159X.2024.2377808 (2025).
Kung, T. H., Cheatham, M. & Medenilla, A. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models[J]. PLoS Digit. Health. 2 (2), e0000198. https://doi.org/10.1371/journal.pdig.0000198 (2023).
Wang, X. Y. et al. ChatGPT performs on the Chinese National medical licensing Examination[J]. J. Med. Syst. 47 (1), 86. https://doi.org/10.1007/s10916-023-01961-0 (2023).
Huh, S. Are chatgpt’s knowledge and interpretation ability comparable to those of medical students in Korea for taking A parasitology examination? A descriptive study[J]. J. Educ. Eval Health Prof. 20, 1. https://doi.org/10.3352/jeehp.2023.20.1 (2023).
Conroy, G. & Mallapaty, S. How China created AI model deepseek and shocked the world[J]. Nature 638 (8050), 300–301. https://doi.org/10.1038/d41586-025-00259-0 (2025).
Temsah, A. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new Open-Source artificial intelligence frontier. Cureus 17 (2), e79221. https://doi.org/10.7759/cureus.79221 (2025).
Blundell, C. N., Mukherjee, M. & Nykvist, S. A scoping review of the application of the SAMR model in research[J]. Comput. Educ. Open. https://doi.org/10.1016/j.caeo.2022.100093 (2022). ,3:100093.
Wang, S. et al. Artificial intelligence in education: A systematic literature review[J]. Expert Syst. Appl. https://doi.org/10.1016/j.eswa.2024.124167 (2024). 252(Part A):124167.
Thirunavukarasu, A. J. et al. Large Language models in medicine[J]. Nat. Med. 29 (8), 1930–1940. https://doi.org/10.1038/s41591-023-02448-8 (2023).
Hang, C. N., Yu, P. D., Tan, C. W. & TrumorGPT Graph-Based Retrieval-Augmented large Language model for Fact-Checking[J]. IEEE Trans. Artif. Intell. 6 (11), 3148–3162. https://doi.org/10.1109/TAI.2025.3567369 (2025).
Hang, C. N., Yu, P. D., Chen, S., Tan, C. W. & Chen, G. M. E. G. A. Machine Learning-Enhanced graph analytics for infodemic risk Management[J]. IEEE J. Biomed. Health Inf. 27 (12), 6100–6111. https://doi.org/10.1109/JBHI.2023.3314632 (2023).
Wang, Y., Guo, S. & Tan, C. W. From code generation to software testing: AI copilot with Context-Based Retrieval-Augmented generation[J]. IEEE Softw. 42 (4), 34–42. https://doi.org/10.1109/MS.2025.3549628 (2025).
Li, H. et al. Mitigating LLM hallucinations with knowledge graphs: A case Study[Preprint]. https://arxiv.org/abs/2504.12422. https://doi.org/10.48550/arXiv.2504.12422
Wang, H. & Shu, K. Explainable Claim Verification via Knowledge-Grounded Reasoning with Large Language Models[C/OL]//BOUAMOR H, PINO J, BALI K, eds. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 6288-6304[2025-12-07]. https://aclanthology.org/2023.findings-emnlp.416 (2023).
Wong, M. F. & Tan, C. W. Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models[Preprint].[2025-12-07]. https://arxiv.org/pdf/2503.15129
Ouyang, L. et al. Training language models to follow instructions with human feedback[Preprint]. https://arxiv.org/abs/2203.02155
Acknowledgements
Data Availability Statement In this study, the 2023 USMLE (United States Medical Licensing Examination) sample exam, consisting of 376 questions for public release, was downloaded from the official USMLE website (https://www.usmle.org/).The original questions of the 2022 Chinese National Medical Licensing Examination (CNMLE) were purchased from a local bookstore. While the specific data sets from these purchased resources are not publicly available in a centralized online repository, interested readers can obtain similar data by purchasing the same or equivalent CNMLE preparation books. These books typically contain practice questions, exam formats, and detailed explanations that mirror the actual CNMLE exam content, and they were instrumental in shaping the analysis and conclusions drawn in this study.We have ensured that all data sources are properly cited within the manuscript to facilitate transparency and reproducibility of our research.
Funding
This research was supported by Peking Union Medical College’s 2024 Youth Medical Education Backbone Training Program under Grant No. [2024mesp010].
Author information
Authors and Affiliations
Contributions
Qing Wang drafted the initial manuscript and led subsequent revisions. Xiaoying Li designed the article’s structure and conceptualized its core framework. Junlian Li performed statistical analysis and data processing, while Panpan Deng managed data collection and curation. All authors provided critical feedback during revisions and approved the final version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, Q., Li, J., Li, X. et al. Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education. Sci Rep (2026). https://doi.org/10.1038/s41598-026-40043-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-40043-2