Abstract
Background
Direct clinical uses of large language models (LLMs) remain controversial, partly because of the lack of methodological rigor in assessing their risks and benefits in medicine.
Methods
We developed Medieval, a multidisciplinary, randomized, and blinded expert evaluation framework. A ten-point Dreyfus-based scoring scale linked to career stages of human physicians was designed to reflect response qualities. Seven advanced LLMs or their distilled versions that were released within a short time-frame ( ≤ 45 days) in early 2025 were tested. Incidence of fabricated medical facts were documented. Linear mixed-effects models and variance-stabilizing Bayesian generalized linear mixed models were employed to perform statistical analyses.
Results
We first develop a high-quality question bank comprising 685 real and simulated clinical cases across 13 specialties. An expert panel of 27 clinicians (average years of services: 25.9) evaluated the 4795 model responses. We show that these LLM ratings (n = 9856) have excellent reliability (intraclass correlation coefficients \( > \)0.9). Among the seven LLMs tested, Gemini 2.0 Flash achieved the highest raw scores. However, after adjusting for confounders, DeepSeek-R1 was the top-performing model with a mean score of 6.36 (95% confidence interval 6.03 − 6.69), a performance level equivalent to an early-career physician. Despite these strengths, 3–19% LLM responses were rated as incompetent and 40 instances of LLM hallucination were also identified.
Conclusions
Our study shows that in spite of LLMs’ substantial potentials in medicine, their unguarded clinical application could present serious risks, which must be continuously monitored by human expert panels. The evaluation framework developed and validated in this study will facilitate such efforts.
Plain language summary
Artificial intelligence (AI) systems are becoming increasingly capable of answering medical questions, but it is still important to understand how safe and reliable they truly are. In this study, we created a structured evaluation framework where experienced doctors reviewed the responses of seven newly released AI models to hundreds of real and simulated clinical scenarios. Across nearly 5,000 answers, doctors used a scoring system designed to reflect the quality expected at different stages of a medical career. Some AI models performed well, occasionally reaching a level similar to that of early career physicians. However, doctors also identified answers that were incomplete, inaccurate, or based on invented details. Our study shows that while these AI systems are promising, ensuring their safe use in medicine will require ongoing oversight. Standard engineering tests alone are not enough; evaluations by clinical experts remain a crucial safeguard to help identify potential risks and prevent misuse.
Data availability
Due to privacy concerns, the original clinical cases submitted to the models, despite having been deidentified, will not be made publicly available, but can be obtained by email to PKC. The model responses of these questions have been uploaded to a public repository as Supplementary Data 144. The evaluation results have been posted to the same repository as Supplementary Data 232.
Code availability
The custom code for processing the LLM responses, running the web-system, and for analyzing the resulting data were deposited on an open repository45 (https://doi.org/10.6084/m9.figshare.30889115). The R scripts were analyzed on the R software (version 4.3.1) with dependent packages listed in the above repository.
References
Liu, M. et al. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. Int J. Med. Inf. 193, 105673 (2025).
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).
Zeng, D. et al. DeepSeek’s “Low-Cost” Adoption Across China’s Hospital Systems: Too Fast. Too Soon? JAMA 333, 1866–1869 (2025).
Feldman, M. J. et al. Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses. JAMA Netw. Open 8, e2512994 (2025).
Menezes, M. C. S. et al. The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health 7, e35–e43 (2025).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med 29, 1930–1940 (2023).
Lim, Z. W. et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 95, 104770 (2023).
Rydzewski, N. R. et al. Comparative Evaluation of LLMs in Clinical Oncology. NEJM AI. 1 https://doi.org/10.1056/aioa2300151 (2024).
Akkus Yildirim, B. et al. Large language models standardize the interpretation of complex oncology guidelines for brain metastases. Commun. Med. 6, 56 (2025).
McGrath, S. P. et al. A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions. J. Am. Med Inf. Assoc. 31, 2271–2283 (2024).
Heinz, M. V. et al. Randomized trial of a generative AI chatbot for mental health treatment. Nejm Ai 2, AIoa2400802 (2025).
Du, X. et al. Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. EBioMedicine 109, 105401 (2024).
Moura, L. et al. Implications of Large Language Models for Quality and Efficiency of Neurologic Care: Emerging Issues in Neurology. Neurology 102, e209497 (2024).
Sosa, B. R. et al. Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries. J. Orthop. Res 42, 1276–1282 (2024).
Kim, J. et al. Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease. Am. J. Hum. Genet. 111, 2190–2202 (2024).
Zelin, C. et al. Rare disease diagnosis using knowledge guided retrieval augmentation for ChatGPT. J. Biomed. Inf. 157, 104702 (2024).
Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).
Goodman, R. S. et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw. Open 6, e2336483 (2023).
Ong, J. C. L. et al. Medical ethics of large language models in medicine. NEJM AI 1, AIra2400038 (2024).
Li, H. et al. Cmmlu: Measuring massive multitask language understanding in chinese. in Findings of the Association for Computational Linguistics: ACL 2024. (2024).
Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. in Conference on health, inference, and learning. PMLR. (2022)
Flores-Gouyonnet, J. et al. Performance of large language models in rheumatology board-like questions: accuracy, quality, and safety. Lancet Rheumatol. 7, e152–e154 (2025).
Chiang, W.-L. et al. Chatbot arena: An open platform for evaluating llms by human preference. in Forty-first International Conference on Machine Learning. (2024).
Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 7, e2440969 (2024).
Newble, D. Techniques for measuring clinical competence: objective structured clinical examinations. Med Educ. 38, 199–203 (2004).
Carraccio, C. L. et al. From the educational bench to the clinical bedside: translating the Dreyfus developmental model to the learning of clinical skills. Acad. Med. 83, 761–767 (2008).
Ten Hove, D., Jorgensen, T. D. & van der Ark, L. A. How to Estimate Intraclass Correlation Coefficients for Interrater Reliability from Planned Incomplete Data. Multivar. Behav. Res. 60, 1042–1061 (2025).
Fong, Y., Rue, H. & Wakefield, J. Bayesian inference for generalized linear mixed models. Biostatistics 11, 397–412 (2010). p.
Giuffre, M., You, K. & Shung, D. L. Evaluating ChatGPT in Medical Contexts: The Imperative to Guard Against Hallucinations and Partial Accuracies. Clin. Gastroenterol. Hepatol. 22, 1145–1146 (2024).
Chen, P. & Cai, J.-F. Expert evaluations of LLM responses on clinical questions. [Dataset] 2025; Available from: https://doi.org/10.6084/m9.figshare.30899360].
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17, 405–424 (2015).
Riggs, E. R. et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen). Genet Med 22, 245–257 (2020).
Dai, H. P. et al. Expert Consensus on the Diagnosis and Treatment of Anticancer Drug-Induced Interstitial Lung Disease. Curr. Med Sci. 43, 1–12 (2023).
Brahmer, J. R. et al. Management of Immune-Related Adverse Events in Patients Treated With Immune Checkpoint Inhibitor Therapy: American Society of Clinical Oncology Clinical Practice Guideline. J. Clin. Oncol. 36, 1714–1768 (2018).
Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med 31, 2546–2549 (2025).
Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med 31, 2550–2555 (2025).
Biderman, S. et al. Emergent and predictable memorization in large language models. Adv. Neural Inf. Process. Syst. 36, 28072–28090 (2023).
Antman, E. M. et al. ACC/AHA guidelines for the management of patients with ST-elevation myocardial infarction-executive summary. A report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (Writing Committee to revise the 1999 guidelines for the management of patients with acute myocardial infarction). J. Am. Coll. Cardiol. 44, 671–719 (2004).
Duong, D. & Solomon, B. D. Artificial intelligence in clinical genetics. Eur. J. Hum. Genet. 33, 281–288 (2025).
Omar, M. et al. Sociodemographic biases in medical decision making by large language models. Nat. Med. 31, 1873–1881 (2025).
Omar, M. et al. Refining LLMs outputs with iterative consensus ensemble (ICE). Comput. Biol. Med. 196, 110731 (2025).
Chen, P. Responses of seven tested models for real clinical questions. [Dataset] 2025; V1.0:[Available from: https://doi.org/10.6084/m9.figshare.30889556].
Chen, P., Shih, D. J. H. & Hu, Y. Medieval is an expert-evaluation framework for benefits and risks of using language models for medicine. [Software] 2025; V1.0:[Available from: https://doi.org/10.6084/m9.figshare.30889115].
Acknowledgements
This work was supported by the Sanming Project of Medicine in Shenzhen (SZSM202311022), Shenzhen Clinical Research Center for Rare Diseases (LCYSSQ20220823091402005), Shenzhen Key Medical Discipline Construction Fund (SZXK2020084 and SZXK077), Shenzhen Science and Technology Program (JCYJ20250604180803005), Shenzhen Science and Technology Major Project of China (KJZD20240903102759061), Small Equipment Grant (HKU), and the HKU - SZH Translational Med Centre (HZQSWS-KCCYB-2024055). PKC thanks the Shenzhen Peacock Plan (No.20210830100 C) and Futian Talent Program, and Shenzhen General Research Program JCYJ20250604180803005. We thank the three disqualified evaluators for their engagement. We thank C Zhong, D Chen, Kevin Tam, Profs Pak Sham (HKU) and N Ding (ZJU), and Drs A. Wai and CF Jiang for assistance or useful discussions.
Author information
Authors and Affiliations
Contributions
K.M.C.C. and P.K.C. conceived the project. K.M.C.C., M.K.T.T. and Y.H. coordinated the efforts. P.K.C. designed the methodology, developed and implemented the web system, designed the analyses, and wrote the first draft. P.K.C., D.J.H.S. and Y.H. performed the statistical analyses. J.F.C., J.Y.Z., S.X.C., and C.G.X. are question designers and non-independent expert evaluators for the neurology, pediatrics, A.E., and NICU teams. L.H.Y., X.Y.D., X.W.C., Y.Z.W., X.L., S.F.G., X.L.L., and J.C.Y. are question designers for their affiliated departments. The co-first authors were all involved in designing methodology. J.J., K.L.D., Y.Z.C., G.M.K., J.S.X., L.B.L., H.B.X., S.J.Y., J.Y., Y.L.Y., J.L.C., Y.H.C., QSZhang, QSZhou, L.N.Z., M.W., X.T., LR., Z.X.W., W.F.Q., Y.L.W., L.W.C., X.Y.L., M.X.W., H.R.T., N.W., and PP are the independent evaluators in their respective affiliations’ disciplines (see also Supplementary Fig. S1). All evaluators vouched for the professionalism and neutrality in the evaluation process. Co-first authors co-wrote the first draft, and all authors contributed to, revised and approved the final draft.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Medicine thanks Mahmud Omar, Boya Zhang and Haoze du for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, P., Cai, J., Zhou, J. et al. Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management. Commun Med (2026). https://doi.org/10.1038/s43856-026-01576-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43856-026-01576-9