Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management

Chen, Peikai; Cai, Jifu; Zhou, Jiaying; Chen, Shaoxi; Xu, Chenguang; Yuan, Lihua; Dai, Xiaoying; Chen, Xiaowei; Wei, Yanzhe; Li, Xia; Gong, Shaofeng; Liang, Xiaolong; Yang, Jiancheng; Jin, Jun; Dai, Kanglin; Cui, Yuzhen; Kuang, Guan-Ming; Xie, Jiansheng; Luo, Libing; Xiao, Haibing; Yin, Shijie; Yang, Jun; Yan, Yulan; Chen, Jianliang; Chen, Yihua; Zhang, Qianshen; Zhou, Qingshan; Zhao, Lina; Wu, Min; Tang, Xin; Rong, Lei; Wang, Zanxin; Qiu, Weifu; Wang, Yanli; Cui, Liwen; Li, Xiangyang; Hu, Yong; Tao, Huiren; Wu, Nan; Shih, David J. H.; Pai, Pearl; Wei, Minxin; To, Michael Kai-tsun; Cheung, Kenneth M. C.

doi:10.1038/s43856-026-01576-9

Download PDF

Article
Open access
Published: 15 April 2026

Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management

Peikai Chen ORCID: orcid.org/0000-0003-1880-0893^1,2,3,4^na1,
Jifu Cai^2,5^na1,
Jiaying Zhou^2,6^na1,
Shaoxi Chen⁷^na1,
Chenguang Xu⁸^na1,
Lihua Yuan⁹^na1,
Xiaoying Dai^2,10^na1,
Xiaowei Chen¹¹^na1,
Yanzhe Wei¹^na1,
Xia Li¹²^na1,
Shaofeng Gong¹³^na1,
Xiaolong Liang¹⁴^na1,
Jiancheng Yang¹⁵^na1,
Jun Jin¹³,
Kanglin Dai⁹,
Yuzhen Cui⁵,
Guan-Ming Kuang¹,
Jiansheng Xie^2,10,
Libing Luo^2,10,
Haibing Xiao^2,5,
Shijie Yin^1,2,
Jun Yang⁶,
Yulan Yan¹²,
Jianliang Chen⁶,
Yihua Chen⁸,
Qianshen Zhang^2,8,
Qingshan Zhou¹³,
Lina Zhao¹⁵,
Min Wu¹⁵,
Xin Tang¹⁶,
Lei Rong¹²,
Zanxin Wang¹⁴,
Weifu Qiu⁷,
Yanli Wang⁷,
Liwen Cui¹¹,
Xiangyang Li¹¹,
Yong Hu³,
Huiren Tao¹,
Nan Wu ORCID: orcid.org/0000-0002-9429-2889^2,17,
David J. H. Shih ORCID: orcid.org/0000-0002-9802-4937⁴,
Pearl Pai^2,11,
Minxin Wei¹⁴,
Michael Kai-tsun To ORCID: orcid.org/0000-0001-6853-0591^1,2 &
…
Kenneth M. C. Cheung ORCID: orcid.org/0000-0001-8304-0419^1,2

Communications Medicine , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Background

Direct clinical uses of large language models (LLMs) remain controversial, partly because of the lack of methodological rigor in assessing their risks and benefits in medicine.

Methods

We developed Medieval, a multidisciplinary, randomized, and blinded expert evaluation framework. A ten-point Dreyfus-based scoring scale linked to career stages of human physicians was designed to reflect response qualities. Seven advanced LLMs or their distilled versions that were released within a short time-frame ( ≤ 45 days) in early 2025 were tested. Incidence of fabricated medical facts were documented. Linear mixed-effects models and variance-stabilizing Bayesian generalized linear mixed models were employed to perform statistical analyses.

Results

We first develop a high-quality question bank comprising 685 real and simulated clinical cases across 13 specialties. An expert panel of 27 clinicians (average years of services: 25.9) evaluated the 4795 model responses. We show that these LLM ratings (n = 9856) have excellent reliability (intraclass correlation coefficients \( > \)0.9). Among the seven LLMs tested, Gemini 2.0 Flash achieved the highest raw scores. However, after adjusting for confounders, DeepSeek-R1 was the top-performing model with a mean score of 6.36 (95% confidence interval 6.03 − 6.69), a performance level equivalent to an early-career physician. Despite these strengths, 3–19% LLM responses were rated as incompetent and 40 instances of LLM hallucination were also identified.

Conclusions

Our study shows that in spite of LLMs’ substantial potentials in medicine, their unguarded clinical application could present serious risks, which must be continuously monitored by human expert panels. The evaluation framework developed and validated in this study will facilitate such efforts.

Plain language summary

Artificial intelligence (AI) systems are becoming increasingly capable of answering medical questions, but it is still important to understand how safe and reliable they truly are. In this study, we created a structured evaluation framework where experienced doctors reviewed the responses of seven newly released AI models to hundreds of real and simulated clinical scenarios. Across nearly 5,000 answers, doctors used a scoring system designed to reflect the quality expected at different stages of a medical career. Some AI models performed well, occasionally reaching a level similar to that of early career physicians. However, doctors also identified answers that were incomplete, inaccurate, or based on invented details. Our study shows that while these AI systems are promising, ensuring their safe use in medicine will require ongoing oversight. Standard engineering tests alone are not enough; evaluations by clinical experts remain a crucial safeguard to help identify potential risks and prevent misuse.

Data availability

Due to privacy concerns, the original clinical cases submitted to the models, despite having been deidentified, will not be made publicly available, but can be obtained by email to PKC. The model responses of these questions have been uploaded to a public repository as Supplementary Data 1⁴⁴. The evaluation results have been posted to the same repository as Supplementary Data 2³².

Code availability

The custom code for processing the LLM responses, running the web-system, and for analyzing the resulting data were deposited on an open repository⁴⁵ (https://doi.org/10.6084/m9.figshare.30889115). The R scripts were analyzed on the R software (version 4.3.1) with dependent packages listed in the above repository.

References

Liu, M. et al. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. Int J. Med. Inf. 193, 105673 (2025).
Google Scholar
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).
Google Scholar
Zeng, D. et al. DeepSeek’s “Low-Cost” Adoption Across China’s Hospital Systems: Too Fast. Too Soon? JAMA 333, 1866–1869 (2025).
Google Scholar
Feldman, M. J. et al. Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses. JAMA Netw. Open 8, e2512994 (2025).
Google Scholar
Menezes, M. C. S. et al. The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health 7, e35–e43 (2025).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med 29, 1930–1940 (2023).
Google Scholar
Lim, Z. W. et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 95, 104770 (2023).
Google Scholar
Rydzewski, N. R. et al. Comparative Evaluation of LLMs in Clinical Oncology. NEJM AI. 1 https://doi.org/10.1056/aioa2300151 (2024).
Akkus Yildirim, B. et al. Large language models standardize the interpretation of complex oncology guidelines for brain metastases. Commun. Med. 6, 56 (2025).
McGrath, S. P. et al. A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions. J. Am. Med Inf. Assoc. 31, 2271–2283 (2024).
Google Scholar
Heinz, M. V. et al. Randomized trial of a generative AI chatbot for mental health treatment. Nejm Ai 2, AIoa2400802 (2025).
Google Scholar
Du, X. et al. Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. EBioMedicine 109, 105401 (2024).
Google Scholar
Moura, L. et al. Implications of Large Language Models for Quality and Efficiency of Neurologic Care: Emerging Issues in Neurology. Neurology 102, e209497 (2024).
Google Scholar
Sosa, B. R. et al. Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries. J. Orthop. Res 42, 1276–1282 (2024).
Google Scholar
Kim, J. et al. Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease. Am. J. Hum. Genet. 111, 2190–2202 (2024).
Google Scholar
Zelin, C. et al. Rare disease diagnosis using knowledge guided retrieval augmentation for ChatGPT. J. Biomed. Inf. 157, 104702 (2024).
Google Scholar
Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).
Google Scholar
Goodman, R. S. et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw. Open 6, e2336483 (2023).
Google Scholar
Ong, J. C. L. et al. Medical ethics of large language models in medicine. NEJM AI 1, AIra2400038 (2024).
Google Scholar
Li, H. et al. Cmmlu: Measuring massive multitask language understanding in chinese. in Findings of the Association for Computational Linguistics: ACL 2024. (2024).
Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Google Scholar
Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. in Conference on health, inference, and learning. PMLR. (2022)
Flores-Gouyonnet, J. et al. Performance of large language models in rheumatology board-like questions: accuracy, quality, and safety. Lancet Rheumatol. 7, e152–e154 (2025).
Google Scholar
Chiang, W.-L. et al. Chatbot arena: An open platform for evaluating llms by human preference. in Forty-first International Conference on Machine Learning. (2024).
Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 7, e2440969 (2024).
Google Scholar
Newble, D. Techniques for measuring clinical competence: objective structured clinical examinations. Med Educ. 38, 199–203 (2004).
Google Scholar
Carraccio, C. L. et al. From the educational bench to the clinical bedside: translating the Dreyfus developmental model to the learning of clinical skills. Acad. Med. 83, 761–767 (2008).
Google Scholar
Ten Hove, D., Jorgensen, T. D. & van der Ark, L. A. How to Estimate Intraclass Correlation Coefficients for Interrater Reliability from Planned Incomplete Data. Multivar. Behav. Res. 60, 1042–1061 (2025).
Google Scholar
Fong, Y., Rue, H. & Wakefield, J. Bayesian inference for generalized linear mixed models. Biostatistics 11, 397–412 (2010). p.
Google Scholar
Giuffre, M., You, K. & Shung, D. L. Evaluating ChatGPT in Medical Contexts: The Imperative to Guard Against Hallucinations and Partial Accuracies. Clin. Gastroenterol. Hepatol. 22, 1145–1146 (2024).
Google Scholar
Chen, P. & Cai, J.-F. Expert evaluations of LLM responses on clinical questions. [Dataset] 2025; Available from: https://doi.org/10.6084/m9.figshare.30899360].
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17, 405–424 (2015).
Google Scholar
Riggs, E. R. et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen). Genet Med 22, 245–257 (2020).
Google Scholar
Dai, H. P. et al. Expert Consensus on the Diagnosis and Treatment of Anticancer Drug-Induced Interstitial Lung Disease. Curr. Med Sci. 43, 1–12 (2023).
Google Scholar
Brahmer, J. R. et al. Management of Immune-Related Adverse Events in Patients Treated With Immune Checkpoint Inhibitor Therapy: American Society of Clinical Oncology Clinical Practice Guideline. J. Clin. Oncol. 36, 1714–1768 (2018).
Google Scholar
Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med 31, 2546–2549 (2025).
Google Scholar
Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med 31, 2550–2555 (2025).
Google Scholar
Biderman, S. et al. Emergent and predictable memorization in large language models. Adv. Neural Inf. Process. Syst. 36, 28072–28090 (2023).
Google Scholar
Antman, E. M. et al. ACC/AHA guidelines for the management of patients with ST-elevation myocardial infarction-executive summary. A report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (Writing Committee to revise the 1999 guidelines for the management of patients with acute myocardial infarction). J. Am. Coll. Cardiol. 44, 671–719 (2004).
Google Scholar
Duong, D. & Solomon, B. D. Artificial intelligence in clinical genetics. Eur. J. Hum. Genet. 33, 281–288 (2025).
Google Scholar
Omar, M. et al. Sociodemographic biases in medical decision making by large language models. Nat. Med. 31, 1873–1881 (2025).
Google Scholar
Omar, M. et al. Refining LLMs outputs with iterative consensus ensemble (ICE). Comput. Biol. Med. 196, 110731 (2025).
Google Scholar
Chen, P. Responses of seven tested models for real clinical questions. [Dataset] 2025; V1.0:[Available from: https://doi.org/10.6084/m9.figshare.30889556].
Chen, P., Shih, D. J. H. & Hu, Y. Medieval is an expert-evaluation framework for benefits and risks of using language models for medicine. [Software] 2025; V1.0:[Available from: https://doi.org/10.6084/m9.figshare.30889115].

Download references

Acknowledgements

This work was supported by the Sanming Project of Medicine in Shenzhen (SZSM202311022), Shenzhen Clinical Research Center for Rare Diseases (LCYSSQ20220823091402005), Shenzhen Key Medical Discipline Construction Fund (SZXK2020084 and SZXK077), Shenzhen Science and Technology Program (JCYJ20250604180803005), Shenzhen Science and Technology Major Project of China (KJZD20240903102759061), Small Equipment Grant (HKU), and the HKU - SZH Translational Med Centre (HZQSWS-KCCYB-2024055). PKC thanks the Shenzhen Peacock Plan (No.20210830100 C) and Futian Talent Program, and Shenzhen General Research Program JCYJ20250604180803005. We thank the three disqualified evaluators for their engagement. We thank C Zhong, D Chen, Kevin Tam, Profs Pak Sham (HKU) and N Ding (ZJU), and Drs A. Wai and CF Jiang for assistance or useful discussions.

Author information

These authors contributed equally: Peikai Chen, Jifu Cai, Jiaying Zhou, Shaoxi Chen, Chenguang Xu, Lihua Yuan, Xiaoying Dai, Xiaowei Chen, Yanzhe Wei, Xia Li, Shaofeng Gong, Xiaolong Liang, Jiancheng Yang.

Authors and Affiliations

Department Orthopedics, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Peikai Chen, Yanzhe Wei, Guan-Ming Kuang, Shijie Yin, Huiren Tao, Michael Kai-tsun To & Kenneth M. C. Cheung
Shenzhen Clinical Research Center for Rare Diseases, Shenzhen, China
Peikai Chen, Jifu Cai, Jiaying Zhou, Xiaoying Dai, Jiansheng Xie, Libing Luo, Haibing Xiao, Shijie Yin, Qianshen Zhang, Nan Wu, Pearl Pai, Michael Kai-tsun To & Kenneth M. C. Cheung
AIBD Lab, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Peikai Chen & Yong Hu
School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, the University of Hong Kong, Pokfulam, Hong Kong SAR, China
Peikai Chen & David J. H. Shih
Department Neurology, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Jifu Cai, Yuzhen Cui & Haibing Xiao
Department Pediatrics, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Jiaying Zhou, Jun Yang & Jianliang Chen
Department Accidents and Emergency, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Shaoxi Chen, Weifu Qiu & Yanli Wang
Neonatal ICU (NICU), The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Chenguang Xu, Yihua Chen & Qianshen Zhang
Department Pediatric Surgery, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Lihua Yuan & Kanglin Dai
Department Prenatal Diagnosis (Clinical Genetics), The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Xiaoying Dai, Jiansheng Xie & Libing Luo
Department Nephrology, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Xiaowei Chen, Liwen Cui, Xiangyang Li & Pearl Pai
Department Respiratory Medicine, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Xia Li, Yulan Yan & Lei Rong
Intensive Care Unit (ICU), The University of Hong Kong - Shenzhen Hospital), Shenzhen, China
Shaofeng Gong, Jun Jin & Qingshan Zhou
Department Cardiac Surgery, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Xiaolong Liang, Zanxin Wang & Minxin Wei
Department Cardiology, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China
Jiancheng Yang, Lina Zhao & Min Wu
Department Pediatric Orthopedics, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
Xin Tang
Department Orthopedics, Peking Union Medical College Hospital (PUMCH), Chinese Academy of Medical Sciences, Beijing, China
Nan Wu

Authors

Peikai Chen
View author publications
Search author on:PubMed Google Scholar
Jifu Cai
View author publications
Search author on:PubMed Google Scholar
Jiaying Zhou
View author publications
Search author on:PubMed Google Scholar
Shaoxi Chen
View author publications
Search author on:PubMed Google Scholar
Chenguang Xu
View author publications
Search author on:PubMed Google Scholar
Lihua Yuan
View author publications
Search author on:PubMed Google Scholar
Xiaoying Dai
View author publications
Search author on:PubMed Google Scholar
Xiaowei Chen
View author publications
Search author on:PubMed Google Scholar
Yanzhe Wei
View author publications
Search author on:PubMed Google Scholar
Xia Li
View author publications
Search author on:PubMed Google Scholar
Shaofeng Gong
View author publications
Search author on:PubMed Google Scholar
Xiaolong Liang
View author publications
Search author on:PubMed Google Scholar
Jiancheng Yang
View author publications
Search author on:PubMed Google Scholar
Jun Jin
View author publications
Search author on:PubMed Google Scholar
Kanglin Dai
View author publications
Search author on:PubMed Google Scholar
Yuzhen Cui
View author publications
Search author on:PubMed Google Scholar
Guan-Ming Kuang
View author publications
Search author on:PubMed Google Scholar
Jiansheng Xie
View author publications
Search author on:PubMed Google Scholar
Libing Luo
View author publications
Search author on:PubMed Google Scholar
Haibing Xiao
View author publications
Search author on:PubMed Google Scholar
Shijie Yin
View author publications
Search author on:PubMed Google Scholar
Jun Yang
View author publications
Search author on:PubMed Google Scholar
Yulan Yan
View author publications
Search author on:PubMed Google Scholar
Jianliang Chen
View author publications
Search author on:PubMed Google Scholar
Yihua Chen
View author publications
Search author on:PubMed Google Scholar
Qianshen Zhang
View author publications
Search author on:PubMed Google Scholar
Qingshan Zhou
View author publications
Search author on:PubMed Google Scholar
Lina Zhao
View author publications
Search author on:PubMed Google Scholar
Min Wu
View author publications
Search author on:PubMed Google Scholar
Xin Tang
View author publications
Search author on:PubMed Google Scholar
Lei Rong
View author publications
Search author on:PubMed Google Scholar
Zanxin Wang
View author publications
Search author on:PubMed Google Scholar
Weifu Qiu
View author publications
Search author on:PubMed Google Scholar
Yanli Wang
View author publications
Search author on:PubMed Google Scholar
Liwen Cui
View author publications
Search author on:PubMed Google Scholar
Xiangyang Li
View author publications
Search author on:PubMed Google Scholar
Yong Hu
View author publications
Search author on:PubMed Google Scholar
Huiren Tao
View author publications
Search author on:PubMed Google Scholar
Nan Wu
View author publications
Search author on:PubMed Google Scholar
David J. H. Shih
View author publications
Search author on:PubMed Google Scholar
Pearl Pai
View author publications
Search author on:PubMed Google Scholar
Minxin Wei
View author publications
Search author on:PubMed Google Scholar
Michael Kai-tsun To
View author publications
Search author on:PubMed Google Scholar
Kenneth M. C. Cheung
View author publications
Search author on:PubMed Google Scholar

Contributions

K.M.C.C. and P.K.C. conceived the project. K.M.C.C., M.K.T.T. and Y.H. coordinated the efforts. P.K.C. designed the methodology, developed and implemented the web system, designed the analyses, and wrote the first draft. P.K.C., D.J.H.S. and Y.H. performed the statistical analyses. J.F.C., J.Y.Z., S.X.C., and C.G.X. are question designers and non-independent expert evaluators for the neurology, pediatrics, A.E., and NICU teams. L.H.Y., X.Y.D., X.W.C., Y.Z.W., X.L., S.F.G., X.L.L., and J.C.Y. are question designers for their affiliated departments. The co-first authors were all involved in designing methodology. J.J., K.L.D., Y.Z.C., G.M.K., J.S.X., L.B.L., H.B.X., S.J.Y., J.Y., Y.L.Y., J.L.C., Y.H.C., QSZhang, QSZhou, L.N.Z., M.W., X.T., LR., Z.X.W., W.F.Q., Y.L.W., L.W.C., X.Y.L., M.X.W., H.R.T., N.W., and PP are the independent evaluators in their respective affiliations’ disciplines (see also Supplementary Fig. S1). All evaluators vouched for the professionalism and neutrality in the evaluation process. Co-first authors co-wrote the first draft, and all authors contributed to, revised and approved the final draft.

Corresponding authors

Correspondence to Peikai Chen, Michael Kai-tsun To or Kenneth M. C. Cheung.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks Mahmud Omar, Boya Zhang and Haoze du for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, P., Cai, J., Zhou, J. et al. Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management. Commun Med (2026). https://doi.org/10.1038/s43856-026-01576-9

Download citation

Received: 11 October 2025
Accepted: 26 March 2026
Published: 15 April 2026
DOI: https://doi.org/10.1038/s43856-026-01576-9