Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Communications Medicine
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. communications medicine
  3. articles
  4. article
Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 15 April 2026

Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management

  • Peikai Chen  ORCID: orcid.org/0000-0003-1880-08931,2,3,4 na1,
  • Jifu Cai2,5 na1,
  • Jiaying Zhou2,6 na1,
  • Shaoxi Chen7 na1,
  • Chenguang Xu8 na1,
  • Lihua Yuan9 na1,
  • Xiaoying Dai2,10 na1,
  • Xiaowei Chen11 na1,
  • Yanzhe Wei1 na1,
  • Xia Li12 na1,
  • Shaofeng Gong13 na1,
  • Xiaolong Liang14 na1,
  • Jiancheng Yang15 na1,
  • Jun Jin13,
  • Kanglin Dai9,
  • Yuzhen Cui5,
  • Guan-Ming Kuang1,
  • Jiansheng Xie2,10,
  • Libing Luo2,10,
  • Haibing Xiao2,5,
  • Shijie Yin1,2,
  • Jun Yang6,
  • Yulan Yan12,
  • Jianliang Chen6,
  • Yihua Chen8,
  • Qianshen Zhang2,8,
  • Qingshan Zhou13,
  • Lina Zhao15,
  • Min Wu15,
  • Xin Tang16,
  • Lei Rong12,
  • Zanxin Wang14,
  • Weifu Qiu7,
  • Yanli Wang7,
  • Liwen Cui11,
  • Xiangyang Li11,
  • Yong Hu3,
  • Huiren Tao1,
  • Nan Wu  ORCID: orcid.org/0000-0002-9429-28892,17,
  • David J. H. Shih  ORCID: orcid.org/0000-0002-9802-49374,
  • Pearl Pai2,11,
  • Minxin Wei14,
  • Michael Kai-tsun To  ORCID: orcid.org/0000-0001-6853-05911,2 &
  • …
  • Kenneth M. C. Cheung  ORCID: orcid.org/0000-0001-8304-04191,2 

Communications Medicine , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Health care
  • Medical research

Abstract

Background

Direct clinical uses of large language models (LLMs) remain controversial, partly because of the lack of methodological rigor in assessing their risks and benefits in medicine.

Methods

We developed Medieval, a multidisciplinary, randomized, and blinded expert evaluation framework. A ten-point Dreyfus-based scoring scale linked to career stages of human physicians was designed to reflect response qualities. Seven advanced LLMs or their distilled versions that were released within a short time-frame ( ≤ 45 days) in early 2025 were tested. Incidence of fabricated medical facts were documented. Linear mixed-effects models and variance-stabilizing Bayesian generalized linear mixed models were employed to perform statistical analyses.

Results

We first develop a high-quality question bank comprising 685 real and simulated clinical cases across 13 specialties. An expert panel of 27 clinicians (average years of services: 25.9) evaluated the 4795 model responses. We show that these LLM ratings (n = 9856) have excellent reliability (intraclass correlation coefficients \( > \)0.9). Among the seven LLMs tested, Gemini 2.0 Flash achieved the highest raw scores. However, after adjusting for confounders, DeepSeek-R1 was the top-performing model with a mean score of 6.36 (95% confidence interval 6.03 − 6.69), a performance level equivalent to an early-career physician. Despite these strengths, 3–19% LLM responses were rated as incompetent and 40 instances of LLM hallucination were also identified.

Conclusions

Our study shows that in spite of LLMs’ substantial potentials in medicine, their unguarded clinical application could present serious risks, which must be continuously monitored by human expert panels. The evaluation framework developed and validated in this study will facilitate such efforts.

Plain language summary

Artificial intelligence (AI) systems are becoming increasingly capable of answering medical questions, but it is still important to understand how safe and reliable they truly are. In this study, we created a structured evaluation framework where experienced doctors reviewed the responses of seven newly released AI models to hundreds of real and simulated clinical scenarios. Across nearly 5,000 answers, doctors used a scoring system designed to reflect the quality expected at different stages of a medical career. Some AI models performed well, occasionally reaching a level similar to that of early career physicians. However, doctors also identified answers that were incomplete, inaccurate, or based on invented details. Our study shows that while these AI systems are promising, ensuring their safe use in medicine will require ongoing oversight. Standard engineering tests alone are not enough; evaluations by clinical experts remain a crucial safeguard to help identify potential risks and prevent misuse.

Data availability

Due to privacy concerns, the original clinical cases submitted to the models, despite having been deidentified, will not be made publicly available, but can be obtained by email to PKC. The model responses of these questions have been uploaded to a public repository as Supplementary Data 144. The evaluation results have been posted to the same repository as Supplementary Data 232.

Code availability

The custom code for processing the LLM responses, running the web-system, and for analyzing the resulting data were deposited on an open repository45 (https://doi.org/10.6084/m9.figshare.30889115). The R scripts were analyzed on the R software (version 4.3.1) with dependent packages listed in the above repository.

References

  1. Liu, M. et al. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. Int J. Med. Inf. 193, 105673 (2025).

    Google Scholar 

  2. Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).

    Google Scholar 

  3. Zeng, D. et al. DeepSeek’s “Low-Cost” Adoption Across China’s Hospital Systems: Too Fast. Too Soon? JAMA 333, 1866–1869 (2025).

    Google Scholar 

  4. Feldman, M. J. et al. Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses. JAMA Netw. Open 8, e2512994 (2025).

    Google Scholar 

  5. Menezes, M. C. S. et al. The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health 7, e35–e43 (2025).

    Google Scholar 

  6. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Google Scholar 

  7. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med 29, 1930–1940 (2023).

    Google Scholar 

  8. Lim, Z. W. et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 95, 104770 (2023).

    Google Scholar 

  9. Rydzewski, N. R. et al. Comparative Evaluation of LLMs in Clinical Oncology. NEJM AI. 1 https://doi.org/10.1056/aioa2300151 (2024).

  10. Akkus Yildirim, B. et al. Large language models standardize the interpretation of complex oncology guidelines for brain metastases. Commun. Med. 6, 56 (2025).

  11. McGrath, S. P. et al. A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions. J. Am. Med Inf. Assoc. 31, 2271–2283 (2024).

    Google Scholar 

  12. Heinz, M. V. et al. Randomized trial of a generative AI chatbot for mental health treatment. Nejm Ai 2, AIoa2400802 (2025).

    Google Scholar 

  13. Du, X. et al. Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. EBioMedicine 109, 105401 (2024).

    Google Scholar 

  14. Moura, L. et al. Implications of Large Language Models for Quality and Efficiency of Neurologic Care: Emerging Issues in Neurology. Neurology 102, e209497 (2024).

    Google Scholar 

  15. Sosa, B. R. et al. Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries. J. Orthop. Res 42, 1276–1282 (2024).

    Google Scholar 

  16. Kim, J. et al. Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease. Am. J. Hum. Genet. 111, 2190–2202 (2024).

    Google Scholar 

  17. Zelin, C. et al. Rare disease diagnosis using knowledge guided retrieval augmentation for ChatGPT. J. Biomed. Inf. 157, 104702 (2024).

    Google Scholar 

  18. Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).

    Google Scholar 

  19. Goodman, R. S. et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw. Open 6, e2336483 (2023).

    Google Scholar 

  20. Ong, J. C. L. et al. Medical ethics of large language models in medicine. NEJM AI 1, AIra2400038 (2024).

    Google Scholar 

  21. Li, H. et al. Cmmlu: Measuring massive multitask language understanding in chinese. in Findings of the Association for Computational Linguistics: ACL 2024. (2024).

  22. Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

    Google Scholar 

  23. Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. in Conference on health, inference, and learning. PMLR. (2022)

  24. Flores-Gouyonnet, J. et al. Performance of large language models in rheumatology board-like questions: accuracy, quality, and safety. Lancet Rheumatol. 7, e152–e154 (2025).

    Google Scholar 

  25. Chiang, W.-L. et al. Chatbot arena: An open platform for evaluating llms by human preference. in Forty-first International Conference on Machine Learning. (2024).

  26. Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 7, e2440969 (2024).

    Google Scholar 

  27. Newble, D. Techniques for measuring clinical competence: objective structured clinical examinations. Med Educ. 38, 199–203 (2004).

    Google Scholar 

  28. Carraccio, C. L. et al. From the educational bench to the clinical bedside: translating the Dreyfus developmental model to the learning of clinical skills. Acad. Med. 83, 761–767 (2008).

    Google Scholar 

  29. Ten Hove, D., Jorgensen, T. D. & van der Ark, L. A. How to Estimate Intraclass Correlation Coefficients for Interrater Reliability from Planned Incomplete Data. Multivar. Behav. Res. 60, 1042–1061 (2025).

    Google Scholar 

  30. Fong, Y., Rue, H. & Wakefield, J. Bayesian inference for generalized linear mixed models. Biostatistics 11, 397–412 (2010). p.

    Google Scholar 

  31. Giuffre, M., You, K. & Shung, D. L. Evaluating ChatGPT in Medical Contexts: The Imperative to Guard Against Hallucinations and Partial Accuracies. Clin. Gastroenterol. Hepatol. 22, 1145–1146 (2024).

    Google Scholar 

  32. Chen, P. & Cai, J.-F. Expert evaluations of LLM responses on clinical questions. [Dataset] 2025; Available from: https://doi.org/10.6084/m9.figshare.30899360].

  33. Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17, 405–424 (2015).

    Google Scholar 

  34. Riggs, E. R. et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen). Genet Med 22, 245–257 (2020).

    Google Scholar 

  35. Dai, H. P. et al. Expert Consensus on the Diagnosis and Treatment of Anticancer Drug-Induced Interstitial Lung Disease. Curr. Med Sci. 43, 1–12 (2023).

    Google Scholar 

  36. Brahmer, J. R. et al. Management of Immune-Related Adverse Events in Patients Treated With Immune Checkpoint Inhibitor Therapy: American Society of Clinical Oncology Clinical Practice Guideline. J. Clin. Oncol. 36, 1714–1768 (2018).

    Google Scholar 

  37. Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med 31, 2546–2549 (2025).

    Google Scholar 

  38. Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med 31, 2550–2555 (2025).

    Google Scholar 

  39. Biderman, S. et al. Emergent and predictable memorization in large language models. Adv. Neural Inf. Process. Syst. 36, 28072–28090 (2023).

    Google Scholar 

  40. Antman, E. M. et al. ACC/AHA guidelines for the management of patients with ST-elevation myocardial infarction-executive summary. A report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (Writing Committee to revise the 1999 guidelines for the management of patients with acute myocardial infarction). J. Am. Coll. Cardiol. 44, 671–719 (2004).

    Google Scholar 

  41. Duong, D. & Solomon, B. D. Artificial intelligence in clinical genetics. Eur. J. Hum. Genet. 33, 281–288 (2025).

    Google Scholar 

  42. Omar, M. et al. Sociodemographic biases in medical decision making by large language models. Nat. Med. 31, 1873–1881 (2025).

    Google Scholar 

  43. Omar, M. et al. Refining LLMs outputs with iterative consensus ensemble (ICE). Comput. Biol. Med. 196, 110731 (2025).

    Google Scholar 

  44. Chen, P. Responses of seven tested models for real clinical questions. [Dataset] 2025; V1.0:[Available from: https://doi.org/10.6084/m9.figshare.30889556].

  45. Chen, P., Shih, D. J. H. & Hu, Y. Medieval is an expert-evaluation framework for benefits and risks of using language models for medicine. [Software] 2025; V1.0:[Available from: https://doi.org/10.6084/m9.figshare.30889115].

Download references

Acknowledgements

This work was supported by the Sanming Project of Medicine in Shenzhen (SZSM202311022), Shenzhen Clinical Research Center for Rare Diseases (LCYSSQ20220823091402005), Shenzhen Key Medical Discipline Construction Fund (SZXK2020084 and SZXK077), Shenzhen Science and Technology Program (JCYJ20250604180803005), Shenzhen Science and Technology Major Project of China (KJZD20240903102759061), Small Equipment Grant (HKU), and the HKU - SZH Translational Med Centre (HZQSWS-KCCYB-2024055). PKC thanks the Shenzhen Peacock Plan (No.20210830100 C) and Futian Talent Program, and Shenzhen General Research Program JCYJ20250604180803005. We thank the three disqualified evaluators for their engagement. We thank C Zhong, D Chen, Kevin Tam, Profs Pak Sham (HKU) and N Ding (ZJU), and Drs A. Wai and CF Jiang for assistance or useful discussions.

Author information

Author notes
  1. These authors contributed equally: Peikai Chen, Jifu Cai, Jiaying Zhou, Shaoxi Chen, Chenguang Xu, Lihua Yuan, Xiaoying Dai, Xiaowei Chen, Yanzhe Wei, Xia Li, Shaofeng Gong, Xiaolong Liang, Jiancheng Yang.

Authors and Affiliations

  1. Department Orthopedics, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Peikai Chen, Yanzhe Wei, Guan-Ming Kuang, Shijie Yin, Huiren Tao, Michael Kai-tsun To & Kenneth M. C. Cheung

  2. Shenzhen Clinical Research Center for Rare Diseases, Shenzhen, China

    Peikai Chen, Jifu Cai, Jiaying Zhou, Xiaoying Dai, Jiansheng Xie, Libing Luo, Haibing Xiao, Shijie Yin, Qianshen Zhang, Nan Wu, Pearl Pai, Michael Kai-tsun To & Kenneth M. C. Cheung

  3. AIBD Lab, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Peikai Chen & Yong Hu

  4. School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, the University of Hong Kong, Pokfulam, Hong Kong SAR, China

    Peikai Chen & David J. H. Shih

  5. Department Neurology, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Jifu Cai, Yuzhen Cui & Haibing Xiao

  6. Department Pediatrics, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Jiaying Zhou, Jun Yang & Jianliang Chen

  7. Department Accidents and Emergency, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Shaoxi Chen, Weifu Qiu & Yanli Wang

  8. Neonatal ICU (NICU), The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Chenguang Xu, Yihua Chen & Qianshen Zhang

  9. Department Pediatric Surgery, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Lihua Yuan & Kanglin Dai

  10. Department Prenatal Diagnosis (Clinical Genetics), The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Xiaoying Dai, Jiansheng Xie & Libing Luo

  11. Department Nephrology, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Xiaowei Chen, Liwen Cui, Xiangyang Li & Pearl Pai

  12. Department Respiratory Medicine, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Xia Li, Yulan Yan & Lei Rong

  13. Intensive Care Unit (ICU), The University of Hong Kong - Shenzhen Hospital), Shenzhen, China

    Shaofeng Gong, Jun Jin & Qingshan Zhou

  14. Department Cardiac Surgery, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Xiaolong Liang, Zanxin Wang & Minxin Wei

  15. Department Cardiology, The University of Hong Kong - Shenzhen Hospital, Shenzhen, China

    Jiancheng Yang, Lina Zhao & Min Wu

  16. Department Pediatric Orthopedics, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China

    Xin Tang

  17. Department Orthopedics, Peking Union Medical College Hospital (PUMCH), Chinese Academy of Medical Sciences, Beijing, China

    Nan Wu

Authors
  1. Peikai Chen
    View author publications

    Search author on:PubMed Google Scholar

  2. Jifu Cai
    View author publications

    Search author on:PubMed Google Scholar

  3. Jiaying Zhou
    View author publications

    Search author on:PubMed Google Scholar

  4. Shaoxi Chen
    View author publications

    Search author on:PubMed Google Scholar

  5. Chenguang Xu
    View author publications

    Search author on:PubMed Google Scholar

  6. Lihua Yuan
    View author publications

    Search author on:PubMed Google Scholar

  7. Xiaoying Dai
    View author publications

    Search author on:PubMed Google Scholar

  8. Xiaowei Chen
    View author publications

    Search author on:PubMed Google Scholar

  9. Yanzhe Wei
    View author publications

    Search author on:PubMed Google Scholar

  10. Xia Li
    View author publications

    Search author on:PubMed Google Scholar

  11. Shaofeng Gong
    View author publications

    Search author on:PubMed Google Scholar

  12. Xiaolong Liang
    View author publications

    Search author on:PubMed Google Scholar

  13. Jiancheng Yang
    View author publications

    Search author on:PubMed Google Scholar

  14. Jun Jin
    View author publications

    Search author on:PubMed Google Scholar

  15. Kanglin Dai
    View author publications

    Search author on:PubMed Google Scholar

  16. Yuzhen Cui
    View author publications

    Search author on:PubMed Google Scholar

  17. Guan-Ming Kuang
    View author publications

    Search author on:PubMed Google Scholar

  18. Jiansheng Xie
    View author publications

    Search author on:PubMed Google Scholar

  19. Libing Luo
    View author publications

    Search author on:PubMed Google Scholar

  20. Haibing Xiao
    View author publications

    Search author on:PubMed Google Scholar

  21. Shijie Yin
    View author publications

    Search author on:PubMed Google Scholar

  22. Jun Yang
    View author publications

    Search author on:PubMed Google Scholar

  23. Yulan Yan
    View author publications

    Search author on:PubMed Google Scholar

  24. Jianliang Chen
    View author publications

    Search author on:PubMed Google Scholar

  25. Yihua Chen
    View author publications

    Search author on:PubMed Google Scholar

  26. Qianshen Zhang
    View author publications

    Search author on:PubMed Google Scholar

  27. Qingshan Zhou
    View author publications

    Search author on:PubMed Google Scholar

  28. Lina Zhao
    View author publications

    Search author on:PubMed Google Scholar

  29. Min Wu
    View author publications

    Search author on:PubMed Google Scholar

  30. Xin Tang
    View author publications

    Search author on:PubMed Google Scholar

  31. Lei Rong
    View author publications

    Search author on:PubMed Google Scholar

  32. Zanxin Wang
    View author publications

    Search author on:PubMed Google Scholar

  33. Weifu Qiu
    View author publications

    Search author on:PubMed Google Scholar

  34. Yanli Wang
    View author publications

    Search author on:PubMed Google Scholar

  35. Liwen Cui
    View author publications

    Search author on:PubMed Google Scholar

  36. Xiangyang Li
    View author publications

    Search author on:PubMed Google Scholar

  37. Yong Hu
    View author publications

    Search author on:PubMed Google Scholar

  38. Huiren Tao
    View author publications

    Search author on:PubMed Google Scholar

  39. Nan Wu
    View author publications

    Search author on:PubMed Google Scholar

  40. David J. H. Shih
    View author publications

    Search author on:PubMed Google Scholar

  41. Pearl Pai
    View author publications

    Search author on:PubMed Google Scholar

  42. Minxin Wei
    View author publications

    Search author on:PubMed Google Scholar

  43. Michael Kai-tsun To
    View author publications

    Search author on:PubMed Google Scholar

  44. Kenneth M. C. Cheung
    View author publications

    Search author on:PubMed Google Scholar

Contributions

K.M.C.C. and P.K.C. conceived the project. K.M.C.C., M.K.T.T. and Y.H. coordinated the efforts. P.K.C. designed the methodology, developed and implemented the web system, designed the analyses, and wrote the first draft. P.K.C., D.J.H.S. and Y.H. performed the statistical analyses. J.F.C., J.Y.Z., S.X.C., and C.G.X. are question designers and non-independent expert evaluators for the neurology, pediatrics, A.E., and NICU teams. L.H.Y., X.Y.D., X.W.C., Y.Z.W., X.L., S.F.G., X.L.L., and J.C.Y. are question designers for their affiliated departments. The co-first authors were all involved in designing methodology. J.J., K.L.D., Y.Z.C., G.M.K., J.S.X., L.B.L., H.B.X., S.J.Y., J.Y., Y.L.Y., J.L.C., Y.H.C., QSZhang, QSZhou, L.N.Z., M.W., X.T., LR., Z.X.W., W.F.Q., Y.L.W., L.W.C., X.Y.L., M.X.W., H.R.T., N.W., and PP are the independent evaluators in their respective affiliations’ disciplines (see also Supplementary Fig. S1). All evaluators vouched for the professionalism and neutrality in the evaluation process. Co-first authors co-wrote the first draft, and all authors contributed to, revised and approved the final draft.

Corresponding authors

Correspondence to Peikai Chen, Michael Kai-tsun To or Kenneth M. C. Cheung.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks Mahmud Omar, Boya Zhang and Haoze du for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, P., Cai, J., Zhou, J. et al. Multidisciplinary blinded randomized expert evaluation of large language models for clinical diagnosis and management. Commun Med (2026). https://doi.org/10.1038/s43856-026-01576-9

Download citation

  • Received: 11 October 2025

  • Accepted: 26 March 2026

  • Published: 15 April 2026

  • DOI: https://doi.org/10.1038/s43856-026-01576-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Journal Information
  • Open Access Fees and Funding
  • Journal Metrics
  • Editors
  • Editorial Board
  • Calls for Papers
  • Contact
  • Conferences
  • Editorial Values Statement
  • Posters
  • Editorial policies

Publish with us

  • For Authors
  • For Referees
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Communications Medicine (Commun Med)

ISSN 2730-664X (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing