Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Communications
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. nature communications
  3. articles
  4. article
Evaluating clinical competencies of large language models with a general practice benchmark
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 16 April 2026

Evaluating clinical competencies of large language models with a general practice benchmark

  • Zheqing Li  (李哲青)  ORCID: orcid.org/0009-0008-8501-03691 na1,
  • Yiying Yang  (杨怡莹)2 na1,
  • Jiping Lang  (郎吉萍)1 na1,
  • Wenhao Jiang  (姜文浩)  ORCID: orcid.org/0000-0002-0795-366X2 na1,
  • Junrong Chen  (陈俊榕)  ORCID: orcid.org/0000-0002-3796-57171 na1,
  • Yuhang Zhao  (赵宇航)2,
  • Shuang Li  (李双)1,
  • Dingqian Wang  (王定乾)2,
  • Zhu Lin  (林珠)1,
  • Xuanna Li  (李宣娜)1,
  • Yuze Tang  (唐瑜泽)1,
  • Jiexian Qiu  (邱洁娴)3,
  • Xiaolin Lu  (卢晓霖)3,
  • Hongji Yu  (俞鸿基)3,
  • Shuang Chen  (陈爽)1,
  • Yuhua Bi  (闭玉华)1,
  • Xiaofei Zeng  (曾晓菲)1,
  • Yixian Chen  (陈一贤)1 &
  • …
  • Lin Yao  (姚麟)  ORCID: orcid.org/0000-0002-3422-89224,5 

Nature Communications , Article number:  (2026) Cite this article

  • 909 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Health care
  • Predictive medicine

Abstract

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

Similar content being viewed by others

Context matching is not reasoning when performing generalized clinical evaluation of generative language models

Article Open access 27 December 2025

Benchmarking large language models for replication of guideline-based PGx recommendations

Article 26 July 2025

Large language models for frontline healthcare support in low-resource settings

Article Open access 06 February 2026

Data availability

The prompts used for LLMs have been included in the manuscript. Both the development set and the test set were made available exclusively for non-commercial research purposes under a data use agreement via our project homepage (https://github.com/AIPrimaryCare). The development set could be accessed in full, including both the evaluation questions and the corresponding scoring rubrics. To preserve evaluation integrity, only the evaluation questions of the test set were released. Access requests should be submitted to the corresponding author, accompanied by a detailed description of the research purpose. The corresponding author and the data-contributing institutions review the request and determine whether access can be granted. Source data are available at the Source Data file. Source data are provided with this paper.

Code availability

The source code for the study is available in the GitHub repository at https://github.com/AIPrimaryCare/gpbench_code. To ensure long-term accessibility and reproducibility, the repository has been archived on Zenodo under the https://doi.org/10.5281/zenodo.18428084. A comprehensive README file is provided with full instructions for reproducing the experiments.

References

  1. Menezes, M. C. S. et al. The potential of generative pre-trained transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digital Health 7, e35–e43 (2025).

    Google Scholar 

  2. Bellini, V. & Bignami, E. G. Generative pre-trained transformer 4 (GPT-4) in clinical settings. Lancet Digital Health 7, e6–e7 (2025).

    Google Scholar 

  3. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Google Scholar 

  4. Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).

    Google Scholar 

  5. Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern. Med. 183, 1028–1030 (2023).

    Google Scholar 

  6. Gilson, A. et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312 (2023).

    Google Scholar 

  7. Jin, D. et al. What disease does this patient have? A large-scale open-domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

    Google Scholar 

  8. McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).

    Google Scholar 

  9. Chen, X. et al. Evaluating large language models and agents in healthcare: key challenges in clinical applications. Intell. Med. 5, 151–163 (2025).

    Google Scholar 

  10. Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).

    Google Scholar 

  11. Zhu, L., Mou, W. & Chen, R. Can the ChatGPT and other large language models with internet-connected databases solve the questions and concerns of patients with prostate cancer and help democratize medical knowledge? J. Transl. Med. 21, 269 (2023).

    Google Scholar 

  12. Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).

    Google Scholar 

  13. Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).

    Google Scholar 

  14. Augenstein, I. et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat. Mach. Intell. 6, 852–863 (2024).

    Google Scholar 

  15. Ng, K. K. Y., Matsuba, I. & Zhang, P. C. RAG in health care: a novel framework for improving communication and decision-making by addressing LLM limitations. Nejm Ai 2, AIra2400380 (2025).

    Google Scholar 

  16. Tran, K.-T. et al. Multi-agent collaboration mechanisms: a survey of LLMs. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.06322 (2025).

  17. Kim, Y. et al. Mdagents: An adaptive collaboration of llms for medical decision-making. Adv. Neural Inf. Process. Syst. 37, 79410–79452 (2024).

    Google Scholar 

  18. Chen, X. et al. Enhancing diagnostic capability with multi-agent conversational large language models. NPJ Digit. Med. 8, 159 (2025).

    Google Scholar 

  19. Kim, Y. et al. Medical hallucinations in foundation models and their impact on healthcare. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.05777 (2025).

  20. Holt, S., Qian, Z., Liu, T., Weatherall, J. & van der Schaar, M. Data-driven discovery of dynamical systems in pharmacology using large language models. Adv. Neural Inf. Process. Syst. 37, 96325–96366 (2024).

    Google Scholar 

  21. WONCA Europe. The European definition of general practice/family medicine-2023 edition. Barcelona: WONCA Europe https://www.woncaeurope.org/page/definition-of-general-practice-family-medicine (2023).

  22. Scherger, J. E. Preparing the personal physician for practice (P4): essential skills for new family physicians and how residency programs may provide them. J. Am. Board Fam. Med. 20, 348–355 (2007).

    Google Scholar 

  23. McClelland, D. C. Testing for competence rather than for" intelligence.". Am. Psychol. 28, 1 (1973).

    Google Scholar 

  24. Boyatzis, R. E.The competent manager: A model for effective performance (John Wiley & Sons, 1991).

  25. X. Wang et al. CMB: A Comprehensive Medical Benchmark in Chinese. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 6184–6205 (2024).

  26. Liu, M. et al. Medbench: A comprehensive, standardized, and reliable benchmarking system for evaluating chinese medical large language models. Big Data Mining and Analytics https://www.sciopen.com/article/10.26599/BDMA.2024.9020044 (2024).

  27. Collaborators, G. B. D. et al. Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. Lancet 385, 117–171 (2015).

    Google Scholar 

  28. Zhou, M. et al. Cause-specific mortality for 240 causes in China during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. Lancet 387, 251–272 (2016).

    Google Scholar 

  29. Peng, W. et al. Trends in major non-communicable diseases and related risk factors in China an analysis of nationally representative survey data. Lancet Reg Health West. Pac. 43, 100809 (2024).

    Google Scholar 

  30. Hurst, A. et al. GPT-4o system card. arXiv https://doi.org/10.48550/arXiv.2410.21276 (2024).

  31. Jaech, A. et al. OpenAI o1 system card. arXiv https://doi.org/10.48550/arXiv.2412.16720 (2024).

  32. Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv https://doi.org/10.48550/arXiv.2403.05530 (2024).

  33. Yang, A. et al. Qwen2.5 technical report. arXiv https://doi.org/10.48550/arXiv.2412.15115 (2024).

  34. Liu, A. et al. DeepSeek-V3 technical report. arXiv https://doi.org/10.48550/arXiv.2412.19437 (2024).

  35. Guo, D. et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv https://doi.org/10.48550/arXiv.2501.12948 (2025).

  36. Chen, J. et al. HuatuoGPT-o1, towards medical complex reasoning with LLMs. arXiv https://doi.org/10.48550/arXiv.2412.18925 (2024).

Download references

Acknowledgments

This research was funded by the Guangdong Basic and Applied Basic Research Foundation of China (2024A1515220073) and the Science and Technology Program of Guangzhou (2023B03J1277).

Author information

Author notes
  1. These authors contributed equally: Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen.

Authors and Affiliations

  1. The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong, China

    Zheqing Li  (李哲青), Jiping Lang  (郎吉萍), Junrong Chen  (陈俊榕), Shuang Li  (李双), Zhu Lin  (林珠), Xuanna Li  (李宣娜), Yuze Tang  (唐瑜泽), Shuang Chen  (陈爽), Yuhua Bi  (闭玉华), Xiaofei Zeng  (曾晓菲) & Yixian Chen  (陈一贤)

  2. Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Guangdong, China

    Yiying Yang  (杨怡莹), Wenhao Jiang  (姜文浩), Yuhang Zhao  (赵宇航) & Dingqian Wang  (王定乾)

  3. Xinyi People’s Hospital, Xinyi, Guangdong, China

    Jiexian Qiu  (邱洁娴), Xiaolin Lu  (卢晓霖) & Hongji Yu  (俞鸿基)

  4. The Fifth Affiliated Hospital of Sun Yat-sen University, Zhuhai, Guangdong, China

    Lin Yao  (姚麟)

  5. School of Public Health of Sun Yat-sen University, Guangzhou, Guangdong, China

    Lin Yao  (姚麟)

Authors
  1. Zheqing Li  (李哲青)
    View author publications

    Search author on:PubMed Google Scholar

  2. Yiying Yang  (杨怡莹)
    View author publications

    Search author on:PubMed Google Scholar

  3. Jiping Lang  (郎吉萍)
    View author publications

    Search author on:PubMed Google Scholar

  4. Wenhao Jiang  (姜文浩)
    View author publications

    Search author on:PubMed Google Scholar

  5. Junrong Chen  (陈俊榕)
    View author publications

    Search author on:PubMed Google Scholar

  6. Yuhang Zhao  (赵宇航)
    View author publications

    Search author on:PubMed Google Scholar

  7. Shuang Li  (李双)
    View author publications

    Search author on:PubMed Google Scholar

  8. Dingqian Wang  (王定乾)
    View author publications

    Search author on:PubMed Google Scholar

  9. Zhu Lin  (林珠)
    View author publications

    Search author on:PubMed Google Scholar

  10. Xuanna Li  (李宣娜)
    View author publications

    Search author on:PubMed Google Scholar

  11. Yuze Tang  (唐瑜泽)
    View author publications

    Search author on:PubMed Google Scholar

  12. Jiexian Qiu  (邱洁娴)
    View author publications

    Search author on:PubMed Google Scholar

  13. Xiaolin Lu  (卢晓霖)
    View author publications

    Search author on:PubMed Google Scholar

  14. Hongji Yu  (俞鸿基)
    View author publications

    Search author on:PubMed Google Scholar

  15. Shuang Chen  (陈爽)
    View author publications

    Search author on:PubMed Google Scholar

  16. Yuhua Bi  (闭玉华)
    View author publications

    Search author on:PubMed Google Scholar

  17. Xiaofei Zeng  (曾晓菲)
    View author publications

    Search author on:PubMed Google Scholar

  18. Yixian Chen  (陈一贤)
    View author publications

    Search author on:PubMed Google Scholar

  19. Lin Yao  (姚麟)
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Z. Li wrote the original draft, performed the writing, review, and editing, and developed the methodology. Y.Y. carried out the data curation, performed the formal analysis and visualization, and wrote the original draft, review, and editing. J. Lang performed the review and editing, carried out the data curation, and conducted the investigation and validation. W.J. conceived the study, developed the methodology, performed the review and editing, carried out the formal analysis, and supervised the work. J. Chen conducted the investigation and validation, developed the methodology, and performed the review, and editing. Y.Z. carried out the data curation, conducted the investigation, and developed the software. D.W. carried out the data curation, developed the software, and performed the visualization. S. Li conducted the investigation. Z. Lin conducted the investigation. X. Li conducted the investigation. Y.T. conducted the investigation. J.Q. conducted the investigation. X. Lu conducted the investigation. H.Y. conducted the investigation. S. Chen conducted the investigation. Y.B. conducted the investigation. X.Z. conducted the investigation. Y. Chen carried out the data curation. L.Y. conceived the study, developed the methodology, provided the resources, supervised the work, and acquired the funding. Data verification: Y.Y., J. Lang, J. Chen, and Y.Z. directly accessed and verified the underlying data.

Corresponding authors

Correspondence to Wenhao Jiang  (姜文浩), Junrong Chen  (陈俊榕) or Lin Yao  (姚麟).

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Peer Review file (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1–3 (download ZIP )

Reporting Summary (download PDF )

Source data

Source Data (download ZIP )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Yang, Y., Lang, J. et al. Evaluating clinical competencies of large language models with a general practice benchmark. Nat Commun (2026). https://doi.org/10.1038/s41467-026-71622-6

Download citation

  • Received: 14 May 2025

  • Accepted: 26 March 2026

  • Published: 16 April 2026

  • DOI: https://doi.org/10.1038/s41467-026-71622-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Videos
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Editors
  • Journal Information
  • Open Access Fees and Funding
  • Calls for Papers
  • Editorial Values Statement
  • Journal Metrics
  • Editors' Highlights
  • Contact
  • Editorial policies
  • Top Articles

Publish with us

  • For authors
  • For Reviewers
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Nature Communications (Nat Commun)

ISSN 2041-1723 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing