Evaluating clinical competencies of large language models with a general practice benchmark

Li, Zheqing; Yang, Yiying; Lang, Jiping; Jiang, Wenhao; Chen, Junrong; Zhao, Yuhang; Li, Shuang; Wang, Dingqian; Lin, Zhu; Li, Xuanna; Tang, Yuze; Qiu, Jiexian; Lu, Xiaolin; Yu, Hongji; Chen, Shuang; Bi, Yuhua; Zeng, Xiaofei; Chen, Yixian; Yao, Lin

doi:10.1038/s41467-026-71622-6

Download PDF

Article
Open access
Published: 16 April 2026

Evaluating clinical competencies of large language models with a general practice benchmark

Zheqing Li (李哲青) ORCID: orcid.org/0009-0008-8501-0369¹^na1,
Yiying Yang (杨怡莹)²^na1,
Jiping Lang (郎吉萍)¹^na1,
Wenhao Jiang (姜文浩) ORCID: orcid.org/0000-0002-0795-366X²^na1,
Junrong Chen (陈俊榕) ORCID: orcid.org/0000-0002-3796-5717¹^na1,
Yuhang Zhao (赵宇航)²,
Shuang Li (李双)¹,
Dingqian Wang (王定乾)²,
Zhu Lin (林珠)¹,
Xuanna Li (李宣娜)¹,
Yuze Tang (唐瑜泽)¹,
Jiexian Qiu (邱洁娴)³,
Xiaolin Lu (卢晓霖)³,
Hongji Yu (俞鸿基)³,
Shuang Chen (陈爽)¹,
Yuhua Bi (闭玉华)¹,
Xiaofei Zeng (曾晓菲)¹,
Yixian Chen (陈一贤)¹ &
…
Lin Yao (姚麟) ORCID: orcid.org/0000-0002-3422-8922^4,5

Nature Communications , Article number: (2026) Cite this article

909 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

Context matching is not reasoning when performing generalized clinical evaluation of generative language models

Article Open access 27 December 2025

Benchmarking large language models for replication of guideline-based PGx recommendations

Article 26 July 2025

Large language models for frontline healthcare support in low-resource settings

Article Open access 06 February 2026

Data availability

The prompts used for LLMs have been included in the manuscript. Both the development set and the test set were made available exclusively for non-commercial research purposes under a data use agreement via our project homepage (https://github.com/AIPrimaryCare). The development set could be accessed in full, including both the evaluation questions and the corresponding scoring rubrics. To preserve evaluation integrity, only the evaluation questions of the test set were released. Access requests should be submitted to the corresponding author, accompanied by a detailed description of the research purpose. The corresponding author and the data-contributing institutions review the request and determine whether access can be granted. Source data are available at the Source Data file. Source data are provided with this paper.

Code availability

The source code for the study is available in the GitHub repository at https://github.com/AIPrimaryCare/gpbench_code. To ensure long-term accessibility and reproducibility, the repository has been archived on Zenodo under the https://doi.org/10.5281/zenodo.18428084. A comprehensive README file is provided with full instructions for reproducing the experiments.

References

Menezes, M. C. S. et al. The potential of generative pre-trained transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digital Health 7, e35–e43 (2025).
Google Scholar
Bellini, V. & Bignami, E. G. Generative pre-trained transformer 4 (GPT-4) in clinical settings. Lancet Digital Health 7, e6–e7 (2025).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Google Scholar
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Google Scholar
Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern. Med. 183, 1028–1030 (2023).
Google Scholar
Gilson, A. et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312 (2023).
Google Scholar
Jin, D. et al. What disease does this patient have? A large-scale open-domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Google Scholar
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Google Scholar
Chen, X. et al. Evaluating large language models and agents in healthcare: key challenges in clinical applications. Intell. Med. 5, 151–163 (2025).
Google Scholar
Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).
Google Scholar
Zhu, L., Mou, W. & Chen, R. Can the ChatGPT and other large language models with internet-connected databases solve the questions and concerns of patients with prostate cancer and help democratize medical knowledge? J. Transl. Med. 21, 269 (2023).
Google Scholar
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Google Scholar
Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).
Google Scholar
Augenstein, I. et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat. Mach. Intell. 6, 852–863 (2024).
Google Scholar
Ng, K. K. Y., Matsuba, I. & Zhang, P. C. RAG in health care: a novel framework for improving communication and decision-making by addressing LLM limitations. Nejm Ai 2, AIra2400380 (2025).
Google Scholar
Tran, K.-T. et al. Multi-agent collaboration mechanisms: a survey of LLMs. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.06322 (2025).
Kim, Y. et al. Mdagents: An adaptive collaboration of llms for medical decision-making. Adv. Neural Inf. Process. Syst. 37, 79410–79452 (2024).
Google Scholar
Chen, X. et al. Enhancing diagnostic capability with multi-agent conversational large language models. NPJ Digit. Med. 8, 159 (2025).
Google Scholar
Kim, Y. et al. Medical hallucinations in foundation models and their impact on healthcare. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.05777 (2025).
Holt, S., Qian, Z., Liu, T., Weatherall, J. & van der Schaar, M. Data-driven discovery of dynamical systems in pharmacology using large language models. Adv. Neural Inf. Process. Syst. 37, 96325–96366 (2024).
Google Scholar
WONCA Europe. The European definition of general practice/family medicine-2023 edition. Barcelona: WONCA Europe https://www.woncaeurope.org/page/definition-of-general-practice-family-medicine (2023).
Scherger, J. E. Preparing the personal physician for practice (P4): essential skills for new family physicians and how residency programs may provide them. J. Am. Board Fam. Med. 20, 348–355 (2007).
Google Scholar
McClelland, D. C. Testing for competence rather than for" intelligence.". Am. Psychol. 28, 1 (1973).
Google Scholar
Boyatzis, R. E.The competent manager: A model for effective performance (John Wiley & Sons, 1991).
X. Wang et al. CMB: A Comprehensive Medical Benchmark in Chinese. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 6184–6205 (2024).
Liu, M. et al. Medbench: A comprehensive, standardized, and reliable benchmarking system for evaluating chinese medical large language models. Big Data Mining and Analytics https://www.sciopen.com/article/10.26599/BDMA.2024.9020044 (2024).
Collaborators, G. B. D. et al. Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. Lancet 385, 117–171 (2015).
Google Scholar
Zhou, M. et al. Cause-specific mortality for 240 causes in China during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. Lancet 387, 251–272 (2016).
Google Scholar
Peng, W. et al. Trends in major non-communicable diseases and related risk factors in China an analysis of nationally representative survey data. Lancet Reg Health West. Pac. 43, 100809 (2024).
Google Scholar
Hurst, A. et al. GPT-4o system card. arXiv https://doi.org/10.48550/arXiv.2410.21276 (2024).
Jaech, A. et al. OpenAI o1 system card. arXiv https://doi.org/10.48550/arXiv.2412.16720 (2024).
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv https://doi.org/10.48550/arXiv.2403.05530 (2024).
Yang, A. et al. Qwen2.5 technical report. arXiv https://doi.org/10.48550/arXiv.2412.15115 (2024).
Liu, A. et al. DeepSeek-V3 technical report. arXiv https://doi.org/10.48550/arXiv.2412.19437 (2024).
Guo, D. et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv https://doi.org/10.48550/arXiv.2501.12948 (2025).
Chen, J. et al. HuatuoGPT-o1, towards medical complex reasoning with LLMs. arXiv https://doi.org/10.48550/arXiv.2412.18925 (2024).

Download references

Acknowledgments

This research was funded by the Guangdong Basic and Applied Basic Research Foundation of China (2024A1515220073) and the Science and Technology Program of Guangzhou (2023B03J1277).

Author information

These authors contributed equally: Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen.

Authors and Affiliations

The Sixth Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong, China
Zheqing Li (李哲青), Jiping Lang (郎吉萍), Junrong Chen (陈俊榕), Shuang Li (李双), Zhu Lin (林珠), Xuanna Li (李宣娜), Yuze Tang (唐瑜泽), Shuang Chen (陈爽), Yuhua Bi (闭玉华), Xiaofei Zeng (曾晓菲) & Yixian Chen (陈一贤)
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Guangdong, China
Yiying Yang (杨怡莹), Wenhao Jiang (姜文浩), Yuhang Zhao (赵宇航) & Dingqian Wang (王定乾)
Xinyi People’s Hospital, Xinyi, Guangdong, China
Jiexian Qiu (邱洁娴), Xiaolin Lu (卢晓霖) & Hongji Yu (俞鸿基)
The Fifth Affiliated Hospital of Sun Yat-sen University, Zhuhai, Guangdong, China
Lin Yao (姚麟)
School of Public Health of Sun Yat-sen University, Guangzhou, Guangdong, China
Lin Yao (姚麟)

Authors

Zheqing Li (李哲青)
View author publications
Search author on:PubMed Google Scholar
Yiying Yang (杨怡莹)
View author publications
Search author on:PubMed Google Scholar
Jiping Lang (郎吉萍)
View author publications
Search author on:PubMed Google Scholar
Wenhao Jiang (姜文浩)
View author publications
Search author on:PubMed Google Scholar
Junrong Chen (陈俊榕)
View author publications
Search author on:PubMed Google Scholar
Yuhang Zhao (赵宇航)
View author publications
Search author on:PubMed Google Scholar
Shuang Li (李双)
View author publications
Search author on:PubMed Google Scholar
Dingqian Wang (王定乾)
View author publications
Search author on:PubMed Google Scholar
Zhu Lin (林珠)
View author publications
Search author on:PubMed Google Scholar
Xuanna Li (李宣娜)
View author publications
Search author on:PubMed Google Scholar
Yuze Tang (唐瑜泽)
View author publications
Search author on:PubMed Google Scholar
Jiexian Qiu (邱洁娴)
View author publications
Search author on:PubMed Google Scholar
Xiaolin Lu (卢晓霖)
View author publications
Search author on:PubMed Google Scholar
Hongji Yu (俞鸿基)
View author publications
Search author on:PubMed Google Scholar
Shuang Chen (陈爽)
View author publications
Search author on:PubMed Google Scholar
Yuhua Bi (闭玉华)
View author publications
Search author on:PubMed Google Scholar
Xiaofei Zeng (曾晓菲)
View author publications
Search author on:PubMed Google Scholar
Yixian Chen (陈一贤)
View author publications
Search author on:PubMed Google Scholar
Lin Yao (姚麟)
View author publications
Search author on:PubMed Google Scholar

Contributions

Z. Li wrote the original draft, performed the writing, review, and editing, and developed the methodology. Y.Y. carried out the data curation, performed the formal analysis and visualization, and wrote the original draft, review, and editing. J. Lang performed the review and editing, carried out the data curation, and conducted the investigation and validation. W.J. conceived the study, developed the methodology, performed the review and editing, carried out the formal analysis, and supervised the work. J. Chen conducted the investigation and validation, developed the methodology, and performed the review, and editing. Y.Z. carried out the data curation, conducted the investigation, and developed the software. D.W. carried out the data curation, developed the software, and performed the visualization. S. Li conducted the investigation. Z. Lin conducted the investigation. X. Li conducted the investigation. Y.T. conducted the investigation. J.Q. conducted the investigation. X. Lu conducted the investigation. H.Y. conducted the investigation. S. Chen conducted the investigation. Y.B. conducted the investigation. X.Z. conducted the investigation. Y. Chen carried out the data curation. L.Y. conceived the study, developed the methodology, provided the resources, supervised the work, and acquired the funding. Data verification: Y.Y., J. Lang, J. Chen, and Y.Z. directly accessed and verified the underlying data.

Corresponding authors

Correspondence to Wenhao Jiang (姜文浩), Junrong Chen (陈俊榕) or Lin Yao (姚麟).

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Peer Review file (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1–3 (download ZIP )

Reporting Summary (download PDF )

Source data

Source Data (download ZIP )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, Z., Yang, Y., Lang, J. et al. Evaluating clinical competencies of large language models with a general practice benchmark. Nat Commun (2026). https://doi.org/10.1038/s41467-026-71622-6

Download citation

Received: 14 May 2025
Accepted: 26 March 2026
Published: 16 April 2026
DOI: https://doi.org/10.1038/s41467-026-71622-6