Abstract
Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.
Similar content being viewed by others
Data availability
The prompts used for LLMs have been included in the manuscript. Both the development set and the test set were made available exclusively for non-commercial research purposes under a data use agreement via our project homepage (https://github.com/AIPrimaryCare). The development set could be accessed in full, including both the evaluation questions and the corresponding scoring rubrics. To preserve evaluation integrity, only the evaluation questions of the test set were released. Access requests should be submitted to the corresponding author, accompanied by a detailed description of the research purpose. The corresponding author and the data-contributing institutions review the request and determine whether access can be granted. Source data are available at the Source Data file. Source data are provided with this paper.
Code availability
The source code for the study is available in the GitHub repository at https://github.com/AIPrimaryCare/gpbench_code. To ensure long-term accessibility and reproducibility, the repository has been archived on Zenodo under the https://doi.org/10.5281/zenodo.18428084. A comprehensive README file is provided with full instructions for reproducing the experiments.
References
Menezes, M. C. S. et al. The potential of generative pre-trained transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digital Health 7, e35–e43 (2025).
Bellini, V. & Bignami, E. G. Generative pre-trained transformer 4 (GPT-4) in clinical settings. Lancet Digital Health 7, e6–e7 (2025).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern. Med. 183, 1028–1030 (2023).
Gilson, A. et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312 (2023).
Jin, D. et al. What disease does this patient have? A large-scale open-domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Chen, X. et al. Evaluating large language models and agents in healthcare: key challenges in clinical applications. Intell. Med. 5, 151–163 (2025).
Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).
Zhu, L., Mou, W. & Chen, R. Can the ChatGPT and other large language models with internet-connected databases solve the questions and concerns of patients with prostate cancer and help democratize medical knowledge? J. Transl. Med. 21, 269 (2023).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025).
Augenstein, I. et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat. Mach. Intell. 6, 852–863 (2024).
Ng, K. K. Y., Matsuba, I. & Zhang, P. C. RAG in health care: a novel framework for improving communication and decision-making by addressing LLM limitations. Nejm Ai 2, AIra2400380 (2025).
Tran, K.-T. et al. Multi-agent collaboration mechanisms: a survey of LLMs. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.06322 (2025).
Kim, Y. et al. Mdagents: An adaptive collaboration of llms for medical decision-making. Adv. Neural Inf. Process. Syst. 37, 79410–79452 (2024).
Chen, X. et al. Enhancing diagnostic capability with multi-agent conversational large language models. NPJ Digit. Med. 8, 159 (2025).
Kim, Y. et al. Medical hallucinations in foundation models and their impact on healthcare. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.05777 (2025).
Holt, S., Qian, Z., Liu, T., Weatherall, J. & van der Schaar, M. Data-driven discovery of dynamical systems in pharmacology using large language models. Adv. Neural Inf. Process. Syst. 37, 96325–96366 (2024).
WONCA Europe. The European definition of general practice/family medicine-2023 edition. Barcelona: WONCA Europe https://www.woncaeurope.org/page/definition-of-general-practice-family-medicine (2023).
Scherger, J. E. Preparing the personal physician for practice (P4): essential skills for new family physicians and how residency programs may provide them. J. Am. Board Fam. Med. 20, 348–355 (2007).
McClelland, D. C. Testing for competence rather than for" intelligence.". Am. Psychol. 28, 1 (1973).
Boyatzis, R. E.The competent manager: A model for effective performance (John Wiley & Sons, 1991).
X. Wang et al. CMB: A Comprehensive Medical Benchmark in Chinese. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 6184–6205 (2024).
Liu, M. et al. Medbench: A comprehensive, standardized, and reliable benchmarking system for evaluating chinese medical large language models. Big Data Mining and Analytics https://www.sciopen.com/article/10.26599/BDMA.2024.9020044 (2024).
Collaborators, G. B. D. et al. Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. Lancet 385, 117–171 (2015).
Zhou, M. et al. Cause-specific mortality for 240 causes in China during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. Lancet 387, 251–272 (2016).
Peng, W. et al. Trends in major non-communicable diseases and related risk factors in China an analysis of nationally representative survey data. Lancet Reg Health West. Pac. 43, 100809 (2024).
Hurst, A. et al. GPT-4o system card. arXiv https://doi.org/10.48550/arXiv.2410.21276 (2024).
Jaech, A. et al. OpenAI o1 system card. arXiv https://doi.org/10.48550/arXiv.2412.16720 (2024).
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv https://doi.org/10.48550/arXiv.2403.05530 (2024).
Yang, A. et al. Qwen2.5 technical report. arXiv https://doi.org/10.48550/arXiv.2412.15115 (2024).
Liu, A. et al. DeepSeek-V3 technical report. arXiv https://doi.org/10.48550/arXiv.2412.19437 (2024).
Guo, D. et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv https://doi.org/10.48550/arXiv.2501.12948 (2025).
Chen, J. et al. HuatuoGPT-o1, towards medical complex reasoning with LLMs. arXiv https://doi.org/10.48550/arXiv.2412.18925 (2024).
Acknowledgments
This research was funded by the Guangdong Basic and Applied Basic Research Foundation of China (2024A1515220073) and the Science and Technology Program of Guangzhou (2023B03J1277).
Author information
Authors and Affiliations
Contributions
Z. Li wrote the original draft, performed the writing, review, and editing, and developed the methodology. Y.Y. carried out the data curation, performed the formal analysis and visualization, and wrote the original draft, review, and editing. J. Lang performed the review and editing, carried out the data curation, and conducted the investigation and validation. W.J. conceived the study, developed the methodology, performed the review and editing, carried out the formal analysis, and supervised the work. J. Chen conducted the investigation and validation, developed the methodology, and performed the review, and editing. Y.Z. carried out the data curation, conducted the investigation, and developed the software. D.W. carried out the data curation, developed the software, and performed the visualization. S. Li conducted the investigation. Z. Lin conducted the investigation. X. Li conducted the investigation. Y.T. conducted the investigation. J.Q. conducted the investigation. X. Lu conducted the investigation. H.Y. conducted the investigation. S. Chen conducted the investigation. Y.B. conducted the investigation. X.Z. conducted the investigation. Y. Chen carried out the data curation. L.Y. conceived the study, developed the methodology, provided the resources, supervised the work, and acquired the funding. Data verification: Y.Y., J. Lang, J. Chen, and Y.Z. directly accessed and verified the underlying data.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, Z., Yang, Y., Lang, J. et al. Evaluating clinical competencies of large language models with a general practice benchmark. Nat Commun (2026). https://doi.org/10.1038/s41467-026-71622-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-71622-6


