Abstract
With the rapid advancement of large language model technology, numerous studies have explored its application in the medical field. Robust evaluation is crucial for ensuring reliability and safety, leading to the development of diverse benchmark datasets. In this study, we propose a structured taxonomy to provide researchers with practical guidance for benchmark selection. Furthermore, we introduce READY, a development framework built on five principles - Reliable, Ethical, Annotated, Diverse, Yield-validated - to support the systematic design of medical benchmarks and strengthen future evaluation practices. To establish the taxonomy and framework, we systematically reviewed benchmark datasets designed for evaluating LLMs in medical context. A comprehensive literature search yielded 55 relevant studies. Each benchmark was analyzed using a structured framework encompassing the dataset construction and evaluation methodology. To assess the applicability of the proposed framework, five domain experts independently applied the READY framework to benchmark studies, demonstrating consistent inter-rater agreement. We anticipate that this research will promote more rigorous and ethical LLM evaluation, paving the way for the safe application of LLMs in clinical settings.
Similar content being viewed by others
Data availability
No new data was generated or analyzed in this study. All data supporting the findings of this study are available within the cited articles included in the scoping review.
References
Du, X. et al. Performance and improvement strategies for adapting generative large language models for electronic health record applications: a systematic review. Int. J. Med. Inform. 2025, 106091 (2025).
Du, X. et al. Testing and evaluation of generative large language models in electronic health record applications: a systematic review. J. Am. Med. Inform. Assoc 33, ocaf233 (2026).
Li, J. et al. Exploring Multimodal Large Language Models on Transthoracic Echocardiogram (TTE) Tasks for Cardiovascular Decision Support.
Li, Y. et al. VaxBot-HPV: a GPT-based chatbot for answering HPV vaccine-related questions. JAMIA Open 8, ooaf005 (2025).
Li, Y. et al. A comparative study of recent large language models on generating hospital discharge summaries for lung cancer patients. J. Biomed. Inform. 2025, 104867 (2025).
Li, Y. et al. Enhancing relation extraction for COVID-19 vaccine shot-adverse event associations with large language models. Research Square (2025). rs. 3. rs–6201919 (2025).
Li, Y. et al. Ai-assisted literature screening: A hybrid approach using large language models and retrieval-augmented generation. Int. J. Med. Inform. 2025, 106205 (2025).
Li, Y. et al. Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media. J. Biomed. Inform. 163, 104789 (2025).
Li, J. et al. Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models. J. Biomed. Semant. 15, 14 (2024).
Li, Y. et al. AcuKG: a comprehensive knowledge graph for medical acupuncture. J. Am. Med. Inform. Assoc. 2025, ocaf179 (2025).
Li, Y. et al. Relation extraction using large language models: a case study on acupuncture point locations. J. Am. Med. Inform. Assoc. 31, 2622–2631 (2024).
Hu, Y. et al. PheCatcher: leveraging LLM-generated synthetic data for automated phenotype definition extraction from biomedical literature. In MEDINFO 2025—Healthcare Smart× Medicine Deep. 718–722 (IOS Press, 2025).
Li, Y., Li, J., He, J. & Tao, C. A. E. - GPT: using large language models to extract adverse events from surveillance reports-a use case with influenza vaccine adverse events. Plos one 19, e0300919 (2024).
Li, Y. et al. RefAI: a GPT-powered retrieval-augmented generative tool for biomedical literature recommendation and summarization. J. Am. Med. Inform. Assoc. 31, 2030–2039 (2024).
Niv, Y. & Tal, Y. Development of Patient Safety and Risk Management in Medicine. In Patient Safety and Risk Management in Medicine: From Theory to Practice. 15–26 (Springer, 2024).
Chen, X. et al. Evaluating large language models in medical applications: a survey. arXiv preprint arXiv:2405.07468. (2024).
Cresswell, K. et al. Evaluating artificial intelligence in clinical settings—let us not reinvent the wheel. J. Med. Internet Res. 26, e46407 (2024).
Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning 248–260 (PMLR, 2022).
Savery, M., Abacha, A. B., Gayen, S. & Demner-Fushman, D. Question-driven summarization of answers to consumer health questions. Sci. Data 7, 322 (2020).
Zhu, M. et al. Question answering with long multiple-span answers. Find. Assoc. Computational Linguist.: EMNLP 2020, 3840–3849 (2020).
Jin, Q. et al. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) 2567–2577 (2019).
Suster, S. & Daelemans, W. CliCR: a dataset of clinical case reports for machine reading comprehension. Proc. 2018 Conf. North Am. chapter Assoc. computational Linguist.: Hum. Lang. Technol. ume 1, 1551–1563 (2018).
Soni, S., Gudala, M., Pajouhi, A. & Roberts, K. Radqa: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the thirteenth language resources and evaluation conference 6250–6259 (2022).
Chen, X. et al. RareBench: can LLMs serve as rare diseases specialists? In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining 4850–4861 (2024).
Blagec, K., Kraiger, J., Frühwirt, W. & Samwald, M. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals. J. Biomed. Inform. 137, 104274 (2023).
Bardhan, J., Roberts, K. & Wang, D. Z. Question answering for electronic health records: scoping review of datasets and models. J. Med. Internet Res. 26, e53636 (2024).
Lee, J., Park, S., Shin, J. & Cho, B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med. Inform. Decis. Mak. 24, 366 (2024).
Baek, S. J. et al. Ophtimus-LLM: Development of a Specialized Large Language Model for Ophthalmology. In Workshop on Large Language Models and Generative AI for Health at AAAI 2025 (2025).
Tricco, A. C. et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann. Intern. Med. 169, 467–473 (2018).
Vilares, D. & Gómez-Rodríguez, C. HEAD-QA: A healthcare dataset for complex reasoning. In Pro. of the 57th Annual Meeting of the Association for Computational Linguistics, 960–966 (2019).
Li, D. et al. Towards medical machine reading comprehension with structural knowledge and plain text. In Pro. of the 2020 conference on empirical methods in natural language processing (EMNLP) 1427–1438 (2020).
Li, J., Zhong, S. & Chen, K. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. Proc. 2021 Conf. Empir. Methods Nat. Lang. Process. 8862, 8874 (2021).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Gourraud, E. M. & Rouvier, M. FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain.
Gao, Y. et al. Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. J. Biomed. Inform. 138, 104286 (2023).
Subramanian, A. et al. M-QALM: A benchmark to assess clinical reading comprehension and knowledge recall in large language models via question answering. In Findings of the Association for Computational Linguistics: ACL 2024, 4002–4042 (2024).
Kim, Y. et al. MedExQA: Medical question answering benchmark with multiple explanations. In Pro. of the 23rd Workshop on Biomedical Natural Language Processing, 167–181 (2024).
Hertzberg, N. & Lokrantz, A. MedQA-SWE-a Clinical Question & Answer Dataset for Swedish.Proc. 2024 Jt. Int. Conf. Computational Linguist., Lang. Resour. Evaluation (LREC-COLING2024, 11178–11186 (2024). .
Li, D. et al. ExplainCPE: A free-text explanation benchmark of Chinese pharmacist examination. In Findings of the Association for Computational Linguistics: EMNLP 2023, 1922–1940 (2023).
Cai, Y. et al. Medbench: A large-scale chinese benchmark for evaluating medical large language models. In Proceedings of the AAAI Conference on Artificial Intelligence 17709–17717 (2024).
Liu, J., Zhou, P. & Hua, Y. Benchmarking Large Language Models on CMExam--A Comprehensive Chinese Medical Exam Dataset. Published online 2023. Advances in Neural Information Processing Systems 36.
Kweon, S. et al. KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations. arXiv preprint arXiv:2403.01469. (2024).
Alonso González, I. B. et al. Multilingual benchmarking of large language models for medical question answering. (2024).
Wang, X. et al. CMB: a comprehensive medical benchmark in Chinese. In Pro. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 6184–6205 (2024).
Abacha, A. B., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC 1, 12 (2017).
Zhang, S. et al. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access 6, 74061–74071 (2018).
Abacha, A. B. et al. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All. 25–29 (IOS Press, 2019).
Ben Abacha, A. & Demner-Fushman, D. A question-entailment approach to question answering. BMC Bioinforma. 20, 1–23 (2019).
He, J., Fu, M. & Tu, M. Applying deep matching networks to Chinese medical question answering: a study and a dataset. BMC Med. Inform. Decis. Mak. 19, 91–100 (2019).
Tian, Y., Ma, W., Xia, F. & Song, Y. ChiMed: A Chinese medical corpus for question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task 250–260 (2019).
Abacha, A. B. & Demner-Fushman, D. On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2228–2234 (2019).
Chen, N. et al. A benchmark dataset and case study for Chinese medical question intent classification. BMC Med. Inform. Decis. Mak. 20, 1–7 (2020).
Yadav, S., Gupta, D. & Demner-Fushman, D. Chq-summ: A dataset for consumer healthcare question summarization. arXiv preprint arXiv:2206.06581. (2022).
Wang, X. et al. Huatuo-26M, a large-scale Chinese medical QA dataset. In Findings of the Association for Computational Linguistics: NAACL 2025, 3828–3848 (2025).
Alasmari, A., Alhumoud, S. & Alshammari, W. AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation@ LREC-COLING 2024 50–56 (2024).
Nguyen, V., Karimi, S., Rybinski, M. & Xing, Z. MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 629–648 (2023).
Manes, I. et al. K-QA: A real-world medical Q&A benchmark. In Pro. of the 23rd Workshop on Biomedical Natural Language Processing, 277–294 (2024).
Pampari, A. et al. emrQA: A large corpus for question answering on electronic medical records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2357–2368 (2018).
Raghavan, P. et al. emrkbqa: A clinical knowledge-base question answering dataset. (Association for Computational Linguistics, 2021).
Mullenbach, J. et al. CLIP: A dataset for extracting action items for physicians from hospital discharge notes. In Pro. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1365–1378 (2021).
Blinov, P. et al. Rumedbench: a Russian medical language understanding benchmark. In International Conference on Artificial Intelligence in Medicine 383–392 (Springer, 2022).
Zhang, N. et al. CBLUE: A Chinese biomedical language understanding evaluation benchmark. In Pro. of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 7888–7915 (2022).
Lehman, E. et al. Learning to ask like a physician. In Pro. of the 4th Clinical Natural Language Processing Workshop, 74–86 (2022).
Pal, A. CLIFT: Analysing Natural Distribution Shift on Question Answering Models in Clinical Domain. arXiv preprint arXiv:2310.13146. (2023).
Bardhan, J. et al. DrugEHRQA: A question answering dataset on structured and unstructured electronic health records for medicine-related queries. In Pro. of the Thirteenth Language Resources and Evaluation Conference, 1083–1097 (2022).
Wang, P. et al. Attention-based aspect reasoning for knowledge base question answering on clinical notes. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 1–6 (2022).
He, Z. et al. MedEval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. In Pro. of the 2023 Conference on Empirical Methods in Natural Language Processing, 8725–8744 (2023).
Zhu, W. et al. PromptCBLUE: a Chinese prompt tuning benchmark for the medical domain. arXiv preprint arXiv:2310.14151. (2023).
Fleming, S. L. et al. Medalign: A clinician-generated dataset for instruction following with electronic medical records. In Proceedings of the AAAI Conference on Artificial Intelligence 22021–22030 (2024).
Kweon, S. et al. EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings. arXiv preprint arXiv:2402.16040. (2024).
Suster, S. & Daelemans, W. CliCR: A dataset of clinical case reports for machine reading comprehension. In Pro. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1551–1563 (2018).
Zhu, M., Ahuja, A., Wei, W. & Reddy, C. K. A hierarchical attention retrieval model for healthcare question answering. In The World Wide Web Conference 2472–2482 (2019).
Jin, Q. et al. PubMedQA: A dataset for biomedical research question answering. In Pro. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2567–2577 (2019).
Pappas, D., Androutsopoulos, I. & Papageorgiou, H. BioRead: A new dataset for biomedical reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018).
Kaddari, Z. & Bouchentouf, T. FrBMedQA: the first French biomedical question answering dataset. IAES Int. J. Artif. Intell. 11, 1588 (2022).
Mahbub, M. et al. cpgqa: A benchmark dataset for machine reading comprehension tasks on clinical practice guidelines and a case study using transfer learning. IEEE Access 11, 3691–3705 (2023).
Vladika, J. et al. MedREQAL: examining medical knowledge recall of large language models via question answering. In Findings of the Association for Computational Linguistics: ACL 2024, 14459–14469 (2024).
Tran, M.-N., Nguyen, P.-V., Nguyen, L. & Dien, D. ViMedAQA: a Vietnamese medical abstractive question-answering dataset and findings of large language model. Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. ume 4, 356–364 (2024).
Liu, Y. et al. Datasets for large language models: A comprehensive survey. Artif. Intell. Rev. 58, 403 (2025).
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digital Med. 7, 258 (2024).
Welivita, A. & Pu, P. A survey of consumer health question answering systems. AI Mag. 44, 482–507 (2023).
Acknowledgements
This research was supported by the Technology Innovation Program (RS-2024-00432987), funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea). This study was derived in part from the doctoral dissertation of Junbok Lee at Seoul National University.
Author information
Authors and Affiliations
Contributions
J.B.L. conceptualized the study, developed the review protocol, conducted the literature search, and performed data extraction and analysis as part of his doctoral research. J.Y.S. verified the extracted data and contributed to the development of the taxonomy and framework. B.L.C. supervised the overall study, provided methodological guidance, and critically reviewed the manuscript for important intellectual content.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lee, J., Shin, J. & Cho, B. Structured taxonomy and framework for developing medical benchmark in large language models derived from scoping review. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02567-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02567-9


