Structured taxonomy and framework for developing medical benchmark in large language models derived from scoping review

Lee, Junbok; Shin, Jaeyong; Cho, Belong

doi:10.1038/s41746-026-02567-9

Download PDF

Article
Open access
Published: 31 March 2026

Structured taxonomy and framework for developing medical benchmark in large language models derived from scoping review

Junbok Lee^1,2,
Jaeyong Shin^3,4 &
Belong Cho^2,5

npj Digital Medicine , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

With the rapid advancement of large language model technology, numerous studies have explored its application in the medical field. Robust evaluation is crucial for ensuring reliability and safety, leading to the development of diverse benchmark datasets. In this study, we propose a structured taxonomy to provide researchers with practical guidance for benchmark selection. Furthermore, we introduce READY, a development framework built on five principles - Reliable, Ethical, Annotated, Diverse, Yield-validated - to support the systematic design of medical benchmarks and strengthen future evaluation practices. To establish the taxonomy and framework, we systematically reviewed benchmark datasets designed for evaluating LLMs in medical context. A comprehensive literature search yielded 55 relevant studies. Each benchmark was analyzed using a structured framework encompassing the dataset construction and evaluation methodology. To assess the applicability of the proposed framework, five domain experts independently applied the READY framework to benchmark studies, demonstrating consistent inter-rater agreement. We anticipate that this research will promote more rigorous and ethical LLM evaluation, paving the way for the safe application of LLMs in clinical settings.

Automating expert-level medical reasoning evaluation of large language models

Article Open access 06 December 2025

A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains

Article Open access 26 December 2025

Medical triage as an AI ethics benchmark

Article Open access 22 August 2025

Data availability

No new data was generated or analyzed in this study. All data supporting the findings of this study are available within the cited articles included in the scoping review.

References

Du, X. et al. Performance and improvement strategies for adapting generative large language models for electronic health record applications: a systematic review. Int. J. Med. Inform. 2025, 106091 (2025).
Google Scholar
Du, X. et al. Testing and evaluation of generative large language models in electronic health record applications: a systematic review. J. Am. Med. Inform. Assoc 33, ocaf233 (2026).
Google Scholar
Li, J. et al. Exploring Multimodal Large Language Models on Transthoracic Echocardiogram (TTE) Tasks for Cardiovascular Decision Support.
Li, Y. et al. VaxBot-HPV: a GPT-based chatbot for answering HPV vaccine-related questions. JAMIA Open 8, ooaf005 (2025).
Google Scholar
Li, Y. et al. A comparative study of recent large language models on generating hospital discharge summaries for lung cancer patients. J. Biomed. Inform. 2025, 104867 (2025).
Google Scholar
Li, Y. et al. Enhancing relation extraction for COVID-19 vaccine shot-adverse event associations with large language models. Research Square (2025). rs. 3. rs–6201919 (2025).
Li, Y. et al. Ai-assisted literature screening: A hybrid approach using large language models and retrieval-augmented generation. Int. J. Med. Inform. 2025, 106205 (2025).
Google Scholar
Li, Y. et al. Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media. J. Biomed. Inform. 163, 104789 (2025).
Google Scholar
Li, J. et al. Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models. J. Biomed. Semant. 15, 14 (2024).
Google Scholar
Li, Y. et al. AcuKG: a comprehensive knowledge graph for medical acupuncture. J. Am. Med. Inform. Assoc. 2025, ocaf179 (2025).
Google Scholar
Li, Y. et al. Relation extraction using large language models: a case study on acupuncture point locations. J. Am. Med. Inform. Assoc. 31, 2622–2631 (2024).
Google Scholar
Hu, Y. et al. PheCatcher: leveraging LLM-generated synthetic data for automated phenotype definition extraction from biomedical literature. In MEDINFO 2025—Healthcare Smart× Medicine Deep. 718–722 (IOS Press, 2025).
Li, Y., Li, J., He, J. & Tao, C. A. E. - GPT: using large language models to extract adverse events from surveillance reports-a use case with influenza vaccine adverse events. Plos one 19, e0300919 (2024).
Google Scholar
Li, Y. et al. RefAI: a GPT-powered retrieval-augmented generative tool for biomedical literature recommendation and summarization. J. Am. Med. Inform. Assoc. 31, 2030–2039 (2024).
Google Scholar
Niv, Y. & Tal, Y. Development of Patient Safety and Risk Management in Medicine. In Patient Safety and Risk Management in Medicine: From Theory to Practice. 15–26 (Springer, 2024).
Chen, X. et al. Evaluating large language models in medical applications: a survey. arXiv preprint arXiv:2405.07468. (2024).
Cresswell, K. et al. Evaluating artificial intelligence in clinical settings—let us not reinvent the wheel. J. Med. Internet Res. 26, e46407 (2024).
Google Scholar
Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Google Scholar
Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning 248–260 (PMLR, 2022).
Savery, M., Abacha, A. B., Gayen, S. & Demner-Fushman, D. Question-driven summarization of answers to consumer health questions. Sci. Data 7, 322 (2020).
Google Scholar
Zhu, M. et al. Question answering with long multiple-span answers. Find. Assoc. Computational Linguist.: EMNLP 2020, 3840–3849 (2020).
Google Scholar
Jin, Q. et al. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) 2567–2577 (2019).
Suster, S. & Daelemans, W. CliCR: a dataset of clinical case reports for machine reading comprehension. Proc. 2018 Conf. North Am. chapter Assoc. computational Linguist.: Hum. Lang. Technol. ume 1, 1551–1563 (2018).
Google Scholar
Soni, S., Gudala, M., Pajouhi, A. & Roberts, K. Radqa: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the thirteenth language resources and evaluation conference 6250–6259 (2022).
Chen, X. et al. RareBench: can LLMs serve as rare diseases specialists? In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining 4850–4861 (2024).
Blagec, K., Kraiger, J., Frühwirt, W. & Samwald, M. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals. J. Biomed. Inform. 137, 104274 (2023).
Google Scholar
Bardhan, J., Roberts, K. & Wang, D. Z. Question answering for electronic health records: scoping review of datasets and models. J. Med. Internet Res. 26, e53636 (2024).
Google Scholar
Lee, J., Park, S., Shin, J. & Cho, B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med. Inform. Decis. Mak. 24, 366 (2024).
Google Scholar
Baek, S. J. et al. Ophtimus-LLM: Development of a Specialized Large Language Model for Ophthalmology. In Workshop on Large Language Models and Generative AI for Health at AAAI 2025 (2025).
Tricco, A. C. et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann. Intern. Med. 169, 467–473 (2018).
Google Scholar
Vilares, D. & Gómez-Rodríguez, C. HEAD-QA: A healthcare dataset for complex reasoning. In Pro. of the 57th Annual Meeting of the Association for Computational Linguistics, 960–966 (2019).
Li, D. et al. Towards medical machine reading comprehension with structural knowledge and plain text. In Pro. of the 2020 conference on empirical methods in natural language processing (EMNLP) 1427–1438 (2020).
Li, J., Zhong, S. & Chen, K. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. Proc. 2021 Conf. Empir. Methods Nat. Lang. Process. 8862, 8874 (2021).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Google Scholar
Gourraud, E. M. & Rouvier, M. FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain.
Gao, Y. et al. Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. J. Biomed. Inform. 138, 104286 (2023).
Google Scholar
Subramanian, A. et al. M-QALM: A benchmark to assess clinical reading comprehension and knowledge recall in large language models via question answering. In Findings of the Association for Computational Linguistics: ACL 2024, 4002–4042 (2024).
Kim, Y. et al. MedExQA: Medical question answering benchmark with multiple explanations. In Pro. of the 23rd Workshop on Biomedical Natural Language Processing, 167–181 (2024).
Hertzberg, N. & Lokrantz, A. MedQA-SWE-a Clinical Question & Answer Dataset for Swedish.Proc. 2024 Jt. Int. Conf. Computational Linguist., Lang. Resour. Evaluation (LREC-COLING2024, 11178–11186 (2024). .
Li, D. et al. ExplainCPE: A free-text explanation benchmark of Chinese pharmacist examination. In Findings of the Association for Computational Linguistics: EMNLP 2023, 1922–1940 (2023).
Cai, Y. et al. Medbench: A large-scale chinese benchmark for evaluating medical large language models. In Proceedings of the AAAI Conference on Artificial Intelligence 17709–17717 (2024).
Liu, J., Zhou, P. & Hua, Y. Benchmarking Large Language Models on CMExam--A Comprehensive Chinese Medical Exam Dataset. Published online 2023. Advances in Neural Information Processing Systems 36.
Kweon, S. et al. KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations. arXiv preprint arXiv:2403.01469. (2024).
Alonso González, I. B. et al. Multilingual benchmarking of large language models for medical question answering. (2024).
Wang, X. et al. CMB: a comprehensive medical benchmark in Chinese. In Pro. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 6184–6205 (2024).
Abacha, A. B., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC 1, 12 (2017).
Google Scholar
Zhang, S. et al. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access 6, 74061–74071 (2018).
Google Scholar
Abacha, A. B. et al. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All. 25–29 (IOS Press, 2019).
Ben Abacha, A. & Demner-Fushman, D. A question-entailment approach to question answering. BMC Bioinforma. 20, 1–23 (2019).
Google Scholar
He, J., Fu, M. & Tu, M. Applying deep matching networks to Chinese medical question answering: a study and a dataset. BMC Med. Inform. Decis. Mak. 19, 91–100 (2019).
Google Scholar
Tian, Y., Ma, W., Xia, F. & Song, Y. ChiMed: A Chinese medical corpus for question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task 250–260 (2019).
Abacha, A. B. & Demner-Fushman, D. On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2228–2234 (2019).
Chen, N. et al. A benchmark dataset and case study for Chinese medical question intent classification. BMC Med. Inform. Decis. Mak. 20, 1–7 (2020).
Google Scholar
Yadav, S., Gupta, D. & Demner-Fushman, D. Chq-summ: A dataset for consumer healthcare question summarization. arXiv preprint arXiv:2206.06581. (2022).
Wang, X. et al. Huatuo-26M, a large-scale Chinese medical QA dataset. In Findings of the Association for Computational Linguistics: NAACL 2025, 3828–3848 (2025).
Alasmari, A., Alhumoud, S. & Alshammari, W. AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation@ LREC-COLING 2024 50–56 (2024).
Nguyen, V., Karimi, S., Rybinski, M. & Xing, Z. MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 629–648 (2023).
Manes, I. et al. K-QA: A real-world medical Q&A benchmark. In Pro. of the 23rd Workshop on Biomedical Natural Language Processing, 277–294 (2024).
Pampari, A. et al. emrQA: A large corpus for question answering on electronic medical records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2357–2368 (2018).
Raghavan, P. et al. emrkbqa: A clinical knowledge-base question answering dataset. (Association for Computational Linguistics, 2021).
Mullenbach, J. et al. CLIP: A dataset for extracting action items for physicians from hospital discharge notes. In Pro. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1365–1378 (2021).
Blinov, P. et al. Rumedbench: a Russian medical language understanding benchmark. In International Conference on Artificial Intelligence in Medicine 383–392 (Springer, 2022).
Zhang, N. et al. CBLUE: A Chinese biomedical language understanding evaluation benchmark. In Pro. of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 7888–7915 (2022).
Lehman, E. et al. Learning to ask like a physician. In Pro. of the 4th Clinical Natural Language Processing Workshop, 74–86 (2022).
Pal, A. CLIFT: Analysing Natural Distribution Shift on Question Answering Models in Clinical Domain. arXiv preprint arXiv:2310.13146. (2023).
Bardhan, J. et al. DrugEHRQA: A question answering dataset on structured and unstructured electronic health records for medicine-related queries. In Pro. of the Thirteenth Language Resources and Evaluation Conference, 1083–1097 (2022).
Wang, P. et al. Attention-based aspect reasoning for knowledge base question answering on clinical notes. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 1–6 (2022).
He, Z. et al. MedEval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. In Pro. of the 2023 Conference on Empirical Methods in Natural Language Processing, 8725–8744 (2023).
Zhu, W. et al. PromptCBLUE: a Chinese prompt tuning benchmark for the medical domain. arXiv preprint arXiv:2310.14151. (2023).
Fleming, S. L. et al. Medalign: A clinician-generated dataset for instruction following with electronic medical records. In Proceedings of the AAAI Conference on Artificial Intelligence 22021–22030 (2024).
Kweon, S. et al. EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings. arXiv preprint arXiv:2402.16040. (2024).
Suster, S. & Daelemans, W. CliCR: A dataset of clinical case reports for machine reading comprehension. In Pro. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1551–1563 (2018).
Zhu, M., Ahuja, A., Wei, W. & Reddy, C. K. A hierarchical attention retrieval model for healthcare question answering. In The World Wide Web Conference 2472–2482 (2019).
Jin, Q. et al. PubMedQA: A dataset for biomedical research question answering. In Pro. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2567–2577 (2019).
Pappas, D., Androutsopoulos, I. & Papageorgiou, H. BioRead: A new dataset for biomedical reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018).
Kaddari, Z. & Bouchentouf, T. FrBMedQA: the first French biomedical question answering dataset. IAES Int. J. Artif. Intell. 11, 1588 (2022).
Google Scholar
Mahbub, M. et al. cpgqa: A benchmark dataset for machine reading comprehension tasks on clinical practice guidelines and a case study using transfer learning. IEEE Access 11, 3691–3705 (2023).
Google Scholar
Vladika, J. et al. MedREQAL: examining medical knowledge recall of large language models via question answering. In Findings of the Association for Computational Linguistics: ACL 2024, 14459–14469 (2024).
Tran, M.-N., Nguyen, P.-V., Nguyen, L. & Dien, D. ViMedAQA: a Vietnamese medical abstractive question-answering dataset and findings of large language model. Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. ume 4, 356–364 (2024).
Google Scholar
Liu, Y. et al. Datasets for large language models: A comprehensive survey. Artif. Intell. Rev. 58, 403 (2025).
Google Scholar
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digital Med. 7, 258 (2024).
Google Scholar
Welivita, A. & Pu, P. A survey of consumer health question answering systems. AI Mag. 44, 482–507 (2023).
Google Scholar

Download references

Acknowledgements

This research was supported by the Technology Innovation Program (RS-2024-00432987), funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea). This study was derived in part from the doctoral dissertation of Junbok Lee at Seoul National University.

Author information

Authors and Affiliations

Yonsei Institute for Digital Health, Yonsei University, Seoul, Republic of Korea
Junbok Lee
Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
Junbok Lee & Belong Cho
Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Seoul, Republic of Korea
Jaeyong Shin
Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea
Jaeyong Shin
Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea
Belong Cho

Authors

Junbok Lee
View author publications
Search author on:PubMed Google Scholar
Jaeyong Shin
View author publications
Search author on:PubMed Google Scholar
Belong Cho
View author publications
Search author on:PubMed Google Scholar

Contributions

J.B.L. conceptualized the study, developed the review protocol, conducted the literature search, and performed data extraction and analysis as part of his doctoral research. J.Y.S. verified the extracted data and contributed to the development of the taxonomy and framework. B.L.C. supervised the overall study, provided methodological guidance, and critically reviewed the manuscript for important intellectual content.

Corresponding authors

Correspondence to Jaeyong Shin or Belong Cho.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemantary (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, J., Shin, J. & Cho, B. Structured taxonomy and framework for developing medical benchmark in large language models derived from scoping review. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02567-9

Download citation

Received: 23 October 2025
Accepted: 10 March 2026
Published: 31 March 2026
DOI: https://doi.org/10.1038/s41746-026-02567-9