Abstract
Recent vision-enabled multimodal large language models (LLMs) have achieved strong performance on high-stakes medical examinations, yet their capabilities in pediatrics, particularly for image-based questions, remain underexplored. We analyzed 498 unique questions with Korean–English terminologies, taken from 10 pediatric in-training examinations (ITEs) conducted by a single pediatric department between 2016 and 2023. Approximately 22% of items contained medical images. Three recent publicly accessible LLMs (GPT-4.1, Gemini-2.5-Pro, Claude-4.1-Opus) and three prior models (GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet) were tested. Recent LLMs significantly outperformed fourth-year residents (R4) (77.7–78.9% vs. 70.1%, all P < 0.008), while prior models showed comparable results. For text-only items, three recent LLMs achieved a superior proportion correct (PC) compared with R4 (80.1–81.0% vs. 69.6%). None of the evaluated LLMs surpassed R4 performance on image-included questions; both prior and recent models consistently exhibited inferior PC on image-included items than text-only questions. Outputs demonstrated high repeatability (intraclass correlation coefficient > 0.98) across most models. In this study, multimodal LLMs achieved high performance on the Pediatric ITE, with further improvements observed over the past year and results exceeding those of senior residents. Nonetheless, performance on image-included questions was inferior to that of text-only questions and did not exceed that of senior residents.
Data availability
The datasets generated during and/or analyzed during the current study are not publicly available due to restrictions from the institutional review board of the Asan Medical Center, Seoul, Korea (IRB no.2025-0722), which prohibits data sharing with out-of-hospital facilities for ethical reasons. However, data are available from the corresponding author upon reasonable request.
References
Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. (Lond). 3 (1), 141 (2023).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31 (3), 943–950 (2025).
Le, M. & Davis, M. ChatGPT Yields a Passing Score on a Pediatric Board Preparatory Exam but Raises Red Flags. Glob Pediatr. Health. 11, 2333794x241240327 (2024).
Bicknell, B. T. et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med. Educ. 10, e63430 (2024).
Del Monte, F. et al. Diagnostic efficacy of large language models in the pediatric emergency department: a pilot study. Front. Digit. Health. 7, 1624786 (2025).
Suresh, S. & Misra, S. M. Large Language Models in Pediatric Education: Current Uses and Future Potential. Pediatrics, 154(3), e2023064683 (2024).
Yang, Z. et al. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, (2023).
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit. Med. 7 (1), 190 (2024).
Katz, U. et al. GPT versus Resident Physicians — A Benchmark Based on Official Board Scores. NEJM AI. 1 (5), AIdbp2300192 (2024).
Park, J. S. et al. Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes. Korean J. Radiol. 26 (9), 855–866 (2025).
Mondillo, G. Are Llms Ready for Pediatrics? A Comparative Evaluation of Model Accuracy Across Clinical Domains. medRxiv 2025.04.25.25326437 (2025).
Yao, Z. et al. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Med. Inf. 13, e69485 (2025).
Park, S. H. et al. Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM). Korean J. Radiol. 25 (10), 865–868 (2024).
Crewson, P. E. Reader agreement studies. AJR Am. J. Roentgenol. 184 (5), 1391–1397 (2005).
Jaiswal, N. et al. PedMedQA: Comparing Large Language Model Accuracy in Pediatric and Adult Medicine. Pediatr. Open. Sci. 1 (2), 1–3 (2025).
Mondillo, G. et al. Large language models performance on pediatrics question: a new challenge. Journal Med. Artif. Intelligence, 8, 14 (2025).
Abbas, A., Rehman, M. S. & Rehman, S. S. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus 16 (3), e55991 (2024).
Gaber, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 8 (1), 263 (2025).
Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun. Med. (Lond). 5 (1), 26 (2025).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29 (8), 1930–1940 (2023).
Awasthi, R. et al. Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization. npj Health Syst. 2 (1), 40 (2025).
Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s Time to Bench the Medical Exam Benchmark. NEJM AI. 2 (2), AIe2401235 (2025).
Zhang, J. & Fenton, S. H. Preparing healthcare education for an AI-augmented future. Npj Health Syst. 1 (1), 4 (2024).
Kim, H. et al. ChatGPT Vision for Radiological Interpretation: An Investigation Using Medical School Radiology Examinations. Korean J. Radiol. 25 (4), 403–406 (2024).
Liu, Z. et al. Holistic evaluation of gpt-4v for biomedical imaging. arXiv preprint arXiv:2312.05256, (2023).
Reith, T. P., D’Alessandro, D. M. & D’Alessandro, M. P. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr. Radiol. 54 (10), 1729–1737 (2024).
Suh, P. S. et al. Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases. Radiology 312 (1), e240273 (2024).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30 (9), 2613–2622 (2024).
Templin, T. et al. Addressing 6 challenges in generative AI for digital health: A scoping review. PLOS Digit. Health. 3 (5), e0000503 (2024).
Jung, K. H. Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc. Inf. Res. 31 (2), 114–124 (2025).
Barile, J. et al. Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies. JAMA Pediatr. 178 (3), 313–315 (2024).
Muralidharan, V. et al. Recommendations for the use of pediatric data in artificial intelligence and machine learning ACCEPT-AI. NPJ Digit. Med. 6 (1), 166 (2023).
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
JSP and SHK conceived and designed the study. MJK performed data analysis and interpretation. MJK, JSP and SHK contributed to data acquisition and clinical review. MJK drafted the manuscript. All authors critically revised the manuscript for important intellectual content and approved the final version for submission.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kim, M.J., Park, J.S. & Kang, S.H. Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44333-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-44333-7