Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions

Kim, Mi Jin; Park, Jun Sung; Kang, Sung Han

doi:10.1038/s41598-026-44333-7

Download PDF

Article
Open access
Published: 02 April 2026

Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions

Mi Jin Kim¹,
Jun Sung Park²^na1 &
Sung Han Kang³^na1

Scientific Reports , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Recent vision-enabled multimodal large language models (LLMs) have achieved strong performance on high-stakes medical examinations, yet their capabilities in pediatrics, particularly for image-based questions, remain underexplored. We analyzed 498 unique questions with Korean–English terminologies, taken from 10 pediatric in-training examinations (ITEs) conducted by a single pediatric department between 2016 and 2023. Approximately 22% of items contained medical images. Three recent publicly accessible LLMs (GPT-4.1, Gemini-2.5-Pro, Claude-4.1-Opus) and three prior models (GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet) were tested. Recent LLMs significantly outperformed fourth-year residents (R4) (77.7–78.9% vs. 70.1%, all P < 0.008), while prior models showed comparable results. For text-only items, three recent LLMs achieved a superior proportion correct (PC) compared with R4 (80.1–81.0% vs. 69.6%). None of the evaluated LLMs surpassed R4 performance on image-included questions; both prior and recent models consistently exhibited inferior PC on image-included items than text-only questions. Outputs demonstrated high repeatability (intraclass correlation coefficient > 0.98) across most models. In this study, multimodal LLMs achieved high performance on the Pediatric ITE, with further improvements observed over the past year and results exceeding those of senior residents. Nonetheless, performance on image-included questions was inferior to that of text-only questions and did not exceed that of senior residents.

Data availability

The datasets generated during and/or analyzed during the current study are not publicly available due to restrictions from the institutional review board of the Asan Medical Center, Seoul, Korea (IRB no.2025-0722), which prohibits data sharing with out-of-hospital facilities for ethical reasons. However, data are available from the corresponding author upon reasonable request.

References

Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. (Lond). 3 (1), 141 (2023).
Google Scholar
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31 (3), 943–950 (2025).
Google Scholar
Le, M. & Davis, M. ChatGPT Yields a Passing Score on a Pediatric Board Preparatory Exam but Raises Red Flags. Glob Pediatr. Health. 11, 2333794x241240327 (2024).
Google Scholar
Bicknell, B. T. et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med. Educ. 10, e63430 (2024).
Google Scholar
Del Monte, F. et al. Diagnostic efficacy of large language models in the pediatric emergency department: a pilot study. Front. Digit. Health. 7, 1624786 (2025).
Google Scholar
Suresh, S. & Misra, S. M. Large Language Models in Pediatric Education: Current Uses and Future Potential. Pediatrics, 154(3), e2023064683 (2024).
Yang, Z. et al. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, (2023).
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit. Med. 7 (1), 190 (2024).
Google Scholar
Katz, U. et al. GPT versus Resident Physicians — A Benchmark Based on Official Board Scores. NEJM AI. 1 (5), AIdbp2300192 (2024).
Google Scholar
Park, J. S. et al. Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes. Korean J. Radiol. 26 (9), 855–866 (2025).
Google Scholar
Mondillo, G. Are Llms Ready for Pediatrics? A Comparative Evaluation of Model Accuracy Across Clinical Domains. medRxiv 2025.04.25.25326437 (2025).
Yao, Z. et al. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Med. Inf. 13, e69485 (2025).
Google Scholar
Park, S. H. et al. Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM). Korean J. Radiol. 25 (10), 865–868 (2024).
Google Scholar
Crewson, P. E. Reader agreement studies. AJR Am. J. Roentgenol. 184 (5), 1391–1397 (2005).
Google Scholar
Jaiswal, N. et al. PedMedQA: Comparing Large Language Model Accuracy in Pediatric and Adult Medicine. Pediatr. Open. Sci. 1 (2), 1–3 (2025).
Google Scholar
Mondillo, G. et al. Large language models performance on pediatrics question: a new challenge. Journal Med. Artif. Intelligence, 8, 14 (2025).
Abbas, A., Rehman, M. S. & Rehman, S. S. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus 16 (3), e55991 (2024).
Google Scholar
Gaber, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 8 (1), 263 (2025).
Google Scholar
Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun. Med. (Lond). 5 (1), 26 (2025).
Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29 (8), 1930–1940 (2023).
Google Scholar
Awasthi, R. et al. Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization. npj Health Syst. 2 (1), 40 (2025).
Google Scholar
Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s Time to Bench the Medical Exam Benchmark. NEJM AI. 2 (2), AIe2401235 (2025).
Google Scholar
Zhang, J. & Fenton, S. H. Preparing healthcare education for an AI-augmented future. Npj Health Syst. 1 (1), 4 (2024).
Google Scholar
Kim, H. et al. ChatGPT Vision for Radiological Interpretation: An Investigation Using Medical School Radiology Examinations. Korean J. Radiol. 25 (4), 403–406 (2024).
Google Scholar
Liu, Z. et al. Holistic evaluation of gpt-4v for biomedical imaging. arXiv preprint arXiv:2312.05256, (2023).
Reith, T. P., D’Alessandro, D. M. & D’Alessandro, M. P. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr. Radiol. 54 (10), 1729–1737 (2024).
Google Scholar
Suh, P. S. et al. Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases. Radiology 312 (1), e240273 (2024).
Google Scholar
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30 (9), 2613–2622 (2024).
Google Scholar
Templin, T. et al. Addressing 6 challenges in generative AI for digital health: A scoping review. PLOS Digit. Health. 3 (5), e0000503 (2024).
Google Scholar
Jung, K. H. Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc. Inf. Res. 31 (2), 114–124 (2025).
Google Scholar
Barile, J. et al. Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies. JAMA Pediatr. 178 (3), 313–315 (2024).
Google Scholar
Muralidharan, V. et al. Recommendations for the use of pediatric data in artificial intelligence and machine learning ACCEPT-AI. NPJ Digit. Med. 6 (1), 166 (2023).
Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Jun Sung Park, Sung Han Kang contributed equally to this work.

Authors and Affiliations

Division of Pediatric Cardiology, Department of Pediatrics, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
Mi Jin Kim
Department of Pediatrics, Asan Medical Center Children’s Hospital, University of Ulsan College of Medicine, Seoul, Korea
Jun Sung Park
Division of Pediatric Hematology/Oncology, Department of Pediatrics, Asan Medical Center Children’s Hospital, University of Ulsan College of Medicine, Seoul, Korea
Sung Han Kang

Authors

Mi Jin Kim
View author publications
Search author on:PubMed Google Scholar
Jun Sung Park
View author publications
Search author on:PubMed Google Scholar
Sung Han Kang
View author publications
Search author on:PubMed Google Scholar

Contributions

JSP and SHK conceived and designed the study. MJK performed data analysis and interpretation. MJK, JSP and SHK contributed to data acquisition and clinical review. MJK drafted the manuscript. All authors critically revised the manuscript for important intellectual content and approved the final version for submission.

Corresponding authors

Correspondence to Jun Sung Park or Sung Han Kang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, M.J., Park, J.S. & Kang, S.H. Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44333-7

Download citation

Received: 01 September 2025
Accepted: 11 March 2026
Published: 02 April 2026
DOI: https://doi.org/10.1038/s41598-026-44333-7

Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions

Subjects

Abstract

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1 (download DOCX )

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1 (download DOCX )

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links