Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 02 April 2026

Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions

  • Mi Jin Kim1,
  • Jun Sung Park2 na1 &
  • Sung Han Kang3 na1 

Scientific Reports , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Health care
  • Medical research

Abstract

Recent vision-enabled multimodal large language models (LLMs) have achieved strong performance on high-stakes medical examinations, yet their capabilities in pediatrics, particularly for image-based questions, remain underexplored. We analyzed 498 unique questions with Korean–English terminologies, taken from 10 pediatric in-training examinations (ITEs) conducted by a single pediatric department between 2016 and 2023. Approximately 22% of items contained medical images. Three recent publicly accessible LLMs (GPT-4.1, Gemini-2.5-Pro, Claude-4.1-Opus) and three prior models (GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet) were tested. Recent LLMs significantly outperformed fourth-year residents (R4) (77.7–78.9% vs. 70.1%, all P < 0.008), while prior models showed comparable results. For text-only items, three recent LLMs achieved a superior proportion correct (PC) compared with R4 (80.1–81.0% vs. 69.6%). None of the evaluated LLMs surpassed R4 performance on image-included questions; both prior and recent models consistently exhibited inferior PC on image-included items than text-only questions. Outputs demonstrated high repeatability (intraclass correlation coefficient > 0.98) across most models. In this study, multimodal LLMs achieved high performance on the Pediatric ITE, with further improvements observed over the past year and results exceeding those of senior residents. Nonetheless, performance on image-included questions was inferior to that of text-only questions and did not exceed that of senior residents.

Data availability

The datasets generated during and/or analyzed during the current study are not publicly available due to restrictions from the institutional review board of the Asan Medical Center, Seoul, Korea (IRB no.2025-0722), which prohibits data sharing with out-of-hospital facilities for ethical reasons. However, data are available from the corresponding author upon reasonable request.

References

  1. Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. (Lond). 3 (1), 141 (2023).

    Google Scholar 

  2. Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31 (3), 943–950 (2025).

    Google Scholar 

  3. Le, M. & Davis, M. ChatGPT Yields a Passing Score on a Pediatric Board Preparatory Exam but Raises Red Flags. Glob Pediatr. Health. 11, 2333794x241240327 (2024).

    Google Scholar 

  4. Bicknell, B. T. et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med. Educ. 10, e63430 (2024).

    Google Scholar 

  5. Del Monte, F. et al. Diagnostic efficacy of large language models in the pediatric emergency department: a pilot study. Front. Digit. Health. 7, 1624786 (2025).

    Google Scholar 

  6. Suresh, S. & Misra, S. M. Large Language Models in Pediatric Education: Current Uses and Future Potential. Pediatrics, 154(3), e2023064683 (2024).

  7. Yang, Z. et al. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, (2023).

  8. Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit. Med. 7 (1), 190 (2024).

    Google Scholar 

  9. Katz, U. et al. GPT versus Resident Physicians — A Benchmark Based on Official Board Scores. NEJM AI. 1 (5), AIdbp2300192 (2024).

    Google Scholar 

  10. Park, J. S. et al. Accuracy of Large Language Models in Detecting Cases Requiring Immediate Reporting in Pediatric Radiology: A Feasibility Study Using Publicly Available Clinical Vignettes. Korean J. Radiol. 26 (9), 855–866 (2025).

    Google Scholar 

  11. Mondillo, G. Are Llms Ready for Pediatrics? A Comparative Evaluation of Model Accuracy Across Clinical Domains. medRxiv 2025.04.25.25326437 (2025).

  12. Yao, Z. et al. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Med. Inf. 13, e69485 (2025).

    Google Scholar 

  13. Park, S. H. et al. Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM). Korean J. Radiol. 25 (10), 865–868 (2024).

    Google Scholar 

  14. Crewson, P. E. Reader agreement studies. AJR Am. J. Roentgenol. 184 (5), 1391–1397 (2005).

    Google Scholar 

  15. Jaiswal, N. et al. PedMedQA: Comparing Large Language Model Accuracy in Pediatric and Adult Medicine. Pediatr. Open. Sci. 1 (2), 1–3 (2025).

    Google Scholar 

  16. Mondillo, G. et al. Large language models performance on pediatrics question: a new challenge. Journal Med. Artif. Intelligence, 8, 14 (2025).

  17. Abbas, A., Rehman, M. S. & Rehman, S. S. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus 16 (3), e55991 (2024).

    Google Scholar 

  18. Gaber, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 8 (1), 263 (2025).

    Google Scholar 

  19. Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun. Med. (Lond). 5 (1), 26 (2025).

    Google Scholar 

  20. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29 (8), 1930–1940 (2023).

    Google Scholar 

  21. Awasthi, R. et al. Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization. npj Health Syst. 2 (1), 40 (2025).

    Google Scholar 

  22. Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s Time to Bench the Medical Exam Benchmark. NEJM AI. 2 (2), AIe2401235 (2025).

    Google Scholar 

  23. Zhang, J. & Fenton, S. H. Preparing healthcare education for an AI-augmented future. Npj Health Syst. 1 (1), 4 (2024).

    Google Scholar 

  24. Kim, H. et al. ChatGPT Vision for Radiological Interpretation: An Investigation Using Medical School Radiology Examinations. Korean J. Radiol. 25 (4), 403–406 (2024).

    Google Scholar 

  25. Liu, Z. et al. Holistic evaluation of gpt-4v for biomedical imaging. arXiv preprint arXiv:2312.05256, (2023).

  26. Reith, T. P., D’Alessandro, D. M. & D’Alessandro, M. P. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr. Radiol. 54 (10), 1729–1737 (2024).

    Google Scholar 

  27. Suh, P. S. et al. Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases. Radiology 312 (1), e240273 (2024).

    Google Scholar 

  28. Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30 (9), 2613–2622 (2024).

    Google Scholar 

  29. Templin, T. et al. Addressing 6 challenges in generative AI for digital health: A scoping review. PLOS Digit. Health. 3 (5), e0000503 (2024).

    Google Scholar 

  30. Jung, K. H. Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc. Inf. Res. 31 (2), 114–124 (2025).

    Google Scholar 

  31. Barile, J. et al. Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies. JAMA Pediatr. 178 (3), 313–315 (2024).

    Google Scholar 

  32. Muralidharan, V. et al. Recommendations for the use of pediatric data in artificial intelligence and machine learning ACCEPT-AI. NPJ Digit. Med. 6 (1), 166 (2023).

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Author notes
  1. Jun Sung Park, Sung Han Kang contributed equally to this work.

Authors and Affiliations

  1. Division of Pediatric Cardiology, Department of Pediatrics, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea

    Mi Jin Kim

  2. Department of Pediatrics, Asan Medical Center Children’s Hospital, University of Ulsan College of Medicine, Seoul, Korea

    Jun Sung Park

  3. Division of Pediatric Hematology/Oncology, Department of Pediatrics, Asan Medical Center Children’s Hospital, University of Ulsan College of Medicine, Seoul, Korea

    Sung Han Kang

Authors
  1. Mi Jin Kim
    View author publications

    Search author on:PubMed Google Scholar

  2. Jun Sung Park
    View author publications

    Search author on:PubMed Google Scholar

  3. Sung Han Kang
    View author publications

    Search author on:PubMed Google Scholar

Contributions

JSP and SHK conceived and designed the study. MJK performed data analysis and interpretation. MJK, JSP and SHK contributed to data acquisition and clinical review. MJK drafted the manuscript. All authors critically revised the manuscript for important intellectual content and approved the final version for submission.

Corresponding authors

Correspondence to Jun Sung Park or Sung Han Kang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, M.J., Park, J.S. & Kang, S.H. Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44333-7

Download citation

  • Received: 01 September 2025

  • Accepted: 11 March 2026

  • Published: 02 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-44333-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Natural Language Processing
  • Artificial Intelligence
  • Pediatrics
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing