Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Input structure–driven instability and convergence in large language model clinical reasoning: a formative study using pediatric residency–level MCQs
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 17 April 2026

Input structure–driven instability and convergence in large language model clinical reasoning: a formative study using pediatric residency–level MCQs

  • Vinson James1,
  • Catherine Caronia1 &
  • Rajesh Savargaonkar1 

Scientific Reports (2026) Cite this article

  • 992 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Health care
  • Medical research

Abstract

Large language models (LLMs) are increasingly incorporated into medical education and clinical learning environments. While prior studies have focused on model accuracy on licensing-style examinations, less attention has been paid to the stability and reproducibility of LLM clinical reasoning under varying input structures—an issue central to safe educational and clinical deployment. To examine how question delivery structure influences performance stability, inter-model variability, and reproducibility of clinical reasoning across multiple contemporary LLMs using pediatric residency–level multiple-choice questions (MCQs). A standardized, evidence-based prompt was used to generate advanced-level pediatric USMLE Step 2/3–style MCQs emphasizing diagnostic reasoning, management decisions, and ethical judgment. One hundred draft MCQs generated using a standardized LLM prompt were randomly selected and independently reviewed by three pediatric physicians for medical accuracy, clinical realism, subspecialty relevance, and adherence to USMLE formatting. Twenty three questions were excluded by unanimous consensus, yielding a validated set of 77 MCQs. Six publicly available LLMs (ChatGPT, DeepSeek AI, Gemini, Microsoft Copilot, Perplexity AI, and OpenEvidence; October–December 2025 versions) were evaluated under two delivery conditions: (1) simultaneous presentation of all questions and (2) sequential delivery in batches of ten. Accuracy and inter-model variability were compared using paired t-tests and one-way ANOVA. When all questions were presented simultaneously, model accuracy varied widely (38%–90%), with significant inter-model differences, indicating poor reproducibility. In contrast, batch delivery resulted in marked convergence of performance across models (83%–88%), with no statistically significant inter-model differences. Sequential delivery in batches of ten substantially reduced performance dispersion and instability across all evaluated systems. LLM clinical reasoning performance is highly sensitive to input structure. Reducing contextual load through structured batch delivery improves reproducibility and minimizes inter-model variability, independent of model architecture. These findings suggest that prompt structure—rather than model selection alone—is a critical determinant of reliable LLM behavior and should be explicitly considered in the design of AI-supported medical education and assessment systems. This should be explicitly considered in the design of AI-supported medical education and assessment systems, particularly when LLMs are used as formative learning tools or clinical reasoning aids. Clinical Trial Number. The protocol was reviewed by the Office of the IRB at Good Samaritan University Hospital and determined to be exempt.

Similar content being viewed by others

Comparative evaluation of large language models performance in medical education using urinary system histology assessment

Article Open access 29 August 2025

Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios

Article Open access 03 December 2025

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Article Open access 04 July 2024

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request. No patient-level or identifiable data were used in this study.

Code availability

Custom scripts used for statistical analysis are available from the corresponding author upon reasonable request. No proprietary or restricted software was developed or modified for this study.

References

  1. DiDonna, N., Shetty, P. N., Khan, K. & Damitz, L. Unveiling the Potential of AI in Plastic Surgery Education: A Comparative Study of Leading AI Platforms’ Performance on In-training Examinations. Plast. Reconstr. Surg. Glob Open. 12 (6), e5929 (2024).

    Google Scholar 

  2. Luke, W. A. N. V. et al. Is ChatGPT ‘ready’ to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry. Med. Teach. 46 (11), 1441–1447 (2024).

    Google Scholar 

  3. Puga-Tejada, M. et al. Artificial intelligence-enabled histology exhibits comparable accuracy to pathologists in assessing histological remission in ulcerative colitis: a systematic review, meta-analysis, and meta-regression. J. Crohns Colitis. 19 (1), jjae198 (2025).

    Google Scholar 

  4. Suen, K., Zhang, R. & Kutaiba, N. Accuracy of wrist fracture detection on radiographs by artificial intelligence compared to human clinicians. A systematic review and meta-analysis. Eur. J. Radiol. 178, 111593 (2024).

    Google Scholar 

  5. Bang, Y. et al. A systematic study of prompt sensitivity in large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 4567–4585. (2023).

  6. Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 (12), 1–38 (2023).

    Google Scholar 

  7. Anthropic Interpretability in Claude models [Large Language Model]. (2024). Available from: https://www.anthropic.com/news/claude-interpretability

  8. Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with GPT-4. (2023). arXiv:2303.12712 [Preprint].

  9. Liu, N. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173. https://doi.org/10.1162/tacl_a_00638 (2024).

    Google Scholar 

  10. Bordage, G. Conceptual frameworks to illuminate and magnify. Med. Educ. 43 (4), 312–319. https://doi.org/10.1111/j.1365-2923.2009.03295.x (2009).

    Google Scholar 

  11. Sweller, J. Cognitive load during problem solving: Effects on learning. Cogn. Sci. 12 (2), 257–285. https://doi.org/10.1207/s15516709cog1202_4 (1988).

    Google Scholar 

  12. Sweller, J., Ayres, P. & Kalyuga, S. Cognitive Load Theory (Springer, 2011). https://doi.org/10.1007/978-1-4419-8126-4.

    Google Scholar 

  13. Downing, S. M. Construct-irrelevant variance and flawed test questions: Do multiple-choice item-writing principles make any difference?. Acad. Med. 77(10), S103–S104. https://doi.org/10.1097/00001888-200210001-00032 (2002).

    Google Scholar 

  14. Case, S. M. & Swanson, D. B. Constructing Written Test Questions for the Basic and Clinical Sciences 3rd ed. (National Board of Medical Examiners, 2001).

    Google Scholar 

  15. Liu, N. F. et al. Lost in the middle: How language models use long contexts. Trans. Association Comput. Linguistics. 11, 117–132. https://doi.org/10.1162/tacl_a_00520 (2023).

    Google Scholar 

  16. Zhang, Y. et al. Hallucinations in large language models: A survey. ACM Comput. Surveys. 56 (2), 1–38. https://doi.org/10.1145/3626237 (2024).

    Google Scholar 

  17. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620 (7972), 172–180 (2023).

    Google Scholar 

  18. Jason Wei, X., Wang, D. & Schuurmans et. al. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS 2022). Curran Associates Inc., Red Hook, NY, USA, Article 1800, 24824–24837.

  19. Ban, T., Chen, L., Lyu, D., Wang, X. & Chen, H. Causal structure learning supervised by large language model. arXiv preprint arXiv:2311.11689. Nov 20. (2023).

  20. Gierl, M., Changjiang, W. & Jiawen, Z. Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees’ Cognitive Skills in Algebra on the SAT©. The Journal of Technology, Learning, and Assessment. 6. (2008).

  21. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020).

    Google Scholar 

  22. Chen, Y. et al. Model editing can hurt general abilities of large language models. arXiv:2401.11439 [Preprint] (2024).

  23. Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (ICLR 2023). (2023).

  24. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29(8), 1930–1940 (2023).

    Google Scholar 

  25. Greenblatt, R. et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093. 2024 Dec 18.

  26. Farquhar, S. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630. https://doi.org/10.1038/s41586-024-07421-0 (2024).

    Google Scholar 

  27. Case, S. & Swanson, D. Constructing Written Test Questions For the Basic and Clinical Sciences (National Board of Examiners, 2002).

  28. Kadavath, S. et al. Language models (mostly) know what they know. arXiv:2207.05221 [Preprint]. (2022).

  29. Ayers, J. W. et al. Physician versus artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183 (6), 589–596 (2023).

    Google Scholar 

  30. Hang, C. N., Yu, P. D., Chen, S., Tan, C. W. & Chen, G. MEGA: Machine Learning-Enhanced Graph Analytics for Infodemic Risk Management. IEEE J Biomed Health Inform 27(12), 6100–6111. https://doi.org/10.1109/JBHI.2023.3314632 (2023).

    Google Scholar 

  31. Hang, C., Yu, P. D. & Tan, C. TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking. (2025). https://doi.org/10.48550/arXiv.2505.07891

  32. Hang, C., Tan, C. & Yu, P.-D. MCQGen: A large language model-driven MCQ generator for personalized learning. IEEE Access https://doi.org/10.1109/ACCESS.2024.3420709 (2024).

    Google Scholar 

  33. OpenAI. ChatGPT (April 2025 version) [Large Language Model]. (2025). Available from: https://openai.com/chatgpt

  34. DeepSeek. DeepSeek model documentation [Large Language Model]. (2025). Available from: https://platform.deepseek.com/api-docs/

  35. Google Gemini model system documentation [Large Language Model]. (2025). Available from: https://ai.google.dev/gemini-api/docs

  36. GitHub. Copilot model platform documentation [Large Language Model]. (2025). Available from: https://docs.github.com/en/copilot

  37. Perplexity, A. I. Perplexity model documentation [Large Language Model]. (2025). Available from: https://docs.perplexity.ai/

Download references

Author information

Authors and Affiliations

  1. Department of Pediatrics, Good Samaritan University Hospital, West Islip, NY, 11795, USA

    Vinson James, Catherine Caronia & Rajesh Savargaonkar

Authors
  1. Vinson James
    View author publications

    Search author on:PubMed Google Scholar

  2. Catherine Caronia
    View author publications

    Search author on:PubMed Google Scholar

  3. Rajesh Savargaonkar
    View author publications

    Search author on:PubMed Google Scholar

Contributions

V.J. conceived and designed the study, generated and curated the dataset, performed statistical analyses, interpreted the results, and drafted the manuscript.C.C. reviewed the manuscript and critically revised it for intellectual content.R.S. supervised the study, contributed to methodological refinement, and provided critical review and editing of the manuscript.All authors approved the final manuscript and agree to be accountable for all aspects of the work.

Corresponding author

Correspondence to Vinson James.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval and informed consent

This study did not involve patients, patient data, biological samples, or identifiable personal information. Human participation was limited to voluntary expert review of multiple-choice questions and scoring of model outputs by qualified physicians. All participating reviewers provided informed consent for study participation. No identifying information was collected or reported. The study met criteria for institutional review board (IRB) exemption as minimal-risk educational research. All methods performed in this study were carried out in accordance with relevant guidelines and regulations. The study was approved by the Institutional Review Board of Good Samaritan University Hospital.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

James, V., Caronia, C. & Savargaonkar, R. Input structure–driven instability and convergence in large language model clinical reasoning: a formative study using pediatric residency–level MCQs. Sci Rep (2026). https://doi.org/10.1038/s41598-026-48326-4

Download citation

  • Received: 27 January 2026

  • Accepted: 07 April 2026

  • Published: 17 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-48326-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing