Input structure–driven instability and convergence in large language model clinical reasoning: a formative study using pediatric residency–level MCQs

James, Vinson; Caronia, Catherine; Savargaonkar, Rajesh

doi:10.1038/s41598-026-48326-4

Download PDF

Article
Open access
Published: 17 April 2026

Input structure–driven instability and convergence in large language model clinical reasoning: a formative study using pediatric residency–level MCQs

Vinson James¹,
Catherine Caronia¹ &
Rajesh Savargaonkar¹

Scientific Reports (2026) Cite this article

992 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Large language models (LLMs) are increasingly incorporated into medical education and clinical learning environments. While prior studies have focused on model accuracy on licensing-style examinations, less attention has been paid to the stability and reproducibility of LLM clinical reasoning under varying input structures—an issue central to safe educational and clinical deployment. To examine how question delivery structure influences performance stability, inter-model variability, and reproducibility of clinical reasoning across multiple contemporary LLMs using pediatric residency–level multiple-choice questions (MCQs). A standardized, evidence-based prompt was used to generate advanced-level pediatric USMLE Step 2/3–style MCQs emphasizing diagnostic reasoning, management decisions, and ethical judgment. One hundred draft MCQs generated using a standardized LLM prompt were randomly selected and independently reviewed by three pediatric physicians for medical accuracy, clinical realism, subspecialty relevance, and adherence to USMLE formatting. Twenty three questions were excluded by unanimous consensus, yielding a validated set of 77 MCQs. Six publicly available LLMs (ChatGPT, DeepSeek AI, Gemini, Microsoft Copilot, Perplexity AI, and OpenEvidence; October–December 2025 versions) were evaluated under two delivery conditions: (1) simultaneous presentation of all questions and (2) sequential delivery in batches of ten. Accuracy and inter-model variability were compared using paired t-tests and one-way ANOVA. When all questions were presented simultaneously, model accuracy varied widely (38%–90%), with significant inter-model differences, indicating poor reproducibility. In contrast, batch delivery resulted in marked convergence of performance across models (83%–88%), with no statistically significant inter-model differences. Sequential delivery in batches of ten substantially reduced performance dispersion and instability across all evaluated systems. LLM clinical reasoning performance is highly sensitive to input structure. Reducing contextual load through structured batch delivery improves reproducibility and minimizes inter-model variability, independent of model architecture. These findings suggest that prompt structure—rather than model selection alone—is a critical determinant of reliable LLM behavior and should be explicitly considered in the design of AI-supported medical education and assessment systems. This should be explicitly considered in the design of AI-supported medical education and assessment systems, particularly when LLMs are used as formative learning tools or clinical reasoning aids. Clinical Trial Number. The protocol was reviewed by the Office of the IRB at Good Samaritan University Hospital and determined to be exempt.

Comparative evaluation of large language models performance in medical education using urinary system histology assessment

Article Open access 29 August 2025

Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios

Article Open access 03 December 2025

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Article Open access 04 July 2024

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request. No patient-level or identifiable data were used in this study.

Code availability

Custom scripts used for statistical analysis are available from the corresponding author upon reasonable request. No proprietary or restricted software was developed or modified for this study.

References

DiDonna, N., Shetty, P. N., Khan, K. & Damitz, L. Unveiling the Potential of AI in Plastic Surgery Education: A Comparative Study of Leading AI Platforms’ Performance on In-training Examinations. Plast. Reconstr. Surg. Glob Open. 12 (6), e5929 (2024).
Google Scholar
Luke, W. A. N. V. et al. Is ChatGPT ‘ready’ to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry. Med. Teach. 46 (11), 1441–1447 (2024).
Google Scholar
Puga-Tejada, M. et al. Artificial intelligence-enabled histology exhibits comparable accuracy to pathologists in assessing histological remission in ulcerative colitis: a systematic review, meta-analysis, and meta-regression. J. Crohns Colitis. 19 (1), jjae198 (2025).
Google Scholar
Suen, K., Zhang, R. & Kutaiba, N. Accuracy of wrist fracture detection on radiographs by artificial intelligence compared to human clinicians. A systematic review and meta-analysis. Eur. J. Radiol. 178, 111593 (2024).
Google Scholar
Bang, Y. et al. A systematic study of prompt sensitivity in large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 4567–4585. (2023).
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 (12), 1–38 (2023).
Google Scholar
Anthropic Interpretability in Claude models [Large Language Model]. (2024). Available from: https://www.anthropic.com/news/claude-interpretability
Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with GPT-4. (2023). arXiv:2303.12712 [Preprint].
Liu, N. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173. https://doi.org/10.1162/tacl_a_00638 (2024).
Google Scholar
Bordage, G. Conceptual frameworks to illuminate and magnify. Med. Educ. 43 (4), 312–319. https://doi.org/10.1111/j.1365-2923.2009.03295.x (2009).
Google Scholar
Sweller, J. Cognitive load during problem solving: Effects on learning. Cogn. Sci. 12 (2), 257–285. https://doi.org/10.1207/s15516709cog1202_4 (1988).
Google Scholar
Sweller, J., Ayres, P. & Kalyuga, S. Cognitive Load Theory (Springer, 2011). https://doi.org/10.1007/978-1-4419-8126-4.
Google Scholar
Downing, S. M. Construct-irrelevant variance and flawed test questions: Do multiple-choice item-writing principles make any difference?. Acad. Med. 77(10), S103–S104. https://doi.org/10.1097/00001888-200210001-00032 (2002).
Google Scholar
Case, S. M. & Swanson, D. B. Constructing Written Test Questions for the Basic and Clinical Sciences 3rd ed. (National Board of Medical Examiners, 2001).
Google Scholar
Liu, N. F. et al. Lost in the middle: How language models use long contexts. Trans. Association Comput. Linguistics. 11, 117–132. https://doi.org/10.1162/tacl_a_00520 (2023).
Google Scholar
Zhang, Y. et al. Hallucinations in large language models: A survey. ACM Comput. Surveys. 56 (2), 1–38. https://doi.org/10.1145/3626237 (2024).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620 (7972), 172–180 (2023).
Google Scholar
Jason Wei, X., Wang, D. & Schuurmans et. al. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS 2022). Curran Associates Inc., Red Hook, NY, USA, Article 1800, 24824–24837.
Ban, T., Chen, L., Lyu, D., Wang, X. & Chen, H. Causal structure learning supervised by large language model. arXiv preprint arXiv:2311.11689. Nov 20. (2023).
Gierl, M., Changjiang, W. & Jiawen, Z. Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees’ Cognitive Skills in Algebra on the SAT©. The Journal of Technology, Learning, and Assessment. 6. (2008).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020).
Google Scholar
Chen, Y. et al. Model editing can hurt general abilities of large language models. arXiv:2401.11439 [Preprint] (2024).
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (ICLR 2023). (2023).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29(8), 1930–1940 (2023).
Google Scholar
Greenblatt, R. et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093. 2024 Dec 18.
Farquhar, S. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630. https://doi.org/10.1038/s41586-024-07421-0 (2024).
Google Scholar
Case, S. & Swanson, D. Constructing Written Test Questions For the Basic and Clinical Sciences (National Board of Examiners, 2002).
Kadavath, S. et al. Language models (mostly) know what they know. arXiv:2207.05221 [Preprint]. (2022).
Ayers, J. W. et al. Physician versus artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183 (6), 589–596 (2023).
Google Scholar
Hang, C. N., Yu, P. D., Chen, S., Tan, C. W. & Chen, G. MEGA: Machine Learning-Enhanced Graph Analytics for Infodemic Risk Management. IEEE J Biomed Health Inform 27(12), 6100–6111. https://doi.org/10.1109/JBHI.2023.3314632 (2023).
Google Scholar
Hang, C., Yu, P. D. & Tan, C. TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking. (2025). https://doi.org/10.48550/arXiv.2505.07891
Hang, C., Tan, C. & Yu, P.-D. MCQGen: A large language model-driven MCQ generator for personalized learning. IEEE Access https://doi.org/10.1109/ACCESS.2024.3420709 (2024).
Google Scholar
OpenAI. ChatGPT (April 2025 version) [Large Language Model]. (2025). Available from: https://openai.com/chatgpt
DeepSeek. DeepSeek model documentation [Large Language Model]. (2025). Available from: https://platform.deepseek.com/api-docs/
Google Gemini model system documentation [Large Language Model]. (2025). Available from: https://ai.google.dev/gemini-api/docs
GitHub. Copilot model platform documentation [Large Language Model]. (2025). Available from: https://docs.github.com/en/copilot
Perplexity, A. I. Perplexity model documentation [Large Language Model]. (2025). Available from: https://docs.perplexity.ai/

Download references

Author information

Authors and Affiliations

Department of Pediatrics, Good Samaritan University Hospital, West Islip, NY, 11795, USA
Vinson James, Catherine Caronia & Rajesh Savargaonkar

Authors

Vinson James
View author publications
Search author on:PubMed Google Scholar
Catherine Caronia
View author publications
Search author on:PubMed Google Scholar
Rajesh Savargaonkar
View author publications
Search author on:PubMed Google Scholar

Contributions

V.J. conceived and designed the study, generated and curated the dataset, performed statistical analyses, interpreted the results, and drafted the manuscript.C.C. reviewed the manuscript and critically revised it for intellectual content.R.S. supervised the study, contributed to methodological refinement, and provided critical review and editing of the manuscript.All authors approved the final manuscript and agree to be accountable for all aspects of the work.

Corresponding author

Correspondence to Vinson James.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval and informed consent

This study did not involve patients, patient data, biological samples, or identifiable personal information. Human participation was limited to voluntary expert review of multiple-choice questions and scoring of model outputs by qualified physicians. All participating reviewers provided informed consent for study participation. No identifying information was collected or reported. The study met criteria for institutional review board (IRB) exemption as minimal-risk educational research. All methods performed in this study were carried out in accordance with relevant guidelines and regulations. The study was approved by the Institutional Review Board of Good Samaritan University Hospital.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

James, V., Caronia, C. & Savargaonkar, R. Input structure–driven instability and convergence in large language model clinical reasoning: a formative study using pediatric residency–level MCQs. Sci Rep (2026). https://doi.org/10.1038/s41598-026-48326-4

Download citation

Received: 27 January 2026
Accepted: 07 April 2026
Published: 17 April 2026
DOI: https://doi.org/10.1038/s41598-026-48326-4