Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms

Mao, Gordon; Snyder, William; Chinthala, Anoop S.; Singh, Arshjeet; Obeng-Gyasi, Barnabas; Potts, Alexander J.; Jackson, Luke R.; Rodriguez, Kyle Jared Ortiz; Singh, Ranjeet S.

doi:10.1038/s41598-026-45326-2

Download PDF

Article
Open access
Published: 02 April 2026

Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms

Gordon Mao¹,
William Snyder III²,
Anoop S. Chinthala²,
Arshjeet Singh²,
Barnabas Obeng-Gyasi³,
Alexander J. Potts²,
Luke R. Jackson²,
Kyle Jared Ortiz Rodriguez¹ &
…
Ranjeet S. Singh⁴

Scientific Reports , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Advanced large language models (LLMs) such as ChatGPT, Gemini, Grok3, and Claude offer new possibilities for medical research interpretation and clinical decision support. While these models demonstrate remarkable natural language processing capabilities, their ability to independently reason through clinical trial data and produce conclusions consistent with published trial interpretations remains underexplored. The objective was to evaluate the reliability of LLMs in interpreting numerical and statistical healthcare data. For this study, landmark randomized controlled trials (RCTs) were selected as a standardized domain to minimize bias from poor-quality research designs. Twenty landmark RCTs from the New England Journal of Medicine were analyzed in neurosurgical and cardiovascular intervention domains. Four AI platforms were evaluated using a structured prompt covering five domains: evidence interpretation, statistical understanding, clinical relevance, limitation recognition, and practical applicability. Two independent raters scored all AI outputs on a 0–5 scale per domain, and interobserver reliability was assessed. Primary outcomes included concordance with published trial conclusions, accuracy of primary outcome identification, and appropriateness of recommendations. Secondary outcomes included output pattern analysis, recognition of limitations, and handling of confounding factors. ChatGPT demonstrated the highest concordance with published conclusions at 100.0%, followed by Gemini at 84%, Grok3 at 72%, and Claude at 68%. However, these concordance scores should be interpreted cautiously, as the LLMs may have been trained on these published trials, potentially inflating alignment with published conclusions. ChatGPT and Gemini accurately identified limitations and confounding factors, while Grok3 and Claude struggled in these secondary outcomes. Interobserver reliability between raters was good (Cronbach’s α = 0.868), supporting the consistency of scoring. Certain LLMs, particularly ChatGPT and Gemini, can reliably interpret clinical trial data and align closely with human conclusions, suggesting a potential role in data summarization and evidence synthesis. These findings highlight the importance of selecting AI tools carefully and highlight potential applications in research and clinical workflows.

Data availability

All data analyzed in this study came from manuscripts published in the New England Journal of Medicine.

References

Bommasani, R. et al. On the opportunities and risks of foundation models. Proc. Natl. Acad. Sci. U S A. 118 (38), e2102159118. https://doi.org/10.48550/arXiv.2108.07258 (2021).
Google Scholar
Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur. Radiol. 34 (5), 2817–2825. https://doi.org/10.1007/s00330-023-10213-1 (2024).
Google Scholar
Rao, A. et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. Preprint medRxiv. https://doi.org/10.1101/2023.02.02.23285399 (2023). 2023.02.02.23285399. Published 2023 Feb 7.
Google Scholar
Brown, E. D. L. et al. Enhancing Diagnostic Support for Chiari Malformation and Syringomyelia: A Comparative Study of Contextualized ChatGPT Models. World Neurosurg. 189, e86–e107. https://doi.org/10.1016/j.wneu.2024.05.172 (2024).
Google Scholar
Yu, E. et al. Large Language Models in Medicine: Applications, Challenges, and Future Directions. Int. J. Med. Sci. 22 (11), 2792–2801. https://doi.org/10.7150/ijms.111780 (2025). Published 2025 May 31.
Google Scholar
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health. 2 (2), e0000198. https://doi.org/10.1371/journal.pdig.0000198 (2023).
Google Scholar
Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med (Lond). 5(1): 26. https://doi.org/10.1038/s43856-024-00717-2 (2025).
Negrini, F. et al. Evaluating ChatGPT-4.0’s accuracy and potential in idiopathic scoliosis conservative treatment: a preliminary study on clarity, validity, and expert perceptions. Eur Spine J Published online July. 21 https://doi.org/10.1007/s00586-025-09166-4 (2025).
Kozel, G. et al. Chat-GPT on brain tumors: An examination of Artificial Intelligence/Machine Learning’s ability to provide diagnoses and treatment plans for example neuro-oncology cases. Clin. Neurol. Neurosurg. 239, 108238. https://doi.org/10.1016/j.clineuro.2024.108238 (2024).
Google Scholar
Ward, M. et al. A Quantitative Assessment of ChatGPT as a Neurosurgical Triaging Tool. Neurosurgery 95 (2), 487–495. https://doi.org/10.1227/neu.0000000000002867 (2024).
Google Scholar
Liu, J., Zheng, J., Cai, X., Wu, D. & Yin, C. A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons. iScience 26 (9), 107590. https://doi.org/10.1016/j.isci.2023.107590 (2023). Published 2023 Aug 9.
Google Scholar
Rajjoub, R. et al. ChatGPT and its Role in the Decision-Making for the Diagnosis and Treatment of Lumbar Spinal Stenosis: A Comparative Analysis and Narrative Review. Global Spine J. 14 (3), 998–1017. https://doi.org/10.1177/21925682231195783 (2024).
Google Scholar
Ahmed, W. et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur. Spine J. 33 (11), 4182–4203. https://doi.org/10.1007/s00586-024-08198-6 (2024).
Google Scholar
Ward, M. et al. Analysis of ChatGPT in the Triage of Common Spinal Complaints. World Neurosurg. 192, e273–e280. https://doi.org/10.1016/j.wneu.2024.09.086 (2024).
Google Scholar

Download references

Funding

This research received no funding.

Author information

Authors and Affiliations

Department of Neurological Surgery, Indiana University School of Medicine, Indianapolis, IN, USA
Gordon Mao & Kyle Jared Ortiz Rodriguez
BS Indiana University School of Medicine Indianapolis, Indianapolis, IN, 46202, USA
William Snyder III, Anoop S. Chinthala, Arshjeet Singh, Alexander J. Potts & Luke R. Jackson
Department of Neurological Surgery, Allegheny Health Network, Pittsburgh, PA, USA
Barnabas Obeng-Gyasi
Department of Neurocritical Care, Indiana University School of Medicine, Indianapolis, IN, USA
Ranjeet S. Singh

Authors

Gordon Mao
View author publications
Search author on:PubMed Google Scholar
William Snyder III
View author publications
Search author on:PubMed Google Scholar
Anoop S. Chinthala
View author publications
Search author on:PubMed Google Scholar
Arshjeet Singh
View author publications
Search author on:PubMed Google Scholar
Barnabas Obeng-Gyasi
View author publications
Search author on:PubMed Google Scholar
Alexander J. Potts
View author publications
Search author on:PubMed Google Scholar
Luke R. Jackson
View author publications
Search author on:PubMed Google Scholar
Kyle Jared Ortiz Rodriguez
View author publications
Search author on:PubMed Google Scholar
Ranjeet S. Singh
View author publications
Search author on:PubMed Google Scholar

Contributions

G.M., R.S.S. constructed and supervised the study. W.S., A.S.C., A.S., B.O-G., A.P., L.R.J., and K.J.O.R. collected data from the clinical trials. A.P. and W.S. performed statistical analysis and created figures. W.S. drafted the manuscript. All authors reviewed, edited, and approved the final version of the manuscript.

Corresponding author

Correspondence to Alexander J. Potts.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Mao, G., Snyder, W., Chinthala, A.S. et al. Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45326-2

Download citation

Received: 07 November 2025
Accepted: 18 March 2026
Published: 02 April 2026
DOI: https://doi.org/10.1038/s41598-026-45326-2