Abstract
Advanced large language models (LLMs) such as ChatGPT, Gemini, Grok3, and Claude offer new possibilities for medical research interpretation and clinical decision support. While these models demonstrate remarkable natural language processing capabilities, their ability to independently reason through clinical trial data and produce conclusions consistent with published trial interpretations remains underexplored. The objective was to evaluate the reliability of LLMs in interpreting numerical and statistical healthcare data. For this study, landmark randomized controlled trials (RCTs) were selected as a standardized domain to minimize bias from poor-quality research designs. Twenty landmark RCTs from the New England Journal of Medicine were analyzed in neurosurgical and cardiovascular intervention domains. Four AI platforms were evaluated using a structured prompt covering five domains: evidence interpretation, statistical understanding, clinical relevance, limitation recognition, and practical applicability. Two independent raters scored all AI outputs on a 0–5 scale per domain, and interobserver reliability was assessed. Primary outcomes included concordance with published trial conclusions, accuracy of primary outcome identification, and appropriateness of recommendations. Secondary outcomes included output pattern analysis, recognition of limitations, and handling of confounding factors. ChatGPT demonstrated the highest concordance with published conclusions at 100.0%, followed by Gemini at 84%, Grok3 at 72%, and Claude at 68%. However, these concordance scores should be interpreted cautiously, as the LLMs may have been trained on these published trials, potentially inflating alignment with published conclusions. ChatGPT and Gemini accurately identified limitations and confounding factors, while Grok3 and Claude struggled in these secondary outcomes. Interobserver reliability between raters was good (Cronbach’s α = 0.868), supporting the consistency of scoring. Certain LLMs, particularly ChatGPT and Gemini, can reliably interpret clinical trial data and align closely with human conclusions, suggesting a potential role in data summarization and evidence synthesis. These findings highlight the importance of selecting AI tools carefully and highlight potential applications in research and clinical workflows.
Data availability
All data analyzed in this study came from manuscripts published in the New England Journal of Medicine.
References
Bommasani, R. et al. On the opportunities and risks of foundation models. Proc. Natl. Acad. Sci. U S A. 118 (38), e2102159118. https://doi.org/10.48550/arXiv.2108.07258 (2021).
Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur. Radiol. 34 (5), 2817–2825. https://doi.org/10.1007/s00330-023-10213-1 (2024).
Rao, A. et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. Preprint medRxiv. https://doi.org/10.1101/2023.02.02.23285399 (2023). 2023.02.02.23285399. Published 2023 Feb 7.
Brown, E. D. L. et al. Enhancing Diagnostic Support for Chiari Malformation and Syringomyelia: A Comparative Study of Contextualized ChatGPT Models. World Neurosurg. 189, e86–e107. https://doi.org/10.1016/j.wneu.2024.05.172 (2024).
Yu, E. et al. Large Language Models in Medicine: Applications, Challenges, and Future Directions. Int. J. Med. Sci. 22 (11), 2792–2801. https://doi.org/10.7150/ijms.111780 (2025). Published 2025 May 31.
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health. 2 (2), e0000198. https://doi.org/10.1371/journal.pdig.0000198 (2023).
Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med (Lond). 5(1): 26. https://doi.org/10.1038/s43856-024-00717-2 (2025).
Negrini, F. et al. Evaluating ChatGPT-4.0’s accuracy and potential in idiopathic scoliosis conservative treatment: a preliminary study on clarity, validity, and expert perceptions. Eur Spine J Published online July. 21 https://doi.org/10.1007/s00586-025-09166-4 (2025).
Kozel, G. et al. Chat-GPT on brain tumors: An examination of Artificial Intelligence/Machine Learning’s ability to provide diagnoses and treatment plans for example neuro-oncology cases. Clin. Neurol. Neurosurg. 239, 108238. https://doi.org/10.1016/j.clineuro.2024.108238 (2024).
Ward, M. et al. A Quantitative Assessment of ChatGPT as a Neurosurgical Triaging Tool. Neurosurgery 95 (2), 487–495. https://doi.org/10.1227/neu.0000000000002867 (2024).
Liu, J., Zheng, J., Cai, X., Wu, D. & Yin, C. A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons. iScience 26 (9), 107590. https://doi.org/10.1016/j.isci.2023.107590 (2023). Published 2023 Aug 9.
Rajjoub, R. et al. ChatGPT and its Role in the Decision-Making for the Diagnosis and Treatment of Lumbar Spinal Stenosis: A Comparative Analysis and Narrative Review. Global Spine J. 14 (3), 998–1017. https://doi.org/10.1177/21925682231195783 (2024).
Ahmed, W. et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur. Spine J. 33 (11), 4182–4203. https://doi.org/10.1007/s00586-024-08198-6 (2024).
Ward, M. et al. Analysis of ChatGPT in the Triage of Common Spinal Complaints. World Neurosurg. 192, e273–e280. https://doi.org/10.1016/j.wneu.2024.09.086 (2024).
Funding
This research received no funding.
Author information
Authors and Affiliations
Contributions
G.M., R.S.S. constructed and supervised the study. W.S., A.S.C., A.S., B.O-G., A.P., L.R.J., and K.J.O.R. collected data from the clinical trials. A.P. and W.S. performed statistical analysis and created figures. W.S. drafted the manuscript. All authors reviewed, edited, and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Mao, G., Snyder, W., Chinthala, A.S. et al. Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45326-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-45326-2