Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 02 April 2026

Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms

  • Gordon Mao1,
  • William Snyder III2,
  • Anoop S. Chinthala2,
  • Arshjeet Singh2,
  • Barnabas Obeng-Gyasi3,
  • Alexander J. Potts2,
  • Luke R. Jackson2,
  • Kyle Jared Ortiz Rodriguez1 &
  • …
  • Ranjeet S. Singh4 

Scientific Reports , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computational biology and bioinformatics
  • Health care
  • Medical research
  • Neurology

Abstract

Advanced large language models (LLMs) such as ChatGPT, Gemini, Grok3, and Claude offer new possibilities for medical research interpretation and clinical decision support. While these models demonstrate remarkable natural language processing capabilities, their ability to independently reason through clinical trial data and produce conclusions consistent with published trial interpretations remains underexplored. The objective was to evaluate the reliability of LLMs in interpreting numerical and statistical healthcare data. For this study, landmark randomized controlled trials (RCTs) were selected as a standardized domain to minimize bias from poor-quality research designs. Twenty landmark RCTs from the New England Journal of Medicine were analyzed in neurosurgical and cardiovascular intervention domains. Four AI platforms were evaluated using a structured prompt covering five domains: evidence interpretation, statistical understanding, clinical relevance, limitation recognition, and practical applicability. Two independent raters scored all AI outputs on a 0–5 scale per domain, and interobserver reliability was assessed. Primary outcomes included concordance with published trial conclusions, accuracy of primary outcome identification, and appropriateness of recommendations. Secondary outcomes included output pattern analysis, recognition of limitations, and handling of confounding factors. ChatGPT demonstrated the highest concordance with published conclusions at 100.0%, followed by Gemini at 84%, Grok3 at 72%, and Claude at 68%. However, these concordance scores should be interpreted cautiously, as the LLMs may have been trained on these published trials, potentially inflating alignment with published conclusions. ChatGPT and Gemini accurately identified limitations and confounding factors, while Grok3 and Claude struggled in these secondary outcomes. Interobserver reliability between raters was good (Cronbach’s α = 0.868), supporting the consistency of scoring. Certain LLMs, particularly ChatGPT and Gemini, can reliably interpret clinical trial data and align closely with human conclusions, suggesting a potential role in data summarization and evidence synthesis. These findings highlight the importance of selecting AI tools carefully and highlight potential applications in research and clinical workflows.

Data availability

All data analyzed in this study came from manuscripts published in the New England Journal of Medicine.

References

  1. Bommasani, R. et al. On the opportunities and risks of foundation models. Proc. Natl. Acad. Sci. U S A. 118 (38), e2102159118. https://doi.org/10.48550/arXiv.2108.07258 (2021).

    Google Scholar 

  2. Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur. Radiol. 34 (5), 2817–2825. https://doi.org/10.1007/s00330-023-10213-1 (2024).

    Google Scholar 

  3. Rao, A. et al. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. Preprint medRxiv. https://doi.org/10.1101/2023.02.02.23285399 (2023). 2023.02.02.23285399. Published 2023 Feb 7.

    Google Scholar 

  4. Brown, E. D. L. et al. Enhancing Diagnostic Support for Chiari Malformation and Syringomyelia: A Comparative Study of Contextualized ChatGPT Models. World Neurosurg. 189, e86–e107. https://doi.org/10.1016/j.wneu.2024.05.172 (2024).

    Google Scholar 

  5. Yu, E. et al. Large Language Models in Medicine: Applications, Challenges, and Future Directions. Int. J. Med. Sci. 22 (11), 2792–2801. https://doi.org/10.7150/ijms.111780 (2025). Published 2025 May 31.

    Google Scholar 

  6. Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health. 2 (2), e0000198. https://doi.org/10.1371/journal.pdig.0000198 (2023).

    Google Scholar 

  7. Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med (Lond). 5(1): 26. https://doi.org/10.1038/s43856-024-00717-2 (2025).

  8. Negrini, F. et al. Evaluating ChatGPT-4.0’s accuracy and potential in idiopathic scoliosis conservative treatment: a preliminary study on clarity, validity, and expert perceptions. Eur Spine J Published online July. 21 https://doi.org/10.1007/s00586-025-09166-4 (2025).

  9. Kozel, G. et al. Chat-GPT on brain tumors: An examination of Artificial Intelligence/Machine Learning’s ability to provide diagnoses and treatment plans for example neuro-oncology cases. Clin. Neurol. Neurosurg. 239, 108238. https://doi.org/10.1016/j.clineuro.2024.108238 (2024).

    Google Scholar 

  10. Ward, M. et al. A Quantitative Assessment of ChatGPT as a Neurosurgical Triaging Tool. Neurosurgery 95 (2), 487–495. https://doi.org/10.1227/neu.0000000000002867 (2024).

    Google Scholar 

  11. Liu, J., Zheng, J., Cai, X., Wu, D. & Yin, C. A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons. iScience 26 (9), 107590. https://doi.org/10.1016/j.isci.2023.107590 (2023). Published 2023 Aug 9.

    Google Scholar 

  12. Rajjoub, R. et al. ChatGPT and its Role in the Decision-Making for the Diagnosis and Treatment of Lumbar Spinal Stenosis: A Comparative Analysis and Narrative Review. Global Spine J. 14 (3), 998–1017. https://doi.org/10.1177/21925682231195783 (2024).

    Google Scholar 

  13. Ahmed, W. et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur. Spine J. 33 (11), 4182–4203. https://doi.org/10.1007/s00586-024-08198-6 (2024).

    Google Scholar 

  14. Ward, M. et al. Analysis of ChatGPT in the Triage of Common Spinal Complaints. World Neurosurg. 192, e273–e280. https://doi.org/10.1016/j.wneu.2024.09.086 (2024).

    Google Scholar 

Download references

Funding

This research received no funding.

Author information

Authors and Affiliations

  1. Department of Neurological Surgery, Indiana University School of Medicine, Indianapolis, IN, USA

    Gordon Mao & Kyle Jared Ortiz Rodriguez

  2. BS Indiana University School of Medicine Indianapolis, Indianapolis, IN, 46202, USA

    William Snyder III, Anoop S. Chinthala, Arshjeet Singh, Alexander J. Potts & Luke R. Jackson

  3. Department of Neurological Surgery, Allegheny Health Network, Pittsburgh, PA, USA

    Barnabas Obeng-Gyasi

  4. Department of Neurocritical Care, Indiana University School of Medicine, Indianapolis, IN, USA

    Ranjeet S. Singh

Authors
  1. Gordon Mao
    View author publications

    Search author on:PubMed Google Scholar

  2. William Snyder III
    View author publications

    Search author on:PubMed Google Scholar

  3. Anoop S. Chinthala
    View author publications

    Search author on:PubMed Google Scholar

  4. Arshjeet Singh
    View author publications

    Search author on:PubMed Google Scholar

  5. Barnabas Obeng-Gyasi
    View author publications

    Search author on:PubMed Google Scholar

  6. Alexander J. Potts
    View author publications

    Search author on:PubMed Google Scholar

  7. Luke R. Jackson
    View author publications

    Search author on:PubMed Google Scholar

  8. Kyle Jared Ortiz Rodriguez
    View author publications

    Search author on:PubMed Google Scholar

  9. Ranjeet S. Singh
    View author publications

    Search author on:PubMed Google Scholar

Contributions

G.M., R.S.S. constructed and supervised the study. W.S., A.S.C., A.S., B.O-G., A.P., L.R.J., and K.J.O.R. collected data from the clinical trials. A.P. and W.S. performed statistical analysis and created figures. W.S. drafted the manuscript. All authors reviewed, edited, and approved the final version of the manuscript.

Corresponding author

Correspondence to Alexander J. Potts.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, G., Snyder, W., Chinthala, A.S. et al. Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45326-2

Download citation

  • Received: 07 November 2025

  • Accepted: 18 March 2026

  • Published: 02 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-45326-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing