Although large language models (LLMs) show promise in controlled settings, a study now exposes their limitations in real-world clinical applications and points the way towards robust evaluation and benchmarking before clinical use.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
An LLM chatbot to facilitate primary-to-specialist care transitions: a randomized controlled trial
Nature Medicine Open Access 19 January 2026
-
Two-stage prompting framework with predefined verification steps for evaluating diagnostic reasoning tasks on two datasets
npj Digital Medicine Open Access 16 December 2025
-
Efficacy of large language models in detecting postoperative delirium from unstructured clinical notes: A retrospective cohort study
npj Digital Medicine Open Access 12 December 2025
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout

References
Kung, T. H. et al. PLoS Digit. Health 2, e0000198 (2023).
Gilson, A. et al. JMIR Med. Educ. 9, e45312 (2023). (2023).
Mehandru, N. et al. npj Digit. Med. 7, 84 (2024).
Hager, P. et al. Nat. Med. https://doi.org/10.1038/s41591-024-03097-1 (2024).
Bedi, S. et al. Preprint at medRxiv https://doi.org/10.1101/2024.04.15.24305869 (2024).
Shah, N. H. et al. JAMA 330, 866–869 (2023).
Jindal, R. et al. J. Am. Med. Inform. Assoc. 31, 1441–1444 (2024).
Fleming, S. L. et al. Proc. AAAI Conference on Artificial Intelligence 38, 22021–22030 (2024).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
N.H.S. is a cofounder of Prealize Health (a predictive analytics company) and Atropos Health (an on-demand evidence generation company); reports funding from the Gordon and Betty Moore Foundation for developing virtual model deployments; and served on the board of the Coalition for Healthcare AI (CHAI), a consensus-building organization providing guidelines for the responsible use of artificial intelligence in health care. The other authors have no competing financial interests.
Rights and permissions
About this article
Cite this article
Bedi, S., Jain, S.S. & Shah, N.H. Evaluating the clinical benefits of LLMs. Nat Med 30, 2409–2410 (2024). https://doi.org/10.1038/s41591-024-03181-6
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41591-024-03181-6
This article is cited by
-
How to interpret ‘zero-shot’ results from generative EHR models
Nature Medicine (2026)
-
An LLM chatbot to facilitate primary-to-specialist care transitions: a randomized controlled trial
Nature Medicine (2026)
-
Feasibility of machine learning analysis for the identification of patients with possible primary ciliary dyskinesia
Orphanet Journal of Rare Diseases (2025)
-
Two-stage prompting framework with predefined verification steps for evaluating diagnostic reasoning tasks on two datasets
npj Digital Medicine (2025)
-
Efficacy of large language models in detecting postoperative delirium from unstructured clinical notes: A retrospective cohort study
npj Digital Medicine (2025)