The value of doubt: training LLMs to consider diagnostic uncertainty may improve clinical utility

Sui, Margaret; Rosen, Kyra; Heydari, Kimia; Enichen, Elizabeth J.; Kvedar, Joseph C.

doi:10.1038/s41746-025-02307-5

Download PDF

Editorial
Open access
Published: 10 February 2026

The value of doubt: training LLMs to consider diagnostic uncertainty may improve clinical utility

Margaret Sui^1,2,
Kyra Rosen¹,
Kimia Heydari¹,
Elizabeth J. Enichen¹ &
…
Joseph C. Kvedar¹

npj Digital Medicine volume 9, Article number: 141 (2026) Cite this article

904 Accesses
10 Altmetric
Metrics details

Subjects

While physicians routinely consider uncertainty during patient diagnosis, large language models (LLMs) often fail to recognize that real-world clinical data can be too limited for a definitive diagnosis. Zhou et al. address this problem by training a LLM, ConfiDx, to recognize medical cases with limited clinical data. This approach improves the utility of LLMs in the clinic and enables physicians to more effectively recognize and explain uncertainty in their patient care.

Introduction

Physicians often need to make rapid diagnoses based on insufficient data to start treatment and help patients¹. To prevent medical misdiagnoses and errors, it is important to consider a broad differential of possible disease etiologies, including both likely and “can’t miss” diagnoses. In the process of arriving at a definitive diagnosis, physicians discuss their diagnostic explanations and adhere to clinical guidelines^2,3,4,5. However, since real-world medical cases often do not cleanly align with clinical guidelines, the probability of each diagnosis needs to be considered.

While large language models (LLMs) can generate a list of possible diagnoses, they often cannot accurately assess the likelihood of each diagnosis³. Additionally, when prompted to estimate likelihoods, LLM-generated uncertainty estimates are often inaccurate and overestimated^3,6,7. Zhou et al. reason that an explainable AI approach would increase the reliability and trustworthiness of LLMs, i.e., LLMs should be able to provide diagnostic explanations of how well the symptoms and data match diagnostic criteria proposed in clinical guidelines⁸.

Recognizing uncertainty may improve patient outcomes

Uncertainty is common in hospital settings, and unfortunately, the pressure on physicians to quickly arrive at a definitive diagnosis can lead to misdiagnosis and patient harm⁹. Most people have a psychological bias towards seeking definitive answers, but acknowledging uncertainty when appropriate increases the likelihood that physicians will continue to seek necessary information and adjust diagnoses based on new information^10,11.

Recognizing what information is lacking requires knowing what you do not know¹². Many current approaches to applying LLMs fail to assess the sufficiency or lack of medical data before making diagnoses^6,13. This is because scientists typically train these LLMs to provide answers rather than admitting uncertainty, and training datasets are often curated to exclude cases with noisy or confusing data. As a result, LLMs can sometimes hallucinate reasoning or diagnoses. Explainability is critical for all AI systems, and in particular, false certainty and misdiagnosis can have dangerous consequences in the medical setting.

ConfiDx acknowledges uncertainty and improves trust

In contrast, the outputs from ConfiDx not only explain how well the diagnostic criteria match the patient case but also what data would be needed to arrive at a more certain diagnosis. This performance increase arises from training ConfiDx on cases with diagnostic evidence-complete and evidence-incomplete notes from the MIMIC-IV dataset, which provides de-identified electronic health records for nearly 300,000 patients treated for cardiovascular, endocrine, or hepatic issues at Beth Israel Deaconess Medical Center¹⁴. To generate evidence-incomplete notes, a portion of the relevant diagnostic evidence was initially masked, and ConfiDx was trained to recognize these cases as uncertain diagnoses.

Zhou et al. show that, compared to off-the-shelf models like GPT-4o, OpenAI-o1, Gemini-2.0, Claude-3.7, and DeepSeek-R1, ConfiDx substantially improves uncertainty-aware diagnostic performance, meaning it is better able to provide not just disease diagnosis but also diagnostic explanation, uncertainty recognition, and uncertainty explanation when evaluated on a separate test dataset⁸. Additionally, compared to standalone medical experts, ConfiDx-assisted experts were 10.7% more accurate in uncertainty recognition, 14.6% better in diagnostic explanation accuracy, and 26.3% better in uncertainty explanation⁸.

Improvements in recognizing uncertainty and missing information make it easier for physicians to trust ConfiDx, as it is less likely to be falsely confident compared to other LLMs. The explanations provided by ConfiDx allow physicians to follow the “reasoning” of how ConfiDx arrived at the conclusion and determine whether that reasoning makes sense. Importantly, physicians who were randomly selected to use ConfiDx were more likely to recognize uncertainty, provide more accurate diagnostic explanations, and offer improved explanations of uncertainty compared to physicians who did not use ConfiDx or any other LLMs. This could then help patients better process their diagnoses and prognoses.

Conclusion

In an era where quick answers are applauded, taking the time to recognize uncertainty is key to avoiding medical mishaps from misdiagnosis and false confidence. While uncertainty-aware models have been increasingly developed and used outside of medicine, ConfiDx paves the way as the first LLM example in medicine that honors uncertainty. Zhou et al. demonstrate that such LLMs can be beneficial in the clinic and may improve patient outcomes in the future.

Data availability

No datasets were generated or analyzed during the current study.

References

Smith, P. C. et al. Missing clinical information during primary care visits. JAMA 293, 565–571 (2005).
Article PubMed CAS Google Scholar
Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit. Med. 7, 1–9 (2024).
Article Google Scholar
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Article PubMed PubMed Central CAS Google Scholar
Ghorbani, A. et al. Deep learning interpretation of echocardiograms. npj Digit. Med. 3, 10 (2020).
Article PubMed PubMed Central Google Scholar
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Article PubMed PubMed Central CAS Google Scholar
Zhou, S. et al. Large language models for disease diagnosis: a scoping review. npj Artif. Intell. 1, 9 (2025).
Vazhentsev, A. et al. Uncertainty-aware abstention in medical diagnosis based on medical texts. https://doi.org/10.48550/arXiv.2502.18050 (2025).
Zhou, S. et al. Uncertainty-aware large language models for explainable disease diagnosis. npj Digit. Med. 8, 690 (2025).
Article PubMed PubMed Central Google Scholar
Meyer, A. N. D., Giardina, T. D., Khawaja, L. & Singh, H. Patient and clinician experiences of uncertainty in the diagnostic process: current understanding and future directions. Patient Educ. Couns. 104, 2606–2615 (2021).
Article PubMed Google Scholar
Patel, B., Gheihman, G., Katz, J. T., Begin, A. S. & Solomon, S. R. Navigating uncertainty in clinical practice: a structured approach. J. Gen. Intern. Med. 39, 829 (2024).
Article PubMed PubMed Central Google Scholar
McGrath, B. M. How doctors think. Can. Fam. Physician 55, 1113 (2009).
PubMed Central Google Scholar
Yin, Z. et al. Do large language models know what they don’t know? https://doi.org/10.48550/arXiv.2305.18153 (2023).
Bhasuran, B. et al. Preliminary analysis of the impact of lab results on large language model generated differential diagnoses. npj Digit. Med. 8, 1–15 (2025).
Article Google Scholar
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
Article PubMed PubMed Central CAS Google Scholar

Download references

Author information

Authors and Affiliations

Harvard Medical School, Boston, MA, USA
Margaret Sui, Kyra Rosen, Kimia Heydari, Elizabeth J. Enichen & Joseph C. Kvedar
Dana-Farber Cancer Institute, Boston, MA, USA
Margaret Sui

Authors

Margaret Sui
View author publications
Search author on:PubMed Google Scholar
Kyra Rosen
View author publications
Search author on:PubMed Google Scholar
Kimia Heydari
View author publications
Search author on:PubMed Google Scholar
Elizabeth J. Enichen
View author publications
Search author on:PubMed Google Scholar
Joseph C. Kvedar
View author publications
Search author on:PubMed Google Scholar

Contributions

M.S. wrote and edited the main manuscript text. K.R., K.H., E.J.E., and J.C.K. edited and reviewed the manuscript.

Corresponding author

Correspondence to Margaret Sui.

Ethics declarations

Competing interests

Authors M.S., K.R., K.H., and E.J.E. declare no financial or non-financial competing interests. Author J.C.K. serves as Editor-in-Chief of this journal and had no role in the peer-review or decision to publish this manuscript. Author J.C.K. declares no financial competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sui, M., Rosen, K., Heydari, K. et al. The value of doubt: training LLMs to consider diagnostic uncertainty may improve clinical utility. npj Digit. Med. 9, 141 (2026). https://doi.org/10.1038/s41746-025-02307-5

Download citation

Received: 17 November 2025
Accepted: 19 December 2025
Published: 10 February 2026
Version of record: 10 February 2026
DOI: https://doi.org/10.1038/s41746-025-02307-5