Explainable differential diagnosis with dual-inference large language models

Zhou, Shuang; Lin, Mingquan; Ding, Sirui; Wang, Jiashuo; Chen, Canyu; Melton, Genevieve B.; Zou, James; Zhang, Rui

doi:10.1038/s44401-025-00015-6

Download PDF

Article
Open access
Published: 24 April 2025

Explainable differential diagnosis with dual-inference large language models

Shuang Zhou¹,
Mingquan Lin¹,
Sirui Ding²,
Jiashuo Wang³,
Canyu Chen⁴,
Genevieve B. Melton⁵,
James Zou⁶ &
…
Rui Zhang¹

npj Health Systems volume 2, Article number: 12 (2025) Cite this article

6704 Accesses
8 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Automatic differential diagnosis (DDx) involves identifying potential conditions that could explain a patient’s symptoms and its accurate interpretation is of substantial significance. While large language models (LLMs) have demonstrated remarkable diagnostic accuracy, their capability to generate high-quality DDx explanations remains underexplored, largely due to the absence of specialized evaluation datasets and the inherent challenges of complex reasoning in LLMs. Therefore, building a tailored dataset and developing novel methods to elicit LLMs for generating precise DDx explanations are worth exploring. We developed the first publicly available DDx dataset, comprising expert-derived explanations for 570 clinical notes, to evaluate DDx explanations. Meanwhile, we proposed a novel framework, Dual-Inf, that could effectively harness LLMs to generate high-quality DDx explanations. To the best of our knowledge, it is the first study to tailor LLMs for DDx explanation and comprehensively evaluate their explainability. Overall, our study bridges a critical gap in DDx explanation, enhancing clinical decision-making.

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Article Open access 04 July 2024

Large language models for disease diagnosis: a scoping review

Article Open access 09 June 2025

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

Article Open access 09 May 2025

Introduction

Differential diagnosis (DDx), a critical component of clinical care, involves generating a list of potential conditions that could explain a patient’s symptoms¹. It facilitates comprehensive case evaluation, identifies critical but subtle conditions, guides diagnostic testing, and optimizes resource utilization. Additionally, DDx fosters patient involvement and trust through improved communication. While numerous automatic DDx systems^2,3 have been developed to support decision-making, their black-box nature, particularly in deep learning models, often undermines trust⁴. To address this, providing interpretative insights alongside diagnostic predictions is essential⁵. Explainable DDx, which takes patient symptom descriptions as input, generates differential diagnoses, and offers accompanying explanations, is thus highly desirable in clinical practice.

In recent years, large language models (LLMs), such as ChatGPT, trained on extensive corpora, have exhibited remarkable capabilities in various clinical scenarios, including medical question answering (QA)^6,7,8,9, clinical text summarization¹⁰, and disease diagnosis^{11,12,13,14,15,16}. Motivated by these advancements, some studies have explored LLMs to improve diagnostic accuracy¹⁷. For instance, Daniel et al.¹⁸ fine-tuned PaLM 2 on medical data and developed an interactive interface to assist clinicians with DDx generation, while Savage et al.¹⁹ refined Chain-of-Thought (CoT) prompting²⁰ to harness LLMs’ reasoning capabilities.

Despite these efforts, the potential of LLMs to generate reliable DDx explanations remains largely unexplored, leaving their role in supporting clinical decision-making uncertain. Two key challenges impede progress in this domain. First, the absence of DDx datasets annotated with diagnostic explanations limits model development and evaluation^21,22. Second, numerous studies have highlighted LLMs’ inherent difficulties with complex reasoning tasks^23,24, such as multi-step logical reasoning^25,26 and clinical decision-making^27,28. Thus, creating tailored datasets and developing novel methodologies to enable LLMs to synthesize high-quality DDx explanations are worthy of exploration.

In this study, we addressed these challenges by investigating prompting strategies for generating trustworthy DDx explanations. Our contributions are threefold. First, we curated a new dataset of 570 clinical notes across nine specialties, sourced from publicly available medical corpora and annotated by domain experts with differential diagnoses and explanations. To our knowledge, this is the first publicly available structured dataset with DDx explanation annotation^21,29, which facilitates automated evaluation and holds substantial potential to advance the field. Second, we proposed Dual-Inf, a customized framework to optimize LLMs’ explanation generation capabilities. The core design lies in enabling LLMs to perform bidirectional inference (i.e., from symptoms to diagnoses and vice versa), leveraging backward verification to boost prediction correctness. Third, we comprehensively evaluated Dual-Inf for explainable DDx, including model explainability and error analysis. The results demonstrated that Dual-Inf achieved superior diagnostic performance while delivering reliable interpretations across various base LLMs (i.e., GPT-4, GPT-4o, Llama3-70B, and BioLlama3-70B). Overall, our findings highlight the effectiveness of Dual-Inf as a promising tool for improving clinical decision-making.

Results

Dataset

We developed Open-XDDx, a well-annotated dataset for explainable DDx, consisting of 570 clinical notes from publicly available medical exercises across nine specialties: cardiovascular, digestive, respiratory, endocrine, nervous, reproductive, circulatory, skin, and orthopedic diseases. Each note includes patient symptoms, differential diagnoses, and expert-derived explanations from the University of Minnesota (Supplementary Appendix 1). The dataset statistics are detailed in Table 1 and Table 2.

Table 1 The data characteristics of our annotated explainable DDx dataset Open-XDDx

Full size table

Table 2 Breakdown of the notes in the DDx dataset Open-XDDx across the nine clinical specialties

Full size table

Differential diagnosis performance

We evaluated differential diagnosis accuracy (Eq. 1) by comparing model predictions to ground-truth diagnoses with prompts (Supplementary Appendix 3). The results with GPT-4 and GPT-4o are depicted in Fig. 1(b), and the results with Llama3-70B and BioLlama3-70B are presented in Supplementary Appendix 4. It showed that Dual-Inf consistently outperformed baselines across nine specialties. Specifically, when built on GPT-4, the overall performance of SC-CoT significantly exceeded CoT (difference of 0.032, 95% CI 0.021–0.043, p = 0.001) and Diagnosis-CoT (difference of 0.019, 95% CI 0.001–0.028, p = 0.004). Dual-Inf further surpassed SC-CoT (0.533 vs. 0.472, difference of 0.061, 95% CI 0.055–0.062, p < 0.001). Concretely, the performance improvement of Dual-Inf over SC-CoT exceeded 16% on cardiovascular and digestive diseases. Similarly, using GPT-4o, Dual-Inf achieved over 0.55 accuracy on nervous, skin, and orthopedic diseases, exceeding the baselines by over 9%. With Llama3-70B and BioLlama3-70B, Dual-Inf outperformed SC-CoT by over 10% in cardiovascular, digestive, and respiratory diseases. The overall performance improvement of Dual-Inf over SC-CoT across the three base LLMs (difference of 0.059, 0.048, and 0.049) was statistically significant (p < 0.001).

Interpretation performance

Model explainability was examined through automatic and human assessments. For automatic evaluation, GPT-4o was employed to measure the consistency between ground-truth and predicted interpretations, utilizing prompts detailed in Supplementary Appendix 3. We tested four base LLMs (GPT-4, GPT-4o, Llama3-70B, and BioLlama3-70B). Partial results on GPT-4 are shown in Fig. 2(a), with additional results in Supplementary Appendix 5. In Fig. 2(a), the interpretation accuracy (Eq. 2) of Diagnosis-CoT and SC-CoT was 0.305 and 0.334, surpassing CoT by 0.011 (95% CI 0.004–0.019, p = 0.012) and 0.04 (95% CI 0.038–0.043, p < 0.001), respectively. Dual-Inf achieved even higher accuracy at 0.446, with a 0.112 improvement over SC-CoT (95% CI 0.105–0.118, p < 0.001). Concretely, the performance improvement of Dual-inf over the baselines surpassed 26% in cardiovascular and respiratory diseases. For BERTScore, SentenceBert, and METEOR, Dual-Inf outperformed SC-CoT with comparisons of 0.345 vs. 0.258 (difference of 0.087, 95% CI 0.083–0.090, p < 0.001), 0.427 vs. 0.356 (difference of 0.071, 95% CI 0.067–0.076) and 0.333 vs. 0.251 (difference of 0.082, 95% CI 0.076–0.088). When taking GPT-4o as the base LLM, the interpretation accuracy of Dual-Inf reached 0.488, outperforming CoT and SC-CoT, which scored 0.366 and 0.408, respectively. On the other metrics, Dual-Inf consistently surpassed SC-CoT, with differences of 0.083, 0.064, and 0.08. Similarly, with Llama3-70B and BioLlama3-70B, Dual-Inf exceeded SC-CoT by over 17% across all metrics. In detail, the performance improvement on digestive, respiratory, and endocrine diseases exceeded 25% over the baselines w.r.t interpretation accuracy (Supplementary Appendix 5).

The interpretations were also manually examined by clinicians on three qualitative metrics: Correctness, Completeness, and Usefulness (Supplementary Appendix 2). Figure 2b presents the results for 100 randomly selected notes, using GPT-4 as the base LLM. We observed that the Correctness score of Dual-Inf predominantly ranged from 3 to 4, whereas SC-CoT scores mainly fell between 2 and 3. In terms of Completeness score, Dual-Inf achieved 38 scores of 3 and 21 scores of 4, compared to SC-CoT’s 19 and 3, respectively. Regarding the Usefulness score, Dual-Inf had 33 scores of 3 and 25 scores of 4, while SC-CoT had 26 and 10, respectively.

Case study

We further provided case studies to demonstrate the superior explainability of Dual-Inf over the baselines. The example in Fig. 3 showcased that SC-CoT only provided three correct explanations for a differential, i.e., Pneumothorax, while Dual-Inf generated more accurate explanations. Besides, Dual-Inf had one more correct differential, i.e., Hemothorax, with three correct explanations than the baselines. See more examples and detailed illustrations in Supplementary Appendices 7 and 8.

Error analysis on explanation

We analyzed error types in generated explanations by comparing Dual-Inf with the baselines on 100 randomly selected samples with incorrect outputs. Errors were categorized as missing content (missing at least two pieces of evidence), factual errors (medically incorrect), or low relevance (evidence not highly pertinent) based on prior studies^30,31. Using GPT-4 as the base LLM (Fig. 2(c)), SC-CoT had 89 cases of missing content versus 76 for Dual-Inf (difference 13.4, 95% CI 11.5–15.2). For factual errors, the baselines achieved similar performance, and the count number comparison between Dual-Inf and SC-CoT was 17 vs. 8.2 (difference 8.8, 95% CI 7.8–9.8). As for low-relevance, Self-Contrast and SC-CoT had fewer errors than the CoT and Diagnosis-CoT, while the comparison between SC-CoT and Dual-Inf was 15.4 vs. 10.8 (difference 4.6, 95% CI 3.9–5.3). All differences were statistically significant (p < 0.001). We further presented the count of errors in each clinical specialty in Supplementary Appendix 6. The results demonstrated that the errors fell into all the specialties, while the nervous and digestive diseases specialty had more errors.

Ablation study

We evaluated the contribution of each component in Dual-Inf through four variants: (1) forward-inference only (FI), (2) FI with excluded backward-inference (FI-EM), (3) FI-EM without self-reflection (FI-EM*), and (4) Dual-Inf without self-reflection (Dual-Inf*). Specifically, we adopted automatic metrics for the evaluation. The results in Supplementary Appendix 12 confirmed that Dual-Inf achieved superior diagnostic accuracy and explainability, highlighting the necessity of all components.

Discussion

Our study demonstrated that Dual-Inf significantly enhanced diagnostic accuracy by filtering low-confidence diagnoses through quality assessment. Specifically, the examination module consolidated outputs from other components to verify correctness, while the self-reflection mechanism enabled the forward-inference module to refine predictions iteratively. To evaluate iterative reflection, we tracked the iteration count for each note in Dual-Inf (Fig. 4a), revealing that most predictions were iteratively revised. For randomly selected ten notes with five iterations, the number of correct diagnoses improved progressively (Fig. 4c), confirming the effectiveness of the iterative reflection mechanism. Besides, we observed that most of the notes’ prediction correctness was boosted or remained stable in the fourth or fifth iteration, demonstrating the necessity of setting the maximum iteration λ to a relatively large value (e.g., 5). Additionally, the distribution of diagnostic accuracy across cases, visualized in Fig. 4b, showed that the median and upper quartile for Dual-Inf (0.495 and 0.746) outperformed SC-CoT (0.434 and 0.652) and Diagnosis-CoT (0.421 and 0.628), with statistically significant improvements (p < 0.001). These findings highlight the efficacy of Dual-Inf in enhancing diagnostic accuracy.

**Fig. 4: In-depth analysis of Dual-Inf.**

Second, Dual-Inf produced superior DDx explanations. Manual evaluation of 100 cases (Fig. 2b) showed higher scores across all metrics, attributed to bidirectional inferences and iterative prediction refinement. Iterative reflection effectiveness was confirmed through ten notes with five iterations (Fig. 4c) and a case study of intermediate predictions (Supplementary Appendix 7), both demonstrating improved explanations over iterations. Distribution analysis (Fig. 4d) revealed higher median and quartile scores (e.g., BERTScore, METEOR) for Dual-Inf compared to SC-CoT, confirming its ability to generate better explanations across most cases. Notably, although the note snippets were publicly available, the ground-truth of DDx and the corresponding explanations were manually generated by our domain experts. Therefore, the LLMs have not been exposed to the ground-truth, and the evaluation on our dataset is trustworthy.

Third, this study demonstrated that leveraging multiple LLMs mitigates explanation errors in DDx. As shown in Fig. 2c, Self-Contrast and SC-CoT reduced low-relevance errors compared to CoT and Diagnosis-CoT, highlighting the benefit of integrating multiple LLM interpretations to address hallucinations. Dual-Inf further minimized errors across all the types, attributed to its dual-inference scheme: the forward-inference module generated diagnoses, the backward-inference module recalled medical knowledge, and the examination module refined predictions. The self-reflection mechanism further improved explanation quality and reduced hallucinations through iterative refinement. Additionally, the higher error rates observed in the nervous and digestive disease specialties were attributed to their larger sample sizes in the dataset. However, normalizing error counts by sample size revealed comparable error rates across the specialties.

One limitation of this study is that our dataset, encompassing nine clinical specialties, does not fully capture the breadth of real-world scenarios. Besides, the dataset lacks annotations on the priority of each diagnosis within the DDx, as ranking the likelihood of possible diseases presents significant challenges. Furthermore, the backward-inference module’s reliance on internal medical knowledge to generate reference signs and symptoms makes it vulnerable to severe hallucinations or erroneous knowledge, which could impact performance. This issue can be mitigated by implementing Dual-Inf with advanced LLMs³².

In summary, this study established a manually annotated dataset for explainable DDx and designed a tailored framework that effectively harnessed LLMs to generate high-quality explanations. The findings revealed that existing prompting methods exhibited suboptimal performance in generating DDx and explanations, limiting their practical utility in clinical scenarios. Our experiments verified the effectiveness of Dual-Inf for providing accurate DDx, delivering comprehensive explanations, and reducing prediction errors. Furthermore, the released dataset with ground-truth DDx and explanations could facilitate the research field. Future work could expand the dataset to a broader range of clinical specialties or integrate domain knowledge from external databases for superior performance.

Methods

Data acquisition and processing

The data source is publicly available medical exercises collected from medical books^33,34 and MedQA USMLE dataset³⁵. There are two key criteria for selecting the clinical notes: (1) the notes must originate from disease diagnosis exercises; (2) they must pertain to one of the nine clinical specialties. We transformed the exercises into free text by preserving the symptom descriptions and removing the multiple-choice options, where applicable. The texts were further preprocessed, including (1) removing duplicate notes, (2) unifying all characters into UTF-8 encoding and removing illegal UTF-8 strings, (3) correcting or removing special characters, and (4) filtering out notes with fewer than 130 characters. Lastly, we collected 570 clinical notes, among which 10 notes were used for prompt development, and 560 notes were preserved for evaluation. The full dataset can be found in Supplementary Appendix 13.

Data annotation

The raw data generally did not have the annotation of DDx, explanation, and clinical specialty. To build a well-annotated dataset, we employed three clinical physicians to curate the dataset manually. Two independent physicians annotated each exercise. When disagreement existed in the annotation, a third physician examined the case and made the final annotation. We checked the inter-annotator agreement (IAA) on DDx, interpretation, and specialty (Supplementary Appendix 1). Additionally, our dataset is well-structured in a standardized format, facilitating automated evaluation.

Model development

How to effectively elicit LLMs’ capability for accurate DDx explanation is challenging. Inspired by the fact that humans usually conduct backward reasoning to validate the correctness of answers when solving reasoning problems³⁶, we proposed performing backward verification (i.e., from diagnosis to symptoms) to examine the predicted diagnoses and elicit correct answers via self-reflection. Accordingly, we developed a customized framework called Dual-Inference Large Language Model (Dual-Inf), shown in Fig. 1a. Specifically, Dual-Inf consisted of four components: (1) a forward-inference module, which was an LLM for initial diagnoses, i.e., from patients’ symptoms to diagnoses, (2) a backward-inference module, which was an LLM for inverse inference via recalling all the representative symptoms of the initial diagnoses, i.e., from diagnoses to symptoms, (3) an examination module, which was another LLM that received patients’ notes and the output from the two modules for prediction assessment and decision making, and (4) an iterative self-reflection mechanism, which iteratively took low-confidence diagnoses as feedback for the forward-inference module to “think twice”.

The pipeline was as follows. First, the forward-inference module analyzed clinical notes to infer initial diagnoses and provide interpretations. Next, the backward-inference module received the initial diagnoses as input and recalled the representative symptoms that the diagnoses generally present, including medical examination and laboratory test results. Given that the recalled symptoms were derived from the LLM’s internal knowledge, they are generally reliable in advanced LLMs^30,32 and could serve as a reference for measuring the correctness of the predicted explanations. Afterward, the examination module verified and refined the above results. Specifically, it (i) checked the forward-inference module’s explanations against the recalled knowledge and discarded erroneous ones, (ii) supplemented the interpretations by integrating patient notes with recalled knowledge, (iii) decided whether to accept or filter predictions based on their quality. The underlying idea was that a diagnosis supported by fewer interpretations was deemed less trustworthy. To assess diagnostic confidence, a threshold β was applied: diagnoses with fewer than β supporting interpretations were flagged as low-confidence. Later, the self-reflection mechanism took the low-confidence diagnoses as feedback to prompt the forward-inference module to “think twice.” This iterative process continued up to a maximum limit λ, balancing accuracy with efficiency. Upon reaching this limit, the framework outputted the final results. The prompts for the three modules are detailed in Supplementary Appendix 9. Importantly, the prompts for the forward-inference module were carefully designed to ensure objectivity toward feedback from the examination module, reducing the risk of false negatives undermining correct predictions.

Implementation details

We adopted four baselines: (1) CoT²⁰, a popular prompting method; (2) Diagnosis-CoT¹⁹, a customized prompting method for disease diagnosis; (3) Self-Contrast³⁷, an advanced method with multiple prompts and a re-examination mechanism to enhance reasoning; (4) Self-consistency CoT (SC-CoT)³⁸, which assembled multiple reasoning paths to enhance performance. We followed the original papers in the implementation. Specifically, SC-CoT generated five reasoning paths for each note and then selected the most consistent diagnoses and interpretations. The prompts of baselines were shown in Supplementary Appendix 10. As for Dual-Inf, we incorporated CoT into the three LLM-based modules. The maximum iteration number λ was assigned to 5, considering the trade-off between effectiveness and efficiency; the threshold β was set to 3. We further analyzed the impact of the hyper-parameter β on the performance and presented the results in Supplementary Appendix 11. For a fair comparison, all the methods were implemented with the same base LLM, including GPT-4, GPT-4o, Llama3-70B (https://huggingface.co/meta-llama/Meta-Llama-3-70B), and BioLlama3-70B (https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B). For the former two LLMs, we used the API from the OpenAI company (https://platform.openai.com/docs/models), which were “gpt-4-turbo-preview” and “gpt-4o”; for the latter two, we downloaded the models from Huggingface for inference. The temperature parameter was set as 0.1.

Performance evaluation

We conducted automatic evaluation by comparing the ground-truths with the predicted ones. Following related papers²⁷, we used accuracy as the primary metric for assessing diagnostic performance, i.e.,

$${\rm{Diagnostic\; Accuracy}}=\frac{{\rm{Cumulative\; number\; of\; correct\; diagnoses}}}{{\rm{Total\; number\; of\; diagnoses}}}$$

(1)

For interpretation performance, we employed metrics designed to assess the semantic alignment between the reference text and the predicted text, rather than relying solely on string matching. The metrics, including accuracy, BERTScore³⁹, SentenceBert⁴⁰, and METEOR⁴¹, have been widely used in related tasks^42,43. Concretely, interpretation accuracy was computed as:

$${\rm{Interpretation\; Accuracy}}=\frac{{\rm{Cumulative\; number\; of\; correct\; interpretations}}}{{\rm{Total\; number\; of\; interpretations}}}$$

(2)

BERTScore³⁹ employs the BERT model⁴⁴ to determine the semantic similarity between reference and generated text, offering a context-aware evaluation of model performance. SentenceBert⁴⁰ measures sentence similarity using a BERT model that generates dense vector representations, facilitating efficient and accurate semantic comparisons. METEOR⁴¹ assesses the harmonic mean of unigram precision and recall, utilizing stemmed forms and synonym equivalence. The details of automatic and human evaluation are shown in Supplementary Appendix 2.

Data availability

Data is provided in the supplementary information files.

Code availability

The code used for this study is available at https://github.com/betterzhou/Dual-Inf.

References

Adler-Milstein, J., Chen, J. H. & Dhaliwal, G. Next-generation artificial intelligence for diagnosis: from predicting diagnostic labels to “wayfinding. Jama 326, 2467–2468 (2021).
Article PubMed Google Scholar
Fansi Tchango, A. et al. Towards trustworthy automatic diagnosis systems by emulating doctors’ reasoning with deep reinforcement learning. Adv. Neural Inf. Process. Syst. 35, 24502–24515 (2022).
Google Scholar
Wu, L. et al. Differential diagnosis of secondary hypertension based on deep learning. Artif. Intell. Med. 141, 13 (2023).
Article Google Scholar
Levy, J., Álvarez, D., Del Campo, F. & Behar, J. A. Deep learning for obstructive sleep apnea diagnosis based on single channel oximetry. Nat. Commun. 14, 4881 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat. Mach. Intell. 1, 236–245 (2019).
Article Google Scholar
Zakka, C. et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).
Article Google Scholar
Wu, S. et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 1, AIdbp2300092 (2024).
Article Google Scholar
Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).
Article Google Scholar
Singhal, K. et al. Towards Expert-Level Medical Question Answering with Large Language Models. ArXiv, vol. abs/2305.09617, (2023).
Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Dig. Med. 6, 158 (2023).
Article Google Scholar
Benary, M. et al. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw. Open 6, e2343689–e2343689 (2023).
Article PubMed PubMed Central Google Scholar
Kanjee, Z, Crowe, B & Rodman, A Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama 330, 78–80 (2023).
Article PubMed PubMed Central Google Scholar
Kwon, T. et al. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. in Proceedings of the AAAI Conference on Artificial Intelligence. 18417–18425 (2024).
Hua, R. et al. Lingdan: enhancing encoding of traditional Chinese medicine knowledge for clinical reasoning tasks with large language models. J. Am. Med. Informat. Asso. 31, 2019–2029 (2024).
Article Google Scholar
Chen, J. et al. CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis. arXiv preprint arXiv:2407.13301, (2024).
Zhou, S. et al. Large Language Models for Disease Diagnosis: A Scoping Review. ArXiv, vol. abs/2409.00097, (2024).
Tu, T. et al. Towards conversational diagnostic AI. arXiv preprint arXiv:2401.05654, 2024.
McDuff, D. et al. (2023, November 01, 2023). Towards Accurate Differential Diagnosis with Large Language Models. arXiv:2312.00164. Available: https://ui.adsabs.harvard.edu/abs/2023arXiv231200164M.
Savage, T., Nayak, A., Gallo, R., Rangan, E. S. & Chen, J. H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med 7, 20 (2023).
Article Google Scholar
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. neural Inf. Process. Syst. 35, 24824–24837 (2022).
Google Scholar
Tchango, A. F., Goel, R., Wen, Z., Martel, J. & Ghosn, J. DDXPlus: A New Dataset For Automatic Medical Diagnosis. in Neural Information Processing Systems, (2022).
Wu, J. et al. Clinical Text Datasets for Medical Artificial Intelligence and Large Language Models — A Systematic Review. NEJM AI 1, AIra2400012 (2024).
Article Google Scholar
Ashwani, S. et al. Cause and effect: Can large language models truly understand causality? in Proceedings of the AAAI Symposium Series, 2–9 (2024).
Chi, H. et al. Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? in The Thirty-eighth Annual Conference on Neural Information Processing Systems.
M. Parmar, et al. “LogicBench: Towards systematic evaluation of logical reasoning ability of large language models. in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, 13679-13707.
Chen, A. et al. Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs. Transac. Machine Learning Res. (2024).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Article CAS PubMed PubMed Central Google Scholar
Williams, C. Y., Miao, B. Y., Kornblith, A. E. & Butte, A. J. Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat. Commun. 15, 8236 (2024).
Article CAS PubMed PubMed Central Google Scholar
Macherla, S., Luo, M., Parmar, M. & Baral, C. MDDial: A Multi-turn Differential Diagnosis Dialogue Dataset with Reliability Evaluation. arXiv preprint arXiv:2308.08147, (2023)
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, X. et al. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. npj Digital Med. 7, 111 (2024).
Article Google Scholar
Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969–e2440969 (2024).
Article PubMed PubMed Central Google Scholar
Le, T., Bhushan, V., Sheikh-Ali, M. & Shahin F. A. First Aid for the USMLE Step 2 CS, Third Edition: McGraw-Hill Education, (2009).
Le, T. & Bechis, S.K. First Aid Q&A for the USMLE Step 1, Second Edition: McGraw-Hill Education, (2009).
Jin, D. et al. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. 11, 6421 (2021).
Article CAS Google Scholar
Jiang, W. et al. Forward-backward reasoning in large language models for mathematical verification. in Findings of the Association for Computational Linguistics ACL 2024, 6647–6661 (2024).
Zhang, W. et al. Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives. Annual Meeting of the Association for Computational Linguistics. (2024).
Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. in The Eleventh International Conference on Learning Representations.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating Text Generation with BERT. ArXiv, vol. abs/1904.09675, 2019.
Reimers, N. & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Conference on Empirical Methods in Natural Language Processing, (2019).
Banerjee, S. & Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72 (2005).
Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Med. 7, 82 (2024).
Article Google Scholar
Celikyilmaz, A., Clark, E. & Gao, J. Evaluation of Text Generation: A Survey. ArXiv, vol. abs/2006.14799, (2020).
JDevlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in North American Chapter of the Association for Computational Linguistics, (2019).

Download references

Acknowledgements

This work was supported by the National Institutes of Health’s National Center for Complementary and Integrative Health under grant number R01AT009457, National Institute on Aging under grant number R01AG078154, and National Cancer Institute under grant number R01CA287413. The content is solely the responsibility of the authors and does not represent the official views of the National Institutes of Health. We also acknowledge support from the Center for Learning Health System Sciences, a partnership between the University of Minnesota Medical School and School of Public Health.

Author information

Authors and Affiliations

Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA
Shuang Zhou, Mingquan Lin & Rui Zhang
Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA
Sirui Ding
Department of Computer Science, University of Chicago, Chicago, IL, USA
Jiashuo Wang
Department of Computer Science, Illinois Institute of Technology, Chicago, IL, USA
Canyu Chen
Institute for Health Informatics and Division of Colon and Rectal Surgery, Department of Surgery, University of Minnesota, Minneapolis, MN, USA
Genevieve B. Melton
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
James Zou

Authors

Shuang Zhou
View author publications
Search author on:PubMed Google Scholar
Mingquan Lin
View author publications
Search author on:PubMed Google Scholar
Sirui Ding
View author publications
Search author on:PubMed Google Scholar
Jiashuo Wang
View author publications
Search author on:PubMed Google Scholar
Canyu Chen
View author publications
Search author on:PubMed Google Scholar
Genevieve B. Melton
View author publications
Search author on:PubMed Google Scholar
James Zou
View author publications
Search author on:PubMed Google Scholar
Rui Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

S.Z. and R.Z. conceptualized the study. S.Z. contributed to the literature search and model evaluation. S.Z. and S.D. performed the data collection, prompt design, and model construction. S.Z., R.Z., and G.M. discussed and arranged data annotation and human evaluation. S.Z., M.L., and R.Z. conducted experimental design. S.Z. performed manuscript drafting. R.Z. supervised the study. All authors contributed to the research discussion, manuscript revision, and approval of the manuscript for submission.

Corresponding author

Correspondence to Rui Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Appendix 13 - Full dataset

Supplementary Appendix_R1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, S., Lin, M., Ding, S. et al. Explainable differential diagnosis with dual-inference large language models. npj Health Syst. 2, 12 (2025). https://doi.org/10.1038/s44401-025-00015-6

Download citation

Received: 28 December 2024
Accepted: 01 March 2025
Published: 24 April 2025
Version of record: 24 April 2025
DOI: https://doi.org/10.1038/s44401-025-00015-6

This article is cited by

Automating expert-level medical reasoning evaluation of large language models
- Shuang Zhou
- Wenya Xie
- Rui Zhang
npj Digital Medicine (2025)
Uncertainty-aware large language models for explainable disease diagnosis
- Shuang Zhou
- Jiashuo Wang
- Rui Zhang
npj Digital Medicine (2025)
Large language models for disease diagnosis: a scoping review
- Shuang Zhou
- Zidu Xu
- Rui Zhang
npj Artificial Intelligence (2025)
Large language models with retrieval-augmented generation enhance expert modelling of Bayesian network for clinical decision support
- Mario A. Cypko
- Muhammad Agus Salim
- Oliver Amft
International Journal of Computer Assisted Radiology and Surgery (2025)