Table 3 Initial vs. final diagnosis uncertainty within four verification strategies across two LLMs and two medical datasets

From: Two-stage prompting framework with predefined verification steps for evaluating diagnostic reasoning tasks on two datasets

Dataset-Model

Strategy

Initial Uncertainty(%)

Final Uncertainty(%)

Initial Median (IQR) %

Final Median (IQR) %

p value

MedQA-GPT-4o (n = 570)

SV

5.51%

2.32%

0%(0-0%)

0%(0-0%)

<0.001***

KV

4.35%

2.11%

0%(0-0%)

0%(0-0%)

<0.001***

DV

4.74%

0.56%

0%(0-0%)

0%(0-0%)

<0.001***

HV

2.77%

1.51%

0%(0-0%)

0%(0-0%)

<0.001***

MedQA-DeepSeek-V3 (n = 570)

SV

3.75%

2.18%

0%(0-0%)

0%(0-0%)

<0.001***

KV

4.21%

3.19%

0%(0-0%)

0%(0-0%)

<0.001***

DV

3.05%

1.19%

0%(0-0%)

0%(0-0%)

<0.001***

HV

3.02%

2.46%

0%(0-0%)

0%(0-0%)

0.002**

NEJM-GPT-4o (n = 300)

SV

21.87%

9.60%

0%(0-40%)

0%(0-20%)

<0.001***

KV

14.93%

6.47%

0%(0-20%)

0%(0-0%)

<0.001***

DV

18.67%

2.67%

0%(0-40%)

0%(0-0%)

<0.001***

HV

9.33%

5.13%

0%(0-20%)

0%(0-0%)

<0.001***

NEJM-DeepSeek-V3 (n = 300)

SV

8.13%

3.40%

0%(0-20%)

0%(0-0%)

<0.001***

KV

4.33%

2.13%

0%(0-0%)

0%(0-0%)

<0.001***

DV

4.67%

0.87%

0%(0-0%)

0%(0-0%)

<0.001***

HV

4.73%

2.87%

0%(0-0%)

0%(0-0%)

<0.001***

  1. Notes: SV Stepwise Verification, KV Knowledge-Augmented Verification, DV Differential Verification, HV Hybrid Verification. Initial Uncertainty (%): proportion of responses scored 3 before verification. Final Uncertainty (%): proportion of responses scored 3 after verification. Initial Median (IQR) % and Final Median (IQR) %: median (interquartile range) of case-level uncertainty before and after verification. p values from Wilcoxon signed-rank test; p < 0.05*, < 0.01**, < 0.001***.
  2. Bold values indicates statistical significant P values (P < 0.05).