Table 4 Initial-to-final diagnosis transitions within four verification strategies across two LLMs and two medical datasets

From: Two-stage prompting framework with predefined verification steps for evaluating diagnostic reasoning tasks on two datasets

Dataset-Model

Strategy

3 To 2

3 To 1

3 To 1/2

3 To 0

0 To 2

0 To 1

0 To 1/2

MedQA-GPT-4o (n = 570)

SV

32.48%

14.01%

46.49%

15.29%

2.04%

0.41%

2.45%

KV

34.68%

3.23%

37.91%

17.74%

0.41%

0.41%

0.82%

DV

37.78%

12.59%

50.37%

41.48%

3.01%

0.60%

3.61%

HV

18.99%

7.59%

26.58%

21.52%

2.37%

0.40%

2.77%

MedQA-DeepSeek-V3 (n = 570)

SV

27.10%

13.08%

40.18%

13.08%

2.39%

0.80%

3.19%

KV

17.50%

8.33%

25.83%

9.17%

3.49%

0.32%

3.81%

DV

19.54%

11.49%

31.03%

35.63%

0.62%

1.09%

1.71%

HV

5.81%

4.65%

10.46%

11.63%

2.22%

0.30%

2.52%

NEJM-GPT-4o (n = 300)

SV

9.45%

7.93%

17.38%

46.65%

2.34%

0.47%

2.81%

KV

12.05%

8.48%

20.53%

43.75%

0.84%

0.42%

1.26%

DV

16.43%

8.57%

25.00%

63.21%

1.92%

0.59%

2.51%

HV

8.57%

5.71%

14.28%

38.57%

2.92%

1.02%

3.94%

NEJM-DeepSeek-V3 (n = 300)

SV

14.75%

4.10%

18.85%

45.90%

3.03%

1.79%

4.82%

KV

15.38%

7.69%

23.07%

40.00%

2.34%

0.91%

3.25%

DV

15.71%

4.29%

20.00%

64.29%

1.69%

0.39%

2.08%

HV

7.04%

5.63%

12.67%

38.03%

3.35%

1.03%

4.38%

  1. Notes: SV Stepwise Verification, KV Knowledge-Augmented Verification, DV Differential Verification, HV Hybrid Verification. “3 to 2,” “3 to 1,” “3 to 1/2,” and “3 to 0” indicate the proportion of responses scored 3 (uncertain) in the initial diagnosis that changed to scores 2 (correct), 1 (partially correct), 1 or 2, and 0 (incorrect) in the final diagnosis, respectively. “0 to 2,” “0 to 1,” and “0 to 1/2” indicate the proportion of responses scored 0 in the initial diagnosis that changed to scores 2, 1, and 1 or 2, respectively, in the final diagnosis.