Table 6 Initial vs. final diagnostic reasoning hallucinations and errors within four verification strategies across two LLMs and two datasets

From: Two-stage prompting framework with predefined verification steps for evaluating diagnostic reasoning tasks on two datasets

Dataset-Model

Strategy

Initial H1(%)

Final H1(%)

p value

Initial H2(%)

Final H2(%)

p value

Initial Reasoning Error(%)

Final Reasoning Error(%)

p value

MedQA-GPT-4o (n = 100)

SV

17.00%

12.00%

0.424

90.00%

44.00%

<0.001***

83.00%

66.00%

<0.001***

KV

19.00%

4.00%

0.001**

91.00%

44.00%

<0.001***

91.00%

86.00%

0.267

DV

8.00%

2.00%

0.109

86.00%

39.00%

<0.001***

89.00%

77.00%

0.004**

HV

11.00%

27.00%

0.002**

94.00%

63.00%

<0.001***

93.00%

90.00%

0.375

MedQA-DeepSeek-V3 (n = 100)

SV

21.00%

4.00%

<0.001***

92.00%

65.00%

<0.001***

99.00%

89.00%

0.004**

KV

3.00%

2.00%

>0.999

96.00%

42.00%

<0.001***

99.00%

93.00%

0.041*

DV

2.00%

1.00%

>0.999

97.00%

69.00%

<0.001***

99.00%

90.00%

0.004**

HV

37.00%

22.00%

0.003**

87.00%

54.00%

<0.001***

100.00%

91.00%

0.004**

NEJM-GPT-4o (n = 100)

SV

22.00%

4.00%

<0.001***

91.00%

36.00%

<0.001***

99.00%

93.00%

0.031*

KV

4.00%

0.00%

0.125

91.00%

45.00%

<0.001***

100.00%

93.00%

0.016*

DV

13.00%

1.00%

< 0.001***

98.00%

51.00%

<0.001***

100.00%

87.00%

<0.001***

HV

14.00%

7.00%

0.035*

99.00%

39.00%

<0.001***

100.00%

96.00%

0.134

NEJM-DeepSeek-V3 (n = 100)

SV

37.00%

6.00%

<0.001***

96.00%

44.00%

<0.001***

100.00%

92.00%

0.008**

KV

35.00%

1.00%

<0.001***

98.00%

48.00%

<0.001***

100.00%

91.00%

0.004**

DV

29.00%

1.00%

<0.001***

96.00%

33.00%

<0.001***

100.00%

94.00%

0.041*

HV

40.00%

7.00%

<0.001***

93.00%

53.00%

<0.001***

100.00%

95.00%

0.062

  1. Notes: SV Stepwise Verification, KV Knowledge-Augmented Verification, DV Differential Verification, HV Hybrid Verification. H1, hallucination type I; H2, hallucination type II. Initial and Final H1(%)/H2(%): proportion of responses exhibiting H1/H2 before and after verification. Initial and Final Reasoning Error(%): proportion of responses with reasoning errors before and after verification. p values from McNemar test; p < 0.05*, < 0.01**, < 0.001***.
  2. Bold values indicates statistical significant P values (P < 0.05).