Table 6 Initial vs. final diagnostic reasoning hallucinations and errors within four verification strategies across two LLMs and two datasets

Dataset-Model	Strategy	Initial H1(%)	Final H1(%)	p value	Initial H2(%)	Final H2(%)	p value	Initial Reasoning Error(%)	Final Reasoning Error(%)	p value
MedQA-GPT-4o (n = 100)	SV	17.00%	12.00%	0.424	90.00%	44.00%	<0.001^***	83.00%	66.00%	<0.001^***
	KV	19.00%	4.00%	0.001^**	91.00%	44.00%	<0.001^***	91.00%	86.00%	0.267
	DV	8.00%	2.00%	0.109	86.00%	39.00%	<0.001^***	89.00%	77.00%	0.004^**
	HV	11.00%	27.00%	0.002^**	94.00%	63.00%	<0.001^***	93.00%	90.00%	0.375
MedQA-DeepSeek-V3 (n = 100)	SV	21.00%	4.00%	<0.001^***	92.00%	65.00%	<0.001^***	99.00%	89.00%	0.004^**
	KV	3.00%	2.00%	>0.999	96.00%	42.00%	<0.001^***	99.00%	93.00%	0.041^*
	DV	2.00%	1.00%	>0.999	97.00%	69.00%	<0.001^***	99.00%	90.00%	0.004^**
	HV	37.00%	22.00%	0.003^**	87.00%	54.00%	<0.001^***	100.00%	91.00%	0.004^**
NEJM-GPT-4o (n = 100)	SV	22.00%	4.00%	<0.001^***	91.00%	36.00%	<0.001^***	99.00%	93.00%	0.031^*
	KV	4.00%	0.00%	0.125	91.00%	45.00%	<0.001^***	100.00%	93.00%	0.016^*
	DV	13.00%	1.00%	< 0.001^***	98.00%	51.00%	<0.001^***	100.00%	87.00%	<0.001^***
	HV	14.00%	7.00%	0.035^*	99.00%	39.00%	<0.001^***	100.00%	96.00%	0.134
NEJM-DeepSeek-V3 (n = 100)	SV	37.00%	6.00%	<0.001^***	96.00%	44.00%	<0.001^***	100.00%	92.00%	0.008^**
	KV	35.00%	1.00%	<0.001^***	98.00%	48.00%	<0.001^***	100.00%	91.00%	0.004^**
	DV	29.00%	1.00%	<0.001^***	96.00%	33.00%	<0.001^***	100.00%	94.00%	0.041^*
	HV	40.00%	7.00%	<0.001^***	93.00%	53.00%	<0.001^***	100.00%	95.00%	0.062

Notes: SV Stepwise Verification, KV Knowledge-Augmented Verification, DV Differential Verification, HV Hybrid Verification. H1, hallucination type I; H2, hallucination type II. Initial and Final H1(%)/H2(%): proportion of responses exhibiting H1/H2 before and after verification. Initial and Final Reasoning Error(%): proportion of responses with reasoning errors before and after verification. p values from McNemar test; p < 0.05*, < 0.01**, < 0.001***.
Bold values indicates statistical significant P values (P < 0.05).

Search