Table 3 Initial vs. final diagnosis uncertainty within four verification strategies across two LLMs and two medical datasets

Dataset-Model	Strategy	Initial Uncertainty(%)	Final Uncertainty(%)	Initial Median (IQR) %	Final Median (IQR) %	p value
MedQA-GPT-4o (n = 570)	SV	5.51%	2.32%	0%(0-0%)	0%(0-0%)	<0.001^***
	KV	4.35%	2.11%	0%(0-0%)	0%(0-0%)	<0.001^***
	DV	4.74%	0.56%	0%(0-0%)	0%(0-0%)	<0.001^***
	HV	2.77%	1.51%	0%(0-0%)	0%(0-0%)	<0.001^***
MedQA-DeepSeek-V3 (n = 570)	SV	3.75%	2.18%	0%(0-0%)	0%(0-0%)	<0.001^***
	KV	4.21%	3.19%	0%(0-0%)	0%(0-0%)	<0.001^***
	DV	3.05%	1.19%	0%(0-0%)	0%(0-0%)	<0.001^***
	HV	3.02%	2.46%	0%(0-0%)	0%(0-0%)	0.002^**
NEJM-GPT-4o (n = 300)	SV	21.87%	9.60%	0%(0-40%)	0%(0-20%)	<0.001^***
	KV	14.93%	6.47%	0%(0-20%)	0%(0-0%)	<0.001^***
	DV	18.67%	2.67%	0%(0-40%)	0%(0-0%)	<0.001^***
	HV	9.33%	5.13%	0%(0-20%)	0%(0-0%)	<0.001^***
NEJM-DeepSeek-V3 (n = 300)	SV	8.13%	3.40%	0%(0-20%)	0%(0-0%)	<0.001^***
	KV	4.33%	2.13%	0%(0-0%)	0%(0-0%)	<0.001^***
	DV	4.67%	0.87%	0%(0-0%)	0%(0-0%)	<0.001^***
	HV	4.73%	2.87%	0%(0-0%)	0%(0-0%)	<0.001^***

Notes: SV Stepwise Verification, KV Knowledge-Augmented Verification, DV Differential Verification, HV Hybrid Verification. Initial Uncertainty (%): proportion of responses scored 3 before verification. Final Uncertainty (%): proportion of responses scored 3 after verification. Initial Median (IQR) % and Final Median (IQR) %: median (interquartile range) of case-level uncertainty before and after verification. p values from Wilcoxon signed-rank test; p < 0.05*, < 0.01**, < 0.001***.
Bold values indicates statistical significant P values (P < 0.05).

Quick links

Search