Table 4 Initial-to-final diagnosis transitions within four verification strategies across two LLMs and two medical datasets

Dataset-Model	Strategy	3 To 2	3 To 1	3 To 1/2	3 To 0	0 To 2	0 To 1	0 To 1/2
MedQA-GPT-4o (n = 570)	SV	32.48%	14.01%	46.49%	15.29%	2.04%	0.41%	2.45%
	KV	34.68%	3.23%	37.91%	17.74%	0.41%	0.41%	0.82%
	DV	37.78%	12.59%	50.37%	41.48%	3.01%	0.60%	3.61%
	HV	18.99%	7.59%	26.58%	21.52%	2.37%	0.40%	2.77%
MedQA-DeepSeek-V3 (n = 570)	SV	27.10%	13.08%	40.18%	13.08%	2.39%	0.80%	3.19%
	KV	17.50%	8.33%	25.83%	9.17%	3.49%	0.32%	3.81%
	DV	19.54%	11.49%	31.03%	35.63%	0.62%	1.09%	1.71%
	HV	5.81%	4.65%	10.46%	11.63%	2.22%	0.30%	2.52%
NEJM-GPT-4o (n = 300)	SV	9.45%	7.93%	17.38%	46.65%	2.34%	0.47%	2.81%
	KV	12.05%	8.48%	20.53%	43.75%	0.84%	0.42%	1.26%
	DV	16.43%	8.57%	25.00%	63.21%	1.92%	0.59%	2.51%
	HV	8.57%	5.71%	14.28%	38.57%	2.92%	1.02%	3.94%
NEJM-DeepSeek-V3 (n = 300)	SV	14.75%	4.10%	18.85%	45.90%	3.03%	1.79%	4.82%
	KV	15.38%	7.69%	23.07%	40.00%	2.34%	0.91%	3.25%
	DV	15.71%	4.29%	20.00%	64.29%	1.69%	0.39%	2.08%
	HV	7.04%	5.63%	12.67%	38.03%	3.35%	1.03%	4.38%

Notes: SV Stepwise Verification, KV Knowledge-Augmented Verification, DV Differential Verification, HV Hybrid Verification. “3 to 2,” “3 to 1,” “3 to 1/2,” and “3 to 0” indicate the proportion of responses scored 3 (uncertain) in the initial diagnosis that changed to scores 2 (correct), 1 (partially correct), 1 or 2, and 0 (incorrect) in the final diagnosis, respectively. “0 to 2,” “0 to 1,” and “0 to 1/2” indicate the proportion of responses scored 0 in the initial diagnosis that changed to scores 2, 1, and 1 or 2, respectively, in the final diagnosis.

Quick links

Search