Table 6 Initial vs. final diagnostic reasoning hallucinations and errors within four verification strategies across two LLMs and two datasets
Dataset-Model | Strategy | Initial H1(%) | Final H1(%) | p value | Initial H2(%) | Final H2(%) | p value | Initial Reasoning Error(%) | Final Reasoning Error(%) | p value |
|---|---|---|---|---|---|---|---|---|---|---|
MedQA-GPT-4o (n = 100) | SV | 17.00% | 12.00% | 0.424 | 90.00% | 44.00% | <0.001*** | 83.00% | 66.00% | <0.001*** |
KV | 19.00% | 4.00% | 0.001** | 91.00% | 44.00% | <0.001*** | 91.00% | 86.00% | 0.267 | |
DV | 8.00% | 2.00% | 0.109 | 86.00% | 39.00% | <0.001*** | 89.00% | 77.00% | 0.004** | |
HV | 11.00% | 27.00% | 0.002** | 94.00% | 63.00% | <0.001*** | 93.00% | 90.00% | 0.375 | |
MedQA-DeepSeek-V3 (n = 100) | SV | 21.00% | 4.00% | <0.001*** | 92.00% | 65.00% | <0.001*** | 99.00% | 89.00% | 0.004** |
KV | 3.00% | 2.00% | >0.999 | 96.00% | 42.00% | <0.001*** | 99.00% | 93.00% | 0.041* | |
DV | 2.00% | 1.00% | >0.999 | 97.00% | 69.00% | <0.001*** | 99.00% | 90.00% | 0.004** | |
HV | 37.00% | 22.00% | 0.003** | 87.00% | 54.00% | <0.001*** | 100.00% | 91.00% | 0.004** | |
NEJM-GPT-4o (n = 100) | SV | 22.00% | 4.00% | <0.001*** | 91.00% | 36.00% | <0.001*** | 99.00% | 93.00% | 0.031* |
KV | 4.00% | 0.00% | 0.125 | 91.00% | 45.00% | <0.001*** | 100.00% | 93.00% | 0.016* | |
DV | 13.00% | 1.00% | < 0.001*** | 98.00% | 51.00% | <0.001*** | 100.00% | 87.00% | <0.001*** | |
HV | 14.00% | 7.00% | 0.035* | 99.00% | 39.00% | <0.001*** | 100.00% | 96.00% | 0.134 | |
NEJM-DeepSeek-V3 (n = 100) | SV | 37.00% | 6.00% | <0.001*** | 96.00% | 44.00% | <0.001*** | 100.00% | 92.00% | 0.008** |
KV | 35.00% | 1.00% | <0.001*** | 98.00% | 48.00% | <0.001*** | 100.00% | 91.00% | 0.004** | |
DV | 29.00% | 1.00% | <0.001*** | 96.00% | 33.00% | <0.001*** | 100.00% | 94.00% | 0.041* | |
HV | 40.00% | 7.00% | <0.001*** | 93.00% | 53.00% | <0.001*** | 100.00% | 95.00% | 0.062 |