Table 3 Initial vs. final diagnosis uncertainty within four verification strategies across two LLMs and two medical datasets
Dataset-Model | Strategy | Initial Uncertainty(%) | Final Uncertainty(%) | Initial Median (IQR) % | Final Median (IQR) % | p value |
|---|---|---|---|---|---|---|
MedQA-GPT-4o (n = 570) | SV | 5.51% | 2.32% | 0%(0-0%) | 0%(0-0%) | <0.001*** |
KV | 4.35% | 2.11% | 0%(0-0%) | 0%(0-0%) | <0.001*** | |
DV | 4.74% | 0.56% | 0%(0-0%) | 0%(0-0%) | <0.001*** | |
HV | 2.77% | 1.51% | 0%(0-0%) | 0%(0-0%) | <0.001*** | |
MedQA-DeepSeek-V3 (n = 570) | SV | 3.75% | 2.18% | 0%(0-0%) | 0%(0-0%) | <0.001*** |
KV | 4.21% | 3.19% | 0%(0-0%) | 0%(0-0%) | <0.001*** | |
DV | 3.05% | 1.19% | 0%(0-0%) | 0%(0-0%) | <0.001*** | |
HV | 3.02% | 2.46% | 0%(0-0%) | 0%(0-0%) | 0.002** | |
NEJM-GPT-4o (n = 300) | SV | 21.87% | 9.60% | 0%(0-40%) | 0%(0-20%) | <0.001*** |
KV | 14.93% | 6.47% | 0%(0-20%) | 0%(0-0%) | <0.001*** | |
DV | 18.67% | 2.67% | 0%(0-40%) | 0%(0-0%) | <0.001*** | |
HV | 9.33% | 5.13% | 0%(0-20%) | 0%(0-0%) | <0.001*** | |
NEJM-DeepSeek-V3 (n = 300) | SV | 8.13% | 3.40% | 0%(0-20%) | 0%(0-0%) | <0.001*** |
KV | 4.33% | 2.13% | 0%(0-0%) | 0%(0-0%) | <0.001*** | |
DV | 4.67% | 0.87% | 0%(0-0%) | 0%(0-0%) | <0.001*** | |
HV | 4.73% | 2.87% | 0%(0-0%) | 0%(0-0%) | <0.001*** |