Table 4 Initial-to-final diagnosis transitions within four verification strategies across two LLMs and two medical datasets
Dataset-Model | Strategy | 3 To 2 | 3 To 1 | 3 To 1/2 | 3 To 0 | 0 To 2 | 0 To 1 | 0 To 1/2 |
|---|---|---|---|---|---|---|---|---|
MedQA-GPT-4o (n = 570) | SV | 32.48% | 14.01% | 46.49% | 15.29% | 2.04% | 0.41% | 2.45% |
KV | 34.68% | 3.23% | 37.91% | 17.74% | 0.41% | 0.41% | 0.82% | |
DV | 37.78% | 12.59% | 50.37% | 41.48% | 3.01% | 0.60% | 3.61% | |
HV | 18.99% | 7.59% | 26.58% | 21.52% | 2.37% | 0.40% | 2.77% | |
MedQA-DeepSeek-V3 (n = 570) | SV | 27.10% | 13.08% | 40.18% | 13.08% | 2.39% | 0.80% | 3.19% |
KV | 17.50% | 8.33% | 25.83% | 9.17% | 3.49% | 0.32% | 3.81% | |
DV | 19.54% | 11.49% | 31.03% | 35.63% | 0.62% | 1.09% | 1.71% | |
HV | 5.81% | 4.65% | 10.46% | 11.63% | 2.22% | 0.30% | 2.52% | |
NEJM-GPT-4o (n = 300) | SV | 9.45% | 7.93% | 17.38% | 46.65% | 2.34% | 0.47% | 2.81% |
KV | 12.05% | 8.48% | 20.53% | 43.75% | 0.84% | 0.42% | 1.26% | |
DV | 16.43% | 8.57% | 25.00% | 63.21% | 1.92% | 0.59% | 2.51% | |
HV | 8.57% | 5.71% | 14.28% | 38.57% | 2.92% | 1.02% | 3.94% | |
NEJM-DeepSeek-V3 (n = 300) | SV | 14.75% | 4.10% | 18.85% | 45.90% | 3.03% | 1.79% | 4.82% |
KV | 15.38% | 7.69% | 23.07% | 40.00% | 2.34% | 0.91% | 3.25% | |
DV | 15.71% | 4.29% | 20.00% | 64.29% | 1.69% | 0.39% | 2.08% | |
HV | 7.04% | 5.63% | 12.67% | 38.03% | 3.35% | 1.03% | 4.38% |