Fig. 1: Accuracy differences (ΔAccuracy) between eight prompting strategies and CoT across four LLMs and two medical datasets.

Note: SV, Stepwise Verification; KV, Knowledge-Augmented Verification; DV, Differential Verification; HV, Hybrid Verification; SC, Self-Consistency; ToT, Tree-of-Thoughts; SR, Self-Refine; CoV, Chain-of-Verification; CoT, Chain-of-Thought. Accuracy (%): proportion of responses scored 1 or 2. ΔAccuracy (%): accuracy of each prompting strategy’s final diagnosis minus the accuracy of CoT within the same model–dataset condition.