Table 4 GPT 4 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.
Prompt | Correct responses (%) | Difference in percentage (confidence interval) | p valuea |
|---|---|---|---|
Chain of thought | 76% | – | – |
Intuitive reasoning | 77% | 0.8% (−3.6%, 5.2%) | 0.73 |
Analytic reasoning | 78% | 1.6% (−2.4%, 5.6%) | 0.35 |
Differential diagnosis | 78% | 2.2% (−2.3%, 6.7%) | 0.24 |
Bayesian inference | 72% | −3.4% (−9.1%, 1.2%) | 0.07 |