Table 3 GPT 3.5 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.
Prompt | Correct responses (%) | Difference in percentage (confidence interval) | p valuea |
|---|---|---|---|
Chain of thought | 46% | – | – |
Intuitive reasoning | 48% | 1.7% (−2.5%, 5.9%) | 0.4 |
Analytic reasoning | 40% | −6.0% (−11%, −1.5%) | 0.001 |
Differential diagnosis | 38% | −8.9% (−14%, −3.4%) | <0.001 |
Bayesian inference | 42% | −4.4% (−9.1%, 0.2%) | 0.02 |