Table 4 GPT 4 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.

Prompt	Correct responses (%)	Difference in percentage (confidence interval)	p value^a
Chain of thought	76%	–	–
Intuitive reasoning	77%	0.8% (−3.6%, 5.2%)	0.73
Analytic reasoning	78%	1.6% (−2.4%, 5.6%)	0.35
Differential diagnosis	78%	2.2% (−2.3%, 6.7%)	0.24
Bayesian inference	72%	−3.4% (−9.1%, 1.2%)	0.07

GPT-4 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.
^aPercentage difference and p value statistics compared to traditional chain-of-thought.

Quick links

Search