Table 4 GPT 4 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.

From: Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

Prompt

Correct responses (%)

Difference in percentage (confidence interval)

p valuea

Chain of thought

76%

Intuitive reasoning

77%

0.8% (−3.6%, 5.2%)

0.73

Analytic reasoning

78%

1.6% (−2.4%, 5.6%)

0.35

Differential diagnosis

78%

2.2% (−2.3%, 6.7%)

0.24

Bayesian inference

72%

−3.4% (−9.1%, 1.2%)

0.07

  1. GPT-4 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.
  2. aPercentage difference and p value statistics compared to traditional chain-of-thought.