Table 3 GPT 3.5 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.

From: Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

Prompt

Correct responses (%)

Difference in percentage (confidence interval)

p valuea

Chain of thought

46%

Intuitive reasoning

48%

1.7% (−2.5%, 5.9%)

0.4

Analytic reasoning

40%

−6.0% (−11%, −1.5%)

0.001

Differential diagnosis

38%

−8.9% (−14%, −3.4%)

<0.001

Bayesian inference

42%

−4.4% (−9.1%, 0.2%)

0.02

  1. GPT-3.5 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.
  2. aPercentage difference and p value statistics compared to traditional chain-of-thought.