Table 1 Comparison of Med-PaLM 2 results to reported results from GPT-4

Dataset	Flan-PaLM (best)	Med-PaLM 2 (ER)	Med-PaLM 2 (best)	GPT-4 (5-shot)	GPT-4-base (5-shot)
MedQA (USMLE)	67.6 [65.0, 70.2]	85.4 [83.3, 87.3]	86.5 [84.5, 88.3]	81.4 [79.1, 83.5]	86.1 [84.1, 88.0]
PubMedQA	79.0 [75.2, 82.5]	75.0 [71.0, 78.7]	81.8 [78.1, 85.1]	75.2 [71.2, 78.9]	80.4 [76.6, 83.8]
MedMCQA	57.6 [56.1, 59.1]	72.3 [70.9, 73.6]	72.3 [70.9, 73.6]	72.4 [71.0, 73.7]	73.7 [72.3, 75.0]
MMLU Clinical Knowledge	80.4 [75.1, 85.0]	88.7 [84.2, 92.2]	88.7 [84.2, 92.2]	86.4 [81.7, 90.3]	88.7 [84.2, 92.2]
MMLU Medical Genetics	75.0 [65.3, 83.1]	92.0 [84.8, 96.5]	92.0 [84.8, 96.5]	92.0 [84.8, 96.5]	97.0 [91.5, 99.4]
MMLU Anatomy	63.7 [55.0, 71.8]	84.4 [77.2, 90.1]	84.4 [77.2, 90.1]	80.0 [72.3, 86.4]	85.2 [78.1, 90.7]
MMLU Professional Medicine	83.8 [78.9, 88.0]	92.3 [88.4, 95.2]	95.2 [92.0, 97.4]	93.8 [90.2, 96.3]	93.8 [90.2, 96.3]
MMLU College Biology	88.9 [82.6, 93.5]	95.8 [91.2, 98.5]	95.8 [91.2, 98.5]	95.1 [90.2, 98.0]	97.2 [93.0, 99.2]
MMLU College Medicine	76.3 [69.3, 82.4]	83.2 [76.8, 88.5]	83.2 [76.8, 88.5]	76.9 [69.9, 82.9]	80.9 [74.3, 86.5]

Med-PaLM 2 was first announced on 14 March 2023. GPT-4 results were released on 20 March 2023, and GPT-4-base (nonproduction) results were released on 12 April 2023². We include Flan-PaLM results from December 2022 for comparison¹. ER stands for ensemble refinement and includes results from prompting strategies only. Best results are across prompting strategies and use the fine-tuned model. Results are reported along with 95% confidence intervals determined by Clopper–Pearson binomial estimates.

Quick links

Search