Table 1 Comparison of Med-PaLM 2 results to reported results from GPT-4

From: Toward expert-level medical question answering with large language models

Dataset

Flan-PaLM (best)

Med-PaLM 2 (ER)

Med-PaLM 2 (best)

GPT-4 (5-shot)

GPT-4-base (5-shot)

MedQA (USMLE)

67.6 [65.0, 70.2]

85.4 [83.3, 87.3]

86.5 [84.5, 88.3]

81.4 [79.1, 83.5]

86.1 [84.1, 88.0]

PubMedQA

79.0 [75.2, 82.5]

75.0 [71.0, 78.7]

81.8 [78.1, 85.1]

75.2 [71.2, 78.9]

80.4 [76.6, 83.8]

MedMCQA

57.6 [56.1, 59.1]

72.3 [70.9, 73.6]

72.3 [70.9, 73.6]

72.4 [71.0, 73.7]

73.7 [72.3, 75.0]

MMLU Clinical Knowledge

80.4 [75.1, 85.0]

88.7 [84.2, 92.2]

88.7 [84.2, 92.2]

86.4 [81.7, 90.3]

88.7 [84.2, 92.2]

MMLU Medical Genetics

75.0 [65.3, 83.1]

92.0 [84.8, 96.5]

92.0 [84.8, 96.5]

92.0 [84.8, 96.5]

97.0 [91.5, 99.4]

MMLU Anatomy

63.7 [55.0, 71.8]

84.4 [77.2, 90.1]

84.4 [77.2, 90.1]

80.0 [72.3, 86.4]

85.2 [78.1, 90.7]

MMLU Professional Medicine

83.8 [78.9, 88.0]

92.3 [88.4, 95.2]

95.2 [92.0, 97.4]

93.8 [90.2, 96.3]

93.8 [90.2, 96.3]

MMLU College Biology

88.9 [82.6, 93.5]

95.8 [91.2, 98.5]

95.8 [91.2, 98.5]

95.1 [90.2, 98.0]

97.2 [93.0, 99.2]

MMLU College Medicine

76.3 [69.3, 82.4]

83.2 [76.8, 88.5]

83.2 [76.8, 88.5]

76.9 [69.9, 82.9]

80.9 [74.3, 86.5]

  1. Med-PaLM 2 was first announced on 14 March 2023. GPT-4 results were released on 20 March 2023, and GPT-4-base (nonproduction) results were released on 12 April 20232. We include Flan-PaLM results from December 2022 for comparison1. ER stands for ensemble refinement and includes results from prompting strategies only. Best results are across prompting strategies and use the fine-tuned model. Results are reported along with 95% confidence intervals determined by Clopper–Pearson binomial estimates.