Table 1 Comparison of Med-PaLM 2 results to reported results from GPT-4
From: Toward expert-level medical question answering with large language models
Dataset | Flan-PaLM (best) | Med-PaLM 2 (ER) | Med-PaLM 2 (best) | GPT-4 (5-shot) | GPT-4-base (5-shot) |
|---|---|---|---|---|---|
MedQA (USMLE) | 67.6 [65.0, 70.2] | 85.4 [83.3, 87.3] | 86.5 [84.5, 88.3] | 81.4 [79.1, 83.5] | 86.1 [84.1, 88.0] |
PubMedQA | 79.0 [75.2, 82.5] | 75.0 [71.0, 78.7] | 81.8 [78.1, 85.1] | 75.2 [71.2, 78.9] | 80.4 [76.6, 83.8] |
MedMCQA | 57.6 [56.1, 59.1] | 72.3 [70.9, 73.6] | 72.3 [70.9, 73.6] | 72.4 [71.0, 73.7] | 73.7 [72.3, 75.0] |
MMLU Clinical Knowledge | 80.4 [75.1, 85.0] | 88.7 [84.2, 92.2] | 88.7 [84.2, 92.2] | 86.4 [81.7, 90.3] | 88.7 [84.2, 92.2] |
MMLU Medical Genetics | 75.0 [65.3, 83.1] | 92.0 [84.8, 96.5] | 92.0 [84.8, 96.5] | 92.0 [84.8, 96.5] | 97.0 [91.5, 99.4] |
MMLU Anatomy | 63.7 [55.0, 71.8] | 84.4 [77.2, 90.1] | 84.4 [77.2, 90.1] | 80.0 [72.3, 86.4] | 85.2 [78.1, 90.7] |
MMLU Professional Medicine | 83.8 [78.9, 88.0] | 92.3 [88.4, 95.2] | 95.2 [92.0, 97.4] | 93.8 [90.2, 96.3] | 93.8 [90.2, 96.3] |
MMLU College Biology | 88.9 [82.6, 93.5] | 95.8 [91.2, 98.5] | 95.8 [91.2, 98.5] | 95.1 [90.2, 98.0] | 97.2 [93.0, 99.2] |
MMLU College Medicine | 76.3 [69.3, 82.4] | 83.2 [76.8, 88.5] | 83.2 [76.8, 88.5] | 76.9 [69.9, 82.9] | 80.9 [74.3, 86.5] |