Table 3 Multiple-choice accuracy evaluation on various English multiple-choice question-answering benchmarks
From: Towards building multilingual language model for medicine
Method | Size | Year | MedQA | MedMCQA | PubMedQA | MMLU | Avg. | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CK | MG | An | PM | CB | CM | |||||||
Close-source Models | ||||||||||||
GPT-3.5 | - | 2022.11 | 57.7 | 72.7 | 53.8 | 74.7 | 74.0 | 65.9 | 72.8 | 72.9 | 64.7 | 67.69 |
GPT-4 | - | 2023.03 | 85.8 | 72.3 | 70.0 | 90.2 | 94.0 | 84.4 | 94.5 | 93.8 | 83.2 | 85.36 |
Flan-PaLM | 540B | 2022.12 | 67.6 | 57.6 | 79.0 | 80.4 | 75.0 | 63.7 | 83.8 | 88.9 | 76.3 | 74.70 |
MedPaLM 2 | - | 2023.05 | 86.5 | 72.3 | 81.8 | 88.7 | 92.0 | 84.4 | 95.2 | 95.8 | 83.2 | 86.66 |
Open-source Models | ||||||||||||
MedAlpaca | 7B | 2023.3 | 41.7 | 37.5 | 72.8 | 57.4 | 69.0 | 57.0 | 67.3 | 65.3 | 54.3 | 58.03 |
PMC-LLaMA | 13B | 2023.9 | 56.4 | 56.0 | 77.9 | - | - | - | - | - | - | - |
MEDITRON | 7B | 2023.11 | 57.2 | 59.2 | 74.4 | 64.6 | 59.9 | 49.3 | 55.4 | 53.8 | 44.8 | 57.62 |
Mistral | 7B | 2023.12 | 50.8 | 48.2 | 75.4 | 68.7 | 71.0 | 55.6 | 68.4 | 68.1 | 59.5 | 62.97 |
Gemma | 7B | 2024.2 | 47.2 | 49.0 | 76.2 | 69.8 | 70.0 | 59.3 | 66.2 | 79.9 | 60.1 | 64.19 |
BioMistral | 7B | 2024.2 | 50.6 | 48.1 | 77.5 | 59.9 | 64.0 | 56.5 | 60.4 | 59.0 | 54.7 | 58.97 |
Llama 3 | 8B | 2024.4 | 60.9 | 50.7 | 73.0 | 72.1 | 76.0 | 63.0 | 77.2 | 79.9 | 64.2 | 68.56 |
MMed-Llama 3 (Ours) | 8B | - | 65.4 | 63.5 | 80.1 | 71.3 | 85.0 | 69.6 | 77.6 | 74.3 | 66.5 | 72.59 |