Table 1 Results on medical multiple-choice question answering, as reported with Accuracy score
From: Towards evaluating and building versatile large language models for medicine
Method1 | Size | MedQA | MedMCQA | PubMedQA | MMedBench | Avg. | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Â | Â | Â | Â | Â | ZH | JA | FR | RU | ES | Â |
Close-source Models | ||||||||||
GPT-3.5 | – | 57.7 | 72.7 | 53.8 | 52.3 | 34.6 | 32.5 | 66.4 | 66.1 | 54.5 |
GPT-4 | – | 85.8 | 72.3 | 75.2 | 75.1 | 72.9 | 56.6 | 83.6 | 85.7 | 75.3 |
Open-source Models | ||||||||||
MEDITRON | 7B | 47.9 | 59.2 | 74.4 | 61.9 | 40.2 | 35.1 | 67.6 | 53.3 | 54.9 |
InternLM 2 | 7B | – | – | – | 77.6 | 47.7 | 41.0 | 68.4 | 59.6 | - |
Mistral | 7B | 50.8 | 48.2 | 75.4 | 71.1 | 44.7 | 48.7 | 74.2 | 63.9 | 49.1 |
Llama 3 | 8B | 60.9 | 50.7 | 73.0 | 78.2 | 48.2 | 50.8 | 71.5 | 64.2 | 62.2 |
Qwen 1.5 | 7B | 48.9 | 50.2 | 67.8 | – | – | – | – | – | – |
Med42-v2 | 8B | 62.8 | 62.8 | 75.8 | – | – | – | – | – | – |
Baichuan 2 | 7B | 32.7 | 41.7 | – | – | – | – | – | – | – |
MMedIns-Llama 3 | 8B | 63.6 | 57.1 | 78.2 | 78.6 | 54.3 | 46.0 | 72.3 | 61.2 | 63.9 |