Table 1 Performance of models for the six QA benchmarks: MedQA, USMLE sample test (USMLE), Medbullets-4 (MB-4), Medbullets-5 (MB-5), MedMCQA, and MMLU-Medical (MMLU)
From: Small language models learn enhanced reasoning skills from medical textbooks
Model | MedQA | USMLE | MB-4 | MB-5 | MedMCQA | MMLU | Avg. |
|---|---|---|---|---|---|---|---|
Commercial LLMs | |||||||
o1 | 95.7 | 95.7 | 87.9 | 86.0 | 82.9 | 95.2 | 90.6 |
o3-mini | 91.8 | 93.5 | 85.7 | 81.2 | 76.3 | 93.0 | 86.9 |
o1-mini | 90.0 | 91.1 | 80.8 | 79.2 | 71.0 | 91.2 | 83.9 |
GPT-4o | 83.6 | 89.2 | 76.3 | 66.5 | 63.1 | 86.2 | 77.5 |
GPT-4 | 81.4 | 86.6 | 68.8 | 63.3 | 72.4 | 87.1 | 76.6 |
GPT-3.5 (175B) | 53.6 | 58.5 | 51.0 | 47.4 | 51.0 | 67.3 | 54.8 |
Open-source & Domain-specific SLMs | |||||||
Mistral-7B | 43.2 | 40.5 | 38.8 | 32.8 | 40.7 | 51.0 | 41.2 |
Llama-3-8B | 57.5 | 58.8 | 49.0 | 48.7 | 54.7 | 68.0 | 56.1 |
MediTron-7B | 50.2 | 44.6 | 51.5 | 45.5 | 57.9 | 56.7 | 51.0 |
BioMistral-7B | 54.3 | 51.4 | 52.3 | 48.7 | 61.1 | 64.6 | 55.4 |
Meerkat-7B (Ours) | 71.2 | 70.1 | 60.5 | 52.8 | 61.5 | 70.7 | 64.5 |
Meerkat-8B (Ours) | 74.2 | 73.8 | 59.7 | 55.2 | 62.7 | 74.3 | 66.7 |