Table 1 Performance of models for the six QA benchmarks: MedQA, USMLE sample test (USMLE), Medbullets-4 (MB-4), Medbullets-5 (MB-5), MedMCQA, and MMLU-Medical (MMLU)

Model	MedQA	USMLE	MB-4	MB-5	MedMCQA	MMLU	Avg.
Commercial LLMs
o1	95.7	95.7	87.9	86.0	82.9	95.2	90.6
o3-mini	91.8	93.5	85.7	81.2	76.3	93.0	86.9
o1-mini	90.0	91.1	80.8	79.2	71.0	91.2	83.9
GPT-4o	83.6	89.2	76.3	66.5	63.1	86.2	77.5
GPT-4	81.4	86.6	68.8	63.3	72.4	87.1	76.6
GPT-3.5 (175B)	53.6	58.5	51.0	47.4	51.0	67.3	54.8
Open-source & Domain-specific SLMs
Mistral-7B	43.2	40.5	38.8	32.8	40.7	51.0	41.2
Llama-3-8B	57.5	58.8	49.0	48.7	54.7	68.0	56.1
MediTron-7B	50.2	44.6	51.5	45.5	57.9	56.7	51.0
BioMistral-7B	54.3	51.4	52.3	48.7	61.1	64.6	55.4
Meerkat-7B (Ours)	71.2	70.1	60.5	52.8	61.5	70.7	64.5
Meerkat-8B (Ours)	74.2	73.8	59.7	55.2	62.7	74.3	66.7

Our Meerkat models generally performed better than existing 7B and 8B models and GPT-3.5 across all datasets. The scores in MMLU-Medical were calculated based on the average accuracies across the six medical-related subjects. Detailed results for the six subjects can be found in Supplementary Table 1. The scores of GPT-3.5 and GPT-4 are obtained from the papers of Nori et al.¹¹, Toma et al.⁵³, and Chen et al.⁵⁴.
The best performance for each category—Commercial LLMs and Open-source & Domain-specific SLMs—is highlighted in bold.

Quick links

Search