Table 1 Mutliple-choice accuracy evaluation on MMedBench

From: Towards building multilingual language model for medicine

Method	Size	Year	MMedC	MMedBench	English	Chinese	Japanese	French	Russian	Spanish	Avg.
GPT-4 (5-shot, CoT)	-	2023.3	✗	✗	90.20	81.00	76.38	55.14	85.10	88.80	79.43
Zero-shot Evaluation
GPT-3.5	-	2022.12	✗	✗	56.88	52.29	34.63	32.48	66.36	66.06	51.47
GPT-4	-	2023.3	✗	✗	78.00	75.07	72.91	56.59	83.62	85.67	74.27
Gemini-1.0 pro	-	2024.1	✗	✗	53.73	60.19	44.22	29.90	73.44	69.69	55.20
Parameter-efficient Fine-tuning (PEFT) Evaluation
BLOOMZ	7B	2023.5	✗	trainset	38.88	48.86	17.59	18.65	53.91	44.78	37.11
InternLM	7B	2023.7	✗	trainset	40.93	52.19	27.14	18.81	46.88	40.34	37.71
Llama 2	7B	2023.7	✗	trainset	37.00	37.13	24.12	19.13	63.67	42.89	37.32
ChatDoctor	7B	2023.3	✗	trainset	36.68	34.06	28.14	11.58	60.55	39.86	35.15
MedAlpaca	7B	2023.4	✗	trainset	43.28	36.81	27.14	16.40	51.95	41.72	36.22
PMC-LLaMA	7B	2023.4	✗	trainset	33.62	31.76	20.60	10.13	57.81	37.89	31.97
Mistral	7B	2023.10	✗	trainset	55.38	50.23	37.69	40.19	71.88	61.60	52.83
MEDITRON	7B	2023.11	✗	trainset	34.88	33.22	21.11	9.65	57.42	40.74	32.84
InternLM 2	7B	2024.2	✗	trainset	52.40	68.18	39.20	28.78	63.67	55.25	51.25
BioMistral	7B	2024.2	✗	trainset	49.41	44.51	29.15	33.60	67.97	54.45	46.51
Llama 3	8B	2024.4	✗	trainset	62.84	70.11	41.21	39.55	64.84	61.52	56.68
MMedLM (Ours)	7B	-	✓	trainset	41.16	52.22	27.14	18.49	47.66	40.34	37.83
MMedLM 2 (Ours)	7B	-	✓	trainset	58.13	70.43	54.27	38.26	71.88	64.95	59.65
MMed-Llama 3 (Ours)	8B	-	✓	trainset	63.08	69.41	55.78	41.64	71.48	66.96	61.39
Full Fine-tuning Evaluation
BLOOMZ	7B	2023.5	✗	trainset	43.28	58.06	32.66	26.37	62.89	47.34	45.10
InternLM	7B	2023.7	✗	trainset	44.07	64.62	37.19	24.92	58.20	44.97	45.67
Llama 2	7B	2023.7	✗	trainset	43.36	50.29	25.13	20.90	66.80	47.10	42.26
MedAlpaca	7B	2023.3	✗	trainset	46.74	44.80	29.64	21.06	59.38	45.00	41.11
ChatDoctor	7B	2023.4	✗	trainset	43.52	43.26	25.63	18.81	62.50	43.44	39.53
PMC-LLaMA	7B	2023.4	✗	trainset	47.53	42.44	24.12	20.74	62.11	43.29	40.04
Mistral	7B	2023.10	✗	trainset	61.74	71.10	44.72	48.71	74.22	63.86	60.73
MEDITRON	7B	2023.11	✗	trainset	55.46	61.88	40.20	35.05	67.58	53.28	52.24
InternLM 2	7B	2024.2	✗	trainset	57.27	77.55	47.74	41.00	68.36	59.59	58.59
BioMistral	7B	2024.2	✗	trainset	57.82	71.54	37.19	47.27	69.92	60.98	57.45
Llama 3	8B	2024.4	✗	trainset	63.86	78.23	48.24	50.80	71.48	64.15	62.79
MMedLM (Ours)	7B	-	✓	trainset	49.88	70.49	46.23	36.66	72.27	54.52	55.01
MMedLM 2 (Ours)	7B	-	✓	trainset	61.7	80.01	61.81	52.09	80.47	67.65	67.30
MMed-Llama 3 (Ours)	8B	-	✓	trainset	66.06	79.25	61.81	55.63	75.39	68.38	67.75

We report each model’s accuracy across various languages separately, with “Avg.” denoting the mean score over six languages under zero-shot, PEFT and full fine-tuning settings. We also list out their model sizes, release time and whether they are testing after further training on our MMedC or the training set of MMedBench in the table. The best results under each setting are bold.

Back to article page

Table 1 Mutliple-choice accuracy evaluation on MMedBench

Search

Quick links