Table 1 Mutliple-choice accuracy evaluation on MMedBench
From: Towards building multilingual language model for medicine
Method | Size | Year | MMedC | MMedBench | English | Chinese | Japanese | French | Russian | Spanish | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 (5-shot, CoT) | - | 2023.3 | ✗ | ✗ | 90.20 | 81.00 | 76.38 | 55.14 | 85.10 | 88.80 | 79.43 |
Zero-shot Evaluation | |||||||||||
GPT-3.5 | - | 2022.12 | ✗ | ✗ | 56.88 | 52.29 | 34.63 | 32.48 | 66.36 | 66.06 | 51.47 |
GPT-4 | - | 2023.3 | ✗ | ✗ | 78.00 | 75.07 | 72.91 | 56.59 | 83.62 | 85.67 | 74.27 |
Gemini-1.0 pro | - | 2024.1 | ✗ | ✗ | 53.73 | 60.19 | 44.22 | 29.90 | 73.44 | 69.69 | 55.20 |
Parameter-efficient Fine-tuning (PEFT) Evaluation | |||||||||||
BLOOMZ | 7B | 2023.5 | ✗ | trainset | 38.88 | 48.86 | 17.59 | 18.65 | 53.91 | 44.78 | 37.11 |
InternLM | 7B | 2023.7 | ✗ | trainset | 40.93 | 52.19 | 27.14 | 18.81 | 46.88 | 40.34 | 37.71 |
Llama 2 | 7B | 2023.7 | ✗ | trainset | 37.00 | 37.13 | 24.12 | 19.13 | 63.67 | 42.89 | 37.32 |
ChatDoctor | 7B | 2023.3 | ✗ | trainset | 36.68 | 34.06 | 28.14 | 11.58 | 60.55 | 39.86 | 35.15 |
MedAlpaca | 7B | 2023.4 | ✗ | trainset | 43.28 | 36.81 | 27.14 | 16.40 | 51.95 | 41.72 | 36.22 |
PMC-LLaMA | 7B | 2023.4 | ✗ | trainset | 33.62 | 31.76 | 20.60 | 10.13 | 57.81 | 37.89 | 31.97 |
Mistral | 7B | 2023.10 | ✗ | trainset | 55.38 | 50.23 | 37.69 | 40.19 | 71.88 | 61.60 | 52.83 |
MEDITRON | 7B | 2023.11 | ✗ | trainset | 34.88 | 33.22 | 21.11 | 9.65 | 57.42 | 40.74 | 32.84 |
InternLM 2 | 7B | 2024.2 | ✗ | trainset | 52.40 | 68.18 | 39.20 | 28.78 | 63.67 | 55.25 | 51.25 |
BioMistral | 7B | 2024.2 | ✗ | trainset | 49.41 | 44.51 | 29.15 | 33.60 | 67.97 | 54.45 | 46.51 |
Llama 3 | 8B | 2024.4 | ✗ | trainset | 62.84 | 70.11 | 41.21 | 39.55 | 64.84 | 61.52 | 56.68 |
MMedLM (Ours) | 7B | - | ✓ | trainset | 41.16 | 52.22 | 27.14 | 18.49 | 47.66 | 40.34 | 37.83 |
MMedLM 2 (Ours) | 7B | - | ✓ | trainset | 58.13 | 70.43 | 54.27 | 38.26 | 71.88 | 64.95 | 59.65 |
MMed-Llama 3 (Ours) | 8B | - | ✓ | trainset | 63.08 | 69.41 | 55.78 | 41.64 | 71.48 | 66.96 | 61.39 |
Full Fine-tuning Evaluation | |||||||||||
BLOOMZ | 7B | 2023.5 | ✗ | trainset | 43.28 | 58.06 | 32.66 | 26.37 | 62.89 | 47.34 | 45.10 |
InternLM | 7B | 2023.7 | ✗ | trainset | 44.07 | 64.62 | 37.19 | 24.92 | 58.20 | 44.97 | 45.67 |
Llama 2 | 7B | 2023.7 | ✗ | trainset | 43.36 | 50.29 | 25.13 | 20.90 | 66.80 | 47.10 | 42.26 |
MedAlpaca | 7B | 2023.3 | ✗ | trainset | 46.74 | 44.80 | 29.64 | 21.06 | 59.38 | 45.00 | 41.11 |
ChatDoctor | 7B | 2023.4 | ✗ | trainset | 43.52 | 43.26 | 25.63 | 18.81 | 62.50 | 43.44 | 39.53 |
PMC-LLaMA | 7B | 2023.4 | ✗ | trainset | 47.53 | 42.44 | 24.12 | 20.74 | 62.11 | 43.29 | 40.04 |
Mistral | 7B | 2023.10 | ✗ | trainset | 61.74 | 71.10 | 44.72 | 48.71 | 74.22 | 63.86 | 60.73 |
MEDITRON | 7B | 2023.11 | ✗ | trainset | 55.46 | 61.88 | 40.20 | 35.05 | 67.58 | 53.28 | 52.24 |
InternLM 2 | 7B | 2024.2 | ✗ | trainset | 57.27 | 77.55 | 47.74 | 41.00 | 68.36 | 59.59 | 58.59 |
BioMistral | 7B | 2024.2 | ✗ | trainset | 57.82 | 71.54 | 37.19 | 47.27 | 69.92 | 60.98 | 57.45 |
Llama 3 | 8B | 2024.4 | ✗ | trainset | 63.86 | 78.23 | 48.24 | 50.80 | 71.48 | 64.15 | 62.79 |
MMedLM (Ours) | 7B | - | ✓ | trainset | 49.88 | 70.49 | 46.23 | 36.66 | 72.27 | 54.52 | 55.01 |
MMedLM 2 (Ours) | 7B | - | ✓ | trainset | 61.7 | 80.01 | 61.81 | 52.09 | 80.47 | 67.65 | 67.30 |
MMed-Llama 3 (Ours) | 8B | - | ✓ | trainset | 66.06 | 79.25 | 61.81 | 55.63 | 75.39 | 68.38 | 67.75 |