Table 5 Results on rationale, as reported with ‘BLEU/ROUGE’ scores
From: Towards evaluating and building versatile large language models for medicine
Method | Size | MMedBench | Avg. | |||||
|---|---|---|---|---|---|---|---|---|
Chinese | English | French | Japanese | Russian | Spanish | |||
Close-source Models | ||||||||
Claude-3.5 | - | 44.64/34.63 | 47.07/38.67 | 48.93/41.23 | 49.22/39.15 | 38.90/28.17 | 48.80/39.99 | 46.26/36.97 |
Open-source Models | ||||||||
MEDITRON | 7B | 20.39/21.79 | 38.42/31.24 | 34.43/29.33 | 18.89/24.98 | 24.32/16.77 | 37.64/31.01 | 29.01/25.86 |
InternLM 2 | 7B | 35.23/30.77 | 44.12/37.39 | 36.10/33.65 | 29.13/33.15 | 27.43/20.99 | 41.87/36.30 | 35.65/32.04 |
Mistral | 7B | 35.53/28.91 | 47.20/37.88 | 39.53/35.64 | 29.16/28.96 | 32.15/23.99 | 45.27/38.33 | 38.14/32.28 |
Llama 3 | 8B | 28.51/23.30 | 44.10/39.26 | 24.92/22.24 | 13.46/15.04 | 31.16/22.85 | 32.37/27.70 | 29.09/25.06 |
Qwen 2 | 7B | 41.53/29.89 | 43.67/34.22 | 30.39/27.72 | 46.78/33.54 | 24.89/22.15 | 40.09/36.38 | 37.89/30.65 |
Med42-v2 | 8B | 19.42/17.55 | 47.22/39.45 | 32.01/26.71 | 10.85/11.52 | 26.87/20.35 | 32.00/24.58 | 28.06/23.36 |
Baichuan 2 | 7B | 32.09/26.70 | 39.52/32.09 | 17.74/17.57 | 14.63/13.52 | 18.38/15.06 | 31.85/28.12 | 25.70/22.18 |
MMedIns-Llama 3 | 8B | 50.27/34.01 | 49.08/38.19 | 46.93/38.73 | 51.74/35.19 | 35.27/23.81 | 48.15/37.35 | 46.90/34.54 |