Table 5 Results on rationale, as reported with ‘BLEU/ROUGE’ scores

From: Towards evaluating and building versatile large language models for medicine

Method

Size

MMedBench

Avg.

  

Chinese

English

French

Japanese

Russian

Spanish

 

Close-source Models

Claude-3.5

-

44.64/34.63

47.07/38.67

48.93/41.23

49.22/39.15

38.90/28.17

48.80/39.99

46.26/36.97

Open-source Models

MEDITRON

7B

20.39/21.79

38.42/31.24

34.43/29.33

18.89/24.98

24.32/16.77

37.64/31.01

29.01/25.86

InternLM 2

7B

35.23/30.77

44.12/37.39

36.10/33.65

29.13/33.15

27.43/20.99

41.87/36.30

35.65/32.04

Mistral

7B

35.53/28.91

47.20/37.88

39.53/35.64

29.16/28.96

32.15/23.99

45.27/38.33

38.14/32.28

Llama 3

8B

28.51/23.30

44.10/39.26

24.92/22.24

13.46/15.04

31.16/22.85

32.37/27.70

29.09/25.06

Qwen 2

7B

41.53/29.89

43.67/34.22

30.39/27.72

46.78/33.54

24.89/22.15

40.09/36.38

37.89/30.65

Med42-v2

8B

19.42/17.55

47.22/39.45

32.01/26.71

10.85/11.52

26.87/20.35

32.00/24.58

28.06/23.36

Baichuan 2

7B

32.09/26.70

39.52/32.09

17.74/17.57

14.63/13.52

18.38/15.06

31.85/28.12

25.70/22.18

MMedIns-Llama 3

8B

50.27/34.01

49.08/38.19

46.93/38.73

51.74/35.19

35.27/23.81

48.15/37.35

46.90/34.54

  1. Note here, we do not include the results for GPT-4 since the original evaluation sets were generated with it, which may bring unfair comparison bias. Bolding represents the best results.