Table 1 Mutliple-choice accuracy evaluation on MMedBench

From: Towards building multilingual language model for medicine

Method

Size

Year

MMedC

MMedBench

English

Chinese

Japanese

French

Russian

Spanish

Avg.

GPT-4 (5-shot, CoT)

-

2023.3

90.20

81.00

76.38

55.14

85.10

88.80

79.43

Zero-shot Evaluation

GPT-3.5

-

2022.12

56.88

52.29

34.63

32.48

66.36

66.06

51.47

GPT-4

-

2023.3

78.00

75.07

72.91

56.59

83.62

85.67

74.27

Gemini-1.0 pro

-

2024.1

53.73

60.19

44.22

29.90

73.44

69.69

55.20

Parameter-efficient Fine-tuning (PEFT) Evaluation

BLOOMZ

7B

2023.5

trainset

38.88

48.86

17.59

18.65

53.91

44.78

37.11

InternLM

7B

2023.7

trainset

40.93

52.19

27.14

18.81

46.88

40.34

37.71

Llama 2

7B

2023.7

trainset

37.00

37.13

24.12

19.13

63.67

42.89

37.32

ChatDoctor

7B

2023.3

trainset

36.68

34.06

28.14

11.58

60.55

39.86

35.15

MedAlpaca

7B

2023.4

trainset

43.28

36.81

27.14

16.40

51.95

41.72

36.22

PMC-LLaMA

7B

2023.4

trainset

33.62

31.76

20.60

10.13

57.81

37.89

31.97

Mistral

7B

2023.10

trainset

55.38

50.23

37.69

40.19

71.88

61.60

52.83

MEDITRON

7B

2023.11

trainset

34.88

33.22

21.11

9.65

57.42

40.74

32.84

InternLM 2

7B

2024.2

trainset

52.40

68.18

39.20

28.78

63.67

55.25

51.25

BioMistral

7B

2024.2

trainset

49.41

44.51

29.15

33.60

67.97

54.45

46.51

Llama 3

8B

2024.4

trainset

62.84

70.11

41.21

39.55

64.84

61.52

56.68

MMedLM (Ours)

7B

-

trainset

41.16

52.22

27.14

18.49

47.66

40.34

37.83

MMedLM 2 (Ours)

7B

-

trainset

58.13

70.43

54.27

38.26

71.88

64.95

59.65

MMed-Llama 3 (Ours)

8B

-

trainset

63.08

69.41

55.78

41.64

71.48

66.96

61.39

Full Fine-tuning Evaluation

BLOOMZ

7B

2023.5

trainset

43.28

58.06

32.66

26.37

62.89

47.34

45.10

InternLM

7B

2023.7

trainset

44.07

64.62

37.19

24.92

58.20

44.97

45.67

Llama 2

7B

2023.7

trainset

43.36

50.29

25.13

20.90

66.80

47.10

42.26

MedAlpaca

7B

2023.3

trainset

46.74

44.80

29.64

21.06

59.38

45.00

41.11

ChatDoctor

7B

2023.4

trainset

43.52

43.26

25.63

18.81

62.50

43.44

39.53

PMC-LLaMA

7B

2023.4

trainset

47.53

42.44

24.12

20.74

62.11

43.29

40.04

Mistral

7B

2023.10

trainset

61.74

71.10

44.72

48.71

74.22

63.86

60.73

MEDITRON

7B

2023.11

trainset

55.46

61.88

40.20

35.05

67.58

53.28

52.24

InternLM 2

7B

2024.2

trainset

57.27

77.55

47.74

41.00

68.36

59.59

58.59

BioMistral

7B

2024.2

trainset

57.82

71.54

37.19

47.27

69.92

60.98

57.45

Llama 3

8B

2024.4

trainset

63.86

78.23

48.24

50.80

71.48

64.15

62.79

MMedLM (Ours)

7B

-

trainset

49.88

70.49

46.23

36.66

72.27

54.52

55.01

MMedLM 2 (Ours)

7B

-

trainset

61.7

80.01

61.81

52.09

80.47

67.65

67.30

MMed-Llama 3 (Ours)

8B

-

trainset

66.06

79.25

61.81

55.63

75.39

68.38

67.75

  1. We report each model’s accuracy across various languages separately, with “Avg.” denoting the mean score over six languages under zero-shot, PEFT and full fine-tuning settings. We also list out their model sizes, release time and whether they are testing after further training on our MMedC or the training set of MMedBench in the table. The best results under each setting are bold.