Table 3 Multiple-choice accuracy evaluation on various English multiple-choice question-answering benchmarks

From: Towards building multilingual language model for medicine

Method

Size

Year

MedQA

MedMCQA

PubMedQA

MMLU

Avg.

      

CK

MG

An

PM

CB

CM

 

Close-source Models

GPT-3.5

-

2022.11

57.7

72.7

53.8

74.7

74.0

65.9

72.8

72.9

64.7

67.69

GPT-4

-

2023.03

85.8

72.3

70.0

90.2

94.0

84.4

94.5

93.8

83.2

85.36

Flan-PaLM

540B

2022.12

67.6

57.6

79.0

80.4

75.0

63.7

83.8

88.9

76.3

74.70

MedPaLM 2

-

2023.05

86.5

72.3

81.8

88.7

92.0

84.4

95.2

95.8

83.2

86.66

Open-source Models

MedAlpaca

7B

2023.3

41.7

37.5

72.8

57.4

69.0

57.0

67.3

65.3

54.3

58.03

PMC-LLaMA

13B

2023.9

56.4

56.0

77.9

-

-

-

-

-

-

-

MEDITRON

7B

2023.11

57.2

59.2

74.4

64.6

59.9

49.3

55.4

53.8

44.8

57.62

Mistral

7B

2023.12

50.8

48.2

75.4

68.7

71.0

55.6

68.4

68.1

59.5

62.97

Gemma

7B

2024.2

47.2

49.0

76.2

69.8

70.0

59.3

66.2

79.9

60.1

64.19

BioMistral

7B

2024.2

50.6

48.1

77.5

59.9

64.0

56.5

60.4

59.0

54.7

58.97

Llama 3

8B

2024.4

60.9

50.7

73.0

72.1

76.0

63.0

77.2

79.9

64.2

68.56

MMed-Llama 3 (Ours)

8B

-

65.4

63.5

80.1

71.3

85.0

69.6

77.6

74.3

66.5

72.59

  1. We report each model’s accuracy across various tasks separately, with “Avg.” denoting the mean score over nine tasks. Note that, for fairness, all the scores are based on basic generation settings without extra ensembling or prompting strategies, like Chain-of-Thought37 or self-consistency53. Since all the English benchmarks share an official test set, we directly use the scores reported in their original papers for other models. For MedAlpaca, GPT-4, GPT-3.5 and Llama 3, their scores are based on Open Medical-LLM Leaderboard54. The best results under each setting are bold.