Table 1 Results on medical multiple-choice question answering, as reported with Accuracy score

From: Towards evaluating and building versatile large language models for medicine

Method1

Size

MedQA

MedMCQA

PubMedQA

MMedBench

Avg.

     

ZH

JA

FR

RU

ES

 

Close-source Models

GPT-3.5

–

57.7

72.7

53.8

52.3

34.6

32.5

66.4

66.1

54.5

GPT-4

–

85.8

72.3

75.2

75.1

72.9

56.6

83.6

85.7

75.3

Open-source Models

MEDITRON

7B

47.9

59.2

74.4

61.9

40.2

35.1

67.6

53.3

54.9

InternLM 2

7B

–

–

–

77.6

47.7

41.0

68.4

59.6

-

Mistral

7B

50.8

48.2

75.4

71.1

44.7

48.7

74.2

63.9

49.1

Llama 3

8B

60.9

50.7

73.0

78.2

48.2

50.8

71.5

64.2

62.2

Qwen 1.5

7B

48.9

50.2

67.8

–

–

–

–

–

–

Med42-v2

8B

62.8

62.8

75.8

–

–

–

–

–

–

Baichuan 2

7B

32.7

41.7

–

–

–

–

–

–

–

MMedIns-Llama 3

8B

63.6

57.1

78.2

78.6

54.3

46.0

72.3

61.2

63.9

  1. Bolding represents the best results. Notably, the results except MMedIns-Llama 3 in this table are all borrowed from other works.
  2. 1The results for GPT-3.5, GPT-4, MEDITRON, InternLM, Mistral, and Llama 3 are borrowed from MMedLM19. Med42-v2 and Baichuan 2 are borrowed from their original papers13,14. Qwen 1.5 is borrowed from Open-Medical-LLM-Leaderboard20. Notably, since we do not find the reported scores for the latest Qwen 2, the earlier Qwen 1.5 (Qwen/Qwen1.5-7B) is instead reported to represent this LLM family.