Table 1 Performance of models for the six QA benchmarks: MedQA, USMLE sample test (USMLE), Medbullets-4 (MB-4), Medbullets-5 (MB-5), MedMCQA, and MMLU-Medical (MMLU)

From: Small language models learn enhanced reasoning skills from medical textbooks

Model

MedQA

USMLE

MB-4

MB-5

MedMCQA

MMLU

Avg.

Commercial LLMs

o1

95.7

95.7

87.9

86.0

82.9

95.2

90.6

o3-mini

91.8

93.5

85.7

81.2

76.3

93.0

86.9

o1-mini

90.0

91.1

80.8

79.2

71.0

91.2

83.9

GPT-4o

83.6

89.2

76.3

66.5

63.1

86.2

77.5

GPT-4

81.4

86.6

68.8

63.3

72.4

87.1

76.6

GPT-3.5 (175B)

53.6

58.5

51.0

47.4

51.0

67.3

54.8

Open-source & Domain-specific SLMs

Mistral-7B

43.2

40.5

38.8

32.8

40.7

51.0

41.2

Llama-3-8B

57.5

58.8

49.0

48.7

54.7

68.0

56.1

MediTron-7B

50.2

44.6

51.5

45.5

57.9

56.7

51.0

BioMistral-7B

54.3

51.4

52.3

48.7

61.1

64.6

55.4

Meerkat-7B (Ours)

71.2

70.1

60.5

52.8

61.5

70.7

64.5

Meerkat-8B (Ours)

74.2

73.8

59.7

55.2

62.7

74.3

66.7

  1. Our Meerkat models generally performed better than existing 7B and 8B models and GPT-3.5 across all datasets. The scores in MMLU-Medical were calculated based on the average accuracies across the six medical-related subjects. Detailed results for the six subjects can be found in Supplementary Table 1. The scores of GPT-3.5 and GPT-4 are obtained from the papers of Nori et al.11, Toma et al.53, and Chen et al.54.
  2. The best performance for each category—Commercial LLMs and Open-source & Domain-specific SLMs—is highlighted in bold.