Table 1 Accuracy comparison of small language models on mathematical reasoning benchmarks under various prompting strategies

From: SLM-MATRIX: a multi-agent trajectory reasoning and verification framework for enhancing language models in materials data extraction

Method

Qwen2.5-7B-Instruct-Turbo

Mistral-7B-Instruct

gemma-2-9b-it

Llama-3.2-

11B-vision

Llama-3.1-8B-

Instruct-Turbo

GSM8K

Zero-shot CoT

81.5

51.33

80.98

78.32

78.17

Few-shot CoT

84.23

52.84

78.32

83.24

84.08

ToT

76.88

28.51

79.38

73.62

64.06

RAP

70.36

49.81

58.99

74.68

74.07

MoA

87.19

SLM-MATRIX

90.83

GSM-Hard

Zero-shot CoT

64.37

37.83

66.34

53.9

56.63

Few-shot CoT

64.29

37.07

64.22

54.36

56.94

ToT

62.24

23.65

62.62

52.16

47.46

RAP

58.15

36.54

63.99

55.57

56.18

MoA

68.76

SLM-MATRIX

73.24

SVAMP

Zero-shot CoT

85.4

63.1

85.60

77.9

79

Few-shot CoT

86.1

68.1

86.7

85.3

86.2

ToT

71.5

45.6

85.2

75.5

65.5

RAP

74.5

59.4

78.1

78.2

78

MoA

88.3

SLM-MATRIX

92.7

MATH

Zero-shot CoT

57.4

34.6

35.4

30.59

32.2

Few-shot CoT

55.59

31.4

37.2

35.2

32.4

ToT

53

18.8

23.4

26.6

30

RAP

54.2

11.4

11.4

34.4

32.2

MoA

58.4

SLM-MATRIX

61.8

  1. This table compares the performance of five open-source small language models (Qwen2.5-7B, Mistral-7B, Gemma-2-9B-it, LLaMA-3.2-11B-Vision, and LLaMA-3.1-8B-Instruct-Turbo) on four reasoning datasets: GSM8K, GSM-Hard, SVAMP, and MATH. Each model is evaluated under four prompting strategies (Zero-shot CoT, Few-shot CoT, Tree of Thoughts, and Reasoning via Planning), as well as within the proposed MoA and SLM-MATRIX frameworks.
  2. The bold values highlight the best-performing result within a given comparison group.