Table 1 Accuracy comparison of small language models on mathematical reasoning benchmarks under various prompting strategies

Method	Qwen2.5-7B-Instruct-Turbo	Mistral-7B-Instruct	gemma-2-9b-it	Llama-3.2- 11B-vision	Llama-3.1-8B- Instruct-Turbo
GSM8K
Zero-shot CoT	81.5	51.33	80.98	78.32	78.17
Few-shot CoT	84.23	52.84	78.32	83.24	84.08
ToT	76.88	28.51	79.38	73.62	64.06
RAP	70.36	49.81	58.99	74.68	74.07
MoA	87.19
SLM-MATRIX	90.83
GSM-Hard
Zero-shot CoT	64.37	37.83	66.34	53.9	56.63
Few-shot CoT	64.29	37.07	64.22	54.36	56.94
ToT	62.24	23.65	62.62	52.16	47.46
RAP	58.15	36.54	63.99	55.57	56.18
MoA	68.76
SLM-MATRIX	73.24
SVAMP
Zero-shot CoT	85.4	63.1	85.60	77.9	79
Few-shot CoT	86.1	68.1	86.7	85.3	86.2
ToT	71.5	45.6	85.2	75.5	65.5
RAP	74.5	59.4	78.1	78.2	78
MoA	88.3
SLM-MATRIX	92.7
MATH
Zero-shot CoT	57.4	34.6	35.4	30.59	32.2
Few-shot CoT	55.59	31.4	37.2	35.2	32.4
ToT	53	18.8	23.4	26.6	30
RAP	54.2	11.4	11.4	34.4	32.2
MoA	58.4
SLM-MATRIX	61.8

This table compares the performance of five open-source small language models (Qwen2.5-7B, Mistral-7B, Gemma-2-9B-it, LLaMA-3.2-11B-Vision, and LLaMA-3.1-8B-Instruct-Turbo) on four reasoning datasets: GSM8K, GSM-Hard, SVAMP, and MATH. Each model is evaluated under four prompting strategies (Zero-shot CoT, Few-shot CoT, Tree of Thoughts, and Reasoning via Planning), as well as within the proposed MoA and SLM-MATRIX frameworks.
The bold values highlight the best-performing result within a given comparison group.

Quick links

Search