Table 1 Accuracy comparison of small language models on mathematical reasoning benchmarks under various prompting strategies
Method | Qwen2.5-7B-Instruct-Turbo | Mistral-7B-Instruct | gemma-2-9b-it | Llama-3.2- 11B-vision | Llama-3.1-8B- Instruct-Turbo |
|---|---|---|---|---|---|
GSM8K | |||||
Zero-shot CoT | 81.5 | 51.33 | 80.98 | 78.32 | 78.17 |
Few-shot CoT | 84.23 | 52.84 | 78.32 | 83.24 | 84.08 |
ToT | 76.88 | 28.51 | 79.38 | 73.62 | 64.06 |
RAP | 70.36 | 49.81 | 58.99 | 74.68 | 74.07 |
MoA | 87.19 | ||||
SLM-MATRIX | 90.83 | ||||
GSM-Hard | |||||
Zero-shot CoT | 64.37 | 37.83 | 66.34 | 53.9 | 56.63 |
Few-shot CoT | 64.29 | 37.07 | 64.22 | 54.36 | 56.94 |
ToT | 62.24 | 23.65 | 62.62 | 52.16 | 47.46 |
RAP | 58.15 | 36.54 | 63.99 | 55.57 | 56.18 |
MoA | 68.76 | ||||
SLM-MATRIX | 73.24 | ||||
SVAMP | |||||
Zero-shot CoT | 85.4 | 63.1 | 85.60 | 77.9 | 79 |
Few-shot CoT | 86.1 | 68.1 | 86.7 | 85.3 | 86.2 |
ToT | 71.5 | 45.6 | 85.2 | 75.5 | 65.5 |
RAP | 74.5 | 59.4 | 78.1 | 78.2 | 78 |
MoA | 88.3 | ||||
SLM-MATRIX | 92.7 | ||||
MATH | |||||
Zero-shot CoT | 57.4 | 34.6 | 35.4 | 30.59 | 32.2 |
Few-shot CoT | 55.59 | 31.4 | 37.2 | 35.2 | 32.4 |
ToT | 53 | 18.8 | 23.4 | 26.6 | 30 |
RAP | 54.2 | 11.4 | 11.4 | 34.4 | 32.2 |
MoA | 58.4 | ||||
SLM-MATRIX | 61.8 | ||||