Table 2 Experimental results at each stage of DeepSeek-R1
From: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
|  | Benchmark (metric) | R1-Zero | R1 Dev1 | R1 Dev2 | R1 Dev3 | R1 |
|---|---|---|---|---|---|---|
English | MMLU (EM) | 88.8 | 89.1 | 91.2 | 91.0 | 90.8 |
MMLU-Redux (EM) | 85.6 | 90.0 | 93.0 | 93.1 | 92.9 | |
MMLU-Pro (EM) | 68.9 | 74.1 | 83.8 | 83.1 | 84.0 | |
DROP (3-shot F1) | 89.1 | 89.8 | 91.1 | 88.7 | 92.2 | |
IF-Eval (Prompt Strict) | 46.6 | 71.7 | 72.0 | 78.1 | 83.3 | |
GPQA Diamond (Pass@1) | 75.8 | 66.1 | 70.7 | 71.2 | 71.5 | |
SimpleQA (Correct) | 30.3 | 17.8 | 28.2 | 24.9 | 30.1 | |
FRAMES (Acc.) | 82.3 | 78.5 | 81.8 | 81.9 | 82.5 | |
AlpacaEval 2.0 (LC-winrate) | 24.7 | 50.1 | 55.8 | 62.1 | 87.6 | |
Arena-Hard (GPT-4-1106) | 53.6 | 77.0 | 73.2 | 75.6 | 92.3 | |
Code | LiveCodeBench (Pass@1-COT) | 50.0 | 57.5 | 63.5 | 64.6 | 65.9 |
Codeforces (Percentile) | 80.4 | 84.5 | 90.5 | 92.1 | 96.3 | |
Codeforces (Rating) | 1,444 | 1,534 | 1,687 | 1,746 | 2,029 | |
SWE-bench Verified (Resolved) | 43.2 | 39.6 | 44.6 | 45.6 | 49.2 | |
Aider-Polyglot (Acc.) | 12.2 | 6.7 | 25.6 | 44.8 | 53.3 | |
Maths | AIME 2024 (Pass@1) | 77.9 | 59.0 | 74.0 | 78.1 | 79.8 |
MATH-500 (Pass@1) | 95.9 | 94.2 | 95.9 | 95.4 | 97.3 | |
CNMO 2024 (Pass@1) | 88.1 | 58.0 | 73.9 | 77.3 | 78.8 | |
Chinese | CLUEWSC (EM) | 93.1 | 92.8 | 92.6 | 91.6 | 92.8 |
C-Eval (EM) | 92.8 | 85.7 | 91.9 | 86.4 | 91.8 | |
C-SimpleQA (Correct) | 66.4 | 58.8 | 64.2 | 66.9 | 63.7 |