Table 2 Experimental results at each stage of DeepSeek-R1

From: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

 

Benchmark (metric)

R1-Zero

R1 Dev1

R1 Dev2

R1 Dev3

R1

English

MMLU (EM)

88.8

89.1

91.2

91.0

90.8

MMLU-Redux (EM)

85.6

90.0

93.0

93.1

92.9

MMLU-Pro (EM)

68.9

74.1

83.8

83.1

84.0

DROP (3-shot F1)

89.1

89.8

91.1

88.7

92.2

IF-Eval (Prompt Strict)

46.6

71.7

72.0

78.1

83.3

GPQA Diamond (Pass@1)

75.8

66.1

70.7

71.2

71.5

SimpleQA (Correct)

30.3

17.8

28.2

24.9

30.1

FRAMES (Acc.)

82.3

78.5

81.8

81.9

82.5

AlpacaEval 2.0 (LC-winrate)

24.7

50.1

55.8

62.1

87.6

Arena-Hard (GPT-4-1106)

53.6

77.0

73.2

75.6

92.3

Code

LiveCodeBench (Pass@1-COT)

50.0

57.5

63.5

64.6

65.9

Codeforces (Percentile)

80.4

84.5

90.5

92.1

96.3

Codeforces (Rating)

1,444

1,534

1,687

1,746

2,029

SWE-bench Verified (Resolved)

43.2

39.6

44.6

45.6

49.2

Aider-Polyglot (Acc.)

12.2

6.7

25.6

44.8

53.3

Maths

AIME 2024 (Pass@1)

77.9

59.0

74.0

78.1

79.8

MATH-500 (Pass@1)

95.9

94.2

95.9

95.4

97.3

CNMO 2024 (Pass@1)

88.1

58.0

73.9

77.3

78.8

Chinese

CLUEWSC (EM)

93.1

92.8

92.6

91.6

92.8

C-Eval (EM)

92.8

85.7

91.9

86.4

91.8

C-SimpleQA (Correct)

66.4

58.8

64.2

66.9

63.7

  1. Numbers in bold denote that the performance is statistically significant (t-test with P < 0.01).