Fig. 1: Accuracy and output length of DeepSeek-R1-Zero throughout the training process. | Nature

Fig. 1: Accuracy and output length of DeepSeek-R1-Zero throughout the training process.

From: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Fig. 1: Accuracy and output length of DeepSeek-R1-Zero throughout the training process.The alternative text for this image may have been generated using AI.

a, AIME accuracy of DeepSeek-R1-Zero during training. AIME takes a mathematical problem as input and a number as output, illustrated in Extended Data Table 1. pass@1 and cons@16 are described in Supplementary Information, section 4.1. The baseline is the average score achieved by human participants in the AIME competition. b, The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time. Note that a training step refers to a single policy update operation.

Back to article page