Extended Data Fig. 2: Illustration of the proposed GRPO for RL-based training.
From: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

In the proposed framework, a LLM is used as a policy model to generate responses {o1, o2,…, oG} conditioned on a given query q. Each response within the group is evaluated by a reward model—either learned (model-based) or manually specified (rule-based)—to assign a scalar reward signal. Subsequently, GRPO computes the relative advantages of each group member based on their assigned rewards. Rather than relying on an explicit value function, as in PPO, GRPO directly estimates advantages from the intra-group reward distribution. The policy parameters are then updated to maximize the expected reward while simultaneously minimizing divergence from a reference policy, typically quantified through the KL divergence. By eliminating the need for a separate value network, GRPO offers a simplified yet effective alternative to traditional actor-critic methods such as PPO.