Group Relative Policy Optimization (GRPO) is a widely used policy gradient algorithm that was popularized by DeepSeek [1]. In this post, we start entirely from first principles and progressively add complexity until we get to the full GRPO objective.

Our primary focus is to understand how GRPO applies to LLM training and tailor the notation specifically towards that.

1. Maximizing Rewards

The fundamental objective of policy gradient algorithms is to maximize the expected reward (technically the expected return but in our setup this is identical):

\[\mathcal{J}(\theta) = \mathbb{E}_{q \sim p(Q), o \sim \pi_\theta(O|q)}[R(o)]\]

To maximize \(\mathcal{J}(\theta)\) we need to find parameters \(\theta\) for our LLM policy \(\pi_\theta\) that will produce good outputs \(o\) over our query distribution \(q \sim p(Q)\). One straightforward way is to differentiate with respect to our policy and iteratively follow the gradient toward better policies.

1.1 REINFORCE

Directly differentiating \(\nabla \mathcal{J}(\theta)\) corresponds to the widely used REINFORCE policy gradient algorithm [2]. However, in its current form, the objective is difficult to optimize. For example, to optimize expectations we usually take Monte-Carlo samples, differentiate them individually, and average their gradients to approximate the true gradient. However, our policy \(\pi_\theta\) never appears inside the expectation, only the rewards, which are typically not differentiable. To get around this, we can use a few definitions and tricks to derive a surrogate objective which is easier to optimize:

\[\begin{align} \nabla \mathcal{J}(\theta) &= \nabla \mathbb{E}_{q \sim p(Q), o \sim \pi_\theta(O|q)}[R(o)] \\ &= \sum_{q, o} \nabla \pi_\theta(o|q) R(o) && \text{definition + linearity of expectations}\\ &= \sum_{q, o} \frac{\pi_\theta(o|q)}{\pi_\theta(o|q) } \nabla \pi_\theta(o|q) R(o) && \text{identity trick} \\ &= \sum_{q, o} \pi_\theta(o|q) \nabla \log \pi_\theta(o|q) R(o) && \text{log derivative trick} \\ &= \mathbb{E}_{q \sim p(Q), o \sim \pi_\theta(O|q)}[ \nabla \log \pi_\theta(o|q) R(o)] \end{align}\]

The final line corresponds to the REINFORCE objective as it is typically presented. In practical implementations, we only need to implement the term inside the expectation and let autograd handle the rest:

def reinforce(
  logps: torch.Tensor,    # [batch_size, seq_len]
  masks: torch.Tensor,    # [batch_size, seq_len]
  rewards: torch.Tensor,  # [batch_size, 1] 
) -> torch.Tensor: 
    masked_obj = (logps * masks) * rewards
    token_level_avg = (masked_obj).sum(dim=1) / masks.sum(dim=1)
    return token_level_avg.mean()

Two minor things to note are (1) we compute the objective on the output tokens only and (2) we take a token-level (instead of a sequence-level) average.

We can formulate a different surrogate objective by pulling the identity trick into the surrogate objective itself:

\[\mathcal{J}(\theta) =\mathbb{E}_{q \sim p(Q), o \sim \pi_\theta(O|q)}\Bigg[ \frac{\pi_\theta(o|q)}{ \pi_\theta(o|q)} R(o)\Bigg]\]

Obviously this is identical to before because we’re simply multiplying the objective by \(1\). If we’re careful during implementation by adding a stop_gradient to the denominator, it should also give us identical gradients to the \(\log\)-based surrogate objective without having to rewrite anything. This may seem like a silly, contrived thing to do, but it will help us build towards the full GRPO objective below.

2. From Rewards to Advantages

GRPO was originally created for domains such as math where there is a single, verifiable correct answer. In these domains, the reward is typically assigned to \(1\) if the output is correct and \(0\) otherwise. One downside of this formulation is that it doesn’t distinguish between easy vs hard queries very well. For example, an easy query which the model gets correct \(9/10\) times is assigned the same reward as a hard query which the model gets correct only \(1/10\) times. This makes credit assignment difficult and can lead to high variance gradients and unstable learning. What we would like is for the reward to vary smoothly depending on the difficulty of the question.

One way to achieve this is by computing and maximizing advantages instead of rewards. Suppose for a given query \(q\) we sample \(G\) outputs \(\{o_1, \dots, o_G\}\) with rewards \(\{r_1 = R(o_1), \dots, r_G = R(o_G)\}\). Then for a single output \(o_i\) we can compute the advantage as:

\[A_i = \frac{r_i - \text{mean}(r_1, \dots, r_G)}{\text{std}(r_1, \dots, r_G)}\]

In essence, the advantage of each output is the reward normalized to have zero-mean, unit-variance across the group. If the output is correct (incorrect), the advantage will be larger (smaller) than the mean by an amount relative to how easy or hard the query was across the group, which is why the algorithm is called Group Relative Policy Optimization. Unlike the rewards, the advantages will vary smoothly thereby reducing variance and providing a stronger learning signal.

Aside: Removing unit variance normalization

Normalizing the advantages to have unit variance can inadvertently lead to a difficulty bias during training [8]. If a query is very easy or very hard so that most rewards are identical, the variance (and hence standard deviation) of the rewards will be close to \(0\). Dividing the centered rewards by this tiny value will lead to large advantages, causing training to heavily bias towards very easy or very difficult queries.

In the limiting case, if all rewards are identical, then the variance will be \(0\) and the advantage will be undefined. Such cases need special handling to protect against NaNs during training, or better yet, outright removal so no computation is wasted on them.

Because of these issues, almost all future works or practical implementations of GRPO remove the unit variance normalization from the advantage computation. In the \(0 / 1\) reward case, there should be no noticeable difference in the performance of the final policy.

We can now update our surrogate objective to maximize advantages instead of rewards:

\[\mathcal{J}(\theta) =\mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\theta(O|q)}\Bigg[ \frac{\pi_\theta(o_i|q)}{ \pi_\theta(o_i|q)} A_i \Bigg]\]

3. Going off-policy

So far we have assumed the outputs are sampled from the current policy \(\{o_i\}_{i=1}^G \sim \pi_\theta(O\vert q)\), and in an ideal world this would always be the case. Unfortunately, because autoregressive sampling from LLMs is so expensive, we often accept some off-policy drift to improve system efficiency and throughput. This is an efficiency trade-off, not something desirable from an algorithmic or correctness perspective.

In modern Reinforcement Learning systems, the sampler (what autoregressively generates outputs) and the trainer (what computes the forward/backward pass and updates the parameters) are implemented by separate systems in order to maximize overall system efficiency. To stay on-policy, the trainer must synchronize its parameters with the sampler after every gradient step so the sampler can generate the next set of outputs with the most up-to-date policy parameters.

There are two ways this system can go off-policy. The first is a conscious decision by the algorithm designer to trade off some degree of off-policy drift for efficiency. For example, modern inference systems like vLLM [3] batch generations and it can be much more efficient to generate huge batches of outputs instead of generating them one-by-one. To increase efficiency we could generate a large batch of outputs with parameters \(\pi_\text{old}\) and use them to take multiple policy gradient steps on \(\pi_\theta\) before synchronizing the new parameters with the sampler. Taking this approach to the extreme, we could entirely decouple the sampler and the trainer so the sampler is continuously generating outputs while the trainer is continuously updating the policy and sending the latest parameters to the sampler. Such approaches, known broadly as Async-RL, have been shown to significantly improve system throughput without major quality degradation [4].

The second, more insidious way, is from the fact that the sampler and trainer are indeed different systems. Even in the on-policy case where the trainer parameters are always synced with the sampler, the sampler will usually produce different token probabilities than the trainer [5]. This can be because of differing CUDA kernels, differing precisions, batch non-determinism [6], quantization of the KV cache in the sampler, etc. Therefore, unless we take an unreasonable level of care to identically match sampler and trainer behavior, we cannot escape going off-policy in modern RL systems.

3.1 Importance Sampling

The major downside of going off-policy is that we’re no longer optimizing the correct objective since the outputs were sampled from \(\pi_\text{old}\) instead of \(\pi_\theta\). Instead we are optimizing:

\[\mathcal{J}(\theta) =\mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\text{old}(O|q)}\Bigg[ \frac{\pi_\theta(o_i|q)}{ \pi_\theta(o_i|q)} A_i \Bigg]\]

This mismatch can have real, sometimes catastrophic, consequences in practice. For example, it can lead to much worse results than an equivalent on-policy run or even cause runs to collapse entirely towards \(0\) reward. This negative behavior becomes more prevalent as the degree of off-policy drift, sequence length, or number of optimization steps increase.

Fortunately, the fix is straightforward and all we need to do is correct the expectation so we optimize the correct objective again. We can do this by multiplying importance sampling weights \(\pi_\theta(o_i|q) / \pi_\text{old}(o_i|q)\) with our previous objective:

\[\begin{align} \mathcal{J}(\theta) &= \mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\text{old}(O|q)}\Bigg[ \frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)} \frac{\pi_\theta(o_i|q)}{ \pi_\theta(o_i|q)} A_i \Bigg] \\ &= \mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\text{old}(O|q)}\Bigg[ \frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)} A_i \Bigg] \\ &= \sum_{q, o_i} \pi_\text{old}(o_i|q) \frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)} A_i && \text{definition of expectation} \\ &= \sum_{q, o_i} \pi_\theta(o_i|q) A_i \\ &= \mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\theta (O|q)}[ A_i ] && \text{definition of expectation} \end{align}\]

As we can see from the derivation above, multiplying by the importance sampling weights indeed ensures we’re optimizing the correct objective. Putting everything together, the objective we need to maximize is now:

\[\mathcal{J}(\theta) = \mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\text{old}(O|q)}\Bigg[ \frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)} A_i \Bigg]\]

Hopefully the connection to the “silly” surrogate objective is now clear: the form of our latest surrogate objective is identical to the previous one except we swap \(\pi_\text{old}\) for \(\pi_\theta\) in the expectation and the denominator.

4. Staying proximal to the old policy

Although the importance sampling weights ensure we’re optimizing the correct objective, they unfortunately introduce new sources of potential issues that can lead to slow or unstable optimization. To understand why, we can consider what happens when \(\pi_\theta\) and \(\pi_\text{old}\) differ in magnitude. If \(\pi_\text{old}\) is tiny when \(\pi_\theta \gg 0\) then the ratio \(\pi_\theta / \pi_\text{old}\) will be large leading to large policy gradients. Conversely, if \(\pi_\theta\) is tiny when \(\pi_\text{old} \gg 0\) then \(\pi_\theta / \pi_\text{old} \approx 0\) and learning will be slow. As we get further off-policy, this problem compounds leading to ratios having high variance which means the policy gradients will have high variance and training becomes unstable and can potentially collapse.

Therefore, a desirable property would be for \(\pi_\theta / \pi_\text{old}\) to be as close to \(1\) as possible and match the on-policy case. We can enforce this explicitly by clipping the ratio any time it gets \(\varepsilon\) away from \(1\):

\[\text{clip}\Bigg(\frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)}, 1-\varepsilon, 1+\varepsilon\Bigg)\]

Clipping the ratio effectively sets its gradient to \(0\) whenever it falls outside of \([1-\varepsilon, 1+\varepsilon]\) ensuring \(\pi_\theta\) remains close (or proximal) to \(\pi_\text{old}\) as we take gradient steps.

Aside: Trust region motivation for clipping

We’ve given a somewhat hand-wavey motivation for why we should clip the probability ratios. While it’s outside the scope of this post, a more principled motivation is provided by the TRPO [13] and PPO [7] papers.

The DeepSeekMath paper [1] which originally introduced GRPO follows PPO [7] and sets \(\varepsilon = 0.2\). Recent papers [8, 9] have found that decoupling \(\varepsilon\) into upper and lower bounds \((\varepsilon^{-}, \varepsilon^+)\) and setting \(\varepsilon^+ = 0.28\) can help with exploration. Indeed, DeepSeek-R1 [10] sets \(\varepsilon^+ = 10\), significantly higher than originally proposed by PPO and follow-up works. In my experiments, I have also found the original \(\varepsilon = 0.2\) to be too conservative and use clipping values closer to what DeepSeek-R1 proposes.

4.1 Taking a conservative lower bound

The clipping introduces yet another issue we need to fix. Depending on the sign of the advantage \(A_i\), the clipped objective can be larger than the unclipped objective. The easiest way to see this is by exhaustively considering all possible interactions between the advantage and the importance sampling ratio. Given advantage \(A_i\) and ratio \(r_i(\theta) = \pi_\theta(o_i|q) / \pi_\text{old}(o_i|q)\) we have:

Advantage Ratio Clipped vs Unclipped Objective
\(A_i > 0\) \(r_i(\theta) < 1 - \varepsilon\) \(\text{clip}(r_i(\theta), 1 - \varepsilon, 1 + \varepsilon) A_i > r_i(\theta) A_i\)
\(A_i > 0\) \(r_i(\theta) > 1 + \varepsilon\) \(\text{clip}(r_i(\theta), 1 - \varepsilon, 1 + \varepsilon) A_i < r_i(\theta) A_i\)
\(A_i < 0\) \(r_i(\theta) < 1 - \varepsilon\) \(\text{clip}(r_i(\theta), 1 - \varepsilon, 1 + \varepsilon) A_i < r_i(\theta) A_i\)
\(A_i < 0\) \(r_i(\theta) > 1 + \varepsilon\) \(\text{clip}(r_i(\theta), 1 - \varepsilon, 1 + \varepsilon) A_i > r_i(\theta) A_i\)

As we can see from rows \(1\) and \(4\) in the table above, using the clipped objective can artificially inflate how good the current policy is. A simple trick to mitigate this is to always take the minimum of the two objectives:

\[\min(r_i(\theta)A_i, \text{clip}(r_i(\theta), 1-\varepsilon, 1+\varepsilon)A_i)\]

Now our surrogate objective is always a conservative lower bound on the unclipped objective. Plugging this into what we had before gives us our latest surrogate objective:

\[\mathcal{J}(\theta) = \mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\text{old}(O|q)}\Bigg[ \min \Bigg( \frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)} A_i, \text{clip}\Bigg(\frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)}, 1 - \varepsilon, 1 + \varepsilon\Bigg) A_i \Bigg) \Bigg]\]

5. Staying proximal to the reference policy

The final extension GRPO makes is to add a KL divergence \(\mathbb{D}_\text{KL}(\pi_\theta \Vert \pi_\text{ref})\) into the objective. This term penalizes the policy \(\pi_\theta\) from moving too far away from a reference policy \(\pi_\text{ref}\). Both the policy and the reference models are typically initialized from a previously trained SFT model so that \(\mathbb{D}_\text{KL}(\pi_\theta \Vert \pi_\text{ref}) = 0\) at the start of training. As training progresses, the KL divergence term will ensure the policy improves while staying relatively close to the SFT model.

Aside: Removing the KL divergence term

Unlike PPO which adds the KL term in the reward, GRPO adds it directly to the objective in order to simplify the advantage computation. This is GRPO’s only significant divergence from PPO and it’s somewhat questionable whether it is a principled modification. Many follow-up works have removed the KL divergence term from the GRPO loss and observed little-to-no performance degradation. Removing the KL has the added benefit that we do not need to keep \(\pi_\text{ref}\) around during training to compute the reference probabilities, which can lead to significant memory and training time reductions. In my own experiments, I have noticed little benefit to including the KL divergence term, though possibly it is consequential when training very long context models for a large number of steps.

The original DeepSeekMath paper approximated the \(\mathbb{D}_\text{KL}(\pi_\theta \Vert \pi_\text{ref})\) using the k3 estimator [11] which takes a sample average of:

\[\mathbb{D}_\text{KL}(\pi_\theta \Vert \pi_\text{ref}) = \Bigg(\frac{\pi_\text{ref}(o_i | q)}{\pi_\theta(o_i | q)} - 1 \Bigg) - \log \frac{\pi_\text{ref}(o_i | q)}{\pi_\theta(o_i | q)}\]

However, this leads to a biased estimate of the KL divergence because the outputs \(o_i\) are generated from \(\pi_\text{old}\) instead of \(\pi_\theta\). DeepSeek-v3.2 [12] fixes this by reapplying the same importance sampling approach we described in section 3.1. Namely, we multiply the term inside the expectation by \(\pi_\theta(o_i | q) / \pi_\text{old}(o_i |q)\):

\[\mathbb{D}_\text{KL}(\pi_\theta \Vert \pi_\text{ref}) = \frac{\pi_\theta(o_i | q)}{\pi_\text{old}(o_i | q)} \Bigg[ \Bigg( \frac{\pi_\text{ref}(o_i | q)}{\pi_\theta(o_i | q)} - 1 \Bigg) - \log \frac{\pi_\text{ref}(o_i | q)}{\pi_\theta(o_i | q)} \Bigg]\]

Plugging the KL divergence into our previous objective gives us the full GRPO surrogate objective:

\[\mathcal{J}(\theta) = \mathbb{E}_{q \sim p(Q), \{o_i\}_{i=1}^G \sim \pi_\text{old}(O|q)}\Bigg[ \min \Bigg( \frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)} A_i, \text{clip}\Bigg(\frac{\pi_\theta(o_i|q)}{\pi_\text{old}(o_i|q)}, 1 - \varepsilon, 1 + \varepsilon\Bigg) A_i \Bigg) - \beta \mathbb{D}_\text{KL}(\pi_\theta \Vert \pi_\text{ref}) \Bigg]\]

Where \(\beta\) is a hyperparameter that controls how much we care about maintaining a low KL divergence. We can implement the GRPO objective in code by directly translating the math like so:

def grpo(
  pi_logps: torch.Tensor,      # [batch_size, seq_len]
  old_logps: torch.Tensor,     # [batch_size, seq_len]
  ref_logps: torch.Tensor,     # [batch_size, seq_len]
  masks: torch.Tensor,         # [batch_size, seq_len]
  advantages: torch.Tensor,    # [batch_size, 1] 
  eps: float = 0.2,
  beta: float = 0.1,
) -> torch.Tensor: 
    ratios = torch.exp(pi_logps - old_logps)
    clipped_ratios = torch.clip(ratios, 1 - eps, 1 + eps)
    lower_bound = torch.minimum(ratios * advantages, clipped_ratios * advantages)

    k3 = torch.exp(ref_logps - pi_logps) - 1 - (ref_logps - pi_logps)
    unbiased_k3 = ratios * k3
    
    masked_obj = (lower_bound - beta * unbiased_k3) * masks
    token_level_avg = (masked_obj).sum(dim=1) / masks.sum(dim=1)
    return token_level_avg.mean()

As with REINFORCE, we take a token-level average over the objective.

Conclusion

In this post we have started from the principle of maximizing rewards and have shown how to progressively add complexity until we arrive at the full GRPO surrogate objective. If we take a step back, we see that GRPO (and PPO) is nothing but an effort to mitigate off-policy behavior.