In this post, I will theoretically prove that two common implementation tricks in PPO – reward whitening and reward normalization – are unnecessary and can be emulated by adjusting other free parameters.

Preliminaries of PPO

For simplicity, let’s consider a single instance of prompt-response. We denote the response as $a_1 … a_T$, where $a_T = \text{</s>}$. The reward model (RM) assigns a sequence-level reward $R$ to this instance, and by contrasting the policy with the ref policy we obtain the token-level KL penalty $k_1 … k_T$, where $k_t = -\beta \cdot \frac{\log p_\theta (a_t | s_t)}{\log p_{\theta_0} (a_t | s_t)}$. Then the token-level reward $r_1 … r_T$ is \(r_t = \begin{cases} k_t + R & \text{if } t = T \\ k_t & \text{if } t < T \end{cases}\)

Then the empirical return $G_t = \sum_{t’=t}^{T} \gamma^{t’-t} r_{t’}$, and the advantage $A_t = G_t - V(s_t)$. From these we can compute the PPO losses. The policy loss $L_P = \sum_{t=1}^{T} f(A_t)$ is some function of the whitened advantage, and the value loss $L_V = \sum_{t=1}^{T} \frac{1}{2} (G_t - V(s_t))^2 = \sum_{t=1}^{T} \frac{1}{2} (A_t)^2$.

Reward whitening

The reward whitening trick applies an affine transformation on the token-level reward, such that their standard deviation is 1 within the batch while preserving the mean.

Citing TRL’s implementation:

# https://github.com/huggingface/trl/blob/v0.8.3/trl/trainer/ppo_trainer.py#L1156
rewards = masked_whiten(rewards, mask, shift_mean=False)

# https://github.com/huggingface/trl/blob/v0.8.3/trl/core.py#L179-L185
def masked_whiten(values: torch.Tensor, mask: torch.Tensor, shift_mean: bool = True) -> torch.Tensor:
    """Whiten values with masked values."""
    mean, var = masked_mean(values, mask), masked_var(values, mask)
    whitened = (values - mean) * torch.rsqrt(var + 1e-8)
    if not shift_mean:
        whitened += mean
    return whitened

Essentially what it does is replacing $r_t$ with $\tilde{r}_t = w \cdot r_t + b$ for some scalar $w$ and $b$. Consequently, the empirical return \(\tilde{G}_t = \sum_{t'=t}^{T} \gamma^{t'-t} \tilde{r}_{t'} = \sum_{t'=t}^{T} \gamma^{t'-t} (w \cdot r_{t'} + b) = w \cdot \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} + b \cdot \sum_{t'=t}^{T} \gamma^{t'-t} = w \cdot G_t + b'\) also undergoes an affine transformation.

Now suppose that we apply the same affine transformation on the value function, such that $\tilde{V}(s_t) = w \cdot V(s_t) + b’$. Then the advantage becomes $\tilde{A}_t = \tilde{G}_t - \tilde{V}(s_t) = w \cdot A_t$. For the policy loss, the scaling factor between $A_t$ and $\tilde{A}_t$ is wiped out by the advantage whitening trick. For the value loss, this scaling factor can be absorbed into the value loss coefficient $\alpha$.

Therefore, the effect of reward whitening can be emulated by properly learning the value function and adjusting the hyperparameters.

Reward normalization

The reward normalization trick applies an affine transformation on the sequence-level reward (i.e., the RM output), such that it has a population mean of 0 and standard deviation of 1 across all instances, before combining it with the token-level KL penalty. Essentially what it does is replacing $R$ with $\tilde{R} = w \cdot R + b$ for some scalar $w$ and $b$.

Now suppose that we scale up the KL penalty coefficient by $w$, such that $\beta = w \cdot \tilde{\beta}$. Then the token-level rewards become \(\tilde{r}_t = \begin{cases} w \cdot r_t + b & \text{if } t = T \\ w \cdot r_t & \text{if } t < T \end{cases}\)

Consequently, the empirical return \(\tilde{G}_t = \sum_{t'=t}^{T} \gamma^{t'-t} \tilde{r}_{t'} = w \cdot G_t + b \\\) also undergoes an affine transformation. This reduces to our argument in the reward whitening section, and thus the effect of reward normalization can also be emulated.

Footnotes

I made several simplifications in the above argument. In reality things get a bit more complicated. But overall, our argument should still hold.

  1. PPO uses generalized advantage estimation (GAE), where there’s a factor $\lambda$ in the advantage computation. This is equivalent to our derivation if $\lambda = 1.0$, while in practice people often set $\lambda = 0.95$. However, our argument still holds, since the return is a linear function of token-level rewards and the value function.
  2. In reward whitening, our proposed coefficient in the value function’s affine transformation $b’ = \frac{1 - \gamma^{T-t+1}}{1 - \gamma} \cdot b$, which is dependent on $t$. This should be fine, since it is conceivable that the value model can learn to adjust the transformation coefficients based on the position, and to roughly predict how many tokens the policy model will be generating.
  3. When doing multiple gradient updates within each rollout batch (which can happen when setting ppo_epochs > 1 or backward_batch_size < rollout_batch_size), the $V(s_t)$ in the value loss may have drifted in later mini-batches, and thus $G_t - V(s_t)$ is no longer equal to $A_t$. However, this doesn’t interfere with our proof.
  4. In reward whitening, the affine transformation coefficients are batch-specific. This problem can be mitigated as long as we’re training with a large enough batch size, such that the batch statistics are close to the population statistics.