From REINFORCE to PPO: The Complete On-Policy RL Journey

Motivation: Why On-Policy RL Matters for Modern AI

If you’ve been following the latest developments in large language models (LLMs), you’ve probably heard of GRPO (Group Relative Policy Optimization) - one of the newest techniques for fine-tuning LLMs to follow human preferences. But did you know that GRPO is actually a direct descendant of algorithms developed decades ago for teaching robots to walk and play video games?

The journey from REINFORCE (1992) to PPO (2017) to GRPO (2024) represents one of the most important evolutionary paths in modern AI. Every time ChatGPT gives you a helpful response instead of generating gibberish, it’s thanks to policy gradient methods that can trace their lineage back to the fundamental algorithms we’ll explore in this post.

Why does this matter now?

LLM Fine-tuning: Modern techniques like RLHF (Reinforcement Learning from Human Feedback) rely heavily on policy gradient methods
Scalable Training: On-policy methods like PPO power the training of state-of-the-art models across multiple domains
Foundational Understanding: To work with cutting-edge AI systems, you need to understand how they learn to optimize complex objectives

Modern LLM training typically involves two major phases: supervised learning (training on text data) followed by reinforcement learning (fine-tuning with human feedback). As I discussed in my previous post on RL vs SL gradients, while both approaches optimize similar mathematical objectives involving log-probability terms, RL is fundamentally more challenging because the agent chooses what data it observes through its actions. In supervised learning, the training data is fixed, but in RL, the agent’s policy directly influences what data it sees by taking actions that determine which states and rewards it encounters, creating a complex feedback loop that introduces non-stationarity, high variance, and numerous “knobs” to tune.

This inherent complexity of RL - with all its challenges around data generation, exploration, and credit assignment - is precisely why the algorithmic evolution we’ll explore matters so much. Each algorithm in our journey represents a breakthrough in taming one of RL’s fundamental difficulties.

A Note on On-Policy vs Off-Policy: RL algorithms can be categorized into on-policy (learning from data generated by the current policy) and off-policy (learning from data generated by different policies) methods. While off-policy algorithms like Q-learning and SAC can be more sample-efficient in theory, they tend to be less stable and harder to tune. For this reason, modern LLM fine-tuning with RL predominantly uses on-policy methods like PPO, which provide more reliable and predictable training dynamics - crucial when working with billion-parameter models.

This post will take you on a step-by-step journey through five fundamental on-policy RL algorithms, each building on the previous one’s insights while solving its core limitations. By the end, you’ll understand not just how these algorithms work, but why they evolved the way they did - and why that evolution led directly to the techniques powering today’s AI breakthroughs.

The Complete Learning Journey: 5 Essential Algorithms

Our exploration follows the actual historical development of on-policy RL, where each algorithm emerged to solve specific problems with its predecessors:

REINFORCE: The foundational policy gradient algorithm
Actor-Critic Monte Carlo: Adding baselines to reduce variance
Actor-Critic Temporal Difference: Bootstrapping for sample efficiency
A2C: Parallel environments for stable learning
PPO: Trust regions for robust optimization

Each algorithm represents a major breakthrough that solved critical problems preventing practical deployment. Understanding this progression isn’t just academic - it’s essential for anyone working with modern AI systems.

1. REINFORCE: The Foundation of Policy Gradients

1.1 The Core Problem: Learning from Delayed Rewards

Unlike supervised learning where we have immediate feedback (correct labels), reinforcement learning faces the credit assignment problem: which actions in a long sequence actually led to the final reward?

REINFORCE solves this with the Policy Gradient Theorem:

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]\]

Where:

\(\theta\) are the policy parameters (neural network weights)
\(G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1}\) is the return-to-go (future rewards from time \(t\))
\(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) is the score function (gradient direction)

Intuitive Meaning: “Increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.”

1.2 The Log-Derivative Trick: Mathematical Foundation

The genius of REINFORCE lies in the log-derivative trick, which allows us to compute gradients of expectations over trajectories:

\[\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau) \nabla_\theta \log \pi_\theta(\tau)]\]

This transforms an intractable expectation gradient into a sample-based estimator we can actually compute.

1.3 Implementation: Handling Different Action Spaces

Our implementation works seamlessly with both discrete and continuous action spaces:

Discrete Actions (LunarLander’s 4 actions):

Policy outputs logits for each action
Apply Softmax to create probability distribution
Sample from Categorical distribution

Continuous Actions (LunarLander’s 2D continuous control):

Policy outputs mean for Gaussian distribution
Sample from multivariate Normal distribution
Clip actions to environment bounds

1.4 The Variance Problem: REINFORCE’s Achilles’ Heel

Despite its theoretical elegance, REINFORCE suffers from extremely high variance:

Episode returns can vary wildly (LunarLander: -200 to +300)
Same action in similar states gets vastly different learning signals
Training curves look chaotic rather than showing smooth improvement

The Solution: Return normalization helps, but we need more sophisticated variance reduction…

2. Actor-Critic Monte Carlo: Adding Smart Baselines

2.1 The Baseline Insight: Subtracting Without Bias

The key insight for reducing variance: subtract a baseline \(b(s_t)\) that doesn’t depend on the action:

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]\]

Why this works: The baseline term has zero expectation but dramatically reduces variance by providing a reference point for each state.

2.2 Two Baseline Strategies

We implement and compare two approaches:

Global Average Baseline: \(b = \bar{G} = \frac{1}{N} \sum_{i=1}^N G_0^{(i)}\)

Simple running average of all episode returns
State-independent but easy to implement
Moderate variance reduction

Learned Value Function Baseline: \(b(s_t) = V_\phi(s_t)\)

Neural network that estimates expected return from each state
State-dependent, more sophisticated
Better variance reduction in theory

2.3 The Actor-Critic Architecture

Actor-Critic introduces a dual network design:

Actor \(\pi_\theta(a \mid s)\): The policy network (same as REINFORCE)
Critic \(V_\phi(s)\): Value function network (new component)

We use a shared architecture for efficiency:

Shared feature layers process observations for both components
Separate heads specialize in policy and value estimation
Joint training with combined loss function

2.4 Normalization Strategy: A Critical Implementation Detail

Actor-Critic requires careful dual normalization:

Stage 1 - Critic Learning: Normalize returns for stable value function training

returns_normalized = (returns - returns.mean()) / (returns.std() + 1e-8)
critic_loss = mse_loss(value_predictions, returns_normalized)

Stage 2 - Actor Learning: Normalize advantages for stable policy updates

advantages = returns_normalized - value_predictions.detach()
advantages_normalized = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
actor_loss = -(log_probs * advantages_normalized).mean()

2.5 Key Insights from Implementation

The Trade-off: Baselines do help, but the results show that sometimes simpler approaches can be more stable than sophisticated ones. The global average baseline sometimes provides more consistent training than the learned value function, even though the latter is theoretically superior.

Important Finding: Value function baselines properly center advantages around zero, showing better credit assignment, but may introduce additional training complexity.

3. Actor-Critic Temporal Difference: The Bootstrapping Revolution

3.1 From Monte Carlo to Temporal Difference

The next breakthrough: instead of waiting for complete episodes, use bootstrapping with the Bellman Expectation Equation:

\[V^\pi(s_t) = \mathbb{E}_{\pi}[r_{t+1} + \gamma V^\pi(s_{t+1})]\]

This enables N-step updates every few timesteps rather than episode-based learning.

3.2 N-Step TD Targets

Instead of full episode returns \(G_t\), we use N-step bootstrapped targets:

\[G_t^{(N)} = \sum_{k=0}^{N-1} \gamma^k r_{t+k+1} + \gamma^N V_\phi(s_{t+N})\]

Key Benefits:

Lower variance: Bootstrapped estimates more stable than full returns
Online learning: Update during episodes, not just at the end
Sample efficiency: Learn from partial episodes
Faster convergence: More frequent updates accelerate learning

3.3 The Bias-Variance Tradeoff

Monte Carlo (previous methods):

✅ Unbiased: \(\mathbb{E}[G_t] = V^\pi(s_t)\) exactly
❌ High Variance: Episode returns vary dramatically

Temporal Difference:

❌ Biased: \(\mathbb{E}[r + \gamma V(s')] \neq V^\pi(s)\) if \(V\) is inaccurate
✅ Lower Variance: Single-step rewards plus learned estimates

The Key: Early in training our value estimates are wrong (bias), but as training progresses, \(V\) becomes accurate while maintaining low variance.

3.4 Critical Implementation Changes

Normalization Strategy Shift:

NO normalization of TD targets for critic learning (breaks value function meaning)
YES normalization of advantages for actor learning (still essential)

Gradient Clipping: TD methods benefit from gradient clipping for stability due to bootstrapping feedback loops.

3.5 The Sample Efficiency Breakthrough

The most striking difference from previous methods is the update frequency revolution:

Previous Methods (REINFORCE, AC-MC): 1 update per episode (1,000 total) Actor-Critic TD: ~95 updates per episode (95,000+ total updates!)

This demonstrates TD learning’s massive improvement in sample efficiency, though it comes with increased training instability that must be carefully managed.

4. A2C: Parallel Environments for Stability

4.1 Learning from A3C’s Lessons

A3C (Asynchronous Advantage Actor-Critic) introduced parallel data collection but with complex asynchronous updates. A2C keeps the parallel insight but simplifies execution:

Multiple environments collect experience simultaneously
Synchronous updates eliminate threading complexity
Batch learning improves GPU utilization
Deterministic training enables reproducible results

4.2 The Parallelization Strategy

Instead of collecting N-step experience from one environment:

\[\text{A2C collects N-step experience from E environments simultaneously}\]

Benefits:

Larger effective batch size: \(N \times E\) samples per update
Gradient averaging: Reduces variance across environments
Decorrelated experience: Different environments provide diverse data
Faster wall-clock time: Parallel data collection

4.3 Key Algorithm Changes

Data Collection:

# Previous: Single environment, N steps
for step in range(N):
    action = agent.select_action(state)
    state, reward, done = env.step(action)
    
# A2C: Vectorized environments, all E environments step simultaneously
for step in range(N):
    actions = agent.select_actions(states)  # Get actions for all E environments
    states, rewards, dones = vec_env.step(actions)  # Step all environments at once
    agent.store_rewards(rewards, dones)  # Store data from all environments

Advantage Calculation:

Normalize advantages across all environments
Larger sample size for more stable normalization

4.4 The Stability Benefits

A2C demonstrates how parallel environment collection significantly improves learning stability while maintaining the sample efficiency benefits of TD learning. The key insight is that averaging gradients across multiple diverse environments acts as a natural form of regularization.

5. PPO: Trust Regions and Robust Optimization

5.1 The Trust Region Insight

All previous methods suffer from a fundamental problem: how much should we change the policy in one update?

PPO solves this with clipped probability ratios:

\[r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\]

Clipped objective: \(L^{\text{CLIP}}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)]\)

5.2 Why Clipping Works

The Problem: Large policy updates can be destructive

Good data collected under old policy \(\pi_{\theta_{\text{old}}}\)
Large updates create new policy \(\pi_\theta\) very different from old
Data becomes off-policy and unreliable

PPO’s Solution:

Allow small improvements (\(r_t \approx 1\))
Clip large changes that would make data unreliable
Conservative updates ensure stable progress

5.3 Multiple Epochs: Squeezing More from Data

PPO enables multiple epochs on the same data:

for epoch in range(K):  # K=4 typically
    for batch in minibatches:
        # Compute clipped objective
        # Update both actor and critic

Why this works: Clipping prevents destructive updates, so we can safely reuse data multiple times.

5.4 The Complete PPO Recipe

Collect trajectories using current policy
Compute advantages using GAE (Generalized Advantage Estimation)
Update policy and value function for K epochs using clipped objectives
Repeat

5.5 Connection to Modern LLM Training

PPO’s design principles directly influence modern LLM fine-tuning:

RLHF (Reinforcement Learning from Human Feedback):

Use PPO to optimize language models for human preferences
Reward model provides signal instead of environment rewards
Same trust region principles prevent catastrophic policy changes

GRPO (Group Relative Policy Optimization):

A variant of PPO specifically designed for large language model (LLM) fine-tuning
Removes the critic (value function) entirely — no value network is trained
This is because LLM reward signals are:
- Sparse: only given at the end of a generated sequence
- Delayed: no intermediate token-level rewards
- Non-Markovian: token-wise value estimation becomes unreliable. Reward depends on the full output sequence, not just the current token or state. This breaks the assumptions that make value function learning reliable in traditional RL.
Instead of estimating advantages with a learned value function, GRPO:
- Samples multiple outputs per prompt (a “group”)
- Uses the group’s average reward as a baseline
- Computes relative advantage: \(\hat{A}_i = \frac{r_i - \bar{r}}{\sigma_r}\)
Eliminates the instability and compute overhead of value training
Aligns naturally with how reward models are trained (via output comparisons)
Achieves strong results in math-heavy domains like DeepSeekMath

6. The Evolution: Understanding the Why

6.1 The Fundamental Challenges

Each algorithm in our journey addresses specific problems:

High Variance → Baselines (Actor-Critic MC)
Sample Inefficiency → Bootstrapping (Actor-Critic TD)
Gradient Variance → Parallel Environment Averaging (A2C)
Destructive Updates → Trust Regions (PPO)

6.2 The Interconnected Nature of RL Problems

What makes RL particularly challenging is that these problems are deeply interconnected:

High variance leads to instability
Instability requires conservative updates
Conservative updates need good data utilization
Data utilization must balance on-policy constraints

PPO elegantly addresses this entire web of challenges simultaneously.

7. From Game Playing to Language Models

7.1 The Abstraction Bridge

The beauty of policy gradient methods is their generality:

Game Playing (LunarLander):

State: Position, velocity, angle (8D vector)
Action: Thruster controls (discrete or continuous)
Reward: Landing success, fuel efficiency

Language Modeling (LLMs):

State: Context tokens (high-dimensional embeddings)
Action: Next token selection (discrete distribution)
Reward: Human preference scores, helpfulness ratings

Same Mathematics: The policy gradient theorem applies equally to both domains!

7.2 Modern Scaling Innovations

Modern LLM training scales these principles:

Massive Parallelism: A2C’s parallel environments → thousands of distributed workers Trust Regions: PPO’s clipping → careful update constraints for billion-parameter models
Value Functions: Critic networks → reward models trained on human preferences

7.3 Why This Historical Perspective Matters

Understanding the evolutionary path from REINFORCE to PPO illuminates:

Design Principles: Why modern algorithms make specific choices
Failure Modes: What problems you’ll encounter and how to diagnose them
Future Directions: How current limitations might be addressed

When you encounter issues training modern AI systems, the debugging insights from these foundational algorithms remain invaluable.

8. Practical Insights and Implementation Details

8.1 Critical Implementation Choices

Our journey reveals several make-or-break implementation details:

Normalization Strategy:

Returns: Normalize in MC methods, NOT in TD methods
Advantages: Always normalize for stable policy updates
Timing: When and how you normalize dramatically affects performance

Network Architecture:

Shared vs. separate networks for actor-critic
Layer normalization (good) vs. batch normalization (bad) for RL
Activation functions and initialization strategies

Hyperparameter Sensitivity:

Learning rates must be carefully tuned per algorithm
Discount factors affect long-term vs. short-term behavior
Gradient clipping thresholds prevent training instability

8.2 Debugging RL: What the Losses Don’t Tell You

Critical Insight: Unlike supervised learning, RL losses don’t directly indicate performance:

High policy loss might mean active learning (good!)
Low critic loss might mean overfitting to wrong targets
Focus on episode scores, gradient stability, and advantage statistics

8.3 The Hardware Evolution Story

Our algorithms also reflect the hardware landscape evolution:

REINFORCE (1992): CPU-era, simple algorithms
Actor-Critic (2000s): Multi-core CPUs, parallel data processing
A2C/PPO (2010s): GPU era, batch processing and parallel environments

Modern LLM training continues this trend with distributed computing across thousands of GPUs.

9. Conclusion

Rather than overwhelming you with experimental numbers and charts, I encourage you to dive into the interactive notebooks where you can:

Run the algorithms yourself on LunarLander-v3
Experiment with hyperparameters to see their effects
Visualize training dynamics in real-time
Compare different approaches side-by-side
Debug implementation details step-by-step

The Fundamental Insight: Understanding how we got from REINFORCE to PPO provides the conceptual foundation for understanding where AI is heading next.

The algorithms we’ve explored aren’t just historical curiosities - they’re the living foundation of modern AI. Every time you interact with ChatGPT, watch a robot learn to walk, or see an AI system master a new task, you’re witnessing these principles in action.

Perhaps most importantly, the debugging intuition and implementation insights from these foundational algorithms remain directly applicable to cutting-edge AI research. The variance problems we solved in LunarLander are the same ones engineers face when training language models with trillions of parameters.

The complete implementation and detailed notebooks are available on GitHub

📚 References

Tags: reinforcement-learning policy-gradients ppo grpo llm-training deep-learning pytorch tutorial