Motivation: Why On-Policy RL Matters for Modern AI
If you’ve been following the latest developments in large language models (LLMs), you’ve probably heard of GRPO (Group Relative Policy Optimization) - one of the newest techniques for fine-tuning LLMs to follow human preferences. But did you know that GRPO is actually a direct descendant of algorithms developed decades ago for teaching robots to walk and play video games?
The journey from REINFORCE (1992) to PPO (2017) to GRPO (2024) represents one of the most important evolutionary paths in modern AI. Every time ChatGPT gives you a helpful response instead of generating gibberish, it’s thanks to policy gradient methods that can trace their lineage back to the fundamental algorithms we’ll explore in this post.
Why does this matter now?
- LLM Fine-tuning: Modern techniques like RLHF (Reinforcement Learning from Human Feedback) rely heavily on policy gradient methods
- Scalable Training: On-policy methods like PPO power the training of state-of-the-art models across multiple domains
- Foundational Understanding: To work with cutting-edge AI systems, you need to understand how they learn to optimize complex objectives
Modern LLM training typically involves two major phases: supervised learning (training on text data) followed by reinforcement learning (fine-tuning with human feedback). As I discussed in my previous post on RL vs SL gradients, while both approaches optimize similar mathematical objectives involving log-probability terms, RL is fundamentally more challenging because the agent chooses what data it observes through its actions. In supervised learning, the training data is fixed, but in RL, the agent’s policy directly influences what data it sees by taking actions that determine which states and rewards it encounters, creating a complex feedback loop that introduces non-stationarity, high variance, and numerous “knobs” to tune.
This inherent complexity of RL - with all its challenges around data generation, exploration, and credit assignment - is precisely why the algorithmic evolution we’ll explore matters so much. Each algorithm in our journey represents a breakthrough in taming one of RL’s fundamental difficulties.
A Note on On-Policy vs Off-Policy: RL algorithms can be categorized into on-policy (learning from data generated by the current policy) and off-policy (learning from data generated by different policies) methods. While off-policy algorithms like Q-learning and SAC can be more sample-efficient in theory, they tend to be less stable and harder to tune. For this reason, modern LLM fine-tuning with RL predominantly uses on-policy methods like PPO, which provide more reliable and predictable training dynamics - crucial when working with billion-parameter models.
This post will take you on a step-by-step journey through five fundamental on-policy RL algorithms, each building on the previous one’s insights while solving its core limitations. By the end, you’ll understand not just how these algorithms work, but why they evolved the way they did - and why that evolution led directly to the techniques powering today’s AI breakthroughs.
The Complete Learning Journey: 5 Essential Algorithms
Our exploration follows the actual historical development of on-policy RL, where each algorithm emerged to solve specific problems with its predecessors:
- REINFORCE: The foundational policy gradient algorithm
- Actor-Critic Monte Carlo: Adding baselines to reduce variance
- Actor-Critic Temporal Difference: Bootstrapping for sample efficiency
- A2C: Parallel environments for stable learning
- PPO: Trust regions for robust optimization
Each algorithm represents a major breakthrough that solved critical problems preventing practical deployment. Understanding this progression isn’t just academic - it’s essential for anyone working with modern AI systems.
1. REINFORCE: The Foundation of Policy Gradients
1.1 The Core Problem: Learning from Delayed Rewards
Unlike supervised learning where we have immediate feedback (correct labels), reinforcement learning faces the credit assignment problem: which actions in a long sequence actually led to the final reward?
REINFORCE solves this with the Policy Gradient Theorem:
\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]\]Where:
- \(\theta\) are the policy parameters (neural network weights)
- \(G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1}\) is the return-to-go (future rewards from time \(t\))
- \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) is the score function (gradient direction)
Intuitive Meaning: “Increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.”
1.2 The Log-Derivative Trick: Mathematical Foundation
The genius of REINFORCE lies in the log-derivative trick, which allows us to compute gradients of expectations over trajectories:
\[\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau) \nabla_\theta \log \pi_\theta(\tau)]\]This transforms an intractable expectation gradient into a sample-based estimator we can actually compute.
1.3 Implementation: Handling Different Action Spaces
Our implementation works seamlessly with both discrete and continuous action spaces:
Discrete Actions (LunarLander’s 4 actions):
- Policy outputs logits for each action
- Apply Softmax to create probability distribution
- Sample from Categorical distribution
Continuous Actions (LunarLander’s 2D continuous control):
- Policy outputs mean for Gaussian distribution
- Sample from multivariate Normal distribution
- Clip actions to environment bounds
1.4 The Variance Problem: REINFORCE’s Achilles’ Heel
Despite its theoretical elegance, REINFORCE suffers from extremely high variance:
- Episode returns can vary wildly (LunarLander: -200 to +300)
- Same action in similar states gets vastly different learning signals
- Training curves look chaotic rather than showing smooth improvement
The Solution: Return normalization helps, but we need more sophisticated variance reduction…
2. Actor-Critic Monte Carlo: Adding Smart Baselines
2.1 The Baseline Insight: Subtracting Without Bias
The key insight for reducing variance: subtract a baseline \(b(s_t)\) that doesn’t depend on the action:
\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]\]Why this works: The baseline term has zero expectation but dramatically reduces variance by providing a reference point for each state.
2.2 Two Baseline Strategies
We implement and compare two approaches:
Global Average Baseline: \(b = \bar{G} = \frac{1}{N} \sum_{i=1}^N G_0^{(i)}\)
- Simple running average of all episode returns
- State-independent but easy to implement
- Moderate variance reduction
Learned Value Function Baseline: \(b(s_t) = V_\phi(s_t)\)
- Neural network that estimates expected return from each state
- State-dependent, more sophisticated
- Better variance reduction in theory
2.3 The Actor-Critic Architecture
Actor-Critic introduces a dual network design:
- Actor \(\pi_\theta(a \mid s)\): The policy network (same as REINFORCE)
- Critic \(V_\phi(s)\): Value function network (new component)
We use a shared architecture for efficiency:
- Shared feature layers process observations for both components
- Separate heads specialize in policy and value estimation
- Joint training with combined loss function
2.4 Normalization Strategy: A Critical Implementation Detail
Actor-Critic requires careful dual normalization:
Stage 1 - Critic Learning: Normalize returns for stable value function training
returns_normalized = (returns - returns.mean()) / (returns.std() + 1e-8)
critic_loss = mse_loss(value_predictions, returns_normalized)
Stage 2 - Actor Learning: Normalize advantages for stable policy updates
advantages = returns_normalized - value_predictions.detach()
advantages_normalized = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
actor_loss = -(log_probs * advantages_normalized).mean()
2.5 Key Insights from Implementation
The Trade-off: Baselines do help, but the results show that sometimes simpler approaches can be more stable than sophisticated ones. The global average baseline sometimes provides more consistent training than the learned value function, even though the latter is theoretically superior.
Important Finding: Value function baselines properly center advantages around zero, showing better credit assignment, but may introduce additional training complexity.
3. Actor-Critic Temporal Difference: The Bootstrapping Revolution
3.1 From Monte Carlo to Temporal Difference
The next breakthrough: instead of waiting for complete episodes, use bootstrapping with the Bellman Expectation Equation:
\[V^\pi(s_t) = \mathbb{E}_{\pi}[r_{t+1} + \gamma V^\pi(s_{t+1})]\]This enables N-step updates every few timesteps rather than episode-based learning.
3.2 N-Step TD Targets
Instead of full episode returns \(G_t\), we use N-step bootstrapped targets:
\[G_t^{(N)} = \sum_{k=0}^{N-1} \gamma^k r_{t+k+1} + \gamma^N V_\phi(s_{t+N})\]Key Benefits:
- Lower variance: Bootstrapped estimates more stable than full returns
- Online learning: Update during episodes, not just at the end
- Sample efficiency: Learn from partial episodes
- Faster convergence: More frequent updates accelerate learning
3.3 The Bias-Variance Tradeoff
Monte Carlo (previous methods):
- ✅ Unbiased: \(\mathbb{E}[G_t] = V^\pi(s_t)\) exactly
- ❌ High Variance: Episode returns vary dramatically
Temporal Difference:
- ❌ Biased: \(\mathbb{E}[r + \gamma V(s')] \neq V^\pi(s)\) if \(V\) is inaccurate
- ✅ Lower Variance: Single-step rewards plus learned estimates
The Key: Early in training our value estimates are wrong (bias), but as training progresses, \(V\) becomes accurate while maintaining low variance.
3.4 Critical Implementation Changes
Normalization Strategy Shift:
- NO normalization of TD targets for critic learning (breaks value function meaning)
- YES normalization of advantages for actor learning (still essential)
Gradient Clipping: TD methods benefit from gradient clipping for stability due to bootstrapping feedback loops.
3.5 The Sample Efficiency Breakthrough
The most striking difference from previous methods is the update frequency revolution:
Previous Methods (REINFORCE, AC-MC): 1 update per episode (1,000 total) Actor-Critic TD: ~95 updates per episode (95,000+ total updates!)
This demonstrates TD learning’s massive improvement in sample efficiency, though it comes with increased training instability that must be carefully managed.
4. A2C: Parallel Environments for Stability
4.1 Learning from A3C’s Lessons
A3C (Asynchronous Advantage Actor-Critic) introduced parallel data collection but with complex asynchronous updates. A2C keeps the parallel insight but simplifies execution:
- Multiple environments collect experience simultaneously
- Synchronous updates eliminate threading complexity
- Batch learning improves GPU utilization
- Deterministic training enables reproducible results
4.2 The Parallelization Strategy
Instead of collecting N-step experience from one environment:
\[\text{A2C collects N-step experience from E environments simultaneously}\]Benefits:
- Larger effective batch size: \(N \times E\) samples per update
- Gradient averaging: Reduces variance across environments
- Decorrelated experience: Different environments provide diverse data
- Faster wall-clock time: Parallel data collection
4.3 Key Algorithm Changes
Data Collection:
# Previous: Single environment, N steps
for step in range(N):
action = agent.select_action(state)
state, reward, done = env.step(action)
# A2C: Vectorized environments, all E environments step simultaneously
for step in range(N):
actions = agent.select_actions(states) # Get actions for all E environments
states, rewards, dones = vec_env.step(actions) # Step all environments at once
agent.store_rewards(rewards, dones) # Store data from all environments
Advantage Calculation:
- Normalize advantages across all environments
- Larger sample size for more stable normalization
4.4 The Stability Benefits
A2C demonstrates how parallel environment collection significantly improves learning stability while maintaining the sample efficiency benefits of TD learning. The key insight is that averaging gradients across multiple diverse environments acts as a natural form of regularization.
5. PPO: Trust Regions and Robust Optimization
5.1 The Trust Region Insight
All previous methods suffer from a fundamental problem: how much should we change the policy in one update?
PPO solves this with clipped probability ratios:
\[r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\]Clipped objective: \(L^{\text{CLIP}}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)]\)
5.2 Why Clipping Works
The Problem: Large policy updates can be destructive
- Good data collected under old policy \(\pi_{\theta_{\text{old}}}\)
- Large updates create new policy \(\pi_\theta\) very different from old
- Data becomes off-policy and unreliable
PPO’s Solution:
- Allow small improvements (\(r_t \approx 1\))
- Clip large changes that would make data unreliable
- Conservative updates ensure stable progress
5.3 Multiple Epochs: Squeezing More from Data
PPO enables multiple epochs on the same data:
for epoch in range(K): # K=4 typically
for batch in minibatches:
# Compute clipped objective
# Update both actor and critic
Why this works: Clipping prevents destructive updates, so we can safely reuse data multiple times.
5.4 The Complete PPO Recipe
- Collect trajectories using current policy
- Compute advantages using GAE (Generalized Advantage Estimation)
- Update policy and value function for K epochs using clipped objectives
- Repeat
5.5 Connection to Modern LLM Training
PPO’s design principles directly influence modern LLM fine-tuning:
RLHF (Reinforcement Learning from Human Feedback):
- Use PPO to optimize language models for human preferences
- Reward model provides signal instead of environment rewards
- Same trust region principles prevent catastrophic policy changes
GRPO (Group Relative Policy Optimization):
- A variant of PPO specifically designed for large language model (LLM) fine-tuning
- Removes the critic (value function) entirely — no value network is trained
- This is because LLM reward signals are:
- Sparse: only given at the end of a generated sequence
- Delayed: no intermediate token-level rewards
- Non-Markovian: token-wise value estimation becomes unreliable. Reward depends on the full output sequence, not just the current token or state. This breaks the assumptions that make value function learning reliable in traditional RL.
- Instead of estimating advantages with a learned value function, GRPO:
- Samples multiple outputs per prompt (a “group”)
- Uses the group’s average reward as a baseline
- Computes relative advantage: \(\hat{A}_i = \frac{r_i - \bar{r}}{\sigma_r}\)
- Eliminates the instability and compute overhead of value training
- Aligns naturally with how reward models are trained (via output comparisons)
- Achieves strong results in math-heavy domains like DeepSeekMath
6. The Evolution: Understanding the Why
6.1 The Fundamental Challenges
Each algorithm in our journey addresses specific problems:
- High Variance → Baselines (Actor-Critic MC)
- Sample Inefficiency → Bootstrapping (Actor-Critic TD)
- Gradient Variance → Parallel Environment Averaging (A2C)
- Destructive Updates → Trust Regions (PPO)
6.2 The Interconnected Nature of RL Problems
What makes RL particularly challenging is that these problems are deeply interconnected:
- High variance leads to instability
- Instability requires conservative updates
- Conservative updates need good data utilization
- Data utilization must balance on-policy constraints
PPO elegantly addresses this entire web of challenges simultaneously.
7. From Game Playing to Language Models
7.1 The Abstraction Bridge
The beauty of policy gradient methods is their generality:
Game Playing (LunarLander):
- State: Position, velocity, angle (8D vector)
- Action: Thruster controls (discrete or continuous)
- Reward: Landing success, fuel efficiency
Language Modeling (LLMs):
- State: Context tokens (high-dimensional embeddings)
- Action: Next token selection (discrete distribution)
- Reward: Human preference scores, helpfulness ratings
Same Mathematics: The policy gradient theorem applies equally to both domains!
7.2 Modern Scaling Innovations
Modern LLM training scales these principles:
Massive Parallelism: A2C’s parallel environments → thousands of distributed workers
Trust Regions: PPO’s clipping → careful update constraints for billion-parameter
models
Value Functions: Critic networks → reward models trained on human preferences
7.3 Why This Historical Perspective Matters
Understanding the evolutionary path from REINFORCE to PPO illuminates:
- Design Principles: Why modern algorithms make specific choices
- Failure Modes: What problems you’ll encounter and how to diagnose them
- Future Directions: How current limitations might be addressed
When you encounter issues training modern AI systems, the debugging insights from these foundational algorithms remain invaluable.
8. Practical Insights and Implementation Details
8.1 Critical Implementation Choices
Our journey reveals several make-or-break implementation details:
Normalization Strategy:
- Returns: Normalize in MC methods, NOT in TD methods
- Advantages: Always normalize for stable policy updates
- Timing: When and how you normalize dramatically affects performance
Network Architecture:
- Shared vs. separate networks for actor-critic
- Layer normalization (good) vs. batch normalization (bad) for RL
- Activation functions and initialization strategies
Hyperparameter Sensitivity:
- Learning rates must be carefully tuned per algorithm
- Discount factors affect long-term vs. short-term behavior
- Gradient clipping thresholds prevent training instability
8.2 Debugging RL: What the Losses Don’t Tell You
Critical Insight: Unlike supervised learning, RL losses don’t directly indicate performance:
- High policy loss might mean active learning (good!)
- Low critic loss might mean overfitting to wrong targets
- Focus on episode scores, gradient stability, and advantage statistics
8.3 The Hardware Evolution Story
Our algorithms also reflect the hardware landscape evolution:
- REINFORCE (1992): CPU-era, simple algorithms
- Actor-Critic (2000s): Multi-core CPUs, parallel data processing
- A2C/PPO (2010s): GPU era, batch processing and parallel environments
Modern LLM training continues this trend with distributed computing across thousands of GPUs.
9. Conclusion
Rather than overwhelming you with experimental numbers and charts, I encourage you to dive into the interactive notebooks where you can:
- Run the algorithms yourself on LunarLander-v3
- Experiment with hyperparameters to see their effects
- Visualize training dynamics in real-time
- Compare different approaches side-by-side
- Debug implementation details step-by-step
The Fundamental Insight: Understanding how we got from REINFORCE to PPO provides the conceptual foundation for understanding where AI is heading next.
The algorithms we’ve explored aren’t just historical curiosities - they’re the living foundation of modern AI. Every time you interact with ChatGPT, watch a robot learn to walk, or see an AI system master a new task, you’re witnessing these principles in action.
Perhaps most importantly, the debugging intuition and implementation insights from these foundational algorithms remain directly applicable to cutting-edge AI research. The variance problems we solved in LunarLander are the same ones engineers face when training language models with trillions of parameters.
The complete implementation and detailed notebooks are available on GitHub
📚 References
- Simple statistical gradient-following algorithms for connectionist reinforcement learning
- Human-level Control through Deep Reinforcement Learning
- Rainbow: Combining Improvements in Deep Reinforcement Learning
- Actor-Critic Reinforcement Learning for Control with Stability Guarantees
- Asynchronous Methods for Deep Reinforcement Learning
- Continuous Control with Deep Reinforcement Learning
- Addressing Function Approximation Error in Actor-Critic Methods
- Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
- Proximal Policy Optimization Algorithms
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models