Back to blog
February 6, 2025

RL vs SL: Understanding Their Roles in Large Language Models

Bridging Reinforcement Learning and Supervised Learning in LLMs

RL vs SL: Understanding Their Roles in Large Language Models

Large Language Models (LLMs) have brought a new twist to the way we think about training algorithms. While traditional Supervised Learning (SL) and Reinforcement Learning (RL) might seem worlds apart at first glance, their underlying optimization procedures share a striking similarity. In both cases, we aim to maximize an objective function J(θ)J(\theta) by ascending its gradient. Yet, a fundamental difference lies in the way data is generated and controlled. This blog post dives into these similarities and differences, and explores how RL is adapted in the context of LLMs.

Gradients: A Shared Mathematical Form

In both SL and RL, our goal is to optimize the parameters θ\theta by maximizing an objective function:

θ=argmaxθJ(θ)\theta^* = \arg\max_\theta J(\theta)

Taking the gradient with respect to θ\theta gives us the direction in which to update the parameters. Let’s look at the gradient expressions for each case.

Supervised Learning

In SL, our objective is often the log-likelihood of the observed data. Assuming our data xx is sampled from a fixed training set DtrainD_{\text{train}}, we have:

J(θ)=ExDtrain[logp(x;θ)].J(\theta) = \mathbb{E}_{x \sim D_{\text{train}}}\big[\log p(x; \theta)\big].

The gradient of this objective is:

θJ(θ)=ExDtrain[θlogp(x;θ)].\nabla_\theta J(\theta) = \mathbb{E}_{x \sim D_{\text{train}}}\big[\nabla_\theta \log p(x; \theta)\big].

Here, the log appears naturally because of the logarithm of the likelihood. The key point is that the training data is fixed, sampled independently and identically distributed (iid) from DtrainD_{\text{train}}, meaning we have no control over which samples the model sees.

Reinforcement Learning

In RL, our objective is to maximize the expected cumulative discounted reward:

J(θ)=Eπθ[t=0γtrt],J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right],

where πθ\pi_\theta is the policy parameterized by θ\theta, rtr_t is the reward at time tt, and γ\gamma is the discount factor.

To derive the gradient, we employ the log trick. Although the original RL objective does not include a logarithm, we can write:

θJ(θ)=θEπθ[Gt]=Eπθ[Gtθlogπθ(τ)],\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\pi_\theta}[G_{t}] = \mathbb{E}_{\pi_\theta}\big[G_{t} \, \nabla_\theta \log \pi_\theta(\tau)\big],

where τ\tau denotes a trajectory and GG is the return. Breaking this down per time step (as in the REINFORCE algorithm), we obtain:

θJ(θ)=Eπθ[t=0θlogπθ(atst)Gt]\boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, G_t\right]}

Note: The log term is introduced via the log trick. It allows us to bring the gradient inside the expectation by rewriting θπθ(atst)\nabla_\theta \pi_\theta(a_t \mid s_t) as πθ(atst)θlogπθ(atst)\pi_\theta(a_t \mid s_t) \nabla_\theta \log \pi_\theta(a_t \mid s_t).

For simplicity, we are considering the return GtG_t as the cumulative discounted reward. In more sophisticated algorithms, GtG_t might be replaced with an advantage function (involving a critic) to reduce variance. Here, however, we stick with the simplest form—REINFORCE.

A Common Structure, But a Key Difference

Both gradients share a similar structure:

In both cases, the gradient is an expectation over a log derivative—an indication that we’re essentially adjusting our parameters in the direction that increases the log-probability of favorable outcomes.

The Fundamental Difference: Control Over Data

The similarity ends when we consider how the data is obtained:

This fundamental difference is why RL is notoriously difficult. The interdependence between the policy and the data generation process introduces non-stationarity, high variance, and a need for careful exploration strategies.

RL in the Context of LLMs

When RL is applied to Large Language Models (LLMs), the setting changes in interesting ways:

Conclusion

To summarize, while both supervised learning and reinforcement learning optimize similar-looking gradient expressions involving a log term, the crucial difference lies in the control over data generation. In supervised learning, data is fixed and sampled iid from a training set, meaning the model simply learns from what is given. In reinforcement learning, however, the agent’s policy actively influences the data it sees, making the problem inherently more challenging due to non-stationarity and the need for exploration.

In the realm of Large Language Models (LLMs), this complexity is mitigated by structuring the problem such that the prompts are fixed and the environment is deterministic. The only genuine RL challenge is the sequential generation of tokens—an aspect that, thanks to deterministic state transitions and rewards, reduces the need for extensive exploration. This hybrid approach allows techniques from both supervised learning and RL to be used effectively, enabling efficient training of these powerful models.

Understanding these nuances not only deepens our insight into model training but also helps us design better algorithms tailored to the unique challenges of each learning paradigm.