RL vs SL: Understanding Their Roles in Large Language Models

Large Language Models (LLMs) have brought a new twist to the way we think about training algorithms. While traditional Supervised Learning (SL) and Reinforcement Learning (RL) might seem worlds apart at first glance, their underlying optimization procedures share a striking similarity. In both cases, we aim to maximize an objective function \(J(\theta)\) by ascending its gradient. Yet, a fundamental difference lies in the way data is generated and controlled. This blog post dives into these similarities and differences, and explores how RL is adapted in the context of LLMs.

Gradients: A Shared Mathematical Form

In both SL and RL, our goal is to optimize the parameters \(\theta\) by maximizing an objective function:

\[\theta^* = \arg\max_\theta J(\theta)\]

Taking the gradient with respect to \(\theta\) gives us the direction in which to update the parameters. Let’s look at the gradient expressions for each case.

Supervised Learning

In SL, our objective is often the log-likelihood of the observed data. Assuming our data \(x\) is sampled from a fixed training set \(D_{\text{train}}\), we have:

\[J(\theta) = \mathbb{E}_{x \sim D_{\text{train}}}\big[\log p(x; \theta)\big].\]

The gradient of this objective is:

\[\nabla_\theta J(\theta) = \mathbb{E}_{x \sim D_{\text{train}}}\big[\nabla_\theta \log p(x; \theta)\big].\]

Here, the log appears naturally because of the logarithm of the likelihood. The key point is that the training data is fixed, sampled independently and identically distributed (iid) from \(D_{\text{train}}\), meaning we have no control over which samples the model sees.

Reinforcement Learning

In RL, our objective is to maximize the expected cumulative discounted reward:

\[J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right],\]

where \(\pi_\theta\) is the policy parameterized by \(\theta\), \(r_t\) is the reward at time \(t\), and \(\gamma\) is the discount factor.

To derive the gradient, we employ the log trick. Although the original RL objective does not include a logarithm, we can write:

\[\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\pi_\theta}[G_{t}] = \mathbb{E}_{\pi_\theta}\big[G_{t} \, \nabla_\theta \log \pi_\theta(\tau)\big],\]

where \(\tau\) denotes a trajectory and \(G\) is the return. Breaking this down per time step (as in the REINFORCE algorithm), we obtain:

\[\boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, G_t\right]}\]

Note: The log term is introduced via the log trick. It allows us to bring the gradient inside the expectation by rewriting \(\nabla_\theta \pi_\theta(a_t \mid s_t)\) as \(\pi_\theta(a_t \mid s_t) \nabla_\theta \log \pi_\theta(a_t \mid s_t)\).

For simplicity, we are considering the return \(G_t\) as the cumulative discounted reward. In more sophisticated algorithms, \(G_t\) might be replaced with an advantage function (involving a critic) to reduce variance. Here, however, we stick with the simplest form—REINFORCE.

A Common Structure, But a Key Difference

Both gradients share a similar structure:

Supervised Learning:
\[\nabla_\theta J(\theta) = \mathbb{E}_{x \sim D_{\text{train}}}\left[\nabla_\theta \log p(x; \theta)\right]\]
Reinforcement Learning:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, G_t\right]\]

In both cases, the gradient is an expectation over a log derivative—an indication that we’re essentially adjusting our parameters in the direction that increases the log-probability of favorable outcomes.

The Fundamental Difference: Control Over Data

The similarity ends when we consider how the data is obtained:

Supervised Learning:
The training data \(x\) is sampled from \(D_{\text{train}}\) according to an iid assumption. The model has no control over which data points are received—it simply adapts to a pre-existing, fixed distribution.
Reinforcement Learning:
Here, the data (trajectories of states and actions) is generated by the agent interacting with the environment. The policy \(\pi_\theta\) directly influences which data is collected. This creates a feedback loop:
- As the policy changes, so does the distribution of the data.
- The agent must explore to gather informative data, dealing with the non-stationarity of the environment.

This fundamental difference is why RL is notoriously difficult. The interdependence between the policy and the data generation process introduces non-stationarity, high variance, and a need for careful exploration strategies.

RL in the Context of LLMs

When RL is applied to Large Language Models (LLMs), the setting changes in interesting ways:

Supervised-Learning-Like Data Generation:
The prompts or questions provided to the LLM are typically sampled from a fixed dataset. In this scenario, the question forms part of the state. Since the prompt is independent of the model’s actions, the data distribution is fixed—just as in supervised learning. The only RL aspect is in generating the sequence of tokens.
Deterministic Reward and Transitions:
The reward in LLM RL settings is often defined deterministically, for example as:
\[R(\text{question}, \text{generated sequence})\]
If the generated answer is correct (or matches certain criteria), the reward is the same every time. Similarly, the state transition—adding a token to the generated sequence—is deterministic. If the model chooses a token, that token is generated with certainty.

Although the reward function is deterministic, in the context of LLMs we still face the challenge of credit assignment. A key decision is whether to assign the same reward to all tokens in the sequence or to give the reward only to the final token (with earlier tokens receiving zero reward). Regardless of which approach is chosen, a discount factor \(\gamma\) must be considered to appropriately account for the sequential nature of the decision-making process. If we assign rewards to all tokens, the return \(G_t\) at each time step \(t\) (to generate a token at \(t\)) is computed as:
\[G_t = \sum_{k=t}^{T} \gamma^{\,k-t} r_k,\]
where \(T\) is the final time step of the sequence and \(r_k\) is the reward at time step \(k\). This formulation ensures that future rewards are properly discounted, a standard consideration in sequential decision-making with RL that remains applicable in LLM contexts.
Reduced Need for Exploration:
Because the questions are fixed and the environment (i.e., the token generation process) is deterministic, the agent in an LLM doesn’t need to explore as extensively as in traditional RL. The main challenge is in the credit assignment over the sequence—determining how earlier tokens contributed to the final reward.
Bridging the Gap:
These factors allow RL in LLMs to be formulated much closer to a supervised learning or bandit problem. Algorithms like Direct Preference Optimization (DPO) take advantage of this structure, effectively transforming the RL task into one that resembles a supervised learning problem.

Conclusion

To summarize, while both supervised learning and reinforcement learning optimize similar-looking gradient expressions involving a log term, the crucial difference lies in the control over data generation. In supervised learning, data is fixed and sampled iid from a training set, meaning the model simply learns from what is given. In reinforcement learning, however, the agent’s policy actively influences the data it sees, making the problem inherently more challenging due to non-stationarity and the need for exploration.

In the realm of Large Language Models (LLMs), this complexity is mitigated by structuring the problem such that the prompts are fixed and the environment is deterministic. The only genuine RL challenge is the sequential generation of tokens—an aspect that, thanks to deterministic state transitions and rewards, reduces the need for extensive exploration. This hybrid approach allows techniques from both supervised learning and RL to be used effectively, enabling efficient training of these powerful models.

Understanding these nuances not only deepens our insight into model training but also helps us design better algorithms tailored to the unique challenges of each learning paradigm.