Is supervised learning a special type of reinforcement learning?

Supervised Learning Objective: Maximum Likelihood

In supervised learning, the goal is to learn a function \(f(\boldsymbol{x})\) that maps an input \(\boldsymbol{x}\) to a label \(y\). This is typically done by maximizing the likelihood of the correct label \(y\) given \(\boldsymbol{x}\). For classification, the model outputs probabilities using the softmax function over \(N\) possible categories. The probability of class \(y\) is given by:

\[P(y \mid \boldsymbol{x}) = \frac{e^{f(y)}}{\sum_{i=1}^{N} e^{f(x_i)}}\]

The objective is to maximize this probability. For classification, we normally choose the cross-entropy loss to maximize the likelihood:

\[L(f(\boldsymbol{x}), y) = - \log P(y \mid \boldsymbol{x})\]

Minimizing this cross-entropy loss encourages the model’s predicted probability distribution to match the “one-hot” labeled distribution.

For example, in an image classification task, we have labeled data \((\boldsymbol{x}, y)\), where \(\boldsymbol{x}\) is the image and \(y\) is its label (e.g., “cat” or “dog”). Supervised learning uses these labeled examples to learn the correct mapping from inputs to labels.

Note on Regression
If the model outputs a single real number (instead of a categorical distribution), we often assume outputs follow a Gaussian distribution. In this scenario, maximizing the likelihood under that assumption is equivalent to minimizing MSE:
\[L(f(\boldsymbol{x}), y) = \bigl(f(\boldsymbol{x}) - y\bigr)^2.\]

For simplicity, however, we focus on classification in this post.

Reinforcement Learning Objective: Reward Maximization in MDP

In reinforcement learning (RL), the goal is to maximize the cumulative reward by interacting with an environment. Formally, RL uses a Markov Decision Process (MDP), where at each step \(t\):

The agent is in a state \(s_t\).
The agent takes an action \(a_t\).
The agent receives a reward \(r_{t+1}\) and transitions to a new state \(s_{t+1}\).

The agent’s objective is to maximize the expected sum of discounted rewards, also called the return:

\[J(\pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t \, r_{t+1} \,\middle|\, s_0 \right]\]

Here, \(\gamma\) (the discount factor) controls how future rewards are weighted, \(r_{t+1}\) is the reward at time \(t+1\), and \(\pi(a \mid s)\) is the policy the agent follows. A key aspect of RL is the exploration-exploitation trade-off: the agent must try different actions to discover which yield the most reward, especially when \(r\) can be stochastic.

Simplifying RL: Supervised Learning as a Special Case of RL

We can see supervised learning as a special case of RL if we apply three key simplifications:

No transition based on actions: The next state does not depend on the current action.
Deterministic reward function: Instead of stochastic rewards, we assume we know exactly what reward each action produces.
The agent can take all possible actions and get the corresponding rewards: We have access to the reward for every possible action at each step, unlike traditional RL, which only reveals the reward of the chosen action.

With these conditions, a reinforcement learning problem effectively behaves like supervised learning.

No Transition Based on Actions: Contextual Bandits

The first simplification eliminates action-based state transitions. In a standard MDP, the next state \(s_{t+1}\) depends on both \(s_t\) and \(a_t\). However, in a contextual bandit problem, there are no transitions between states. Instead, the state (or context) \(\boldsymbol{x}\) remains independent of any chosen action, and the objective is:

\[J(\pi) = \mathbb{E}_\pi \left[ r(a \mid \boldsymbol{x}) \right].\]

In other words, for a given context \(\boldsymbol{x}\), the agent selects an action \(a\) to maximize its immediate reward \(r(a \mid \boldsymbol{x})\). This structure is closer to supervised learning, where one simply learns a mapping from \(\boldsymbol{x}\) to \(y\) (instead of worrying about transitions).

Contextual vs. Non-Contextual Bandits

In a non-contextual bandit, the state is effectively fixed: one “arms” scenario, repeated many times.

In a contextual bandit, the state \(\boldsymbol{x}\) can change at each step (sampled from some distribution). In this post, we assume all \(\boldsymbol{x}\) come from the same distribution, and no action changes that distribution.

Deterministic Reward Function

Next, we assume the reward function is deterministic. In general RL, the reward for an action \(a\) in state \(s\) can be stochastic—performing the same action multiple times can yield different outcomes. But if

\[r(a \mid \boldsymbol{x})\]

is deterministic, the agent knows the exact reward each action would produce given \(\boldsymbol{x}\). That effectively removes any need for the agent to explore the uncertainty in the reward distribution.

Agent Can Take All Actions and Get the Rewards

Finally, we suppose the agent can take all possible actions and learn the reward for each. Typically, in RL or even standard contextual bandits, the agent observes only the reward of the chosen action. But if we assume it has access to:

\[r(a_1 \mid \boldsymbol{x}), \quad r(a_2 \mid \boldsymbol{x}), \quad \dots, \quad r(a_N \mid \boldsymbol{x}),\]

this is analogous to labeled data in supervised learning: for each input \(\boldsymbol{x}\), we effectively know the “correct label” (the highest reward action) as well as the payoffs of all the “incorrect” actions.

Classification as a Special Case of RL

Given (1) no transitions, (2) deterministic rewards, and (3) access to the reward for all actions, we can see how classification becomes a special case of RL. When classifying:

Input \(\boldsymbol{x}\): Serves as the “context.”
Predicting a label \(y\): Is akin to taking an action \(a\).
Knowing the true label: Gives the “reward” for that action. Having labeled data is like having the reward function for all possible labels.

Because the “best action” (true label) is always known for each context, exploration is unnecessary. You have a full reward vector just like a labeled dataset. Even using a softmax output to pick a class is akin to a policy that selects among actions.

Thus, supervised learning can be thought of as a deterministic, single-step, fully-informed RL problem.

Conclusion

Supervised Learning: We optimize a model to output the correct label \(\hat{y}\) given an input \(\boldsymbol{x}\). For classification, we typically maximize the likelihood of the correct label—often via the cross-entropy loss.
Reinforcement Learning: We maximize the cumulative reward in an MDP, dealing with unknown reward distributions, partial feedback, and potential dependencies between actions and future states.

Supervised learning emerges as a special case of RL when:

No state transitions depend on actions,
Rewards are deterministic, and
We observe rewards for all possible actions.

In that scenario, we no longer need to explore or handle multi-step dependencies. Real RL, on the other hand, involves much harder challenges, such as stochastic rewards, partial feedback, and the exploration-exploitation trade-off. This highlights both the generality and the complexity of RL—and why solving RL problems can be significantly more difficult than supervised learning.

One final note: supervised learning generally operates on a fixed, pre-collected dataset, making it inherently offline—there is no ongoing interaction once data gathering is complete. By contrast, reinforcement learning can be online, where the agent actively interacts with an environment in real time to collect new data (rewards) based on its actions. However, RL can also be done offline, where you first gather a dataset of transitions (e.g., from some policy or human demonstrations) and then train on that dataset—very much like supervised learning. This illustrates how the two approaches can blend once the RL problem constraints are simplified enough.

Tags: reinforcement learning supervised learning contextual bandits