Is supervised learning a special type of reinforcement learning?

The reason why reinforcement learning is such a hard problem

By Taewoon Kim

Supervised Learning Objective: Maximum Likelihood

In supervised learning, the goal is to learn a function \(f(\boldsymbol{x})\) that maps an input \(\boldsymbol{x}\) to a label \(y\). This is typically done by maximizing the likelihood of the correct label \(y\) given \(\boldsymbol{x}\). For classification, the model outputs probabilities using the softmax function over \(N\) possible categories. The probability of class \(y\) is given by:

\[P(y \mid \boldsymbol{x}) = \frac{e^{f(y)}}{\sum_{i=1}^{N} e^{f(x_{i})}}\]

The objective is to maximize this probability. As for classification, we normally choose the cross-entropy loss as a loss function to maximize the likelihood:

\[L(f(\boldsymbol{x}), y) = - \log P(y \mid \boldsymbol{x})\]

Basically minimizing the cross-entropy loss here is to make your neural network output a probability distribution as close as possible to the labeled one-hot vector.

For example, in an image classification task, we are given labeled data \((\boldsymbol{x}, y)\), where \(\boldsymbol{x}\) is the image and \(y\) is its corresponding label (e.g., “cat” or “dog”). Supervised learning relies on this labeled data to learn the correct mapping from inputs to labels.

Reinforcement Learning Objective: Reward Maximization in MDP

In reinforcement learning (RL), the objective is to maximize the cumulative reward by interacting with an environment. Formally, RL is modeled as a Markov Decision Process (MDP), where at each step \(t\), the agent is in a state \(s_t\), takes an action \(a_t\), receives a reward \(r_{t+1}\), and transitions to a new state \(s_{t+1}\).

The objective in RL is to maximize the expected cumulative reward, or return, over time:

\[J(\pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \mid s_0 \right]\]

Here, \(\gamma\) is the discount factor that controls the weight of future rewards, \(r_{t+1}\) is the reward received at time \(t+1\), and \(\pi(a \mid s)\) is the policy that the agent follows. An essential part of RL is the exploration-exploitation trade-off: the agent needs to explore different actions to learn which ones yield the highest rewards, especially because the reward function is often stochastic.

Simplifying RL: Supervised Learning as a Special Case of RL

In this post, we want to show how supervised learning can be viewed as a special case of reinforcement learning. The key simplifications are:

  1. No transition based on actions: In a simplified version of RL, the next state does not depend on the current action, unlike in a general MDP.
  2. Deterministic reward function: Instead of a stochastic reward, we assume the reward function is deterministic.
  3. The agent can take all possible actions and get the corresponding rewards: We assume that the agent has access to the rewards for all possible actions at each step, which is not possible in traditional RL.

These simplifications effectively turn a reinforcement learning problem into something that resembles supervised learning.

No Transition Based on Actions: Contextual Bandits

The first simplification is to remove the state transitions based on actions. In a general MDP, the next state \(s_{t+1}\) depends on both the current state \(s_t\) and the action \(a_t\) taken by the agent. However, in a contextual bandit problem, there are no transitions between states. The state (or context) \(\boldsymbol{x}\) remains fixed, and the goal is simply to choose the best action based on the current context to maximize the immediate reward. The objective for contextual bandits is:

\[J(\pi) = \mathbb{E}_\pi \left[ r(a \mid \boldsymbol{x}) \right]\]

This simplification makes the problem much closer to supervised learning, where the focus is on making predictions based on static inputs rather than managing state transitions.

Deterministic Reward Function

The second simplification is to assume a deterministic reward function. In general RL, the reward \(r\) for taking an action \(a\) in state \(s\) is often stochastic, meaning that taking the same action in the same state might yield different rewards due to randomness in the environment. However, if we assume the reward function is deterministic, we know with certainty what the reward will be for each action: \(r(a \mid \boldsymbol{x})\).

In this scenario, there is no uncertainty in the reward, making it unnecessary for the agent to explore different actions to estimate the expected reward. This mirrors supervised learning, where the correct label (reward) is always known and deterministic.

Agent Can Take All Actions and Get the Rewards

The third simplification is assuming that the agent can take all possible actions and immediately receive the corresponding rewards for each. In standard RL and contextual bandits, after taking an action, the agent only receives feedback about the reward for the action it took. However, in this special case, the agent knows the rewards for all possible actions at each step:

\[r(a_1 \mid \boldsymbol{x}), r(a_2 \mid \boldsymbol{x}), \dots, r(a_N \mid \boldsymbol{x})\]

This is equivalent to receiving the full reward vector for every possible action, which is similar to the labeled data in supervised learning where, for each input \(\boldsymbol{x}\), we know the correct label (or the correct action). In this case, the agent doesn’t need to explore because it has full information about the rewards for all actions, making exploration unnecessary.

Classification as a Special Case of RL

Given these three simplifications—(1) no transitions, (2) deterministic rewards, and (3) access to all rewards for all actions—we can now see that supervised learning is indeed a special case of reinforcement learning.

In classification, given an input \(\boldsymbol{x}\), the model predicts the label \(y\), which corresponds to taking an action. The feedback (or label) for the input is deterministic and known for all possible classes, which corresponds to receiving all possible rewards for the available actions. The model could apply a softmax function to assign probabilities to each action (or class):

\[P(a \mid \boldsymbol{x}) = \frac{e^{f(a)}}{\sum_{i=1}^{N} e^{f(x_{i})}}\]

Although softmax is not typically used in RL, it is applicable in this case because we have full feedback about all possible actions, just like in supervised learning. Therefore, the classification problem can be seen as a special case of reinforcement learning with these simplifications.

Conclusion

To conclude, supervised learning can be viewed as a special case of reinforcement learning when we simplify the RL framework by assuming (1) no transitions based on actions, (2) a deterministic reward function, and (3) the ability to take all possible actions and immediately get the rewards for each. These conditions align supervised learning with a simplified version of RL, specifically contextual bandits without exploration.

One more thing is to mention is that supervised learning is inherently offline, although reinforcement learning can be done both offline and online. In framing supervised learning as a special case of reinforcement learning, we can consider that we just collect enough data in an online fashion from the environment, and train a supervised learning model with the collected offline data.

Reinforcement learning is a much more generalized and complex framework. It must handle transitions between states, stochastic rewards, and partial feedback—all while balancing exploration and exploitation. This complexity makes RL a powerful but challenging approach to solving a wide variety of learning problems, far beyond the scope of traditional supervised learning. The RL in MDP we discussed can be further generalized into more complex RL, e.g., POMDP, multi-agent RL, etc., which involves more sophisticated strategies than the RL with one fully observable agent.

Share: X (Twitter)