## Supervised Learning Objective: Maximum Likelihood

In **supervised learning**, the goal is to learn a function \(f(\boldsymbol{x})\) that
maps an input \(\boldsymbol{x}\) to a label \(y\). This is typically done by
maximizing the likelihood of the correct label \(y\) given \(\boldsymbol{x}\). For
classification, the model outputs probabilities using the **softmax** function over \(N\) possible categories. The probability of class \(y\) is given by:

The objective is to maximize this probability. As for classification, we normally choose
the **cross-entropy loss** as a loss function to maximize the likelihood:

Basically minimizing the cross-entropy loss here is to make your neural network output a probability distribution as close as possible to the labeled one-hot vector.

For example, in an image classification task, we are given labeled data \((\boldsymbol{x}, y)\), where \(\boldsymbol{x}\) is the image and \(y\) is its corresponding label (e.g., “cat” or “dog”). Supervised learning relies on this labeled data to learn the correct mapping from inputs to labels.

## Reinforcement Learning Objective: Reward Maximization in MDP

In **reinforcement learning (RL)**, the objective is to maximize the cumulative reward
by interacting with an environment. Formally, RL is modeled as a **Markov Decision
Process (MDP)**, where at each step \(t\), the agent is in a state \(s_t\), takes an
action \(a_t\), receives a reward \(r_{t+1}\), and transitions to a new state \(s_{t+1}\).

The objective in RL is to maximize the expected cumulative reward, or **return**, over
time:

Here, \(\gamma\) is the discount factor that controls the weight of future rewards, \(r_{t+1}\) is the reward received at time \(t+1\), and \(\pi(a \mid s)\) is the
policy that the agent follows. An essential part of RL is the **exploration-exploitation
trade-off**: the agent needs to explore different actions to learn which ones yield the
highest rewards, especially because the reward function is often **stochastic**.

## Simplifying RL: Supervised Learning as a Special Case of RL

In this post, we want to show how supervised learning can be viewed as a special case of reinforcement learning. The key simplifications are:

**No transition based on actions**: In a simplified version of RL, the next state does not depend on the current action, unlike in a general MDP.**Deterministic reward function**: Instead of a stochastic reward, we assume the reward function is deterministic.**The agent can take all possible actions and get the corresponding rewards**: We assume that the agent has access to the rewards for all possible actions at each step, which is not possible in traditional RL.

These simplifications effectively turn a reinforcement learning problem into something that resembles supervised learning.

## No Transition Based on Actions: Contextual Bandits

The first simplification is to remove the **state transitions** based on actions. In a
general MDP, the next state \(s_{t+1}\) depends on both the current state \(s_t\)
and the action \(a_t\) taken by the agent. However, in a **contextual bandit**
problem, there are no transitions between states. The state (or context) \(\boldsymbol{x}\) remains fixed, and the goal is simply to choose the best action based
on the current context to maximize the immediate reward. The objective for contextual
bandits is:

This simplification makes the problem much closer to supervised learning, where the focus is on making predictions based on static inputs rather than managing state transitions.

## Deterministic Reward Function

The second simplification is to assume a **deterministic reward function**. In general
RL, the reward \(r\) for taking an action \(a\) in state \(s\) is often
stochastic, meaning that taking the same action in the same state might yield different
rewards due to randomness in the environment. However, if we assume the reward function
is **deterministic**, we know with certainty what the reward will be for each action: \(r(a \mid \boldsymbol{x})\).

In this scenario, there is no uncertainty in the reward, making it unnecessary for the agent to explore different actions to estimate the expected reward. This mirrors supervised learning, where the correct label (reward) is always known and deterministic.

## Agent Can Take All Actions and Get the Rewards

The third simplification is assuming that the agent can take **all possible actions**
and immediately receive the corresponding rewards for each. In standard RL and
contextual bandits, after taking an action, the agent only receives feedback about the
reward for the action it took. However, in this special case, the agent knows the
rewards for **all possible actions** at each step:

This is equivalent to receiving the **full reward vector** for every possible action,
which is similar to the labeled data in supervised learning where, for each input \(\boldsymbol{x}\), we know the correct label (or the correct action). In this case, the
agent doesn’t need to explore because it has full information about the rewards for all
actions, making exploration unnecessary.

## Classification as a Special Case of RL

Given these three simplifications—(1) no transitions, (2) deterministic rewards, and (3)
access to all rewards for all actions—we can now see that supervised learning is indeed
a **special case of reinforcement learning**.

In classification, given an input \(\boldsymbol{x}\), the model predicts the label \(y\), which corresponds to taking an action. The feedback (or label) for the input is
deterministic and known for all possible classes, which corresponds to receiving all
possible rewards for the available actions. The model could apply a **softmax** function
to assign probabilities to each action (or class):

Although softmax is not typically used in RL, it is applicable in this case because we have full feedback about all possible actions, just like in supervised learning. Therefore, the classification problem can be seen as a special case of reinforcement learning with these simplifications.

## Conclusion

To conclude, **supervised learning** can be viewed as a **special case of reinforcement
learning** when we simplify the RL framework by assuming (1) no transitions based on
actions, (2) a deterministic reward function, and (3) the ability to take all possible
actions and immediately get the rewards for each. These conditions align supervised
learning with a simplified version of RL, specifically **contextual bandits** without
exploration.

One more thing is to mention is that supervised learning is inherently offline, although reinforcement learning can be done both offline and online. In framing supervised learning as a special case of reinforcement learning, we can consider that we just collect enough data in an online fashion from the environment, and train a supervised learning model with the collected offline data.

**Reinforcement learning** is a much more generalized and complex framework. It must
handle transitions between states, stochastic rewards, and partial feedback—all while
balancing exploration and exploitation. This complexity makes RL a powerful but
challenging approach to solving a wide variety of learning problems, far beyond the
scope of traditional supervised learning. The RL in MDP we discussed can be further
generalized into more complex RL, e.g., POMDP, multi-agent RL, etc., which involves more
sophisticated strategies than the RL with one fully observable agent.