Understanding Maximum Likelihood Estimation

Most of us, whether consciously or not, use Maximum Likelihood Estimation (MLE) in our daily machine learning workflows. When you’re training a model to predict labels in a supervised setting, or even when you’re using a self-supervised approach like masked language modeling, you’re often implicitly performing MLE under the hood. From linear regression (minimizing mean squared error) to large language models (minimizing cross-entropy), the common thread is that these objectives can be interpreted as maximizing the probability of the observed data given the model’s parameters. In other words, at the foundation of many learning algorithms lies a fundamental statistical principle: make the observed data as likely as possible under your chosen model.

Why MLE Matters

In essence, MLE answers the question: “Which parameters best explain my observed data?” By turning your dataset into a probability—via a likelihood function—MLE provides a principled way to select the model parameters that make those observations most probable. This elegantly ties into common machine learning losses: minimizing negative log-likelihood (NLL) is maximizing the likelihood of your data. When you fit a regression line or train a neural classifier, you’re effectively applying MLE to find parameters that best match the data under assumed noise or distributional conditions.

Example: Coin Flips

Imagine you flip a coin three times and observe the sequence Heads, Heads, Tail. Let \(p\) be the probability of getting heads. The probability of observing the sequence “Heads, Heads, Tail” is given by:

\[p(\text{Heads, Heads, Tail} \mid p) = p^2 (1-p).\]

If \(p = 0.5\):

\[p(\text{Heads, Heads, Tail} \mid 0.5) = (0.5)^2 \times (0.5) = 0.125.\]

If \(p = \frac{2}{3}\):

\[p\left(\text{Heads, Heads, Tail} \mid \frac{2}{3}\right) = \left(\frac{2}{3}\right)^2 \times \left(1 - \frac{2}{3}\right) = \frac{4}{9} \times \frac{1}{3} = \frac{4}{27} \approx 0.148.\]

Since \(\frac{4}{27} > 0.125\), the likelihood is higher at \(p = \frac{2}{3}\). Therefore, we know that this value of \(p\) is better at maximizing the likelihood of the observed data.

In practice, rather than testing random values of \(p\) to see which one gives the highest likelihood, we use optimization techniques like gradient descent to efficiently find the value of \(p\) that maximizes the likelihood.

MLE Through the Lens of Bayes’ Rule

Bayes’ rule states:

\[p(\theta \mid X) \;=\; \frac{p(X \mid \theta) \, p(\theta)}{p(X)}.\]

Here, \(p(X \mid \theta)\) is the likelihood, \(p(\theta)\) is the prior, and \(p(\theta \mid X)\) is the posterior. If you assume a uniform prior on \(\theta\)—or more generally, treat the prior as constant over the parameter space—then maximizing the posterior \(p(\theta \mid X)\) is exactly the same as maximizing \(p(X \mid \theta)\). This is precisely what MLE does.

MAP vs. MLE

If the prior is not uniform, you get Maximum A Posteriori (MAP) estimation:

\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[ \log p(X \mid \theta) + \log p(\theta) \right].\]

In contrast:

MLE: \(\arg\max_\theta \log p(X \mid \theta).\)
MAP: \(\arg\max_\theta [\log p(X \mid \theta) + \log p(\theta)].\)

So, MLE is essentially a special case of MAP when \(\log p(\theta)\) is effectively constant or negligible.

Defining the Likelihood and Log-Likelihood

When applying MLE, we start by assuming that the data points \(x_1, x_2, \dots, x_n\) are drawn independently and identically distributed (i.i.d.) from some distribution \(p(x \mid \theta)\). This assumption allows us to express the joint probability of the observed data as the product of the individual likelihoods:

\[p(x_1, x_2, \dots, x_n \mid \theta) = \prod_{i=1}^n p(x_i \mid \theta).\]

Taking the logarithm converts this product into a sum, which is much easier to work with mathematically:

\[\log p(x_1, x_2, \dots, x_n \mid \theta) = \sum_{i=1}^n \log p(x_i \mid \theta).\]

Since MLE seeks to maximize the likelihood, we are equivalently maximizing the log-likelihood. In practice, however, optimization routines are designed to minimize a loss function. Therefore, we often reformulate the problem as minimizing the negative log-likelihood (NLL):

\[\hat{\theta} = \arg\min_\theta \left[-\sum_{i=1}^n \log p(x_i \mid \theta)\right].\]

This formulation seamlessly ties MLE to many common loss functions used in machine learning.

Loss Functions as Negative Log-Likelihoods

A key insight in MLE is that by choosing a particular distribution for the data, we implicitly select a corresponding loss function for training our model. In other words, minimizing the negative log-likelihood (NLL) under a given distributional assumption is equivalent to minimizing a well-known loss function. Here are a few examples:

Gaussian Distribution and Mean Squared Error (MSE)

Suppose we assume that the target variable \(y\) is generated from a Gaussian distribution with mean \(\hat{y}\) (the model’s prediction) and variance \(\sigma^2\). The probability density function is:

\[p(y \mid \hat{y}; \sigma^2) = \frac{1}{\sqrt{2\pi\,\sigma^2}} \exp\!\Bigl(-\frac{(y - \hat{y})^2}{2\sigma^2}\Bigr).\]

Taking the negative log yields:

\[-\log p(y \mid \hat{y}; \sigma^2) = \frac{(y - \hat{y})^2}{2\sigma^2} + \text{constant}.\]

Minimizing this is equivalent to minimizing the squared error between \(y\) and \(\hat{y}\), which is the familiar MSE loss.

Laplace Distribution and Mean Absolute Error (MAE)

If we assume instead that the errors follow a Laplace distribution with mean \(\hat{y}\) and scale parameter \(\beta\), the probability density function is:

\[p(y \mid \hat{y}; \beta) = \frac{1}{2\beta} \exp\!\Bigl(-\frac{|y-\hat{y}|}{\beta}\Bigr).\]

The negative log-likelihood is then:

\[-\log p(y \mid \hat{y}; \beta) = \frac{|y-\hat{y}|}{\beta} + \text{constant}.\]

Minimizing this is equivalent to minimizing the absolute error between \(y\) and \(\hat{y}\), resulting in the MAE loss.

Bernoulli/Categorical Distribution and Cross-Entropy Loss

For classification tasks, if we assume the target labels follow a Bernoulli distribution (for binary classification) or a Categorical distribution (for multi-class classification), the negative log-likelihood becomes the cross-entropy loss.

Binary Classification (Bernoulli):
For a binary classification problem where the model predicts \(\hat{y}\) as the probability of class 1, the probability mass function is:
\[p(y \mid \hat{y}) = \hat{y}^y (1-\hat{y})^{1-y},\]
and the negative log-likelihood is:
\[-\log p(y \mid \hat{y}) = -\Bigl[y\log(\hat{y}) + (1-y)\log(1-\hat{y})\Bigr].\]
This is exactly the binary cross-entropy loss.
Multi-Class Classification (Categorical):
In a multi-class classification problem with \(K\) classes, each training sample is represented by a one-hot encoded vector \(y = (y_1, y_2, \dots, y_K),\) where one element is 1 (indicating the true class) and the others are 0. The model outputs a predicted probability vector \(\hat{y} = (\hat{y}_1, \hat{y}_2, \dots, \hat{y}_K),\) with the constraint that \(\sum_{c=1}^K \hat{y}_c = 1.\)
The probability of observing the correct label given the prediction is defined as
\[p(y \mid \hat{y}) = \prod_{c=1}^K \hat{y}_c^{\,y_c},\]
which simplifies to the predicted probability of the true class. Taking the negative logarithm gives
\[-\log p(y \mid \hat{y}) = -\sum_{c=1}^K y_c \log \hat{y}_c.\]
This expression is exactly the multi-class cross-entropy loss.

These examples illustrate that by choosing a particular likelihood model, we are implicitly defining the loss function used during training. In other words, minimizing the NLL under a specific distributional assumption is equivalent to minimizing the corresponding loss function.

Why Choosing the Right Loss Matters

The loss function we choose is not just a mathematical convenience—it reflects our assumptions about how the data are generated. By assuming that the data come from a specific distribution, we implicitly define the noise model underlying the observations. For instance, assuming Gaussian noise implies that deviations from the model are symmetrically distributed and that larger errors are penalized quadratically, which gives rise to the Mean Squared Error (MSE) loss. On the other hand, if the data contain outliers or are heavy-tailed, a Laplace noise assumption might be more appropriate, leading to the Mean Absolute Error (MAE) loss.

Choosing the right loss function is crucial because:

Model Accuracy: The loss function directly influences how the model learns. A misaligned loss function can lead to suboptimal parameter estimates.
Noise Modeling: It embodies our assumptions about the nature of the errors in the data. This choice determines the model’s sensitivity to different types of errors.
Optimization Behavior: The properties of the loss function, such as its gradients, affect the efficiency and stability of the optimization process.
Interpretability: A loss function grounded in a probabilistic model allows for a more meaningful interpretation of the results within a statistical framework.

In summary, selecting the appropriate loss function—by aligning it with a realistic assumption about the data distribution—is key to building models that perform well and accurately capture the underlying data-generating process.

Interpreting MLE as a Point Estimate

One key aspect of MLE is that it yields a single point estimate \(\hat{\theta}\). It doesn’t provide a distribution over \(\theta\)—that would be a Bayesian approach where you’d maintain the entire posterior \(p(\theta \mid X)\). MLE is simpler in that sense:

Form a likelihood based on your model assumption.
Optimize w.r.t. \(\theta\).
Obtain a single “best-fit” parameter value.

While straightforward, point estimates can be limited when you want to capture model uncertainty. Capturing uncertainty should be reserved for another blog post!

Minimizing Cross-Entropy and KL Divergence

In supervised learning, the cross-entropy between the true label distribution \(p\) and the predicted distribution \(q\) is defined as the expectation of the negative log-probability of \(q\) under \(p\):

\[H(p, q) = \mathbb{E}_{c \sim p}[-\log q(c)] = -\sum_{c} p(c) \log q(c).\]

Similarly, the self-entropy (or Shannon entropy) of the true distribution \(p\) is:

\[H(p) = \mathbb{E}_{c \sim p}[-\log p(c)] = -\sum_{c} p(c) \log p(c).\]

The Kullback–Leibler (KL) divergence from \(p\) to \(q\) is defined as:

\[D_{KL}(p \parallel q) = \mathbb{E}_{c \sim p}\left[\log \frac{p(c)}{q(c)}\right] = \sum_{c} p(c) \log \frac{p(c)}{q(c)}.\]

Expanding the KL divergence, we have:

\[D_{KL}(p \parallel q) = \sum_{c} p(c) \log p(c) - \sum_{c} p(c) \log q(c) = -H(p) + H(p, q).\]

Rearranging this expression gives:

\[H(p, q) = H(p) + D_{KL}(p \parallel q).\]

In supervised learning, the true distribution \(p\) (often represented as a one-hot vector) is fixed, so its entropy \(H(p)\) is constant with respect to the model parameters. Consequently, minimizing the cross-entropy loss

\[\min_q H(p, q)\]

is equivalent to minimizing the KL divergence

\[\min_q D_{KL}(p \parallel q),\]

since the constant term \(H(p)\) does not affect the optimization.

Conclusion

Maximum Likelihood Estimation (MLE) provides a unifying framework for understanding and designing loss functions in machine learning. At its core, MLE seeks the set of parameters that makes the observed data as likely as possible under a given probability model. This probabilistic approach naturally leads to common loss functions: assuming Gaussian noise results in the Mean Squared Error (MSE) loss, while a Laplace noise assumption leads to the Mean Absolute Error (MAE) loss. For classification tasks, modeling the data with Bernoulli or Categorical distributions gives rise to the cross-entropy loss—and when the true label distribution is fixed, minimizing cross-entropy is equivalent to minimizing the KL divergence between the true distribution and the model’s predictions.

Moreover, the choice of loss function reflects our assumptions about the data-generating process. Aligning the loss function with these assumptions is crucial because it directly influences model accuracy, the behavior of the optimization process, and how sensitive the model is to various types of errors. Although MLE yields a single best-fit parameter value—a point estimate—it forms the foundation upon which more advanced techniques (such as Bayesian methods for capturing uncertainty) are built.

By understanding MLE and its intimate connection to loss functions, you gain valuable insight into why models are trained the way they are and how the underlying assumptions about noise and variability shape the learning process. This understanding can empower you to make more informed choices in designing and optimizing machine learning models.

Tags: statistics machine-learning estimation likelihood