Maximum Likelihood Principle for deriving cost functions for regression and classification

Hypothesis set

Given a set of hypotheses $\mathcal H$ (here parametric models) for explaining some data $\mathcal D$. The hypotheses in a set should have the same functional form, but have different parameters $\theta$.

As an example we assume that the data are $(x,y)$-pairs (like in supervised learning):

$\mathcal D = \{ (x_1, y_1), (x_2, y_2), \dots , (x_n,y_n) \}$

Example: (Linear) Regression

Let's generate some data (xy-pairs):

In [2]:
import numpy as np
x = np.random.rand(25) * 2.  - 1. 
y = 1. + 1.5 * np.sin(x) + np.random.rand(len(x)) * 1.
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(x, y, 'bx')
<matplotlib.text.Text at 0x10ac6c6d0>

Just looking at the graph, a linear relation seem to be a good fit. (In fact the noise is much stronger than the non-linearity of the sin in the given x-range.)

A plausible assumption for the noise of real world data is that the noise is gaussian $\mathcal N (0,\sigma$), because of the Central Limit Theorem. If there are many factors which influences the noise, each of them can be seen as a step of a one-dimesional random walk. The probability distribution of a one-dimesional random walk is a gaussian distribution (see for instance: Reif: Fundamentals of Statistical and Thermal Physics)

The hypothesis set $\mathcal H$ of a linear model:

$$ h_\theta(x) = \theta_0 + \theta_1 x $$

So $\mathcal H$ is parametrized by $\theta = \{\theta_0, \theta_1 \}$.

The probability $p(y \mid x)$ is given by:

$$ p(y \mid x) = h_\theta(x) + \mathcal N(0, \sigma^2) $$

This is a discriminative model $p(y \mid x)$. We have a probability for $y$ given $x$ and some fixed parameters. (The variance of the noise $\sigma^2$ can also be seen as a parameter of the model.)

Bayes Theorem

Given a hypothesis set $\mathcal H$ and some data $\mathcal D$ the probability of the parameters $\theta$ can be expressed by the bayes theorem:

$$ P(\theta \mid \mathcal D, \mathcal H) = \frac{P(\mathcal D \mid \theta, \mathcal H) \cdot P(\theta \mid \mathcal H)}{P(\mathcal D \mid \mathcal H)} $$

For this problem the different parts of the equation have the following names: $$ posterior = \frac{likelihood \cdot prior}{evidence} $$

Maximum Likelihood Estimator

A maximum likelihood estimator for the parameters considers only the likelihood $P(\mathcal D \mid \theta, \mathcal H)$ to get an estimation for the parameters $\theta$ by maximizing the likelihood.

$$ \underset{\theta}{\operatorname{argmax}} P(\mathcal D \mid \theta, \mathcal H) $$

For a constant prior the following holds (follows from bayes rule):

$$ \underset{\theta}{\operatorname{argmax}} P(\theta \mid \mathcal D, \mathcal H) = \underset{\theta}{\operatorname{argmax}} P(\mathcal D \mid \theta, \mathcal H) $$

Note: The denominator $P(\mathcal D\mid \mathcal H)$ is independent of $\theta$, so it plays no role in the maximization.

Example: Regression

For a discriminative model we use the conditional likelihood $P(Y \mid X, \theta, \mathcal H)$ (instead of $P(\mathcal D \mid \theta, \mathcal H)$)

For our regression example note that each data point is conditional independent from the others under the hypothesis set $\mathcal H$ (for a fixed $\theta$).

$$ \underset{\theta}{\operatorname{argmax}} \left[ P(y_1\mid x_1; \theta, \mathcal H) \cdot P(y_2\mid x_2; \theta, \mathcal H) \dots \cdot P(y_n\mid x_n;\theta, \mathcal H) \right] $$ with

$$ P(y_i\mid x_i; \theta, \mathcal H) = h_\theta(x_i) + \mathcal N(0, \sigma^2) = \mathcal N(h_\theta(x_i), \sigma^2) = \frac {1}{\sigma \sqrt{2\pi}} e^{- \frac{(y_i - h_\theta(x_i) )^2 }{2 \sigma^2}} $$

The maximizer of the likelihood is the same as the minimizer of the negative log-likehihood, because the logarithm is a monotonic function.

$$ \underset{\theta}{\operatorname{argmin}} \left[- \log(P(y_1\mid x_1; \theta, \mathcal H)) - \log(P(y_2| x_2; \theta, \mathcal H)) - \dots - \log(P(y_n\mid x_n ;\theta, \mathcal H)) \right] $$

with $$ \log(P(y_i \mid x_i; \theta, \mathcal H)) = \log(\mathcal N(h(x_i), \sigma)) = C -\frac{(y_i - h_\theta(x_i))^2}{2 \sigma ^2} $$

The argmin is independent on the constant $C$ and with the assumption that each data point has the same $\sigma^2$ the maximum likelihood estimation of the parameters $\theta$ is equivalent to the minimization of the squared error:

$$ E = \underset{\theta}{\operatorname{argmin}} \frac{1}{2}\sum_i (y_i - h_\theta (x_i))^2 $$

Note that we didn't used the linear relationship between $x$ and $y$. So that's valid for regression in general.

Therefore the MLE (maximum likelihood estimation) for finding the parameters of a parametric regression is equivalent to the minimization of the squared error.

This implies a gaussian noise assumption.

For example from numpy's polyfit documentation:

Fit a polynomial ``p(x) = p[0] * x**deg + ... + p[deg]`` of degree `deg`
to points `(x, y)`. Returns a vector of coefficients `p` that minimises
the squared error.
In [3]:
theta_0, theta_1 = np.polyfit(x, y, deg=1)

x_ = np.array([-1.,1])
y_ = theta_0 + theta_1 * x_

plt.plot(x, y, 'bx')
plt.plot(x_,y_, 'g-')
<matplotlib.text.Text at 0x10adf3490>

Example: Binary Classification

For binary classification the train data $\mathcal D$ consists of $(\vec x^{(i)}, t^{(i)})$ pairs (with $t^{(i)} \in \{0,1\}$).

The hypothesis $h(\vec x^{(i)})$ gives us the probability that the feature vector $\vec x$ corresponds to an object of the positive class, i.e. $p$ for $y^{(i)}=1$:

$$ p(y^{(i)}=1 \mid \vec x^{(i)}) = h(\vec x^{(i)}) $$

and for the negative class

$$ p(y^{(i)}=0 \mid \vec x^{(i)}) = 1 - p(y^{(i)}=1 \mid \vec x^{(i)}) = 1- h(\vec x^{(i)}) $$

Each data point is conditional independent from the others under the hypothesis set $\mathcal H$ as for the regression example. So the maximum likelihood principle gives us:

$$ \underset{\theta}{\operatorname{argmax}} P(\mathcal D| \theta, \mathcal H) = \underset{\theta}{\operatorname{argmax}} \prod_{i} \left(h(\vec x^{(i)})\right)^{t^{(i)}} \left(1-h(\vec x^{(i)})\right)^{1-t^{(i)}} $$

Note: The 'to the power' trick selects the right probability prediction $h$ resp. $1-h$.

So the corresponding minimization of the negative log likelihood is given by:

$$ \underset{\theta}{\operatorname{argmin}} \left( - P(\mathcal D| \theta, \mathcal H) \right)= \underset{\theta}{\operatorname{argmin}} \left(

- \sum_{i} \left(  {t^{(i)}} \log (h(\vec x^{(i)}) + {(1-t^{(i)})} \log(1-h(\vec x^{(i)} ) \right)\right)


The cost function is also called cross entropy cost for binary classification (see below) So the maximum likelihood principle for binary classification is equivalent to minimizing the (binary) cross entropy.

What's exactly the relation to cross entropy?

Definition of the cross entropy $H(p,q)$ between two distributions $p(y \mid \vec x)$ and $q(y \mid \vec x)$:

$$ H(p,q) := - \mathbb E_{(\vec x,y)\sim p}[\log q(y \mid \vec x)] = - \sum_y \int_{\mathcal X} p(y \mid \vec x) \log q(y \mid \vec x) d\vec x $$


  • $\mathcal X$: $\vec x \in \mathbb R^n$
  • $y \in \{0,1\}$

For a binary classification problem with one output $h(\vec x)$ of the classifier, the probability for the positive class $y=1$ is the output $h(\vec x)$:

  • $q(y=1 \mid \vec x) = h(\vec x)$
  • $q(y=0 \mid \vec x) = 1- h(\vec x)$
  • $p(y\mid x)$: "true" unknown conditional probability

So the cross entropy becomes: $$ H(p,q) = - \mathbb E_{(\vec x,y)\sim p}[\log q(y \mid \vec x)] = - \int_{\mathcal X} p(y=1 \mid \vec x) \log h(\vec x) d\vec x - \int_{\mathcal X} p(y=0 \mid \vec x) \log \left(1-h(\vec x)\right) d\vec x $$

Sampling of $m$ training examples $(\vec x^{(i)}, t^{(i)})$ from the distribution $p(\vec x,y)$ and labeling with the true label $t$ we can estimate the cross entropy $J \approx H(p,q)$ by:

$$ J = - \frac{1}{m} \sum_{i=1}^m \left(\mathbb 1_{t^{(i)}=1}\log h(\vec x^{(i)}) + \mathbb 1_{t^{(i)}=0} \log \left(1-h(\vec x^{(i)})\right) \right) $$ this is equivalent to $$ J = - \frac{1}{m} \sum_{i=1}^m \left(t^{(i)}\log h(\vec x^{(i)}) + \mathbb (1-t^{(i)}) \log \left(1-h(\vec x^{(i)})\right) \right) $$

So minimization of the negative log likelihood is equivalent to the minimization of the cross entropy.

Because of $$ D_{KL}(p \mid \mid q) = H(p,q) - H(p) $$


  • $D_{KL}$: Kullback Leibler Divergence
  • $H(p)$: Entropy

Minimizing the cross entropy is equivalent to minimizing $D_{KL}$. For fixed $p$ minimizing $H(p,q)$ is equivalent to minimizing $D_{KL}$. The minimal possible $D_{KL}$ is $0$. That holds if the distributions $p$ and $q$ are the same, i.e. $p=q$.