Maximum Likelihood Principle for deriving cost functions for regression and classification

Hypothesis set

Given a set of hypotheses $\mathcal H$ (here parametric models) for explaining some data $\mathcal D$. The hypotheses in a set should have the same functional form, but have different parameters $\theta$.

As an example we assume that the data are $(x,y)$-pairs (like in supervised learning):

$\mathcal D = \{ (x_1, y_1), (x_2, y_2), \dots , (x_n,y_n) \}$

Example: (Linear) Regression

Let's generate some data (xy-pairs):

In [2]:
import numpy as np
x = np.random.rand(25) * 2.  - 1. 
y = 1. + 1.5 * np.sin(x) + np.random.rand(len(x)) * 1.
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(x, y, 'bx')
<matplotlib.text.Text at 0x10ac6c6d0>

Just looking at the graph, a linear relation seem to be a good fit. (In fact the noise is much stronger than the non-linearity of the sin in the given x-range.)

A plausible assumption for the noise of real world data is that the noise is gaussian $\mathcal N (0,\sigma$), because of the Central Limit Theorem. If there are many factors which influences the noise, each of them can be seen as a step of a one-dimesional random walk. The probability distribution of a one-dimesional random walk is a gaussian distribution (see for instance: Reif: Fundamentals of Statistical and Thermal Physics)

The hypothesis set $\mathcal H$ of a linear model:

$$ h_\theta(x) = \theta_0 + \theta_1 x $$

So $\mathcal H$ is parametrized by $\theta = \{\theta_0, \theta_1 \}$.

The probability $p(y|x)$ is given by:

$$ p(y|x) = h_\theta(x) + \mathcal N(0, \sigma) $$

This is a discriminative model $p(y|x)$. We have a probability for $y$ given $x$ and some fixed parameters. $\sigma$ can also be seen as a parameter of the model.

Bayes Theorem

Given a hypothesis set $\mathcal H$ and some data $\mathcal D$ the probability of the parameters $\theta$ can be expressed by the bayes theorem:

$$ P(\theta| \mathcal D, \mathcal H) = \frac{P(\mathcal D| \theta, \mathcal H) \cdot P(\theta| \mathcal H)}{P(\mathcal D| \mathcal H)} $$

For this problem the different parts of the equation have the following names: $$ posterior = \frac{likelihood \cdot prior}{evidence} $$

Maximum Likelihood Estimator

A maximum likelihood estimator for the parameters considers only the likelihood $P(\mathcal D| \theta, \mathcal H)$ to get an estimation for the parameters $\theta$ by maximizing the likelihood.

$$ \underset{\theta}{\operatorname{argmax}} P(\mathcal D| \theta, \mathcal H) $$

For a constant prior the following holds (follows from bayes rule):

$$ \underset{\theta}{\operatorname{argmax}} P(\theta| \mathcal D, \mathcal H) = \underset{\theta}{\operatorname{argmax}} P(\mathcal D| \theta, \mathcal H) $$

Note: The denominator $P(\mathcal D| \mathcal H)$ is independent of $\theta$, so it plays no role in the maximization.

Example: Regression

For our regression example note that each data point is conditional independent from the others under the hypothesis set $\mathcal H$ (for a fixed $\theta$). Additionally the hypothesis set is a discriminative model, so the $x$ values are fixed.

$$ \underset{\theta}{\operatorname{argmax}} P(y_1| \theta, \mathcal H) \cdot P(y_2| \theta, \mathcal H) \dots \cdot P(y_n| \theta, \mathcal H) $$


$$ P(y_i| \theta, \mathcal H) = h_\theta(x_i) + \mathcal N(0, \sigma) = \mathcal N(h_\theta(x_i), \sigma) = \frac {1}{\sigma \sqrt{2\pi}} e^{- \frac{(y_i - h_\theta(x_i) )^2 }{2 \sigma^2}} $$

The maximizer of the likelihood is the same as the minimizer of the negative log-likehihood, because the logarithm is a monotonic function.

$$ \underset{\theta}{\operatorname{argmin}} \left(- \log(P(y_1| \theta, \mathcal H)) - \log(P(y_2| \theta, \mathcal H)) - \dots - \log(P(y_n| \theta, \mathcal H)) \right) $$

with $$ log(P(y_i| \theta, \mathcal H)) = \log(\mathcal N(h(x_i), \sigma)) = C -\frac{(y_i - h_\theta(x_i))^2}{2 \sigma ^2} $$

The argmin is independent on the constant $C$ and with the assumption that each data point has the same $\sigma$ the maximum likelihood estimation of the parameters $\theta$ is equivalent to the minimization of the squared error:

$$ E = \underset{\theta}{\operatorname{argmin}} \frac{1}{2}\sum_i (y_i - h_\theta (x_i))^2 $$

Note that we don't used the linear relationship between $x$ and $y$. So that is valid for regression in general.

Therefore for finding the parameters of a parametric regression model mostly the squared error loss is minimized. This implies a gaussian noise assumption.

For example from numpy's polyfit documentation:

Fit a polynomial ``p(x) = p[0] * x**deg + ... + p[deg]`` of degree `deg`
to points `(x, y)`. Returns a vector of coefficients `p` that minimises
the squared error.
In [3]:
theta_0, theta_1 = np.polyfit(x, y, deg=1)

x_ = np.array([-1.,1])
y_ = theta_0 + theta_1 * x_

plt.plot(x, y, 'bx')
plt.plot(x_,y_, 'g-')
<matplotlib.text.Text at 0x10adf3490>

Example: Binary Classification

For binary classification the train data $\mathcal D$ consists of $(\vec x^{(i)}, t^{(i)})$ pairs (with $t^{(i)} \in \{0,1\}$).

The hypothesis $h(\vec x^{(i)})$ gives us the probability that the feature vector $\vec x$ corresponds to an object of the positive class, i.e. $p$ for $y^{(i)}=1$: $$ p(y^{(i)}=1 \mid \vec x^{(i)}) = h(\vec x^{(i)}) $$

and for the negative class

$$ p(y^{(i)}=0 \mid \vec x^{(i)}) = 1 - p(y^{(i)}=1 \mid \vec x^{(i)}) = 1- h(\vec x^{(i)}) $$

Each data point is conditional independent from the others under the hypothesis set $\mathcal H$ as for the regression example. So the maximum likelihood principle gives us:

$$ \underset{\theta}{\operatorname{argmax}} P(\mathcal D| \theta, \mathcal H) = \underset{\theta}{\operatorname{argmax}} \prod_{i} \left(h(\vec x^{(i)})\right)^{t^{(i)}} \left(1-h(\vec x^{(i)})\right)^{1-t^{(i)}} $$

Note: The 'to the power' trick selects the right probability prediction $h$ resp. $1-h$.

So the corresponding minimization of the negative log likelihood is given by:

$$ \underset{\theta}{\operatorname{argmin}} \left( - P(\mathcal D| \theta, \mathcal H) \right)= \underset{\theta}{\operatorname{argmin}} \left( - \sum_{i} \left( {t^{(i)}} \log (h(\vec x^{(i)}) + {(1-t^{(i)})} \log(1-h(\vec x^{(i)} ) \right)\right) $$

So the maximum likelihood principle yields to the cross entropy cost function for binary classification.