Examples:
Likelihood function is considered as a function of $\theta$:
$$ \mathcal L (\theta) = p(\mathcal D | \theta) $$The likelihood function is not a probability function!
"Never say 'the likelihood of the data'. Alway say 'the likelihood of the parameters'. The likelihood function is not a probability distribution." (from D. MacKay: Information Theory, Inference and Learning Algorithms, Page 29, Cambride 2003, http://www.inference.phy.cam.ac.uk/itila/book.html)
often the negativ log-likelihood is minimized:
$$ \arg\max_\theta \mathcal L(\theta) = \arg\min_\theta \left(- \mathcal \log \left( \mathcal L(\theta) \right) \right) $$Example: Binomial Distribution (e.g. tossing a thumtack)
$$ \arg\min_\theta \left( - \mathcal \log L(\theta) \right)= \arg\min_\theta - \log \left( {n \choose k} \theta^k (1-\theta)^{n-k}\right) $$necessary condition for a minimum:
$$ 0 = \frac{d}{d\theta} \left( \theta^k (1-\theta)^{n-k}\right) = k \theta^{k-1} (1-\theta)^{n-k} - (n-k) \theta^{k} (1-\theta)^{n-k-1} $$$$ k(1-\theta) = (n-k) \theta $$$$ \theta_{ML} = \frac{k}{n} $$Example: Thumbtack using maximum likelihood estimation.
from IPython.display import Image
Image(filename='./thumbtack.jpg', width=200, height=200)
# number of iid flips
n = 14
from scipy.stats import bernoulli
bernoulli.rvs(0.3,size=n)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0])
Sufficient statistics for $\mathcal D$ is $k$ (number of positive outcomes) and $n$ (number of total outcomes).
def maximum_likelihood_estimate(theta, n_total):
r = bernoulli.rvs(theta, size=n_total)
mle=[np.sum(r[:i])/float(i) for i in range(1, len(r)+1)]
return r, mle
theta = 0.3
r, mle = maximum_likelihood_estimate(theta, n_total=10)
print "\nMaximum likelihood estimate for the parameter theta up to the k-th trail:"
mle
Maximum likelihood estimate for the parameter theta up to the k-th trail:
[0.0, 0.0, 0.0, 0.25, 0.40000000000000002, 0.33333333333333331, 0.42857142857142855, 0.375, 0.33333333333333331, 0.29999999999999999]
plot_estimates(theta)
Example Graph of two binary random variables $X$ and $Y$:
draw_XY(3.)
$\mathcal D$ consists of $n$ different observations:
with
We have three free parameters for three Binomial distributions:
Joinly written as parameter vector $\vec \theta = (\theta_X, \theta_{0Y},\theta_{1Y})$ or as a set $\theta = \{\theta_X, \theta_{0Y},\theta_{1Y}\}$
The likelihood of parameter vector $\vec \theta = (\theta_X, \theta_{0Y},\theta_{1Y})$ given $\mathcal D$ is:
$$ \mathcal L(\vec \theta) = p( \mathcal D| \vec \theta) = P(\mathcal D_X | \theta_X) P(\mathcal D_{Y|X} | \theta_{0Y}, \theta_{1Y}) $$Under omitting the constant (Binomial coeffients - given data):
$$ \mathcal L(\vec \theta) \propto \left( \theta_X^{k} (1-\theta_X)^{n-k} \right) \left( \theta_{1Y}^{i} (1-\theta_{1Y})^{k-i} \right)^k \left( \theta_{0Y}^{g} (1-\theta_{0Y})^{l-g} \right)^{l} $$The likelihood decomposes into a product of terms; that's is called Decomposability.
respectivly for the log-likelihood:
$$ \begin{align} \log \mathcal L(\vec \theta) \propto & { } \left( k \log \theta_X + (n-k) \log (1-\theta_X) \right) + \\ & { } k \left( i \log \theta_{1Y} + (k-i) \log (1-\theta_{1Y}) \right) + \\ & { } l \left( g \log \theta_{0Y} + (l-g) \log (1-\theta_{0Y}) \right) \end{align} $$So we have three independent terms $$ \begin{align} \log \mathcal L(\vec \theta) \propto & { }\log \mathcal L(\theta_X) + \\ & { } \log \mathcal L(\theta_{1Y}) + \\ & { } \log \mathcal L(\theta_{0Y}) \end{align} $$
So searching for the argmax of $\log \mathcal L(\vec \theta)$ can be done by finding the argmax of each term independently. Also note that the leading $k$ and $l$ are just constants in the second resp. third term which can be neglected in the optimization.
Log-Likelihood of the first term:
$$ \log \mathcal L (\theta_{X}) \propto k \log \theta_{X} + (n-k) \log (1-\theta_{X}) $$We already know how to compute (see above) the MLE (maximum likelihood estimator) of $\theta_{X}$:
$$ \theta_{X}^{ML} = \frac{k}{n} $$(Local) Log-Likelihood (second term):
$$ \log \mathcal L(\theta_{1Y})\propto i \log \theta_{1Y} + (k-i) \log (1-\theta_{1Y}) $$$ $
MLE (maximum likelihood estimator):
$$ \theta_{1Y}^{ML} = \frac{i}{k} $$(Local) Log-Likelihood (third term):
$$ \log \mathcal L(\theta_{0Y}) \propto g \log \theta_{0Y} + (l-g) \log (1-\theta_{0Y}) $$$ $
MLE (maximum likelihood estimator):
$$ \theta_{0Y}^{ML} = \frac{g}{l} $$So for (local) Binomial (and analog for Multinomials) the MLEs can be obtained just by ratios of frequencies.