Find Maximum Likelihood Estimation (MLE) (or Maximum a posteriori / MAP) when some data is missing.
Given $n$ iid-observations (data)
$$ {\bf X} = (\vec x_1, \vec x_2, \dots \vec x_n) $$
Note: $\vec x_i$ can be a sequence.
$$ (\vec x, \vec z) \sim p_\theta $$ for some (unknown) $\theta \in \Theta$
Typically $p_\theta$ is in the exponential family.
Maximization of the likelihood (or log likelihood) of the parameters given the observations.
$$ \theta_{mle} = \text{arg} \max_\theta p({\bf X} \mid \theta) $$
$$ \mathcal l(\theta) = p({\bf X} \mid \theta) = \prod_{i=1}^n \left(\sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right) $$
Note: For continuous latent variables the sum over $\vec z_i$ must be replaced by an integral.
$$ \mathcal L(\theta) = \log p({\bf X} \mid \theta) = \sum_{i=1}^n \left(\log \sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right) $$
The the log-likelihood is difficult to maximize: the sum is inside the log.
$$ \mathcal L_c(\theta) = \log p({\bf X, Z} \mid \theta) = \sum_{i=1}^n \log p(\vec x_i, \vec z_i \mid \theta ) $$
can not be computed, since $\vec z_i$ is unknown.
So we introduce the expected complete data log likelihood as an auxilary function $\mathcal Q(\theta , \theta^{(t)})$
$$ \mathcal Q(\theta , \theta^{(t)}) = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta^{(t)} \right] = \sum_{\bf Z} p({\bf Z} \mid {\bf X}, \theta^{(t)}) \log p({\bf X, Z} \mid \theta) = \sum_{i=1}^n \sum_{\vec z_i}p(\vec z_i \mid \vec x_i, \theta^{(t)}) \log p(\vec x_i, \vec z_i \mid \theta ) $$
Initialize $\theta^{(0)} \in \Theta$
For $t = 0,1,2, \dots$ (until convergence) :
E-Step (Expectation): $$\mathcal Q(\theta ,\theta^{(t)}) = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta_t \right]$$ Here typically values analog to the "weights" $p(\vec z_i \mid \vec x_i, \theta^{(t)})$) for the expectation of the log-likelihood are computed based on the current parameters $\theta^{(t)}$. $p(\vec z_i \mid \theta^{(t)})$ is typically also considered as a parameter and computed in the M-step.
M-Step (Maximization):
$$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)}) $$
Note:
In the E-Step $\mathcal Q$ is not typically computed directly. Instead terms inside of it are computed of which the MLE depends on.
These are the sufficient statistics.
for MAP is the M-Step (Maximization):
$$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)}) + \log p(\theta)$$
for a different view on the EM-Algorithm see EM - variational derivation.
z -> x
$$ p(\vec x, z) = p(\vec x \mid z) p(z) $$
with the discrete latent state $z \in \{1,2, \dots K\}$
using $z_k$ for a special latent state
$$ p(\vec x, z_k) = p(\vec x \mid z_k) p(z_k) $$
or (using $k$ as $z_k$)
$$ p(\vec x \mid z) = \prod_{k=1}^K p(\vec x \mid z_k)^{\mathbb 1\left(k=z\right)} $$
and for the join $$ p(\vec x, z) = \prod_{k=1}^K \pi_{k} p(\vec x \mid z_k)^{\mathbb 1\left(k=z\right)} $$ with the mixing coeffients $p(z_k)=\pi_k$:
Typically we use for the $p(\vec x \mid z_k)$ base functions with parameters $\theta_k$.
Here we write a little bit sloppy (the latent $z_k$ selects implicitly the corresponding $\theta_k$):
$$ p(\vec x \mid z_k) = p(\vec x \mid \theta_k) $$
For GMMs we use Gaussian base distributions
$$ p(\vec x \mid z_k) = \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k}) $$
or (using $k$ as $z_k$)
$$ p(\vec x \mid z) = \prod_{k=1}^K \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})^{\mathbb 1\left(k=z\right)} $$
and for the join $$ p(\vec x, z) = \prod_{k=1}^K \pi_{k} \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})^{\mathbb 1\left(k=z\right)} $$
$$ p(\vec x) = \sum_{z_k=1}^{z_K} p(z_k) p(\vec x \mid z_k) = \sum_{k=1}^{K} \pi_k p(\vec x \mid \theta_k) $$
for GMM
$$ p(\vec x) = \sum_{k=1}^K \pi_k \mathcal N(\vec x \mid \vec \mu_k, \Sigma _k) $$
$$ p(\vec x_i\mid \theta) = \sum_{k=1}^K p(k \mid \theta) p(\vec x_i \mid \theta, k) = \sum_{k=1}^K \pi_k p(\vec x_i \mid \theta _k) $$
with the parameters
$\theta = \{\pi_1, \pi_2, \dots, \pi_K, \theta_1, \theta_2, \dots, \theta_K \}$
i.e. we consider the mixing coefficients $\pi_k$ as parameters.
for GMM:
$$ p(\vec x_i\mid \theta) = \sum_{k=1}^K \pi_k \mathcal N(\vec x_i \mid \vec \mu_k, \Sigma _k) $$
with the parameters $\theta = \{\pi_1, \pi_2, \dots, \pi_K, (\vec \mu_1, \Sigma_1), (\vec \mu_2, \Sigma_2), \dots, (\vec \mu_K, \Sigma_K) \}$
$$ \begin{align} \mathcal Q(\theta, \theta^{(t)}) & = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta_t \right] \\ & = \mathbb E \left[ \sum_{i=1}^n \log p(\vec x_i, z_i \mid \theta) \right] \\ & = \sum_{i=1}^n \mathbb E \left[ \log p(\vec x_i, z_i \mid \theta) \right] \\ & = \sum_{i=1}^n \mathbb E \left[ \log \left( \prod_{k=1}^K \left(\pi_k p(\vec x_i \mid \theta _k) \right)^{\mathbb 1(z_i=k)}\right) \right]\\ & = \sum_{i=1}^n \sum_{k=1}^K\mathbb E \left[ {\mathbb 1(z_i=k)} \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right]\\ & = \sum_{i=1}^n \sum_{k=1}^K \sum_{z_i=1}^K \left( {\mathbb 1(z_i=k)} p(z_i\mid x_i, \theta^{(t)})\log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K \left( {\mathbb 1(z_i=k)} p(z_i\mid x_i, \theta^{(t)} ) \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K \left( r_{ik} \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k + \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log p(\vec x_i \mid \theta _k)\\ \end{align} $$
with the responsibility $r_{ik}$ that cluster $k$ is responsible for data point $\vec x_i$:
$$ r_{ik} = {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) $$
In the E-Step we compute the responsability $r_{ik} = \mathbb 1(z_i=k) p(z_i\mid \vec x_i, \theta^{(t)})$
with
$$
p(z_i \mid \vec x_i, \theta^{(t)}) = \frac{p(z_i,\vec x_i \mid \theta^{(t)})}{p(\vec x_i\mid \theta^{(t)})}
= \frac{p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}{\sum_{z_i=1}^K p(z_i, \vec x_i \mid \theta^{(t)})} =
\frac{p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}{\sum_{z_i=1}^K p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}
$$
and
the dependence of $r_{ik}$ on the current parameter estimates $\theta^{(t)} = \{\pi_1^{(t)}, \pi_2^{(t)}, \dots, \pi_K^{(t)}, \theta_1^{(t)}, \theta_2^{(t)}, \dots, \theta_K^{(t)} \}$ is
$$ \begin{align} r_{ik} & = {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) \\ & = \frac{\pi_k^{(t)}p(\vec x_i \mid \theta^{(t)}_k)}{\sum_{k=1}^K \pi_k^{(t)}p(\vec x_i \mid \theta^{(t)}_k)} \end{align} $$
that holds for all mixture models.
remember for GMMs: $$ p(\vec x_i \mid \theta^{(t)}_k) = \mathcal N(\vec x \mid \vec \mu^{(t)}_k, \Sigma^{(t)}_k) $$
so $r_{ij}$ can easyly computed.
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
# Data
cov_1 = np.array([[0.6,0.2],[0.2, 0.3]])
mean_1 = np.array([2.6,2.2])
cov_2 = np.array([[0.6,-0.55],[-0.55, 0.6]])
mean_2 = np.array([-.9,1.7])
cov_3 = np.array([[.1, 0.01],[0.01, 4.3]])
mean_3 = np.array([.9,5.5])
nb_1 = 40
nb_2 = 60
nb_3 = 50
X_1 = np.random.multivariate_normal(mean_1, cov_1, size=nb_1)
X_2 = np.random.multivariate_normal(mean_2, cov_2, size=nb_2)
X_3 = np.random.multivariate_normal(mean_3, cov_3, size=nb_3)
X_ = np.concatenate((X_1,X_2,X_3))
X = np.random.permutation(X_)
labels = np.zeros(len(X))
labels[nb_1:nb_2+nb_1] = 1
labels[nb_1+nb_2:] = 2
plt.figure(figsize=(5,5))
plt.scatter(X_[:, 0], X_[:, 1], s=50, c=labels, cmap='viridis')
plt.xlabel("$x_1$")
plt.xlim(-3,5)
plt.ylim(-1,10)
_ = plt.ylabel("$x_2$")
#np.random.multivariate_normal?
from scipy.stats import multivariate_normal
means = np.random.uniform(low=-2,high=3, size=(3,2))
# multivariate_normal.pdf(X, mean=mean_1, cov=cov_1).shape
#np.random.uniform?
from scipy import random, linalg
# TODO: better random cov (negative of-diagonal element)
def get_random_cov(size=2, s=1.5):
m = random.rand(size,size)*s
return np.dot(m, m.T)
covs = np.array([get_random_cov() for i in range(3)])
pi = np.array([0.33,0.33,0.34])
means.shape
n = X.shape[0]
K = means.shape[0]
p_x_i__theta_k = np.ndarray([n, K])
for i, (mean, cov) in enumerate(zip(means, covs)):
p_x_i__theta_k[:,i] = multivariate_normal.pdf(X, mean=mean, cov=cov)
#pi_ks * p_x_i__theta_k
def responsibility(means, covs, pi, X):
# returns a matrix r with indicies i and k
n = X.shape[0]
K = means.shape[0]
p_x_i__theta_k = np.ndarray([n, K])
for i, (mean, cov) in enumerate(zip(means, covs)):
p_x_i__theta_k[:,i] = multivariate_normal.pdf(X, mean=mean, cov=cov)
r = pi * p_x_i__theta_k
r = (r.T / r.sum(axis=1)).T
return r
r = responsibility(means, covs, pi, X)
Remember the M-Step: $$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)}) $$
i.e. maximization of $\mathcal Q(\theta , \theta^{(t)})$ with respect to $\theta = \{\pi_1, \pi_2, \dots, \pi_K, (\vec \mu_1, \Sigma_1), (\vec \mu_2, \Sigma_2), \dots, (\vec \mu_K, \Sigma_K) \}$
$$ \mathcal Q(\theta, \theta_t) = \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k + \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log p(\vec x_i \mid \theta _k) $$
We can maximize the two terms separatly.
the first term is $$ \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k = \sum_{i,k} r_{ik} \log \pi_k $$
so we are searching for the maximum of this term under the constraint $\sum_k{\pi_k}=1$
with the Lagrange multiplier $\lambda$ we have to maximize $$ \sum_{i,k} r_{ik} \log \pi_k + \lambda \left(1- \sum_{k=1}^K{\pi_k}\right) $$
setting the derivative equals zero: $$ \frac{\partial}{\partial \pi_k} \left(\sum_{i,k} r_{ik} \log \pi_k + \lambda \left(1-\sum_{k=1}^K{\pi_k}\right) \right) = 0 $$
$$ \frac{\sum_i r_{ki}}{\pi_k} - \lambda = 0 $$
or
$$ \frac{\sum_i r_{ki}}{\lambda} = \pi_k $$
summation over all $k$ and using the constraint, we get: $$ \sum_k \frac{\sum_i r_{ki}}{\lambda} = \sum_k \pi_k = 1 $$ Using the definition for the $r_{ik}$ we have for $\lambda$: $$ \lambda = \sum_{i,k} r_{ik} = \sum_i \sum_k {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) = \sum_i 1 = n $$
this gives with $r_{k} = \sum_{i=1}^n r_{ik}$
$$ \pi_k = \frac{r_k}{n} $$
def m_step_pi(r_k, n):
return r_k / n
In all the computations above we didn't used the fact that the base functions are Gaussians. So they hold also for other mixture models.
TODO: explizite Herleitung
$$ \vec \mu = \frac{\sum_i r_{ik} \vec x_i}{r_k} $$
$$ \Sigma_k = \frac{\sum_i r_{ik}(\vec x_i - \vec \mu_k) (\vec x_i - \vec \mu_k)^T}{r_k} = \frac{\sum_i r_{ik}\vec x_i \vec x_i^T}{r_k} - \vec \mu_k \vec \mu_k^T $$
for numpy's einsum see https://obilaniu6266h16.wordpress.com/2016/02/04/einstein-summation-in-numpy/
# test of XX computation with einsum
n = 150
a = np.einsum('ac,ad->acd',X,X)
for i in range(150):
b = np.outer(X[i,:], X[i,:])
np.testing.assert_allclose(a[i],b)
def m_step_mu_sigma(r, r_k, X):
#r_k = r.sum(axis=0)
mu = np.dot(r.T, X)
mu = (mu.T / r_k).T
r_k = r.sum(axis=0)
XX = np.einsum('ac,ad->acd',X,X)
t1 = np.einsum('ace,ad->dce',XX,r)
t1 = (t1.T / r_k).T
t2 = np.einsum('ac,ad->acd',mu,mu)
sigma = t1 - t2
return mu, sigma
r_k = r.sum(axis=0)
mu, sigma = m_step_mu_sigma(r, r_k, X)
sigma
# TODO:
# visualizaton from python data science handbook
# remove that!!!!
from matplotlib.patches import Ellipse
def draw_ellipse(position, covariance, ax=None, **kwargs):
"""Draw an ellipse with a given position and covariance"""
ax = ax or plt.gca()
# Convert covariance to principal axes
if covariance.shape == (2, 2):
U, s, Vt = np.linalg.svd(covariance)
angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)
# Draw the Ellipse
for nsig in range(1, 4):
ax.add_patch(Ellipse(position, nsig * width, nsig * height,
angle, **kwargs))
def plot_gmm(means, covars, weights, X, label=False, ax=None):
ax = ax or plt.gca()
if label:
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
else:
ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
ax.axis('equal')
w_factor = 0.2 / weights.max()
for pos, covar, w in zip(means, covars, weights):
draw_ellipse(pos, covar, alpha=w * w_factor)
# TODO make numerically stable
def expectation_maximization(X, k=3):
n = X.shape[0]
data_dim = X.shape[1]
# random start values - TODO
pi = np.ones(k)/k
mus = np.random.uniform(low=-2, high=3, size=(k, data_dim))
sigmas = get_random_cov(size=data_dim, s=1.5)
converged = False
while not converged:
# E-Step
r = responsibility(mus, sigmas, pi, X)
r_k = r.sum(axis=0)
# M-Step
pi_, (mus_, sigmas_) = m_step_pi(r_k, n), m_step_mu_sigma(r, r_k, X)
# stopping criterion
tol = 1e-02
if np.allclose(pi, pi_, rtol=tol) and np.allclose(mus, mus_, rtol=tol) and np.allclose(sigmas, sigmas_, rtol=tol):
converged = True
pi, mus, sigmas = pi_, mus_, sigmas_
return pi, mus, sigmas
while True:
try:
pi, mus, sigmas = expectation_maximization(X)
break
except ValueError:
pass
plot_gmm(mus, sigmas, pi, X, label=False, ax=None)