Expectation-Maximization

Goal

Find Maximum Likelihood Estimation (MLE) (or Maximum a posteriori / MAP) when some data is missing.

Given $n$ iid-observations (data)

$$ {\bf X} = (\vec x_1, \vec x_2, \dots \vec x_n) $$

Note: $\vec x_i$ can be a sequence.

Model

$$ (\vec x, \vec z) \sim p_\theta $$ for some (unknown) $\theta \in \Theta$

Typically $p_\theta$ is in the exponential family.

  • $\vec z$: hidden (latent) random variables

Goal

Maximization of the likelihood (or log likelihood) of the parameters given the observations.

$$ \theta_{mle} = \text{arg} \max_\theta p({\bf X} \mid \theta) $$

Marginal likelihood of the parameters w.r.t. the observed data:

$$ \mathcal l(\theta) = p({\bf X} \mid \theta) = \prod_{i=1}^n \left(\sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right) $$

Note: For continuous latent variables the sum over $\vec z_i$ must be replaced by an integral.

Marginal log likelihood of the parameters w.r.t. the observed data

$$ \mathcal L(\theta) = \log p({\bf X} \mid \theta) = \sum_{i=1}^n \left(\log \sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right) $$

The the log-likelihood is difficult to maximize: the sum is inside the log.

Complete data log likelihood:

$$ \mathcal L_c(\theta) = \log p({\bf X, Z} \mid \theta) = \sum_{i=1}^n \log p(\vec x_i, \vec z_i \mid \theta ) $$

can not be computed, since $\vec z_i$ is unknown.

Auxilary function:

So we introduce the expected complete data log likelihood as an auxilary function $\mathcal Q(\theta , \theta^{(t)})$

$$ \mathcal Q(\theta , \theta^{(t)}) = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta^{(t)} \right] = \sum_{\bf Z} p({\bf Z} \mid {\bf X}, \theta^{(t)}) \log p({\bf X, Z} \mid \theta) = \sum_{i=1}^n \sum_{\vec z_i}p(\vec z_i \mid \vec x_i, \theta^{(t)}) \log p(\vec x_i, \vec z_i \mid \theta ) $$

EM Algorithm

Initialize $\theta^{(0)} \in \Theta$
For $t = 0,1,2, \dots$ (until convergence) :

  • E-Step (Expectation): $$\mathcal Q(\theta ,\theta^{(t)}) = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta_t \right]$$ Here typically values analog to the "weights" $p(\vec z_i \mid \vec x_i, \theta^{(t)})$) for the expectation of the log-likelihood are computed based on the current parameters $\theta^{(t)}$. $p(\vec z_i \mid \theta^{(t)})$ is typically also considered as a parameter and computed in the M-step.

  • M-Step (Maximization):
    $$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)}) $$

Note:

In the E-Step $\mathcal Q$ is not typically computed directly. Instead terms inside of it are computed of which the MLE depends on.
These are the sufficient statistics.

MAP

for MAP is the M-Step (Maximization):
$$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)}) + \log p(\theta)$$

Variational Methods and EM

for a different view on the EM-Algorithm see EM - variational derivation.

Example: Mixture Models / Gaussians Mixture Models

Mixture Model for the data probability distribution:

 z -> x

$$ p(\vec x, z) = p(\vec x \mid z) p(z) $$

with the discrete latent state $z \in \{1,2, \dots K\}$

using $z_k$ for a special latent state

$$ p(\vec x, z_k) = p(\vec x \mid z_k) p(z_k) $$

or (using $k$ as $z_k$)

$$ p(\vec x \mid z) = \prod_{k=1}^K p(\vec x \mid z_k)^{\mathbb 1\left(k=z\right)} $$

  • The indicator function $\mathbb 1\left(k=z\right)$ selects one of the $K$ the base distributions

and for the join $$ p(\vec x, z) = \prod_{k=1}^K \pi_{k} p(\vec x \mid z_k)^{\mathbb 1\left(k=z\right)} $$ with the mixing coeffients $p(z_k)=\pi_k$:

  • $\pi_k\geq0$
  • $\sum_{k=1}^K \pi_k=1$

Gaussian Mixture Model (Mixture of Gaussian, GMM)

Typically we use for the $p(\vec x \mid z_k)$ base functions with parameters $\theta_k$.

Here we write a little bit sloppy (the latent $z_k$ selects implicitly the corresponding $\theta_k$):

$$ p(\vec x \mid z_k) = p(\vec x \mid \theta_k) $$

For GMMs we use Gaussian base distributions

$$ p(\vec x \mid z_k) = \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k}) $$

  • $\mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})$: Gaussian distribution with mean vector $\vec \mu_k$ and covariance matrix $\Sigma_k$

or (using $k$ as $z_k$)

$$ p(\vec x \mid z) = \prod_{k=1}^K \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})^{\mathbb 1\left(k=z\right)} $$

and for the join $$ p(\vec x, z) = \prod_{k=1}^K \pi_{k} \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})^{\mathbb 1\left(k=z\right)} $$

(Data) Marginal

$$ p(\vec x) = \sum_{z_k=1}^{z_K} p(z_k) p(\vec x \mid z_k) = \sum_{k=1}^{K} \pi_k p(\vec x \mid \theta_k) $$

for GMM

$$ p(\vec x) = \sum_{k=1}^K \pi_k \mathcal N(\vec x \mid \vec \mu_k, \Sigma _k) $$

Likelihood of one data point $\vec x_i$:

$$ p(\vec x_i\mid \theta) = \sum_{k=1}^K p(k \mid \theta) p(\vec x_i \mid \theta, k) = \sum_{k=1}^K \pi_k p(\vec x_i \mid \theta _k) $$

with the parameters

$\theta = \{\pi_1, \pi_2, \dots, \pi_K, \theta_1, \theta_2, \dots, \theta_K \}$

i.e. we consider the mixing coefficients $\pi_k$ as parameters.

for GMM:

$$ p(\vec x_i\mid \theta) = \sum_{k=1}^K \pi_k \mathcal N(\vec x_i \mid \vec \mu_k, \Sigma _k) $$

with the parameters $\theta = \{\pi_1, \pi_2, \dots, \pi_K, (\vec \mu_1, \Sigma_1), (\vec \mu_2, \Sigma_2), \dots, (\vec \mu_K, \Sigma_K) \}$

Expected complete log likelihood

$$ \begin{align} \mathcal Q(\theta, \theta^{(t)}) & = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta_t \right] \\ & = \mathbb E \left[ \sum_{i=1}^n \log p(\vec x_i, z_i \mid \theta) \right] \\ & = \sum_{i=1}^n \mathbb E \left[ \log p(\vec x_i, z_i \mid \theta) \right] \\ & = \sum_{i=1}^n \mathbb E \left[ \log \left( \prod_{k=1}^K \left(\pi_k p(\vec x_i \mid \theta _k) \right)^{\mathbb 1(z_i=k)}\right) \right]\\ & = \sum_{i=1}^n \sum_{k=1}^K\mathbb E \left[ {\mathbb 1(z_i=k)} \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right]\\ & = \sum_{i=1}^n \sum_{k=1}^K \sum_{z_i=1}^K \left( {\mathbb 1(z_i=k)} p(z_i\mid x_i, \theta^{(t)})\log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K \left( {\mathbb 1(z_i=k)} p(z_i\mid x_i, \theta^{(t)} ) \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K \left( r_{ik} \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k + \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log p(\vec x_i \mid \theta _k)\\ \end{align} $$

with the responsibility $r_{ik}$ that cluster $k$ is responsible for data point $\vec x_i$:

$$ r_{ik} = {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) $$

  • $r_{ik} > 0$
  • $\sum_{k=1}^K r_{ik} = 1$

E-Step

In the E-Step we compute the responsability $r_{ik} = \mathbb 1(z_i=k) p(z_i\mid \vec x_i, \theta^{(t)})$

with
$$ p(z_i \mid \vec x_i, \theta^{(t)}) = \frac{p(z_i,\vec x_i \mid \theta^{(t)})}{p(\vec x_i\mid \theta^{(t)})} = \frac{p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}{\sum_{z_i=1}^K p(z_i, \vec x_i \mid \theta^{(t)})} = \frac{p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}{\sum_{z_i=1}^K p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})} $$

and

  • $\mathbb 1(z_i=k) p(z_i \mid \theta^{(t)}) = \pi_k^{(t)}$
  • $\mathbb 1(z_i=k) p(\vec x_i \mid z_i, \theta^{(t)}) = p(\vec x_i \mid \theta^{(t)}_k)$

the dependence of $r_{ik}$ on the current parameter estimates $\theta^{(t)} = \{\pi_1^{(t)}, \pi_2^{(t)}, \dots, \pi_K^{(t)}, \theta_1^{(t)}, \theta_2^{(t)}, \dots, \theta_K^{(t)} \}$ is

$$ \begin{align} r_{ik} & = {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) \\ & = \frac{\pi_k^{(t)}p(\vec x_i \mid \theta^{(t)}_k)}{\sum_{k=1}^K \pi_k^{(t)}p(\vec x_i \mid \theta^{(t)}_k)} \end{align} $$

that holds for all mixture models.

remember for GMMs: $$ p(\vec x_i \mid \theta^{(t)}_k) = \mathcal N(\vec x \mid \vec \mu^{(t)}_k, \Sigma^{(t)}_k) $$

so $r_{ij}$ can easyly computed.

In [1]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
In [2]:
# Data
cov_1 = np.array([[0.6,0.2],[0.2, 0.3]])
mean_1 = np.array([2.6,2.2])
cov_2 = np.array([[0.6,-0.55],[-0.55, 0.6]])
mean_2 = np.array([-.9,1.7])
cov_3 = np.array([[.1, 0.01],[0.01, 4.3]])
mean_3 = np.array([.9,5.5])
nb_1 = 40
nb_2 = 60
nb_3 = 50 
X_1 = np.random.multivariate_normal(mean_1, cov_1, size=nb_1)
X_2 = np.random.multivariate_normal(mean_2, cov_2, size=nb_2)
X_3 = np.random.multivariate_normal(mean_3, cov_3, size=nb_3)
X_ = np.concatenate((X_1,X_2,X_3))
X = np.random.permutation(X_)
labels = np.zeros(len(X))
labels[nb_1:nb_2+nb_1] = 1
labels[nb_1+nb_2:] = 2
In [3]:
plt.figure(figsize=(5,5))
plt.scatter(X_[:, 0], X_[:, 1], s=50, c=labels, cmap='viridis')
plt.xlabel("$x_1$")
plt.xlim(-3,5)
plt.ylim(-1,10)
_ = plt.ylabel("$x_2$")
In [4]:
#np.random.multivariate_normal?
In [5]:
from scipy.stats import multivariate_normal 
In [6]:
means = np.random.uniform(low=-2,high=3, size=(3,2))
In [7]:
# multivariate_normal.pdf(X, mean=mean_1, cov=cov_1).shape
#np.random.uniform?
In [8]:
from scipy import random, linalg
# TODO: better random cov (negative of-diagonal element)
def get_random_cov(size=2, s=1.5):
    m = random.rand(size,size)*s
    return np.dot(m, m.T)
covs = np.array([get_random_cov() for i in range(3)])
In [9]:
pi = np.array([0.33,0.33,0.34])
In [10]:
means.shape
n = X.shape[0]
K = means.shape[0]
p_x_i__theta_k = np.ndarray([n, K])
for i, (mean, cov) in enumerate(zip(means, covs)):
    p_x_i__theta_k[:,i] = multivariate_normal.pdf(X, mean=mean, cov=cov)
    
#pi_ks * p_x_i__theta_k
In [11]:
def responsibility(means, covs, pi, X):
    # returns a matrix r with indicies i and k
    
    n = X.shape[0]
    K = means.shape[0]
    
    p_x_i__theta_k = np.ndarray([n, K])
    for i, (mean, cov) in enumerate(zip(means, covs)):
        p_x_i__theta_k[:,i] = multivariate_normal.pdf(X, mean=mean, cov=cov)
        
    r = pi * p_x_i__theta_k  
    r = (r.T / r.sum(axis=1)).T
    return r
In [12]:
r = responsibility(means, covs, pi, X)

M-Step

Remember the M-Step: $$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)}) $$

i.e. maximization of $\mathcal Q(\theta , \theta^{(t)})$ with respect to $\theta = \{\pi_1, \pi_2, \dots, \pi_K, (\vec \mu_1, \Sigma_1), (\vec \mu_2, \Sigma_2), \dots, (\vec \mu_K, \Sigma_K) \}$

$$ \mathcal Q(\theta, \theta_t) = \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k + \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log p(\vec x_i \mid \theta _k) $$

We can maximize the two terms separatly.

M-Step for $\pi_k$

the first term is $$ \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k = \sum_{i,k} r_{ik} \log \pi_k $$

so we are searching for the maximum of this term under the constraint $\sum_k{\pi_k}=1$

with the Lagrange multiplier $\lambda$ we have to maximize $$ \sum_{i,k} r_{ik} \log \pi_k + \lambda \left(1- \sum_{k=1}^K{\pi_k}\right) $$

setting the derivative equals zero: $$ \frac{\partial}{\partial \pi_k} \left(\sum_{i,k} r_{ik} \log \pi_k + \lambda \left(1-\sum_{k=1}^K{\pi_k}\right) \right) = 0 $$

$$ \frac{\sum_i r_{ki}}{\pi_k} - \lambda = 0 $$

or

$$ \frac{\sum_i r_{ki}}{\lambda} = \pi_k $$

summation over all $k$ and using the constraint, we get: $$ \sum_k \frac{\sum_i r_{ki}}{\lambda} = \sum_k \pi_k = 1 $$ Using the definition for the $r_{ik}$ we have for $\lambda$: $$ \lambda = \sum_{i,k} r_{ik} = \sum_i \sum_k {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) = \sum_i 1 = n $$

this gives with $r_{k} = \sum_{i=1}^n r_{ik}$

$$ \pi_k = \frac{r_k}{n} $$

In [13]:
def m_step_pi(r_k, n):
    return r_k / n

M-Step for $\vec \mu_k$ and $\Sigma_k$

In all the computations above we didn't used the fact that the base functions are Gaussians. So they hold also for other mixture models.

TODO: explizite Herleitung

$$ \vec \mu = \frac{\sum_i r_{ik} \vec x_i}{r_k} $$

$$ \Sigma_k = \frac{\sum_i r_{ik}(\vec x_i - \vec \mu_k) (\vec x_i - \vec \mu_k)^T}{r_k} = \frac{\sum_i r_{ik}\vec x_i \vec x_i^T}{r_k} - \vec \mu_k \vec \mu_k^T $$

In [14]:
# test of XX computation with einsum
n = 150
a = np.einsum('ac,ad->acd',X,X)
for i in range(150):
    b = np.outer(X[i,:], X[i,:])
    np.testing.assert_allclose(a[i],b)
In [15]:
def m_step_mu_sigma(r, r_k, X):
    #r_k = r.sum(axis=0)
    
    mu = np.dot(r.T, X) 
    mu = (mu.T / r_k).T
    
    r_k = r.sum(axis=0)
    XX = np.einsum('ac,ad->acd',X,X)
    t1 = np.einsum('ace,ad->dce',XX,r) 
    t1 = (t1.T / r_k).T
    t2 = np.einsum('ac,ad->acd',mu,mu)
    sigma = t1 - t2
    return mu, sigma
In [16]:
r_k = r.sum(axis=0)
mu, sigma = m_step_mu_sigma(r, r_k, X)
In [17]:
sigma
Out[17]:
array([[[2.06087833, 0.18880427],
        [0.18880427, 6.45992082]],

       [[0.62648901, 0.40307152],
        [0.40307152, 0.44573864]],

       [[5.93178835, 0.29021649],
        [0.29021649, 0.09886901]]])
In [18]:
# TODO: 
# visualizaton from python data science handbook 
# remove that!!!!

from matplotlib.patches import Ellipse

def draw_ellipse(position, covariance, ax=None, **kwargs):
    """Draw an ellipse with a given position and covariance"""
    ax = ax or plt.gca()
    
    # Convert covariance to principal axes
    if covariance.shape == (2, 2):
        U, s, Vt = np.linalg.svd(covariance)
        angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
        width, height = 2 * np.sqrt(s)
    else:
        angle = 0
        width, height = 2 * np.sqrt(covariance)
    
    # Draw the Ellipse
    for nsig in range(1, 4):
        ax.add_patch(Ellipse(position, nsig * width, nsig * height,
                             angle, **kwargs))

def plot_gmm(means, covars, weights, X, label=False, ax=None):
    
    ax = ax or plt.gca()
    if label:
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    
    w_factor = 0.2 / weights.max()
    for pos, covar, w in zip(means, covars, weights):
        draw_ellipse(pos, covar, alpha=w * w_factor)
In [19]:
# TODO make numerically stable
def expectation_maximization(X, k=3):
    n = X.shape[0]
    data_dim = X.shape[1]

    # random start values - TODO
    pi = np.ones(k)/k
    mus = np.random.uniform(low=-2, high=3, size=(k, data_dim))
    sigmas = get_random_cov(size=data_dim, s=1.5)

    converged = False
    
    while not converged: 
        # E-Step
        r = responsibility(mus, sigmas, pi, X)
        r_k = r.sum(axis=0)
        # M-Step
        pi_, (mus_, sigmas_)  = m_step_pi(r_k, n), m_step_mu_sigma(r, r_k, X)
        # stopping criterion
        tol = 1e-02
        if np.allclose(pi, pi_, rtol=tol) and np.allclose(mus, mus_, rtol=tol) and np.allclose(sigmas, sigmas_, rtol=tol):
            converged = True 
        pi, mus, sigmas = pi_, mus_, sigmas_
    return pi, mus, sigmas
In [20]:
while True:
    try:
        pi, mus, sigmas = expectation_maximization(X)
        break
    except ValueError:
        pass
In [21]:
plot_gmm(mus, sigmas, pi, X, label=False, ax=None)

Literature

  • K. Murphy: Machine Learning - A Probabilistic Perspective, MIT Press, 2012
  • [Row99] Sam Roweis, Zoubin Ghahramani: A Unified View of Linear Gaussian Models, Neural Computation, 11(2), 305-45, 1999
  • [Rad93] Radford M. Neal and Geoffrey E. Hinton: A New View of the EM Algorithm that Justifies Incremental and Other Variants, Learning in Graphical Models, 355-368, Kluwer Academic Publishers, 1993