Expectation-Maximization

following a variational derivation

Goal

Find Maximum Likelihood Estimation (MLE) (or Maximum a posteriori / MAP) when some data is missing.

Given $n$ iid-observations (data)

$$ {\bf X} = (\vec x_1, \vec x_2, \dots \vec x_n) $$

Note: $\vec x_i$ can be a sequence.

Model

$$ (\vec x, \vec z) \sim p_\theta $$ for some (unknown) $\theta \in \Theta$

Typically $p_\theta$ is in the exponential family.

  • $\vec z$: hidden (latent) random variables

Goal

Maximization of the likelihood (or log-likelihood) of the parameters given the observations.

$$ \theta_{mle} = \text{arg} \max_\theta p({\bf X} \mid \theta) $$

Marginal likelihood of the parameters w.r.t. the observed data:

$$ \mathcal l(\theta) = p({\bf X} \mid \theta) = \prod_{i=1}^n \left(\sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right) $$

Note: For continuous latent variables the sum over $\vec z_i$ must be replaced by an integral.

$$ \mathcal l(\theta) = p({\bf X} \mid \theta) = \prod_{i=1}^n \left(\int_{\mathcal Z} p(\vec x_i, \vec z_i \mid \theta ) d \vec z_i\right) $$

Marginal log likelihood of the parameters w.r.t. the observed data

$$ L(\theta) = \log p({\bf X} \mid \theta) = \sum_{i=1}^n \left(\log \sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right) $$

The the log-likelihood is difficult to maximize: the sum is inside the log.

Complete data log likelihood:

$$ L_c(\theta) = \log p({\bf X, Z} \mid \theta) = \sum_{i=1}^n \log p(\vec x_i, \vec z_i \mid \theta ) $$

can not be computed, since $\vec z_i$ is unknown.

Here: for each observation $x_i$ we a have a hidden variable $z_i$.

Lower Bound

With an arbitrary distribution $q({\bf Z})$ over the hidden variables and using Jensen's inequality, we can obtain a lower bound for the log-likelihood:

$$ \begin{align} L(\theta) &= \sum_{i} \log p({\bf x}_i \mid \theta ) \\ & = \sum_{i} \log \int_{\mathcal Z} p({\bf x}_i,{\bf z}_i \mid \theta ) d {\bf z}_i = \\ & = \sum_{i} \log \int_{\mathcal Z} q({\bf z}_i) \frac{p({\bf x}_i, {\bf z}\mid \theta )}{q({\bf z}_i)} d {\bf z}_i \\ & \geq \sum_{i} \int_{\mathcal Z} q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} d {\bf z}_i \\ & = \sum_{i} \mathbb E_{q({\bf z}_i)} \left[\log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)}\right] \\ & = - \sum_{i} \mathcal D_{KL} \left( q({\bf z}_i) \mid \mid p({\bf x}_i, {\bf z}_i \mid \theta ) \right) \\ & = \mathcal L(q, \theta) \end{align} $$

  • We have a family of lower bounds: Different $q(x)$ choose other members of the family. So maximizing the lower bound w.r.t. $q$'s for a fixed $\theta$ gives a "stronger" lower bound at the position $\theta$.
  • $\mathcal L(q, \theta)$: variational lower bound

Note: By using Jensen's inequality we pushed the $\log$ inside the integral/sum.

EM Algorithm

  • E-Step: $q^{(t+1)} \leftarrow \text{arg} \max_q \mathcal L(q,\theta^{(t)}) $
    Search in the family of lower bounds at the current position $\theta^{(t)}$ the strongest bound.
  • M-Step: $\theta^{(t+1)} \leftarrow \text{arg} \max_{\theta} \mathcal L(q^{(t+1)},\theta)$
    Maximize the current lower bound (given by $q^{(t+1)}$) w.r.t. $\theta$.

E-Step in detail

The Gap between the log-likelihood $L(\theta) $ and the lower bound $\mathcal L$ is

$$ \begin{align} \text{GAP} &= L(\theta) - \mathcal L (q, \theta)\\ &= \sum_{i} \log p({\bf x}_i \mid \theta ) - \sum_{i} \int_{\mathcal Z} q({\bf z_i}) \log \frac{p({\bf x}_i, {\bf z_i} \mid \theta )}{q({\bf z}_i)} d {\bf z}_i\\ &= \sum_{i} \log p({\bf x}_i \mid \theta )\int_{\mathcal Z} q({\bf z}_i) d{\bf z}_i - \sum_{i} \int_{\mathcal Z} q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} d {\bf z}_i\\ &= \sum_{i} \left( \int_{\mathcal Z} q({\bf z}_i) \log p({\bf x}_i \mid \theta ) d{\bf z} - \int_{\mathcal Z} q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} d {\bf z}_i \right)\\ &= \sum_{i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log p({\bf x}_i \mid \theta ) - q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} \right) d {\bf z}_i \right)\\ &= \sum_{i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log \frac{p({\bf x}_i \mid \theta ) q({\bf z}_i) }{p({\bf x}_i, {\bf z}_i \mid \theta )} \right) d {\bf z}_i \right)\\ &= \sum_{i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log \frac{p({\bf x}_i \mid \theta ) q({\bf z_i}) }{p({\bf z_i} \mid {\bf x_i}, \theta )p({\bf x_i}\mid \theta)} \right) d {\bf z}_i \right)\\ &= \sum_{_i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log \frac{ q({\bf z}_i) }{p({\bf z}_i \mid {\bf x_i}, \theta )} \right) d {\bf z}_i \right)\\ &= \sum_{_i} \mathcal D_{KL}\left( q({\bf z}_i) \mid \mid p({\bf z}_i \mid {\bf x_i}, \theta )\right ) \end{align} $$

with the Kullback-Leiber divergence $\mathcal D_{KL}$.

We want to minimize the GAP (equals to maximize the lower bound).

The Kullback-Leiber divergence $\mathcal D_{KL}$ is zero iff $q({\bf z}_i) = p({\bf z}_i \mid {\bf x_i}, \theta )$.

So in the E-Step we set the variational distribution $q({\bf z}_i)$ to $p({\bf z}_i \mid {\bf x_i}, \theta )$ (if we can compute $p({\bf z}_i \mid {\bf x_i}, \theta )$)

Note: For each $i$ we have a (in general different) $q({\bf z}_i)$, i.e. we used the usual "abuse of notation": the variable ${\bf z}_i $ of the function $p({\bf z}_i)$ is used to identify the distribution.

M-Step in detail

$$ \begin{align} \theta^{(t+1)} &= \text{arg} \max_{\theta} \mathcal L (q, \theta) \\ &=\text{arg} \max_{\theta} \left( \sum_{i} \int_{\mathcal Z} q({\bf z_i}) \log \frac{p({\bf x}_i, {\bf z_i} \mid \theta )}{q({\bf z}_i)} d {\bf z}_i \right)\\ &=\text{arg} \max_{\theta} \left(\sum_{i} \int_{\mathcal Z} q({\bf z_i}) \log {p({\bf x}_i, {\bf z_i} \mid \theta )} d {\bf z}_i - \sum_{i} \int_{\mathcal Z} {q({\bf z}_i)} \log q({\bf z}_i) d {\bf z}_i \right)\\ &=\text{arg} \max_{\theta} \left( \sum_{i} \int_{\mathcal Z} q({\bf z_i}) \log {p({\bf x}_i, {\bf z_i} \mid \theta )} d {\bf z}_i\right)\\ &=\text{arg} \max_{\theta} \sum_{i} \mathbb E_{q_i} \left[ \log {p({\bf x}_i, {\bf z_i} \mid \theta )} \right] \end{align} $$

Note: We maximize the expected complete data log-likelihood w.r.t. $q({\bf z}_i)$, i.e. the complete data log-likelihood by using our current guess of the distribution $p({\bf z}_i \mid {\bf x_i}, \theta )$.

Convergence of EM

Lower bound:

$$ \log p(X \mid \theta^{(t+1)}) \geq \mathcal L(q^{(t+1)}, \theta^{(t+1)}) $$

and because of the optimization procedure (E-Step and M-Step) the following inequalities holds:

$$ \mathcal L(q^{(t+1)}, \theta^{(t+1)})\geq \mathcal L(q^{(t+1)}, \theta^{(t)}) \geq \mathcal L(q^{(t)}, \theta^{(t)}) $$

and for $q({\bf z}_i) = p({\bf z}_i \mid {\bf x_i}, \theta )$ $$ \mathcal L(q^{(t)}, \theta^{(t)}) = \log p(X \mid \theta^{(t)}) $$

Therefore $$ \log p(X \mid \theta^{(t+1)}) \geq \log p(X \mid \theta^{(t)}) $$

So is guaranteed that in each iteration the log likelihood increases (until convergence).

  • This can be used for debugging: the log likelihood must not increase.
  • EM converges to a (local) maximum (or a saddle point).

Summary EM Algorithm

  • E-Step: $q^{(t+1)}= \text{arg}\min_q\sum_{_i} \mathcal D_{KL}\left( q({\bf z}_i) \mid \mid p({\bf z}_i \mid {\bf x_i}, \theta^{(t)} )\right)$
  • M-Step: $\theta^{(t+1)} = \text{arg}\max_\theta \sum_{i} \mathbb E_{q_i^{(t+1)}} \left[ \log {p({\bf x}_i, {\bf z_i} \mid \theta )} \right] $

Literature

  • [Rad93] Radford M. Neal and Geoffrey E. Hinton: A New View of the EM Algorithm that Justifies Incremental and Other Variants, Learning in Graphical Models, 355-368, Kluwer Academic Publishers, 1993
  • A. Novikov, D. Polykovskiy: Online Course: Bayesian Methods for Machine Learning