MCMC slides

Sampling Methods

Source: D. Koller, Graphical Models

IID-Data:

$$ \mathcal D = \{ x^{(1)}, x^{(2)}, x^{(3)}, \dots , x^{(m)} \} $$

if $P(X=1) = \theta_1 $

Estimator for $\theta_1$:

$$ \hat \theta_1 = \frac{1}{m} \sum_{i=1}^{m} \mathbb 1_{x^{(i)}=1} $$

with the indicator function $\mathbb 1$ (is 1 if subscript is true; else 0)

More generally, for any distribution $P$ and function $f(x)$, the empirical expectation $\mathbb{\hat E}_\mathcal D (f)$ on a data set $\mathcal D$ is

$$ \mathbb{\hat E}_\mathcal D (f) := \frac{1}{m} \sum_{i=1}^{m} f(x^{(i)}) $$

The empirial expectation is an approximation for the expectation on $P$:

$$ \mathbb{E}_P (f) \approx \mathbb{\hat E}_\mathcal D (f) $$

Example: Sampling from multinomial with $ p = (0.5, 0.2, 0.3)$

In [4]:
import numpy as np
mn = np.random.multinomial(1, (0.5,0.2,0.3), size=100)
c = mn.argmax(axis=1)
for i in range(3):
    theta = float((c==i).sum())/len(c)
    print theta
0.43
0.25
0.32

Hoeffding Bound (additive bound):

$$ \mathbb P_D \left( \mathbb{\hat E}_\mathcal D (f) \notin [\theta-\epsilon, \theta+\epsilon] \right) \leq 2 e^{-2m \epsilon^2} $$

so the probability that we have data which gives us an empirical estimation far from the "true" value shrinks exponentially with the number of data $m$.

Chernoff Bound (multiplicative bound):

$$ \mathbb P_D \left( \mathbb{\hat E}_\mathcal D (f) \notin [\theta(1-\epsilon), \theta(1+\epsilon)] \right) \leq 2 e^{-2m \theta \epsilon^2 / 3} $$

For the additive bound: If we want with probability $1-\delta$ that our empirical estimation is maximal $\epsilon$ away from the "real" expectation, so we want $$ 2 e^{-2m \epsilon^2} \leq \delta $$

$$ m \geq \frac{\ln {2/\delta}}{2\epsilon^2} $$

analog for the Chernoff Bound $$ m \geq 3\frac{\ln {2/\delta}}{2 \theta \epsilon^2} $$

Note: $\theta$ is in the denominator. So for low $\theta$-values we need more samples $m$.

Predictions

Posterior predictive distribution

  • observed data $y$
  • predition on $\tilde y$ (theoretically observable)
$$ p(\tilde y \mid y) = \int p(\tilde y, \theta \mid y) d\theta = \int p(\tilde y \mid \theta, y) p(\theta \mid y) d\theta $$

with the conditional independence: $ ( y \mid \theta ) \perp ( \tilde y \mid \theta) $ this becomes:

$$ p(\tilde y \mid y) = \int p(\tilde y \mid \theta) p(\theta \mid y) d\theta $$

Markov Chain Monte Carlo (MCMC)

Markov Chain

A Markov chain defines a probabilistic transition model with

$$ \forall x: \sum_{x'} T(x \rightarrow x') $$

Temporal dynamics

$$ P( X^{(t+1)} = x') = \sum_x P( X^{(t)} = x) T(x\rightarrow x') $$

Stationary distributions

for large $t$ $$ P(X^{(t)} =x') \approx P\left( X^{(t+1)} = x'\right) = \sum_x P( X^{(t)} = x) T(x\rightarrow x') $$

stationary distribution $\pi$ $$ \pi (x') = \sum_x \pi(x) T(x\rightarrow x') $$

Example: Page Rank

Example

Transition probability matrix for three states $a, b$ and $c$:

$$ T = \begin{pmatrix} T(a\rightarrow a) & T(a\rightarrow b) & T(a\rightarrow c) \\ T(b\rightarrow a) & T(b\rightarrow b) & T(b\rightarrow c) \\ T(c\rightarrow a) & T(c\rightarrow b) & T(c\rightarrow c) \end{pmatrix} = \begin{pmatrix} 0.3 & 0.7 & 0.0 \\ 0.5 & 0.05 & 0.45 \\ 0.0 & 0.45 & 0.55 \end{pmatrix} $$

Note: each row sum to 1.

In [5]:
import numpy as np
T = np.array([[0.3,0.7,0.],[0.5,0.05,0.45],[0.0,0.45,0.55]])

#state at t=0, we are at b:
x_0 = np.array([0.,1.,.0])
In [6]:
x1 = x_0.dot(T)
print x1
[ 0.5   0.05  0.45]

The probabilities converge to a stationary distribution:

In [11]:
x = x_0
for i in range(20):
    x = x.dot(T)
    print x
    #print x.sum()
    #assert x.sum() == 1.
[ 0.5   0.05  0.45]
[ 0.175  0.555  0.27 ]
[ 0.33     0.27175  0.39825]
[ 0.234875  0.4238    0.341325]
[ 0.2823625   0.33919875  0.37843875]
[ 0.25430812  0.38491112  0.36078075]
[ 0.268748    0.35961258  0.37163942]
[ 0.26043069  0.37334197  0.36622734]
[ 0.26480019  0.36577089  0.36942892]
[ 0.2623255   0.36989169  0.36778281]
[ 0.2636435   0.3676247   0.36873181]
[ 0.2629054   0.368861    0.36823361]
[ 0.26330212  0.36818195  0.36851593]
[ 0.26308161  0.36855275  0.36836564]
[ 0.26320086  0.3683493   0.36844984]
[ 0.26313491  0.36846049  0.3684046 ]
[ 0.26317072  0.36839953  0.36842975]
[ 0.26315098  0.36843287  0.36841615]
[ 0.26316173  0.3684146   0.36842367]
[ 0.26315582  0.36842459  0.36841959]

Note that for the stationary distribution (steady state) the following equation holds: $$ \vec x T = T $$ That's the eigen equation with eigenvalue 1.

In [13]:
eigenvalues
Out[13]:
array([-0.54749372,  0.44749372,  1.        ])
In [14]:
i=eigenvalues.argmax() # index of the largest eigenvector
eigenvectors[:,i]/sum(eigenvectors[:,i])
Out[14]:
array([ 0.26315789,  0.36842105,  0.36842105])
  • Has every Markov chain has a stationary distribution?
  • Even if a stationary Markov chain has a unique stationary state, is it guaranteed that starting from an arbitary state that the chain converges to this stationary state?

In general both question must not be true.

But if a Markov chain is regular than both properties will become true.

Regular Markov Chains

A Markov is regular if there exists $k$ such that, for every $x, x'$, the probability of getting from $x$ to $x'$ in exactly $k$ steps is $>0$.

A regular Markov chain converges to a unique stationary distribution regardless of the start state.

  • a Markov Chain defines a dynamical system from which we can sample trajectories
  • under certain conditions (e.g. regularity), this process is guaranteed to converge to a stationary distribution at the limit
  • this allows us to sample from a distribution indirectly and thereby provides a mechanism for sampling from an intractable distribution

Using a Markov Chain

  • Goal compute $p(x \in S)$
    • but $p$ is to hard to sample from directly
  • Construct a Markov chain $T$ whose unique stationary distribution is $p$
  • Sample $x^{(0)}$ from some $P^{(0)}$
  • for $t = 0,1,2, \dots$:
    • generate $x^{(t+1)}$ from $T(x^{(t)}\rightarrow x'$)
  • We only want to use samples that are sampled from a distribution close to $p$
  • At iterations, $p^{(t)}$ is usually far from $p$
  • start collecting samples only after the chain has long enough to "mix"

Mixing

  • How do you know if a chain has mixed or not?

    • in general, you can never "prove" a chain has mixed
    • but in many cases you can show that it has NOT
  • How do you know a chain has not mixed?

    • Compare chain statistics in different windows within a single run of the chain
    • and across different runs initialized differently

Markov Chain: $$ \theta_0 \rightarrow \theta_1 \rightarrow \theta_2 \rightarrow \theta_3 \rightarrow \theta_4 \rightarrow \theta_5 \dots $$

where $\theta_0 \sim p(\theta_0), \theta_1 \sim p(\theta_1) \dots$
and with the property

$$ p_t(\theta) = \int d\theta' p_{t-1}(\theta') T(\theta'\rightarrow \theta) $$

Gibbs Chain

  • Target distribution: $P_\Phi (X_1, X_2, \dots X_n)$
  • Markov chain state space: complete assignments $\vec x$ to $\vec X$
  • Transition model given starting state $\vec x$:
    • for $i \in \{1,2, \dots n\}$:
      • sample $x_i \sim P_\Phi(X_i | \vec x_{-i})$ ($\vec x_{-1}$ is an assigment to all variables $x_1, x_2, \dots x_n$ except $x_i$)
      • set new $\vec x' = \vec x$

Reversible Chain

Detailed Balance:

$$ p^*(\theta') T(\theta'\rightarrow \theta) = p^*(\theta) T(\theta\rightarrow \theta') $$

Theorem: If detailed balance holds, and $T$ is regular, than $T$ has a unique stationary distribution $\pi$.

Proof:

$$ \sum_{x'} p(x') T(x'\rightarrow x) = \sum_{x'} p(x) T(x\rightarrow x') $$

The right hand side is: $$ \sum_{x'} p(x) T(x\rightarrow x') = p(x) \sum_{x'} T(x\rightarrow x') = p(x) $$ So this is equivalent to the definition of a stationary distribution:

$$ \sum_{x'} p(x') T(x'\rightarrow x) = p(x) $$

Metropolis Hastings Chain

  • Proposal distribution $Q(\vec x \rightarrow \vec x')$

  • Acceptance probability $A(\vec x \rightarrow \vec x')$

At each state $\vec x$,

  • sample from $Q(\vec x \rightarrow \vec x')$
  • accept proposal with acceptance probability $A(\vec x \rightarrow \vec x')$,
    otherwise stay at $\vec x$

for $x \neq x'$: $$ T(\vec x \rightarrow \vec x') = Q(\vec x \rightarrow \vec x') A(\vec x \rightarrow \vec x') $$

for $x=x'$ $$ T(\vec x \rightarrow \vec x) = Q(\vec x \rightarrow \vec x) + \sum_{x'}Q(\vec x \rightarrow \vec x') \left(1-A(\vec x \rightarrow \vec x')\right) $$

from detailed balance:

$$ \pi(x') T(x'\rightarrow x) = \pi(x) T(x \rightarrow x') $$

and for $x\neq x'$ the following should hold:

$$ \pi(\vec x') Q(\vec x' \rightarrow \vec x) A(\vec x' \rightarrow \vec x) = \pi(\vec x) Q(\vec x \rightarrow \vec x') A(\vec x \rightarrow \vec x') $$

therefore follows a constraint on the $A$-ratio:

$$ \frac{A(\vec x \rightarrow \vec x')}{A(\vec x' \rightarrow \vec x)} = \frac{\pi(\vec x') Q(\vec x' \rightarrow \vec x) }{\pi(\vec x) Q(\vec x \rightarrow \vec x') } $$

without loss of generality: numerator smaller as denominator

  • $A(\vec x \rightarrow \vec x') = \rho \leq 1$
  • $A(\vec x' \rightarrow \vec x) = 1$

general rule:

$$ A(\vec x \rightarrow \vec x') = \min\left[ 1, \frac{\pi(\vec x') Q(\vec x' \rightarrow \vec x) }{\pi(\vec x) Q(\vec x \rightarrow \vec x')} \right] $$

Choise of Q

Q must be reversible:

  • iff $Q(\vec x \rightarrow \vec x') > 0$ than $Q(\vec x' \rightarrow \vec x) > 0$

Opposing forces:

  • Q should try to spread out, to improve mixing
  • but than acceptance probability often low
In [ ]: