MCMC slides

## Sampling Methods¶

Source: D. Koller, Graphical Models

#### IID-Data:¶

$$\mathcal D = \{ x^{(1)}, x^{(2)}, x^{(3)}, \dots , x^{(m)} \}$$

if $P(X=1) = \theta_1$

Estimator for $\theta_1$:

$$\hat \theta_1 = \frac{1}{m} \sum_{i=1}^{m} \mathbb 1_{x^{(i)}=1}$$

with the indicator function $\mathbb 1$ (is 1 if subscript is true; else 0)

More generally, for any distribution $P$ and function $f(x)$, the empirical expectation $\mathbb{\hat E}_\mathcal D (f)$ on a data set $\mathcal D$ is

$$\mathbb{\hat E}_\mathcal D (f) := \frac{1}{m} \sum_{i=1}^{m} f(x^{(i)})$$

The empirial expectation is an approximation for the expectation on $P$:

$$\mathbb{E}_P (f) \approx \mathbb{\hat E}_\mathcal D (f)$$

#### Example: Sampling from multinomial with $p = (0.5, 0.2, 0.3)$¶

In :
import numpy as np
mn = np.random.multinomial(1, (0.5,0.2,0.3), size=100)
c = mn.argmax(axis=1)
for i in range(3):
theta = float((c==i).sum())/len(c)
print theta

0.43
0.25
0.32


$$\mathbb P_D \left( \mathbb{\hat E}_\mathcal D (f) \notin [\theta-\epsilon, \theta+\epsilon] \right) \leq 2 e^{-2m \epsilon^2}$$

so the probability that we have data which gives us an empirical estimation far from the "true" value shrinks exponentially with the number of data $m$.

Chernoff Bound (multiplicative bound):

$$\mathbb P_D \left( \mathbb{\hat E}_\mathcal D (f) \notin [\theta(1-\epsilon), \theta(1+\epsilon)] \right) \leq 2 e^{-2m \theta \epsilon^2 / 3}$$

For the additive bound: If we want with probability $1-\delta$ that our empirical estimation is maximal $\epsilon$ away from the "real" expectation, so we want $$2 e^{-2m \epsilon^2} \leq \delta$$

$$m \geq \frac{\ln {2/\delta}}{2\epsilon^2}$$

analog for the Chernoff Bound $$m \geq 3\frac{\ln {2/\delta}}{2 \theta \epsilon^2}$$

Note: $\theta$ is in the denominator. So for low $\theta$-values we need more samples $m$.

### Predictions¶

#### Posterior predictive distribution¶

• observed data $y$
• predition on $\tilde y$ (theoretically observable)
$$p(\tilde y \mid y) = \int p(\tilde y, \theta \mid y) d\theta = \int p(\tilde y \mid \theta, y) p(\theta \mid y) d\theta$$

with the conditional independence: $( y \mid \theta ) \perp ( \tilde y \mid \theta)$ this becomes:

$$p(\tilde y \mid y) = \int p(\tilde y \mid \theta) p(\theta \mid y) d\theta$$

### Markov Chain Monte Carlo (MCMC)¶

#### Markov Chain¶

A Markov chain defines a probabilistic transition model with

$$\forall x: \sum_{x'} T(x \rightarrow x')$$

#### Temporal dynamics¶

$$P( X^{(t+1)} = x') = \sum_x P( X^{(t)} = x) T(x\rightarrow x')$$

### Stationary distributions¶

for large $t$ $$P(X^{(t)} =x') \approx P\left( X^{(t+1)} = x'\right) = \sum_x P( X^{(t)} = x) T(x\rightarrow x')$$

stationary distribution $\pi$ $$\pi (x') = \sum_x \pi(x) T(x\rightarrow x')$$

Example: Page Rank

#### Example¶

Transition probability matrix for three states $a, b$ and $c$:

$$T = \begin{pmatrix} T(a\rightarrow a) & T(a\rightarrow b) & T(a\rightarrow c) \\ T(b\rightarrow a) & T(b\rightarrow b) & T(b\rightarrow c) \\ T(c\rightarrow a) & T(c\rightarrow b) & T(c\rightarrow c) \end{pmatrix} = \begin{pmatrix} 0.3 & 0.7 & 0.0 \\ 0.5 & 0.05 & 0.45 \\ 0.0 & 0.45 & 0.55 \end{pmatrix}$$

Note: each row sum to 1.

In :
import numpy as np
T = np.array([[0.3,0.7,0.],[0.5,0.05,0.45],[0.0,0.45,0.55]])

#state at t=0, we are at b:
x_0 = np.array([0.,1.,.0])

In :
x1 = x_0.dot(T)
print x1

[ 0.5   0.05  0.45]


The probabilities converge to a stationary distribution:

In :
x = x_0
for i in range(20):
x = x.dot(T)
print x
#print x.sum()
#assert x.sum() == 1.

[ 0.5   0.05  0.45]
[ 0.175  0.555  0.27 ]
[ 0.33     0.27175  0.39825]
[ 0.234875  0.4238    0.341325]
[ 0.2823625   0.33919875  0.37843875]
[ 0.25430812  0.38491112  0.36078075]
[ 0.268748    0.35961258  0.37163942]
[ 0.26043069  0.37334197  0.36622734]
[ 0.26480019  0.36577089  0.36942892]
[ 0.2623255   0.36989169  0.36778281]
[ 0.2636435   0.3676247   0.36873181]
[ 0.2629054   0.368861    0.36823361]
[ 0.26330212  0.36818195  0.36851593]
[ 0.26308161  0.36855275  0.36836564]
[ 0.26320086  0.3683493   0.36844984]
[ 0.26313491  0.36846049  0.3684046 ]
[ 0.26317072  0.36839953  0.36842975]
[ 0.26315098  0.36843287  0.36841615]
[ 0.26316173  0.3684146   0.36842367]
[ 0.26315582  0.36842459  0.36841959]


Note that for the stationary distribution (steady state) the following equation holds: $$\vec x T = T$$ That's the eigen equation with eigenvalue 1.

In :
eigenvalues

Out:
array([-0.54749372,  0.44749372,  1.        ])
In :
i=eigenvalues.argmax() # index of the largest eigenvector
eigenvectors[:,i]/sum(eigenvectors[:,i])

Out:
array([ 0.26315789,  0.36842105,  0.36842105])
• Has every Markov chain has a stationary distribution?
• Even if a stationary Markov chain has a unique stationary state, is it guaranteed that starting from an arbitary state that the chain converges to this stationary state?

In general both question must not be true.

But if a Markov chain is regular than both properties will become true.

#### Regular Markov Chains¶

A Markov is regular if there exists $k$ such that, for every $x, x'$, the probability of getting from $x$ to $x'$ in exactly $k$ steps is $>0$.

A regular Markov chain converges to a unique stationary distribution regardless of the start state.

• a Markov Chain defines a dynamical system from which we can sample trajectories
• under certain conditions (e.g. regularity), this process is guaranteed to converge to a stationary distribution at the limit
• this allows us to sample from a distribution indirectly and thereby provides a mechanism for sampling from an intractable distribution

#### Using a Markov Chain¶

• Goal compute $p(x \in S)$
• but $p$ is to hard to sample from directly
• Construct a Markov chain $T$ whose unique stationary distribution is $p$
• Sample $x^{(0)}$ from some $P^{(0)}$
• for $t = 0,1,2, \dots$:
• generate $x^{(t+1)}$ from $T(x^{(t)}\rightarrow x'$)
• We only want to use samples that are sampled from a distribution close to $p$
• At iterations, $p^{(t)}$ is usually far from $p$
• start collecting samples only after the chain has long enough to "mix"

#### Mixing¶

• How do you know if a chain has mixed or not?

• in general, you can never "prove" a chain has mixed
• but in many cases you can show that it has NOT
• How do you know a chain has not mixed?

• Compare chain statistics in different windows within a single run of the chain
• and across different runs initialized differently

Markov Chain: $$\theta_0 \rightarrow \theta_1 \rightarrow \theta_2 \rightarrow \theta_3 \rightarrow \theta_4 \rightarrow \theta_5 \dots$$

where $\theta_0 \sim p(\theta_0), \theta_1 \sim p(\theta_1) \dots$
and with the property

$$p_t(\theta) = \int d\theta' p_{t-1}(\theta') T(\theta'\rightarrow \theta)$$

#### Gibbs Chain¶

• Target distribution: $P_\Phi (X_1, X_2, \dots X_n)$
• Markov chain state space: complete assignments $\vec x$ to $\vec X$
• Transition model given starting state $\vec x$:
• for $i \in \{1,2, \dots n\}$:
• sample $x_i \sim P_\Phi(X_i | \vec x_{-i})$ ($\vec x_{-1}$ is an assigment to all variables $x_1, x_2, \dots x_n$ except $x_i$)
• set new $\vec x' = \vec x$

#### Reversible Chain¶

Detailed Balance:

$$p^*(\theta') T(\theta'\rightarrow \theta) = p^*(\theta) T(\theta\rightarrow \theta')$$

Theorem: If detailed balance holds, and $T$ is regular, than $T$ has a unique stationary distribution $\pi$.

Proof:

$$\sum_{x'} p(x') T(x'\rightarrow x) = \sum_{x'} p(x) T(x\rightarrow x')$$

The right hand side is: $$\sum_{x'} p(x) T(x\rightarrow x') = p(x) \sum_{x'} T(x\rightarrow x') = p(x)$$ So this is equivalent to the definition of a stationary distribution:

$$\sum_{x'} p(x') T(x'\rightarrow x) = p(x)$$

#### Metropolis Hastings Chain¶

• Proposal distribution $Q(\vec x \rightarrow \vec x')$

• Acceptance probability $A(\vec x \rightarrow \vec x')$

At each state $\vec x$,

• sample from $Q(\vec x \rightarrow \vec x')$
• accept proposal with acceptance probability $A(\vec x \rightarrow \vec x')$,
otherwise stay at $\vec x$

for $x \neq x'$: $$T(\vec x \rightarrow \vec x') = Q(\vec x \rightarrow \vec x') A(\vec x \rightarrow \vec x')$$

for $x=x'$ $$T(\vec x \rightarrow \vec x) = Q(\vec x \rightarrow \vec x) + \sum_{x'}Q(\vec x \rightarrow \vec x') \left(1-A(\vec x \rightarrow \vec x')\right)$$

from detailed balance:

$$\pi(x') T(x'\rightarrow x) = \pi(x) T(x \rightarrow x')$$

and for $x\neq x'$ the following should hold:

$$\pi(\vec x') Q(\vec x' \rightarrow \vec x) A(\vec x' \rightarrow \vec x) = \pi(\vec x) Q(\vec x \rightarrow \vec x') A(\vec x \rightarrow \vec x')$$

therefore follows a constraint on the $A$-ratio:

$$\frac{A(\vec x \rightarrow \vec x')}{A(\vec x' \rightarrow \vec x)} = \frac{\pi(\vec x') Q(\vec x' \rightarrow \vec x) }{\pi(\vec x) Q(\vec x \rightarrow \vec x') }$$

without loss of generality: numerator smaller as denominator

• $A(\vec x \rightarrow \vec x') = \rho \leq 1$
• $A(\vec x' \rightarrow \vec x) = 1$

general rule:

$$A(\vec x \rightarrow \vec x') = \min\left[ 1, \frac{\pi(\vec x') Q(\vec x' \rightarrow \vec x) }{\pi(\vec x) Q(\vec x \rightarrow \vec x')} \right]$$

#### Choise of Q¶

Q must be reversible:

• iff $Q(\vec x \rightarrow \vec x') > 0$ than $Q(\vec x' \rightarrow \vec x) > 0$

Opposing forces:

• Q should try to spread out, to improve mixing
• but than acceptance probability often low