## Introduction into Bayesian Inference with PyMc3¶

In a full bayesian approach the parameters of a model are considered as random variables. In this way a Bayesian handles with the uncertainty about the values of the parameters.

In [1]:
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt


#### Example: Tossing Thumtacks¶

In [2]:
from IPython.display import Image
Image(filename='./thumbtack.jpg', width=200, height=200)

Out[2]:

### Bernoulli trials¶

To simulate the tossing behaviour of the thumtack we need the Bernoulli distribution.
$\theta$ is the probability of success, i.e. $X=1$; here: thumtack shows head with $\theta=0.3$:

$$P(X=1) = \theta \\ P(X=0) = 1 - \theta$$

$n$-iid (independent and identically distributed) Benoulli trials can be simulated by:

In [3]:
import scipy.stats
scipy.stats.bernoulli.rvs(0.3, size=10)

Out[3]:
array([0, 0, 0, 0, 1, 0, 0, 0, 1, 1])

### Binomial Distribution¶

with $\theta$ the probability of successes of a Bernoulli trial the probability mass function (pmf) of $k$ successes of $n$ iid trials is given by

$$P( X = k \mid \theta) = {n \choose k} \theta^k (1-\theta)^{n-k}$$

In [4]:
def plot_binom(theta, n):
p=theta
binom = scipy.stats.binom
fig, ax = plt.subplots(1, 1)
x = np.arange(binom.ppf(0.001, n, p),
binom.ppf(0.999, n, p))
ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.set_xlabel("k")
ax.set_ylabel("p(k)")
ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)


In [5]:
plot_binom(theta=0.3, n=10)


### Beta Distribution¶

Probability density function (pdf) of the Beta distribution:

\begin{align} Beta(\alpha, \beta) = p(\theta \mid \alpha,\beta) & = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\int_0^1 u^{\alpha-1} (1-u)^{\beta-1}\, du} \end{align}

with

• $\theta \in { } [0,1]$
• shape parameters: $\alpha > 0, \beta > 0$

Note: The denominator normalizes the pdf

#### Beta distribution in scipy¶

In [6]:
alpha=10.; beta=40.
plt.figure(figsize=(5,5))
plt.title("Beta(%0.1f,%0.1f)"%(alpha,beta))
r = np.arange(0.,1.01,.01)
plt.xlabel("$\\theta$")
plt.ylabel("beta")
plt.plot(r, scipy.stats.beta.pdf(r, alpha, beta), 'r-', lw=5, alpha=0.6, label='beta pdf')
plt.show()


### Conjugate Prior¶

Bayes Rule for the parameters $\vec \theta$ given the observation $\mathcal D$ (data):

$$p(\vec \theta | \mathcal D) = \frac{p({\mathcal D \mid \vec \theta}) p(\vec \theta)}{p(\mathcal D)}$$

In Bayesian probability theory, if the posterior distributions $p(\theta | \mathcal D)$ are in the same family as the prior probability distribution $p(\theta)$, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function.

Example: The Beta distribution is the conjugate prior if the likelihood function is the Binomial distribution.

#### Sufficient statistics¶

Definition:

• a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter".

Sufficient statistics for Binominal (Example "tossing of the thumtack"): $k, n$

• a sufficent statistics of $\mathcal D$ is the number of successes $X=k$ and the number of trials $n$.
• so we can represent the data, e.g. $\mathcal D = (0,0,1,0,0,1,\dots)$, compactly (with two numbers) as $\mathcal D = (k, n)$.

#### Beta distribution as conjugate prior of the Binomial distribution¶

$$p(\theta \mid \mathcal D) = \frac{p(\mathcal D \mid \theta) p(\theta)}{p(\mathcal D)}$$ Note: the denominator is independent of $\theta$ and normalizes the distribution.

with the sufficent statistics of $\mathcal D$ the number of successes $X=k$ and the number of trials $n$:

$$p(\theta \mid \mathcal D) \propto \theta^k (1-\theta)^{n-k} \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}$$

##### Beta distribution as conjugate prior of the Binomial distribution¶
• prior for $\theta$: $Beta(\alpha, \beta)$
• (new) observations: $k$-successes in $n$-trials
• posterior: $Beta(\alpha + k, \beta + n - k)$
In [7]:
def plot_prior_posterior_beta(alpha, beta, k, n):
r=np.arange(0.,1.01,.01)
plt.figure(figsize=(10,5))
ax = plt.subplot(121)
plt.title("Prior Beta(%0.1f,%0.1f)"%(alpha,beta))
ax.plot(r, scipy.stats.beta.pdf(r, alpha, beta), 'r-', lw=2, alpha=0.6, label='beta pdf')
ax.set_xlabel("$\\theta$")
ax.set_ylabel("prior probability density")
ax = plt.subplot(122)
a = alpha+k; b = beta+n-k
plt.title("Posterior Beta(%0.1f,%0.1f)"%(a ,b))
ax.plot(r, scipy.stats.beta.pdf(r, a, b), 'b-', lw=2, alpha=0.6, label='beta pdf')
ax.set_xlabel("$\\theta$")
ax.set_ylabel("posterior probability density")
plt.show()

In [8]:
plot_prior_posterior_beta(alpha=1., beta=1., k=100, n=300)

##### Analogon: Thumtack machine¶
• think of a machine that produces thumtacks
• the machine can be parameterized with $\alpha, \beta$
• even with fixed $\alpha, \beta$ values the machine produces in general thumtacks with different $\theta$

The prior is our knowledgle of the machine. The machine produces a thumtack. Without any data (throwing the thumtack) our knowledge about $\theta$ is given by our knowledge of the machine.

From the observe data (by throwing the thumtack) we can compute the likelihood of the data.

With Bayes rule we can combine our prior knowledge with the knowledge we gained from the data (the likelihood).

In [9]:
# without any observation of a produced thumtack: k=0, n=0

# with machine parameter: alpha=2., beta=6.
# tacking one produced thumtack and throwing is 100 times
#   with 20 positive observations:
plot_prior_posterior_beta(alpha=2., beta=6., k=20, n=100)

In [10]:
s = 0.9
figsize(11*s, 9*s)

def plot_thumtack_flip(theta_unk = 0.3):
# Beta distribution is the conjugate prior of the binomial distribution
dist = scipy.stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = scipy.stats.bernoulli.rvs(theta_unk, size=n_trials[-1])
x = np.linspace(0, 1, 100)

# For the already prepared, I'm using Binomial's conj. prior.
for k, N in enumerate(n_trials):
sx = plt.subplot(len(n_trials) / 2, 2, k + 1)
#plt.xlabel("$\\theta$, probability of heads") \
#    if k in [0, len(n_trials) - 1] else None
plt.setp(sx.get_yticklabels(), visible=False)
plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
plt.vlines(theta_unk, 0, 4, color="k", linestyles="--", lw=1)

leg = plt.legend()
leg.get_frame().set_alpha(0.4)
plt.autoscale(tight=True)

plt.suptitle("Bayesian updating of posterior probabilities of $\\theta$",
y=1.02,
fontsize=14)

plt.tight_layout()

In [11]:
plot_thumtack_flip()


### Random Variables in PyMc¶

Here we choose $X$ to be Beta distributed (conjugate prior to the Binomial distribution):

$$X \sim Beta(\alpha=1,\beta=6)$$

Such a variable is called stochastic in the terminology of pymc.

In [12]:
alpha=1; beta=6;


### Drawing samples from a distribution¶

In [13]:
import pymc3

def get_samples(alpha, beta, num_samples):
with pymc3.Model() as model:
b = pymc3.Beta('b', alpha=alpha, beta=beta)
tr = pymc3.sample(num_samples)
return tr.get_values('b')

In [44]:
def plot_sample_of_beta_dist(alpha, beta, num_samples):
samples = get_samples(alpha, beta, num_samples)
plt.figure(figsize=(10,5))
ax = plt.subplot(121)
ax.hist(samples, bins=int(np.sqrt(num_samples)), normed=True )
ax.set_xlim(0.,1.)
ax.set_ylim(0.,7.)
plt.title("Sample-Histogram Beta(%0.1f,%0.1f)"%(alpha,beta))

ax = plt.subplot(122)
plt.title("Exact distribution Beta(%0.1f,%0.1f)"%(alpha,beta))
r=np.arange(0.,1.01,.01)
ax.set_xlim(0.,1.)
ax.set_ylim(0.,7.)
ax.plot(r, scipy.stats.beta.pdf(r, alpha, beta), 'r-', lw=5, alpha=0.6, label='beta pdf')
plt.show()

In [45]:
plot_sample_of_beta_dist(alpha=1., beta=7., num_samples=10000)

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Average ELBO = -0.062832: 100%|██████████| 200000/200000 [00:14<00:00, 13491.39it/s]
Finished [100%]: Average ELBO = -0.070504
INFO:pymc3:Finished [100%]: Average ELBO = -0.070504
100%|██████████| 10000/10000 [00:04<00:00, 2234.80it/s]
/Users/christian/.virtualenvs/py3/lib/python3.5/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "


## PyMc Example: Binomial trials¶

Let's assume we have some data from iid binomial trials and we want to estimate the probability of observing a 1.

In [46]:
#simulate some observations,
theta_unk = 0.3 # unknown theta, that's the parameter we want to estimate
nb_data = n = 40
data = scipy.stats.bernoulli.rvs(theta_unk, size=nb_data)
print ("Data:", data)
k = data.sum()
print ("Suffient statistics:\n\t number of trails:",n, "\n\t number of positive outcomes:", k)
theta_ml = float(k)/float(n)
print ("Maximum likelihood estimate of theta:", theta_ml)

Data: [0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
0 0 1]
Suffient statistics:
number of trails: 40
number of positive outcomes: 7
Maximum likelihood estimate of theta: 0.175


### Model in PyMc¶

#### Prior¶

We choose the Beta distribution as prior with $\alpha=1$ und $\beta=1$:

theta = pymc.Beta('theta', alpha=1, beta=1)

If we would put more prior (knowledge) in our model we could use a Beta-Distribution with different parameters $\alpha, \beta$, e.g.:

theta = pymc.Beta('theta', alpha=4., beta=8.)

#### Observed Variable¶

Our observed variable is a Bernoulli random variable. For a pymc Bernoulli variable we can use another pymc variable as value for p: our theta-Variable:

pymc3.Bernoulli('bernoulli',p=theta, observed=data)

In [33]:
# with bernoulli distribution
def create_model_pymc3(data):
with pymc3.Model() as model:
theta = pymc3.Beta('theta', alpha=1, beta=1)
bernoulli = pymc3.Bernoulli('bernoulli',p=theta, observed=data)
return model


Alternatively we can use the sufficient statistics of the data $\mathcal D$, i.e. $(k, n)$.

Our observed variable is then a Binomial random variable:

pymc3.Binomial('binominal', len(data), p, observed=np.sum(data))

In [47]:
# with binominal distribution
def create_model_pymc3_(data):
with pymc3.Model() as model:
theta = pymc3.Beta('theta', alpha=1, beta=1)
# sufficient statistics
n = len(data)
k = np.sum(data)
binominal = pymc3.Binomial('binominal', n, p, observed=k)
return model

In [48]:
model = create_model_pymc3(data)

/Users/christian/.virtualenvs/py3/lib/python3.5/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from half_away_from_zero to half_to_even to have the same default as NumPy. Use the Theano flag warn.round=False to disable this warning.
"theano.tensor.round() changed its default from"

In [49]:
map_estimate = pymc3.find_MAP(model=model)

map_estimate

Optimization terminated successfully.
Current function value: 20.450334
Iterations: 5
Function evaluations: 6

Out[49]:
{'theta_logodds_': array(-1.44691753)}

### Sampling from the model¶

In [37]:
with model:
# draw 50000 posterior samples
p_trace = pymc3.sample(50000)

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Average ELBO = -26.169: 100%|██████████| 200000/200000 [00:22<00:00, 8881.44it/s]
Finished [100%]: Average ELBO = -26.162
INFO:pymc3:Finished [100%]: Average ELBO = -26.162
100%|██████████| 50000/50000 [00:25<00:00, 1946.95it/s]

In [38]:
theta_trace = p_trace['theta'][10000:]


#### Plotting the posterior of $\theta$¶

In [39]:
plt.figure(figsize=(14,5))
ax = plt.subplot(121)
hist, bins = np.histogram(theta_trace, bins=np.arange(0.,1.01,0.01), normed=True)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
ax.bar(center, hist, align='center', width=width)
plt.xlim([0., 1.0])
plt.ylim([0., 6.0])
ax = plt.subplot(122)
ax.plot(bins, scipy.stats.beta.pdf(bins, k+1, n-k+1), 'r-', lw=5, alpha=0.6, label='beta pdf')
plt.xlim([0., 1.0])
plt.ylim([0., 6.0])
plt.legend()
plt.show()

In [40]:
sample_size=30

def get_traces_pymc3(sample_size, theta_unk=.3):
observed_data=scipy.stats.bernoulli.rvs(theta_unk, size=sample_size)
model_pymc3 = create_model_pymc3(observed_data)
with model_pymc3:
# obtain starting values via MAP
start = pymc3.find_MAP()
# draw 2000 posterior samples
trace = pymc3.sample(20000, start=start)
return trace.get_values('theta'), observed_data


Comparing pymc's approximation with the exact solution:

In [41]:
s = 0.8
figsize(11*s, 9*s)

n_samples = [20,50,100,500,5000]

def get_theta_histograms():
histograms = dict()
data = dict()
bins = np.arange(0.,1.01,0.01)
for i, N in enumerate(n_samples):
#prior_trace, observed_data = get_traces(N, theta_unk=theta_unk)
prior_trace, observed_data = get_traces_pymc3(N, theta_unk=theta_unk)
hist, bins = np.histogram(prior_trace, bins=bins)
histograms[N] = hist
data[N] = observed_data
return histograms, bins, data

histograms, bins, data = get_theta_histograms()


/Users/christian/.virtualenvs/py3/lib/python3.5/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from half_away_from_zero to half_to_even to have the same default as NumPy. Use the Theano flag warn.round=False to disable this warning.
"theano.tensor.round() changed its default from"

Optimization terminated successfully.
Current function value: 12.890958
Iterations: 4
Function evaluations: 5

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Average ELBO = -12.705: 100%|██████████| 200000/200000 [00:23<00:00, 8613.06it/s]
Finished [100%]: Average ELBO = -12.701
INFO:pymc3:Finished [100%]: Average ELBO = -12.701
100%|██████████| 20000/20000 [00:09<00:00, 2005.96it/s]

Optimization terminated successfully.
Current function value: 26.831371
Iterations: 5
Function evaluations: 6

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Average ELBO = -27.002: 100%|██████████| 200000/200000 [00:21<00:00, 9114.08it/s]
Finished [100%]: Average ELBO = -26.993
INFO:pymc3:Finished [100%]: Average ELBO = -26.993
100%|██████████| 20000/20000 [00:11<00:00, 1772.43it/s]

Optimization terminated successfully.
Current function value: 65.595639
Iterations: 5
Function evaluations: 6

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Average ELBO = -66.248: 100%|██████████| 200000/200000 [00:23<00:00, 8569.99it/s]
Finished [100%]: Average ELBO = -66.248
INFO:pymc3:Finished [100%]: Average ELBO = -66.248
100%|██████████| 20000/20000 [00:11<00:00, 1781.82it/s]

Optimization terminated successfully.
Current function value: 310.290668
Iterations: 5
Function evaluations: 6

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Average ELBO = -311.71: 100%|██████████| 200000/200000 [00:24<00:00, 8329.36it/s]
Finished [100%]: Average ELBO = -311.72
INFO:pymc3:Finished [100%]: Average ELBO = -311.72
100%|██████████| 20000/20000 [00:15<00:00, 1265.23it/s]

Optimization terminated successfully.
Current function value: 3037.018908
Iterations: 5
Function evaluations: 6

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Average ELBO = -3,039.6: 100%|██████████| 200000/200000 [00:40<00:00, 4899.37it/s]
Finished [100%]: Average ELBO = -3,039.6
INFO:pymc3:Finished [100%]: Average ELBO = -3,039.6
100%|██████████| 20000/20000 [00:23<00:00, 856.38it/s]

In [42]:
def plot_theta_histograms(histograms, bins, data):
x = np.linspace(0, 1, 100)
theta_unk = 0.3
plt.figure(figsize=(10,16))
for i, N in enumerate(n_samples):
ax = plt.subplot(len(n_samples), 2, (i*2)+1)
plt.setp(ax.get_yticklabels(), visible=False)
hist = histograms[N]
#np.histogram(prior_trace, bins=np.arange(0.,1.01,0.01))
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
ax.bar(center, hist, align='center', width=width)
plt.xlim([0., 1.0])
ax.set_xlabel("$theta$")
ax.set_ylabel("$p(theta)$")

ax = plt.subplot(len(n_samples), 2, (i*2) + 2)
plt.setp(ax.get_yticklabels(), visible=False)
observed_data = data[N]
k = observed_data.sum(); n=len(observed_data)
ax.plot(bins, scipy.stats.beta.pdf(bins, k+1, n-k+1), 'r-', lw=2, alpha=0.6, label='beta pdf')
ax.set_xlabel("$theta$")
ax.set_ylabel("$p(theta)$")
plt.suptitle("MCMC sample histograms and exact solution for different $n$")
#plt.legend()
plt.show()


In [43]:
plot_theta_histograms(histograms, bins, data)