Introduction into Bayesian Inference with PyMc3

In a full bayesian approach the parameters of a model are considered as random variables. In this way a Bayesian handles with the uncertainty about the values of the parameters.

In [1]:
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt

Example: Tossing Thumtacks

In [2]:
from IPython.display import Image
Image(filename='./thumbtack.jpg', width=200, height=200)
Out[2]:

Bernoulli trials

To simulate the tossing behaviour of the thumtack we need the Bernoulli distribution.
$\theta$ is the probability of success, i.e. $X=1$; here: thumtack shows head with $\theta=0.3$:

$$ P(X=1) = \theta \\ P(X=0) = 1 - \theta $$

$n$-iid (independent and identically distributed) Benoulli trials can be simulated by:

In [3]:
import scipy.stats
scipy.stats.bernoulli.rvs(0.3, size=10)
Out[3]:
array([0, 0, 0, 0, 1, 0, 0, 0, 1, 1])

Binomial Distribution

with $\theta$ the probability of successes of a Bernoulli trial the probability mass function (pmf) of $k$ successes of $n$ iid trials is given by

$$ P( X = k \mid \theta) = {n \choose k} \theta^k (1-\theta)^{n-k} $$

In [4]:
def plot_binom(theta, n):
    p=theta
    binom = scipy.stats.binom
    fig, ax = plt.subplots(1, 1)
    x = np.arange(binom.ppf(0.001, n, p),
                   binom.ppf(0.999, n, p))
    ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
    ax.set_xlabel("k")
    ax.set_ylabel("p(k)")
    ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
    
In [5]:
plot_binom(theta=0.3, n=10)

Beta Distribution

Probability density function (pdf) of the Beta distribution:

$$ \begin{align} Beta(\alpha, \beta) = p(\theta \mid \alpha,\beta) & = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\int_0^1 u^{\alpha-1} (1-u)^{\beta-1}\, du} \end{align} $$

with

  • $\theta \in { } [0,1]$
  • shape parameters: $\alpha > 0, \beta > 0$

Note: The denominator normalizes the pdf

Beta distribution in scipy

In [6]:
alpha=10.; beta=40.
plt.figure(figsize=(5,5))
plt.title("Beta(%0.1f,%0.1f)"%(alpha,beta))
r = np.arange(0.,1.01,.01)
plt.xlabel("$\\theta$")
plt.ylabel("beta")
plt.plot(r, scipy.stats.beta.pdf(r, alpha, beta), 'r-', lw=5, alpha=0.6, label='beta pdf')
plt.show()

Conjugate Prior

Bayes Rule for the parameters $\vec \theta$ given the observation $\mathcal D$ (data):

$$ p(\vec \theta | \mathcal D) = \frac{p({\mathcal D \mid \vec \theta}) p(\vec \theta)}{p(\mathcal D)} $$

In Bayesian probability theory, if the posterior distributions $p(\theta | \mathcal D)$ are in the same family as the prior probability distribution $p(\theta)$, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function.

Example: The Beta distribution is the conjugate prior if the likelihood function is the Binomial distribution.

Sufficient statistics

Definition:

  • a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter".

Sufficient statistics for Binominal (Example "tossing of the thumtack"): $k, n$

  • a sufficent statistics of $\mathcal D$ is the number of successes $X=k$ and the number of trials $n$.
  • so we can represent the data, e.g. $\mathcal D = (0,0,1,0,0,1,\dots)$, compactly (with two numbers) as $\mathcal D = (k, n)$.

Beta distribution as conjugate prior of the Binomial distribution

$$ p(\theta \mid \mathcal D) = \frac{p(\mathcal D \mid \theta) p(\theta)}{p(\mathcal D)} $$ Note: the denominator is independent of $\theta$ and normalizes the distribution.

with the sufficent statistics of $\mathcal D$ the number of successes $X=k$ and the number of trials $n$:

$$ p(\theta \mid \mathcal D) \propto \theta^k (1-\theta)^{n-k} \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1} $$

Beta distribution as conjugate prior of the Binomial distribution
  • prior for $\theta$: $Beta(\alpha, \beta)$
  • (new) observations: $k$-successes in $n$-trials
  • posterior: $Beta(\alpha + k, \beta + n - k)$
In [7]:
def plot_prior_posterior_beta(alpha, beta, k, n):
    r=np.arange(0.,1.01,.01)
    plt.figure(figsize=(10,5))
    ax = plt.subplot(121)
    plt.title("Prior Beta(%0.1f,%0.1f)"%(alpha,beta))
    ax.plot(r, scipy.stats.beta.pdf(r, alpha, beta), 'r-', lw=2, alpha=0.6, label='beta pdf')
    ax.set_xlabel("$\\theta$")
    ax.set_ylabel("prior probability density")
    ax = plt.subplot(122)
    a = alpha+k; b = beta+n-k
    plt.title("Posterior Beta(%0.1f,%0.1f)"%(a ,b))
    ax.plot(r, scipy.stats.beta.pdf(r, a, b), 'b-', lw=2, alpha=0.6, label='beta pdf')
    ax.set_xlabel("$\\theta$")
    ax.set_ylabel("posterior probability density")
    plt.show()
In [8]:
plot_prior_posterior_beta(alpha=1., beta=1., k=100, n=300)
Analogon: Thumtack machine
  • think of a machine that produces thumtacks
  • the machine can be parameterized with $\alpha, \beta$
  • even with fixed $\alpha, \beta$ values the machine produces in general thumtacks with different $\theta$

The prior is our knowledgle of the machine. The machine produces a thumtack. Without any data (throwing the thumtack) our knowledge about $\theta$ is given by our knowledge of the machine.

From the observe data (by throwing the thumtack) we can compute the likelihood of the data.

With Bayes rule we can combine our prior knowledge with the knowledge we gained from the data (the likelihood).

In [9]:
# without any observation of a produced thumtack: k=0, n=0

# with machine parameter: alpha=2., beta=6.
# tacking one produced thumtack and throwing is 100 times
#   with 20 positive observations: 
plot_prior_posterior_beta(alpha=2., beta=6., k=20, n=100)
In [10]:
s = 0.9
figsize(11*s, 9*s)

def plot_thumtack_flip(theta_unk = 0.3):
    # Beta distribution is the conjugate prior of the binomial distribution
    dist = scipy.stats.beta
    n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
    data = scipy.stats.bernoulli.rvs(theta_unk, size=n_trials[-1])
    x = np.linspace(0, 1, 100)

    # For the already prepared, I'm using Binomial's conj. prior.
    for k, N in enumerate(n_trials):
        sx = plt.subplot(len(n_trials) / 2, 2, k + 1)
        #plt.xlabel("$\\theta$, probability of heads") \
        #    if k in [0, len(n_trials) - 1] else None
        plt.setp(sx.get_yticklabels(), visible=False)
        heads = data[:N].sum()
        y = dist.pdf(x, 1 + heads, 1 + N - heads)
        plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
        plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
        plt.vlines(theta_unk, 0, 4, color="k", linestyles="--", lw=1)

        leg = plt.legend()
        leg.get_frame().set_alpha(0.4)
        plt.autoscale(tight=True)

    plt.suptitle("Bayesian updating of posterior probabilities of $\\theta$",
             y=1.02,
             fontsize=14)

    plt.tight_layout()
In [11]:
plot_thumtack_flip()

Random Variables in PyMc

Here we choose $X$ to be Beta distributed (conjugate prior to the Binomial distribution):

$$ X \sim Beta(\alpha=1,\beta=6) $$

Such a variable is called stochastic in the terminology of pymc.

In [12]:
alpha=1; beta=6;

Drawing samples from a distribution

In [13]:
import pymc3

def get_samples(alpha, beta, num_samples):
    with pymc3.Model() as model:
        b = pymc3.Beta('b', alpha=alpha, beta=beta)
        tr = pymc3.sample(num_samples)
    return tr.get_values('b')
In [44]:
def plot_sample_of_beta_dist(alpha, beta, num_samples):
    samples = get_samples(alpha, beta, num_samples)
    plt.figure(figsize=(10,5))
    ax = plt.subplot(121)
    ax.hist(samples, bins=int(np.sqrt(num_samples)), normed=True )
    ax.set_xlim(0.,1.)
    ax.set_ylim(0.,7.)
    plt.title("Sample-Histogram Beta(%0.1f,%0.1f)"%(alpha,beta))

    ax = plt.subplot(122)
    plt.title("Exact distribution Beta(%0.1f,%0.1f)"%(alpha,beta))
    r=np.arange(0.,1.01,.01)
    ax.set_xlim(0.,1.)
    ax.set_ylim(0.,7.)
    ax.plot(r, scipy.stats.beta.pdf(r, alpha, beta), 'r-', lw=5, alpha=0.6, label='beta pdf')
    plt.show()
In [45]:
plot_sample_of_beta_dist(alpha=1., beta=7., num_samples=10000)
Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using advi...
INFO:pymc3:Initializing NUTS using advi...
Average ELBO = -0.062832: 100%|██████████| 200000/200000 [00:14<00:00, 13491.39it/s]
Finished [100%]: Average ELBO = -0.070504
INFO:pymc3:Finished [100%]: Average ELBO = -0.070504
100%|██████████| 10000/10000 [00:04<00:00, 2234.80it/s]
/Users/christian/.virtualenvs/py3/lib/python3.5/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

PyMc Example: Binomial trials

Let's assume we have some data from iid binomial trials and we want to estimate the probability of observing a 1.

In [46]:
#simulate some observations, 
theta_unk = 0.3 # unknown theta, that's the parameter we want to estimate
nb_data = n = 40
data = scipy.stats.bernoulli.rvs(theta_unk, size=nb_data)
print ("Data:", data)
k = data.sum()
print ("Suffient statistics:\n\t number of trails:",n, "\n\t number of positive outcomes:", k)
theta_ml = float(k)/float(n)
print ("Maximum likelihood estimate of theta:", theta_ml)
Data: [0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
 0 0 1]
Suffient statistics:
	 number of trails: 40 
	 number of positive outcomes: 7
Maximum likelihood estimate of theta: 0.175

Model in PyMc

Prior

We choose the Beta distribution as prior with $\alpha=1$ und $\beta=1$:

theta = pymc.Beta('theta', alpha=1, beta=1)

If we would put more prior (knowledge) in our model we could use a Beta-Distribution with different parameters $\alpha, \beta$, e.g.:

theta = pymc.Beta('theta', alpha=4., beta=8.)

Observed Variable

Our observed variable is a Bernoulli random variable. For a pymc Bernoulli variable we can use another pymc variable as value for p: our theta-Variable:

pymc3.Bernoulli('bernoulli',p=theta, observed=data)

In [33]:
# with bernoulli distribution
def create_model_pymc3(data):
    with pymc3.Model() as model: 
        theta = pymc3.Beta('theta', alpha=1, beta=1)
        bernoulli = pymc3.Bernoulli('bernoulli',p=theta, observed=data)
    return model

Alternatively we can use the sufficient statistics of the data $\mathcal D$, i.e. $(k, n)$.

Our observed variable is then a Binomial random variable:

pymc3.Binomial('binominal', len(data), p, observed=np.sum(data))

In [47]:
# with binominal distribution
def create_model_pymc3_(data):
    with pymc3.Model() as model: 
        theta = pymc3.Beta('theta', alpha=1, beta=1)
        # sufficient statistics
        n = len(data)
        k = np.sum(data)
        binominal = pymc3.Binomial('binominal', n, p, observed=k)
    return model
In [48]:
model = create_model_pymc3(data)
/Users/christian/.virtualenvs/py3/lib/python3.5/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
  "theano.tensor.round() changed its default from"
In [49]:
map_estimate = pymc3.find_MAP(model=model)

map_estimate
Optimization terminated successfully.
         Current function value: 20.450334
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 6
Out[49]:
{'theta_logodds_': array(-1.44691753)}

Sampling from the model

In [37]:
with model:
    # draw 50000 posterior samples
    p_trace = pymc3.sample(50000)
Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using advi...
INFO:pymc3:Initializing NUTS using advi...
Average ELBO = -26.169: 100%|██████████| 200000/200000 [00:22<00:00, 8881.44it/s] 
Finished [100%]: Average ELBO = -26.162
INFO:pymc3:Finished [100%]: Average ELBO = -26.162
100%|██████████| 50000/50000 [00:25<00:00, 1946.95it/s]
In [38]:
theta_trace = p_trace['theta'][10000:]

Plotting the posterior of $\theta$

In [39]:
plt.figure(figsize=(14,5))
ax = plt.subplot(121)
hist, bins = np.histogram(theta_trace, bins=np.arange(0.,1.01,0.01), normed=True)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
ax.bar(center, hist, align='center', width=width)
plt.xlim([0., 1.0])
plt.ylim([0., 6.0])
ax = plt.subplot(122)
ax.plot(bins, scipy.stats.beta.pdf(bins, k+1, n-k+1), 'r-', lw=5, alpha=0.6, label='beta pdf')
plt.xlim([0., 1.0])
plt.ylim([0., 6.0])
plt.legend()
plt.show()
In [40]:
sample_size=30

def get_traces_pymc3(sample_size, theta_unk=.3):
    observed_data=scipy.stats.bernoulli.rvs(theta_unk, size=sample_size)
    model_pymc3 = create_model_pymc3(observed_data)
    with model_pymc3:
        # obtain starting values via MAP
        start = pymc3.find_MAP()
        # draw 2000 posterior samples
        trace = pymc3.sample(20000, start=start) 
    return trace.get_values('theta'), observed_data

Comparing pymc's approximation with the exact solution:

In [41]:
s = 0.8
figsize(11*s, 9*s)

n_samples = [20,50,100,500,5000]

def get_theta_histograms():    
    histograms = dict()
    data = dict()
    bins = np.arange(0.,1.01,0.01)
    for i, N in enumerate(n_samples):
        #prior_trace, observed_data = get_traces(N, theta_unk=theta_unk)
        prior_trace, observed_data = get_traces_pymc3(N, theta_unk=theta_unk)
        hist, bins = np.histogram(prior_trace, bins=bins)
        histograms[N] = hist
        data[N] = observed_data
    return histograms, bins, data

histograms, bins, data = get_theta_histograms()

   
/Users/christian/.virtualenvs/py3/lib/python3.5/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
  "theano.tensor.round() changed its default from"
Optimization terminated successfully.
         Current function value: 12.890958
         Iterations: 4
         Function evaluations: 5
         Gradient evaluations: 5
Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using advi...
INFO:pymc3:Initializing NUTS using advi...
Average ELBO = -12.705: 100%|██████████| 200000/200000 [00:23<00:00, 8613.06it/s] 
Finished [100%]: Average ELBO = -12.701
INFO:pymc3:Finished [100%]: Average ELBO = -12.701
100%|██████████| 20000/20000 [00:09<00:00, 2005.96it/s]
Optimization terminated successfully.
         Current function value: 26.831371
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 6
Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using advi...
INFO:pymc3:Initializing NUTS using advi...
Average ELBO = -27.002: 100%|██████████| 200000/200000 [00:21<00:00, 9114.08it/s] 
Finished [100%]: Average ELBO = -26.993
INFO:pymc3:Finished [100%]: Average ELBO = -26.993
100%|██████████| 20000/20000 [00:11<00:00, 1772.43it/s]
Optimization terminated successfully.
         Current function value: 65.595639
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 6
Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using advi...
INFO:pymc3:Initializing NUTS using advi...
Average ELBO = -66.248: 100%|██████████| 200000/200000 [00:23<00:00, 8569.99it/s] 
Finished [100%]: Average ELBO = -66.248
INFO:pymc3:Finished [100%]: Average ELBO = -66.248
100%|██████████| 20000/20000 [00:11<00:00, 1781.82it/s]
Optimization terminated successfully.
         Current function value: 310.290668
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 6
Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using advi...
INFO:pymc3:Initializing NUTS using advi...
Average ELBO = -311.71: 100%|██████████| 200000/200000 [00:24<00:00, 8329.36it/s] 
Finished [100%]: Average ELBO = -311.72
INFO:pymc3:Finished [100%]: Average ELBO = -311.72
100%|██████████| 20000/20000 [00:15<00:00, 1265.23it/s]
Optimization terminated successfully.
         Current function value: 3037.018908
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 6
Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Initializing NUTS using advi...
INFO:pymc3:Initializing NUTS using advi...
Average ELBO = -3,039.6: 100%|██████████| 200000/200000 [00:40<00:00, 4899.37it/s]
Finished [100%]: Average ELBO = -3,039.6
INFO:pymc3:Finished [100%]: Average ELBO = -3,039.6
100%|██████████| 20000/20000 [00:23<00:00, 856.38it/s] 
In [42]:
def plot_theta_histograms(histograms, bins, data):
    x = np.linspace(0, 1, 100)
    theta_unk = 0.3
    plt.figure(figsize=(10,16))
    for i, N in enumerate(n_samples):
        ax = plt.subplot(len(n_samples), 2, (i*2)+1)
        plt.setp(ax.get_yticklabels(), visible=False)
        hist = histograms[N]
        #np.histogram(prior_trace, bins=np.arange(0.,1.01,0.01))
        width = 0.7 * (bins[1] - bins[0])
        center = (bins[:-1] + bins[1:]) / 2
        ax.bar(center, hist, align='center', width=width)
        plt.xlim([0., 1.0])
        ax.set_xlabel("$theta$")
        ax.set_ylabel("$p(theta)$")
        
        ax = plt.subplot(len(n_samples), 2, (i*2) + 2)
        plt.setp(ax.get_yticklabels(), visible=False)
        observed_data = data[N]
        k = observed_data.sum(); n=len(observed_data)
        ax.plot(bins, scipy.stats.beta.pdf(bins, k+1, n-k+1), 'r-', lw=2, alpha=0.6, label='beta pdf')
        ax.set_xlabel("$theta$")
        ax.set_ylabel("$p(theta)$")
    plt.suptitle("MCMC sample histograms and exact solution for different $n$")
    #plt.legend()
    plt.show()
 
In [43]:
plot_theta_histograms(histograms, bins, data)