Multinomial-Dirichlet slides

Categorical data / Multinomial distribution¶

Random variable vector $\vec v$. Each element in the vector corresponds to a trial (categorical variable).

$v_j \in \{1,2, \dots , k\}$

$k$ possible outcomes (categories)

e.g. (with k=3): $\vec v = (3,1,2,2,3,3,2,1) $

$\theta_i$ probability to get the outcome $i$ in a trial e.g.

$ P(v_j = i \mid \theta) = \theta_i $

Constraint on $\theta$¶

$\theta = \{\theta_1, \theta_2, \dots, \theta_n\}$

$\theta_1 + \dots + \theta_k = 1$ and
$\theta_i \geq 0$.

Multinomial distribution¶

with:

$k$: possible outcomes
$\theta_i$ probability to get the outcome $i$ in a trial, with: $\theta_1 + \dots + \theta_k = 1$
$i \in \{1, \dotsc, k\}$

The probability mass fuction to get in $n$-iid trials the number of $x_i$-outcomes for category $i$ is given by the multinomial distribution:

$$ P(X_1=x_1, X_2=x_2, \dotsc, X_k=x_k)= \frac {n!}{x_1! \cdot x_2! \dotsm x_k!}\theta_1^{x_1} \cdot \theta_2^{x_2} \dotsm \theta_k^{x_k} $$

with: $x_1 + \dots + x_k = n$

Dirichlet distribution as conjugate prior¶

The Dirichlet distribution is the conjugate prior of the multinomial distribution, i.e. if the prior distribution of the multinomial parameters is Dirichlet then the posterior distribution is also a Dirichlet distribution (with parameters different from those of the prior)

the posterior distribution is easy to compute
it in some sense is possible to quantify how much our beliefs have changed after collecting the data.

Dirichlet pdf (probability density function):

$$ p(\theta_1,\dots, \theta_{K} \mid \alpha_1,\dots, \alpha_K) = \frac{1}{\mathrm{B}(\alpha)} \prod_{i=1}^K \theta_i^{\alpha_i - 1} $$

with

$\theta_i \geq 0$ and $\sum_i \theta_i = 1$
the multinomial Beta function $\mathrm{B}(\alpha)$ as normalizing constant.
(concentration) parameter $\alpha_1, \dots, \alpha_k$

Dirchlet pdf in scipy¶

Visualization on the probability simplex in $\mathbb{R}^3$¶

Simplex: $\theta_i \geq 0$ and $\sum_i \theta_i = 1$
Example: $k=3$:

Dirichlet on simplex with $\alpha = (1, 1, 1)$

In [9]:

draw_pdf_contours(Dirichlet([1., 1., 1.]))

Dirichlet on simplex with $\alpha = (.9999, 0.9999, 0.9999)$

In [10]:

draw_pdf_contours(Dirichlet([.9999, 0.9999, 0.9999]))

In [11]:

draw_pdf_contours(Dirichlet([10., 20., 3.]))

In [12]:

draw_pdf_contours(Dirichlet([1., 2., 5.]))

$$ \mathbb{E}(\theta_i) = \frac{\alpha_i}{A} $$ with $$ A = \sum_i \alpha_i $$

Posterior given data¶

In [15]:

alpha = np.array([1., 1., 1.]) # prior parameter of Dirichlet
x = np.array([10, 3, 4]) # observations

alpha_posterior = plot_posterior_dirichlet(alpha, x)
print ("posterior parameters: ", alpha_posterior)

posterior parameters:  [11.  4.  5.]

Mode of the Dirichlet¶

The mode is the value that appears most often in a set of data.

The mode of the Dirichlet is given by:

$$ \theta_i =\frac{\alpha_i - 1}{A - k} $$

with

$k$: number of different classes
$A = \sum_i \alpha_i$

PyMc3¶

In [19]:

theta_unk = [0.2, 0.5, 0.3]
#data = np.random.multinomial(20, theta_unk, size=1)
data = np.random.choice(3, 200, p=theta_unk)
data

Out[19]:

array([2, 2, 2, 1, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 1, 1, 1, 0, 0,
       2, 0, 1, 1, 0, 2, 1, 0, 1, 1, 1, 2, 0, 2, 2, 0, 1, 2, 2, 1, 1, 2,
       0, 2, 2, 1, 0, 0, 2, 1, 2, 0, 1, 0, 0, 0, 2, 2, 0, 0, 1, 2, 1, 0,
       2, 2, 1, 1, 2, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0,
       2, 0, 1, 0, 1, 2, 1, 2, 1, 1, 0, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1,
       1, 2, 2, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 1, 0, 2, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 2, 0,
       0, 2, 2, 1, 2, 2, 1, 0, 0, 1, 1, 0, 2, 2, 1, 1, 0, 2, 2, 2, 2, 1,
       2, 0])

In [20]:

import pymc3
def create_model(data):
    with pymc3.Model() as model:    
        k = data.max()+1
        a = np.ones(k)
        p = pymc3.Dirichlet('p', a, shape=a.shape)
        c = pymc3.Categorical('c', p, observed=data, shape=1)
        
    return model

In [21]:

model = create_model(data)
with model:
    step = pymc3.Metropolis(model.vars)
    trace = pymc3.sample(20000, step)
    
a = pymc3.traceplot(trace)

Multiprocess sampling (2 chains in 2 jobs)
Metropolis: [p]
Sampling 2 chains: 100%|██████████| 41000/41000 [00:12<00:00, 3392.72draws/s]
The number of effective samples is smaller than 10% for some parameters.