Multinomial-Dirichlet slides

Categorical data / Multinomial distribution

Random variable vector $\vec v$. Each element in the vector corresponds to a trial (categorical variable).

$v_j \in \{1,2, \dots , k\}$

  • $k$ possible outcomes (categories)

e.g. (with k=3): $\vec v = (3,1,2,2,3,3,2,1) $

$\theta_i$ probability to get the outcome $i$ in a trial e.g.

$ P(v_j = i \mid \theta) = \theta_i $

Constraint on $\theta$

$\theta = \{\theta_1, \theta_2, \dots, \theta_n\}$

  • $\theta_1 + \dots + \theta_k = 1$ and
  • $\theta_i \geq 0$.

Multinomial distribution

with:

  • $k$: possible outcomes
  • $\theta_i$ probability to get the outcome $i$ in a trial, with: $\theta_1 + \dots + \theta_k = 1$
  • $i \in \{1, \dotsc, k\}$

The probability mass fuction to get in $n$-iid trials the number of $x_i$-outcomes for category $i$ is given by the multinomial distribution:

$$ P(X_1=x_1, X_2=x_2, \dotsc, X_k=x_k)= \frac {n!}{x_1! \cdot x_2! \dotsm x_k!}\theta_1^{x_1} \cdot \theta_2^{x_2} \dotsm \theta_k^{x_k} $$

with: $x_1 + \dots + x_k = n$

Dirichlet distribution as conjugate prior

The Dirichlet distribution is the conjugate prior of the multinomial distribution, i.e. if the prior distribution of the multinomial parameters is Dirichlet then the posterior distribution is also a Dirichlet distribution (with parameters different from those of the prior)

  • the posterior distribution is easy to compute
  • it in some sense is possible to quantify how much our beliefs have changed after collecting the data.

Dirichlet pdf (probability density function):

$$ p(\theta_1,\dots, \theta_{K} \mid \alpha_1,\dots, \alpha_K) = \frac{1}{\mathrm{B}(\alpha)} \prod_{i=1}^K \theta_i^{\alpha_i - 1} $$

with

  • $\theta_i \geq 0$ and $\sum_i \theta_i = 1$
  • the multinomial Beta function $\mathrm{B}(\alpha)$ as normalizing constant.
  • (concentration) parameter $\alpha_1, \dots, \alpha_k$

Dirchlet pdf in scipy

Visualization on the probability simplex in $\mathbb{R}^3$

Simplex: $\theta_i \geq 0$ and $\sum_i \theta_i = 1$
Example: $k=3$:

Dirichlet on simplex with $\alpha = (1, 1, 1)$

In [9]:
draw_pdf_contours(Dirichlet([1., 1., 1.]))

Dirichlet on simplex with $\alpha = (.9999, 0.9999, 0.9999)$

In [10]:
draw_pdf_contours(Dirichlet([.9999, 0.9999, 0.9999]))
In [11]:
draw_pdf_contours(Dirichlet([10., 20., 3.]))
In [12]:
draw_pdf_contours(Dirichlet([1., 2., 5.]))

$$ \mathbb{E}(\theta_i) = \frac{\alpha_i}{A} $$ with $$ A = \sum_i \alpha_i $$

Posterior given data

In [15]:
alpha = np.array([1., 1., 1.]) # prior parameter of Dirichlet
x = np.array([10, 3, 4]) # observations

alpha_posterior = plot_posterior_dirichlet(alpha, x)
print ("posterior parameters: ", alpha_posterior)
posterior parameters:  [11.  4.  5.]

Mode of the Dirichlet

The mode is the value that appears most often in a set of data.

The mode of the Dirichlet is given by:

$$ \theta_i =\frac{\alpha_i - 1}{A - k} $$

with

  • $k$: number of different classes
  • $A = \sum_i \alpha_i$

PyMc3

In [19]:
theta_unk = [0.2, 0.5, 0.3]
#data = np.random.multinomial(20, theta_unk, size=1)
data = np.random.choice(3, 200, p=theta_unk)
data
Out[19]:
array([2, 2, 2, 1, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 1, 1, 1, 0, 0,
       2, 0, 1, 1, 0, 2, 1, 0, 1, 1, 1, 2, 0, 2, 2, 0, 1, 2, 2, 1, 1, 2,
       0, 2, 2, 1, 0, 0, 2, 1, 2, 0, 1, 0, 0, 0, 2, 2, 0, 0, 1, 2, 1, 0,
       2, 2, 1, 1, 2, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0,
       2, 0, 1, 0, 1, 2, 1, 2, 1, 1, 0, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1,
       1, 2, 2, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 1, 0, 2, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 2, 0,
       0, 2, 2, 1, 2, 2, 1, 0, 0, 1, 1, 0, 2, 2, 1, 1, 0, 2, 2, 2, 2, 1,
       2, 0])
In [20]:
import pymc3
def create_model(data):
    with pymc3.Model() as model:    
        k = data.max()+1
        a = np.ones(k)
        p = pymc3.Dirichlet('p', a, shape=a.shape)
        c = pymc3.Categorical('c', p, observed=data, shape=1)
        
    return model
In [21]:
model = create_model(data)
with model:
    step = pymc3.Metropolis(model.vars)
    trace = pymc3.sample(20000, step)
    
a = pymc3.traceplot(trace)
Multiprocess sampling (2 chains in 2 jobs)
Metropolis: [p]
Sampling 2 chains: 100%|██████████| 41000/41000 [00:12<00:00, 3392.72draws/s]
The number of effective samples is smaller than 10% for some parameters.