topic-models slides

## Topic Models¶

### Simple Model: For Each Document One Topic¶

Naive bayes assumption: each document is drawn from a single topic (NOT mixture of topics) .

• $N$: number of documents in the collection.
• $M_n$: number of words in the $n$-th document.
• $V$: number of different words in the total collection
• $K$: number of predefined-topics (the number of topics is not automatically inferred)

Generative process:

1. Choose a word distribution $\phi_k$ for each topic $k$.
2. Choose a topic $z_i$ for each document $n$ from $\theta$. $z_i \in \{ 1,\dots, K \}$
3. For each of the word positions $m, n$, where $m \in \{ 1,\dots,M_n \}$, and $n \in \{ 1,\dots, N \}$
choose a word $w_{n,m} \,\sim\, \mathrm{Categorical}( \phi_{z_{i}})$.

#### Topic word distributions¶

• $\phi_k$: word distribution for the topics
$$\phi_k \sim Dirichlet(\beta)$$

with $1\leq k \leq K$ and each $\phi_k$ is a vector with $V$ elements.

Global (collection) topic distribution:

$$\theta \sim Dirichlet(\alpha)$$

Parameter for the probability that a document has topic $k$. $\theta$ is a vector with $K$ elements

$\alpha$ and $\beta$ are hyperparameters and fixed in our model.

$$z_n \sim Categorical(\theta)$$

with $1 \leq n \leq N$

In [5]:
plot_naive_bayes_topic_model()

In [6]:
docs = ["football ball football ball ball ball football",
"money economy money money money economy economy",
"football ball football ball football football",
"economy economy money money",
"money economy computer economy",
"computer computer technology technology computer technology",
"technology computer technology",
"money economy economy money technology",
"computer technology computer technology"]

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
v = vectorizer.fit_transform(docs)

In [8]:
vectorizer.vocabulary_

Out[8]:
{u'ball': 0,
u'computer': 1,
u'economy': 2,
u'football': 3,
u'money': 4,
u'technology': 5}
In [9]:
v.toarray()

Out[9]:
array([[4, 0, 0, 3, 0, 0],
[0, 0, 3, 0, 4, 0],
[2, 0, 0, 4, 0, 0],
[0, 0, 2, 0, 2, 0],
[0, 1, 2, 0, 1, 0],
[0, 3, 0, 0, 0, 3],
[0, 1, 0, 0, 0, 2],
[0, 0, 2, 0, 2, 1],
[0, 2, 0, 0, 0, 2]])
In [10]:
def to_one_word_array(v):
docs = list()
for w in v.toarray():
words = list()
for d, n in enumerate(w):
for j in range(n):
words.append(d)
#print d
docs.append(words)
return docs

In [11]:
documents = to_one_word_array(v)

In [12]:
documents

Out[12]:
[[0, 0, 0, 0, 3, 3, 3],
[2, 2, 2, 4, 4, 4, 4],
[0, 0, 3, 3, 3, 3],
[2, 2, 4, 4],
[1, 2, 2, 4],
[1, 1, 1, 5, 5, 5],
[1, 5, 5],
[2, 2, 4, 4, 5],
[1, 1, 5, 5]]
In [13]:
# number of topics
K = 3

In [14]:
# number of words
V = len(vectorizer.vocabulary_)
V

Out[14]:
6
In [15]:
#number of documents
N = len(docs)
N

Out[15]:
9
In [16]:
#array containing the information about doc length in our collection
m_n = [m.sum() for m in v]
m_n

Out[16]:
[7, 7, 6, 4, 4, 6, 3, 5, 4]
In [17]:
# Hyperparameter: alpha, beta

# D: number of documents
# K: number of predefined topics
alpha = np.ones((K,))

#V: number of vocabulary terms
beta = np.ones((V,))

In [39]:
# word distribution for each topic k
# note beta has dim V so also phi.
phi_ = [pymc.Dirichlet("pphi_%i" % k, theta=beta) for k in range(K)]
phi  = [pymc.CompletedDirichlet("phi_%i" % k, phi_[k]) for k in range(K)]

# each document belongs exactly to a topic z
theta = pymc.Dirichlet("theta", theta=alpha)
z = [pymc.Categorical("z_%i"%n, p=theta) for n in range(N)]

# version with multinominal:
#words_in_docs = [pymc.Multinomial("words_in_doc_%i"%(d,), value=v[d].toarray(), observed=True,
#                   n=n, p=pymc.Lambda("phi_%i"%(d),  lambda z=z[d], phi=phi : phi[z]))
#                    for d, n in enumerate(Nd)]

w = [pymc.Categorical("w_%i_%i" % (n, m),
p = pymc.Lambda("phi_z_%i_%i" % (n, m),
lambda z=z[n], phi=phi : phi[z]),
value=documents[n][m],
observed=True)
for n in range(N) for m in range(m_n[n])]

In [37]:
documents[3][2]

Out[37]:
4
In [19]:
mcmc = pymc.MCMC([phi_, phi, theta, z, w])
mcmc.sample(10000, burn=5000)

 [-----------------100%-----------------] 10000 of 10000 complete in 43.7 sec
In [20]:
vectorizer.vocabulary_

Out[20]:
{u'ball': 0,
u'computer': 1,
u'economy': 2,
u'football': 3,
u'money': 4,
u'technology': 5}
In [23]:
inv_voc = {v: k for k, v in vectorizer.vocabulary_.items()}
inv_voc

Out[23]:
{0: u'ball',
1: u'computer',
2: u'economy',
3: u'football',
4: u'money',
5: u'technology'}
In [24]:
for k in range(K):
print "topic %i"%k
for i, j in enumerate(mcmc.trace('phi_%i'%k)[-100:-1].mean(axis=0)[0]):
print "\t", inv_voc[i], ":", j
print

topic 0
ball : 0.0363931429669
computer : 0.256455662602
economy : 0.0402787781787
football : 0.0705131792063
money : 0.109059925401
technology : 0.487299311644

topic 1
ball : 0.0149182117232
computer : 0.0822085487173
economy : 0.385940319504
football : 0.0417620171837
money : 0.4026711362
technology : 0.0724997666716

topic 2
ball : 0.27412201123
computer : 0.0610294392576
economy : 0.133883284225
football : 0.421780433754
money : 0.0507701509037
technology : 0.0584146806298


In [26]:
for n in range(N):
print(mcmc.trace('z_%i'%n)[-1])

2
1
2
1
1
0
0
1
0

In [42]:
np.random.rand()

Out[42]:
0.346432025127938

### LDA - Latent Dirichlet Allocation¶

#### Smoothed LDA¶

The generative process is as follows. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for a corpus $D$ consisting of $N$ documents each of length $m_n$:

1. Choose $m \sim \mathrm{Poission}(\zeta)$, the number of terms in the document.
2. Choose a word distribution $\phi_k \sim \, \mathrm{Dir}(\beta)$ for each topic $k$
3. Choose $\theta_n \, \sim \, \mathrm{Dir}(\alpha)$, where $n \in \{ 1,\dots, N \}$
1. $\theta_n$ is the topic distribution for document $n$
2. and $\mathrm{Dir}(\alpha)$ is the Dirichlet distribution for parameter $\alpha$
4. For each of the word positions $n, m$, where $m \in \{ 1,\dots,M_i \}$, and $n \in \{ 1,\dots, N \}$
a. Choose a topic $z_{n,m} \,\sim\, \mathrm{Categorical}(\theta_n).$
b. Choose a word $w_{n,m} \,\sim\, \mathrm{Categorical}( \beta_{z_{n,m}})$.
• $\alpha$: parameter of Dirichlet prior on per-document topic distribution ($m$-Vector)
• $\beta$: parameter of Dirichlet prior on per-word topic distribution ($k \times V$-Matrix)
• $\theta_n$ topic distribution for document $n$
• $z_{n, m}$ topic for the $m$th word in document $n$
• $w_{n,m}$ specific word $m$ in document $n$

• $V$ number of words in the vocabulary

Exercise: Write down the graph factorization as formula: $$p(\theta, \bf z, \bf w \mid \alpha, \beta) = ?$$

In [28]:
plot_smoothed_lda()

In [29]:
# Hyperparameter: alpha, beta

# D: number of documents
# K: number of predefined topics
alpha = np.ones((K,))

#V: number of vocabulary terms
beta = np.ones((V,))

In [32]:
D=N
Nd = m_n
# word distribution for each topic k
phi =[pymc.CompletedDirichlet("phi_%i" % k, phi_[k]) for k in range(K)]

# each document has a topic distribution
theta = [pymc.Dirichlet("theta_%i"%d, theta=alpha) for d in range(D)]

z = [pymc.Categorical("z_%i" % d,  p = theta[d],
size = Nd[d],
value = np.random.randint(K, size=Nd[d]))
for d in range(D)]

#word generated from phi, given a topic z
w = [pymc.Categorical("w_%i_%i" % (d,i),
p = pymc.Lambda("phi_z_%i_%i" % (d,i),
lambda z=z[d][i], phi=phi : phi[z]),
value=documents[d][i],
observed=True)
for d in range(D) for i in range(Nd[d])]


In [34]:
mcmc = pymc.MCMC([phi_, phi, theta, z, w])
mcmc.sample(10000, burn=8000)

 [-----------------100%-----------------] 10000 of 10000 complete in 219.4 sec
In [30]:
#show the topic assignment for each document
for d in range(D):
#print mcmc.trace('z_%i'%d)[-1]
print np.bincount(mcmc.trace('z_%i'%d)[-1], minlength=3)

# number of choosen topics for the documents (sums up to the number of terms)
# only for the last sample of the MCMC trace

[7 0 0]
[0 7 0]
[6 0 0]
[0 4 0]
[0 2 2]
[0 0 6]
[0 0 3]
[0 4 1]
[0 2 2]

In [32]:
for k in range(K):
print "topic %i"%k
for i, j in enumerate(mcmc.trace('phi_%i'%k)[-100:-1].mean(axis=0)[0]):
print "\t", inv_voc[i], ":", j
print

topic 0
ball : 0.494302656478
computer : 0.00330601256377
economy : 0.0113991691287
football : 0.418033392202
money : 0.039234887214
technology : 0.0337238824139

topic 1
ball : 0.0144646150808
computer : 0.0642180078031
economy : 0.494408160655
football : 0.0146421743477
money : 0.352925984262
technology : 0.0593410578516

topic 2
ball : 0.00538960882665
computer : 0.411470519517
economy : 0.0136952757781
football : 0.048737756214
money : 0.0305771090712
technology : 0.490129730593