Latent Dirichlet Allocation(Blei et al.)
Kris Sankaran
2016-11-14
1
Introduction
2
Agenda
I Generative Mechanism (15 minutes): What is the proposedmodel, and how does it differ from what existed before?
I Interpretations (10 minutes): What are alternative ways tounderstand the model?
I Model Inference (15 minutes): How would we fit this modelin practice?
I Examples and Conclusion (10 minutes): Why might we fitLDA in practice, and what are its limitations?
3
Context and Motivation
I Motivated by topic modeling:I Building interpretable representations of text dataI Designing preprocessing steps for classification or information
retrieval
I This said, LDA is not necessarily tied to text analysisI Generative Modeling: Design unified probabilistic models
I Is explicit about assumptions, feels less ad hocI Gives access to (large) Bayesian inference literatureI Can be used as a module in larger probabilistic models
4
Generative Model
5
Latent Dirichlet Allocation
I For the nth word in document d ,
wdn|β, θd ∼ Cat (β·k)zdn|θd ∼ Cat (θd)
θd |α ∼ Dir (α)
I Mnemonics:I wdn ∈ {1, . . . ,V } is the term used as the nth word in document
dI zdn ∈ {1, . . . ,K } is the topic associated with the nth word in
document dI θd ∈ SK−1 are the topic mixture proportions for document dI β·k ∈ SV−1 are the term mixture proportions for topic kI α is the topic shrinkage parameter
6
Latent Dirichlet Allocation
β
α θ z w
D
N
I w are observed dataI α,β are fixed, global parametersI θ, z are random, local parameters
7
Observed Counts (sum of wdn’s)
word
doc
count
0
10
20
30
8
Mixing Proportions (θd ’s)
topic
doc
theta
0.25
0.50
0.75
9
Topic Counts (sum of zdn’s)
12
3
word
docu
men
t
z counts
10
20
30
10
Latent Dirichlet Allocation (β)
topic
wor
d
beta
0.01
0.02
0.03
11
Unigram Model
I It can be illustrative to compare with earlier topic modelingapproaches
I The unigram model draws all words from the same multinomial,wdn ∼ Cat (β).
w
D
N
β
12
Mixture of UnigramsI This is the multinomial analog of gaussian mixture modelsI Each word is drawn from a mixture of K topics
zd ∼ p (z)wdn|zd ∼ Cat (βzdk )
I Topic assignment is drawn at the document level
β
z w
D
N
13
Probabilistic Latent Semantic Indexing (pLSI)I pLSI draws a different topic for each word in the document,
zdn|d ∼ p (zdn|d)wdn|zdn ∼ Cat (β)
I The per-document topic mixture proportions are nonrandomand different for each document
I The number of fixed parameters grows linearly with the numberof documents
β
d z w
D
N
14
Back to LDA
I Essential difference: Randomness in topic mixture proportionslets us share information across documents
I Number of fixed parameters does not grow with number ofdocuments.
β
α θ z w
D
N
15
Interpretations
16
GeometricI Each topic is a point on the simplex, and the K topics
determine a topics simplexI The mixture of unigrams model gives each document a corner
of the topics simplexI The pLSI estimates the empirical distribution of observed
mixing proportionsI LDA estimates a smooth density over the topics simplex
17
Matrix FactorizationI We can think of topics as latent factors and mixing proportions
as document scores,
p (wdn = v |θd ,β) =K∑
k=1p (wdn|βvk) p (zdn = k)
= βTv ·p (zdn)
I The different models treat the p (zdn)’s differentlyI In LDA, this probability is βT
v ·θd .
p(wdn = v) θdk
βkv
=D
V K
18
Inference
19
Variational Bayes
I As scientists / modelers, our primary interest is in the posteriorp (θ, z |w ,α,β) after observing the words w
I This not available in closed form (the normalizing constant isintractable)
I In practice, we also need to estimate the α and β – more onthis later
20
Variational Bayes
I (Blei et al.) propose a variational approachI Turns Bayesian inference into an optimization problem
I Specifically, consider the family of Γ of q’s that factor like
q (θ, z |γ,ϕ) =D∏
d=1
[Dir (θd |γd)
N∏n=1
Cat (zdn|ϕdn)
],
and try to identify,
argminq∈Γ
KL (q (θ, z ||γ,ϕ) ‖p (θ, z |w ,α,β))
21
KL Minimization
I Note that
KL (q, p) = Eq
[log q (θ, z |γ,ϕ)
p (θ, z |w ,α,β)
]= −H (q) + log p (w |α,β) − Eq [p (θ, z ,w |α,β)] ,
and that the middle term (the “evidence”) is irrelevant to ouroptimization.
I Hence, find γ∗,ϕ∗ that maximize,
Eq [p (θ, z ,w |,α,β)] + H (q) ,
the “evidence lower bound” (ELBO).
22
KL Minimization
I The ELBO can be written explicitly (though it’s not pretty),
D∑d=1
K∑k=1
(αk − 1)Eq [log (θdk) |γd ] +D∑
d=1
N∑n=1
K∑k=1
ϕdnkEq [log (θdk) |γd ] +
D∑d=1
N∑n=1
V∑v=1
I (wdn = v)ϕdnk logβvk−
D∑d=1
log Γ( K∑
k=1γdk
)+
K∑k=1
log Γ (γdk) −K∑
k=1(γdk − 1)Eq [log θdk |γd ] −
D∑d=1
N∑n=1
K∑k=1
ϕdnk logϕdnk
where we have omitted constants in γ,ϕ.
23
KL Minimization
I The point is that we can perform coordinate ascent on theparameters ϕ and γ to find locally optimal ϕ∗ and γ∗
I The updates look like
ϕdnk ∝ βnwdn exp (Eq [log θd |γd ])
γdk = αk +
N∑n=1
ϕdnk
I InterpretationI First update is like p (zdn|wdn) ∝ p (wdn|zdn) p (zdn)I Second update is like Dirichlet posterior update upon observing
data ϕdnk .I ϕndk are the same across occurrences of the same term → save
memory
24
Estimating α,β
I So far, we have assumed the fixed parameters α,β are known,when in practice they aren’t
I (Blei et al.) propose two approachesI Variational EM: Here, the ELBO takes the place of the usual
Expected Complete Log-Likelihood, and we alternate betweenoptimizing ϕdnk ,γd (Variational E-step) and α,β (VariationalM-step)
I Smoothed Variational Bayes: Place a Dirichlet prior on β andintroduce this to the Variational approximation. The VariationalM-step now only optimizes α.
I The Smoothed Bayesian approach is better when ML estimatesof β are unreliable (e.g., when data are sparse).
25
Conclusion
26
Examples(Blei et al.) compare approaches to a variety of topic modelingtasks,
I Directly fitting to Associated Press corpus, evaluated usingheld-out likelihood
I As preprocessing for classification on the Reuters dataI Collaborative filtering – evaluate likelihood on held-out movies,
instead of words
27
Conclusion
I The basic LDA model can be easily extended by removingvarious exchangeability assumptions (D. M. Blei and J. D.Lafferty, D. Blei and J. Lafferty, Lacoste-Julien et al.)
I More generally, the three-level hierarchical Bayesian idea opensthe door to a variety of “mixed-membership” models (Airoldi etal., Erosheva and Fienberg, Mackey et al., Fox and Jordan)
I Alternative MCMC, Variational Inference, and Method ofMoments techniques are still an active area of research (M.Hoffman et al., Anandkumar et al., M. D. Hoffman et al., Tehet al.)
28
ReferencesAiroldi, Edoardo M., et al. “Mixed Membership StochasticBlockmodels.” Journal of Machine Learning Research, vol. 9, no.Sep, 2008, pp. 1981–2014.Anandkumar, Anima, et al. “A Spectral Algorithm for LatentDirichlet Allocation.” Advances in Neural Information ProcessingSystems, 2012, pp. 917–925.Blei, David M., and John D. Lafferty. “Dynamic Topic Models.”Proceedings of the 23rd International Conference on MachineLearning, ACM, 2006, pp. 113–120.Blei, David M., et al. “Latent Dirichlet Allocation.” Journal ofMachine Learning Research, vol. 3, no. Jan, 2003, pp. 993–1022.Blei, David, and John Lafferty. “Correlated Topic Models.”Advances in Neural Information Processing Systems, vol. 18, MIT;1998, 2006, p. 147.Erosheva, Elena A., and Stephen E. Fienberg. “Bayesian MixedMembership Models for Soft Clustering and Classification.”Classification—The Ubiquitous Challenge, Springer, 2005, pp.11–26.Fox, Emily B., and Michael I. Jordan. “Mixed Membership Modelsfor Time Series.” ArXiv Preprint ArXiv:1309.3533, 2013.Hoffman, Matthew D., et al. “Stochastic Variational Inference.”Journal of Machine Learning Research, vol. 14, no. 1, 2013, pp.1303–1347.Hoffman, Matthew, et al. “Online Learning for Latent DirichletAllocation.” Advances in Neural Information Processing Systems,2010, pp. 856–864.Lacoste-Julien, Simon, et al. “DiscLDA: Discriminative Learning forDimensionality Reduction and Classification.” Advances in NeuralInformation Processing Systems, 2009, pp. 897–904.Mackey, Lester W., et al. “Mixed Membership Matrix Factorization.”Proceedings of the 27th International Conference on MachineLearning (ICML-10), 2010, pp. 711–718.Teh, Yee W., et al. “A Collapsed Variational Bayesian InferenceAlgorithm for Latent Dirichlet Allocation.” Advances in NeuralInformation Processing Systems, 2006, pp. 1353–1360.
29