27
10/4/2018 1 Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline โ€ข Topic Models โ€ข pLSA โ€ข LSA โ€ข Model โ€ข Fitting via EM โ€ข pHITS: link analysis โ€ข LDA โ€ข Dirichlet distribution โ€ข Generative process โ€ข Model โ€ข Geometric Interpretation โ€ข Inference 2

pLSA and LDA

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: pLSA and LDA

10/4/2018

1

Document and Topic Models:pLSA and LDA

Andrew Levandoski and Jonathan Lobo

CS 3750 Advanced Topics in Machine Learning

2 October 2018

Outline

โ€ข Topic Models

โ€ข pLSAโ€ข LSAโ€ข Modelโ€ข Fitting via EMโ€ข pHITS: link analysis

โ€ข LDAโ€ข Dirichlet distributionโ€ข Generative processโ€ข Modelโ€ข Geometric Interpretationโ€ข Inference

2

Page 2: pLSA and LDA

10/4/2018

2

Topic Models: Visual Representation

Topics DocumentsTopic proportions and

assignments

3

Topic Models: Importance

โ€ข For a given corpus, we learn two things:1. Topic: from full vocabulary set, we learn important subsets

2. Topic proportion: we learn what each document is about

โ€ข This can be viewed as a form of dimensionality reductionโ€ข From large vocabulary set, extract basis vectors (topics)

โ€ข Represent document in topic space (topic proportions)

โ€ข Dimensionality is reduced from ๐‘ค๐‘– โˆˆ โ„ค๐‘‰๐‘ to ๐œƒ โˆˆ โ„๐พ

โ€ข Topic proportion is useful for several applications including document classification, discovery of semantic structures, sentiment analysis, object localization in images, etc.

4

Page 3: pLSA and LDA

10/4/2018

3

Topic Models: Terminology

โ€ข Document Modelโ€ข Word: element in a vocabulary setโ€ข Document: collection of wordsโ€ข Corpus: collection of documents

โ€ข Topic Modelโ€ข Topic: collection of words (subset of vocabulary)โ€ข Document is represented by (latent) mixture of topics

โ€ข ๐‘ ๐‘ค ๐‘‘ = ๐‘ ๐‘ค ๐‘ง ๐‘(๐‘ง|๐‘‘) (๐‘ง : topic)

โ€ข Note: document is a collection of words (not a sequence)โ€ข โ€˜Bag of wordsโ€™ assumptionโ€ข In probability, we call this the exchangeability assumption

โ€ข ๐‘ ๐‘ค1, โ€ฆ , ๐‘ค๐‘ = ๐‘(๐‘ค๐œŽ 1 , โ€ฆ , ๐‘ค๐œŽ ๐‘ ) (๐œŽ: permutation)

5

Topic Models: Terminology (contโ€™d)

โ€ข Represent each document as a vector space

โ€ข A word is an item from a vocabulary indexed by {1,โ€ฆ , ๐‘‰}. We represent words using unitโ€basis vectors. The ๐‘ฃ๐‘กโ„Ž word is represented by a ๐‘‰ vector ๐‘ค such that ๐‘ค๐‘ฃ = 1 and ๐‘ค๐‘ข = 0 for ๐‘ฃ โ‰  ๐‘ข.

โ€ข A document is a sequence of ๐‘› words denoted by w = (๐‘ค1, ๐‘ค2, โ€ฆ๐‘ค๐‘›)where ๐‘ค๐‘› is the nth word in the sequence.

โ€ข A corpus is a collection of ๐‘€ documents denoted by ๐ท = ๐‘ค1, ๐‘ค2, โ€ฆ๐‘ค๐‘š .

6

Page 4: pLSA and LDA

10/4/2018

4

Probabilistic Latent Semantic Analysis (pLSA)

7

Motivation

โ€ข Learning from text and natural language

โ€ข Learning meaning and usage of words without prior linguistic knowledge

โ€ข Modeling semanticsโ€ข Account for polysems and similar words

โ€ข Difference between what is said and what is meant

8

Page 5: pLSA and LDA

10/4/2018

5

Vector Space Model

โ€ข Want to represent documents and terms as vectors in a lower-dimensional space

โ€ข N ร— M word-document co-occurrence matrix ๐œจ

โ€ข limitations: high dimensionality, noisy, sparse

โ€ข solution: map to lower-dimensional latent semantic space using SVD

๐ท = {๐‘‘1, . . . , ๐‘‘๐‘}

W = {๐‘ค1, . . . , ๐‘ค๐‘€}

๐œจ = ๐‘› ๐‘‘๐‘– , ๐‘ค๐‘—๐‘–๐‘—

9

Latent Semantic Analysis (LSA)

โ€ข Goalโ€ข Map high dimensional vector space representation to lower dimensional

representation in latent semantic spaceโ€ข Reveal semantic relations between documents (count vectors)

โ€ข SVDโ€ข N = UฮฃVT

โ€ข U: orthogonal matrix with left singular vectors (eigenvectors of NNT)โ€ข V: orthogonal matrix with right singular vectors (eigenvectors of NTN)โ€ข ฮฃ: diagonal matrix with singular values of N

โ€ข Select k largest singular values from ฮฃ to get approximation เทฉ๐‘ with minimal errorโ€ข Can compute similarity values between document vectors and term vectors

10

Page 6: pLSA and LDA

10/4/2018

6

LSA

11

LSA Strengths

โ€ข Outperforms naรฏve vector space model

โ€ข Unsupervised, simple

โ€ข Noise removal and robustness due to dimensionality reduction

โ€ข Can capture synonymy

โ€ข Language independent

โ€ข Can easily perform queries, clustering, and comparisons

12

Page 7: pLSA and LDA

10/4/2018

7

LSA Limitations

โ€ข No probabilistic model of term occurrences

โ€ข Results are difficult to interpret

โ€ข Assumes that words and documents form a joint Gaussian model

โ€ข Arbitrary selection of the number of dimensions k

โ€ข Cannot account for polysemy

โ€ข No generative model

13

Probabilistic Latent Semantic Analysis (pLSA)

โ€ข Difference between topics and words?โ€ข Words are observable

โ€ข Topics are not, they are latent

โ€ข Aspect Modelโ€ข Associates an unobserved latent class variable ๐‘ง ๐œ– โ„ค = {๐‘ง1, . . . , ๐‘ง๐พ} with each

observation

โ€ข Defines a joint probability model over documents and words

โ€ข Assumes w is independent of d conditioned on z

โ€ข Cardinality of z should be much less than than d and w

14

Page 8: pLSA and LDA

10/4/2018

8

pLSA Model Formulation

โ€ข Basic Generative Modelโ€ข Select document d with probability P(d)

โ€ข Select a latent class z with probability P(z|d)

โ€ข Generate a word w with probability P(w|z)

โ€ข Joint Probability Model

๐‘ƒ ๐‘‘,๐‘ค = ๐‘ƒ ๐‘‘ ๐‘ƒ ๐‘ค ๐‘‘ ๐‘ƒ ๐‘ค|๐‘‘ =

๐‘ง ๐œ– โ„ค

๐‘ƒ ๐‘ค|๐‘ง ๐‘ƒ ๐‘ง ๐‘‘

15

pLSA Graphical Model Representation

16

๐‘ƒ ๐‘‘,๐‘ค = ๐‘ƒ ๐‘‘ ๐‘ƒ ๐‘ค ๐‘‘

๐‘ƒ ๐‘ค|๐‘‘ =

๐‘ง ๐œ– โ„ค

๐‘ƒ ๐‘ค|๐‘ง ๐‘ƒ ๐‘ง ๐‘‘๐‘ƒ ๐‘‘, ๐‘ค =

๐‘ง ๐œ– โ„ค

๐‘ƒ ๐‘ง ๐‘ƒ ๐‘‘ ๐‘ง ๐‘ƒ(๐‘ค|๐‘ง)

Page 9: pLSA and LDA

10/4/2018

9

pLSA Joint Probability Model

๐‘ƒ ๐‘‘,๐‘ค = ๐‘ƒ ๐‘‘ ๐‘ƒ ๐‘ค ๐‘‘

๐‘ƒ ๐‘ค|๐‘‘ =

๐‘ง ๐œ– โ„ค

๐‘ƒ ๐‘ค|๐‘ง ๐‘ƒ ๐‘ง ๐‘‘

โ„’ =

๐‘‘๐œ–๐ท

๐‘ค๐œ–๐‘Š

๐‘› ๐‘‘,๐‘ค log๐‘ƒ(๐‘‘, ๐‘ค)

Maximize:

Corresponds to a minimization of KL divergence (cross-entropy) between the empirical distribution of words and the model distribution P(w|d)

17

Probabilistic Latent Semantic Space

โ€ข P(w|d) for all documents is approximated by a multinomial combination of all factors P(w|z)

โ€ข Weights P(z|d) uniquely define a point in the latent semantic space, represent how topics are mixed in a document

18

Page 10: pLSA and LDA

10/4/2018

10

Probabilistic Latent Semantic Space

โ€ข Topic represented by probability distribution over words

โ€ข Document represented by probability distribution over topics

19

๐‘ง๐‘– = (๐‘ค1, . . . , ๐‘ค๐‘š) ๐‘ง1 = (0.3, 0.1, 0.2, 0.3, 0.1)

๐‘‘๐‘— = (๐‘ง1, . . . , ๐‘ง๐‘›) ๐‘‘1 = (0.5, 0.3, 0.2)

Model Fitting via Expectation Maximization

โ€ข E-step

โ€ข M-step

๐‘ƒ ๐‘ง ๐‘‘,๐‘ค =๐‘ƒ ๐‘ง ๐‘ƒ ๐‘‘ ๐‘ง ๐‘ƒ ๐‘ค ๐‘ง

ฯƒ๐‘งโ€ฒ ๐‘ƒ ๐‘งโ€ฒ ๐‘ƒ ๐‘‘ ๐‘งโ€ฒ ๐‘ƒ(๐‘ค|๐‘งโ€ฒ)

๐‘ƒ(๐‘ค|๐‘ง) =ฯƒ๐‘‘ ๐‘› ๐‘‘,๐‘ค ๐‘ƒ(๐‘ง|๐‘‘, ๐‘ค)

ฯƒ๐‘‘,๐‘คโ€ฒ ๐‘› ๐‘‘, ๐‘คโ€ฒ ๐‘ƒ(๐‘ง|๐‘‘, ๐‘คโ€ฒ)

๐‘ƒ(๐‘‘|๐‘ง) =ฯƒ๐‘ค ๐‘› ๐‘‘,๐‘ค ๐‘ƒ(๐‘ง|๐‘‘, ๐‘ค)

ฯƒ๐‘‘โ€ฒ,๐‘ค ๐‘› ๐‘‘โ€ฒ, ๐‘ค ๐‘ƒ(๐‘ง|๐‘‘โ€ฒ, ๐‘ค)

๐‘ƒ ๐‘ง =1

๐‘…

๐‘‘,๐‘ค

๐‘› ๐‘‘,๐‘ค ๐‘ƒ ๐‘ง ๐‘‘, ๐‘ค , ๐‘… โ‰ก

๐‘‘,๐‘ค

๐‘›(๐‘‘, ๐‘ค)

Compute posterior probabilities for latent variables z using current parameters

Update parameters using given posterior probabilities

20

Page 11: pLSA and LDA

10/4/2018

11

pLSA Strengths

โ€ข Models word-document co-occurrences as a mixture of conditionally independent multinomial distributions

โ€ข A mixture model, not a clustering model

โ€ข Results have a clear probabilistic interpretation

โ€ข Allows for model combination

โ€ข Problem of polysemy is better addressed

21

pLSA Strengths

โ€ข Problem of polysemy is better addressed

22

Page 12: pLSA and LDA

10/4/2018

12

pLSA Limitations

โ€ข Potentially higher computational complexity

โ€ข EM algorithm gives local maximum

โ€ข Prone to overfittingโ€ข Solution: Tempered EM

โ€ข Not a well defined generative model for new documentsโ€ข Solution: Latent Dirichlet Allocation

23

pLSA Model Fitting Revisited

โ€ข Tempered EM โ€ข Goals: maximize performance on unseen data, accelerate fitting process

โ€ข Define control parameter ฮฒ that is continuously modified

โ€ข Modified E-step

๐‘ƒ๐›ฝ ๐‘ง ๐‘‘,๐‘ค =๐‘ƒ ๐‘ง ๐‘ƒ ๐‘‘ ๐‘ง ๐‘ƒ ๐‘ค ๐‘ง ๐›ฝ

ฯƒ๐‘งโ€ฒ ๐‘ƒ ๐‘งโ€ฒ ๐‘ƒ ๐‘‘ ๐‘งโ€ฒ ๐‘ƒ ๐‘ค ๐‘งโ€ฒ ๐›ฝ

24

Page 13: pLSA and LDA

10/4/2018

13

Tempered EM Steps

1) Split data into training and validation sets

2) Set ฮฒ to 1

3) Perform EM on training set until performance on validation set decreases

4) Decrease ฮฒ by setting it to ฮทฮฒ, where ฮท <1, and go back to step 3

5) Stop when decreasing ฮฒ gives no improvement

25

Example: Identifying Authoritative Documents

26

Page 14: pLSA and LDA

10/4/2018

14

HITS

โ€ข Hubs and Authoritiesโ€ข Each webpage has an authority score x and a hub score y

โ€ข Authority โ€“ value of content on the page to a community โ€ข likelihood of being cited

โ€ข Hub โ€“ value of links to other pages โ€ข likelihood of citing authorities

โ€ข A good hub points to many good authorities

โ€ข A good authority is pointed to by many good hubs

โ€ข Principal components correspond to different communitiesโ€ข Identify the principal eigenvector of co-citation matrix

27

HITS Drawbacks

โ€ข Uses only the largest eigenvectors, not necessary the only relevant communities

โ€ข Authoritative documents in smaller communities may be given no credit

โ€ข Solution: Probabilistic HITS

28

Page 15: pLSA and LDA

10/4/2018

15

pHITS๐‘ƒ ๐‘‘, ๐‘ =

๐‘ง

๐‘ƒ ๐‘ง ๐‘ƒ ๐‘ ๐‘ง ๐‘ƒ(๐‘‘|๐‘ง)

P(d|z) P(c|z)

Documents

Communities

Citations

29

Interpreting pHITS Results

โ€ข Explain d and c in terms of the latent variable โ€œcommunityโ€

โ€ข Authority score: P(c|z) โ€ข Probability of a document being cited from within community z

โ€ข Hub Score: P(d|z) โ€ข Probability that a document d contains a reference to community z.

โ€ข Community Membership: P(z|c). โ€ข Classify documents

30

Page 16: pLSA and LDA

10/4/2018

16

Joint Model of pLSA and pHITS

โ€ข Joint probabilistic model of document content (pLSA) and connectivity (pHITS) โ€ข Able to answer questions on both structure and content

โ€ข Model can use evidence about link structure to make predictions about document content, and vice versa

โ€ข Reference flow โ€“ connection between one topic and another

โ€ข Maximize log-likelihood function

โ„’ =

๐‘—

๐›ผ

๐‘–

๐‘๐‘–๐‘—ฯƒ๐‘–โ€ฒ๐‘๐‘–โ€ฒ๐‘—

๐‘™๐‘œ๐‘”

๐‘˜

๐‘ƒ ๐‘ค๐‘– ๐‘ง๐‘˜ ๐‘ƒ ๐‘ง๐‘˜ ๐‘‘๐‘— + (1 โˆ’ ๐›ผ)

๐‘™

๐ด๐‘™๐‘—ฯƒ๐‘™โ€ฒ ๐ด๐‘™โ€ฒ๐‘—

๐‘™๐‘œ๐‘”

๐‘˜

๐‘ƒ ๐‘๐‘™ ๐‘ง๐‘˜ ๐‘ƒ ๐‘ง๐‘˜ ๐‘‘๐‘—

31

pLSA: Main Deficiencies

โ€ข Incomplete in that it provides no probabilistic model at the document level i.e. no proper priors are defined.

โ€ข Each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers, thus:

1. The number of parameters in the model grows linearly with the size of the corpus, leading to overfitting

2. It is unclear how to assign probability to a document outside of the training set

โ€ข Latent Dirichlet allocation (LDA) captures the exchangeability of both words and documents using a Dirichlet distribution, allowing a coherent generative process for test data

32

Page 17: pLSA and LDA

10/4/2018

17

Latent Dirichlet Allocation (LDA)

33

LDA: Dirichlet Distribution

34

โ€ข A โ€˜distribution of distributionsโ€™โ€ข Multivariate distribution whose components all take values on

(0,1) and which sum to one.โ€ข Parameterized by the vector ฮฑ, which has the same number of

elements (k) as our multinomial parameter ฮธ.โ€ข Generalization of the beta distribution into multiple dimensionsโ€ข The alpha hyperparameter controls the mixture of topics for a

given documentโ€ข The beta hyperparameter controls the distribution of words per

topic

Note: Ideally we want our composites to be made up of only a few topics and our parts to belong to only some of the topics. With this in mind, alpha and beta are typically set below one.

Page 18: pLSA and LDA

10/4/2018

18

LDA: Dirichlet Distribution (contโ€™d)

โ€ข A k-dimensional Dirichlet random variable ๐œƒ can take values in the (k-1)-simplex (a k-vector ๐œƒ lies in the (k-1)-simplex if ๐œƒ๐‘– โ‰ฅ0,ฯƒ๐‘–=1

๐‘˜ ๐œƒ๐‘– = 1) and has the following probability density on this simplex:

๐‘ ๐œƒ ๐›ผ =ฮ“(ฯƒ๐‘–=1

๐‘˜ ๐›ผ๐‘–)

ฯ‚๐‘–=1๐‘˜ ฮ“(๐›ผ๐‘–)

๐œƒ1๐›ผ1โˆ’1โ€ฆ๐œƒ๐‘˜

๐›ผ๐‘˜โˆ’1,

where the parameter ๐›ผ is a k-vector with components ๐›ผ๐‘– > 0 and where ฮ“(๐‘ฅ) is the Gamma function.

โ€ข The Dirichlet is a convenient distribution on the simplex:โ€ข In the exponential family

โ€ข Has finite dimensional sufficient statistics

โ€ข Conjugate to the multinomial distribution 35

LDA: Generative ProcessLDA assumes the following generative process for each document ๐‘ค in a corpus ๐ท:

1. Choose ๐‘ ~ ๐‘ƒ๐‘œ๐‘–๐‘ ๐‘ ๐‘œ๐‘› ๐œ‰ .

2. Choose ๐œƒ ~ ๐ท๐‘–๐‘Ÿ ๐›ผ .

3. For each of the ๐‘ words ๐‘ค๐‘:a. Choose a topic ๐‘ง๐‘› ~๐‘€๐‘ข๐‘™๐‘ก๐‘–๐‘›๐‘œ๐‘š๐‘–๐‘Ž๐‘™ ๐œƒ .

b. Choose a word ๐‘ค๐‘› from ๐‘(๐‘ค๐‘›|๐‘ง๐‘›, ๐›ฝ), a multinomial probability conditioned on the topic ๐‘ง๐‘›.

Example: Assume a group of articles that can be broken down by three topics described by the following words:

โ€ข Animals: dog, cat, chicken, nature, zoo

โ€ข Cooking: oven, food, restaurant, plates, taste

โ€ข Politics: Republican, Democrat, Congress, ineffective, divisive

To generate a new document that is 80% about animals and 20% about cooking:โ€ข Choose the length of the article (say, 1000 words)

โ€ข Choose a topic based on the specified mixture (~800 words will coming from topic โ€˜animalsโ€™)

โ€ข Choose a word based on the word distribution for each topic 36

Page 19: pLSA and LDA

10/4/2018

19

LDA: Model (Plate Notation)

๐›ผ is the parameter of the Dirichlet prior on the per-document topic distribution,

๐›ฝ is the parameter of the Dirichlet prior on the per-topic word distribution,

๐œƒ๐‘€ is the topic distribution for document M,

๐‘ง๐‘€๐‘ is the topic for the N-th word in document M, and

๐‘ค๐‘€๐‘ is the word.

37

LDA: Model

38

Parameters of Dirichlet distribution(K-vector)

dk dn

ki

dn d

๐œƒ๐‘‘๐‘˜

๐‘ง๐‘‘๐‘› = {1,โ€ฆ , ๐พ}

๐›ฝ๐‘˜๐‘– = ๐‘(๐‘ค|๐‘ง)

1 โ€ฆ topic โ€ฆ K

1 โ€ฆ nth word โ€ฆ Nd

1 โ€ฆ word idx โ€ฆ V

1โ‹ฎ

docโ‹ฎ

M

1โ‹ฎ

docโ‹ฎ

M

1โ‹ฎ

topicโ‹ฎ

M

Page 20: pLSA and LDA

10/4/2018

20

LDA: Model (contโ€™d)

39

controls the mixture of topics

controls the distribution of words per topic

LDA: Model (contโ€™d)

Given the parameters ๐›ผ and ๐›ฝ, the joint distribution of a topic mixture ๐œƒ, a set of ๐‘ topics ๐‘ง, and a set of ๐‘words ๐‘ค is given by:

๐‘ ๐œƒ, ๐‘ง, ๐‘ค ๐›ผ, ๐›ฝ = ๐‘ ๐œƒ ๐›ผ เท‘

๐‘›=1

๐‘

๐‘ ๐‘ง๐‘› ๐œƒ ๐‘ ๐‘ค๐‘› ๐‘ง๐‘›, ๐›ฝ ,

where ๐‘ ๐‘ง๐‘› ๐œƒ is ๐œƒ๐‘– for the unique ๐‘– such that ๐‘ง๐‘›๐‘– = 1. Integrating over ๐œƒ and summing over ๐‘ง, we obtain

the marginal distribution of a document:

๐‘ ๐‘ค ๐›ผ, ๐›ฝ = เถฑ๐‘(๐œƒ|๐›ผ) เท‘

๐‘›=1

๐‘

๐‘ง๐‘›

๐‘ ๐‘ง๐‘› ๐œƒ ๐‘(๐‘ค๐‘›|๐‘ง๐‘›, ๐›ฝ) ๐‘‘๐œƒ๐‘‘ .

Finally, taking the products of the marginal probabilities of single documents, we obtain the probability of a corpus:

๐‘ ๐ท ๐›ผ, ๐›ฝ =เท‘

๐‘‘=1

๐‘€

เถฑ๐‘(๐œƒ๐‘‘|๐›ผ) เท‘

๐‘›=1

๐‘๐‘‘

๐‘ง๐‘‘๐‘›

๐‘ ๐‘ง๐‘‘๐‘› ๐œƒ๐‘‘ ๐‘(๐‘ค๐‘‘๐‘›|๐‘ง๐‘‘๐‘›, ๐›ฝ) ๐‘‘๐œƒ๐‘‘ .

40

Page 21: pLSA and LDA

10/4/2018

21

LDA: Exchangeability

โ€ข A finite set of random variables {๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘} is said to be exchangeable if the joint distribution is invariant to permutation. If ฯ€ is a permutation of the integers from 1 to N:

๐‘ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘ = ๐‘(๐‘ฅ๐œ‹1 , โ€ฆ , ๐‘ฅ๐œ‹๐‘)

โ€ข An infinite sequence of random numbers is infinitely exchangeable if every finite sequence is exchangeable

โ€ข We assume that words are generated by topics and that those topics are infinitely exchangeable within a document

โ€ข By De Finettiโ€™s Theorem:

๐‘ ๐‘ค, ๐‘ง = เถฑ๐‘(๐œƒ) เท‘

๐‘›=1

๐‘

๐‘ ๐‘ง๐‘› ๐œƒ ๐‘(๐‘ค๐‘›|๐‘ง๐‘›) ๐‘‘๐œƒ

41

LDA vs. other latent variable models

42

Unigram model: ๐‘ ๐‘ค = ฯ‚๐‘›=1๐‘ ๐‘(๐‘ค๐‘›)

Mixture of unigrams: ๐‘ ๐‘ค = ฯƒ๐‘ง ๐‘(๐‘ง)ฯ‚๐‘›=1๐‘ ๐‘(๐‘ค๐‘›|๐‘ง)

pLSI: ๐‘ ๐‘‘,๐‘ค๐‘› = ๐‘(๐‘‘)ฯƒ๐‘ง ๐‘ ๐‘ค๐‘› ๐‘ง ๐‘(๐‘ง|๐‘‘)

Page 22: pLSA and LDA

10/4/2018

22

LDA: Geometric Interpretation

43

โ€ข Topic simplex for three topics embedded in the word simplex for three words

โ€ข Corners of the word simplex correspond to the three distributions where each word has probability one

โ€ข Corners of the topic simplex correspond to three different distributions over words

โ€ข Mixture of unigrams places each document at one of the corners of the topic simplex

โ€ข pLSI induces an empirical distribution on the topic simplex denoted by diamonds

โ€ข LDA places a smooth distribution on the topic simplex denoted by contour lines

LDA: Goal of Inference

LDA inputs: Set of words per document for each document in a corpus

LDA outputs: Corpus-wide topic vocabulary distributions

Topic assignments per word

Topic proportions per document44

?

Page 23: pLSA and LDA

10/4/2018

23

LDA: Inference

45

The key inferential problem we need to solve with LDA is that of computing the posterior distribution of the hidden variables given a document:

๐‘ ๐œƒ, ๐‘ง ๐‘ค, ๐›ผ, ๐›ฝ =๐‘(๐œƒ, ๐‘ง, ๐‘ค|๐›ผ, ๐›ฝ)

๐‘(๐‘ค|๐›ผ, ๐›ฝ)

This formula is intractable to compute in general (the integral cannot be solved in closed form), so to normalize the distribution we marginalize over the hidden variables:

๐‘ ๐‘ค ๐›ผ, ๐›ฝ =ฮ“(ฯƒ๐‘– ๐›ผ๐‘–)

ฯ‚๐‘– ฮ“(๐›ผ๐‘–)เถฑ เท‘

๐‘–=1

๐‘˜

๐œƒ๐‘–๐›ผ๐‘–โˆ’1 เท‘

๐‘›=1

๐‘

๐‘–=1

๐‘˜

เท‘

๐‘—=1

๐‘‰

(๐œƒ๐‘–๐›ฝ๐‘–๐‘—)๐‘ค๐‘›๐‘—

๐‘‘๐œƒ

LDA: Variational Inference

โ€ข Basic idea: make use of Jensenโ€™s inequality to obtain an adjustable lower bound on the log likelihood

โ€ข Consider a family of lower bounds indexed by a set of variational parameters chosen by an optimization procedure that attempts to find the tightest possible lower bound

โ€ข Problematic coupling between ๐œƒ and ๐›ฝ arises due to edges between ๐œƒ, z and w. By dropping these edges and the w nodes, we obtain a family of distributions on the latent variables characterized by the following variational distribution:

๐‘ž ๐œƒ, ๐‘ง ๐›พ, ๐œ™ = ๐‘ž ๐œƒ ๐›พ เท‘

๐‘›=1

๐‘

๐‘ž ๐‘ง๐‘› ๐œ™๐‘›

where ๐›พ and (๐œ™1, โ€ฆ , ๐œ™๐‘›) and the free variational parameters.

46

Page 24: pLSA and LDA

10/4/2018

24

LDA: Variational Inference (contโ€™d)โ€ข With this specified family of probability distributions, we set up the following

optimization problem to determine ๐œƒ and ๐œ™:๐›พโˆ—, ๐œ™โˆ— = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐›พ,๐œ™ ๐ท(๐‘ž(๐œƒ, ๐‘ง|๐›พ, ๐œ™) โˆฅ ๐‘ ๐œƒ, ๐‘ง ๐‘ค, ๐›ผ, ๐›ฝ )

โ€ข The optimizing values of these parameters are found by minimizing the KL divergence between the variational distribution and the true posterior ๐‘ ๐œƒ, ๐‘ง ๐‘ค, ๐›ผ, ๐›ฝ

โ€ข By computing the derivatives of the KL divergence and setting them equal to zero, we obtain the following pair of update equations:

๐œ™๐‘›๐‘– โˆ ๐›ฝ๐‘–๐‘ค๐‘›exp ๐ธ๐‘ž log ๐œƒ๐‘– ๐›พ

๐›พ๐‘– = ๐›ผ๐‘– +๐‘›=1

๐‘

๐œ™๐‘›๐‘–

โ€ข The expectation in the multinomial update can be computed as follows:

๐ธ๐‘ž log ๐œƒ๐‘– ๐›พ = ฮจ ๐›พ๐‘– โˆ’ฮจ(เท‘๐‘—=1

๐‘˜

๐›พ๐‘—)

where ฮจ is the first derivative of the logฮ“ function.47

LDA: Variational Inference (contโ€™d)

48

Page 25: pLSA and LDA

10/4/2018

25

LDA: Parameter Estimation

โ€ข Given a corpus of documents ๐ท = {๐‘ค1, ๐‘ค2โ€ฆ ,๐‘ค๐‘€}, we wish to find ๐›ผand ๐›ฝ that maximize the marginal log likelihood of the data:

โ„“ ๐›ผ, ๐›ฝ =

๐‘‘=1

๐‘€

๐‘™๐‘œ๐‘”๐‘(๐‘ค๐‘‘|๐›ผ, ๐›ฝ)

โ€ข Variational EM yields the following iterative algorithm:1. (E-step) For each document, find the optimizing values of the variational

parameters ๐›พ๐‘‘โˆ— , ๐œ™๐‘‘

โˆ— : ๐‘‘ โˆˆ ๐ท

2. (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters ๐›ผ and ๐›ฝ

These two steps are repeated until the lower bound on the log likelihood converges.

49

LDA: Smoothing

50

โ€ข Introduces Dirichlet smoothing on ๐›ฝ to avoid the zero frequency word problem

โ€ข Fully Bayesian approach:

๐‘ž ๐›ฝ1:๐‘˜ , ๐‘ง1:๐‘€, ๐œƒ1:๐‘€ ๐œ†, ๐œ™, ๐›พ =

๐‘–=1

๐‘˜

๐ท๐‘–๐‘Ÿ(๐›ฝ๐‘–|๐œ†๐‘–)เท‘

๐‘‘=1

๐‘€

๐‘ž๐‘‘(๐œƒ๐‘‘ , ๐‘ง๐‘‘|๐œ™๐‘‘ , ๐›พ๐‘‘)

where ๐‘ž๐‘‘(๐œƒ, ๐‘ง |๐œ™, ๐›พ) is the variational distribution defined for LDA. We require an additional update for the new variational parameter ๐œ†:

๐œ†๐‘–๐‘— = ๐œ‚ +

๐‘‘=1

๐‘€

๐‘›=1

๐‘๐‘‘

๐œ™๐‘‘๐‘›๐‘–โˆ— ๐‘ค๐‘‘๐‘›

๐‘—

Page 26: pLSA and LDA

10/4/2018

26

Topic Model Applications

โ€ข Information Retrieval

โ€ข Visualization

โ€ข Computer Visionโ€ข Document = image, word = โ€œvisual wordโ€

โ€ข Bioinformaticsโ€ข Genomic features, gene sequencing, diseases

โ€ข Modeling networksโ€ข cities, social networks

51

pLSA / LDA Libraries

โ€ข gensim (Python)

โ€ข MALLET (Java)

โ€ข topicmodels (R)

โ€ข Stanford Topic Modeling Toolbox

Page 27: pLSA and LDA

10/4/2018

27

References

David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation. JMLR,2003.

David Cohn and Huan Chang. Learning to probabilistically identify Authoritativedocuments. ICML, 2000.

David Cohn and Thomas Hoffman. The Missing Link - A Probabilistic Model ofDocument Content and Hypertext Connectivity. NIPS, 2000.

Thomas Hoffman. Probabilistic Latent Semantic Analysis. UAI-99, 1999.

Thomas Hofmann. Probabilistic Latent Semantic Indexing. SIGIR-99, 1999.

53