View
224
Download
3
Tags:
Embed Size (px)
Citation preview
Generative Topic Models for Community Analysis
Ramesh Nallapati
9/18/2007 10-802: Guest Lecture 2 / 57
Objectives
• Provide an overview of topic models and their learning techniques– Mixture models, PLSA, LDA– EM, variational EM, Gibbs sampling
• Convince you that topic models are an attractive framework for community analysis– 5 definitive papers
9/18/2007 10-802: Guest Lecture 3 / 57
Outline• Part I: Introduction to Topic Models
– Naive Bayes model– Mixture Models
• Expectation Maximization
– PLSA– LDA
• Variational EM• Gibbs Sampling
• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model
9/18/2007 10-802: Guest Lecture 4 / 57
Introduction to Topic Models
• Multinomial Naïve Bayes
C
W1 W2 W3 ….. WN
M
• For each document d = 1,, M
• Generate Cd ~ Mult( ¢ | )
• For each position n = 1,, Nd
• Generate wn ~ Mult(¢|,Cd)
9/18/2007 10-802: Guest Lecture 5 / 57
Introduction to Topic Models• Naïve Bayes Model: Compact representation
C
W1 W2 W3 ….. WN
C
W
N
M
M
9/18/2007 10-802: Guest Lecture 6 / 57
Introduction to Topic Models
• Multinomial naïve Bayes: Learning– Maximize the log-likelihood of observed
variables w.r.t. the parameters:
• Convex function: global optimum
• Solution:
9/18/2007 10-802: Guest Lecture 7 / 57
Introduction to Topic Models
• Mixture model: unsupervised naïve Bayes model
C
W
NM
• Joint probability of words and classes:
• But classes are not visible:Z
9/18/2007 10-802: Guest Lecture 8 / 57
Introduction to Topic Models
• Mixture model: learning
– Not a convex function• No global optimum solution
– Solution: Expectation Maximization• Iterative algorithm• Finds local optimum• Guaranteed to maximize a lower-bound on the log-likelihood
of the observed data
9/18/2007 10-802: Guest Lecture 9 / 57
Introduction to Topic Models
• Quick summary of EM:– Log is a concave function
– Lower-bound is convex!– Optimize this lower-bound w.r.t. each variable instead
X1X2
log(0.5x1+0.5x2)
0.5log(x1)+0.5log(x2)
0.5x1+0.5x2
H()
9/18/2007 10-802: Guest Lecture 10 / 57
Introduction to Topic Models
• Mixture model: EM solution
E-step:
M-step:
9/18/2007 10-802: Guest Lecture 11 / 57
Introduction to Topic Models
9/18/2007 10-802: Guest Lecture 12 / 57
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis Model
d
z
w
M
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
d
N
Topic distribution
9/18/2007 10-802: Guest Lecture 13 / 57
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis Model– Learning using EM– Not a complete generative model
• Has a distribution over the training set of documents: no new document can be generated!
– Nevertheless, more realistic than mixture model
• Documents can discuss multiple topics!
9/18/2007 10-802: Guest Lecture 14 / 57
Introduction to Topic Models
• PLSA topics (TDT-1 corpus)
9/18/2007 10-802: Guest Lecture 15 / 57
Introduction to Topic Models
9/18/2007 10-802: Guest Lecture 16 / 57
Introduction to Topic Models
• Latent Dirichlet Allocation
z
w
M
N
• For each document d = 1,,M
• Generate d ~ Dir(¢ | )
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
9/18/2007 10-802: Guest Lecture 17 / 57
Introduction to Topic Models
• Latent Dirichlet Allocation– Overcomes the issues with PLSA
• Can generate any random document
– Parameter learning:• Variational EM
– Numerical approximation using lower-bounds
– Results in biased solutions
– Convergence has numerical guarantees
• Gibbs Sampling – Stochastic simulation
– unbiased solutions
– Stochastic convergence
9/18/2007 10-802: Guest Lecture 18 / 57
Introduction to Topic Models
• Variational EM for LDA– Approximate the posterior by a simpler
distribution
• A convex function in each parameter!
9/18/2007 10-802: Guest Lecture 19 / 57
Introduction to Topic Models
• Gibbs sampling– Applicable when joint distribution is hard to evaluate but
conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution
9/18/2007 10-802: Guest Lecture 20 / 57
Introduction to Topic Models
• LDA topics
9/18/2007 10-802: Guest Lecture 21 / 57
Introduction to Topic Models
• LDA’s view of a document
9/18/2007 10-802: Guest Lecture 22 / 57
Introduction to Topic Models
• Perplexity comparison of various models
Unigram
Mixture model
PLSA
LDALower is better
9/18/2007 10-802: Guest Lecture 23 / 57
Introduction to Topic Models
• Summary– Generative models for exchangeable data– Unsupervised models– Automatically discover topics– Well developed approximate techniques
available for inference and learning
9/18/2007 10-802: Guest Lecture 24 / 57
Outline• Part I: Introduction to Topic Models
– Naive Bayes model– Mixture Models
• Expectation Maximization
– PLSA– LDA
• Variational EM• Gibbs Sampling
• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model
9/18/2007 10-802: Guest Lecture 25 / 57
Hyperlink modeling using PLSA
9/18/2007 10-802: Guest Lecture 26 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
d
z
w
M
d
N
z
c
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
• For each citation j = 1,, Ld
• generate zj ~ Mult( ¢ | d)
• generate cj ~ Mult( ¢ | zj)L
9/18/2007 10-802: Guest Lecture 27 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
d
z
w
M
d
N
z
c
L
PLSA likelihood:
New likelihood:
Learning using EM
9/18/2007 10-802: Guest Lecture 28 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
Heuristic:
0 · · 1 determines the relative importance of content and hyperlinks
(1-)
9/18/2007 10-802: Guest Lecture 29 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
• Experiments: Text Classification• Datasets:
– Web KB• 6000 CS dept web pages with hyperlinks• 6 Classes: faculty, course, student, staff, etc.
– Cora• 2000 Machine learning abstracts with citations• 7 classes: sub-areas of machine learning
• Methodology:– Learn the model on complete data and obtain d for each
document– Test documents classified into the label of the nearest neighbor
in training set– Distance measured as cosine similarity in the space– Measure the performance as a function of
9/18/2007 10-802: Guest Lecture 30 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
• Classification performance
Hyperlink content Hyperlink content
9/18/2007 10-802: Guest Lecture 31 / 57
Hyperlink modeling using LDA
9/18/2007 10-802: Guest Lecture 32 / 57
Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]
z
w
M
N
• For each document d = 1,,M
• Generate d ~ Dir(¢ | )
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
•For each citation j = 1,, Ld
• generate zj ~ Mult( . | d)
• generate cj ~ Mult( . | zj)
z
c
L
Learning using variational EM
9/18/2007 10-802: Guest Lecture 33 / 57
Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]
9/18/2007 10-802: Guest Lecture 34 / 57
Author-Topic Model for Scientific Literature
9/18/2007 10-802: Guest Lecture 35 / 57
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
z
w
M
N
• For each author a = 1,,A
• Generate a ~ Dir(¢ | )
• For each topic k = 1,,K
• Generate k ~ Dir( ¢ | )
•For each document d = 1,,M
• For each position n = 1,, Nd
•Generate author x ~ Unif(¢ | ad)
• generate zn ~ Mult( ¢ | a)
• generate wn ~ Mult( ¢ | zn)
x
a
A
P
K
9/18/2007 10-802: Guest Lecture 36 / 57
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
Learning: Gibbs sampling
z
w
M
N
x
a
A
P
K
9/18/2007 10-802: Guest Lecture 37 / 57
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Perplexity results
9/18/2007 10-802: Guest Lecture 38 / 57
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Topic-Author visualization
9/18/2007 10-802: Guest Lecture 39 / 57
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Application 1: Author similarity
9/18/2007 10-802: Guest Lecture 40 / 57
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Application 2: Author entropy
9/18/2007 10-802: Guest Lecture 41 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
9/18/2007 10-802: Guest Lecture 42 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
Gibbs sampling
9/18/2007 10-802: Guest Lecture 43 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Datasets– Enron email data
• 23,488 messages between 147 users
– McCallum’s personal email• 23,488(?) messages with 128 authors
9/18/2007 10-802: Guest Lecture 44 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Topic Visualization: Enron set
9/18/2007 10-802: Guest Lecture 45 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Topic Visualization: McCallum’s data
9/18/2007 10-802: Guest Lecture 46 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
9/18/2007 10-802: Guest Lecture 47 / 57
Modeling Citation Influences
9/18/2007 10-802: Guest Lecture 48 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Copycat model
9/18/2007 10-802: Guest Lecture 49 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence model
9/18/2007 10-802: Guest Lecture 50 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence graph for LDA paper
9/18/2007 10-802: Guest Lecture 51 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Words in LDA paper assigned to citations
9/18/2007 10-802: Guest Lecture 52 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Performance evaluation– Data:
• 22 seed papers and 132 cited papers• Users labeled citations on a scale of 1-4
– Models considered:• Citation influence model• Copy cat model• LDA-JS-divergence
– Symmetric Divergence in topic space • LDA-post
• Page Rank• TF-IDF
– Evaulation measure:• Area under the ROC curve
where
9/18/2007 10-802: Guest Lecture 53 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Results
9/18/2007 10-802: Guest Lecture 54 / 57
Mixed membership Stochastic Block models[Work In Progress]
• A complete generative model for text and citations
• Can model the topicality of citations– Topic Specific PageRank
• Can also predict citations between unseen documents
9/18/2007 10-802: Guest Lecture 55 / 57
Summary
• Topic Modeling is an interesting, new framework for community analysis – Sound theoretical basis– Completely unsupervised– Simultaneous modeling of multiple fields– Discovers “soft”-communities and clusters in
terms of “topic” membership
– Can also be used for predictive purposes