55
Generative Topic Models for Community Analysis Ramesh Nallapati

Generative Topic Models for Community Analysis Ramesh Nallapati

  • View
    224

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Generative Topic Models for Community Analysis Ramesh Nallapati

Generative Topic Models for Community Analysis

Ramesh Nallapati

Page 2: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 2 / 57

Objectives

• Provide an overview of topic models and their learning techniques– Mixture models, PLSA, LDA– EM, variational EM, Gibbs sampling

• Convince you that topic models are an attractive framework for community analysis– 5 definitive papers

Page 3: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 3 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

Page 4: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 4 / 57

Introduction to Topic Models

• Multinomial Naïve Bayes

C

W1 W2 W3 ….. WN

M

• For each document d = 1,, M

• Generate Cd ~ Mult( ¢ | )

• For each position n = 1,, Nd

• Generate wn ~ Mult(¢|,Cd)

Page 5: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 5 / 57

Introduction to Topic Models• Naïve Bayes Model: Compact representation

C

W1 W2 W3 ….. WN

C

W

N

M

M

Page 6: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 6 / 57

Introduction to Topic Models

• Multinomial naïve Bayes: Learning– Maximize the log-likelihood of observed

variables w.r.t. the parameters:

• Convex function: global optimum

• Solution:

Page 7: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 7 / 57

Introduction to Topic Models

• Mixture model: unsupervised naïve Bayes model

C

W

NM

• Joint probability of words and classes:

• But classes are not visible:Z

Page 8: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 8 / 57

Introduction to Topic Models

• Mixture model: learning

– Not a convex function• No global optimum solution

– Solution: Expectation Maximization• Iterative algorithm• Finds local optimum• Guaranteed to maximize a lower-bound on the log-likelihood

of the observed data

Page 9: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 9 / 57

Introduction to Topic Models

• Quick summary of EM:– Log is a concave function

– Lower-bound is convex!– Optimize this lower-bound w.r.t. each variable instead

X1X2

log(0.5x1+0.5x2)

0.5log(x1)+0.5log(x2)

0.5x1+0.5x2

H()

Page 10: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 10 / 57

Introduction to Topic Models

• Mixture model: EM solution

E-step:

M-step:

Page 11: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 11 / 57

Introduction to Topic Models

Page 12: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 12 / 57

Introduction to Topic Models

• Probabilistic Latent Semantic Analysis Model

d

z

w

M

• Select document d ~ Mult()

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

d

N

Topic distribution

Page 13: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 13 / 57

Introduction to Topic Models

• Probabilistic Latent Semantic Analysis Model– Learning using EM– Not a complete generative model

• Has a distribution over the training set of documents: no new document can be generated!

– Nevertheless, more realistic than mixture model

• Documents can discuss multiple topics!

Page 14: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 14 / 57

Introduction to Topic Models

• PLSA topics (TDT-1 corpus)

Page 15: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 15 / 57

Introduction to Topic Models

Page 16: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 16 / 57

Introduction to Topic Models

• Latent Dirichlet Allocation

z

w

M

N

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

Page 17: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 17 / 57

Introduction to Topic Models

• Latent Dirichlet Allocation– Overcomes the issues with PLSA

• Can generate any random document

– Parameter learning:• Variational EM

– Numerical approximation using lower-bounds

– Results in biased solutions

– Convergence has numerical guarantees

• Gibbs Sampling – Stochastic simulation

– unbiased solutions

– Stochastic convergence

Page 18: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 18 / 57

Introduction to Topic Models

• Variational EM for LDA– Approximate the posterior by a simpler

distribution

• A convex function in each parameter!

Page 19: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 19 / 57

Introduction to Topic Models

• Gibbs sampling– Applicable when joint distribution is hard to evaluate but

conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution

Page 20: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 20 / 57

Introduction to Topic Models

• LDA topics

Page 21: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 21 / 57

Introduction to Topic Models

• LDA’s view of a document

Page 22: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 22 / 57

Introduction to Topic Models

• Perplexity comparison of various models

Unigram

Mixture model

PLSA

LDALower is better

Page 23: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 23 / 57

Introduction to Topic Models

• Summary– Generative models for exchangeable data– Unsupervised models– Automatically discover topics– Well developed approximate techniques

available for inference and learning

Page 24: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 24 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

Page 25: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 25 / 57

Hyperlink modeling using PLSA

Page 26: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 26 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

d

z

w

M

d

N

z

c

• Select document d ~ Mult()

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

• For each citation j = 1,, Ld

• generate zj ~ Mult( ¢ | d)

• generate cj ~ Mult( ¢ | zj)L

Page 27: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 27 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

d

z

w

M

d

N

z

c

L

PLSA likelihood:

New likelihood:

Learning using EM

Page 28: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 28 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

Heuristic:

0 · · 1 determines the relative importance of content and hyperlinks

(1-)

Page 29: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 29 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

• Experiments: Text Classification• Datasets:

– Web KB• 6000 CS dept web pages with hyperlinks• 6 Classes: faculty, course, student, staff, etc.

– Cora• 2000 Machine learning abstracts with citations• 7 classes: sub-areas of machine learning

• Methodology:– Learn the model on complete data and obtain d for each

document– Test documents classified into the label of the nearest neighbor

in training set– Distance measured as cosine similarity in the space– Measure the performance as a function of

Page 30: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 30 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

• Classification performance

Hyperlink content Hyperlink content

Page 31: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 31 / 57

Hyperlink modeling using LDA

Page 32: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 32 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

z

w

M

N

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

•For each citation j = 1,, Ld

• generate zj ~ Mult( . | d)

• generate cj ~ Mult( . | zj)

z

c

L

Learning using variational EM

Page 33: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 33 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

Page 34: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 34 / 57

Author-Topic Model for Scientific Literature

Page 35: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 35 / 57

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

z

w

M

N

• For each author a = 1,,A

• Generate a ~ Dir(¢ | )

• For each topic k = 1,,K

• Generate k ~ Dir( ¢ | )

•For each document d = 1,,M

• For each position n = 1,, Nd

•Generate author x ~ Unif(¢ | ad)

• generate zn ~ Mult( ¢ | a)

• generate wn ~ Mult( ¢ | zn)

x

a

A

P

K

Page 36: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 36 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

Learning: Gibbs sampling

z

w

M

N

x

a

A

P

K

Page 37: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 37 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• Perplexity results

Page 38: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 38 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• Topic-Author visualization

Page 39: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 39 / 57

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• Application 1: Author similarity

Page 40: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 40 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• Application 2: Author entropy

Page 41: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 41 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Page 42: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 42 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Gibbs sampling

Page 43: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 43 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Datasets– Enron email data

• 23,488 messages between 147 users

– McCallum’s personal email• 23,488(?) messages with 128 authors

Page 44: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 44 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Topic Visualization: Enron set

Page 45: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 45 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Topic Visualization: McCallum’s data

Page 46: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 46 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Page 47: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 47 / 57

Modeling Citation Influences

Page 48: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 48 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Copycat model

Page 49: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 49 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Citation influence model

Page 50: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 50 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Citation influence graph for LDA paper

Page 51: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 51 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Words in LDA paper assigned to citations

Page 52: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 52 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Performance evaluation– Data:

• 22 seed papers and 132 cited papers• Users labeled citations on a scale of 1-4

– Models considered:• Citation influence model• Copy cat model• LDA-JS-divergence

– Symmetric Divergence in topic space • LDA-post

• Page Rank• TF-IDF

– Evaulation measure:• Area under the ROC curve

where

Page 53: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 53 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Results

Page 54: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 54 / 57

Mixed membership Stochastic Block models[Work In Progress]

• A complete generative model for text and citations

• Can model the topicality of citations– Topic Specific PageRank

• Can also predict citations between unseen documents

Page 55: Generative Topic Models for Community Analysis Ramesh Nallapati

9/18/2007 10-802: Guest Lecture 55 / 57

Summary

• Topic Modeling is an interesting, new framework for community analysis – Sound theoretical basis– Completely unsupervised– Simultaneous modeling of multiple fields– Discovers “soft”-communities and clusters in

terms of “topic” membership

– Can also be used for predictive purposes