Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William

Sparse Word Graphs:A Scalable Algorithm for Capturing Word Correlations in Topic Models

Ramesh NallapatiJoint work with

John Lafferty, Amr Ahmed,

William Cohen and Eric XingMachine Learning Department

Carnegie Mellon University

8/28/2007/9:30am ICDM’07 HPDM workskop 2/28

Introduction

• Statistical topic modeling: an attractive framework for topic discovery– Completely unsupervised– Models text very well

• Lower perplexity compared to unigram models

– Reveals meaningful semantic patterns– Can help summarize and visualize document

collections– e.g.: PLSA, LDA, DPM, DTM, CTM, PA


Introduction

• A common assumption in all the variants:– Exchangeability: “bag of words” assumption– Topics represented as a ranked list of words

• Consequences:– Word Correlation information is lost

• e.g.: “white-house” vs. “white” and “house”• Long distance correlations


Introduction

• Objective:– To capture correlations between words within

topics

• Motivation:– More interpretable representation of topics as

a network of words rather than a list– Helps better visualize and summarize

document collections– May reveal unexpected relationships and

patterns within topics


Past Work: Topic Models

• Bigram topic models [Wallach, ICML 2006]

• Requires KV(K-1) parameters

• Only captures local dependencies

• Does not model sparsity of correlations

• Does not capture “within-topic” correlations


Past work: Other approaches

• Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96]

– Word pair correlation measured as a weighted count of number of times they occur within a fixed length window

– Weight of an occurrence / 1/(mutual distance)



• Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96]

– Plusses: • Sparse solutions, scalability

– Minuses: • Only unearths global correlations, not semantic correlations

– E.g.: “river – bank”, “bank – check” • Only local dependencies



• Query expansion in IR– Similar in spirit: finds words that highly co-

occur with the query words– However, not a corpus visualization tool:

requires a context to operate on

• Wordnet– Semantic networks– Human labeled: not directly related to our goal


Our approach

• L1 norm regularization

– Known to enforce sparse solutions• Sparsity permits scalability

– Convex optimization problem • Globally optimal solutions

– Recent advances in learning structure of graphical models:

• L1 regularization framework asymptotically leads to true structure


Background:LASSO

• Example: linear regression

• Regularization used to improve generalizability– E.g.1: Ridge regression: L2 norm regularization

– E.g.2: Lasso: L1 norm regularization


Background: LASSO

• Lasso encourages sparse solutions


Background: Gaussian Random Fields

• Multivariate Gaussian distribution

• Random field structure: G = (V,E)– V: set of all variables {X1,,Xp}

– (s,t) 2 E , -1st 0

– Xs ? Xu | XN(s) where u N(s)



• Estimating the graph structure of GRF from data [Meinshausen and Buhlmann, Annals. Stats., 2006]

– Regress each variable onto others imposing L1 penalty to encourage sparsity

– Estimated neighborhood:



True Graph Estimated graph

Courtesy: [Meinshausen and Buhlmann, Annals. Stats., 2006]



• Application to topic models: CTM [Blei and Lafferty, NIPS, 2006]



• Application to CTM:[Blei & Lafferty, Annals. Appl. Stats., ‘07]


Structure learning of an MRF

• Ising model

• L1 regularized conditional likelihood learns true structure asymptotically

[Wainwright, Ravikumar and Lafferty, NIPS’06]


Structure learning of an MRFCourtesy: [Wainwright, Ravikumar and Lafferty, NIPS’06]


Sparse Word Graphs• Algorithm

– Run LDA on the document collection and obtain topic assignments

– Convert topic assignments for each document into K binary vectors X:

– Assume an MRF for each topic with X as underlying data

– Apply structure learning for MRF using regularized conditional likelihood


Sparse Word Graphs


Sparse Word Graphs: Scalability

• We still run V logistic regression problems, each of size V for each topic: O(KV2) !– However, each example is very sparse

– L1 penalty results in sparse solutions

– Can run each topic in parallel

– Efficient interior point based L1 regularized logistic regression [Koh, Kim & Boyd, JMLR,’07]


Experiments

• Small AP corpus– 2.2K Docs, 10.5K unique words

• Ran 10 topic LDA model

• Used = 0.1 in L1 logistic regression

• Took just 45 min. per topic

• Very sparse solutions– Computes only under 0.1% of the total

number of possible edges


Topic “Business”: neighborhood of top LDA terms


Topic “Business”: neighborhood of top edges


Topic “War”: neighborhood of top LDA terms


Topic “War”: neighborhood of top edges


Concluding remarks

• Pros– A highly scalable algorithm for capturing

within topic word correlations– Captures both short distance and long

distance correlations– Makes topics more interpretable

• Cons– Not a complete probabilistic model

• Significant modeling challenge since the correlations are latent


Concluding remarks

• Applications of Sparse Word Graphs– Better document summarization and

visualization tool– Word sense disambiguation– Semantic query expansion

• Future Work– Evaluation on a “real task”– Build a unified statistical model

Documents

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William