View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Sparse Word Graphs:A Scalable Algorithm for Capturing Word Correlations in Topic Models
Ramesh NallapatiJoint work with
John Lafferty, Amr Ahmed,
William Cohen and Eric XingMachine Learning Department
Carnegie Mellon University
8/28/2007/9:30am ICDM’07 HPDM workskop 2/28
Introduction
• Statistical topic modeling: an attractive framework for topic discovery– Completely unsupervised– Models text very well
• Lower perplexity compared to unigram models
– Reveals meaningful semantic patterns– Can help summarize and visualize document
collections– e.g.: PLSA, LDA, DPM, DTM, CTM, PA
8/28/2007/9:30am ICDM’07 HPDM workskop 3/28
Introduction
• A common assumption in all the variants:– Exchangeability: “bag of words” assumption– Topics represented as a ranked list of words
• Consequences:– Word Correlation information is lost
• e.g.: “white-house” vs. “white” and “house”• Long distance correlations
8/28/2007/9:30am ICDM’07 HPDM workskop 4/28
Introduction
• Objective:– To capture correlations between words within
topics
• Motivation:– More interpretable representation of topics as
a network of words rather than a list– Helps better visualize and summarize
document collections– May reveal unexpected relationships and
patterns within topics
8/28/2007/9:30am ICDM’07 HPDM workskop 5/28
Past Work: Topic Models
• Bigram topic models [Wallach, ICML 2006]
• Requires KV(K-1) parameters
• Only captures local dependencies
• Does not model sparsity of correlations
• Does not capture “within-topic” correlations
8/28/2007/9:30am ICDM’07 HPDM workskop 6/28
Past work: Other approaches
• Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96]
– Word pair correlation measured as a weighted count of number of times they occur within a fixed length window
– Weight of an occurrence / 1/(mutual distance)
8/28/2007/9:30am ICDM’07 HPDM workskop 7/28
Past work: Other approaches
• Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96]
– Plusses: • Sparse solutions, scalability
– Minuses: • Only unearths global correlations, not semantic correlations
– E.g.: “river – bank”, “bank – check” • Only local dependencies
8/28/2007/9:30am ICDM’07 HPDM workskop 8/28
Past work: Other approaches
• Query expansion in IR– Similar in spirit: finds words that highly co-
occur with the query words– However, not a corpus visualization tool:
requires a context to operate on
• Wordnet– Semantic networks– Human labeled: not directly related to our goal
8/28/2007/9:30am ICDM’07 HPDM workskop 9/28
Our approach
• L1 norm regularization
– Known to enforce sparse solutions• Sparsity permits scalability
– Convex optimization problem • Globally optimal solutions
– Recent advances in learning structure of graphical models:
• L1 regularization framework asymptotically leads to true structure
8/28/2007/9:30am ICDM’07 HPDM workskop 10/28
Background:LASSO
• Example: linear regression
• Regularization used to improve generalizability– E.g.1: Ridge regression: L2 norm regularization
– E.g.2: Lasso: L1 norm regularization
8/28/2007/9:30am ICDM’07 HPDM workskop 11/28
Background: LASSO
• Lasso encourages sparse solutions
8/28/2007/9:30am ICDM’07 HPDM workskop 12/28
Background: Gaussian Random Fields
• Multivariate Gaussian distribution
• Random field structure: G = (V,E)– V: set of all variables {X1,,Xp}
– (s,t) 2 E , -1st 0
– Xs ? Xu | XN(s) where u N(s)
8/28/2007/9:30am ICDM’07 HPDM workskop 13/28
Background: Gaussian Random Fields
• Estimating the graph structure of GRF from data [Meinshausen and Buhlmann, Annals. Stats., 2006]
– Regress each variable onto others imposing L1 penalty to encourage sparsity
– Estimated neighborhood:
8/28/2007/9:30am ICDM’07 HPDM workskop 14/28
Background: Gaussian Random Fields
True Graph Estimated graph
Courtesy: [Meinshausen and Buhlmann, Annals. Stats., 2006]
8/28/2007/9:30am ICDM’07 HPDM workskop 15/28
Background: Gaussian Random Fields
• Application to topic models: CTM [Blei and Lafferty, NIPS, 2006]
8/28/2007/9:30am ICDM’07 HPDM workskop 16/28
Background: Gaussian Random Fields
• Application to CTM:[Blei & Lafferty, Annals. Appl. Stats., ‘07]
8/28/2007/9:30am ICDM’07 HPDM workskop 17/28
Structure learning of an MRF
• Ising model
• L1 regularized conditional likelihood learns true structure asymptotically
[Wainwright, Ravikumar and Lafferty, NIPS’06]
8/28/2007/9:30am ICDM’07 HPDM workskop 18/28
Structure learning of an MRFCourtesy: [Wainwright, Ravikumar and Lafferty, NIPS’06]
8/28/2007/9:30am ICDM’07 HPDM workskop 19/28
Sparse Word Graphs• Algorithm
– Run LDA on the document collection and obtain topic assignments
– Convert topic assignments for each document into K binary vectors X:
– Assume an MRF for each topic with X as underlying data
– Apply structure learning for MRF using regularized conditional likelihood
8/28/2007/9:30am ICDM’07 HPDM workskop 20/28
Sparse Word Graphs
8/28/2007/9:30am ICDM’07 HPDM workskop 21/28
Sparse Word Graphs: Scalability
• We still run V logistic regression problems, each of size V for each topic: O(KV2) !– However, each example is very sparse
– L1 penalty results in sparse solutions
– Can run each topic in parallel
– Efficient interior point based L1 regularized logistic regression [Koh, Kim & Boyd, JMLR,’07]
8/28/2007/9:30am ICDM’07 HPDM workskop 22/28
Experiments
• Small AP corpus– 2.2K Docs, 10.5K unique words
• Ran 10 topic LDA model
• Used = 0.1 in L1 logistic regression
• Took just 45 min. per topic
• Very sparse solutions– Computes only under 0.1% of the total
number of possible edges
8/28/2007/9:30am ICDM’07 HPDM workskop 23/28
Topic “Business”: neighborhood of top LDA terms
8/28/2007/9:30am ICDM’07 HPDM workskop 24/28
Topic “Business”: neighborhood of top edges
8/28/2007/9:30am ICDM’07 HPDM workskop 25/28
Topic “War”: neighborhood of top LDA terms
8/28/2007/9:30am ICDM’07 HPDM workskop 26/28
Topic “War”: neighborhood of top edges
8/28/2007/9:30am ICDM’07 HPDM workskop 27/28
Concluding remarks
• Pros– A highly scalable algorithm for capturing
within topic word correlations– Captures both short distance and long
distance correlations– Makes topics more interpretable
• Cons– Not a complete probabilistic model
• Significant modeling challenge since the correlations are latent
8/28/2007/9:30am ICDM’07 HPDM workskop 28/28
Concluding remarks
• Applications of Sparse Word Graphs– Better document summarization and
visualization tool– Word sense disambiguation– Semantic query expansion
• Future Work– Evaluation on a “real task”– Build a unified statistical model