7
Solutions

Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting

Embed Size (px)

Citation preview

Statistical Modeling of Large Text Collections

Padhraic SmythDepartment of Computer ScienceUniversity of California, Irvine

MURI Project Kick-off MeetingNovember 18th 2008

2

The Text Revolution

Widespread availability of text in digital form is drivingmany new applications based on automated text analysis

• Categorization/classification

• Automated summarization

• Machine translation

• Information extraction

• And so on….

3

The Text Revolution

Widespread availability of text in digital form is drivingmany new applications based on automated text analysis

• Categorization/classification

• Automated summarization

• Machine translation

• Information extraction

• And so on….

• Most of this work is happening in computing, but many of the underlying techniques are statistical

4

NYT

1.5 million articles

16 million

Medline articles

Motivation

Pennsylvania Gazette

80,000 articles

1728-1800

5

Problems of Interest

• What topics do these documents “span”?

• Which documents are about a particular topic?

• How have topics changed over time?

• What does author X write about?

• and so on…..

6

Problems of Interest

• What topics do these documents “span”?

• Which documents are about a particular topic?

• How have topics changed over time?

• What does author X write about?

• and so on…..

Key Ideas:• Learn a probabilistic model over words and docs• Treat query-answering as computation of appropriate

conditional probabilities

7

Topic Models for Documents

P( words | document ) = ??

= P(words|topic) P (topic|document)

Topic = probability distribution over words

Coefficientsfor each document

Automatically learned from text corpus

8

Topics = Multinomials over Words

)|( zwP

WORD PROB.

RETRIEVAL 0.1179

TEXT 0.0853

DOCUMENTS 0.0527

INFORMATION 0.0504

DOCUMENT 0.0441

CONTENT 0.0242

INDEXING 0.0205

RELEVANCE 0.0159

COLLECTION 0.0146

RELEVANT 0.0136

... ...

TOPIC 289

9

Topics = Multinomials over Words

)|( zwP

WORD PROB.

PROBABILISTIC 0.0778

BAYESIAN 0.0671

PROBABILITY 0.0532

CARLO 0.0309

MONTE 0.0308

DISTRIBUTION 0.0257

INFERENCE 0.0253

PROBABILITIES 0.0253

CONDITIONAL 0.0229

PRIOR 0.0219

.... ...

TOPIC 209

WORD PROB.

RETRIEVAL 0.1179

TEXT 0.0853

DOCUMENTS 0.0527

INFORMATION 0.0504

DOCUMENT 0.0441

CONTENT 0.0242

INDEXING 0.0205

RELEVANCE 0.0159

COLLECTION 0.0146

RELEVANT 0.0136

... ...

TOPIC 289

10

Basic Concepts

• Topics = distributions over words • Unknown a priori, learned from data

• Documents represented as mixtures of topics

• Learning algorithm• Gibbs sampling (stochastic search)• Linear time per iteration

• Provides a full probabilistic model over words, documents, and topics

• Query answering = computation of conditional probabilities

11

Enron email data

250,000 emails250,000 emails

28,000 individuals28,000 individuals

1999-20021999-2002

12

Enron email: business topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

FEEDBACK 0.0781 PROJECT 0.0514 FERC 0.0554 ENVIRONMENTAL 0.0291

PERFORMANCE 0.0462 PLANT 0.028 MARKET 0.0328 AIR 0.0232

PROCESS 0.0455 COST 0.0182 ISO 0.0226 MTBE 0.019

PEP 0.0446 CONSTRUCTION 0.0169 COMMISSION 0.0215 EMISSIONS 0.017

MANAGEMENT 0.03 UNIT 0.0166 ORDER 0.0212 CLEAN 0.0143

COMPLETE 0.0205 FACILITY 0.0165 FILING 0.0149 EPA 0.0133

QUESTIONS 0.0203 SITE 0.0136 COMMENTS 0.0116 PENDING 0.0129

SELECTED 0.0187 PROJECTS 0.0117 PRICE 0.0116 SAFETY 0.0104

COMPLETED 0.0146 CONTRACT 0.011 CALIFORNIA 0.0110 WATER 0.0092

SYSTEM 0.0146 UNITS 0.0106 FILED 0.0110 GASOLINE 0.0086

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.

perfmgmt 0.2195 *** 0.0288 *** 0.0532 *** 0.1339

perf eval process 0.0784 *** 0.022 *** 0.0454 *** 0.0275

enron announcements 0.0489 *** 0.0123 *** 0.0384 *** 0.0205

*** 0.0089 *** 0.0111 *** 0.0334 *** 0.0166

*** 0.0048 *** 0.0108 *** 0.0317 *** 0.0129

TOPIC 23TOPIC 36 TOPIC 72 TOPIC 54

13

Enron: non-work topics…

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

HOLIDAY 0.0857 TEXANS 0.0145 GOD 0.0357 AMAZON 0.0312

PARTY 0.0368 WIN 0.0143 LIFE 0.0272 GIFT 0.0226

YEAR 0.0316 FOOTBALL 0.0137 MAN 0.0116 CLICK 0.0193

SEASON 0.0305 FANTASY 0.0129 PEOPLE 0.0103 SAVE 0.0147

COMPANY 0.0255 SPORTSLINE 0.0129 CHRIST 0.0092 SHOPPING 0.0140

CELEBRATION 0.0199 PLAY 0.0123 FAITH 0.0083 OFFER 0.0124

ENRON 0.0198 TEAM 0.0114 LORD 0.0079 HOLIDAY 0.0122

TIME 0.0194 GAME 0.0112 JESUS 0.0075 RECEIVE 0.0102

RECOGNIZE 0.019 SPORTS 0.011 SPIRITUAL 0.0066 SHIPPING 0.0100

MONTH 0.018 GAMES 0.0109 VISIT 0.0065 FLOWERS 0.0099

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.

chairman & ceo 0.131 cbs sportsline com 0.0866 crosswalk com 0.2358 amazon com 0.1344

*** 0.0102 houston texans 0.0267 wordsmith 0.0208 jos a bank 0.0266

*** 0.0046 houstontexans 0.0203 *** 0.0107 sharperimageoffers 0.0136

*** 0.0022 sportsline rewards 0.0175 doctor dictionary 0.0101 travelocity com 0.0094

general announcement 0.0017 pro football 0.0136 *** 0.0061 barnes & noble com 0.0089

TOPIC 109TOPIC 66 TOPIC 182 TOPIC 113

15

Examples of Topics from New York Times

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

DOW_INDUSTRIALSGRAPH_TRACKS

EXPECTEDBILLION

NASDAQ_COMPOSITE_INDEXEST_02

PHOTO_YESTERDAYYEN10

500_STOCK_INDEX

WALL_STREETANALYSTS

INVESTORSFIRM

GOLDMAN_SACHSFIRMS

INVESTMENTMERRILL_LYNCH

COMPANIESSECURITIESRESEARCH

STOCKBUSINESSANALYST

WALL_STREET_FIRMSSALOMON_SMITH_BARNEY

CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS

INVESTMENT_BANKS

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADENAMERICAN

ATTACKNEW_YORK_REGION

NEWMILITARY

NEW_YORKWORLD

NATIONALQAEDA

TERRORIST_ATTACKS

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

CHAPTER_11FILING

COOPERBILLIONS

COMPANIESBANKRUPTCY_PROCEEDINGS

DEBTSRESTRUCTURING

CASEGROUP

Terrorism Wall Street Firms

Stock Market

Bankruptcy

16 Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

50

100

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

5

10

15

Topic trends from New York Times

TOURRIDER

LANCE_ARMSTRONGTEAMBIKERACE

FRANCE

Tour-de-France

COMPANYQUARTERPERCENTANALYST

SHARESALES

EARNING

ANTHRAXLETTER

MAILWORKEROFFICESPORESPOSTAL

BUILDING

Anthrax

330,000 articles

2000-2002

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

10

20

30 Quarterly Earnings

17

What does an author write about?

• Author = Jerry Friedman, Stanford:

18

What does an author write about?

• Author = Jerry Friedman, Stanford:• Topic 1: regression, estimate, variance, data, series,…• Topic 2: classification, training, accuracy, decision, data,…• Topic 3: distance, metric, similarity, measure, nearest,…

19

What does an author write about?

• Author = Jerry Friedman, Stanford:• Topic 1: regression, estimate, variance, data, series,…• Topic 2: classification, training, accuracy, decision, data,…• Topic 3: distance, metric, similarity, measure, nearest,…

• Author = Rakesh Agrawal, IBM:

20

What does an author write about?

• Author = Jerry Friedman, Stanford:• Topic 1: regression, estimate, variance, data, series,…• Topic 2: classification, training, accuracy, decision, data,…• Topic 3: distance, metric, similarity, measure, nearest,…

• Author = Rakesh Agrawal, IBM:- Topic 1: index, data, update, join, efficient….

- Topic 2: query, database, relational, optimization, answer….- Topic 3: data, mining, association, discovery, attributes,…

21

Examples of Data Sets Modeled

• 1,200 Bible chapters (KJV)• 4,000 Blog entries• 20,000 PNAS abstracts• 80,000 Pennsylvania Gazette articles• 250,000 Enron emails• 300,000 North Carolina vehicle accident police reports• 500,000 New York Times articles• 650,000 CiteSeer abstracts• 8 million MEDLINE abstracts • Books by Austen, Dickens, and Melville• …..

• Exactly the same algorithm used in all cases – and in all cases interpretable topics produced automatically

22

Related Work

• Statistical origins• Latent class models in statistics (late 60’s)• Admixture models in genetics

• LDA Model: Blei, Ng, and Jordan (2003)• Variational EM

• Topic Model: Griffiths and Steyvers (2004)• Collapsed Gibbs sampler

• Alternative approaches• Latent semantic indexing (LSI/LSA)

• less interpretable, not appropriate for count data• Document clustering:

• simpler but less powerful

23

Clusters v. Topics

Hidden Markov Models in Molecular Biology: New Algorithms and ApplicationsPierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure

Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, andclassification.

24

Clusters v. Topics

Hidden Markov Models in Molecular Biology: New Algorithms and ApplicationsPierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure

Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.

[cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov

One Cluster

25

Clusters v. Topics

Hidden Markov Models in Molecular Biology: New Algorithms and ApplicationsPierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure

Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, andclassification.

[cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov

[topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling

[topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes

Multiple TopicsOne Cluster

26

Extensions

• Author-topic models• Authors = mixtures over topics (Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004)

• Special-words model• Documents = mixtures of topics + idiosyncratic words (Chemudugunta, Smyth, Steyvers, 2006)

• Entity-topic models• Topic models that can reason about entities (Newman, Chemudugunta, Smyth, Steyvers, 2006)

• See also work by McCallum, Blei, Buntine, Welling, Fienberg, Xing, etc

• Probabilistic basis allows for a wide range of generalizations

27

Combining Models for Networks and Text

28

Combining Models for Networks and Text

29

Combining Models for Networks and Text

30

Combining Models for Networks and Text

31

Technical Approach and Challenges

• Develop flexible probabilistic network models that can incorporate textual information

• e.g., ERGMs with text as node or edge covariates• e.g., latent space models with text-based covariates• e.g., dynamic relational models with text as edge covariates

• Research challenges• Computational scalability

• ERGMS not directly applicable to large text data sets • What text representation to use:

• High-dimensional “bag of words” ?• Low-dimensional latent topics ?

• Utility of text• Does the incorporation of textual information produce more

accurate models or predictions? How can this be quantified?

Graphical Model

zz Group VariableGroup Variable

Word 1Word 1 Word nWord nWord 2Word 2 ....................

Graphical Model

zz

ww

Group VariableGroup Variable

WordWord

n wordsn words

Graphical Model

zz

ww

Group VariableGroup Variable

WordWord

D documentsD documents

n wordsn words

Mixture Model for Documents

zz

ww

Group VariableGroup Variable

WordWord

Group-WordGroup-Word

distributionsdistributions

D documentsD documents

n wordsn words

GroupGroup

ProbabilitiesProbabilities

Clustering with a Mixture Model

zz

ww

Cluster VariableCluster Variable

WordWord

Cluster-WordCluster-Word

distributionsdistributions

D documentsD documents

n wordsn words

ClusterCluster

ProbabilitiesProbabilities

37

Graphical Model for Topics

z

w

Topic

Word

Document-Topicdistributions

Topic-Worddistributions

D

n

38

Learning via Gibbs sampling

z

w

Topic

Word

Document-Topicdistributions

Topic-Worddistributions

D

n

Gibbs samplerto estimate z foreach word occurrence,……marginalizing over otherparameters

39

More Details on Learning

• Gibbs sampling for word-topic assignments (z)• 1 iteration = full pass through all words in all documents• Typically run a few hundred Gibbs iterations

• Estimating θ and • use z samples to get point estimates• non-informative Dirichlet priors for θ and

• Computational Efficiency• Learning is linear in the number of word tokens • Can still take order of a day on 100k or more docs

40

Gibbs Sampler Stability