Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities

Preview:

DESCRIPTION

Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities. Xuerui Wang Computer Science Department University of Massachusetts Amherst. Joint work with Andrew McCallum, Andres Corrada-Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty. Probabilistic topic models. - PowerPoint PPT Presentation

Citation preview

Structured Topic Models: Jointly Modeling Words and Their

Accompanying Modalities

Xuerui WangComputer Science Department

University of Massachusetts Amherst

Joint work with Andrew McCallum, Andres Corrada-Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty.

2

Probabilistic topic models

• Main Assumption:– Documents are mixture of topics– Topic distributions over words for co-occurrence

• Objectives:– Understand text using learned topics– Represent documents in topic space

3

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% finance30% environment

finance

“bank”

GenerativeProcess:

environment

4

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

5

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGYFIELD

PHYSICSLABORATORY

STUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

6

Documents are not just text !

• Multiple modalities:– Research papers (author, venue, words, etc.)– Email messages (sender, recipients, time, words, etc.)– Legislative resolutions (voting record, words, etc.)– And many more

• Most previous work: one modality at a time– Learn topics from words– Discover groups from relations– Etc.

8

Outline

• Introduction

• Role and Topic Discovery in Social Networks

• Group and Topic Discovery from Voting Records

• Topics over Time

• Topical Phrase with Markov Assumption

• Conclusions

9

All possible “topic models” with one latent topic, two observed modalities

and two conditional dependencies

10

Outline

• Introduction

• Role and Topic Discovery in Social Networks

• Group and Topic Discovery from Voting Records

• Topics over Time

• Topical Phrase with Markov Assumption

• Conclusions

11

From LDA to Author-Recipient-Topic

12

All possible “topic models” with two observed modalities

13

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

14

Enron email corpus

• 250k email messages• 147 people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: debra.perlingiere@enron.comTo: steve.hooser@enron.comSubject: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas 77002dperlin@enron.com

15

Topics, and prominent senders / receiversdiscovered by ARTTopic names,

by hand

16

Topics, and prominent senders / receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

17

Comparing role discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

18

Comparing role discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

20

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing role discovery Lynn Blair Kimberly Watson

21

McCallum Email Corpus 2004

• January - October 2004• 23k email messages• 825 people

From: kate@cs.umass.eduSubject: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: mccallum@cs.umass.edu

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

25

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402donna 0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009roweis 0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558

27

Outline

• Introduction

• Role and Topic Discovery in Social Networks

• Group and Topic Discovery from Voting Records

• Topics over Time

• Topical Phrase with Markov Assumption

• Conclusions

29

Discovering groups from observed set of relations

Admiration relations among six high school students.

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

30

Adjacency matrix representing relations

A B C D E FABCDEF

A B C D E FG1G2G1G2G3G3

G1G2G1G2G3G3

ABCDEF

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

ACBDEF

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

31

Group Model: partitioning entities into groups

2Sv

β

2Gγ α

Stochastic Blockstructures for Relations[Nowicki, Snijders 2001]

S: number of entities

G: number of groups

Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]

BetaDirichlet

Binomial

SgMultinomial

32

Two relations with different attributes

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

A C E B D FG1G1G1G2G2G2

G1G1G1G2G2G2

ACEBDF

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Social Admiration

Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)

ACBDEF

33

Goal:Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics.

budget, funding, annual, cash

document, corrections, review, annual

34

The Group-Topic model: discovering groups and topics simultaneously

bNw

t

B

T

φ

η

DirichletMultinomial

Uniform

2Sv

β

2Gγ α

Beta

Dirichlet

Binomial

SgMultinomial

T

35

All possible “topic models” with two observed modalities

37

U.S. Senate data set

• 16 years of voting records in the US Senate (1989 – 2005)

• a Senator may respond Yea or Nay to a resolution

• 3423 resolutions with text attributes (index terms)

• 191 Senators in total across 16 years

S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms

Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

38

Topics discovered (U.S. Senate)Education Energy

MilitaryMisc.

Economic

education energy government federalschool power military labor

aid water foreign insurancechildren nuclear tax aid

drug gas congress taxstudents petrol aid business

elementary research law employeeprevention pollution policy care

Mixture of Unigrams

Group-Topic Model

Education

+ DomesticForeign Economic

Social Security

+ Medicareeducation foreign labor social

school trade insurance securityfederal chemicals tax insurance

aid tariff congress medicalgovernment congress income care

tax drugs minimum medicareenergy communicable wage disability

research diseases business assistance

39

Groups discovered (US Senate)

Groups from topic Education + Domestic

40

Senators Who Change Coalition the most Dependent on Topic

e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicare

44

Do we get better groups with the GT model?

1. Cluster bills into topics using mixture of unigrams;

2. Apply group model on topic-specific subsets of bills.

Agreement Index (AI) measures group cohesion. Higher, better.

Datasets Avg. AI for Baseline Avg. AI for GT p-value

Senate 0.8198 0.8294 <.01

UN 0.8548 0.8664 <.01

1. Jointly cluster topic and groups at the same time using the GT model.

Baseline Model GT Model

46

Outline

• Introduction

• Role and Topic Discovery in Social Networks

• Group and Topic Discovery from Voting Records

• Topics over Time

• Topical Phrase with Markov Assumption

• Conclusions

48

Want to model trends over time

• Is prevalence of topic growing or waning?

• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time

• How do roles, groups, influence shift over time?

49

Topics Over Time (TOT)

Betaover time

topicindex

timestamp

word

Multinomialover words

Dirichletprior

Dirichlet prior

multinomialover topics

Betaover time

topicindex

timestamp

wordMultinomialover words

Dirichlet prior

multinomialover topics

Dirichlet prior

50

All possible “topic models” with two observed modalities

51

State of the union address

208 Addresses delivered between January 8, 1790 and January 29, 2002.

To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.

•17156 ‘documents’

•21534 words

•669,425 tokens

Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.

1910

52

Comparing

TOT

against

LDA

55

Topic Distributions Conditioned on Time

time

top

ic m

ass

(in

ver

tica

l h

eig

ht)

in N

IPS

con

ference p

apers

57

TOT improves ability to predict time

Predicting the year of a State-of-the-Union address.

L1 = distance between predicted year and actual year.

58

Outline

• Introduction

• Role and Topic Discovery in Social Networks

• Group and Topic Discovery from Voting Records

• Topics over Time

• Topical Phrase with Markov Assumption

• Conclusions

59

Topic Interpretability

LDA

algorithmsalgorithmgenetic

problemsefficient

Topical N-grams

genetic algorithmsgenetic algorithm

evolutionary computationevolutionary algorithms

fitness function

60

Topics modeling phrases

• Topics based only on unigrams often difficult to interpret

• Topic discovery itself is confused because important meaning / distinctions carried by phrases.

• Significant opportunity to provide improved language models to ASR, MT, IR, etc.

61

Topical N-Gram model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

α

WTW

γ1 γ2β 2

62

All possible “topic models” with two observed modalities

63

Features of Topical N-Grams model

• Easily trained by Gibbs sampling– Can run efficiently on millions of words

• Topic-specific phrase discovery– “white house” has special meaning as a phrase

in the politics topic,– ... but not in the real estate topic.

64

NIPS research papers• Full text of NIPS papers between 1987-1999.

• 1,740 research papers in total.

• 13, 649 unique words and 2,301,375 word tokens.

• Stop words removed and no stemming.

65

“Reinforcement Learning”

state learning policy action reinforcement states time optimal actions function algorithm reward step dynamic control sutton rl decision algorithms agent

LDAreinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning RLfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods

policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies

Topical N-grams (2+) Topical N-grams (1)

66

“Support Vector Machines”

kernel linear vector support set nonlinear data algorithm space pca function problem margin vectors solution training svm kernels matrix machines

LDA

support vectors test error support vector machines training error feature space training examples decision function cost functions test inputs kkt conditions leave-one-out procedure soft margin bayesian transduction training patterns training points maximum margin strictly convex regularization operators base classifiers convex optimization

kernel training support margin svm solution kernels regularization adaboost test data generalization examples cost convex algorithm working feature sv functions

Topical N-grams (2+) Topical N-grams (1)

67

Word dependencies in information retrieval

• Long-distance dependency ---- topical (semantic) dependency helps [Hofmann, 1999; Wei and Croft, 2006].

• Short-distance dependency ---- phrases (usually discovered by separate modules) can boost IR performance [Fagan, 1989; Evans et al., 1991; Strzalkowski, 1995; Mitra et al., 1997].

• TNG simultaneously capture both.

68

San Jose Mercury News (TREC)

• Covers materials from San Jose Mercury News in 1991

• With TREC queries 51-150

• 90,257 documents in total, 255, 686 unique words and 17,574,989 word tokens.

• Stop words removed and no stemming.

<DOC><DOCNO> SJMN91-06364022 </DOCNO><ACCESS> 06364022 </ACCESS><CAPTION> Photo; PHOTO: Associated Press; MONSTER MASH -- Kentucky's Jamal MashBurn shows his stuff in the Wildcats' 103-89 victory over state rival Louisville onSaturday. Mashburn had 25 points. </CAPTION><DESCRIPT> COLLEGE; BASKETBALL; GAME; RESULT; RANKING; SCHOOL </DESCRIPT><LEADPARA> Arizona had a 24-point night from Sean Rooks, a height advantage and strong defense, but still struggled to an 83-76 victory over Evansville in the FiestaBowl Classic in Tucson, Ariz., on Saturday.; The victory moved the No. 6Wildcats into the championship of their tournament for the seventh straighttime. </LEADPARA><SECTION> Sports </SECTION><HEADLINE> ARIZONA EDGES EVANSVILLE……

69

Ad-hoc retrieval on SJMN

Clearly contain phrases

No phrases due to stopping and punctuation removing

Mixed results on many other queries.

70

Ad-hoc retrieval on SJMN

* indicates statistically significant differences in performance with 95% confidence according to the Wilcoxon test

71

Outline

• Introduction

• Role and Topic Discovery in Social Networks

• Group and Topic Discovery from Voting Records

• Topics over Time

• Topical Phrase with Markov Assumption

• Conclusions

72

All possible “topic models” with two observed modalities (revisit)

ARTGTTOT TNG

73

Conclusions

• With carefully designed model structures, we can utilize multi-modality information.

• Choices of configuration are task dependent.

• Better results are obtained from joint inference on various tasks.