Topic modeling

Topic modeling

Mark Steyvers

Department of Cognitive SciencesUniversity of California, Irvine

Some topics we can discuss

• Introduction to LDA: basic topic model

• Preliminary work on therapy transcripts

• Extensions to LDA

– Conditional topic models (for predicting behavioral codes)

– Various topic models for word order

– Topic models incorporating parse trees

– Topic models for dialogue

– Topic models incorporating speech information

Most basic topic model: LDA(Latent Dirichlet Allocation)

NYT

330,000 articles

Enron

250,000 emails

16 million Medline articles

NSF/ NIH

100,000 grants

Automatic and unsupervised extraction of semantic themes from large text collections.

AOL queries

20,000,000 queries

650,000 users

Pennsylvania Gazette

(1728-1800)

80,000 articles

http://images.google.com/imgres?imgurl=http://www.niaid.nih.gov/ncn/graphics/logos/nih_300.gif&imgrefurl=http://www.niaid.nih.gov/ncn/graphics/logos/&h=300&w=300&sz=15&tbnid=iPmLY65o1SF70M:&tbnh=111&tbnw=111&hl=en&start=1&prev=/images%3Fq%3Dnih%2Blogo%26svnum%3D10%26hl%3Den%26lr%3D%26safe%3Doff%26sa%3DX

http://images.google.com/imgres?imgurl=http://phys.educ.ksu.edu/images/logo/nsf_logo_1.jpg&imgrefurl=http://phys.educ.ksu.edu/info/summaryOfVqm.html&h=166&w=166&sz=11&tbnid=xTyRV0HEzFra9M:&tbnh=93&tbnw=93&hl=en&start=5&prev=/images%3Fq%3Dnsf%2Blogo%26svnum%3D10%26hl%3Den%26lr%3D%26safe%3Doff

Model Input

• Matrix of counts: number of times words occur in documents

• Note:

– word order is lost: “bag of words” approach

– Some function words are deleted: “the”, “a”, “in”

documents

wor

ds

1

…

16

…

0

…

FOOD

…

6190ITALIAN

2012PASTA

3034PIZZA

Doc3 … Doc2Doc1

Basic Assumptions

• Each topic is a distribution over words

• Each document a mixture of topics

• Each word in a document originates from a single topic

Document = mixture of topics

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

80%20%

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

100%

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

autocar

partscarsusedford

hondatruck

toyota

hannahmontana

zacefron

disneyhigh school

musicalmiley cyrushilary duff

webmdcymbalta

xanaxgout

vicodineffexor

prednisonelexaproambien

partystore

weddingbirthdayjewelry

ideascardscakegifts

Generative Process

• For each document, choose a mixture of topics

Dirichlet()

• Sample a topic [1..T] from the mixture

z Multinomial()

• Sample a word from the topic

w Multinomial((z))

Dirichlet(β)

z

w

D

T

Nd

Prior Distributions• Dirichlet priors encourage sparsity on topic mixtures and

topics

θ ~ Dirichlet( α ) ~ Dirichlet( β )

Topic 1 Topic 2

Topic 3

Word 1 Word 2

Word 3

(darker colors indicate lower probability)

Statistical Inference

• Three sets of latent variables:– document-topic distributions θ– topic-word distributions – topic assignments z

• Estimate posterior distribution over topic assignments – P( z | w )– we “collapse” over topic mixtures and word mixtures– we can later infer θ and

• Use approximate methods: Markov chain Monte Carlo (MCMC)

with Gibbs sampling

Toy Example: Artificial Dataset

Docs

Can we recover the original topics and topic mixtures from this data?

River Stream Bank Money Loan123456789

10111213141516

Two topics

16 documents

topic 1 topic 2

River 0.33 0Stream 0.33 0Bank 0.33 0.33Money 0 0.33Loan 0 0.33


10111213141516

Initialization: assign word tokens randomly to topics:

(●=topic 1; ○=topic 2 )

Gibbs Sampling

' '' '

( | )i i

td wti i i i

t d w tt w

n np z t z

n T n W

count of word w assigned to topic t

count of topic t assigned to doc d

probability that word i is assigned to topic t


10111213141516

After 1 iteration

• Apply sampling equation to each word token:



10111213141516

After 4 iterations



10111213141516

After 8 iterations



10111213141516

After 32 iterations


topic 1 topic 2

River 0.42 0Stream 0.29 0.05Bank 0.28 0.31Money 0 0.29Loan 0 0.35

INPUT: word-document counts (word order is irrelevant)

OUTPUT:topic assignments to each word P( zi )likely words in each topic P( w | z )likely topics in each document (“gist”) P( z | d )

Summary of Algorithm

Example topics from TASA: an educational corpus

• 37K docs 26K word vocabulary• 300 topics e.g.:

P( w | z )0.00

0.050.10

MUSICDANCE

SONGPLAYSING

SINGINGBAND

PLAYEDSANG

SONGSDANCING

PIANOPLAYINGRHYTHM

TOPIC 77

P( w | z )0.00

0.010.02

0.030.04

LITERATUREPOEM

POETRYPOET

PLAYSPOEMS

PLAYLITERARYWRITERS

DRAMAWROTEPOETS

WRITERSHAKESPEARE

TOPIC 82

P( w | z )0.00

0.060.12

0.18

RIVERLAND

RIVERSVALLEY

BUILTWATERFLOOD

WATERSNILE

FLOWSRICH

FLOWDAM

BANKS

TOPIC 137

P( w | z )0.00

0.080.16

0.24

READBOOK

BOOKSREADINGLIBRARYWROTEWRITE

FINDWRITTEN

PAGESWORDS

PAGEAUTHOR

TITLE

TOPIC 254

Three documents with the word “play”(numbers & colors topic assignments)

A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....

word

s

C = F Q

documentsw

ord

stopics

top

ics documents

Topic model

word

s

documents

C = U D VT

word

s

dims

dims

dim

s

dim

s

documents

LSA

normalizedco-occurrence matrix

mixtureweights

mixturecomponents

Documents as Topics Mixtures:a Geometric Interpretation

P(word3)

P(word1)

0

1

1

1

P(word2)

P(word1)+P(word2)+P(word3) = 1

topic 1

topic 2

= observed= observeddocumentdocument

Some Preliminary Work on Therapy Transcripts

Defining documents

• Can define “document” in multiple ways

– all words within a therapy session

– all words from a particular speaker within a session

• Clearly we need to extend topic model to dialogue….

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

positive topic negative topic

P(

top

ic )

controls

pre

Positive/Negative Topic Usage by Group

Positive/ Negative Topic Usage by Changes in Satisfaction

0

0.02

0.04

0.06

0.08

0.1

0.12

positive topic negative topic

P(

top

ic )

positive change insatisfaction

negative change insatisfaction

This graph shows that couples with a decrease in satisfaction over the course of therapy use relatively negative language.

Those who leave the therapy with increased satisfaction exhibit more positive language

Topics used by Satisfied/ Unsatisfied Couples

0 0.005 0.01 0.015 0.020

0.005

0.01

0.015

0.02

T1

T2T3

T4T5

T6

T7

T8

T9

T10T11

T12

T13

T14

T15T16

T17

T18T19

T20

T21

T22

T23

T24

T25

T26

T27

T28T29

T30

T31

T32

T33

T34

T35

T36T37

T38

T39T40

T41T42T43

T44

T45T46

T47

T48T49

T50T51

T52

T53

T54

T55

T56

T57

T58

T59T60

T61

T62

T63

T64T65

T66

T67T68

T69

T70

T71

T72

T73

T74

T75

T76

T77

T78T79T80T81

T82

T83

T84

T85T86

T87

T88

T89

T90T91

T92

T93

T94

T95

T96

T97T98

T99T100

T101

T102 T103

T104

T105

T106

T107

T108

T109T110

T111

T112

T113

T114

T115

T116

T117

T118T119

T120

T121

T122

T123

T124

T125

T126

T127

T128

T129T130T131

T132

T133

T134

T135

T136

T137

T138

T139

T140

T141T142

T143

T144

T145

T146

T147

T148

T149T150T151

T152

T153

T154T155

T156 T157

T158

T159

T160

T161

T162

T163

T164

T165

T166T167

T168T169

T170

T171

T172 T173T174

T175T176

T177

T178

T179

T180

T181

T182

T183

T184T185T186T187T188

T189

T190

T191

T192

T193

T194

T195

T196

T197

T198

T199

T200

Positive Change in Satisfaction

Neg

ativ

e C

ha

nge

in S

atis

fact

ion

talkdivorceproblem

housealong

separateseparation

talkingagree

example

Topic 38

Dissatisfied couples talk relatively more often about separation and divorce

Affect Dynamics

Analyze the short-term dynamics of affect usage: Do unhappy couples follow up negative language with negative

language more often than happy couples? In other words, are unhappy couples involved in a negative feedback loop?

Calculated: P( z2=+ | z1=+ ) P( z2=+ | z1=- ) P( z2=- | z1=+ ) P( z2=- | z1=- )

E.g. P( z2=- | z1=+ ) is the probability that after a positive word the next non-neutral word will be a negative word

+ -.27

.28

.72.73

.51

.49+-

zNormalControls

Markov Chain Illustration Base rates

+ -.33

.27

.73.67

.45

.55+-

zPositiveChange

+ -.37

.22

.78.63

.38

.62+-

zLittleChange

+ -.41

.22

.78.59

.35

.65+-

zNegativeChange

Modeling Extensions

Extensions

• Multi-label Document Classification

– conditional topic models

• Topic models and word order

– ngrams/collocations

– hidden-markov models

• Some potential model developments:

– topic models incorporating parse trees

– topic models for dialogue

– topic models incorporating speech information

Conditional Topic Models

Assume there is a topic associated with each label/behavioral code. Model only is allowed to assign words to labels that are associated with the document

This model can learn the distribution of words associated with each label/behavioral code

Topics associated

with Behavioral

Codes

?

?

word? word word? word? word? word? word? word?

word? word? word? word? word? ....

TopicWeights

word? word? word? word? word? word? word?

word? word? word? word? word? ....

word? word? word? word? word? word?....

Documents and topic assignments

“Vulnerability”

“Hard Expression”

Vulnerability=yes Hard Expression=no

Vulnerability=no Hard Expression=yes

Vulnerability=yes Hard Expression=yes

Preliminary ResultsHEX (Hard Expressions)

[child_making_noises] 0.02250[sighs] 0.01715

alice 0.01661furniture 0.01447

brother 0.01340literally 0.01072

[oldest_daughter] 0.00911jason 0.00804shit 0.00804

budget 0.00697cold 0.00697

[child] 0.00643da 0.00643

dad 0.00643diaper 0.00643

lie 0.00643[youngest_daughter] 0.00590

criticize 0.00590enemy 0.00590

expression 0.00590mmhmm 0.00590

monitoring 0.00590months 0.00590

mad 0.00536piece 0.00536pssht 0.00536

carpet 0.00483floors 0.00483huge 0.00483

techniques 0.00483

vul (Vulnerability)

care 0.02262love 0.01996

[sigh] 0.01775home 0.01730life 0.01730

talking 0.01553family 0.01508baby 0.01375

brother 0.01375[sighs] 0.01021

difficult 0.01021school 0.01021

guy 0.00976talk 0.00976

da 0.00932heart 0.00932

sister 0.00932appreciate 0.00887connection 0.00887

stress 0.00887class 0.00799

lonely 0.00799yesterday 0.00799

called 0.00754emotional 0.00754

mom 0.00754mother 0.00754hard 0.00710phone 0.00710read 0.00710

Topic Models for short-range sequential dependencies

Hidden Markov Topics Model

• Syntactic dependencies short range dependencies

• Semantic dependencies long-range

z z z z

w w w w

s s s s

Semantic state: generate words from topic model

Syntactic states: generate words from HMM

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

MODELALGORITHM

SYSTEMCASE

PROBLEMNETWORKMETHOD

APPROACHPAPER

PROCESS

ISWASHAS

BECOMESDENOTES

BEINGREMAINS

REPRESENTSEXISTSSEEMS

SEESHOWNOTE

CONSIDERASSUMEPRESENT

NEEDPROPOSEDESCRIBESUGGEST

USEDTRAINED

OBTAINEDDESCRIBED

GIVENFOUND

PRESENTEDDEFINED

GENERATEDSHOWN

INWITHFORON

FROMAT

USINGINTOOVER

WITHIN

HOWEVERALSOTHENTHUS

THEREFOREFIRSTHERENOW

HENCEFINALLY

#*IXTN-CFP

EXPERTSEXPERTGATING

HMEARCHITECTURE

MIXTURELEARNINGMIXTURESFUNCTION

GATE

DATAGAUSSIANMIXTURE

LIKELIHOODPOSTERIOR

PRIORDISTRIBUTION

EMBAYESIAN

PARAMETERS

STATEPOLICYVALUE

FUNCTIONACTION

REINFORCEMENTLEARNINGCLASSESOPTIMAL

*

MEMBRANESYNAPTIC

CELL*

CURRENTDENDRITICPOTENTIAL

NEURONCONDUCTANCE

CHANNELS

IMAGEIMAGESOBJECT

OBJECTSFEATURE

RECOGNITIONVIEWS

#PIXEL

VISUAL

KERNELSUPPORTVECTOR

SVMKERNELS

#SPACE

FUNCTIONMACHINES

SET

NETWORKNEURAL

NETWORKSOUPUTINPUT

TRAININGINPUTS

WEIGHTS#

OUTPUTS

NIPS Semantics

NIPS Syntax

Random sentence generation

LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Collocation Topic Model

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

DOW_INDUSTRIALSGRAPH_TRACKS

EXPECTEDBILLION

NASDAQ_COMPOSITE_INDEXEST_02

PHOTO_YESTERDAYYEN10

500_STOCK_INDEX

WALL_STREETANALYSTS

INVESTORSFIRM

GOLDMAN_SACHSFIRMS

INVESTMENTMERRILL_LYNCH

COMPANIESSECURITIESRESEARCH

STOCKBUSINESSANALYST

WALL_STREET_FIRMSSALOMON_SMITH_BARNEY

CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS

INVESTMENT_BANKS

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADENAMERICAN

ATTACKNEW_YORK_REGION

NEWMILITARY

NEW_YORKWORLD

NATIONALQAEDA

TERRORIST_ATTACKS

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

CHAPTER_11FILING

COOPERBILLIONS

COMPANIESBANKRUPTCY_PROCEEDINGS

DEBTSRESTRUCTURING

CASEGROUP

Terrorism Wall Street Firms

Stock Market

Bankruptcy

Potential Model Developments

Using parse trees/ pos taggers?

“You complete me” “I complete you”

S

NP VP

S

NP VP

Modeling Dialogue

Topic Segmentation Model

• Purver, Kording, Griffiths, & Tenenbaum, J. B. (2006). Unsupervised

topic modeling for multi-party spoken discourse. Proceedings of the 21st

International Conference on Computational Linguistics and 44th Annual

Meeting of the Association for Computational Linguistics

• Automatically segments multi-party discourse into topically coherent

segments

• Outperforms standard HMMs

• Model does not incorporate speaker information or speaker turns

– goal is simply to segment long stream of words into segments

At each utterance, there is a prob. of changing theta, the topic mixture. If no change is indicated, words are drawn from the same mixture of topics. If there is a change, the topic mixture is resampled from Dirichley

Latent Dialogue Structure modelDing et al. (Nips workshop, 2009)

• Designed for modeling sequences of messages on discussion forums

• Models the relationship of messages within documents – a message might relate

to any previous message within a dialogue

• It does not incorporate speaker specific variables

Some details …

Learning User Intentions in Spoken Dialogue Systems Chinaei et al. (ICAART, 2009)

• Applies HTMM model (Gruber et al., 2007) to dialogue

• Assumes that within each talk-turn, words are drawn from same topic z (not

mixture!). At start of new talk-turn, there is some probability (psi below) of

sampling new topic z from mixture theta

• Can we enhance topic

models with non-verbal

speech information

• Each topic is a

distribution over words as

well as voicing

information (f0, timing,

etc)

Other ideas

z

w

D

T

Nd

y

Non-verbal feature

Other Extensions

Learning Topic Hierarchies(example: psych Review Abstracts)

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS