Upload
diamond
View
49
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Topic modeling. Mark Steyvers Department of Cognitive Sciences University of California, Irvine. Some topics we can discuss. Introduction to LDA: basic topic model Preliminary work on therapy transcripts Extensions to LDA Conditional topic models (for predicting behavioral codes) - PowerPoint PPT Presentation
Citation preview
Topic modeling
Mark Steyvers
Department of Cognitive SciencesUniversity of California, Irvine
Some topics we can discuss
• Introduction to LDA: basic topic model
• Preliminary work on therapy transcripts
• Extensions to LDA
– Conditional topic models (for predicting behavioral codes)
– Various topic models for word order
– Topic models incorporating parse trees
– Topic models for dialogue
– Topic models incorporating speech information
Most basic topic model: LDA(Latent Dirichlet Allocation)
NYT
330,000 articles
Enron
250,000 emails
16 million Medline articles
NSF/ NIH
100,000 grants
Automatic and unsupervised extraction of semantic themes from large text collections.
AOL queries
20,000,000 queries
650,000 users
Pennsylvania Gazette
(1728-1800)
80,000 articles
Model Input
• Matrix of counts: number of times words occur in documents
• Note:
– word order is lost: “bag of words” approach
– Some function words are deleted: “the”, “a”, “in”
documents
wor
ds
1
…
16
…
0
…
FOOD
…
6190ITALIAN
2012PASTA
3034PIZZA
Doc3 … Doc2Doc1
Basic Assumptions
• Each topic is a distribution over words
• Each document a mixture of topics
• Each word in a document originates from a single topic
Document = mixture of topics
Document
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
80%20%
Document
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100%
brainfmri
imagingfunctional
mrisubjects
magneticresonance
neuroimagingstructural
schizophreniapatientsdeficits
schizophrenicpsychosissubjects
psychoticdysfunction
abnormalitiesclinical
memoryworking
memoriestasks
retrievalencodingcognitive
processingrecognition
performance
diseasead
alzheimerdiabetes
cardiovascularinsulin
vascularblood
clinicalindividuals
autocar
partscarsusedford
hondatruck
toyota
hannahmontana
zacefron
disneyhigh school
musicalmiley cyrushilary duff
webmdcymbalta
xanaxgout
vicodineffexor
prednisonelexaproambien
partystore
weddingbirthdayjewelry
ideascardscakegifts
Generative Process
• For each document, choose a mixture of topics
Dirichlet()
• Sample a topic [1..T] from the mixture
z Multinomial()
• Sample a word from the topic
w Multinomial((z))
Dirichlet(β)
z
w
D
T
Nd
Prior Distributions• Dirichlet priors encourage sparsity on topic mixtures and
topics
θ ~ Dirichlet( α ) ~ Dirichlet( β )
Topic 1 Topic 2
Topic 3
Word 1 Word 2
Word 3
(darker colors indicate lower probability)
Statistical Inference
• Three sets of latent variables:– document-topic distributions θ– topic-word distributions – topic assignments z
• Estimate posterior distribution over topic assignments – P( z | w )– we “collapse” over topic mixtures and word mixtures– we can later infer θ and
• Use approximate methods: Markov chain Monte Carlo (MCMC)
with Gibbs sampling
Toy Example: Artificial Dataset
Docs
Can we recover the original topics and topic mixtures from this data?
River Stream Bank Money Loan123456789
10111213141516
Two topics
16 documents
topic 1 topic 2
River 0.33 0Stream 0.33 0Bank 0.33 0.33Money 0 0.33Loan 0 0.33
River Stream Bank Money Loan123456789
10111213141516
Initialization: assign word tokens randomly to topics:
(●=topic 1; ○=topic 2 )
Gibbs Sampling
' '' '
( | )i i
td wti i i i
t d w tt w
n np z t z
n T n W
count of word w assigned to topic t
count of topic t assigned to doc d
probability that word i is assigned to topic t
River Stream Bank Money Loan123456789
10111213141516
After 1 iteration
• Apply sampling equation to each word token:
(●=topic 1; ○=topic 2 )
River Stream Bank Money Loan123456789
10111213141516
After 4 iterations
(●=topic 1; ○=topic 2 )
River Stream Bank Money Loan123456789
10111213141516
After 8 iterations
(●=topic 1; ○=topic 2 )
River Stream Bank Money Loan123456789
10111213141516
After 32 iterations
(●=topic 1; ○=topic 2 )
topic 1 topic 2
River 0.42 0Stream 0.29 0.05Bank 0.28 0.31Money 0 0.29Loan 0 0.35
INPUT: word-document counts (word order is irrelevant)
OUTPUT:topic assignments to each word P( zi )likely words in each topic P( w | z )likely topics in each document (“gist”) P( z | d )
Summary of Algorithm
Example topics from TASA: an educational corpus
• 37K docs 26K word vocabulary• 300 topics e.g.:
P( w | z )0.00
0.050.10
MUSICDANCE
SONGPLAYSING
SINGINGBAND
PLAYEDSANG
SONGSDANCING
PIANOPLAYINGRHYTHM
TOPIC 77
P( w | z )0.00
0.010.02
0.030.04
LITERATUREPOEM
POETRYPOET
PLAYSPOEMS
PLAYLITERARYWRITERS
DRAMAWROTEPOETS
WRITERSHAKESPEARE
TOPIC 82
P( w | z )0.00
0.060.12
0.18
RIVERLAND
RIVERSVALLEY
BUILTWATERFLOOD
WATERSNILE
FLOWSRICH
FLOWDAM
BANKS
TOPIC 137
P( w | z )0.00
0.080.16
0.24
READBOOK
BOOKSREADINGLIBRARYWROTEWRITE
FINDWRITTEN
PAGESWORDS
PAGEAUTHOR
TITLE
TOPIC 254
Three documents with the word “play”(numbers & colors topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....
word
s
C = F Q
documentsw
ord
stopics
top
ics documents
Topic model
word
s
documents
C = U D VT
word
s
dims
dims
dim
s
dim
s
documents
LSA
normalizedco-occurrence matrix
mixtureweights
mixturecomponents
Documents as Topics Mixtures:a Geometric Interpretation
P(word3)
P(word1)
0
1
1
1
P(word2)
P(word1)+P(word2)+P(word3) = 1
topic 1
topic 2
= observed= observeddocumentdocument
Some Preliminary Work on Therapy Transcripts
Defining documents
• Can define “document” in multiple ways
– all words within a therapy session
– all words from a particular speaker within a session
• Clearly we need to extend topic model to dialogue….
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
positive topic negative topic
P(
top
ic )
controls
pre
Positive/Negative Topic Usage by Group
Positive/ Negative Topic Usage by Changes in Satisfaction
0
0.02
0.04
0.06
0.08
0.1
0.12
positive topic negative topic
P(
top
ic )
positive change insatisfaction
negative change insatisfaction
This graph shows that couples with a decrease in satisfaction over the course of therapy use relatively negative language.
Those who leave the therapy with increased satisfaction exhibit more positive language
Topics used by Satisfied/ Unsatisfied Couples
0 0.005 0.01 0.015 0.020
0.005
0.01
0.015
0.02
T1
T2T3
T4T5
T6
T7
T8
T9
T10T11
T12
T13
T14
T15T16
T17
T18T19
T20
T21
T22
T23
T24
T25
T26
T27
T28T29
T30
T31
T32
T33
T34
T35
T36T37
T38
T39T40
T41T42T43
T44
T45T46
T47
T48T49
T50T51
T52
T53
T54
T55
T56
T57
T58
T59T60
T61
T62
T63
T64T65
T66
T67T68
T69
T70
T71
T72
T73
T74
T75
T76
T77
T78T79T80T81
T82
T83
T84
T85T86
T87
T88
T89
T90T91
T92
T93
T94
T95
T96
T97T98
T99T100
T101
T102 T103
T104
T105
T106
T107
T108
T109T110
T111
T112
T113
T114
T115
T116
T117
T118T119
T120
T121
T122
T123
T124
T125
T126
T127
T128
T129T130T131
T132
T133
T134
T135
T136
T137
T138
T139
T140
T141T142
T143
T144
T145
T146
T147
T148
T149T150T151
T152
T153
T154T155
T156 T157
T158
T159
T160
T161
T162
T163
T164
T165
T166T167
T168T169
T170
T171
T172 T173T174
T175T176
T177
T178
T179
T180
T181
T182
T183
T184T185T186T187T188
T189
T190
T191
T192
T193
T194
T195
T196
T197
T198
T199
T200
Positive Change in Satisfaction
Neg
ativ
e C
ha
nge
in S
atis
fact
ion
talkdivorceproblem
housealong
separateseparation
talkingagree
example
Topic 38
Dissatisfied couples talk relatively more often about separation and divorce
Affect Dynamics
Analyze the short-term dynamics of affect usage: Do unhappy couples follow up negative language with negative
language more often than happy couples? In other words, are unhappy couples involved in a negative feedback loop?
Calculated: P( z2=+ | z1=+ ) P( z2=+ | z1=- ) P( z2=- | z1=+ ) P( z2=- | z1=- )
E.g. P( z2=- | z1=+ ) is the probability that after a positive word the next non-neutral word will be a negative word
+ -.27
.28
.72.73
.51
.49+-
zNormalControls
Markov Chain Illustration Base rates
+ -.33
.27
.73.67
.45
.55+-
zPositiveChange
+ -.37
.22
.78.63
.38
.62+-
zLittleChange
+ -.41
.22
.78.59
.35
.65+-
zNegativeChange
Modeling Extensions
Extensions
• Multi-label Document Classification
– conditional topic models
• Topic models and word order
– ngrams/collocations
– hidden-markov models
• Some potential model developments:
– topic models incorporating parse trees
– topic models for dialogue
– topic models incorporating speech information
Conditional Topic Models
Assume there is a topic associated with each label/behavioral code. Model only is allowed to assign words to labels that are associated with the document
This model can learn the distribution of words associated with each label/behavioral code
Topics associated
with Behavioral
Codes
?
?
word? word word? word? word? word? word? word?
word? word? word? word? word? ....
TopicWeights
word? word? word? word? word? word? word?
word? word? word? word? word? ....
word? word? word? word? word? word?....
Documents and topic assignments
“Vulnerability”
“Hard Expression”
Vulnerability=yes Hard Expression=no
Vulnerability=no Hard Expression=yes
Vulnerability=yes Hard Expression=yes
Preliminary ResultsHEX (Hard Expressions)
[child_making_noises] 0.02250[sighs] 0.01715
alice 0.01661furniture 0.01447
brother 0.01340literally 0.01072
[oldest_daughter] 0.00911jason 0.00804shit 0.00804
budget 0.00697cold 0.00697
[child] 0.00643da 0.00643
dad 0.00643diaper 0.00643
lie 0.00643[youngest_daughter] 0.00590
criticize 0.00590enemy 0.00590
expression 0.00590mmhmm 0.00590
monitoring 0.00590months 0.00590
mad 0.00536piece 0.00536pssht 0.00536
carpet 0.00483floors 0.00483huge 0.00483
techniques 0.00483
vul (Vulnerability)
care 0.02262love 0.01996
[sigh] 0.01775home 0.01730life 0.01730
talking 0.01553family 0.01508baby 0.01375
brother 0.01375[sighs] 0.01021
difficult 0.01021school 0.01021
guy 0.00976talk 0.00976
da 0.00932heart 0.00932
sister 0.00932appreciate 0.00887connection 0.00887
stress 0.00887class 0.00799
lonely 0.00799yesterday 0.00799
called 0.00754emotional 0.00754
mom 0.00754mother 0.00754hard 0.00710phone 0.00710read 0.00710
Topic Models for short-range sequential dependencies
Hidden Markov Topics Model
• Syntactic dependencies short range dependencies
• Semantic dependencies long-range
z z z z
w w w w
s s s s
Semantic state: generate words from topic model
Syntactic states: generate words from HMM
(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
MODELALGORITHM
SYSTEMCASE
PROBLEMNETWORKMETHOD
APPROACHPAPER
PROCESS
ISWASHAS
BECOMESDENOTES
BEINGREMAINS
REPRESENTSEXISTSSEEMS
SEESHOWNOTE
CONSIDERASSUMEPRESENT
NEEDPROPOSEDESCRIBESUGGEST
USEDTRAINED
OBTAINEDDESCRIBED
GIVENFOUND
PRESENTEDDEFINED
GENERATEDSHOWN
INWITHFORON
FROMAT
USINGINTOOVER
WITHIN
HOWEVERALSOTHENTHUS
THEREFOREFIRSTHERENOW
HENCEFINALLY
#*IXTN-CFP
EXPERTSEXPERTGATING
HMEARCHITECTURE
MIXTURELEARNINGMIXTURESFUNCTION
GATE
DATAGAUSSIANMIXTURE
LIKELIHOODPOSTERIOR
PRIORDISTRIBUTION
EMBAYESIAN
PARAMETERS
STATEPOLICYVALUE
FUNCTIONACTION
REINFORCEMENTLEARNINGCLASSESOPTIMAL
*
MEMBRANESYNAPTIC
CELL*
CURRENTDENDRITICPOTENTIAL
NEURONCONDUCTANCE
CHANNELS
IMAGEIMAGESOBJECT
OBJECTSFEATURE
RECOGNITIONVIEWS
#PIXEL
VISUAL
KERNELSUPPORTVECTOR
SVMKERNELS
#SPACE
FUNCTIONMACHINES
SET
NETWORKNEURAL
NETWORKSOUPUTINPUT
TRAININGINPUTS
WEIGHTS#
OUTPUTS
NIPS Semantics
NIPS Syntax
Random sentence generation
LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL
Collocation Topic Model
WEEKDOW_JONES
POINTS10_YR_TREASURY_YIELD
PERCENTCLOSE
NASDAQ_COMPOSITESTANDARD_POOR
CHANGEFRIDAY
DOW_INDUSTRIALSGRAPH_TRACKS
EXPECTEDBILLION
NASDAQ_COMPOSITE_INDEXEST_02
PHOTO_YESTERDAYYEN10
500_STOCK_INDEX
WALL_STREETANALYSTS
INVESTORSFIRM
GOLDMAN_SACHSFIRMS
INVESTMENTMERRILL_LYNCH
COMPANIESSECURITIESRESEARCH
STOCKBUSINESSANALYST
WALL_STREET_FIRMSSALOMON_SMITH_BARNEY
CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS
INVESTMENT_BANKS
SEPT_11WAR
SECURITYIRAQ
TERRORISMNATIONKILLED
AFGHANISTANATTACKS
OSAMA_BIN_LADENAMERICAN
ATTACKNEW_YORK_REGION
NEWMILITARY
NEW_YORKWORLD
NATIONALQAEDA
TERRORIST_ATTACKS
BANKRUPTCYCREDITORS
BANKRUPTCY_PROTECTIONASSETS
COMPANYFILED
BANKRUPTCY_FILINGENRON
BANKRUPTCY_COURTKMART
CHAPTER_11FILING
COOPERBILLIONS
COMPANIESBANKRUPTCY_PROCEEDINGS
DEBTSRESTRUCTURING
CASEGROUP
Terrorism Wall Street Firms
Stock Market
Bankruptcy
Potential Model Developments
Using parse trees/ pos taggers?
“You complete me” “I complete you”
S
NP VP
S
NP VP
Modeling Dialogue
Topic Segmentation Model
• Purver, Kording, Griffiths, & Tenenbaum, J. B. (2006). Unsupervised
topic modeling for multi-party spoken discourse. Proceedings of the 21st
International Conference on Computational Linguistics and 44th Annual
Meeting of the Association for Computational Linguistics
• Automatically segments multi-party discourse into topically coherent
segments
• Outperforms standard HMMs
• Model does not incorporate speaker information or speaker turns
– goal is simply to segment long stream of words into segments
At each utterance, there is a prob. of changing theta, the topic mixture. If no change is indicated, words are drawn from the same mixture of topics. If there is a change, the topic mixture is resampled from Dirichley
Latent Dialogue Structure modelDing et al. (Nips workshop, 2009)
• Designed for modeling sequences of messages on discussion forums
• Models the relationship of messages within documents – a message might relate
to any previous message within a dialogue
• It does not incorporate speaker specific variables
Some details …
Learning User Intentions in Spoken Dialogue Systems Chinaei et al. (ICAART, 2009)
• Applies HTMM model (Gruber et al., 2007) to dialogue
• Assumes that within each talk-turn, words are drawn from same topic z (not
mixture!). At start of new talk-turn, there is some probability (psi below) of
sampling new topic z from mixture theta
• Can we enhance topic
models with non-verbal
speech information
• Each topic is a
distribution over words as
well as voicing
information (f0, timing,
etc)
Other ideas
z
w
D
T
Nd
y
Non-verbal feature
Other Extensions
Learning Topic Hierarchies(example: psych Review Abstracts)
RESPONSESTIMULUS
REINFORCEMENTRECOGNITION
STIMULIRECALLCHOICE
CONDITIONING
SPEECHREADINGWORDS
MOVEMENTMOTORVISUALWORD
SEMANTIC
ACTIONSOCIALSELF
EXPERIENCEEMOTION
GOALSEMOTIONALTHINKING
GROUPIQ
INTELLIGENCESOCIAL
RATIONALINDIVIDUAL
GROUPSMEMBERS
SEXEMOTIONS
GENDEREMOTIONSTRESSWOMENHEALTH
HANDEDNESS
REASONINGATTITUDE
CONSISTENCYSITUATIONALINFERENCEJUDGMENT
PROBABILITIESSTATISTICAL
IMAGECOLOR
MONOCULARLIGHTNESS
GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC
CONDITIONINSTRESS
EMOTIONALBEHAVIORAL
FEARSTIMULATIONTOLERANCERESPONSES
AMODEL
MEMORYFOR
MODELSTASK
INFORMATIONRESULTSACCOUNT
SELFSOCIAL
PSYCHOLOGYRESEARCH
RISKSTRATEGIES
INTERPERSONALPERSONALITY
SAMPLING
MOTIONVISUAL
SURFACEBINOCULAR
RIVALRYCONTOUR
DIRECTIONCONTOURSSURFACES
DRUGFOODBRAIN
AROUSALACTIVATIONAFFECTIVEHUNGER
EXTINCTIONPAIN
THEOF
ANDTOINAIS