Upload
adrian-hart
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
Topic Models forSocial Network Analysis
and Bibliometrics
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with Xuerui Wang, Natasha Mohanty,Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann.
Goal:
Mine actionable knowledgefrom unstructured text.
From Text to Actionable Knowledge
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Uncertainty Info
Emerging Patterns
Joint Inference
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
ProbabilisticModel
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Conditional Random Fields [Lafferty, McCallum, Pereira]
Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]
Discriminatively-trained undirected graphical models
Complex Inference and LearningJust what we researchers like to sink our teeth into!
Unified Model
Context
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Leveraging Text in Social Network Analysis
Joint inferenceamong detailed steps
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Social Network in an Email Dataset
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Sample a distributionover topics,
For each document:
Sample a topic, z
For each word in doc
Sample a wordfrom the topic, w
Example:
70% Iraq war30% US election
Iraq war
“bombing”
GenerativeProcess:
STORYSTORIESTELL
CHARACTERCHARACTERS
AUTHORREADTOLD
SETTINGTALESPLOT
TELLINGSHORTFICTIONACTIONTRUE
EVENTSTELLSTALENOVEL
MINDWORLDDREAMDREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEASWIM
SWIMMINGPOOLLIKESHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETICMAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGYFIELD
PHYSICSLABORATORY
STUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTSBAT
TERRY
JOBWORKJOBS
CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES
WORKINGTRAININGSKILLS
CAREERSPOSITIONS
FINDPOSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
STORYSTORIESTELL
CHARACTERCHARACTERS
AUTHORREADTOLD
SETTINGTALESPLOT
TELLINGSHORTFICTIONACTIONTRUE
EVENTSTELLSTALENOVEL
MINDWORLDDREAMDREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEASWIM
SWIMMINGPOOLLIKESHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETICMAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGYFIELD
PHYSICSLABORATORY
STUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELDPLAYER
BASKETBALLCOACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTSBAT
TERRY
JOBWORKJOBS
CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES
WORKINGTRAININGSKILLS
CAREERSPOSITIONS
FINDPOSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
From LDA to Author-Recipient-Topic(ART) [McCallum et al 2005]
Inference and Estimation
Gibbs Sampling:- Easy to implement- Reasonably fast
r
Enron Email Corpus
• 250k email messages• 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
Topics, and prominent senders / receiversdiscovered by ARTTopic names,
by hand
Topics, and prominent senders / receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
Comparing Role Discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
Comparing Role Discovery Tracy Geaconne Dan McCarty
Traditional SNA Author-TopicART
Similar roles Different rolesDifferent roles
Geaconne = “Secretary”McCarty = “Vice President”
Traditional SNA Author-TopicART
Different roles Very differentVery similar
Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”
Comparing Role Discovery Lynn Blair Kimberly Watson
McCallum Email Corpus 2004
• January - October 2004• 23k email messages• 825 people
From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]
There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:
NIPS registration receipt.CALO registration receipt.
Thanks,Kate
Four most prominent topicsin discussions with ____?
Two most prominent topicsin discussions with ____?
Words Problove 0.030514house 0.015402
0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009
0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539
Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558
Role-Author-Recipient-Topic Models
Results with RART:People in “Role #3” in Academic Email
• olc lead Linux sysadmin• gauthier sysadmin for CIIR group• irsystem mailing list CIIR sysadmins• system mailing list for dept. sysadmins• allan Prof., chair of “computing
committee”• valerie second Linux sysadmin• tech mailing list for dept. hardware• steve head of dept. I.T. support
Roles for allan (James Allan)
• Role #3 I.T. support• Role #2 Natural Language
researcher
Roles for pereira (Fernando Pereira) • Role #2 Natural Language researcher• Role #4 SRI CALO project participant• Role #6 Grant proposal writer• Role #10 Grant proposal coordinator• Role #8 Guests at McCallum’s house
Traditional SNA Author-TopicART
Block structured NotNot
ART: Roles but not Groups
Enron TransWestern Division
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Groups and Topics
• Input:– Observed relations between people– Attributes on those relations (text, or categorical)
• Output:– Attributes clustered into “topics”– Groups of people---varying depending on topic
Discovering Groups from Observed Set of Relations
Admiration relations among six high school students.
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Adjacency Matrix Representing Relations
A B C D E FABCDEF
A B C D E FG1G2G1G2G3G3
G1G2G1G2G3G3
ABCDEF
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
ACBDEF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Group Model: Partitioning Entities into Groups
2Sv
β
2Gγ α
Stochastic Blockstructures for Relations[Nowicki, Snijders 2001]
S: number of entities
G: number of groups
Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]
BetaDirichlet
Binomial
SgMultinomial
Two Relations with Different Attributes
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
A C E B D FG1G1G1G2G2G2
G1G1G1G2G2G2
ACEBDF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Social Admiration
Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)
ACBDEF
The Group-Topic Model: Discovering Groups and Topics Simultaneously
bNw
t
B
T
φ
η
DirichletMultinomial
Uniform
2Sv
β
2Gγ α
Beta
Dirichlet
Binomial
SgMultinomial
T
[Wang, Mohanty, McCallum 2006]
Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast
We assume the relationship is symmetric.
Dataset #1:U.S. Senate
• 16 years of voting records in the US Senate (1989 – 2005)
• a Senator may respond Yea or Nay to a resolution
• 3423 resolutions with text attributes (index terms)
• 191 Senators in total across 16 years
S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
Topics Discovered (U.S. Senate)Education Energy
MilitaryMisc.
Economic
education energy government federalschool power military laboraid water foreign insurance
children nuclear tax aiddrug gas congress tax
students petrol aid businesselementary research law employeeprevention pollution policy care
Mixture of Unigrams
Group-Topic Model
Education
+ DomesticForeign Economic
Social Security
+ Medicareeducation foreign labor socialschool trade insurance securityfederal chemicals tax insuranceaid tariff congress medical
government congress income caretax drugs minimum medicare
energy communicable wage disabilityresearch diseases business assistance
Groups Discovered (US Senate)
Groups from topic Education + Domestic
Senators Who Change Coalition the most Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid
Dataset #2:The UN General Assembly
• Voting records of the UN General Assembly (1990 - 2003)
• A country may choose to vote Yes, No or Abstain
• 931 resolutions with text attributes (titles)
• 192 countries in total
• Also experiments later with resolutions from 1960-2003
Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting
The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:
In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
Topics Discovered (UN)
Everything Nuclear
Human RightsSecurity
in Middle East
nuclear rights occupiedweapons human israel
use palestine syriaimplementation situation security
countries israel calls
Mixture ofUnigrams
Group-TopicModel
NuclearNon-proliferation
Nuclear Arms Race
Human Rights
nuclear nuclear rightsstates arms humanunited prevention palestine
weapons race occupiednations space israel
GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Latent Dirichlet Allocation
[Blei, Ng, Jordan, 2003]
N
n
w
z
θ
α
Tφ
β
LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal
“motion,some junk”
LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation
“images,motion, eyes”
Correlated Topic Model[Blei, Lafferty, 2005]
N
n
w
z
η
Tφ
β
Square matrix ofpairwise correlations.
logisticnormal
Pachinko Machine
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Pachinko Allocation Model[Li, McCallum, 2005]
α22
α31 α33
α41 α42 α43 α44 α45
Model stru
cture
,
not the g
raphical m
odel
α32
word1 word2 word3 word4 word5 word6 word7 word8
Given: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves
For each document: Sample a multinomial from each Dirichlet
For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf.Generate the word at the leaf
α21
α11
Like a Polya tree, but DAG shaped, with arbitrary number of children.
Thanks to Michael Jordan
for suggesting the name
Pachinko Allocation Model[Li, McCallum, 2005]
Model stru
cture
,
not the g
raphical m
odel
DAG may have arbitrary structure• arbitrary depth• any number of children per node• sparse connectivity• edges may skip layers
α22
α31 α33
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
α21
α11
Pachinko Allocation Model[Li, McCallum, 2005]
Model stru
cture
,
not the g
raphical m
odel
Distributions over words (like “LDA topics”)
Distributions over topics;mixtures, representing topic correlations
Distributions over distributions over topics...
Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)
α22
α31 α33
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
α21
α11
Pachinko Allocation Model[Li, McCallum, 2005]
Model stru
cture
,
not the g
raphical m
odel
Estimate all these Dirichlets from data.
Estimate model structure from data. (number of nodes, and connectivity)
α22
α31 α33
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
α21
α11
Pachinko Allocation Special CasesLatent Dirichlet Allocation
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
Pachinko Allocation Special CasesHierarchical Latent Dirichlet Allocation (HLDA)
α41
α32
α51
α33
α42
word1 word2 word3 word4 word5 word6 word7 word8
α22 α23 α24α21
α11
α31 α34
Very low variance Dirichlet at root
TheHLDAhier.
Each leaf of theHLDA topic hier. hasa distr. over nodeson path to the root.
Pachinko Allocation on a Topic Hierarchy
α41
α32
α51
α33
α42
word1 word2 word3 word4 word5 word6 word7 word8
α22 α23 α24α21
α11
α31 α34
Combining best of HLDA and Pachinko Allocation
TheHLDAhier.
...representingcorrelations amongtopic leaves.
α12
α00
ThePAMDAG.
Pachinko Allocation Model... with two layers, no skipping layers,fully-connected from one layer to the next.
α11
α21 α23
α31 α32 α33 α34 α35
α22
word1 word2 word3 word4 word5 word6 word7 word8
“sub-topics”
“super-topics”
fixed multinomials
Another special case would select only one super-topic per document.
Graphical Models
N
n
w
z1
qα
qθ
z2 zm
Tφ
β…
N
n
w
z
θ
α
Tφ
β
LDA PAM(with fixed multinomials for topics)
Pachinko Allocation Model
• Likelihood
• Estimate z’s by Gibbs sampling
• Estimate α’s by moment matching.
€
P(w,z,θ,ϕ |α ,β ) = P(ϕ i |β )i=1
T
∏ ×
∏ ∏∏∏= = =
−=
N
i
n
jijmj
m
kkijijk
q
jjij zwPzzPP
1 1 2)1(
1
))),|(),|(()|(( ϕθαθ
∏ ∑∑= −
−− +
+×
+
+∝
−
−m
k k kijm
zijijm
k kzkiji
zzijkkiji
ijij zC
wzC
zC
zzCzwzP ijm
kij
ijkkij
2 ' '' ',)1(
,)1(
)(
),(
)(
),(),,,|(
)1(
)1(
β
β
α
αβα
Preliminary Experimental Results
• Topic Coherence
• Likelihood on held-out data
• Document classification
NIPS Dataset
•1740 papers •13649 Words•2,301,375 tokens
NIPS Conference PapersVolumes 0-12
Spanning 1987 – 1999. Prepared by Sam Roweis.
Topic Coherence Comparison
LDA 100estimationlikelihoodmaximumnoisyestimatesmixturescenesurfacenormalizationgeneratedmeasurementssurfacesestimatingestimatediterativecombinedfiguredivisivesequenceideal
LDA 20models modelparametersdistributionbayesianprobabilityestimationdatagaussianmethodslikelihoodemmixtureshowapproachpaperdensityframeworkapproximationmarkov
Example super-topic33 input hidden units function number27 estimation bayesian parameters data methods24 distribution gaussian markov likelihood mixture11 exact kalman full conditional deterministic1 smoothing predictive regularizers intermediate slope
“models,estimation, stopwords”
“estimation,some junk”
PAM 100estimationbayesianparametersdatamethodsestimatemaximumprobabilisticdistributionsnoisevariablevariablesnoisyinferencevarianceentropymodelsframeworkstatisticalestimating
“estimation”
Topic Coherence Comparison
LDA 100networklayermultitrainedhighperceptronlayersgivetypenonlinearityperceptronsmodulemodifiedmatchedperformedprovideddesignedsamplesstudymode
PAM 100inputhiddenunitsfunctionnumberfunctionsnetworksoutputlinearlayersingleresultsweightinputsbasisparametersstandardnetworkpatternsstudy
LDA 20architecturenetworkinputoutputstructurepaperleveltaskworksequencessequencemultipleproblemshowsconnectionistnetworkscontextperformscalelearn
“neural networks,some junk”
“neural networks,some junk”
“neural networks,much less junk”
Blind Topic Evaluation
• Randomly select 25 similar pairs of topics generated from PAM and LDA
• 5 people• Each asked “select the topic
in each pair that you find most semantically coherent.”
LDA PAM
5 votes 0 5
>= 4 votes 3 8
>= 3 votes 9 16
Topic countsPrefer PAM
Topic Correlations in PAM
5000 research paper abstracts, from across all CS
Numbers on edges are supertopics’ Dirichlet parameters
Likelihood on Held Out Data
• Likelihood comparison– NIPS abstracts– Train the model with 75% data– Calculate likelihood on 25% data
• Calculate likelihood by– Sampling many, many documents from the model– Estimating a simple mixture of multinomials from these– Calculate the likelihood of data under this simple
mixture.
Likelihood Comparison
Varying number of topics
Document Classification
Test Accuracy (%)
“Comp5” from 20 Newsgroups corpus.
Train on 25%, test on 75%Like Naive Bayes, but use LDA/PAM per-class instead of multinomial.
~2.5% increase
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Want to Model Trends over Time
• Is prevalence of topic growing or waning?
• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time
• How do roles, groups, influence shift over time?
Topics over Time (TOT)
w t
α
Nd
z
D
T
T
Betaover time
Multinomialover words
β γ
Dirichlet
multinomialover topics
topicindex
wordtime
stamp
Dirichletprior
Uniformprior
w
t
Nd
z
D
T
Multinomialover words
β
time stamp
multinomialover topics
topicindex
word
Dirichletprior
distributionon timestamps
T
Betaover time
γ
Uniformprior
State of the Union Address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.
•17156 ‘documents’
•21534 words
•669,425 tokens
Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.
1910
Comparing
TOT
against
LDA
TOT on 17 years of NIPS proceedings
TOT on 17 years of NIPS proceedings
TOT LDA
TOT
versus
LDA
on my email
TOT improves ability to Predict Time
Predicting the year of a State-of-the-Union address.
L1 = distance between predicted year and actual year.
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Topics Modeling Phrases
• Topics based only on unigrams often difficult to interpret
• Topic discovery itself is confused because important meaning / distinctions carried by phrases.
• Significant opportunity to provide improved language models to ASR, MT, IR, etc.
Topical N-gram Model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
α
WTW
γ1 γ2β 2
LDA Topic
LDA
algorithmsalgorithmgenetic
problemsefficient
Topical N-grams
genetic algorithmsgenetic algorithm
evolutionary computationevolutionary algorithms
fitness function
Topic Comparison
learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning
LDA
reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning_rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods
policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies
Topical N-grams (2) Topical N-grams (1)
Topic Comparison
motionvisualfieldpositionfiguredirectionfieldseyelocationretinareceptivevelocityvisionmovingsystemflowedgecenterlightlocal
LDA
receptive fieldspatial frequencytemporal frequencyvisual motionmotion energytuning curveshorizontal cellsmotion detectionpreferred directionvisual processingarea mtvisual cortexlight intensitydirectional selectivityhigh contrastmotion detectorsspatial phasemoving stimulidecision strategyvisual stimuli
motionresponsedirectioncellsstimulusfigurecontrastvelocitymodelresponsesstimulimovingcellintensitypopulationimagecentertuningcomplexdirections
Topical N-grams (2) Topical N-grams (1)
Topic Comparison
wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels
LDA
speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent
speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid
Topical N-grams (2) Topical N-grams (1)
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Social Networks in Research Literature
• Better understand structure of our own research area.
• Structure helps us learn a new field.• Aid collaboration• Map how ideas travel through social networks
of researchers.
• Aids for hiring and finding reviewers!
Traditional Bibliometrics
• Analyses a small amount of data(e.g. 19 articles from a single issue of a journal)
• Uses “journal” as a proxy for “research topic”(but there is no journal for information extraction)
• Uses impact measures almost exclusively based on simple citation counts.
How can we use topic models to create new, interesting impact measures?
Our Data
• Over 1 million research papers, gathered as part of Rexa.info portal.
• Cross linked references / citations.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Finding Topics with TNG
Traditional unigram LDArun on 1 milliontitles / abstracts
(200 topics)
...select ~300k papers onML, NLP, robotics, vision...
Find 200 TNG topics among those papers.
Topical Bibliometric Impact Measures
• Topical Citation Counts
• Topical Impact Factors
• Topical Longevity
• Topical Diversity
• Topical Precedence
• Topical Transfer
Topical DiversityEntropy of the topic distribution among
papers that cite this paper (this topic).
LowDiversity
HighDiversity
Topical Diversity
Can also be measured on particular papers...
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Information Retrieval:
On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)
Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,
Cooper (1968)
Relevance feedback in information retrieval, Rocchio (1971)
Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)
New experiments in relevance feedback, Ide (1971)
Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Speech Recognition:
Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)
Spectrographic study of vowel reduction, B. Lindblom (1963)
Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)
Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)
Automatic Recognition of Speakers from Their Voices, B. Atal (1976)
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cit’s Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase: a repository of Web pages
Topical TransferCitation counts from one topic to another.
Map “producers and consumers”
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Want a “topic model” with the advantages of CRFs
• Use arbitrary, overlapping features of the input.
• Undirected graphical model, so we don’t have to think about avoiding cycles.
• Integrate naturally with our other CRF components.
• Train “discriminatively”
• Natural semi-supervised training
What does this mean?
Topic models are unsupervised!
“Multi-Conditional Mixtures”Latent Variable Models fit by Multi-way Conditional Probability
• For clustering structured data,ala Latent Dirichlet Allocation & its successors
• But an undirected model,like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]
• But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B)e.g. A,B,C are different modalities
[McCallum, Wang, Pal, 2005], [McCallum, Pal, Wang, 2006]
Objective Functions for Parameter EstimationTraditional, joint training (e.g. naive Bayes, most topic models)
Traditional, conditional training (e.g. MaxEnt classifiers, CRFs)
Conditional mixtures (e.g. Jebara’s CEM)
Multi-conditional(mostly conditional, generative regularization)
Multi-conditional(for semi-sup)
Multi-conditional(for transfer learning, 2 tasks, shared hiddens)
Tra
dit
ion
alN
ew,
mu
lti-
con
dit
ion
al
Traditional mixture model (e.g. LDA)
“Multi-Conditional Learning” (Regularization)[McCallum, Pal, Wang, 2006]
Predictive Random Fieldsmixture of Gaussians on synthetic data
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data, classify by color Generatively trained
Conditionally-trained [Jebara 1998]
Multi-Conditional
[McCallum, Wang, Pal, 2005]
Multi-Conditional Mixturesvs. Harmoniun
on document retrieval task
Harmonium, joint with words, no labels
Harmonium, joint,with class labels and words
Conditionally-trained,to predict class labels
Multi-Conditional,multi-way conditionally trained
[McCallum, Wang, Pal, 2005]
Outline
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Social Network Analysis with Topic Models
Multi-Conditional Mixtures
Summary
Assigning topics to documents
1. Build a 200 topic n-gram topic model on 300k documents
2. Remove stopword or methodological topics (e.g. “efficient, fast, speed”)
3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t
Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.
Impact Factor
Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.
2004 Impact factors from JCR:
Nature 32.182
Cell 28.389
JMLR 5.952
Machine Learning 3.258
Topic Impact Factor
Broad Impact: Diffusion
Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100
Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.
Broad Impact: Diversity
Topic Diversity: Entropy of the distribution of citing topics
Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics
Broad Impact: Diversity
Topic Diversity: Entropy of the distribution of citing topics
Topic diversity can also be measured for papers:
Longevity: Cited Half Life
Two views:• Given a paper, what is the median age of
citations to that paper?• What is the median age of citations from
current literature?
History: Topical Precedence
Within a topic, what are the earliest papers that received more than n citations?
Information Retrieval (138):On Relevance, Probabilistic Indexing and Information Retrieval,
Kuhns and Maron (1960)Expected Search Length: A Single Measure of Retrieval
Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968)
Relevance feedback in information retrieval, Rocchio (1971)Relevance feedback and the optimization of retrieval
effectiveness, Salton (1971)New experiments in relevance feedback, Ide (1971)Automatic Indexing of a Sound Database Using Self-organizing
Neural Nets, Feiten and Gunzel (1982)