Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with

Topic Models forSocial Network Analysis

and Bibliometrics

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

Joint work with Xuerui Wang, Natasha Mohanty,Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann.

Goal:

Mine actionable knowledgefrom unstructured text.

From Text to Actionable Knowledge

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge


Filter


IE

Documentcollection

Database


DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Joint Inference


Filter


IE

Documentcollection

ProbabilisticModel


DataMining

Spider

Actionableknowledge

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

Context


Filter


IE

Documentcollection

Database


DataMining

Spider

Actionableknowledge

Leveraging Text in Social Network Analysis

Joint inferenceamong detailed steps

Outline

• Role Discovery (Author-Recipient-Topic Model, ART)

• Group Discovery (Group-Topic Model, GT)

• Enhanced Topic Models

– Correlations among Topics (Pachinko Allocation, PAM)

– Time Localized Topics (Topics-over-Time Model, TOT)

– Markov Dependencies in Topics (Topical N-Grams Model, TNG)

• Bibliometric Impact Measures enabled by Topics

Social Network Analysis with Topic Models

Multi-Conditional Mixtures

Social Network in an Email Dataset

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% Iraq war30% US election

Iraq war

“bombing”

GenerativeProcess:

STORYSTORIESTELL

CHARACTERCHARACTERS

AUTHORREADTOLD

SETTINGTALESPLOT

TELLINGSHORTFICTIONACTIONTRUE

EVENTSTELLSTALENOVEL

MINDWORLDDREAMDREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEASWIM

SWIMMINGPOOLLIKESHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETICMAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGYFIELD

PHYSICSLABORATORY

STUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTSBAT

TERRY

JOBWORKJOBS

CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES

WORKINGTRAININGSKILLS

CAREERSPOSITIONS

FINDPOSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

STORYSTORIESTELL

CHARACTERCHARACTERS

AUTHORREADTOLD

SETTINGTALESPLOT

TELLINGSHORTFICTIONACTIONTRUE

EVENTSTELLSTALENOVEL

MINDWORLDDREAMDREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEASWIM

SWIMMINGPOOLLIKESHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETICMAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGYFIELD

PHYSICSLABORATORY

STUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELDPLAYER

BASKETBALLCOACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTSBAT

TERRY

JOBWORKJOBS

CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES

WORKINGTRAININGSKILLS

CAREERSPOSITIONS

FINDPOSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

From LDA to Author-Recipient-Topic(ART) [McCallum et al 2005]

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

Enron Email Corpus

• 250k email messages• 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]

Topics, and prominent senders / receiversdiscovered by ARTTopic names,

by hand

Topics, and prominent senders / receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”


Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

McCallum Email Corpus 2004

• January - October 2004• 23k email messages• 825 people

From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

Four most prominent topicsin discussions with ____?

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402

0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009

0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558

Role-Author-Recipient-Topic Models

Results with RART:People in “Role #3” in Academic Email

• olc lead Linux sysadmin• gauthier sysadmin for CIIR group• irsystem mailing list CIIR sysadmins• system mailing list for dept. sysadmins• allan Prof., chair of “computing

committee”• valerie second Linux sysadmin• tech mailing list for dept. hardware• steve head of dept. I.T. support

Roles for allan (James Allan)

• Role #3 I.T. support• Role #2 Natural Language

researcher

Roles for pereira (Fernando Pereira) • Role #2 Natural Language researcher• Role #4 SRI CALO project participant• Role #6 Grant proposal writer• Role #10 Grant proposal coordinator• Role #8 Guests at McCallum’s house


Block structured NotNot

ART: Roles but not Groups

Enron TransWestern Division

Outline










Groups and Topics

• Input:– Observed relations between people– Attributes on those relations (text, or categorical)

• Output:– Attributes clustered into “topics”– Groups of people---varying depending on topic

Discovering Groups from Observed Set of Relations

Admiration relations among six high school students.

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Adjacency Matrix Representing Relations

A B C D E FABCDEF

A B C D E FG1G2G1G2G3G3

G1G2G1G2G3G3

ABCDEF

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

ACBDEF

Student Roster


Academic Admiration


Group Model: Partitioning Entities into Groups

2Sv

β

2Gγ α

Stochastic Blockstructures for Relations[Nowicki, Snijders 2001]

S: number of entities

G: number of groups

Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]

BetaDirichlet

Binomial

SgMultinomial

Two Relations with Different Attributes

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

A C E B D FG1G1G1G2G2G2

G1G1G1G2G2G2

ACEBDF

Student Roster


Academic Admiration


Social Admiration

Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)

ACBDEF

The Group-Topic Model: Discovering Groups and Topics Simultaneously

bNw

t

B

T

φ

η

DirichletMultinomial

Uniform

2Sv

β

2Gγ α

Beta

Dirichlet

Binomial

SgMultinomial

T

[Wang, Mohanty, McCallum 2006]

Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast

We assume the relationship is symmetric.

Dataset #1:U.S. Senate

• 16 years of voting records in the US Senate (1989 – 2005)

• a Senator may respond Yea or Nay to a resolution

• 3423 resolutions with text attributes (index terms)

• 191 Senators in total across 16 years

S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms

Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

http://thomas.loc.gov/cgi-bin/bdquery/?&Db=d102&querybd=@FIELD(FLD001+@4(Banks+and+banking))

http://thomas.loc.gov/cgi-bin/bdquery/?&Db=d102&querybd=@FIELD(FLD001+@4(Accounting))

http://thomas.loc.gov/cgi-bin/bdquery/?&Db=d102&querybd=@FIELD(FLD001+@4(Administrative+fees))

http://thomas.loc.gov/cgi-bin/bdquery/?&Db=d102&querybd=@FIELD(FLD001+@4(Cost+control))

http://thomas.loc.gov/cgi-bin/bdquery/?&Db=d102&querybd=@FIELD(FLD001+@4(Credit))

http://thomas.loc.gov/cgi-bin/bdquery/?&Db=d102&querybd=@FIELD(FLD001+@4(Deposit+insurance))

http://thomas.loc.gov/cgi-bin/bdquery/?&Db=d102&querybd=@FIELD(FLD001+@4(Depressed+areas))

Topics Discovered (U.S. Senate)Education Energy

MilitaryMisc.

Economic

education energy government federalschool power military laboraid water foreign insurance

children nuclear tax aiddrug gas congress tax

students petrol aid businesselementary research law employeeprevention pollution policy care

Mixture of Unigrams

Group-Topic Model

Education

+ DomesticForeign Economic

Social Security

+ Medicareeducation foreign labor socialschool trade insurance securityfederal chemicals tax insuranceaid tariff congress medical

government congress income caretax drugs minimum medicare

energy communicable wage disabilityresearch diseases business assistance

Groups Discovered (US Senate)

Groups from topic Education + Domestic

Senators Who Change Coalition the most Dependent on Topic

e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid

Dataset #2:The UN General Assembly

• Voting records of the UN General Assembly (1990 - 2003)

• A country may choose to vote Yes, No or Abstain

• 931 resolutions with text attributes (titles)

• 192 countries in total

• Also experiments later with resolutions from 1960-2003

Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting

The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:

In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.

Topics Discovered (UN)

Everything Nuclear

Human RightsSecurity

in Middle East

nuclear rights occupiedweapons human israel

use palestine syriaimplementation situation security

countries israel calls

Mixture ofUnigrams

Group-TopicModel

NuclearNon-proliferation

Nuclear Arms Race

Human Rights

nuclear nuclear rightsstates arms humanunited prevention palestine

weapons race occupiednations space israel

GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.

Outline










Latent Dirichlet Allocation

[Blei, Ng, Jordan, 2003]

N

n

w

z

θ

α

Tφ

β

LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal

“motion,some junk”

LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation

“images,motion, eyes”

Correlated Topic Model[Blei, Lafferty, 2005]

N

n

w

z

η

Tφ

β

Square matrix ofpairwise correlations.

logisticnormal

Pachinko Machine

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Pachinko Allocation Model[Li, McCallum, 2005]

α22

α31 α33

α41 α42 α43 α44 α45

Model stru

cture

,

not the g

raphical m

odel

α32

word1 word2 word3 word4 word5 word6 word7 word8

Given: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves

For each document: Sample a multinomial from each Dirichlet

For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf.Generate the word at the leaf

α21

α11

Like a Polya tree, but DAG shaped, with arbitrary number of children.

Thanks to Michael Jordan

for suggesting the name


Model stru

cture

,

not the g

raphical m

odel

DAG may have arbitrary structure• arbitrary depth• any number of children per node• sparse connectivity• edges may skip layers

α22

α31 α33

α41 α42 α43 α44 α45

α32


α21

α11


Model stru

cture

,

not the g

raphical m

odel

Distributions over words (like “LDA topics”)

Distributions over topics;mixtures, representing topic correlations

Distributions over distributions over topics...

Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)

α22

α31 α33

α41 α42 α43 α44 α45

α32


α21

α11


Model stru

cture

,

not the g

raphical m

odel

Estimate all these Dirichlets from data.

Estimate model structure from data. (number of nodes, and connectivity)

α22

α31 α33

α41 α42 α43 α44 α45

α32


α21

α11

Pachinko Allocation Special CasesLatent Dirichlet Allocation

α41 α42 α43 α44 α45

α32


Pachinko Allocation Special CasesHierarchical Latent Dirichlet Allocation (HLDA)

α41

α32

α51

α33

α42


α22 α23 α24α21

α11

α31 α34

Very low variance Dirichlet at root

TheHLDAhier.

Each leaf of theHLDA topic hier. hasa distr. over nodeson path to the root.

Pachinko Allocation on a Topic Hierarchy

α41

α32

α51

α33

α42


α22 α23 α24α21

α11

α31 α34

Combining best of HLDA and Pachinko Allocation

TheHLDAhier.

...representingcorrelations amongtopic leaves.

α12

α00

ThePAMDAG.

Pachinko Allocation Model... with two layers, no skipping layers,fully-connected from one layer to the next.

α11

α21 α23

α31 α32 α33 α34 α35

α22


“sub-topics”

“super-topics”

fixed multinomials

Another special case would select only one super-topic per document.

Graphical Models

N

n

w

z1

qα

qθ

z2 zm

Tφ

β…

N

n

w

z

θ

α

Tφ

β

LDA PAM(with fixed multinomials for topics)

Pachinko Allocation Model

• Likelihood

• Estimate z’s by Gibbs sampling

• Estimate α’s by moment matching.

€

P(w,z,θ,ϕ |α ,β ) = P(ϕ i |β )i=1

T

∏ ×

∏ ∏∏∏= = =

−=

N

i

n

jijmj

m

kkijijk

q

jjij zwPzzPP

1 1 2)1(

1

))),|(),|(()|(( ϕθαθ

∏ ∑∑= −

−− +

+×

+

+∝

−

−m

k k kijm

zijijm

k kzkiji

zzijkkiji

ijij zC

wzC

zC

zzCzwzP ijm

kij

ijkkij

2 ' '' ',)1(

,)1(

)(

),(

)(

),(),,,|(

)1(

)1(

β

β

α

αβα

Preliminary Experimental Results

• Topic Coherence

• Likelihood on held-out data

• Document classification

NIPS Dataset

•1740 papers •13649 Words•2,301,375 tokens

NIPS Conference PapersVolumes 0-12

Spanning 1987 – 1999. Prepared by Sam Roweis.

Topic Coherence Comparison

LDA 100estimationlikelihoodmaximumnoisyestimatesmixturescenesurfacenormalizationgeneratedmeasurementssurfacesestimatingestimatediterativecombinedfiguredivisivesequenceideal

LDA 20models modelparametersdistributionbayesianprobabilityestimationdatagaussianmethodslikelihoodemmixtureshowapproachpaperdensityframeworkapproximationmarkov

Example super-topic33 input hidden units function number27 estimation bayesian parameters data methods24 distribution gaussian markov likelihood mixture11 exact kalman full conditional deterministic1 smoothing predictive regularizers intermediate slope

“models,estimation, stopwords”

“estimation,some junk”

PAM 100estimationbayesianparametersdatamethodsestimatemaximumprobabilisticdistributionsnoisevariablevariablesnoisyinferencevarianceentropymodelsframeworkstatisticalestimating

“estimation”

Topic Coherence Comparison

LDA 100networklayermultitrainedhighperceptronlayersgivetypenonlinearityperceptronsmodulemodifiedmatchedperformedprovideddesignedsamplesstudymode

PAM 100inputhiddenunitsfunctionnumberfunctionsnetworksoutputlinearlayersingleresultsweightinputsbasisparametersstandardnetworkpatternsstudy

LDA 20architecturenetworkinputoutputstructurepaperleveltaskworksequencessequencemultipleproblemshowsconnectionistnetworkscontextperformscalelearn

“neural networks,some junk”

“neural networks,some junk”

“neural networks,much less junk”

Blind Topic Evaluation

• Randomly select 25 similar pairs of topics generated from PAM and LDA

• 5 people• Each asked “select the topic

in each pair that you find most semantically coherent.”

LDA PAM

5 votes 0 5

>= 4 votes 3 8

>= 3 votes 9 16

Topic countsPrefer PAM

Topic Correlations in PAM

5000 research paper abstracts, from across all CS

Numbers on edges are supertopics’ Dirichlet parameters

Likelihood on Held Out Data

• Likelihood comparison– NIPS abstracts– Train the model with 75% data– Calculate likelihood on 25% data

• Calculate likelihood by– Sampling many, many documents from the model– Estimating a simple mixture of multinomials from these– Calculate the likelihood of data under this simple

mixture.

Likelihood Comparison

Varying number of topics

Document Classification

Test Accuracy (%)

“Comp5” from 20 Newsgroups corpus.

Train on 25%, test on 75%Like Naive Bayes, but use LDA/PAM per-class instead of multinomial.

~2.5% increase

Outline










Want to Model Trends over Time

• Is prevalence of topic growing or waning?

• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time

• How do roles, groups, influence shift over time?

Topics over Time (TOT)

w t

α

Nd

z

D

T

T

Betaover time

Multinomialover words

β γ

Dirichlet

multinomialover topics

topicindex

wordtime

stamp

Dirichletprior

Uniformprior

w

t

Nd

z

D

T

Multinomialover words

β

time stamp

multinomialover topics

topicindex

word

Dirichletprior

distributionon timestamps

T

Betaover time

γ

Uniformprior

State of the Union Address

208 Addresses delivered between January 8, 1790 and January 29, 2002.

To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.

•17156 ‘documents’

•21534 words

•669,425 tokens

Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.

1910

Comparing

TOT

against

LDA

TOT on 17 years of NIPS proceedings

TOT on 17 years of NIPS proceedings

TOT LDA

TOT

versus

LDA

on my email

TOT improves ability to Predict Time

Predicting the year of a State-of-the-Union address.

L1 = distance between predicted year and actual year.

Outline










Topics Modeling Phrases

• Topics based only on unigrams often difficult to interpret

• Topic discovery itself is confused because important meaning / distinctions carried by phrases.

• Significant opportunity to provide improved language models to ASR, MT, IR, etc.

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

α

WTW

γ1 γ2β 2

LDA Topic

LDA

algorithmsalgorithmgenetic

problemsefficient

Topical N-grams

genetic algorithmsgenetic algorithm

evolutionary computationevolutionary algorithms

fitness function

Topic Comparison

learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning

LDA

reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning_rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods

policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies

Topical N-grams (2) Topical N-grams (1)

Topic Comparison

motionvisualfieldpositionfiguredirectionfieldseyelocationretinareceptivevelocityvisionmovingsystemflowedgecenterlightlocal

LDA

receptive fieldspatial frequencytemporal frequencyvisual motionmotion energytuning curveshorizontal cellsmotion detectionpreferred directionvisual processingarea mtvisual cortexlight intensitydirectional selectivityhigh contrastmotion detectorsspatial phasemoving stimulidecision strategyvisual stimuli

motionresponsedirectioncellsstimulusfigurecontrastvelocitymodelresponsesstimulimovingcellintensitypopulationimagecentertuningcomplexdirections


Topic Comparison

wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels

LDA

speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent

speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid


Outline










Social Networks in Research Literature

• Better understand structure of our own research area.

• Structure helps us learn a new field.• Aid collaboration• Map how ideas travel through social networks

of researchers.

• Aids for hiring and finding reviewers!

Traditional Bibliometrics

• Analyses a small amount of data(e.g. 19 articles from a single issue of a journal)

• Uses “journal” as a proxy for “research topic”(but there is no journal for information extraction)

• Uses impact measures almost exclusively based on simple citation counts.

How can we use topic models to create new, interesting impact measures?

Our Data

• Over 1 million research papers, gathered as part of Rexa.info portal.

• Cross linked references / citations.

QuickTime™ and aTIFF (LZW) decompressor


Finding Topics with TNG

Traditional unigram LDArun on 1 milliontitles / abstracts

(200 topics)

...select ~300k papers onML, NLP, robotics, vision...

Find 200 TNG topics among those papers.

Topical Bibliometric Impact Measures

• Topical Citation Counts

• Topical Impact Factors

• Topical Longevity

• Topical Diversity

• Topical Precedence

• Topical Transfer

Topical DiversityEntropy of the topic distribution among

papers that cite this paper (this topic).

LowDiversity

HighDiversity

Topical Diversity

Can also be measured on particular papers...

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Information Retrieval:

On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)

Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,

Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)

Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)

New experiments in relevance feedback, Ide (1971)

Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Speech Recognition:

Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)

Spectrographic study of vowel reduction, B. Lindblom (1963)

Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)

Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)

Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

Topical Transfer

Transfer from Digital Libraries to other topics

Other topic Cit’s Paper Title

Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.

Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...

Video 12 Lessons learned from the creation and deployment of a terabyte digital video

Graphs 12 Trawling the Web for Emerging Cyber-Communities

Web Pages 11 WebBase: a repository of Web pages

Topical TransferCitation counts from one topic to another.

Map “producers and consumers”

Outline










Want a “topic model” with the advantages of CRFs

• Use arbitrary, overlapping features of the input.

• Undirected graphical model, so we don’t have to think about avoiding cycles.

• Integrate naturally with our other CRF components.

• Train “discriminatively”

• Natural semi-supervised training

What does this mean?

Topic models are unsupervised!

“Multi-Conditional Mixtures”Latent Variable Models fit by Multi-way Conditional Probability

• For clustering structured data,ala Latent Dirichlet Allocation & its successors

• But an undirected model,like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]

• But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B)e.g. A,B,C are different modalities

[McCallum, Wang, Pal, 2005], [McCallum, Pal, Wang, 2006]

Objective Functions for Parameter EstimationTraditional, joint training (e.g. naive Bayes, most topic models)

Traditional, conditional training (e.g. MaxEnt classifiers, CRFs)

Conditional mixtures (e.g. Jebara’s CEM)

Multi-conditional(mostly conditional, generative regularization)

Multi-conditional(for semi-sup)

Multi-conditional(for transfer learning, 2 tasks, shared hiddens)

Tra

dit

ion

alN

ew,

mu

lti-

con

dit

ion

al

Traditional mixture model (e.g. LDA)

“Multi-Conditional Learning” (Regularization)[McCallum, Pal, Wang, 2006]

Predictive Random Fieldsmixture of Gaussians on synthetic data

QuickTime™ and aTIFF (Uncompressed) decompressor


Data, classify by color Generatively trained

Conditionally-trained [Jebara 1998]

Multi-Conditional

[McCallum, Wang, Pal, 2005]

Multi-Conditional Mixturesvs. Harmoniun

on document retrieval task

Harmonium, joint with words, no labels

Harmonium, joint,with class labels and words

Conditionally-trained,to predict class labels

Multi-Conditional,multi-way conditionally trained

[McCallum, Wang, Pal, 2005]

Outline










Summary

Assigning topics to documents

1. Build a 200 topic n-gram topic model on 300k documents

2. Remove stopword or methodological topics (e.g. “efficient, fast, speed”)

3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t

Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

Impact Factor

Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.

2004 Impact factors from JCR:

Nature 32.182

Cell 28.389

JMLR 5.952

Machine Learning 3.258

Topic Impact Factor

Broad Impact: Diffusion

Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100

Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

Broad Impact: Diversity

Topic Diversity: Entropy of the distribution of citing topics

Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics

Broad Impact: Diversity

Topic Diversity: Entropy of the distribution of citing topics

Topic diversity can also be measured for papers:

Longevity: Cited Half Life

Two views:• Given a paper, what is the median age of

citations to that paper?• What is the median age of citations from

current literature?

History: Topical Precedence

Within a topic, what are the earliest papers that received more than n citations?

Information Retrieval (138):On Relevance, Probabilistic Indexing and Information Retrieval,

Kuhns and Maron (1960)Expected Search Length: A Single Measure of Retrieval

Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)Relevance feedback and the optimization of retrieval

effectiveness, Salton (1971)New experiments in relevance feedback, Ide (1971)Automatic Indexing of a Sound Database Using Self-organizing

Neural Nets, Feiten and Gunzel (1982)

Documents

Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with