52
Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark Steyvers UC Irvine Padhraic Smyth Dave Newman Tom Griffiths UC Irvine UC Irvine Brown University

Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Analyzing Federal Funding, Scientific

Publications and Emailwith Probabilistic Topic

ModelsMark Steyvers UC Irvine

Padhraic SmythDave NewmanTom Griffiths

UC IrvineUC IrvineBrown University

Page 2: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Analyzing Content/ Managing Information

EMAIL

JOURNALS

NEWSPAPERS

Page 3: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

The Problem Many information retrieval systems assess

similarity of documents on the raw word counts

DOCUMENTDOCUMENT

CARCAR

CHEAPCHEAP

PRICEPRICE

......

DOCUMENT DOCUMENT

AUTOMOBILEAUTOMOBILE

AFFORDABLEAFFORDABLE

AMOUNTAMOUNT

......

DOCUMENTDOCUMENT

BANKBANKMONEYMONEY

......

DOCUMENTDOCUMENT

BANKBANKRIVERRIVER

......

no word overlapno word overlap

but high similaritybut high similarity

high word overlaphigh word overlap

but low similaritybut low similarity

Page 4: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

One solution: compare documents on a latent set of

factors (topics)

topic 1

topic 2

DOCUMENTDOCUMENT

CARCAR

CHEAPCHEAP

PRICEPRICE

......

DOCUMENTDOCUMENT

AUTOMOBILEAUTOMOBILE

AFFORDABLEAFFORDABLE

AMOUNTAMOUNT

......

DOCUMENTDOCUMENT

BANKBANKMONEYMONEY

......

DOCUMENTDOCUMENT

BANKBANKRIVERRIVER

......

topic 1

topic 2

high topical high topical overlapoverlap

topic 3 topic 4no topical no topical overlapoverlap

Page 5: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

2nd generation systems Go beyond the raw word information

Extract content in terms of topics

Deal with large sets of documents

Miniminal supervision

Page 6: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Probabilistic Topic Models

originated in domain of statistics & machine learning

performs unsupervised extraction of topics from large text collections

Text documents: scientific articles book chapters newspaper articles .... any set of words in a verbal context

Page 7: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Overview

I Probabilistic Topic Models

II Analyzing Scientific Topics: PNAS

III Analyzing Topics of Federal Funding

IV Analyzing Enron Email

V Extensions of the the model

VI Conclusion

Page 8: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Overview

I Probabilistic Topic Models

II Analyzing Scientific Topics: PNAS

III Analyzing Topics of Federal Funding

IV Analyzing Enron Email

V Extensions of the the model

VI Conclusion

Page 9: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Probabilistic Topic Models

Each document is a probability distribution over topics

Each topic is a probability distribution over words

We do not observe these distributions but we can infer them statistically

Page 10: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

The Generative ModelView document generation as a probabilistic process

TOPICS MIXTURETOPICS MIXTURE

TOPIC TOPIC TOPICTOPIC

WORDWORD WORDWORD

......

......

1.1. for each document, for each document, choosechoosea mixture of topics a mixture of topics

2.2. For every word slot, For every word slot, sample a topic [1..T] sample a topic [1..T] from the mixturefrom the mixture

3.3. sample a word from the sample a word from the topictopic

Page 11: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

loan

TOPIC 1

money

loan

bank

moneyb

an

k

river

TOPIC 2

river

river

stream

bank

bank

stream

bank

loan

DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1

loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2

river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2

money1

DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1

stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1

money1 river2 bank1 money1 bank1 loan1 bank1 money1

stream2 .3

.8

.2

Example

Mixture components

Mixture weights

Bayesian approach: use priors Mixture weights ~ Dirichlet( ) Mixture components ~ Dirichlet( )

.7

Page 12: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

TOPIC 1

TOPIC 2

DOCUMENT 1: A Play is written to be performed on a stage before a live audience or before motion picture or television cameras ( for later viewing by large audiences ). A Play is written because playwrights have something ...

INVERTING THE GENERATIVE PROCESS

?

?

?

DOCUMENT 2: He was listening to music coming

from a passing riverboat. The music had already captured his heart as well as his ear . It was jazz . Bix beiderbecke had already had music lessons . He wanted to play the cornet. And he wanted to play jazz .......

We estimate the assignments of topics to words

Page 13: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

TOPIC 1

TOPIC 2

DOCUMENT 1: A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ...

INVERTING THE GENERATIVE PROCESS

DOCUMENT 2: He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077.......

We estimate the assignments of topics to words

Page 14: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Choosing number of topics

Subjective interpretability

Bayesian model selection

Generalization tests

Models that grow with size of data

Page 15: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

INPUT:word-document counts

OUTPUT:topic assignments to each word

likely words in each topic

likely topics for a document (“gist”)

Input/Output

z

)|( zwP

)w|(zP

Page 16: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Example: topics from an educational corpus (TASA)

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PRINTERPRINTS

PRINTERSCOPY

COPIESFORM

OFFSETGRAPHICSURFACE

PRODUCEDCHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMES

COMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYER

PLAYPLAYINGSOCCERPLAYED

BALLTEAMSBASKET

FOOTBALLSCORECOURTGAMES

TRYCOACH

GYMSHOT

JUDGETRIAL

COURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESS

ATTORNEYHEARING

INNOCENTDEFENSECHARGE

CRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOK

REVIEW

• 37K docs, 26K words• 1700 topics, e.g.:

Page 17: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Polysemy

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PRINTERPRINTS

PRINTERSCOPY

COPIESFORM

OFFSETGRAPHICSURFACE

PRODUCEDCHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMES

COMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYERPLAY

PLAYINGSOCCERPLAYED

BALLTEAMSBASKET

FOOTBALLSCORECOURTGAMES

TRYCOACH

GYMSHOT

JUDGETRIAL

COURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESS

ATTORNEYHEARING

INNOCENTDEFENSECHARGE

CRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOK

REVIEW

Page 18: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Three documents with the word “play”

(numbers & colors topic assignments)

A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....

Page 19: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Overview

I Probabilistic Topic Models

II Analyzing Scientific Topics: PNAS

III Analyzing Topics of Federal Funding

IV Analyzing Enron Email

V Extensions of the the model

VI Conclusion

Page 20: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

PNAS Topics

Applied model to PNAS abstracts(Proceedings of the National Academy of Sciences)

Page 21: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

FORCESURFACE

MOLECULESSOLUTIONSURFACES

MICROSCOPYWATERFORCES

PARTICLESSTRENGTHPOLYMER

IONICATOMIC

AQUEOUSMOLECULARPROPERTIES

LIQUIDSOLUTIONS

BEADSMECHANICAL

HIVVIRUS

INFECTEDIMMUNODEFICIENCY

CD4INFECTION

HUMANVIRAL

TATGP120

REPLICATIONTYPE

ENVELOPEAIDSREV

BLOODCCR5

INDIVIDUALSENV

PERIPHERAL

MUSCLECARDIAC

HEARTSKELETALMYOCYTES

VENTRICULARMUSCLESSMOOTH

HYPERTROPHYDYSTROPHIN

HEARTSCONTRACTION

FIBERSFUNCTION

TISSUERAT

MYOCARDIALISOLATED

MYODFAILURE

STRUCTUREANGSTROM

CRYSTALRESIDUES

STRUCTURESSTRUCTURALRESOLUTION

HELIXTHREE

HELICESDETERMINED

RAYCONFORMATION

HELICALHYDROPHOBIC

SIDEDIMENSIONALINTERACTIONS

MOLECULESURFACE

NEURONSBRAIN

CORTEXCORTICAL

OLFACTORYNUCLEUS

NEURONALLAYER

RATNUCLEI

CEREBELLUMCEREBELLAR

LATERALCEREBRAL

LAYERSGRANULELABELED

HIPPOCAMPUSAREAS

THALAMIC

A selection of topics (out of 300)

TUMORCANCERTUMORSHUMANCELLS

BREASTMELANOMA

GROWTHCARCINOMA

PROSTATENORMAL

CELLMETASTATICMALIGNANT

LUNGCANCERS

MICENUDE

PRIMARYOVARIAN

Page 22: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

PNAS Topics and classes

PNAS authors provide class designations major: Biological, Physical, Social Sciences minor: 33 separate disciplines

Find topics diagnostic of classes validate “reality” of classes show how disciplines overlap topically

Page 23: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths
Page 24: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

TOPIC 210SYNAPTICNEURONS

POSTSYNAPTICHIPPOCAMPAL

SYNAPSESLTP

PRESYNAPTICTRANSMISSIONPOTENTIATION

PLASTICITYEXCITATORY

RELEASEDENDRITIC

PYRAMIDALHIPPOCAMPUS

Neurobiology

Topic 210

Page 25: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

TOPIC 280SPECIES

SELECTIONEVOLUTION

GENETICPOPULATIONSPOPULATIONVARIATIONNATURAL

EVOLUTIONARYFITNESS

ADAPTIVERATES

THEORYTRAITS

DIVERSITY

Evolution

Topic 280

Population

biology

Page 26: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

TOPIC 39THEORY

TIMESPACEGIVEN

PROBLEMSHAPESIMPLE

DIMENSIONALPAPER

NUMBERCASE

LOCALTERMS

SYMMETRYRANDOM

Mathematics

Topic 39

Applied Mathematics

Page 27: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Topic Dynamics

We have the distribution over topics for PNAS abstracts from 1991 to 2001

Analysis of dynamics: perform linear trend analysis for each

topic “hot topics” go up, “cold topics” go

down

Page 28: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

1990 1992 1994 1996 1998 2000 20022

4

6

8

10

12

14x 10

-3

289

37

75P(t

opic

)

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

179

2

134

year

P(t

opic

)

year year

Cold topics Hot topics

2SPECIESGLOBALCLIMATE

CO2WATER

ENVIRONMENTALYEARS

MARINECARBON

DIVERSITYOCEAN

EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE

134MICE

DEFICIENTNORMAL

GENENULL

MOUSETYPE

HOMOZYGOUSROLE

KNOCKOUTDEVELOPMENT

GENERATEDLACKINGANIMALSREDUCED

179APOPTOSIS

DEATHCELL

INDUCEDBCL

CELLSAPOPTOTIC

CASPASEFAS

SURVIVALPROGRAMMED

MEDIATEDINDUCTIONCERAMIDE

EXPRESSION

37CDNA

AMINOSEQUENCE

ACIDPROTEIN

ISOLATEDENCODING

CLONEDACIDS

IDENTITYCLONE

EXPRESSEDENCODES

RATHOMOLOGY

289KDA

PROTEINPURIFIED

MOLECULARMASS

CHROMATOGRAPHYPOLYPEPTIDE

GELSDS

BANDAPPARENTLABELED

IDENTIFIEDFRACTIONDETECTED

75ANTIBODY

ANTIBODIESMONOCLONAL

ANTIGENIGG

MABSPECIFICEPITOPEHUMANMABS

RECOGNIZEDSERA

EPITOPESDIRECTED

NEUTRALIZING

NOBEL 1987

NOBEL 2002

Page 29: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Overview

I Probabilistic Topic Models

II Analyzing Scientific Topics: PNAS

III Analyzing Topics of Federal Funding

IV Analyzing Enron Email

V Extensions of the the model

VI Conclusion

Page 30: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Analyzing Topics of Funding

Get a large-scale overview of funding for social sciences

How similar are different funding programs?

How is funding distributed over topics?

Page 31: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Dataset22,189 Abstracts from grants active in 2003

NIH NIMH (National Institute of Mental Health)

NCI (National Cancer Institute)

NSF SBE (Social, Behavioral and Economic

Sciences) BIO (Biological Sciences)

Page 32: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Extracted topics (1..20)

Topic Interpretation Likely words

1 training program research students program training faculty2 mental health services health intervention care mental services3 protein binding protein proteins binding domain domains4 pathway signaling signaling kinase pathways signal activation5 neural activity neurons brain neuronal synaptic activity6 collaborative projects university award research project collaboration7 gene expression expression gene transcription regulation genes8 immunology cells tumor immune cell antigen9 children/family children child family school adolescents

10 archaelogy sites site archaeological region data11 genetics genes gene genome sequence genetic12 ecosystem/climate forest soil climate ecosystem carbon13 tumors tumor tumors human expression mammary14 gene mutation mice mutations gene mutant genes15 cell differentation cell cells cycle growth differentiation16 psychiatric disorders depression disorders disorder symptoms psychiatric17 research training research training development career candidate18 clinical trials patients clinical treatment therapy trials19 cancer cancer breast lung cancers ovarian20 research center research center program core support

Page 33: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

80 interpretable topics (out of 100)

training program environmental issues software/databases rnamental health services patient treatment metabolism sexual behavior

protein binding dna repair equipment/facilities molecular mechanismspathway signaling drugs/agents sample analysis modeling

neural activity structural biology gene development imaging techniquescollaborative projects public policy ecology family genetics

gene expression protein physiology information technology specimen collectionimmunology apoptosis markets stress response

children/family evolution cell receptors viral infectionsarchaelogy economic forces genetic variation ethnic minorities

genetics cancer treatment research development hypothesis testingecosystem/climate tumor growth theory bacteria

tumors conference/meetings commercial development systems researchgene mutation group sociology marine environment calcium channels

cell differentation memory & cognition population screening individual differencespsychiatric disorders risk factors plant growth hormones

research training data analysis circadian rythms smokingclinical trials tumor therapy leukemia research training biology

cancer functional imaging science/technology languageresearch center imaging systems decision making energy transfer

Page 34: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Interpretation %NIH %SBE %BIO

cancer 94 3 3tumors 92 3 5

clinical trials 91 4 5psychiatric disorders 90 5 5

tumor therapy 89 4 7cancer treatment 89 5 6

leukemia 89 4 7patient treatment 88 6 6

immunology 87 3 9mental health services 87 8 5

Likely topics for NIH

Interpretation %NIH %SBE %BIO

collaborative projects 9 81 9public policy 14 79 7

archaelogy 12 76 12economic forces 18 73 10

markets 18 70 12environmental issues 16 65 19

science/technology 21 63 17systems research 19 62 19

language 29 59 12decision making 33 51 16

Likely topics for NSF-SBE

Likely topics for NSF-BIO

Interpretation %NIH %SBE %BIO

plant growth 10 9 81ecology 9 14 77

evolution 12 13 76bacteria 17 9 74

genetic variation 19 13 69ecosystem/climate 8 24 69gene development 30 6 63

marine environment 13 27 60specimen collection 17 24 59

protein physiology 37 6 58

Page 35: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Level 1 Level 2 Level 3 Level 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

Cancer biology, detection and diagnosis 1

AIDS Research 2

Cancer Research Centers 3

Cancer causation 4

Cancer prevention and control 5

Cancer treatment 6

NCI research manpower development 7

AIDS Research 8

Extramural research 9

Intramural research 10

Archaeology, archeometry, and ... 11

Behavioral and cognitive sciences - Other 12

Child learning and development 13

Cultural anthropology 14Environmental social and behavioral science 15

Geography and regional science 16

Human cognition and perception 17

Instrumentation 18

Linguistics 19

Physical anthropology 20

Social psychology 21

Africa, Near East, and South Asia 22

Americas 23

Central and Eastern Europe 24

East Asia and Pacific 25

International activities - Other 26

Japan and Korea 27

Western Europe 28

Decision, risk, and management science 29

Methodology, measures, and statistics 30

Economics 31

Ethics and values studies 32

Innovation and organizational change 33

Law and social science 34

Political science 35

Research on science and technology 36

Science and technology studies 37

Social and economic sciences - Other 38

Sociology 39

Transformations to quality organizations 40

Biological infrastructure - Other 41

Human resources 42

Instrumentation 43

Research resources 44

Ecological studies 45

Environmental biology - Other 46

Systematic & population biology 47

Developmental mechanisms 48Integrative biology and neuroscience - Other 49

Neuroscience 50

Physiology and ethology 51

Biochemical and biomolecular processes 52

Biomolecular structure & function 53

Cell biology 54

Genetics 55

Molecular and cellular biosciences - Other 56

Plant genome research (119) Plant genome research project 57

National Science

Foundation (10580)

Social, Behavioral,

and Economic Sciences

(SBE) (4584)

Biological Sciences

(BIO) (5996)

Behavioral and cognitive sciences (BCS) (1469)

International science and engineering (INT) (formerly International cooperative scientific activities) (1299)

Social and economic sciences (SES) (1816)

Biological infrastructure (BIR/DBI) (1061)

Environmental biology (DEB) (1609)

Integrative biology and neuroscience (IBN/BNS)

(1673)

Molecular and cellular biosciences (MCB/DCB)

(1534)

Dept of Health and

Human Services (11609)

National Institutes of

Health (11609)

National Cancer Institute (7574)

National Institute of Mental Health (4035)

Program similarity using topics

Page 36: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Level 1 Level 2 Level 3 Level 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

Cancer biology, detection and diagnosis 1

AIDS Research 2

Cancer Research Centers 3

Cancer causation 4

Cancer prevention and control 5

Cancer treatment 6

NCI research manpower development 7

AIDS Research 8

Extramural research 9

Intramural research 10

Archaeology, archeometry, and ... 11

Behavioral and cognitive sciences - Other 12

Child learning and development 13

Cultural anthropology 14Environmental social and behavioral science 15

Geography and regional science 16

Human cognition and perception 17

Instrumentation 18

Linguistics 19

Physical anthropology 20

Social psychology 21

Africa, Near East, and South Asia 22

Americas 23

Central and Eastern Europe 24

East Asia and Pacific 25

International activities - Other 26

Japan and Korea 27

Western Europe 28

Decision, risk, and management science 29

Methodology, measures, and statistics 30

Economics 31

Ethics and values studies 32

Innovation and organizational change 33

Law and social science 34

Political science 35

Research on science and technology 36

Science and technology studies 37

Social and economic sciences - Other 38

Sociology 39

Transformations to quality organizations 40

Biological infrastructure - Other 41

Human resources 42

Instrumentation 43

Research resources 44

Ecological studies 45

Environmental biology - Other 46

Systematic & population biology 47

Developmental mechanisms 48Integrative biology and neuroscience - Other 49

Neuroscience 50

Physiology and ethology 51

Biochemical and biomolecular processes 52

Biomolecular structure & function 53

Cell biology 54

Genetics 55

Molecular and cellular biosciences - Other 56

Plant genome research (119) Plant genome research project 57

National Science

Foundation (10580)

Social, Behavioral,

and Economic Sciences

(SBE) (4584)

Biological Sciences

(BIO) (5996)

Behavioral and cognitive sciences (BCS) (1469)

International science and engineering (INT) (formerly International cooperative scientific activities) (1299)

Social and economic sciences (SES) (1816)

Biological infrastructure (BIR/DBI) (1061)

Environmental biology (DEB) (1609)

Integrative biology and neuroscience (IBN/BNS)

(1673)

Molecular and cellular biosciences (MCB/DCB)

(1534)

Dept of Health and

Human Services (11609)

National Institutes of

Health (11609)

National Cancer Institute (7574)

National Institute of Mental Health (4035)

Program similarity using topics

Page 37: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

NCICancer biology,

detection and diagnosis

NCIAIDS Research

NCICancer

Research Centers

NCICancer

causationNCI

Cancer prevention and control

NCICancer

treatment

NCIResearch

manpower development

NIMHAIDS Research

NIMHExtramural research

NIMHIntramural research

BCSArchaeology,

archeometry, and ...

BCSBehavioral

and cognitive sciences - Other

BCSChild learning

and development

BCSCultural

anthropology

BCSEnvironmental social

and behavioral scienceBCSGeography

and regional science

BCSHuman cognition and perception

BCSInstrumentation

BCSLinguistics

BCSPhysical

anthropology

BCSSocial

psychology

INTAfrica, Near East, and South Asia

INTAmericas

INTCentral

and Eastern Europe

INTEast Asia

and Pacific

INTInternational

activities - Other

INTJapan

and Korea INTWestern Europe

SESDecision, risk,

and management science

SESMethodology, measures,

and statistics

SESEconomics

SESEthics

and values studies

SESInnovation

and organizational change

SESLaw

and social science

SESPolitical science

SESResearch on science

and technology

SESScience

and technology studies

SESSocial and economic

sciences - Other

SESSociologySES

Transformations to quality organizations

BIRBiological

infrastructure - Other

BIRHuman

resources

BIRInstrumentation

BIRResearch resources

DEBEcological

studies

DEBEnvironmental biology - Other

DEBSystematic

& population biology

IBNDevelopmental mechanisms

IBNIntegrative biology

and neuroscience - Other

IBNNeuroscience

IBNPhysiology

and ethology

MCBBiochemical

and biomolecular processes

MCBBiomolecular structure

& function

MCBCell biology

MCBGenetics

MCBMolecular and cellular biosciences - Other

PGRPlant genome research project

NIH

NSF – BIO

NSF – SBE

2D visualization of funding programs – nearby program support similar topics

Page 38: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Funding Amounts per Topic

We have $ funding per grant We have % of topics for each grant We can solve for the $ amount per topic

What are expensive topics?

Page 39: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Funding % Interpretation3.47 research center2.87 cancer control2.26 mental health services2.01 clinical treatment1.87 cancer 1.73 gene sequencing1.61 risk factors1.56 children/parents1.51 tumors1.48 training program1.47 immunology1.43 disorders1.40 patient treatment

Funding % Interpretation0.60 conference/meetings0.56 theory0.55 public policy0.55 collaborative projects0.55 marine environment0.55 decision making0.55 ecological diversity0.53 sexual behavior0.52 markets0.51 science/technology0.49 computer systems0.45 language0.44 archaelogy

High $$$ topics Low $$$ topics

Page 40: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Overview

I Probabilistic Topic Models

II Analyzing Scientific Topics: PNAS

III Analyzing Topics of Federal Funding

IV Analyzing Enron Email

V Extensions of the the model

VI Conclusion

Page 41: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Enron email data 500,000 emails500,000 emails

5000 authors5000 authors

1999-20021999-2002

Page 42: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Enron topics

2000 2001 2002 2003

PERSON1

PERSON2

TEXANSWIN

FOOTBALLFANTASY

SPORTSLINEPLAYTEAMGAME

SPORTSGAMES

GODLIFEMAN

PEOPLECHRISTFAITHLORDJESUS

SPIRITUALVISIT

ENVIRONMENTALAIR

MTBEEMISSIONS

CLEANEPA

PENDINGSAFETYWATER

GASOLINE

FERCMARKET

ISOCOMMISSION

ORDERFILING

COMMENTSPRICE

CALIFORNIAFILED

POWERCALIFORNIAELECTRICITY

UTILITIESPRICESMARKET

PRICEUTILITY

CUSTOMERSELECTRIC

STATEPLAN

CALIFORNIADAVISRATE

BANKRUPTCYSOCALPOWERBONDSMOU

TIMELINEMay 22, 2000

Start of California

energy crisis

Page 43: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Overview

I Probabilistic Topic Models

II Analyzing Scientific Topics: PNAS

III Analyzing Topics of Federal Funding

IV Analyzing Enron Email

V Extensions of the the model

VI Conclusion

Page 44: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Pennsylvania Gazette

1728-18001728-1800

80,000 80,000 articlesarticles

(courtesy of David Newman & Sharon Block, History Department, UC Irvine)(courtesy of David Newman & Sharon Block, History Department, UC Irvine)

Page 45: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Historical Trends in Pen. Gazette

YEAR

1730 1740 1750 1760 1770 1780 1790 1800

Top

ic P

ropo

rtio

n (%

)

0

2

4

6

8

10STATE

GOVERNMENTCONSTITUTION

LAWUNITEDPOWERCITIZENPEOPLEPUBLIC

CONGRES

SILKCOTTON

DITTOWHITEBLACKLINENCLOTH

WOMENBLUE

WORSTED

(courtesy of David Newman & Sharon Block, UC Irvine)(courtesy of David Newman & Sharon Block, UC Irvine)

Page 46: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Learning Topic Hierarchies In regular topic model, no relations between

topics

Alternative: hierarchical topic organization topic 1

topic 2 topic 3

topic 4 topic 5 topic 6 topic 7

Apply to Psych Review abstracts

Page 47: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

theorymodeldata

informationproposed

modeltheorymodelsw ord

response

readingtext

readersmeaning

comprehension

biasassociative

matricesmatrix

al

memorylistitemitems

recognition

distributedgrams

associateassociations

paired

strengthfamiliarityretroactivedeviationlikelihood

responseinstrumentalresponsesconditioning

behavior

choicedelays

alternativesfixed

rew ard

memorymodelmodels

informationsocial

know ledgeskill

readingaccessspecific

modeleffectslearningtheory

systems

memoryretrieval

serialstoragew orking

preferencereinforcement

choicepunishmentcontingent

modeltheory

informationeffectsaccount

imagesperceptionaccordinglightnessobjects

visualimagery

representationsmental

subsystems

movementeye

positionspeedtarget

orientationeroticbem

sexualebe

situationalconsistency

crosstemporalbehavior

objectbasedneglectattentionspace

stimulivisual

componentcontourforw ard

attributestochastic

choicedifferencetransitivity

maskingmetacontrast

typeinhibition

mask

serialfunctionlatencypositionitems

reasoningbayesiansimilaritiesstatements

gain

similaritygeometricobjectsdensitydistance

ceconditioningprinciples

reinforcementrew ard

modelmemory

processesmodelslearning

imagecomponents

boundnearestneighbor

memoryreasoning

interferenceprocess

background

theorysentence

jamesfit

emotionmodel

memorydecisionresponse

theorytheory

achievementemotion

motivationfailure

modelcs

avoidanceucs

conditioningmodel

memoryproblems

itemstheoretical

goodnessapproach

representationholographic

pictorial

lettersmodelw ordsletter

memoryfunction

psychometriccorrelationsindividuals

performancestresssystemimmunearousal

fight

sexaffects

biologicaldifferenceshandedness

cognitivegigerenzerheuristicsreasoning

biases

childchildren

developmentfieldrisk

bayesianinferencealgorithmsauthors

frequency

speechauditoryacoustic

perceptualsound

actioncontrolintention

goalintentions

personalitybehavior

traitconsistencyidiographic

surfacerepresentations

surfacesoccludingcontour

psychologicalpsychology

reviewamerican

association

eventsinterpersonal

eventimpersonalequilibrium

categoriescategorymetaphor

objectmetaphors

motioncontrast

pathvisual

contour

leftcerebral

handednessspeechhuman

socialperceptionimpressionresearchapproach

sleepimagerydreaming

remeye

reinforcementbehaviorextinctionmatching

partial

binocularrivalry

stereopsismonocular

visual

structurerelations

scaledimensional

keys

riskconjunction

decisionprobabilities

risky

distanceretinal

disparityimage

perceived

perceptionvisual

directionrule

adaptation

partthinking

kindscientificactivities

behaviordevelopmentevolutionary

genescomparative

groupintelligenceintellectual

iqconnections

behaviorfood

drinkinghypothalamusphysiological

taskresource

performanceprocessinganaphors

developmentalsocialethnic

processesdevelopment

fearanxiety

painamygdalaautomatic

neuralvisual

neuronsbehavioralmasking

strategiesproblems

termconfirmation

limitationslanguagesemanticlinguisticthought

correlations

learningmapsmap

barrierparallel

statisticalheuristicsknow ledge

intuitiveheuristic face

recognitionfaces

damagedsemantic

Page 48: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Integrating Topics and Syntax

Syntactic dependencies short range dependencies Semantic dependencies long-range

z z z z

w w w w

s s s s

Semantic state: generate words from topic model

Syntactic states: generate words from HMM

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

Page 49: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

...

INBY

WITHONAS

FROMTO

FOR

THEA

ANTHIS

THEIRITS

EACHONE

ISAREBE

HASHAVEWAS

WEREAS

BASEDPRESENTEDDISCUSSEDPROPOSEDDESCRIBED

SUCHUSED

DERIVED

THEORYMODEL

PROCESSESMODELSSYSTEM

PROCESSEFFECTS

INFORMATION

ATTENTIONSEARCHVISUAL

PROCESSINGTASK

PERFORMANCEINFORMATIONATTENTIONAL

MEMORYTERMLONG

SHORTRETRIEVALSTORAGE

MEMORIESAMNESIA

IQBEHAVIOR

EVOLUTIONARYENVIRONMENT

GENESHERITABILITY

GENETICSELECTION

DRUGAROUSALNEURALBRAIN

HABITUATIONBIOLOGICALTOLERANCEBEHAVIORAL

SOCIALSELF

ATTITUDEIMPLICIT

ATTITUDESPERSONALITY

JUDGMENTPERCEPTION

(S) THE SEARCH IN LONG TERM MEMORY ……

(S) A MODEL OF VISUAL ATTENTION ……

Page 50: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Random sentence generation

LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Page 51: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Conclusion Unsupervised extraction of content from large

text collections

Topics provide quick overview of content

Topic models text-mining/ information retrieval psychology/ memory

Connection?

Good semantic memory models for finding semantically relevant information might also be good information retrieval models

Page 52: Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths

Psych Review abstracts All 1281 abstracts since 1967 50 topics – examples:

SIMILARITYCATEGORY

CATEGORIESRELATIONS

DIMENSIONSFEATURES

STRUCTURESIMILAR

REPRESENTATIONONJECTS

STIMULUSCONDITIONING

LEARNINGRESPONSE

STIMULIRESPONSESAVOIDANCE

REINFORCEMENTCLASSICAL

DISCRIMINATION

MEMORYRETRIEVAL

RECALLITEMS

INFORMATIONTERM

RECOGNITIONITEMSLIST

ASSOCIATIVE

GROUPINDIVIDUAL

GROUPSOUTCOMES

INDIVIDUALSGROUPS

OUTCOMESINDIVIDUALSDIFFERENCESINTERACTION

EMOTIONALEMOTION

BASICEMOTIONS

AFFECTSTATES

EXPERIENCESAFFECTIVEAFFECTS

RESEARCH

...