75
Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts Amherst QuickTime™ and a TIFF (Uncompressed) decompr are needed to see this pi QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts

Embed Size (px)

Citation preview

Bibliometric Impact MeasuresLeveraging Topic Analysis

Gideon Mann

David Mimno

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Goal:

Measure the impact of papers, and research subfields.

• Researchers understanding their own field.

• Libraries deciding which journals to purchase.

• Personnel committees deciding on hiring, promotion, awards.

Important for:

Typical Impact Measures

• Citation Count

• Garfield’s Journal Impact Factor

Why are topical divisions useful in bibliometrics?

Biochemistry and molecular biology:

J. Biol. Chem 405017

Cell 136472

Biochem.-US 96809

MathematicsLect. Notes Math 6926

T. Am. Math. Soc 6469

J. Math. Anal. Appl. 6004

Source: Journal Citation Reports (2004)

Can you compare the tallest building in NY to the tallest building in Stamford, CT?

Citationcounts

Why are topical divisions useful in bibliometrics?

Why not use Journalas a proxy for Topic?

• Journals not necessarily about one topic.• Topics may not have their own journal.

• Open access publishing on the rise.• 5% of the 200 most-cited papers in CiteSeer

are tech reports!

• Spidered web documents often do not include venue information.

This Paper

• Discovering fine-grained, interpretable topics from text

• 8 impact measures leveraging topics

Analysis on 1.5 million research papers and their citations.

• Where did we get all this data from?

Topical N-Grams a phrase-discovering enhancement to LDA

A quick tour of 8 impact measureswith examples

An introduction toRexa,a new little sibling of CiteSeer, Google Scholar,etc.

Talk Outline

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% Iraq war30% US election

Iraq war

“bombing”

GenerativeProcess:

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

Topics Modeling Multi-word Phrases

• Topics based only on unigrams sometimes difficult to interpret

• Topic discovery itself is confused because important meaning / distinctions carried by phrases.

A Topic Comparison

LDA

algorithmsalgorithmgenetic

problemsefficient

Topical N-grams

genetic algorithmsgenetic algorithm

evolutionary computationevolutionary algorithms

fitness function

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

WTW

1 2 2

[Wang, McCallum 2005]

Features of Topical N-Grams model

• Easily trained by Gibbs sampling– Can run efficiently on millions of words

• Topic-specific phrase discovery– “white house” has special meaning as a phrase in

the politics topic,– ... but not in the real estate topic.

Topic Comparison

learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning

LDA

reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods

policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies

Topical N-grams (2+) Topical N-grams (1)

Our Data for This JCDL Paper

• 1.6 million research papers – mostly in Computer Science– 400k of them with full text

• 14 fields of meta-data from – “headers” at top of papers– “citations” in References section

automatically extracted with 99% accuracy.

• Reference resolution performed on 4 million citations.

Example Results on our Corpus

Sample LDA topics Sample Topical N-gram topics

Run LDA on 1.6 million papers.Use topic analysis to select a subset of AI: ML, NLP, robotics, vision, etc.

Step 1:Run Topical N-gramson the ~300k papers in the subset.

Step 2:

Each topic is now an intellectual “domain” that includes some number of documents.

We can substitute topic for journal in most traditional bibliometric indicators.

We can also now define several new indicators.

Impact Measures Leveraging Topics

1. Topical Citation count

2. Topical Impact factor

3. Topical Diffusion

4. Topical Diversity

5. Topical Half-life

6. Topical Precedence

7. Topical H-factor

8. Topical Transfer

Impact Measures Leveraging Topics

1. Topical Citation count

2. Topical Impact factor

3. Topical Diffusion

4. Topical Diversity

5. Topical Half-life

6. Topical Precedence

7. Topical H-factor

8. Topical Transfer

Topical Citation Count

Impact Factor

Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.

2004 Impact factors from JCR:

Nature 32.182

Cell 28.389

JMLR 5.952

Machine Learning 3.258

Topical Impact Factor over time

Impact Measures Leveraging Topics

1. Topical Citation count

2. Topical Impact factor

3. Topical Diffusion

4. Topical Diversity

5. Topical Half-life

6. Topical Precedence

7. Topical H-factor

8. Topical Transfer

Broad Impact: Diffusion

Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100

Problem: Relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

Broad Impact: Diversity

Topic Diversity: Entropy of the distribution of citing topics

These are just the least cited topics!Better at capturing broad end of impact spectrum

Diffusion Diversity

Broad Impact: Diversity, for papers

Topic Diversity: Entropy of the distribution of citing topic

Impact Measures Leveraging Topics

1. Topical Citation count

2. Topical Impact factor

3. Topical Diffusion

4. Topical Diversity

5. Topical Half-life

6. Topical Precedence

7. Topical H-factor

8. Topical Transfer

Topical Longevity: Cited Half Life

Two views:• Given a paper, what is the median age of citations to that paper?• What is the median age of citations from current literature?

Collaborative Filtering is young, fast moving.

Maximum Entropy looks further back, but is still producing new work.

Neural Networks literature is aging.

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Speech Recognition:

Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)

Spectrographic study of vowel reduction, B. Lindblom (1963)

Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)

Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)

Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?

“Early-ness”

Information Retrieval:

On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)

Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,

Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)

Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)

New experiments in relevance feedback, Ide (1971)

Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

Impact Measures Leveraging Topics

1. Topical Citation count

2. Topical Impact factor

3. Topical Diffusion

4. Topical Diversity

5. Topical Half-life

6. Topical Precedence

7. Topical H-factor

8. Topical Transfer

H-factor

H = maximum number K for which you have K papers, each with at least K citations.

...for journals [Braun et al, 2005]

Topical H-factor

16 12 Natural Language Parsing (16)173 12 Neural Networks (173)120 12 Speech Recognition (120)21 11 Hidden Markov Models (21)71 11 Genetic Algorithms (71)48 11 Optical Flow (48)83 10 Reinforcement Learning (83)49 10 Computer Vision (49)22 10 Mobile Robots (22)118 9 Word Sense Disambiguation (118)160 9 NLP (160)35 8 Planning (35)106 8 Markov Chain Monte Carlo (106)40 8 Maximum Likelihood Estimators (40)131 8 Genetic Algorithms (131)61 7 Genetic Programming (61)

Year 1990

Topical H-factorYear 1995

49 18 Computer Vision (49)120 17 Speech Recognition (120)146 15 Decision Trees (146)176 15 Data Mining (176)21 14 Hidden Markov Models (21)71 14 Genetic Algorithms (71)106 13 Markov Chain Monte Carlo (106)138 13 IR And Queries (138)118 12 Word Sense Disambiguation (118)80 12 Web And VR (80)16 12 Natural Language Parsing (16)110 12 Bayesian Inference (110)83 12 Reinforcement Learning (83)150 12 Logic Programming (150)22 12 Mobile Robots (22)160 12 NLP (160)

Topical H-factorYear 2001

129 15 Web Pages (129)186 15 Ontologies (186)50 13 SVMs (50)49 13 Computer Vision (49)126 13 Gene Expression (126)176 13 Data Mining (176)29 12 Dimensionality Reduction (29)111 12 Question Answering (111)132 12 Search Engines (132)16 11 Natural Language Parsing (16)83 11 Reinforcement Learning (83)184 11 Web Services (184)164 11 HCI (164)21 10 Hidden Markov Models (21)118 10 Word Sense Disambiguation (118)138 10 IR And Queries (138)

Impact Measures Leveraging Topics

1. Topical Citation count

2. Topical Impact factor

3. Topical Diffusion

4. Topical Diversity

5. Topical Half-life

6. Topical Precedence

7. Topical H-factor

8. Topical Transfer

Topical Transfer

Transfer from Digital Libraries to other topics

Other topic Cit’s Paper Title

Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.

Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...

Video 12 Lessons learned from the creation and deployment of a terabyte digital video

Graphs 12 Trawling the Web for Emerging Cyber-Communities

Web Pages 11 WebBase: a repository of Web pages

Topical TransferCitation counts from one topic to another.

Map “producers and consumers”

Where did the data come from?

http://rexa.info

Rexa System Overview

Reference resolution

(of papers, authors & grants)

Spider Web

for PDFs

Convert to text

(with layout & format)

Extract metadata

(title, authors, abstract, venue,

citations; 14 fields in total)

Browsable Web

Interface

Topic Analysis & other Data

Mining

WWW

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Home-grownJava+MySQL

(~1m PDF/day)

Enhancedps2text

(better word stiching,plus layout in XML)

ConditionalRandom Fields

(99% word accuracy)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

NSF grant DB

Discriminativelytrained

graph partitioning

(competition-winningaccuracy)

IE from Research Papers[McCallum et al ‘99]

@article{ kaelbling96reinforcement, author = "Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore", title = "Reinforcement Learning: A Survey", journal = "Journal of Artificial Intelligence Research", volume = "4", pages = "237-285", year = "1996",

(Linear Chain) Conditional Random Fields

yt -1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model, trained to maximize

conditional probability of output sequence given input sequence

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Jones a Microsoft VP …

OTHER PERSON OTHER ORG TITLE …

output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

p(y | x) =1

Zx

Φ(y t ,y t−1,x, t)t

∏ where

Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k

∑ ⎛

⎝ ⎜

⎠ ⎟

IE from Research Papers

Field-level F1

Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]

error40%

(Word-level accuracy is >99%)

Previous Systems

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

ResearchPaper

Cites

Previous Systems

ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

More Entities and Relations

access, access control, digital library, digital, digital libraries access, access control, digital library, digital, digital libraries

Summary

• Demonstrated a new topic discovery method (Topical N-Grams) on 1.6m research papers.

• Presented 8 impact measures based on topics.

• Introduced Rexa, a showcase for our research on information extraction, coreference and data mining.

http://rexa.info is publicly available now. Try it out! Feedback appreciated.

Trends in 17 years of NIPS proceedings

Topic Distributions Conditioned on Time

time

top

ic m

ass

(in

ver

tica

l h

eig

ht)

Finding Topics in 1 million CS papers

200 topics & keywords automatically discovered.

Topic Correlations in PAM

5000 research paper abstracts, from across all CS

Numbers on edges are supertopics’ Dirichlet parameters