View
215
Download
3
Category
Preview:
Citation preview
Bibliometric Impact MeasuresLeveraging Topic Analysis
Gideon Mann
David Mimno
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Goal:
Measure the impact of papers, and research subfields.
• Researchers understanding their own field.
• Libraries deciding which journals to purchase.
• Personnel committees deciding on hiring, promotion, awards.
Important for:
Why are topical divisions useful in bibliometrics?
Biochemistry and molecular biology:
J. Biol. Chem 405017
Cell 136472
Biochem.-US 96809
MathematicsLect. Notes Math 6926
T. Am. Math. Soc 6469
J. Math. Anal. Appl. 6004
Source: Journal Citation Reports (2004)
Can you compare the tallest building in NY to the tallest building in Stamford, CT?
Citationcounts
Why not use Journalas a proxy for Topic?
• Journals not necessarily about one topic.• Topics may not have their own journal.
• Open access publishing on the rise.• 5% of the 200 most-cited papers in CiteSeer
are tech reports!
• Spidered web documents often do not include venue information.
This Paper
• Discovering fine-grained, interpretable topics from text
• 8 impact measures leveraging topics
Analysis on 1.5 million research papers and their citations.
• Where did we get all this data from?
Topical N-Grams a phrase-discovering enhancement to LDA
A quick tour of 8 impact measureswith examples
An introduction toRexa,a new little sibling of CiteSeer, Google Scholar,etc.
Talk Outline
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Sample a distributionover topics,
For each document:
Sample a topic, z
For each word in doc
Sample a wordfrom the topic, w
Example:
70% Iraq war30% US election
Iraq war
“bombing”
GenerativeProcess:
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
Topics Modeling Multi-word Phrases
• Topics based only on unigrams sometimes difficult to interpret
• Topic discovery itself is confused because important meaning / distinctions carried by phrases.
A Topic Comparison
LDA
algorithmsalgorithmgenetic
problemsefficient
Topical N-grams
genetic algorithmsgenetic algorithm
evolutionary computationevolutionary algorithms
fitness function
Topical N-gram Model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
WTW
1 2 2
[Wang, McCallum 2005]
Features of Topical N-Grams model
• Easily trained by Gibbs sampling– Can run efficiently on millions of words
• Topic-specific phrase discovery– “white house” has special meaning as a phrase in
the politics topic,– ... but not in the real estate topic.
Topic Comparison
learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning
LDA
reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods
policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies
Topical N-grams (2+) Topical N-grams (1)
Our Data for This JCDL Paper
• 1.6 million research papers – mostly in Computer Science– 400k of them with full text
• 14 fields of meta-data from – “headers” at top of papers– “citations” in References section
automatically extracted with 99% accuracy.
• Reference resolution performed on 4 million citations.
Example Results on our Corpus
Sample LDA topics Sample Topical N-gram topics
Run LDA on 1.6 million papers.Use topic analysis to select a subset of AI: ML, NLP, robotics, vision, etc.
Step 1:Run Topical N-gramson the ~300k papers in the subset.
Step 2:
Each topic is now an intellectual “domain” that includes some number of documents.
We can substitute topic for journal in most traditional bibliometric indicators.
We can also now define several new indicators.
Impact Measures Leveraging Topics
1. Topical Citation count
2. Topical Impact factor
3. Topical Diffusion
4. Topical Diversity
5. Topical Half-life
6. Topical Precedence
7. Topical H-factor
8. Topical Transfer
Impact Measures Leveraging Topics
1. Topical Citation count
2. Topical Impact factor
3. Topical Diffusion
4. Topical Diversity
5. Topical Half-life
6. Topical Precedence
7. Topical H-factor
8. Topical Transfer
Impact Factor
Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.
2004 Impact factors from JCR:
Nature 32.182
Cell 28.389
JMLR 5.952
Machine Learning 3.258
Impact Measures Leveraging Topics
1. Topical Citation count
2. Topical Impact factor
3. Topical Diffusion
4. Topical Diversity
5. Topical Half-life
6. Topical Precedence
7. Topical H-factor
8. Topical Transfer
Broad Impact: Diffusion
Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100
Problem: Relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.
Broad Impact: Diversity
Topic Diversity: Entropy of the distribution of citing topics
These are just the least cited topics!Better at capturing broad end of impact spectrum
Diffusion Diversity
Impact Measures Leveraging Topics
1. Topical Citation count
2. Topical Impact factor
3. Topical Diffusion
4. Topical Diversity
5. Topical Half-life
6. Topical Precedence
7. Topical H-factor
8. Topical Transfer
Topical Longevity: Cited Half Life
Two views:• Given a paper, what is the median age of citations to that paper?• What is the median age of citations from current literature?
Collaborative Filtering is young, fast moving.
Maximum Entropy looks further back, but is still producing new work.
Neural Networks literature is aging.
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Speech Recognition:
Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)
Spectrographic study of vowel reduction, B. Lindblom (1963)
Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)
Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)
Automatic Recognition of Speakers from Their Voices, B. Atal (1976)
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Information Retrieval:
On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)
Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,
Cooper (1968)
Relevance feedback in information retrieval, Rocchio (1971)
Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)
New experiments in relevance feedback, Ide (1971)
Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)
Impact Measures Leveraging Topics
1. Topical Citation count
2. Topical Impact factor
3. Topical Diffusion
4. Topical Diversity
5. Topical Half-life
6. Topical Precedence
7. Topical H-factor
8. Topical Transfer
H-factor
H = maximum number K for which you have K papers, each with at least K citations.
...for journals [Braun et al, 2005]
Topical H-factor
16 12 Natural Language Parsing (16)173 12 Neural Networks (173)120 12 Speech Recognition (120)21 11 Hidden Markov Models (21)71 11 Genetic Algorithms (71)48 11 Optical Flow (48)83 10 Reinforcement Learning (83)49 10 Computer Vision (49)22 10 Mobile Robots (22)118 9 Word Sense Disambiguation (118)160 9 NLP (160)35 8 Planning (35)106 8 Markov Chain Monte Carlo (106)40 8 Maximum Likelihood Estimators (40)131 8 Genetic Algorithms (131)61 7 Genetic Programming (61)
Year 1990
Topical H-factorYear 1995
49 18 Computer Vision (49)120 17 Speech Recognition (120)146 15 Decision Trees (146)176 15 Data Mining (176)21 14 Hidden Markov Models (21)71 14 Genetic Algorithms (71)106 13 Markov Chain Monte Carlo (106)138 13 IR And Queries (138)118 12 Word Sense Disambiguation (118)80 12 Web And VR (80)16 12 Natural Language Parsing (16)110 12 Bayesian Inference (110)83 12 Reinforcement Learning (83)150 12 Logic Programming (150)22 12 Mobile Robots (22)160 12 NLP (160)
Topical H-factorYear 2001
129 15 Web Pages (129)186 15 Ontologies (186)50 13 SVMs (50)49 13 Computer Vision (49)126 13 Gene Expression (126)176 13 Data Mining (176)29 12 Dimensionality Reduction (29)111 12 Question Answering (111)132 12 Search Engines (132)16 11 Natural Language Parsing (16)83 11 Reinforcement Learning (83)184 11 Web Services (184)164 11 HCI (164)21 10 Hidden Markov Models (21)118 10 Word Sense Disambiguation (118)138 10 IR And Queries (138)
Impact Measures Leveraging Topics
1. Topical Citation count
2. Topical Impact factor
3. Topical Diffusion
4. Topical Diversity
5. Topical Half-life
6. Topical Precedence
7. Topical H-factor
8. Topical Transfer
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cit’s Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase: a repository of Web pages
Rexa System Overview
Reference resolution
(of papers, authors & grants)
Spider Web
for PDFs
Convert to text
(with layout & format)
Extract metadata
(title, authors, abstract, venue,
citations; 14 fields in total)
Browsable Web
Interface
Topic Analysis & other Data
Mining
WWW
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Home-grownJava+MySQL
(~1m PDF/day)
Enhancedps2text
(better word stiching,plus layout in XML)
ConditionalRandom Fields
(99% word accuracy)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
NSF grant DB
Discriminativelytrained
graph partitioning
(competition-winningaccuracy)
IE from Research Papers[McCallum et al ‘99]
@article{ kaelbling96reinforcement, author = "Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore", title = "Reinforcement Learning: A Survey", journal = "Journal of Artificial Intelligence Research", volume = "4", pages = "237-285", year = "1996",
(Linear Chain) Conditional Random Fields
yt -1
yt
xt
yt+1
xt +1
xt -1
Finite state model Graphical model
Undirected graphical model, trained to maximize
conditional probability of output sequence given input sequence
. . .
FSM states
observations
yt+2
xt +2
yt+3
xt +3
said Jones a Microsoft VP …
OTHER PERSON OTHER ORG TITLE …
output seq
input seq
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
[Lafferty, McCallum, Pereira 2001]
€
p(y | x) =1
Zx
Φ(y t ,y t−1,x, t)t
∏ where
€
Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]
error40%
(Word-level accuracy is >99%)
access, access control, digital library, digital, digital libraries access, access control, digital library, digital, digital libraries
Summary
• Demonstrated a new topic discovery method (Topical N-Grams) on 1.6m research papers.
• Presented 8 impact measures based on topics.
• Introduced Rexa, a showcase for our research on information extraction, coreference and data mining.
http://rexa.info is publicly available now. Try it out! Feedback appreciated.
Recommended