Upload
roberta-morton
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
N-gram Topic Models for Bibliometric Analysis
Gideon Mann, David Mimno, and Andrew McCallum
Can topic models provide better measurements of the impact of research literature?
Bibliometrics and Scientometrics
Typically analyzes patterns of citations in research literature
Derek de Solla Price: “Little Science, Big Science”
Eugene Garfield: Science Citation Index, Journal Citation Reports
Comparing apples to apples: top journals by citations
Biochemistry and molecular biology:
J. Biol. Chem 405017
Cell 136472
Biochem.-US 96809
MathematicsLect. Notes Math 6926
T. Am. Math. Soc 6469
J. Math. Anal. Appl. 6004
Source: Journal Citation Reports (2004)
What’s wrong with grouping by journal?
• 10 of the 200 most cited papers in CiteSeer are unpublished technical reports, 15% of most cited papers are from conference proceedings
• Open-access publication increasing, but venue information often not available
• Hand entered ISI citation data noisy• Article has only one venue, journals
cover many topics
A topic model for N-grams
Determine whether the next word will be part of an n-gram based on the current word and the current hidden topic. “White house” is a collocation in politics, but may not be one in real estate.
Sample n-gram topics1. Digital Libraries (102): digital, electronic, library,
metadata, access; “digital libraries”, “digital library”, “electronic commerce”, “dublin core”, “cultural heritage”
2. WWW (129): web, site, pages, page, www, sites; “world wide web”, “web pages”, “web sites”, “web site”, “world wide”
3. Ontologies (186): semantic, ontology, ontologies, rdf, semantics, meta; “semantic web”, “description logics”, “rdf schema”, “description logic”, “resource description framework”
4. Web services (184): web, services, service, xml, business; “web services”, “web service”, “markup language”, “xml documents”, “xml schema”
Assigning topics to documents
1. Build a 200 topic n-gram topic model on 300k documents
2. Remove stopword or methodological topics (e.g. “efficient, fast, speed”)
3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t
Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.
Impact FactorJournal Impact Factor: Citations from
articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.
2004 Impact factors from JCR:
Nature 32.182
Cell 28.389
JMLR 5.952
Machine Learning 3.258
Topic Impact Factor
Broad Impact: DiffusionJournal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100
Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.
Broad Impact: DiversityTopic Diversity: Entropy of the distribution of citing topics
Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics
Broad Impact: DiversityTopic Diversity: Entropy of the distribution of citing topics
Topic diversity can also be measured for papers:
Longevity: Cited Half LifeTwo views:• Given a paper, what is the median age of
citations to that paper?• What is the median age of citations from
current literature?
History: Topical Precedence
Within a topic, what are the earliest papers that received more than n citations?
Information Retrieval (138):On Relevance, Probabilistic Indexing and Information Retrieval,
Kuhns and Maron (1960)Expected Search Length: A Single Measure of Retrieval
Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968)
Relevance feedback in information retrieval, Rocchio (1971)Relevance feedback and the optimization of retrieval
effectiveness, Salton (1971)New experiments in relevance feedback, Ide (1971)Automatic Indexing of a Sound Database Using Self-organizing
Neural Nets, Feiten and Gunzel (1982)
Sharing: Topical Transfer