Upload
tobias-kuhn
View
196
Download
1
Embed Size (px)
DESCRIPTION
I present here my work on how to identify memes in the scientific literature by using the citation network.
Citation preview
Meme Extraction from Corpora of ScientificLiterature using Citation Networks
Tobias Kuhn
http://www.tkuhn.ch
@txkuhn
ETH Zurich
ColloquiumInstitute of Computational Linguistics
University of Zurich25 November 2014
Reference
Journal article on the content of this talk:
Tobias Kuhn, Matjaz Perc, and Dirk Helbing. Inheritance patterns incitation networks reveal scientific memes. Physical Review X, 4,041036, 21 November 2014. https://journals.aps.org/prx/abstract/10.1103/PhysRevX.4.041036
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 2 / 22
Meme Detection
I am presenting an approach on “meme detection”, which is relatedto a number of existing problems and approaches:
• Named-entity extraction
• Keyphrase extraction
• Topic modeling
• Terminology extraction
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 3 / 22
Context for NLP
Most NLP approaches focus on the analysis of the texts themselves:
• Grammar
• Morphology
• Text Structure
• Statistical Patterns
Some also take the contexts of the texts into account:
• Comparison to properties of entire corpus (e.g. tf–idf)
• Training on particular corpus/domain/speaker
• Citation graph of scientific publications
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 4 / 22
Citation Graph of Scientific Publications
Nodes: publicationsEdges: citations (in gray)
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 5 / 22
Citation Graph of Scientific Publications
Nodes: publicationsEdges: citations (in gray)
Legend:Natural/Agricultural Sciences
(except Physical Sciences)
Physical SciencesEngineering and TechnologyMedical and Health SciencesSocial Sciences / Humanities
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 6 / 22
Citation Graph of Scientific Publications
Nodes: publicationsEdges: citations (in gray)
Legend:Natural/Agricultural Sciences
(except Physical Sciences)
Physical SciencesEngineering and TechnologyMedical and Health SciencesSocial Sciences / Humanities
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 7 / 22
Citation Graph of Scientific Publications
Entire giant component (33million nodes) of the citationgraph of Thomson Reuter’sWeb of Science dataset.
Legend:Natural/Agricultural Sciences
(except Physical Sciences)
Physical SciencesEngineering and TechnologyMedical and Health SciencesSocial Sciences / Humanities
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 8 / 22
Citation Graph: American Physical Society
Citation graph of the Phys-ical Review journals (463knodes).
Legend:A: Atomic, molecular,
optical phys.B: Condensed matter,
materials phys.C: Nuclear phys.D: Particles, fields, gravitation,
cosmologyE: Statistical, nonlinear,
soft matter phys.other journals
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 9 / 22
Citation Graph: Memes
Specific phrases or “memes”localize to specific regions inthe citation graph.
Legend:quantumfissiongrapheneself-organized criticalitytraffic flow
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 10 / 22
Scientific Memes
“Meme” was coined by Richard Dawkins:
“Just as genes propagate themselves in the gene pool by leaping from bodyto body via sperm or eggs, so memes propagate themselves in the meme poolby leaping from brain to brain via a process which, in the broad sense, canbe called imitation.” [Dawkins, The Selfish Gene]
Examples of memes:
• Melodies
• Recipes
• Cultural habits
• Words, grammar rules, text style
• Scientific concepts
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 11 / 22
Genes/Memes as Network Patterns!
Dawkins’ Definition of “Gene”:“I am using the word gene to mean a genetic unit that is small enough to lastfor a number of generations and to be distributed around in many copies.”[Dawkins, The Selfish Gene]
Our Working Definition of “Scientific Meme”:
A scientific meme is a short unit of text in a publication that is replicated inciting publications and thereby distributed around in many copies.
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 12 / 22
Propagation Score
Propagation score P quantifies the degree to which a meme’soccurrence aligns with the citation graph:
Pm =sticking factor
sparking factor=
?
/?
=dm→m
d→m
/dm→�md→�m
To prevent that some infrequent phrases get a high propagation scoreby chance, we can add small amount of controlled noise δ (we useδ = 3):
Pm =dm→m
d→m + δ
/dm→�m + δ
d→�m + δ
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 13 / 22
Frequency/Propagation Score for APS Datarelative
frequency
→
10−2
100
102
104
106
10−6
10−4
10−2
100
APS
N = 1,372,365
quantum
fissiongraphene
self-organizedcriticality
traffic flow
propagation score →
density
ofn-grams:
100
101
102
103
104
105
1
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 14 / 22
Meme ScoreMeme score M as the Product of relative frequency f andpropagation score P:
Mm = fmPm
Top 20 Memes for APS (Physics):
1. loop quantum cosmology+* 11. dark energy+*2. unparticle+* 12. Rashba3. sonoluminescence+* 13. CuGeO3
+
4. MgB2+ 14. strange nonchaotic
5. stochastic resonance+* 15. in NbSe3
6. carbon nanotubes+* 16. spin Hall+
7. NbSe3+ 17. elliptic flow+*
8. black hole+* 18. quantum Hall+*9. nanotubes+ 19. CeCoIn5
+
10. lattice Boltzmann+* 20. inflation+
+ annotators agreed that this is an interesting and important physics concept* also found on the list of terms extracted from Wikipedia
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 15 / 22
Properties of the Meme Score
The meme score has a number of nice properties:
• Can be calculated efficiently and exhaustively even on very largedataset
• No upper limit on the length of n-grams
• No dependence on external linguistic or ontological knowledge
• No stop-word lists or other kinds of arbitrary filters or thresholds
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 16 / 22
Manual Annotation
• Two annotators (A1, A2): PhD students with physics degree• Annotation with respect to (1) physics concept or not and (2)
linguistic category• Randomly extracted phrases for comparison
physics concept not a physics concept
noun phrase verb adjective or adverb other
meme score
A1A2A1A2
random
A1A2A1A2
weighted random
terms30 60 90 120 150
A1A2A1A2
1
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 17 / 22
Comparison to Alternative Metrics
0 0.1 0.2 0.3 0.4 0.5
meme score
frequency
max. absolutechange
over time
max. relativechange
over time
max. absolutedifference
across journals
max. relativedifference
across journals
A (area under curve)
101
102
103
0
20
40
60
80
100
top x terms by meme score
pe
rce
nta
ge
of
Wik
ipe
dia
te
rms
40% of top 50 terms are found on Wikipedia list
1
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 18 / 22
Evolution over Time: Exemplary Memes
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 105
0
2
4
6
8
10
12
14
publication count
mem
e s
core
(δ =
1)
19
4019
6019
7019
8019
8219
8419
8619
8819
9019
9219
9419
9619
9820
0020
0220
0420
0620
08
quantum
fission
graphene
self−organized criticality
traffic flow
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 19 / 22
Evolution over Time
0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 105
0
2
4
6
8
10
12
publication count
mem
e sc
ore
1940
1960
1970
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
grapheneentanglement
MgB2
nanotubescarbon nanotubes
quarkneutrino
Bose−Einsteinquantum Hall
blackC
60Hubbard model
quantum wellsgraphite
reactionsphotoemission
black holetricritical
Kondosuperconducting
fissionMeV
diffuse scattering
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 20 / 22
Conclusions
The citation graph is a very powerful resource to detect memes.
Combined with other existing approaches, this seems to be apromising tool for NLP on scientific publications.
Could be applied to other types of texts that have a certain kind ofcitation structure (legal texts?).
Allows for studying memes in an exhaustive manner.
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 21 / 22
Thank you for your Attention!
Questions?
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 22 / 22
Randomized Networkrelative
frequency
→
10−2
100
102
104
106
10−6
10−4
10−2
100
APSrandomized
(time preserving)
N = 89,356
propagation score →
density
ofn-grams:
100
101
102
103
104
105
1
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 23 / 22
Meme Score Calculation
1 Collect all phrases that stick at least once (not counting“free-riding” on larger memes)
2 Calculate sticking and sparking factors for all collected phrases(Mm = fmPm with Pm =
sticking factor
sparking factor=
dm→m
d→m + δ
/dm→�m
+ δ
d→�m+ δ
)
Example
Citing title:covariant effective action for loop quantum cosmology from order reduction
Cited titles:– quantum nature of the big bang– absence of a singularity in loop quantum cosmology– large scale effective theory for cosmological bounces
Sticking phrases: loop quantum cosmology, quantum, effective, forSparking phrases: covariant, covariant effective action, order reduction, ...
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 24 / 22
Top Meme Scores for Web of Science Data
1. MgB2 11. loop quantum cosmology2. lattice Boltzmann 12. zero-divisor3. graphene 13. BiFeO3
4. on chalcogenolates 14. Neospora5. Ti3SiC2 15. Papuloerythroderma6. harmony search 16. Neospora caninum7. seasonal climate summary 17. metal dusting
southern hemisphere 18. porcine circovirus8. empirical likelihood 19. cone metric9. proxy re-encryption 20. ranked set
10. spiking neural P systems
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 25 / 22
Top Meme Scores for PubMed Central Data
1. Buruli ulcer 11. Nipah virus2. G-quadruplex 12. miRNA3. miRNAs 13. microRNAs4. chronic cerebrospinal venous 14. hepatitis E virus
insufficiency 15. the 45 and Up Study5. cerebrospinal venous 16. chronic cerebrospinal venous6. Mycobacterium ulcerans insufficiency (CCSVI)7. enterovirus 71 17. EV718. G-quadruplexes 18. bluetongue9. CCSVI 19. Schmallenberg virus
10. malaria 20. Nipah
Tobias Kuhn, ETH Zurich Meme Extraction from Corpora of Scientific Literature using Citation Networks 26 / 22