View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Detecting Research Topics via the Correlation between
Graphs and Texts
Yookyung JoDept. of Computer Science, Cornell University
Carl Lagoze†, and C. Lee Giles‡
† Dept. of Computer Science, Cornell University, ‡ Information Sciences and Technology, The Pennsylvania State University
Acknowledgment
• John E. Hopcroft• Thorsten Joachims• Simeon Warner• Isaac G. Councill
• NSF IIS-0430906, 0227648, 0227888, and 0424671
Topic detection
• Problem Statement :
• Our strategy :– The correlation between
• Distribution of terms representing a topic• Distribution of citation links
How to detect topics in a linked corpus (e.g. Citeseer, arXiv, the Web …)
Correlation between Terms and Links
Term citation graph for α Term citation graph for η
Term α : representing a topic (e.g. “sensor network’’, or “association rule’’ )
Term η : not representing a topic (e.g. “six months’’, or “practical examples’’ )
α
α
αα
α
α
α
α
α
α
α
α
η
η
ηη
ηη
ηη
ηη
η
η
η
Term citation graphfor a term α
α
α
α
α
α
α
α
α α
α
α
Correlation between Terms and Links
Term citation graph for α Term citation graph for η
Term α : representing a topic (e.g. “sensor network’’, or “association rule’’ )
Term η : not representing a topic (e.g. “six months’’, or “practical examples’’ )
α
α
αα
α
α
α
α
α
α
α
α
η
η
ηη
ηη
ηη
ηη
η
η
η
Detecting a topic via a single term
• H1 : A represents a topic• H0 : A does not represent a topic• GA : The term citation graph for A• O(GA) : Link connectivity observation on GA
)0|)(Pr(
)1|)(Pr(ln)(TopicScore
HGO
HGOA
A
A
• Finally, a ranked list of terms
• Given a term A,• Binary decision of whether A represents a topic or not
Loglikelihood of H1
• Observation O(GA) :– For each node i in GA, is it connected to other nodes in GA
by at least one link?
• Under H1– pc1 : estimation of pc
– pc1 set to a value close to 1 (e.g. pc1 = 0.9)
)1ln()()ln(
1|Prln)1|)(Pr(ln
1,1, cAcAcAc
iAiA
pnnpn
HGOHGO
This probability
= pc
Loglikelihood of H0
A within falls
in node a oflink a
y that probabilit the: 1
1
G
GN
n
A
A
GA
?
?
• pc0 : estimation of pc
il
Ac N
np
1
1110
Evaluation
• arXiv– A Physics literature collection– Year 1991-2006, 7 major arXiv areas– 214,546 papers, 2,165,170 citation links– Abstract as document– 137,098 bi-gram terms after low-frequency prune
• Citeseer– A Computer Science related collection– Year 1994-2004– 716,771 papers, 1,740,326 citation links– Abstract + title as document – 631,839 bi-gram terms after low-frequency prune
arXiv (physics) : topic terms at top rankstop rank Topic (term) <n, nc, |E|>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Black hole
Quantum hall
Black holes
Higgs boson
Renormalization group
Quantum gravity
Standard model
Heavy quark
Cosmological constant
Quantum dot
Chiral perturbation
Form factors
Lattice qcd
String theory
Hubbard model
<4978, 4701, 38952>
<1863, 1493, 4862>
<3131, 2896, 22824>
<2079, 1896, 12607>
<3738, 2920, 8490>
<2014, 1724, 9693>
<7848, 7145, 53829>
<1671, 1473, 6570>
<2141, 1815, 7134>
<1366, 1031, 2926>
<1132, 1050, 5578>
<1578, 1354, 5616>
<1425, 1265, 5240>
<3818, 3539, 26250>
<1702, 1167, 2678>
n : number of nodes in GA
nc : number of nodes with at least one connection within GA
|E| : number of edges in GA
arXiv(Physics): Term citation graphs for intermediate rank topic terms
Research communities
time
arXiv(Physics): terms at bottommost ranks
)0|)(Pr(
)1|)(Pr(ln
)(TopicScore
HGO
HGO
A
A
A
Bottom entries are stop-phrases
rank term
137098
137097
137096
137095
137094
137093
137092
137091
137090
137089
137088
137087
137086
137085
137084
we show
has been
we find
we present
we study
we have
we also
have been
we discuss
we consider
does not
our results
we investigate
into account
we propose
rank topic (term) up to 1999 topic (term) since 2000
12345678910111213141516171819202122
logic programsmodel checkingsemidefinite programminginductive logicpetri netsgenetic programminginterior pointkolmogorov complexityautomatic differentiationcomplementarity problemscongestion controlcomplementarity problemconservation lawslinear logictimed automatasituation calculusreal-time databasemotion planningduration calculusvolume renderingchain monteassociation rules
sensor networkshoc networkslogic programsimage retrievalsupport vectorcongestion controlmodel checkingdecision diagramswireless sensorad hocintrusion detectionvector machinesmobile adbinary decisionsensor networkenergy consumptioncontent-based imagesemantic webfading channelsxml datasource separationtimed automata
Citeseer(CS): top rank terms
Top rank terms from two different time periods
• Time up to 1999
• Time since 2000
Citeseer: Topic time evolution``sensor networks’’
``support vector’’ ``congestion control’’
``logic programs’’
Citeseer: Topic time evolution``petri nets’’ ``association rules’’
``genetic programming’’ ``semantic web’’
Algorithm Extension
• To detect topics represented by a single term – Algorithm– Evaluation on arXiv, Citeseer
• To detect topics defined by a set of terms – Algorithm– Evaluation on arXiv
Conclusion (poster session : #7)
• Topic detection via the correlation between terms and links
• Our algorithm (in its evaluation on arXiv, Citeseer)
– Effectively discovers topics represented by a single-term or by a set of terms
– Identifies stop-phrases as a by-product
– Discovers topics in their natural scale
– Demonstrates its utility in trend analysis
– Shows the association between topic scale and specificity