18
Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze , and C. Lee Giles Dept. of Computer Science, Cornell University, Information Sciences and Technology, The Pennsylvan ia State University

Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Detecting Research Topics via the Correlation between

Graphs and Texts

Yookyung JoDept. of Computer Science, Cornell University

Carl Lagoze†, and C. Lee Giles‡

† Dept. of Computer Science, Cornell University, ‡ Information Sciences and Technology, The Pennsylvania State University

Page 2: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Acknowledgment

• John E. Hopcroft• Thorsten Joachims• Simeon Warner• Isaac G. Councill

• NSF IIS-0430906, 0227648, 0227888, and 0424671

Page 3: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Topic detection

• Problem Statement :

• Our strategy :– The correlation between

• Distribution of terms representing a topic• Distribution of citation links

How to detect topics in a linked corpus (e.g. Citeseer, arXiv, the Web …)

Page 4: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Correlation between Terms and Links

Term citation graph for α Term citation graph for η

Term α : representing a topic (e.g. “sensor network’’, or “association rule’’ )

Term η : not representing a topic (e.g. “six months’’, or “practical examples’’ )

α

α

αα

α

α

α

α

α

α

α

α

η

η

ηη

ηη

ηη

ηη

η

η

η

Page 5: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Term citation graphfor a term α

α

α

α

α

α

α

α

α α

α

α

Page 6: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Correlation between Terms and Links

Term citation graph for α Term citation graph for η

Term α : representing a topic (e.g. “sensor network’’, or “association rule’’ )

Term η : not representing a topic (e.g. “six months’’, or “practical examples’’ )

α

α

αα

α

α

α

α

α

α

α

α

η

η

ηη

ηη

ηη

ηη

η

η

η

Page 7: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Detecting a topic via a single term

• H1 : A represents a topic• H0 : A does not represent a topic• GA : The term citation graph for A• O(GA) : Link connectivity observation on GA

)0|)(Pr(

)1|)(Pr(ln)(TopicScore

HGO

HGOA

A

A

• Finally, a ranked list of terms

• Given a term A,• Binary decision of whether A represents a topic or not

Page 8: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Loglikelihood of H1

• Observation O(GA) :– For each node i in GA, is it connected to other nodes in GA

by at least one link?

• Under H1– pc1 : estimation of pc

– pc1 set to a value close to 1 (e.g. pc1 = 0.9)

)1ln()()ln(

1|Prln)1|)(Pr(ln

1,1, cAcAcAc

iAiA

pnnpn

HGOHGO

This probability

= pc

Page 9: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Loglikelihood of H0

A within falls

in node a oflink a

y that probabilit the: 1

1

G

GN

n

A

A

GA

?

?

• pc0 : estimation of pc

il

Ac N

np

1

1110

Page 10: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Evaluation

• arXiv– A Physics literature collection– Year 1991-2006, 7 major arXiv areas– 214,546 papers, 2,165,170 citation links– Abstract as document– 137,098 bi-gram terms after low-frequency prune

• Citeseer– A Computer Science related collection– Year 1994-2004– 716,771 papers, 1,740,326 citation links– Abstract + title as document – 631,839 bi-gram terms after low-frequency prune

Page 11: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

arXiv (physics) : topic terms at top rankstop rank Topic (term) <n, nc, |E|>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Black hole

Quantum hall

Black holes

Higgs boson

Renormalization group

Quantum gravity

Standard model

Heavy quark

Cosmological constant

Quantum dot

Chiral perturbation

Form factors

Lattice qcd

String theory

Hubbard model

<4978, 4701, 38952>

<1863, 1493, 4862>

<3131, 2896, 22824>

<2079, 1896, 12607>

<3738, 2920, 8490>

<2014, 1724, 9693>

<7848, 7145, 53829>

<1671, 1473, 6570>

<2141, 1815, 7134>

<1366, 1031, 2926>

<1132, 1050, 5578>

<1578, 1354, 5616>

<1425, 1265, 5240>

<3818, 3539, 26250>

<1702, 1167, 2678>

n : number of nodes in GA

nc : number of nodes with at least one connection within GA

|E| : number of edges in GA

Page 12: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

arXiv(Physics): Term citation graphs for intermediate rank topic terms

Research communities

time

Page 13: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

arXiv(Physics): terms at bottommost ranks

)0|)(Pr(

)1|)(Pr(ln

)(TopicScore

HGO

HGO

A

A

A

Bottom entries are stop-phrases

rank term

137098

137097

137096

137095

137094

137093

137092

137091

137090

137089

137088

137087

137086

137085

137084

we show

has been

we find

we present

we study

we have

we also

have been

we discuss

we consider

does not

our results

we investigate

into account

we propose

Page 14: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

rank topic (term) up to 1999 topic (term) since 2000

12345678910111213141516171819202122

logic programsmodel checkingsemidefinite programminginductive logicpetri netsgenetic programminginterior pointkolmogorov complexityautomatic differentiationcomplementarity problemscongestion controlcomplementarity problemconservation lawslinear logictimed automatasituation calculusreal-time databasemotion planningduration calculusvolume renderingchain monteassociation rules

sensor networkshoc networkslogic programsimage retrievalsupport vectorcongestion controlmodel checkingdecision diagramswireless sensorad hocintrusion detectionvector machinesmobile adbinary decisionsensor networkenergy consumptioncontent-based imagesemantic webfading channelsxml datasource separationtimed automata

Citeseer(CS): top rank terms

Top rank terms from two different time periods

• Time up to 1999

• Time since 2000

Page 15: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Citeseer: Topic time evolution``sensor networks’’

``support vector’’ ``congestion control’’

``logic programs’’

Page 16: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Citeseer: Topic time evolution``petri nets’’ ``association rules’’

``genetic programming’’ ``semantic web’’

Page 17: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Algorithm Extension

• To detect topics represented by a single term – Algorithm– Evaluation on arXiv, Citeseer

• To detect topics defined by a set of terms – Algorithm– Evaluation on arXiv

Page 18: Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C

Conclusion (poster session : #7)

• Topic detection via the correlation between terms and links

• Our algorithm (in its evaluation on arXiv, Citeseer)

– Effectively discovers topics represented by a single-term or by a set of terms

– Identifies stop-phrases as a by-product

– Discovers topics in their natural scale

– Demonstrates its utility in trend analysis

– Shows the association between topic scale and specificity