48
1 ArnetMiner – Extraction and Mining of Academic Social Networks 1 Jie Tang, 1 Jing Zhang, 1 Limin Yao, 1 Juanzi Li, 2 Li Zhang, and 2 Zhong Su 1 Knowledge Engineering Group, Dept. of Computer Science and Technology Tsinghua University 2 IBM, China Research Lab August 25 th 2008

1 ArnetMiner – Extraction and Mining of Academic Social Networks 1 Jie Tang, 1 Jing Zhang, 1 Limin Yao, 1 Juanzi Li, 2 Li Zhang, and 2 Zhong Su 1 Knowledge

Embed Size (px)

Citation preview

1

ArnetMiner– Extraction and Mining of Academic Social Networks

1Jie Tang, 1Jing Zhang, 1Limin Yao, 1Juanzi Li, 2Li Zhang, and 2Zhong Su

1Knowledge Engineering Group, Dept. of Computer Science and Technology

Tsinghua University2IBM, China Research Lab

August 25th 2008

2

Motivation

“Academic search is treated as

document search, but ignore semantics”

“The information need is not only

about publication…”

3

Examples – Expertise search

Researcher A

• When starting awork in a new research topic;

• Or brainstorming for novel ideas.

• Who are experts in this field?

• What are the top conferences in the field?

• What are the best papers?

• What are the top research labs?

4

Examples – Citation network analysis

Researcher B • an in-depth understanding of the research field?

Self-Indexing Inverted Files for Fast Text Retrieval

StaticIndex Pruning for Information

Retrieval Systems

Signature les: An access Method for Documents and

its Analytical Performance Evaluation

FilteredDocument Retrieval with Frequency-Sorted Indexes

Vector-space Ranking with Effective Early Termination

Efficient Document Retrieval in Main Memory

A Document-centric Approach to Static Index Pruning in Text

Retrieval Systems

An Inverted Index Implementation

Parameterised Compression for Sparse Bitmaps

Introduction of Modern Information Retrieval

Memory Efficient Ranking

Topic 31: Ranking and Inverted Index

Topic 27: Information retrieval

Topic 1 : Theory

Topic 21: Framework

Topic 22: Compression

Other

Topic 23: Index method

Topic 34: Parallel computing

Basic theoryComparable workOther

Citation Relationship Type

Topics

5

Which conference should we submit the paper?

Researcher C

authors

content

Examples – Conference Suggestion

6

Who are best matching reviewers for each paper?

KDD Committee

Paper content

conference

Examples – Reviewer Suggestion

7

Topic Browser

8

1

2

9

= Researcher Profiling

+ Name Disambiguation

Academic Network Extraction in ArnetMiner

10

Ruud Bolle Office: 1S-D58 Letters: IBM T.J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598 USA Packages: IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 USA Email: [email protected] Ruud M. Bolle was born in Voorburg, The Netherlands. He received the Bachelor's Degree in Analog Electronics in 1977 and the Master's Degree in Electrical Engineering in 1980, both from Delft University of Technology, Delft, The Netherlands. In 1983 he received the Master's Degree in Applied Mathematics and in 1984 the Ph.D. in Electrical Engineering from Brown University, Providence, Rhode Island. In 1984 he became a Research Staff Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Department. In 1988 he became manager of the newly formed Exploratory Computer Vision Group which is part of the Math Sciences Department.

Currently, his research interests are focused on video database indexing, video processing, visual human-computer interaction and biometrics applications.

Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology.

DBLP: Ruud Bolle 2006

Nalini K. Ratha, Jonathan Connell, Ruud M. Bolle, Sharat Chikkerur: Cancelable Biometrics: A Case Study in Fingerprints. ICPR (4) 2006: 370-373

EE50

Sharat Chikkerur, Sharath Pankanti, Alan Jea, Nalini K. Ratha, Ruud M. Bolle: Fingerprint Representation Using Localized Texture Features. ICPR (4) 2006: 521-524

EE49

Andrew Senior, Arun Hampapur, Ying-li Tian, Lisa Brown, Sharath Pankanti, Ruud M. Bolle: Appearance models for occlusion handling. Image Vision Comput. 24(11): 1233-1243 (2006)EE48

2005

Ruud M. Bolle, Jonathan H. Connell, Sharath Pankanti, Nalini K. Ratha, Andrew W. Senior: The Relation between the ROC Curve and the CMC. AutoID 2005: 15-20EE47

Sharat Chikkerur, Venu Govindaraju, Sharath Pankanti, Ruud M. Bolle, Nalini K. Ratha: Novel Approaches for Minutiae Verification in Fingerprint Images. WACV. 2005: 111-116

EE46

...

Ruud Bolle Office: 1S-D58 Letters: IBM T.J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598 USA Packages: IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 USA Email: [email protected] Ruud M. Bolle was born in Voorburg, The Netherlands. He received the Bachelor's Degree in Analog Electronics in 1977 and the Master's Degree in Electrical Engineering in 1980, both from Delft University of Technology, Delft, The Netherlands. In 1983 he received the Master's Degree in Applied Mathematics and in 1984 the Ph.D. in Electrical Engineering from Brown University, Providence, Rhode Island. In 1984 he became a Research Staff Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Department. In 1988 he became manager of the newly formed Exploratory Computer Vision Group which is part of the Math Sciences Department.

Currently, his research interests are focused on video database indexing, video processing, visual human-computer interaction and biometrics applications.

Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology.

Motivating ExampleContact Information

Educational history

Academic services

Publications

1

1

2

2

Ruud Bolle

Position

Affiliation

Address

Address

Email

PhdunivPhdmajor

Phddate

Msuniv

Msdate

Msmajor

BsunivBsdate

Bsmajor

Research Staff

IBM T.J. Watson Research CenterP.O. Box 704Yorktown Heights, NY 10598 USA

[email protected]

Brown University

1984

Electrical Engineering

Delft University of Technology

Analog Electronics

1977

Delft University of Technology

IBM T.J. Watson Research Center19 Skyline DriveHawthorne, NY 10532 USA

IBM T.J. Watson Research Center

Electrical Engineering

1980

Applied Mathematics

Msmajor

http://researchweb.watson.ibm.com/ecvg/people/bolle.html

Homepage

Ruud BolleName

video database indexingvideo processing

visual human-computer interactionbiometrics applications

Research_Interest

Photo

Publication 1#

Cancelable Biometrics: A Case Study in Fingerprints

ICPR 370

2006

Date

Start_page

Venue

Title

373

End_page

Publication 2#

Fingerprint Representation Using Localized Texture Features

ICPR 521

2006

Date

Start_page

Venue

Title

524

End_page

. . .

Co-authorCo-author

1

Ruud Bolle

2Publication #3

Publication #5

coauthor

coauthor

UIUCaffiliation

Professorposition

2

1

11

Ruud Bolle Office: 1S-D58 Letters: IBM T.J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598 USA Packages: IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 USA Email: [email protected] Ruud M. Bolle was born in Voorburg, The Netherlands. He received the Bachelor's Degree in Analog Electronics in 1977 and the Master's Degree in Electrical Engineering in 1980, both from Delft University of Technology, Delft, The Netherlands. In 1983 he received the Master's Degree in Applied Mathematics and in 1984 the Ph.D. in Electrical Engineering from Brown University, Providence, Rhode Island. In 1984 he became a Research Staff Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Department. In 1988 he became manager of the newly formed Exploratory Computer Vision Group which is part of the Math Sciences Department.

Currently, his research interests are focused on video database indexing, video processing, visual human-computer interaction and biometrics applications.

Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology.

DBLP: Ruud Bolle 2006

Nalini K. Ratha, Jonathan Connell, Ruud M. Bolle, Sharat Chikkerur: Cancelable Biometrics: A Case Study in Fingerprints. ICPR (4) 2006: 370-373

EE50

Sharat Chikkerur, Sharath Pankanti, Alan Jea, Nalini K. Ratha, Ruud M. Bolle: Fingerprint Representation Using Localized Texture Features. ICPR (4) 2006: 521-524

EE49

Andrew Senior, Arun Hampapur, Ying-li Tian, Lisa Brown, Sharath Pankanti, Ruud M. Bolle: Appearance models for occlusion handling. Image Vision Comput. 24(11): 1233-1243 (2006)EE48

2005

Ruud M. Bolle, Jonathan H. Connell, Sharath Pankanti, Nalini K. Ratha, Andrew W. Senior: The Relation between the ROC Curve and the CMC. AutoID 2005: 15-20EE47

Sharat Chikkerur, Venu Govindaraju, Sharath Pankanti, Ruud M. Bolle, Nalini K. Ratha: Novel Approaches for Minutiae Verification in Fingerprint Images. WACV. 2005: 111-116

EE46

...

Ruud Bolle Office: 1S-D58 Letters: IBM T.J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598 USA Packages: IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 USA Email: [email protected] Ruud M. Bolle was born in Voorburg, The Netherlands. He received the Bachelor's Degree in Analog Electronics in 1977 and the Master's Degree in Electrical Engineering in 1980, both from Delft University of Technology, Delft, The Netherlands. In 1983 he received the Master's Degree in Applied Mathematics and in 1984 the Ph.D. in Electrical Engineering from Brown University, Providence, Rhode Island. In 1984 he became a Research Staff Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Department. In 1988 he became manager of the newly formed Exploratory Computer Vision Group which is part of the Math Sciences Department.

Currently, his research interests are focused on video database indexing, video processing, visual human-computer interaction and biometrics applications.

Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology.

Motivating ExampleContact Information

Educational history

Academic services

Publications

1

1

2

2

Ruud Bolle

Position

Affiliation

Address

Address

Email

PhdunivPhdmajor

Phddate

Msuniv

Msdate

Msmajor

BsunivBsdate

Bsmajor

Research Staff

IBM T.J. Watson Research CenterP.O. Box 704Yorktown Heights, NY 10598 USA

[email protected]

Brown University

1984

Electrical Engineering

Delft University of Technology

Analog Electronics

1977

Delft University of Technology

IBM T.J. Watson Research Center19 Skyline DriveHawthorne, NY 10532 USA

IBM T.J. Watson Research Center

Electrical Engineering

1980

Applied Mathematics

Msmajor

http://researchweb.watson.ibm.com/ecvg/people/bolle.html

Homepage

Ruud BolleName

video database indexingvideo processing

visual human-computer interactionbiometrics applications

Research_Interest

Photo

Publication 1#

Cancelable Biometrics: A Case Study in Fingerprints

ICPR 370

2006

Date

Start_page

Venue

Title

373

End_page

Publication 2#

Fingerprint Representation Using Localized Texture Features

ICPR 521

2006

Date

Start_page

Venue

Title

524

End_page

. . .

Co-authorCo-author

1

Ruud Bolle

2Publication #3

Publication #5

coauthor

coauthor

UIUCaffiliation

Professorposition

2

1

Two key issues:• How to accurately extract the researcher profile information from the Web?• How to integrate the information from different sources?

12

Researcher Network Extraction

Researcher

Homepage

Phone

Address

Email

Phduniv

Phddate

PhdmajorMsuniv

Bsmajor

Bsdate

Bsuniv

Affiliation

Postion

Msmajor

Msdate

Fax

Person Photo

Publication

Research_Interest

NameAuthored

Title

Publication_venue

Start_page

End_page

Date

Coauthor

70.60% of the researchers have at least one homepage/introducing page

There are a large number of person names having the ambiguity problem

85.6% from universities 14.4% from companies

71.9% are homepages28.1% are introducing

pages60% are natural language text

40% are in lists and tables

70% moved at least one time

Even 3 “Yi Li” graduated from the author’s lab

300 most common male names are used by 1 billion+ people (78.74%) in USA

13

Our Approach Picture – based on Markov Random Field

aYbY

cY

eY

fY

dY

( | | )

( | | ~ )

i j j i

i j j i

P Y Y Y Y

P Y Y Y Y

Special cases: - Conditional Random Fields - Hidden Markov Random Fields

Markov Property:

x1

x2

x3

co-conference

cite

coauthor

cite

coauthor

coauthor

coauthort- coauthor

cite

co-conference

x4

x5

x6

x7

x8

x9

x10

x11

y1=1

y8=1

y9=3

y11=3

y10=3

y2=1

y3=1

y4=2

y7=2

y6=2

y5=2

Researcher Profiling Name Disambiguation

14

CRFs

He is a Professor at

OTH OTH

POS

AFF

POS

OTH

POS

UIUC

OTH

POS

AFF

OTH

POS

AFF

……

AFF

OTH

POS

AFF

… ADR

AFF

ADR

- Green nodes are hidden vars, - Purple nodes are observations

, ,

1( | ) exp ( , | , ) ( , | , )

( ) j j e k k ve E j v V k

p y x t e y x s v y xZ x

15

Token Definitions

Standard word Words in natural language

Special word

Including several general ‘special words’

e.g. email address, IP address, URL, date, number, money, percentage, unnecessary tokens (e.g. ‘===’ and ‘###’), etc.

Image token <IMAGE src="defaul3.jpg" alt=""/>

Term base NP, like “Computer Science”

Punctuation marks

Including period, question mark, and exclamation mark

16

Feature Definition• Content features

Word features

Morphological features

Whether the current token is a word

Whether the word is capitalized

Standard Word

Image Token

Image size The size of the image

Image height/width ratio The value of height/width. The value of a person photo is often larger than 1

Image format JPG or BMP

The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1

Face recognition Whether the current image contains a person face

Whether the filename contains (partially) the researcher name

Image filename

Whether the “alt” of the image contains (partially) the researcher name

Image “ALT”

Image positive keywords “myself”, “biology”

Image negative keywords “ads”, “banner”, “logo”

Image color

17

Profiling Experiments

• Dataset– 1,000 researchers from ArnetMiner.org

• Baseline– Amilcare– Support Vector Machines– Unified_NT (CRFs without transition features)

• Evaluation measures– Precision, Recall, F1

18

Profiling Results—5-fold cross validation

Profiling Task Unified Unified_NT SVM AmilcarePhoto 89.11 88.64 88.86 31.62

Position 69.44 64.70 64.68 56.48

Affiliation 83.52 72.16 73.86 46.65

Phone 91.10 78.72 79.71 83.33

Fax 90.83 64.28 64.17 86.88

Email 80.35 75.47 79.37 78.70

Address 86.34 75.15 77.04 66.24

Bsuniv 67.38 57.56 59.54 47.17

Bsmajor 64.20 59.18 60.75 58.67

Bsdate 53.49 40.59 28.49 52.34

Msuniv 57.55 47.49 49.78 45.00

Msmajor 63.35 61.92 62.10 57.14

Msdate 48.96 41.27 30.07 56.00

Phduniv 63.73 53.11 57.01 59.42

Phdmajor 67.92 59.30 59.67 57.93

Phddate 57.75 42.49 41.44 61.19

Overall 83.37 72.09 73.57 62.3083.37

19

20

Name Disambiguation

Name Affiliation

Jing Zhang (26)

Shanghai Jiao Tong Univ.

Yunnan Univ.

Tsinghua Univ.

Alabama Univ.

Univ. of California, Davis

Carnegie Mellon University

Henan Institute of Education

Proposal of a semi-supervised framework

21

Our Method to Name Disambiguation

• A hidden Markov Random Field model

• Observable Variables X represent publications

• Hidden Variables Y represent the labels of publications

• Paper relationships define the dependencies over hidden variables

x1

x2

x3

co-conference

cite

coauthor

cite

coauthor

coauthor

coauthort- coauthor

cite

co-conference

x4

x5

x6

x7

x8

x9

x10

x11

y1=1

y8=1

y9=3

y11=3

y10=3

y2=1

y3=1

y4=2

y7=2

y6=2

y5=2

22

Objective Function

( | ) ( ) ( | )Y X Y X YP P P

2

1( | ) exp( ( , ))

i

i ix X

X Y D x yZ

P

1

1

1

1

1exp( ( ))

1exp( ( ))

1exp( ( , ))

1exp( ( , ) ( ) [ ( , )])

i

i

k

NN N

i j

i j i j k k i ji j c C

V YZ

V YZ

V i jZ

D x x I y y w c y yZ

{ ( , ) ( ) [ ( , )]} ( , ) logk i

obj i j i j k k i j i ii j c C x X

f D x x I y y w c y y D x y Z

maximize

minimize1 22

23

Relationship Definition

p1: A, B, Cp2: A, Bp3: A, Dp4: C, D

Mp(0): p1 p2 p3

p1 p2

p3

1 0 00 1 00 0 1

p1: A, B, Cp2: A, Bp3: A, Dp4: C, D

Mp(1): p1 p2 p3

p1 p2

p3

1 1 01 1 00 0 1

p1: A, B, Cp2: A, Bp3: A, Dp4: C, D

Mp(2): p1 p2 p3

p1 p2

p3

1 1 11 1 01 0 1

p1: A, B, Cp2: A, Bp3: A, Dp4: C, D

Mp(3): p1 p2 p3

p1 p2

p3

1 1 11 1 11 1 1

C W Relationship Descriptionc1 w1 Co-Conference pi.pubvenue = pj.pubvenue

c2 w2 CoAuthor r, s>0, ai(r)=aj

(s)

c3 w3 Citation pi cites pj or pj cites pi

c4 w4 Constraints Feedbacks supplied by users

c5 w5 τ-CoAuthor one common author in τ extension

24

We define the distance function as follows (Basu, 04):

where

We can see that actually maps each vector xi into

another new space, i.e. A1/2xi

To simplify our question, we define A as a diagonal matrix

Parameterized Distance Function

|| || Ti i iA Ax x x

( , ) 1|| || || ||

Ti j

i ji j

D A A

Ax xx x

x x

|| ||i Ax

25

• Initialization • use constraints to generate initial k clusters

• E-Step

• M-Step• Update cluster centroid• Update parameter matrix A

EM Framework

( , ) { ( , ) ( ) [ ( , )]} ( , )k

obj i i i j j k k i j i ii j c C

f y D x x I h l w c p p D x y

x

:

:

|| ||i

i

ii l h

ii

i l h

y

A

x

x

( , ) ( , ){ ( ) [ ( , )]}

k i

obj i j i ii j k k i j

i j c C x Xm m m

f D x x D x yI l l w c p p

a a a

2 2|| || || ||

|| || || ||( , ) 2 || || || ||

|| || || ||

im i jm jTim jm i j i j

i j i j

m i j

x x x xx x x x x x

D x x x x

a x x

2 2A A

A AA A

2 2A A

A

26

Disambiguation Experiments

• Data set:

Abbr. Name#Public-

ations#Actual Person

Abbr. Name#Public-ations

#Actual Person

Cheng Chang 12 3 Gang Wu 40 16

Wen Gao 286 4 Jing Zhang 54 25

Yi Li 42 21 Kuo Zhang 6 2

Jie Tang 21 2 Hui Fang 15 3

Bin Yu 66 12 Lei Wang 109 40

Rakesh Kumar 61 5 Michael Wagner 44 12

Bing Liu 130 11 Jim Smith 33 5

27

Our Approach vs. Baseline

Data Set Person NameBaseline (Tan, 2006) Our Approach

Our Approach(w/o relation)

Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1

Real Name

Cheng Chang 100.0 100.0 100.0 100.0 100.0 100.0 72.73 64.00 68.09Wen Gao 96.60 62.64 76.00 99.29 98.59 98.94 96.17 33.53 49.72

Yi Li 86.64 95.12 90.68 70.91 97.50 82.11 20.97 31.71 25.25Jie Tang 100.0 100.0 100.0 100.0 100.0 100.0 88.68 54.65 67.63Gang Wu 97.54 97.54 97.54 71.86 98.36 83.05 57.69 36.89 45.00

Jing Zhang 85.0 69.86 76.69 83.91 100.0 91.25 10.79 20.55 14.15Kuo Zhang 100.0 100.0 100.0 100.0 100.0 100.0 66.67 40.00 50.00Hui Fang 100.0 100.0 100.0 100.0 100.0 100.0 64.71 70.97 67.69

Bin Yu 67.22 50.25 57.51 86.53 53.00 65.74 26.50 23.25 24.77Lei Wang 68.45 41.12 51.38 88.64 89.06 88.85 24.13 25.16 24.63

Rakesh Kumar 63.36 92.41 75.18 99.14 96.91 98.01 67.11 43.04 52.45

Michael Wagner

18.35 60.26 28.13 85.19 76.16 80.42 36.89 50.33 42.57

Bing Liu 84.88 43.16 57.22 88.25 86.49 87.36 66.14 19.51 30.13Jim Smith 92.43 86.80 89.53 95.81 93.56 94.67 60.94 59.39 60.16

Avg. 82.89 78.51 80.64 90.68 92.12 91.39 54.29 40.93 46.67

29

Contribution of Relationships

0.00

20.00

40.00

60.00

80.00

100.00

Pre. Rec. F1

w/o Relationship

+CoConference

+Citation

+CoAuthor

All

30

Distribution Analysis

(1) All methods can achieve good performance

(2) Our method can achieve good performance

(3) Our method can obtain not bad results, but still need further improvements

31

Modeling the Academic Network and Applications

32

The Academic Network

writewrite

cite

cite

cite

write

write

write

cite

Write

publish

publish

publish

publish

publish

publish

write

write

coauthor coauthor

Dr. Tang

Limin

Prof. Wang

Prof. Li

SVM...Association...

Tree CRF...

Semantic...EOS... Annotation...

IJCAI

ISWC

WWW

Pc member

Academic Network

Heterogeneous objects:

Paper, Person, Conf./Journal

Relationships:

•Conf./Journal publish paper

•Paper cite paper

•Person write paper

•Person is PC member of Conf./Journal

•Person is coauthor of person

Challenges:- How to model the heterogeneous objects in a unified approach?- How to apply the modeling approach to different applications?

33

Modeling the Academic Network

T

DNd

wzxad

β

Φ

α

A

θ

c

T

μ ψ

T

DNd

wzx

ad

β

Φ

α

ACθ

c

T

D

Nd

wz

β

Φ

c

η,σ2

ad x

α

A

θ

ACT1 ACT2 ACT3

authors

Topic

words

conference

34

Generative Story of ACT1 Model

• Generative process

Shafiei

Milios

NLP

MLDM

IR

ML

NLPIR

DM

Latent Dirichlet Co-clustering

Shafiei and Milios

We present a generative model for clustering documents and terms. Our model is a four hierarchical bayesian model. We present efficient inference techniques based on Markow Chain Monte Carlo. We report results in document modeling, document and terms clustering …

ICDM 0.23KDD 0.19….

mining 0.23clustering 0.19classification 0.17….

ICML 0.23NIPS 0.19….

model 0.23learning 0.19boost 0.17….

P(c|z)

P(w|z)

P(c|z)

P(w|z)

clustering

inference

ICDM

Paper

NIPS

35

ACT Model 1

Generative process:

T

DNd

wzxad

β

Φ

α

A

θ

c

T

μ ψ

ACT1

authors

Topic

words

conference

36

ACT Model 2

T

DNd

wzx

ad

β

Φ

α

ACθ

c

ACT2

Generative process:

conference

authors

words

37

ACT Model 3

T

D

Nd

wz

β

Φ

c

η,σ2

ad x

α

A

θ

ACT3

Generative process:

conference

authors words

38

Hot topic on a conference

Applications

Researcher interests

Expertise search

Association search

T

DNd

wzxad

β

Φ

α

A

θ

c

T

μ ψ

Topic browser

39

Expertise Search

• Calculate the relevance of query q and different objects (i.e., papers, authors, and conferences)

• E.g.,( | ) ( | )

w qP q d P w d

( | ) ( | ) ( | )LM ACTP w d P w d P w d

( , ) ( , )( | ) (1 )d d

LMd d d

N Ntf w d tf wP w d

N N N

D

D

1 1

( | ) ( | , ) ( | , ) ( | )dAT

ACT z xz x

P w d P w z P z x P x d

40

Expertise Search Results

Arnetminer data: 14,134 authors 10,716 papers 1,434 confs/journals

Evaluation measures: pooled relevance + human judgement

Baselines: - Language Model (LM) - LDA - Author Topic (AT)

42

ArnetMiner Today

43

* Arnetminer data: > 0.5 M researcher profiles > 2M papers > 8M citation relationships > 4K conferences * Visits come from more than 165 countries* Continuously +20% increase of visits per month

* Currently, more than 1,500 unique-ip visits per day.

ArnetMiner Today

Top 10 countries1. USA 6. Canada2. China 7. Japan3. Germany 8. France4. India 9. Taiwan5. UK 10. Italy

44

Person Search

Basic Info.

Research Interests

Publications

Social Network

45

ExpertiseSearch

Finding experts,expertise conferences, and expertise papers

for “data mining”

47

Association Search

Finding associations between persons - high efficiency - Top-K associations

Usage: - to find a partner - to find a person with same interests

48

Survey Paper Finding

Survey papers

49

Topic Browser

200 topics have been discovered automatically

from the academic network

50

Acknowledgements

• National Science Foundation of China (NSFC)

• National 985 Funding

• Chinese Young Faculty Research Funding

• Minnesota-China Collaboration Project

• IBM CRL

• Tsinghua-Google Joint Research Project

• National Foundation Science Research (973)

51

Thanks!

Q&A & DemoHP: http://keg.cs.tsinghua.edu.cn/persons/tj/

Online URL: http://arnetminer.org

If want to know more technique details,

please come to our poster session tomorrow night.