Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L...

Preview:

Citation preview

Mining Academic Community

Jan-Ming Hohohoiis.sinica.edu.tw

Computer System and Communication LabInstitute of Information Science

Academia Sinica

2

What is Community?

In Graph Theory densely connected groups of

vertices, with sparser connection between groups

In Social Network Analysis groups of entities that share similar

properties or connect to each other via certain relations

A social network is a structure made up of nodes, representing entities from different conceptual groups, that are linked with different types of relations

3

Why is Community Important?

Interesting data with community structure researcher collaboration, friendship network, WWW,

Massive Multi-player on-line gaming, electronic communications.

Groups of web pages that link to more web pages in the community than pages outside correspond to web pages on related topics

Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.

4

Motivation

Understand the research network between authors, conferences and topics (rank entities by relevance for given entities)

Find and justifiably recommend research collaborators for given authors

Explore the academic social network Find out most important papers, researchers and

venues for a given topic

5

Related Systems

Many digital library systems exist ACM Digital Library IEEExplorer DBLP Citeseer Libra DBConnect

Problems The coverage of dataset is not large enough Name ambiguous problem exists in

Web pages Citation records

6

Libra Academic Search

http://libra.msra.cn Free computer science bibliography search

engine A test-bed for object-level vertical search

research Currently the following types of paper-related

objects can be searched: Papers, Authors, Conferences, Journals, Research

Communities

7

8

9

DBconnect: Conference

10

DBconnect: Topic

11

DBconnect: Author

12

ZoomInfo

(1) People Directory

(2) Developer Tools

(3) Social Network, Profile Statistics, Employment History

(4) Ability to identify ambiguous?! Ex. Can get 21 different people called “Bing Liu”

13

ArnetMiner

14

Our goal

Developing an automatic system to Explore the academic social network Find out most important papers, researchers and venues for a

given topic

Provide solutions for existent problems Collecting larger citation datasets

Retrieving data from web pages• Publication list finder • Extracting citation strings from web pages • Citation parser

Multilingual data sources• Chinese and English corpuses

Name dissemination mechanism in Web pages Citation records

15

Our contributions

Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, and Jan-Ming Ho, "Web Appearance Disambiguation of Personal Names Based on Network Motif," in the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), Hong Kong, Dec. 18-22, 2006

Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho, "PLF: A Publication List Web Page Finder for Researchers," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007

Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and Jan-Ming Ho, "Mining Translations of Chinese Name from Web Corpora by Using Query Expansion Technique and Support Vector Machine," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007

Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming Lee, "AEFS: Authoritative Expert Finding System Based on a Language Model and Social Network Analysis," in Proceedings of the 12th Conference on Artificial Intelligence and Applications (TAAI2007), Nov 16-17, 2007

Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser Based on Sequence Alignment Techniques," will appear in Proceedings of the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA-08)

16

PLF: A Publication List Web Page Finder for

Researchers

17

Agenda

Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

18

Overview of a Publication List Web Page

Keep abreast of state-of-the-art research Contains citations not found elsewhere. May provide some reference materials, such as slides

and talks.

Challenges How to find the publication list web pages

Only with the given name .

Various versions or Multiple copies An author may have many affiliations.

Name ambiguity problem E.g., Dr. Bing Liu, we found that 26 people share the same name by inquiring to

ZoomInfo (people search engine).

19

Problem

“Publication List Web Page?”

20

Definition of Publication List

[Affiliation] Institute of Information Science, Academia

Sinica

citation string

Affiliated Personal Publication List Web Page (APPL)

a web page belongs to the affiliated web site of a specific person with the given name.

21

Agenda

Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

22

Process Flow

Given Names

Citation String Search

Digital Libraries

Query

collects the citation strings from digital libraries

Search EnginesInteraction

Web Page Crawlercollects the hyperlinks of web pages from search engines by using the collected citation strings as queries

web page

hyperlinks

Parsing

Citations statistics

Rank Function

analyses the statistics of all the collected hyperlinks of web pages

Analyse

23

QPT2Query

Web Page1

WP2

WP3

.

.

.

WPn

Paper Title1

Jan-Ming Ho

PT2

PT3

.

.

.

PTm

A publication list web page

QPT1Query

Web Page1

WP2

WP3

.

.

.

WPn

X

Search Engine

Search Engine

Basic Concept

A publication list web page may contain many citation strings

24

Agenda

Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

25

Dataset

Scenario Seminar members have usually published major

research works We randomly collected 200 names from the WWW ’06

Conference Committee website

APPL Types #APPL #people %population

others 0 22 11%

single-group 1 120 60%

multi-group 2 35 17.5%

3 16 8%

4 7 3.5%

26

Experiment Evaluation

Evaluation metrics We consider the top-5 results derived by each link and

focus on the top-5 recall metric, which is calculated by:

R

Rrecall a

Notation

Definition

Rathe number of publication list web pages belonging to researchers listed in the dataset

R the number of publication list web pages contained in the top-5 results

27

Parameter Analysis for Single-Group

(a) Fixed n mixed with different scale m

(b) Fixed m mixed with different scale n

Figure (a)

• When m increases, the recall rate also increases.

•Figure (b)

• System performance may be constrained by m.

(m, n) (m, n)

28

Parameter Analysis for Multi-Group

(a) Fixed n mixed with different scale m

(b) Fixed m mixed with different scale n

Figure (a)

• It is clear that the performance when m = 40 is always better than the other settings.

Figure (b)

• The best performance (top-5 recall is 70%) occurs when n = 75.

29

Performance Evaluations

(a)Performance of approaches in

single-group

(b)Performance of different ways in

multi-group1. The parameter m has a strong influence on the system’s performance; for example,

an oversized m may degrade the performance.

2. The parameter n has little influence on the system’s performance.

3. The PLF system outperforms the other two approaches on both the single-group and the multi-group datasets.

(given name + keyword)

30

Conclusion

We have defined the problem of finding the publication list web pages of a researcher, and proposed “PLF” system

Ongoing work Name ambiguity problem How to merge the multiple publication list web pages

for a specific person into a single page.

31

Discussion – Name Ambiguity Problem

Scenario We take the name “Bing Liu”

as an example Analyze manually

Observation Citation Count Name translation problem Partial matching problem

32

Extracting Citation Strings from Web Pages

33

Extract Citation Records

Structured Data

Extract

Web Page

34

Challenges

The formats of publication list web pages vary There are no fixed syntactic rules for parsing

citation records Hence, We can not apply simple rules to extract citation

records automatically

35

Challenges: Complex Layouts of Publication List Pages

36

Ideas

The semantic structure of web pages is organized by visual arrangement.

We can utilize semi-structure information (visual ) of web pages to help extraction task.

With hierarchical structure and geometric information, DOM tree is not only a great structure to present Web pages, but also very helpful for visual pattern analysis.

37

DOM Tree Presentation of Web page

Banner

Navigator

Bar

Publication List

Citation String

Citation String

Citation String

Citation String

38

Architecture of Citation Extraction System

Common Style Finder

CitationRecords

Publication List Web Page

Finder

Parsing

DOM Tree

Normal Citation Model

RankingRecords

Citation Extractor

Mining Common Style

PatternsPublication List

Pages

Candidate Citation Records

CiteSeer

Citation Extraction System

Extracting Candidate Records

p

lili

a ema

T1

T2

T3 T4

T5

T3T4T5

T1T2

Estimating Citation Length

p

lili

a ema

T1

T2

T3 T4

T5

div

a

T6

p

lili

a ema

T1

T2

T3 T4

T5

div

a

T6

39

Modules of Citation Extraction System

Common Style Finder find out all common style patterns for each

level of granularity in web pages

Citation Extractor explore data regions with common style

patterns distill extraction rules from those data regions rank extraction patterns based on a normal

word count distribution probability

40

BibPro: A Citation Parser based on Sequence Alignment Techniques

41

System Goal

CitationChomsky, Noam. 1956. Three models for the description of language. IRE Transactions on Information Theory. 2(3) 113--124.

Our System

MetaDataAuthor: Chomsky, NoamTitle: Three models for the description of languageJournal: IRE Transactions on Information TheoryVolume: 2Issue: 3Page: 113-124Month:Year: 1956

42

Basic Idea(1/2)

Encode citation to protein sequence Only keep the citation style information

order of fields field separators

Author

Title Journal

year page …

…A D T D L D Y R P H Sprotein

sequence

43

Basic Idea(2/2)

To determine citation style by the order of punctuation marks and reserved words

CitationString

FeatureIndex

FeatureIndex

CitationStyle

Search ToolBLAST

FeatureIndex

CitationStyle

FeatureIndex

CitationStyle

.

.

.

.

System Preprocess Online parsing

44

How to encode citation to protein sequence?

Keep the citation style information Which field should be included? (only can use 23

symbol) Which punctuation are used to separate fields?

By observing different citation styles, we define an encode table to translate each token of citation to an amino acid symbol

45

Encode Table

A: AuthorT: TitleL: JournalF: Volumn valueW: Issue valueH: Page valueM: MonthY: Year X: noise (unrecognized token)S: Issue key. e.g. “no”, “No”P: Page key. e.g. “pp”, “page”V: Volume key. e.g. “Vol”, “vo”

N: numeralQ: @ # $ % ^ & * + = \ | ~ _ / ! ? 。I: ( [ { < 「K: ) ] } > 」D: . G: " “ ” R: ,C: - :E: ' ` Z: ;B: blank

46

How to using protein sequence to extract metadata?

Transform extraction problem to sequence alignment problem

Form translation Unknown Answer

BASE FORM ALIGN FORM INDEX FORM

Known Answer RESULT FORM STYLE FORM INDEX FORM

47

RESULT FORM (Known Answer)

48

BASE FORM (Unknow Answer)

49

System Structure

System PreProcess (Template Generating System) Citation Crawler Template Builder

Online Parsing (Parsing System) Template Matching Metadata Extraction

Online Parsing

SystemPreProcess Template

Database

Query Citation Metadata2

1Resource on the Internet

50

Citation Crawler

BibTexBibTexBibTex

MetaDataCitation MetaDataCitation

BibTexParser

IEEEEngine

GoogleEngine

CiteSeerEngine

ACMEngine

MetaDataCitation

Citation Crawler

51

BLAST-powered Template Matching

Form Translation

Query Citation

TEMPLATE DATABASE

Encode Citation

INDEXFORM BLAST

Encode Table

STYLE FORMSTYLE FORMSTYLE FORM

INDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORM

52

Evaluation for CiteSeer DataSet

Consider the inconsistency between the Citation String and BibTex file(metadata)

Old Measurement:

New Measurement:

]Token[Token#

]Token[Token# PrecisionField

BibTexcitationquery

fieldBibTex field parsednew

]Token[Token#

]Token[Token# PrecisionField

BibTexfield parsed

fieldBibTex field parsedold

53

Definition

Tokenparsedfield: denote tokens that appear in the parsed subfield

Tokenquery citation: denote tokens that appear in the query citation string

TokenBibTex field : denote tokens that appear in the specific subfield in the BibTex file

TokenBibTex : denote all tokens that appear in the BibTex fileThese tokens don' t include punctuation

54

Compare with ParaCite

DataSet Collected from CiteSeer

Training Set: 2416 Testing Set: 4131

ParaCite Using default template Database

• add template to its database isn’t easy Test Testing Set

Our System Using training template Database (Training Set) Test Testing Set

55

Experimental Results

ParaCite

Autor Title Journal Page Issue Year Score

new Eva

32.90%

73.35%

29.83%

4.58%25.05

%77.04

%50.22

%

ParaCite

Autor Title Journal Page Issue Year Score

old Eva99.08

%62.72

%30.46

%100.00

%93.96

%99.70

%78.81

%

Our Author Title Journal Volumn Page Issue Month Year Score

new Eva

93.73%

73.32%

51.34%

83.52%94.62

%85.11

%89.18

%96.49

%84.80

%

Our Author Title Journal Volumn Page Issue Month Year Score

old Eva90.58

%89.51

%67.66

%93.58%

96.69%

91.79%

99.49%

99.50%

91.45%

56

Analysis

ParaCite only can extract one author name Old evaluation have a problem: it is highly

probable that you will obtain high accuracy, if you extract less information

57

Evaluation for clean DataSet

Ciation String is fully composed of corresponding metadata

fields ofnumber Total

fields extractedcorrectly ofNumber Accuracy

58

Compare with INFOMAP

DataSet Includes 160000 record Training Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA,

MISQ, and ISR) Testing Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA,

MISQ, and ISR)

59

Result

  Author Title Journal Volumn Page Issue Year Overall average

APA 99.67% 96.38% 97.06% 98.99% 98.71% 98.12% 99.42% 98.33%

IEEE 98.72% 98.12% 99.12% 99.30% 98.40% 98.39% 99.40% 98.78%

ACM 97.14% 95.01% 93.93% 97.19% 97.92% 97.03% 98.88% 96.73%

ISR 99.48% 96.17% 96.96% 99.15% 98.55% 98.39% 99.35% 98.29%

MISQ 98.59% 97.99% 98.98% 99.41% 98.83% 98.61% 99.54% 98.85%

JMIS 91.95% 87.90% 90.46% 99.23% 98.76% 98.03% 99.46% 95.11%

Average 97.59% 95.26% 96.09% 98.88% 98.53% 98.09% 99.34% 97.68%

60

Evaluation for Cora DataSet

500 records Be used as benchmark for many papers (HMM, SVM, CRF)

61

Evaluation

Divide words into four kinds: TP,FP,TN,FN

Four metrics: Word Accuracy: (TP+TN)/(TP+FP+FN+TN) Precision: TP/(TP+FP) Recall: TP/(TP+FN) F1-measure: (2*Precision*Recall)/(Precision+Recall)

62

Our System

  acc. F1.

Author 97.17% 93.98%

Title 94.17% 90.13%

Journal 93.58% 83.27%

Volume 99.21% 84.62%

Page 99.21% 92.09%

Date 99.92% 98.96%

63

Mining Translations of Chinese Names from Web Corpora by Using a

Query Expansion Technique and Support Vector Machine

64

Agenda

Introduction Proposed Approach Experiments Conclusions and Future Work

65

Background

Most of academic information can be found on the Web Scholar Google, DBLP etc.

66

Problems in Searching Chinese Name

Only Chinese Corpus

67

Challenges in Chinese Name Translation

Many pronunciation rules in different areas 陳 Chen (Taiwan)

陳 Tsun (Hong Kong)陳 Tan (Fukien)

Some additional words exist. Ex: 黃光明 (Kwang-Ming Frank Hwang)

Ex: 張韻詩 (Jane Win-Shih Liu)

68

Common Chinese Name Translation Format

Name Format Examples

Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name)

劉豐哲 (Fon-Che Liu)黃田漢 (Ng Tian Hann)林牛 (Ngau Lam)

Type-2. (Merged Chinese given name) (Surname)

吳德琪 (Derchyi Wu)

Type-3. (Western first name) (Surname) 趙蓮菊 (Anne Chao)Type-4. (Chinese given name) (Western first name) (Surname)

黃光明 (Kwang-Ming Frank Hwang)

Type-5. (Abbreviated Chinese given name) (Surname)

張秀瑜 (S.-Y. Chang)

Type-6. (Western first name) (Abbreviated Chinese given name) (Surname)

李昭勝 (Jack-C. Lee)

Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname)

蔡桂紅 (Gwei-Hung H. Tsai)

Type-8. (Chinese given name) (Unpredictable Surname)

張韻詩 (Jane Win-Shih Liu)

69

Goal

Design an automatic mechanism to translate a given Chinese name into its related English name

70

Agenda

Introduction Proposed Approach Experiments Conclusions and Future Work

71

Concepts of Proposed Approach

No corresponding translations

72

Three Major Techniques

Query expansion technique Translation of the surname• Obtaining the related Web page snippets of the Chinese name

translation.• Solve the problem of the unrelated term existing in the name

translation.

Knowledge-based method Chinese surname database, A common dictionary, Western first

name database• Obtaining all the name-like terms from the returned Web page snippets.

SVM Chinese pronunciation database, the phonetic feature and the

distant feature, selectedatraining samples• Selecting the appropriate Chinese name translations from the

candidates.

73

System Architecture

Chinese names

Query expander

Candidateextractor

SVM-basedname selector

Chinese surname database

Western first name database

On-line dictionary

Chinese pronunciation

database

ReturnedWeb page snippets

Namecandidates

Translated English names

Chinese names

Query expander

Chinese surname database

ReturnedWeb page snippets

Candidateextractor

Western first name database

On-line dictionary

Namecandidates

SVM-basedname selector

Chinese pronunciation

database

Translated English names

74

Query Expander

Goal:To retrieve Web page snippets that contain both a person’s Chinese name and the translation of the person’s surname.

Name splitter Determining whether the input Chinese name contains a compound surname Chinese surname database Dividing the input Chinese name into a “Surname” part and a “given name”

part.

Surname translator Selecting appropriate surname translations. Chinese surname database The strength of relationship between each surname translation and the person

is determined by the “distance from the person’s Chinese name to the surname’s translation”.

Web page retriever Making the concept of the query word more clearly. Retrieving the related Web pages back. The new query word will be “(Chinese name) + (Surname’s translation)”.

75

Distance from Two Terms

Calculation of the “distance from two terms”: where D is the distance, N is the number of non-words between the two terms.

陳威達 ( Wei-Da Chen)

The distance from the person’s Chinese name ( 陳威達 ) to the surname’s translation (Chen) is 3.

D N

76

Candidate Extractor

Goal:To extract possible candidates from the retrieved Web page snippets.

Steps:1. Removing all HTML tags.

2. Identifying out all the positions of the Chinese surnames existing in the snippets.

Chinese surname database

3. Extracting any English terms near each surname in the snippets if the term has one of the following properties:– The term cannot be found in a common dictionary.– The term is a Western first name.– The length of the term is 1.

※At most three English terms in the neighborhood of the surname will be extracted.

77

System Architecture 4/10

- Candidate extractor

Step1 Identifying out all the positions of the Chinese surnames existing in the snippets.

Step2 Extracting any English terms near each surname in the snippets if the term has one of the following properties:•The term cannot be found in a common dictionary.

•The term is a Western first name.

•The length of the term is 1.

The extracted terms will be the name translation candidates and be sent to SVM-based name selector for processing

78

SVM-based Name Selector

Goal:To extract each candidate’s features and utilize them to determine whether the candidate is the correct translation of the input Chinese name.

Features:1. The phonetic feature:

– Phonetic similarity Soundex algorithm

2. The distant feature:– Smallest distance (between the Chinese name and the

translation candidates)– Number of appearance in the neighborhood

79

Distant Features

The “neighborhood”: The close area of each occurrence of the Chinese

name. The close area is defined by a given threshold of

distance of number of words.

Number of appearance in the neighborhood

of the candidate “win-shih”: 2

Smallest distance 2

80

Summary

Query expansion technique Retrieving related Web pages.

Knowledge-based method Extracting appropriate name translation candidates

from the retrieved Web pages.

SVM Learning the verification rule and Selecting appropriate name translation from extracted

candidates.

81

Agenda

Introduction Proposed Approach Experiments Conclusions and Future Work

82

Testing Environment and Dataset 1/3

The following tool are used: Cambridge on-line dictionary Google search engine LIBSVM

Two datasets are used: Dataset I (training & testing):

Collected from the Directory of scholars of Institute of Mathematics.

Contains 78 pieces of data. Dataset II (testing):

Collected by our program from the Website of the Directory of Division of Computer Science of National Science Council.

Contains 1,157 pieces of data, and the name translations of 40 data are not existed in Google.

83

Testing Environment and Dataset 2/3

Name format ExampleDataset I Dataset II

# % # %

Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name)

丁建文 (Jen-Wen Ding)丁德榮 (Der-Rong Din)歐陽明 (Ming Ouhyang)

19 24.3%1000

89.5%

Type-2. (Merged Chinese given name) (Surname)

蔡丕裕 (Piyu Tsai) 10 12.8% 42 3.8%

Type-3. (Western first name) (Surname) 賴友仁 (Eugene Lai) 9 11.5% 9 0.8%

Type-4. (Chinese given name) (Western first name) (Surname)

劉立頌 (Alan Li-Sung liu)陳嘉懿 (Jia-Yih Joy Chen)楊豐瑞 (Fongray Frank Young)

14 17.9% 50 4.5%

Type-5. (Abbreviated Chinese given name) (Surname)

洪英超 (I.-C. Hung) 3 3.8% 0 0%

Type-6. (Western first name) (Abbreviated Chinese given name) (Surname)

曾秋蓉 (Judy C. R. Tseng) 8 10.3% 9 0.8%

Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname)

黃哲志 (Tetz C. Huang) 3 3.8% 3 0.4%

Type-8. (Chinese given name) (Unpredictable Surname)

張肇健 (Trieu-Kien Truong) 12 15.4% 4 0.4%

84

Testing Environment and Dataset 3/3

The alignment accuracy Proposed by Huang (2005).

The probability of selecting the correct answers when the searched snippets contain the correct answers.

A

where Ai : The alignment accuracy of candidate i. Nd : The number of testing data. Ncc : The number of correct translation.

Performance measurement: Top-1 to Top-5 alignment accuracy.

Ai Ncc

Nd

85

Results and Analysis 1/3

- Overall performance on Dataset I

70.5% top-1 accuracy

91% top-5 accuracy

86

Results and Analysis 2/3

- Overall performance on Dataset II

57.9% top-1 accuracy

86.2% top-5 accuracy

87

Results and Analysis 3/3

- Performance of each name type

Our system performs better in

type-1, type-2, type-4, type-6.

Name forma

tExample

Type-1

丁建文 (Jen-Wen Ding)丁德榮 (Der-Rong Din)歐陽明 (Ming Ouhyang)

Type-2

蔡丕裕 (Piyu Tsai)

Type-3

賴友仁 (Eugene Lai)

Type-4

劉立頌 (Alan Li-Sung liu)陳嘉懿 (Jia-Yih Joy Chen)

Type-5

洪英超 (I.-C. Hung)

Type-6

曾秋蓉 (Judy C. R. Tseng)

Type-7

黃哲志 (Tetz C. Huang)

Type-8

張肇健 (Trieu-Kien Truong)

88

Discussions

Major reason for the low performance on Type-3, Type-5, Type-7 and Type-8 The lack of Web information.

Usually more than one correct name translations for an input Chinese name are found out. The name ambiguity problem.

89

Limitations

Uncommon surname

Rely on Web resources

Search engine selecting

No name disambiguation

90

Agenda

Introduction Proposed Approach Experiments Conclusions

91

Conclusions

Mining information through Web corpora is effective for dealing with person name translation problem

Name ambiguity problem arises frequently

92

Thank You

Jan-Ming Ho

hoho@iis.sinica.edu.tw

Institute of Information Science

Academia Sinica