View
214
Download
0
Category
Tags:
Preview:
Citation preview
Mining Academic Community
Jan-Ming Hohohoiis.sinica.edu.tw
Computer System and Communication LabInstitute of Information Science
Academia Sinica
2
What is Community?
In Graph Theory densely connected groups of
vertices, with sparser connection between groups
In Social Network Analysis groups of entities that share similar
properties or connect to each other via certain relations
A social network is a structure made up of nodes, representing entities from different conceptual groups, that are linked with different types of relations
3
Why is Community Important?
Interesting data with community structure researcher collaboration, friendship network, WWW,
Massive Multi-player on-line gaming, electronic communications.
Groups of web pages that link to more web pages in the community than pages outside correspond to web pages on related topics
Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.
4
Motivation
Understand the research network between authors, conferences and topics (rank entities by relevance for given entities)
Find and justifiably recommend research collaborators for given authors
Explore the academic social network Find out most important papers, researchers and
venues for a given topic
5
Related Systems
Many digital library systems exist ACM Digital Library IEEExplorer DBLP Citeseer Libra DBConnect
Problems The coverage of dataset is not large enough Name ambiguous problem exists in
Web pages Citation records
6
Libra Academic Search
http://libra.msra.cn Free computer science bibliography search
engine A test-bed for object-level vertical search
research Currently the following types of paper-related
objects can be searched: Papers, Authors, Conferences, Journals, Research
Communities
7
8
9
DBconnect: Conference
10
DBconnect: Topic
11
DBconnect: Author
12
ZoomInfo
(1) People Directory
(2) Developer Tools
(3) Social Network, Profile Statistics, Employment History
(4) Ability to identify ambiguous?! Ex. Can get 21 different people called “Bing Liu”
13
ArnetMiner
14
Our goal
Developing an automatic system to Explore the academic social network Find out most important papers, researchers and venues for a
given topic
Provide solutions for existent problems Collecting larger citation datasets
Retrieving data from web pages• Publication list finder • Extracting citation strings from web pages • Citation parser
Multilingual data sources• Chinese and English corpuses
Name dissemination mechanism in Web pages Citation records
15
Our contributions
Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, and Jan-Ming Ho, "Web Appearance Disambiguation of Personal Names Based on Network Motif," in the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), Hong Kong, Dec. 18-22, 2006
Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho, "PLF: A Publication List Web Page Finder for Researchers," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007
Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and Jan-Ming Ho, "Mining Translations of Chinese Name from Web Corpora by Using Query Expansion Technique and Support Vector Machine," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007
Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming Lee, "AEFS: Authoritative Expert Finding System Based on a Language Model and Social Network Analysis," in Proceedings of the 12th Conference on Artificial Intelligence and Applications (TAAI2007), Nov 16-17, 2007
Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser Based on Sequence Alignment Techniques," will appear in Proceedings of the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA-08)
16
PLF: A Publication List Web Page Finder for
Researchers
17
Agenda
Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work
18
Overview of a Publication List Web Page
Keep abreast of state-of-the-art research Contains citations not found elsewhere. May provide some reference materials, such as slides
and talks.
Challenges How to find the publication list web pages
Only with the given name .
Various versions or Multiple copies An author may have many affiliations.
Name ambiguity problem E.g., Dr. Bing Liu, we found that 26 people share the same name by inquiring to
ZoomInfo (people search engine).
19
Problem
“Publication List Web Page?”
20
Definition of Publication List
[Affiliation] Institute of Information Science, Academia
Sinica
citation string
Affiliated Personal Publication List Web Page (APPL)
a web page belongs to the affiliated web site of a specific person with the given name.
21
Agenda
Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work
22
Process Flow
Given Names
Citation String Search
Digital Libraries
Query
collects the citation strings from digital libraries
Search EnginesInteraction
Web Page Crawlercollects the hyperlinks of web pages from search engines by using the collected citation strings as queries
web page
hyperlinks
Parsing
Citations statistics
Rank Function
analyses the statistics of all the collected hyperlinks of web pages
Analyse
23
QPT2Query
Web Page1
WP2
WP3
.
.
.
WPn
Paper Title1
Jan-Ming Ho
PT2
PT3
.
.
.
PTm
A publication list web page
QPT1Query
Web Page1
WP2
WP3
.
.
.
WPn
X
Search Engine
Search Engine
Basic Concept
A publication list web page may contain many citation strings
24
Agenda
Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work
25
Dataset
Scenario Seminar members have usually published major
research works We randomly collected 200 names from the WWW ’06
Conference Committee website
APPL Types #APPL #people %population
others 0 22 11%
single-group 1 120 60%
multi-group 2 35 17.5%
3 16 8%
4 7 3.5%
26
Experiment Evaluation
Evaluation metrics We consider the top-5 results derived by each link and
focus on the top-5 recall metric, which is calculated by:
R
Rrecall a
Notation
Definition
Rathe number of publication list web pages belonging to researchers listed in the dataset
R the number of publication list web pages contained in the top-5 results
27
Parameter Analysis for Single-Group
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n
Figure (a)
• When m increases, the recall rate also increases.
•Figure (b)
• System performance may be constrained by m.
(m, n) (m, n)
28
Parameter Analysis for Multi-Group
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n
Figure (a)
• It is clear that the performance when m = 40 is always better than the other settings.
Figure (b)
• The best performance (top-5 recall is 70%) occurs when n = 75.
29
Performance Evaluations
(a)Performance of approaches in
single-group
(b)Performance of different ways in
multi-group1. The parameter m has a strong influence on the system’s performance; for example,
an oversized m may degrade the performance.
2. The parameter n has little influence on the system’s performance.
3. The PLF system outperforms the other two approaches on both the single-group and the multi-group datasets.
(given name + keyword)
30
Conclusion
We have defined the problem of finding the publication list web pages of a researcher, and proposed “PLF” system
Ongoing work Name ambiguity problem How to merge the multiple publication list web pages
for a specific person into a single page.
31
Discussion – Name Ambiguity Problem
Scenario We take the name “Bing Liu”
as an example Analyze manually
Observation Citation Count Name translation problem Partial matching problem
32
Extracting Citation Strings from Web Pages
33
Extract Citation Records
Structured Data
Extract
Web Page
34
Challenges
The formats of publication list web pages vary There are no fixed syntactic rules for parsing
citation records Hence, We can not apply simple rules to extract citation
records automatically
35
Challenges: Complex Layouts of Publication List Pages
36
Ideas
The semantic structure of web pages is organized by visual arrangement.
We can utilize semi-structure information (visual ) of web pages to help extraction task.
With hierarchical structure and geometric information, DOM tree is not only a great structure to present Web pages, but also very helpful for visual pattern analysis.
37
DOM Tree Presentation of Web page
Banner
Navigator
Bar
Publication List
Citation String
Citation String
Citation String
Citation String
38
Architecture of Citation Extraction System
Common Style Finder
CitationRecords
Publication List Web Page
Finder
Parsing
DOM Tree
Normal Citation Model
RankingRecords
Citation Extractor
Mining Common Style
PatternsPublication List
Pages
Candidate Citation Records
CiteSeer
Citation Extraction System
Extracting Candidate Records
p
lili
a ema
T1
T2
T3 T4
T5
T3T4T5
T1T2
Estimating Citation Length
p
lili
a ema
T1
T2
T3 T4
T5
div
a
T6
p
lili
a ema
T1
T2
T3 T4
T5
div
a
T6
39
Modules of Citation Extraction System
Common Style Finder find out all common style patterns for each
level of granularity in web pages
Citation Extractor explore data regions with common style
patterns distill extraction rules from those data regions rank extraction patterns based on a normal
word count distribution probability
40
BibPro: A Citation Parser based on Sequence Alignment Techniques
41
System Goal
CitationChomsky, Noam. 1956. Three models for the description of language. IRE Transactions on Information Theory. 2(3) 113--124.
Our System
MetaDataAuthor: Chomsky, NoamTitle: Three models for the description of languageJournal: IRE Transactions on Information TheoryVolume: 2Issue: 3Page: 113-124Month:Year: 1956
42
Basic Idea(1/2)
Encode citation to protein sequence Only keep the citation style information
order of fields field separators
Author
Title Journal
year page …
…A D T D L D Y R P H Sprotein
sequence
43
Basic Idea(2/2)
To determine citation style by the order of punctuation marks and reserved words
CitationString
FeatureIndex
FeatureIndex
CitationStyle
Search ToolBLAST
FeatureIndex
CitationStyle
FeatureIndex
CitationStyle
.
.
.
.
System Preprocess Online parsing
44
How to encode citation to protein sequence?
Keep the citation style information Which field should be included? (only can use 23
symbol) Which punctuation are used to separate fields?
By observing different citation styles, we define an encode table to translate each token of citation to an amino acid symbol
45
Encode Table
A: AuthorT: TitleL: JournalF: Volumn valueW: Issue valueH: Page valueM: MonthY: Year X: noise (unrecognized token)S: Issue key. e.g. “no”, “No”P: Page key. e.g. “pp”, “page”V: Volume key. e.g. “Vol”, “vo”
N: numeralQ: @ # $ % ^ & * + = \ | ~ _ / ! ? 。I: ( [ { < 「K: ) ] } > 」D: . G: " “ ” R: ,C: - :E: ' ` Z: ;B: blank
46
How to using protein sequence to extract metadata?
Transform extraction problem to sequence alignment problem
Form translation Unknown Answer
BASE FORM ALIGN FORM INDEX FORM
Known Answer RESULT FORM STYLE FORM INDEX FORM
47
RESULT FORM (Known Answer)
48
BASE FORM (Unknow Answer)
49
System Structure
System PreProcess (Template Generating System) Citation Crawler Template Builder
Online Parsing (Parsing System) Template Matching Metadata Extraction
Online Parsing
SystemPreProcess Template
Database
Query Citation Metadata2
1Resource on the Internet
50
Citation Crawler
BibTexBibTexBibTex
MetaDataCitation MetaDataCitation
BibTexParser
IEEEEngine
GoogleEngine
CiteSeerEngine
ACMEngine
MetaDataCitation
Citation Crawler
51
BLAST-powered Template Matching
Form Translation
Query Citation
TEMPLATE DATABASE
Encode Citation
INDEXFORM BLAST
Encode Table
STYLE FORMSTYLE FORMSTYLE FORM
INDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORM
52
Evaluation for CiteSeer DataSet
Consider the inconsistency between the Citation String and BibTex file(metadata)
Old Measurement:
New Measurement:
]Token[Token#
]Token[Token# PrecisionField
BibTexcitationquery
fieldBibTex field parsednew
]Token[Token#
]Token[Token# PrecisionField
BibTexfield parsed
fieldBibTex field parsedold
53
Definition
Tokenparsedfield: denote tokens that appear in the parsed subfield
Tokenquery citation: denote tokens that appear in the query citation string
TokenBibTex field : denote tokens that appear in the specific subfield in the BibTex file
TokenBibTex : denote all tokens that appear in the BibTex fileThese tokens don' t include punctuation
54
Compare with ParaCite
DataSet Collected from CiteSeer
Training Set: 2416 Testing Set: 4131
ParaCite Using default template Database
• add template to its database isn’t easy Test Testing Set
Our System Using training template Database (Training Set) Test Testing Set
55
Experimental Results
ParaCite
Autor Title Journal Page Issue Year Score
new Eva
32.90%
73.35%
29.83%
4.58%25.05
%77.04
%50.22
%
ParaCite
Autor Title Journal Page Issue Year Score
old Eva99.08
%62.72
%30.46
%100.00
%93.96
%99.70
%78.81
%
Our Author Title Journal Volumn Page Issue Month Year Score
new Eva
93.73%
73.32%
51.34%
83.52%94.62
%85.11
%89.18
%96.49
%84.80
%
Our Author Title Journal Volumn Page Issue Month Year Score
old Eva90.58
%89.51
%67.66
%93.58%
96.69%
91.79%
99.49%
99.50%
91.45%
56
Analysis
ParaCite only can extract one author name Old evaluation have a problem: it is highly
probable that you will obtain high accuracy, if you extract less information
57
Evaluation for clean DataSet
Ciation String is fully composed of corresponding metadata
fields ofnumber Total
fields extractedcorrectly ofNumber Accuracy
58
Compare with INFOMAP
DataSet Includes 160000 record Training Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA,
MISQ, and ISR) Testing Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA,
MISQ, and ISR)
59
Result
Author Title Journal Volumn Page Issue Year Overall average
APA 99.67% 96.38% 97.06% 98.99% 98.71% 98.12% 99.42% 98.33%
IEEE 98.72% 98.12% 99.12% 99.30% 98.40% 98.39% 99.40% 98.78%
ACM 97.14% 95.01% 93.93% 97.19% 97.92% 97.03% 98.88% 96.73%
ISR 99.48% 96.17% 96.96% 99.15% 98.55% 98.39% 99.35% 98.29%
MISQ 98.59% 97.99% 98.98% 99.41% 98.83% 98.61% 99.54% 98.85%
JMIS 91.95% 87.90% 90.46% 99.23% 98.76% 98.03% 99.46% 95.11%
Average 97.59% 95.26% 96.09% 98.88% 98.53% 98.09% 99.34% 97.68%
60
Evaluation for Cora DataSet
500 records Be used as benchmark for many papers (HMM, SVM, CRF)
61
Evaluation
Divide words into four kinds: TP,FP,TN,FN
Four metrics: Word Accuracy: (TP+TN)/(TP+FP+FN+TN) Precision: TP/(TP+FP) Recall: TP/(TP+FN) F1-measure: (2*Precision*Recall)/(Precision+Recall)
62
Our System
acc. F1.
Author 97.17% 93.98%
Title 94.17% 90.13%
Journal 93.58% 83.27%
Volume 99.21% 84.62%
Page 99.21% 92.09%
Date 99.92% 98.96%
63
Mining Translations of Chinese Names from Web Corpora by Using a
Query Expansion Technique and Support Vector Machine
64
Agenda
Introduction Proposed Approach Experiments Conclusions and Future Work
65
Background
Most of academic information can be found on the Web Scholar Google, DBLP etc.
66
Problems in Searching Chinese Name
Only Chinese Corpus
67
Challenges in Chinese Name Translation
Many pronunciation rules in different areas 陳 Chen (Taiwan)
陳 Tsun (Hong Kong)陳 Tan (Fukien)
Some additional words exist. Ex: 黃光明 (Kwang-Ming Frank Hwang)
Ex: 張韻詩 (Jane Win-Shih Liu)
68
Common Chinese Name Translation Format
Name Format Examples
Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name)
劉豐哲 (Fon-Che Liu)黃田漢 (Ng Tian Hann)林牛 (Ngau Lam)
Type-2. (Merged Chinese given name) (Surname)
吳德琪 (Derchyi Wu)
Type-3. (Western first name) (Surname) 趙蓮菊 (Anne Chao)Type-4. (Chinese given name) (Western first name) (Surname)
黃光明 (Kwang-Ming Frank Hwang)
Type-5. (Abbreviated Chinese given name) (Surname)
張秀瑜 (S.-Y. Chang)
Type-6. (Western first name) (Abbreviated Chinese given name) (Surname)
李昭勝 (Jack-C. Lee)
Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname)
蔡桂紅 (Gwei-Hung H. Tsai)
Type-8. (Chinese given name) (Unpredictable Surname)
張韻詩 (Jane Win-Shih Liu)
69
Goal
Design an automatic mechanism to translate a given Chinese name into its related English name
70
Agenda
Introduction Proposed Approach Experiments Conclusions and Future Work
71
Concepts of Proposed Approach
No corresponding translations
72
Three Major Techniques
Query expansion technique Translation of the surname• Obtaining the related Web page snippets of the Chinese name
translation.• Solve the problem of the unrelated term existing in the name
translation.
Knowledge-based method Chinese surname database, A common dictionary, Western first
name database• Obtaining all the name-like terms from the returned Web page snippets.
SVM Chinese pronunciation database, the phonetic feature and the
distant feature, selectedatraining samples• Selecting the appropriate Chinese name translations from the
candidates.
73
System Architecture
Chinese names
Query expander
Candidateextractor
SVM-basedname selector
Chinese surname database
Western first name database
On-line dictionary
Chinese pronunciation
database
ReturnedWeb page snippets
Namecandidates
Translated English names
Chinese names
Query expander
Chinese surname database
ReturnedWeb page snippets
Candidateextractor
Western first name database
On-line dictionary
Namecandidates
SVM-basedname selector
Chinese pronunciation
database
Translated English names
74
Query Expander
Goal:To retrieve Web page snippets that contain both a person’s Chinese name and the translation of the person’s surname.
Name splitter Determining whether the input Chinese name contains a compound surname Chinese surname database Dividing the input Chinese name into a “Surname” part and a “given name”
part.
Surname translator Selecting appropriate surname translations. Chinese surname database The strength of relationship between each surname translation and the person
is determined by the “distance from the person’s Chinese name to the surname’s translation”.
Web page retriever Making the concept of the query word more clearly. Retrieving the related Web pages back. The new query word will be “(Chinese name) + (Surname’s translation)”.
75
Distance from Two Terms
Calculation of the “distance from two terms”: where D is the distance, N is the number of non-words between the two terms.
陳威達 ( Wei-Da Chen)
The distance from the person’s Chinese name ( 陳威達 ) to the surname’s translation (Chen) is 3.
D N
76
Candidate Extractor
Goal:To extract possible candidates from the retrieved Web page snippets.
Steps:1. Removing all HTML tags.
2. Identifying out all the positions of the Chinese surnames existing in the snippets.
Chinese surname database
3. Extracting any English terms near each surname in the snippets if the term has one of the following properties:– The term cannot be found in a common dictionary.– The term is a Western first name.– The length of the term is 1.
※At most three English terms in the neighborhood of the surname will be extracted.
77
System Architecture 4/10
- Candidate extractor
Step1 Identifying out all the positions of the Chinese surnames existing in the snippets.
Step2 Extracting any English terms near each surname in the snippets if the term has one of the following properties:•The term cannot be found in a common dictionary.
•The term is a Western first name.
•The length of the term is 1.
The extracted terms will be the name translation candidates and be sent to SVM-based name selector for processing
78
SVM-based Name Selector
Goal:To extract each candidate’s features and utilize them to determine whether the candidate is the correct translation of the input Chinese name.
Features:1. The phonetic feature:
– Phonetic similarity Soundex algorithm
2. The distant feature:– Smallest distance (between the Chinese name and the
translation candidates)– Number of appearance in the neighborhood
79
Distant Features
The “neighborhood”: The close area of each occurrence of the Chinese
name. The close area is defined by a given threshold of
distance of number of words.
Number of appearance in the neighborhood
of the candidate “win-shih”: 2
Smallest distance 2
80
Summary
Query expansion technique Retrieving related Web pages.
Knowledge-based method Extracting appropriate name translation candidates
from the retrieved Web pages.
SVM Learning the verification rule and Selecting appropriate name translation from extracted
candidates.
81
Agenda
Introduction Proposed Approach Experiments Conclusions and Future Work
82
Testing Environment and Dataset 1/3
The following tool are used: Cambridge on-line dictionary Google search engine LIBSVM
Two datasets are used: Dataset I (training & testing):
Collected from the Directory of scholars of Institute of Mathematics.
Contains 78 pieces of data. Dataset II (testing):
Collected by our program from the Website of the Directory of Division of Computer Science of National Science Council.
Contains 1,157 pieces of data, and the name translations of 40 data are not existed in Google.
83
Testing Environment and Dataset 2/3
Name format ExampleDataset I Dataset II
# % # %
Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name)
丁建文 (Jen-Wen Ding)丁德榮 (Der-Rong Din)歐陽明 (Ming Ouhyang)
19 24.3%1000
89.5%
Type-2. (Merged Chinese given name) (Surname)
蔡丕裕 (Piyu Tsai) 10 12.8% 42 3.8%
Type-3. (Western first name) (Surname) 賴友仁 (Eugene Lai) 9 11.5% 9 0.8%
Type-4. (Chinese given name) (Western first name) (Surname)
劉立頌 (Alan Li-Sung liu)陳嘉懿 (Jia-Yih Joy Chen)楊豐瑞 (Fongray Frank Young)
14 17.9% 50 4.5%
Type-5. (Abbreviated Chinese given name) (Surname)
洪英超 (I.-C. Hung) 3 3.8% 0 0%
Type-6. (Western first name) (Abbreviated Chinese given name) (Surname)
曾秋蓉 (Judy C. R. Tseng) 8 10.3% 9 0.8%
Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname)
黃哲志 (Tetz C. Huang) 3 3.8% 3 0.4%
Type-8. (Chinese given name) (Unpredictable Surname)
張肇健 (Trieu-Kien Truong) 12 15.4% 4 0.4%
84
Testing Environment and Dataset 3/3
The alignment accuracy Proposed by Huang (2005).
The probability of selecting the correct answers when the searched snippets contain the correct answers.
A
where Ai : The alignment accuracy of candidate i. Nd : The number of testing data. Ncc : The number of correct translation.
Performance measurement: Top-1 to Top-5 alignment accuracy.
Ai Ncc
Nd
85
Results and Analysis 1/3
- Overall performance on Dataset I
70.5% top-1 accuracy
91% top-5 accuracy
86
Results and Analysis 2/3
- Overall performance on Dataset II
57.9% top-1 accuracy
86.2% top-5 accuracy
87
Results and Analysis 3/3
- Performance of each name type
Our system performs better in
type-1, type-2, type-4, type-6.
Name forma
tExample
Type-1
丁建文 (Jen-Wen Ding)丁德榮 (Der-Rong Din)歐陽明 (Ming Ouhyang)
Type-2
蔡丕裕 (Piyu Tsai)
Type-3
賴友仁 (Eugene Lai)
Type-4
劉立頌 (Alan Li-Sung liu)陳嘉懿 (Jia-Yih Joy Chen)
Type-5
洪英超 (I.-C. Hung)
Type-6
曾秋蓉 (Judy C. R. Tseng)
Type-7
黃哲志 (Tetz C. Huang)
Type-8
張肇健 (Trieu-Kien Truong)
88
Discussions
Major reason for the low performance on Type-3, Type-5, Type-7 and Type-8 The lack of Web information.
Usually more than one correct name translations for an input Chinese name are found out. The name ambiguity problem.
89
Limitations
Uncommon surname
Rely on Web resources
Search engine selecting
No name disambiguation
90
Agenda
Introduction Proposed Approach Experiments Conclusions
91
Conclusions
Mining information through Web corpora is effective for dealing with person name translation problem
Name ambiguity problem arises frequently
92
Thank You
Jan-Ming Ho
hoho@iis.sinica.edu.tw
Institute of Information Science
Academia Sinica
Recommended