51
Results on Tracks 1 and 2 of KDD Cup 2013 Chih-Jen Lin Department of Computer Science National Taiwan University Joint work with members of the team “Algorithm” from National Taiwan University Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 1 / 51

Results on Tracks 1 and 2 of KDD Cup 2013

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Results on Tracks 1 and 2 of KDD Cup 2013

Results on Tracks 1 and 2 of KDD Cup2013

Chih-Jen Lin

Department of Computer ScienceNational Taiwan University

Joint work with members of the team “Algorithm” from National

Taiwan UniversityChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 1 / 51

Page 2: Results on Tracks 1 and 2 of KDD Cup 2013

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 2 / 51

Page 3: Results on Tracks 1 and 2 of KDD Cup 2013

Introduction

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 3 / 51

Page 4: Results on Tracks 1 and 2 of KDD Cup 2013

Introduction

Team Members

At National Taiwan University, we organized acourse for KDD Cup 2013

Three instructors, three TAs, and 18 students

18 students split to six sub-teams named byalgorithms

A*, Binary-Search, Dijkstra, K-means, Quick-Sort,Simplex

Submission quotas are equally divided to sixsub-teams

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 4 / 51

Page 5: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 5 / 51

Page 6: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification

Paper-author Identification

Given an (author, paper) pair, did the author writethe paper?

What information do we have?

Author and paper profilesLabeled (author, paper) pairs

Confirmed: author wrote paperDeleted: author didn’t write paper

Under a given (author, paper), we use target authorand target paper to distinguish them from otherauthors/papers

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 6 / 51

Page 7: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification

Paper-author Identification (Cont’d)

Submission: ranking query papers for each queryauthor

Example: author 9417 has query papers 1, 2, 3, 6,and 9.

If 3, 6 are confirmed and 1, 2, 9 are deleted, weshould submit “9417, 3 6 1 2 9”

MAP (Mean Average Precision) is the evaluationmeasure

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 7 / 51

Page 8: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification

Internal Validation Set

We split Train.csv to internal training/validation setsdue to the limited number of submissions per day.

This also avoids overfitting the leader board

5 : 2 : 3

Train Valid Test

internaltrain

internalvalid

5 : 2

randomly shuffled

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 8 / 51

Page 9: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification

System Overview

List of 97 features can be seen in the paper

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 9 / 51

Page 10: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 10 / 51

Page 11: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features from Author Profiles

Given a query (author1360414, paper1841516) .What information do we have about the author?

Author.csv: 1360414,Chih-Jen Lin,NationalTaiwan University

PaperAuthor.csv: 1841516,1360414,Chih-JenLin,”National Taiwan University, Taipei”

Distance between target author’s names, affiliations,etc. in two csv files ⇒ features to indicateconsistency

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 11 / 51

Page 12: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features from Author Profiles (Cont’d)

We need to address two issues

Distance between full and abbreviated namesWestern and eastern order of namesExample: “Chih Jen Lin” and “Lin Chih Jen”

See paper for details

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 12 / 51

Page 13: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features from Author Profiles (Cont’d)

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 13 / 51

Page 14: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features from Coauthors Names

Example: deleted paper 5633 of Li Zhang has twoauthors with the same name

Relation between target author and authors oftarget paper can be features

Examples

1. Minimum name distance between the targetauthor and authors of the target paper

2. Same as 1, but check abbreviated names

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 14 / 51

Page 15: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features for Author/Paper Consistency

Information should be consistent across papers andauthors

Examples

1. Maximum distance between target author’saffiliation and affiliations of co-authors in targetpaper

2. Maximum distance between target paper’s titleand target author’s papers

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 15 / 51

Page 16: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Missing Value Handling

Two empty strings have zero distance

d(’Chih J Lin’, ’C Jen Lin’) ≥ d(”, ”)

Replace distance between empty strings withnon-zero valueDistance valueJaro 0.5Jacard 0.5Levenshtein average length of all entries

Missing value indicators. Example: number ofcoauthors without affiliation information

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 16 / 51

Page 17: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features Using Publication Time

Examples

1. Earliest/latest publication year of target author2. Publication year of target paper

Data cleaning:

Years outside [1800, 2013] are removedThen we must handle missing values

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 17 / 51

Page 18: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features Based on A Network

We construct a network of authors, papers, journals,and conferences

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 18 / 51

Page 19: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Feature generation

Features Based on A Network (Cont’d)

From the network we can extract features todescribe node relationships

Examples

1. # of publications of the author2. # of coauthored papers of the target author

with all the coauthors of the target paper

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 19 / 51

Page 20: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Classification

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 20 / 51

Page 21: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Classification

Classification

Tree-based classifiersRandom forests (RF)Gradient boosting decision tree (GBDT)LambdaMART (LM)

classifiertree

# of trees parallelMAP on

ensemble Valid.csv

RF bagging 12,000 yes 0.983340GBDT boosting 300 no 0.983046

LM boosting 300 no 0.983047

RF is sensitive to the initial random seed. Using12,000 trees stabilizes the results

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 21 / 51

Page 22: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Ensemble and post-processing

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 22 / 51

Page 23: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Ensemble and post-processing

Ensemble

Weighted average over RF, GBDT, andLambdaMart

Didn’t use more complicated settings like regressionbecause we have only three models

A simple grid search on weights

Final weights

RF: 5, GBDT: 1, and LambdaMart: 1.

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 23 / 51

Page 24: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Ensemble and post-processing

Post-Processing

Our post-processing procedure is simple, but onething to note is duplicated paper IDs.

If an author has confirmed papers 1,2,2,4 anddeleted paper 3

The evaluation code seems to consider the 2nd “2”as a deleted paper

Thus, MAP of 1,2,4,3,2 > MAP of 1,2,2,4,3

We move duplicated ones to the end

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 24 / 51

Page 25: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Results

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 25 / 51

Page 26: Results on Tracks 1 and 2 of KDD Cup 2013

Track 1: paper-author identification Results

ResultsPublic Private

1st of public 0.98554 0.9810012th of public (ours) 0.98235 0.98259 (1st)

Possible reasons of the best result in the end

Improvements after Valid.csv is released

1. Data cleaning: unicode → ASCII2. Missing-value handling (0.98334)3. Ensemble (0.98390)

We didn’t give up even though we were the12th!We effectively use the internal validation set toavoid overfitting

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 26 / 51

Page 27: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 27 / 51

Page 28: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation

Author disambiguation

C. J. Smile Lin C. J. Cry LinNational Taiwan Univ. Univ. of Michigan

LIBSVM Guide LIBLINEAR Guide

Are they duplicates?

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 28 / 51

Page 29: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Strategies and architecture

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 29 / 51

Page 30: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Strategies and architecture

Main Strategies

Using string matching rather than other learningtechniques

An author without any papers is treated as a singlegroup without duplicates

Recognizing if an author is Chinese or not

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 30 / 51

Page 31: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Strategies and architecture

Architecture

Implementation 1

Implementation 2

Ensemble Final

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 31 / 51

Page 32: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Strategies and architecture

Results

Method Public PrivateImplementation 1 0.99186 0.99198Implementation 2 0.99071 0.99083Final 0.99195 0.99202

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 32 / 51

Page 33: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Strategies and architecture

Framework of the Two Implementations

1. Cleaning: remove redundant information

2. Chinese-or-not: classify each author as Chinese ornon-Chinese

3. Selection: select a set of candidates of possibleduplicates for each author

4. Identification: identify duplicates from the set ofcandidates for each author

5. Splitting: split incorrect cases (not discussed here)

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 33 / 51

Page 34: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Strategies and architecture

Differences between Two Implementations

The basic elements are different

Implementation 1: author identifiers

Implementation 2: author names

Author identifier 1001Name in Author.csv Chih Jen Lin

C. J. LinNames in PaperAuthor.csv Chih Jen Peter Lin

C. J. P. Lin

More (complicated) rules in Implementation 1

Focus on Implementation 1 because of timelimitation

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 34 / 51

Page 35: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 35 / 51

Page 36: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Cleaning

Clean redundant information.

Cleaning

Examples:CHih JEN LIn → chih jen lin

Mr. Chih Jen Lin → chih jen linChih Jen Lin → chih jen lin

Chih Jen Bill Lin → chih jen william lin

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 36 / 51

Page 37: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Chinese or not

Chinese and non-Chinese names are very different

No middle name in Chinese. “Chih Lin” and “ChihJ. Lin” are likely different

Some Chinese last names like “Wang” are toocommon. Also, “林” and “藺” are romanized to“Lin”

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 37 / 51

Page 38: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Chinese or not (Cont’d)

wu linjuan

zhuang

chintung

C. J. Chris Peter Lin

C J L

X C. J. Lin

Using common Chinese last names and words as adictionary

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 38 / 51

Page 39: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Chinese or not (Cont’d)

Check if the name contains words in our dictionary

Examples:

Without full word → Non-Chinese; full word: aword without “.” and longer than 1e.g., C J LOnly one full word and it is in Chinesedictionary → Chinesee.g., C. J. LinMore than one full word not in Chinesedictionary → Non-Chinesee.g., C. J. Chris Peter Lin

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 39 / 51

Page 40: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Selection

Find candidates of duplicates to reduce squarecomplexity to linear in future comparison

Links indicate candidates

Each author generates several keys. “Chih Jen Lin”has:

“Chih” “Jen” “Lin” “Chih Jen”“Jen Lin” “Chih Lin” “Chih Jen Lin”

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 40 / 51

Page 41: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Selection − Chih Jen Lin’s candidates

One is a candidate of another if two share the samekey. Ignore common keys.

Chih Jen Lin

Shou De Lin

Hsuan Tien Lin

Chien Chih Wang

Peng Jen Chen

Yong Zhaung

Felix Wu

Hsiao Yu TungLin Chih Jen

C. J. Lin

Yu Chin Juan

Wei Sheng Chin

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 41 / 51

Page 42: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Selection − Chih Jen Lin’s candidates

Chih Jen Lin

Shou De Lin

Hsuan Tien Lin

Chien Chih Wang

Peng Jen Chen

Yong Zhaung

Felix Wu

Hsiao Yu TungLin Chih Jen

C. J. Lin

Yu Chin Juan

Wei Sheng Chin

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 42 / 51

Page 43: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Identification

Find duplicates from candidates

?

?

?

Matchingfunctions

Dry-run

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 43 / 51

Page 44: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Identification − Matching Functions

?

?

?

Matchingfunctions

13 matching functions

1. Two authors have the same words2. (Non-Chinese only) Only one author

has middle name and their last namesdiffer in the last two characters

3. . . .

Examples:

Two namesChih Jen Lin, Lin Chih Jen Fun. 1

Michael I. Jordan, Michael Jordan Fun. 2

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 44 / 51

Page 45: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Implementation

Identification − Dry-run

?

?

?

Dry-run

Making corrections as matchingfunctions may wrongly identify duplicates

Check if two names are “looselyidentical”

Examples:

Potential duplicates PassC. J. Lin, Chih Jen Lin,

#Chih Lin, Chen Ju Lin

C. J. Lin, Chih Jen Lin, Chih J. Lin !

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 45 / 51

Page 46: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Ensemble and typo handling

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 46 / 51

Page 47: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Ensemble and typo handling

Ensemble

Method Author identifier DuplicatesImplementation 1 10 10,11Implementation 2 10 10,11,12,13,14Ensembled 10 10,11,12,14

Implementation 1 considered as major predictions

{12, 13, 14} become additional duplicates

Check if each of (10, 12), (10, 13), (10, 14) hassimilar affiliations or fields

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 47 / 51

Page 48: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Analysis

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 48 / 51

Page 49: Results on Tracks 1 and 2 of KDD Cup 2013

Track 2: author disambiguation Analysis

Analysis

We conduct some analyses after the competition.Thank Kaggle for re-opening the submission site

Method Public PrivateFinal 0.99195 0.99202Implementation 1 0.99186 0.99198Without Chinese-or-not 0.99109 0.99125Without dry-run 0.99097 0.99112Without both 0.98891 0.98934

Splitting Chinese/non-Chinese and the dry-runfunction in the identification stage are useful

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 49 / 51

Page 50: Results on Tracks 1 and 2 of KDD Cup 2013

Conclusions

Outline1 Introduction2 Track 1: paper-author identification

Feature generationClassificationEnsemble and post-processingResults

3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis

4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 50 / 51

Page 51: Results on Tracks 1 and 2 of KDD Cup 2013

Conclusions

Conclusions

Our code is available at

github.com/kdd-cup-2013-ntu/track1

github.com/kdd-cup-2013-ntu/track2

Papers are at www.csie.ntu.edu.tw/~cjlin/papers/kddcup2013/kddcup2013track1.pdf andkddcup2013track2.pdf

We thank the organizers and the support fromNational Taiwan University

Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 51 / 51