L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig

LEARNING-BASED ENTITY RESOLUTIONWITH MAPREDUCELars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm

Database Group Leipzighttp://dbs.uni-leipzig.de

Glasgow, CloudDB 2011

2 / 16

• Identification of semantically equivalent entities• Within one data source or between different sources• To merge them, compare them, improve data quality, etc.

ENTITY RESOLUTION

Learning-based Entity Resolution with MapReduce

Duplicates due to• Order of authors• Extraction errors• Different titles• Typos• …

3 / 16Learning-based Entity Resolution with MapReduce

ENTITY RESOLUTION (2)• Lot of research work• Pairwise entity comparison• Application of multiple similarity measures on several attributes• Combination of similarity values to match decision for each entity pair• Hard to configure combination of similarity values manually

• Study of real-world match systems/problems [VLDB’10]• Effective matching is difficult – F-Measure <75% for product data• Matching is expensive – scalability issues for O(n2)• Learning-based approaches automate combination of similarity values

but come with poor efficiency

[VLDB’10] Koepcke, Thor, Rahm: Evaluation of entity resolution approaches on real-world match problems. VLDB 2010

4 / 16

LEARNING-BASED ENTITY RESOLUTION• Based on training data, entity pairs are classified as match/ non-match• Pairwise similarity values serve as feature for classification


Similarity computation

sim1 … simk match

0.8 … 0.7 true0.4 … 0.6 false

Training Similarities

Classifier Training

Classifier

R

S

Similarity computation

Classifier Application

Match Result(idR, idS)

R SmatchA1 … Au A1 … Av

… … … … … … true… … … … … … false

idR idS sim1 … simk match

... ... 0.5 … 0.6 ?

... ... 0.8 … 0.9 ?

Training Data RS

Phase 1: Training

Phase 2: Application

Observations• Training phase < 5%• Similarity computation

counts for 95% of Application phases


OUTLINE

• Motivation

• MapReduce

• Strategies for similarity computation and classifier application on Cartesian product of two data sources with MapReduce• Solely in map phase (“Broadcast Join”) MapSide• Even distribution of entity pairs across reduce tasks Reduce Split

• Experimental Results• Conclusions & Future Work


MAPREDUCE• Programming model for distributed computation in cluster environments• UDF map applied on each input entity which outputs key-value pairs• UDF part applied on key of map output pairs

assigns each pair to a reduce task• UDF group applied on key to group key-value pairs• UDF reduce invoked for each group

Map tasks(m=3)

map

2m

ap1

map

0

Input data

0

1

2

part

(key

) [0

, r-1

]

0

1

2

0

1

2

1

0re

duce

0re

duce

1re

duce

2reduce tasks

(r=3)

0’

1’

2’

7 / 16

DISTRIBUTED EVALUATION OF THE CARTESIAN PRODUCT• Pairwise entity comparison requires distribution of entity pairs to

computing tasks/nodes


R

S

classifier.classify( sim1(eR,eS), sim2(eR,eS), …, simk(eR,eS)) = “match”

+

Split R in x blocks (x=2) Split S in y blocks (y=2) Replicate each R-block y times Replicate each S-block x times x*y “match tasks”

Split S in x blocks (x=2) Replicate R x times x “match tasks”

+

8 / 16

MAPSIDE (m =3)• Map Tasks buffer R in memory at

initialization time• Each Map task operates on a partition of

the larger data source S• map(entity) – match currently processed

entity of S with all buffered entities of R


Pairsa-c, b-ca-d, b-d

Scd m

ap0

Map

Sef

Pairsa-e, b-ea-f, b-f

Rab

map

1

Sgh

Pairsa-g, b-ga-h, b-hm

ap2

9 / 16

• R is split in x blocks, S is split in y blocks• All x blocks of R are compared with all y blocks of S• Implementation• Composite map output keys

• Grouping by i.j invocation of reduce per group• Entities of R appear before entities of S in the list of entities• Reduce tasks buffer entities of R and match each entity of S with buffer

REDUCESPLIT


Assigned blockindex (random)

Outputted key-value pairs

Partitioningfunction

Entity e of R i [0, x-1] y pairs (i.j.R , e )for j [0, y-1] part(i.j.source)=

(i+j x) mod rEntity e of S j [0, y-1]

x pairs (i.j.S , e )for i [0, x-1]

j

0 1 2

i0 0 2 1

1 1 0 2Example reduce task assignment

of part for x=2, y=3, r=3

R0 R1 … Rx-1

S0 S1 … Sy-1

(1.0.R, e) (1.1.R, e)(1.y-1.R, e)(0.y-1.S, e) (1.y-1.S, e)(x-1.y-1.S, e)

10 / 16

REDUCESPLIT (M=3, R=3, X=2, Y=3)


Scde m

ap1

MapKey=IndexR.IndexS.Source

Sfgh

Rab

map

2m

ap0

Key Value0.0.S fS

1.0.S fS

0.1.S gS

1.1.S gS

0.2.S hS

1.2.S hS

Parti

tioni

ng b

y (In

dexR

+Ind

exS*

x m

odul

o r)

Key Value0.0.R aR

0.0.S cS

0.0.S fS

1.1.R bR

1.1.S dS

1.1.S gS

Key Value0.2.R aR

0.2.S eS

0.2.S hS

1.0.R bR

1.0.S cS

1.0.S fS

Key Value0.1.R aR

0.1.S dS

0.1.S gS

1.2.R bR

1.2.S eS

1.2.S hSre

duce

0re

duce

1re

duce

2

Pairsa-e, a-hb-c, b-f

Pairsa-d, a-gb-e, b-h

ReduceGroup By: IndexR.IndexS

Key Value0.0.R aR

0.1.R aR

0.2.R aR

1.0.R bR

1.1.R bR

1.2.R bR

Key Value0.0.S cS

1.0.S cS

0.1.S dS

1.1.S dS

0.2.S eS

1.2.S eS

Pairsa-c, a-fb-d, b-g

11 / 16

MAPSIDE VS. REDUCESPLIT

• MapSide requires that the R entirely fits in main memory that is available per map task (multiple per node!)• No data redistribution, sorting, grouping and reduce task scheduling

• With ReduceSplit, only |R|/x entities need to be buffered• At the expense of data replication (|R|*y + |S|*x map output pairs)• Careful choice of x, y is crucial for performance


12 / 16

EXPERIMENTAL RESULTS – MATCH QUALITY• Bibliographic datasets – DBLP (2,600) vs. GoogleScholar 64,000• Up to six matchers• Two classifiers – Decision Tree and Support Vector Machine from WEKA

• Employing multiple matchers increases overall match quality (F-measure)• Especially true if additional matchers operate on different attributes


13 / 16

EXPERIMENTAL RESULTS – TIME DISTRIBUTION• Evaluation of the runtime using MapSide• Same match problem• 10 Amazon EC2 High-CPU Medium instances (each with two virtual cores)

• Generally multiple matchers increase match quality• At the expense of runtime• Similarity computation consumes between 88% and 97% of overall

runtime depending on number of matchers

Parallel Sorted Neighborhood Blocking with MapReduce

14 / 16

EXPERIMENTAL RESULTS – SCALABILITY

• MapSide with n= 1…50 dual core VMs• Almost linear speedup for up to 10 nodes• Still good speedup values for more nodes (e.g. ≈40 for n=50)


15 / 16

CONCLUSIONS• Learning-based Entity Resolution with MapReduce• Two different strategies for evaluation of Cartesian product of two

input sources• MapSide – similarity computation solely during map phase• ReduceSplit – distribution of Cartesian product evaluation evenly across

all reduce tasks• Evaluation of the proposed approaches

• Future work• Incorporate blocking strategies• Analysis of learned model to avoid application of all matchers



THANK YOU FOR YOUR ATTENTION

Documents

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig