Upload
edmund-heath
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
LEARNING-BASED ENTITY RESOLUTIONWITH MAPREDUCELars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm
Database Group Leipzighttp://dbs.uni-leipzig.de
Glasgow, CloudDB 2011
2 / 16
• Identification of semantically equivalent entities• Within one data source or between different sources• To merge them, compare them, improve data quality, etc.
ENTITY RESOLUTION
Learning-based Entity Resolution with MapReduce
Duplicates due to• Order of authors• Extraction errors• Different titles• Typos• …
3 / 16Learning-based Entity Resolution with MapReduce
ENTITY RESOLUTION (2)• Lot of research work• Pairwise entity comparison• Application of multiple similarity measures on several attributes• Combination of similarity values to match decision for each entity pair• Hard to configure combination of similarity values manually
• Study of real-world match systems/problems [VLDB’10]• Effective matching is difficult – F-Measure <75% for product data• Matching is expensive – scalability issues for O(n2)• Learning-based approaches automate combination of similarity values
but come with poor efficiency
[VLDB’10] Koepcke, Thor, Rahm: Evaluation of entity resolution approaches on real-world match problems. VLDB 2010
4 / 16
LEARNING-BASED ENTITY RESOLUTION• Based on training data, entity pairs are classified as match/ non-match• Pairwise similarity values serve as feature for classification
Learning-based Entity Resolution with MapReduce
Similarity computation
sim1 … simk match
0.8 … 0.7 true0.4 … 0.6 false
Training Similarities
Classifier Training
Classifier
R
S
Similarity computation
Classifier Application
Match Result(idR, idS)
R SmatchA1 … Au A1 … Av
… … … … … … true… … … … … … false
idR idS sim1 … simk match
... ... 0.5 … 0.6 ?
... ... 0.8 … 0.9 ?
Training Data RS
Phase 1: Training
Phase 2: Application
Observations• Training phase < 5%• Similarity computation
counts for 95% of Application phases
5 / 16Learning-based Entity Resolution with MapReduce
OUTLINE
• Motivation
• MapReduce
• Strategies for similarity computation and classifier application on Cartesian product of two data sources with MapReduce• Solely in map phase (“Broadcast Join”) MapSide• Even distribution of entity pairs across reduce tasks Reduce Split
• Experimental Results• Conclusions & Future Work
6 / 16Learning-based Entity Resolution with MapReduce
MAPREDUCE• Programming model for distributed computation in cluster environments• UDF map applied on each input entity which outputs key-value pairs• UDF part applied on key of map output pairs
assigns each pair to a reduce task• UDF group applied on key to group key-value pairs• UDF reduce invoked for each group
Map tasks(m=3)
map
2m
ap1
map
0
Input data
0
1
2
part
(key
) [0
, r-1
]
0
1
2
0
1
2
1
0re
duce
0re
duce
1re
duce
2reduce tasks
(r=3)
0’
1’
2’
7 / 16
DISTRIBUTED EVALUATION OF THE CARTESIAN PRODUCT• Pairwise entity comparison requires distribution of entity pairs to
computing tasks/nodes
Learning-based Entity Resolution with MapReduce
R
S
classifier.classify( sim1(eR,eS), sim2(eR,eS), …, simk(eR,eS)) = “match”
+
Split R in x blocks (x=2) Split S in y blocks (y=2) Replicate each R-block y times Replicate each S-block x times x*y “match tasks”
Split S in x blocks (x=2) Replicate R x times x “match tasks”
+
8 / 16
MAPSIDE (m =3)• Map Tasks buffer R in memory at
initialization time• Each Map task operates on a partition of
the larger data source S• map(entity) – match currently processed
entity of S with all buffered entities of R
Learning-based Entity Resolution with MapReduce
Pairsa-c, b-ca-d, b-d
Scd m
ap0
Map
Sef
Pairsa-e, b-ea-f, b-f
Rab
map
1
Sgh
Pairsa-g, b-ga-h, b-hm
ap2
9 / 16
• R is split in x blocks, S is split in y blocks• All x blocks of R are compared with all y blocks of S• Implementation• Composite map output keys
• Grouping by i.j invocation of reduce per group• Entities of R appear before entities of S in the list of entities• Reduce tasks buffer entities of R and match each entity of S with buffer
REDUCESPLIT
Learning-based Entity Resolution with MapReduce
Assigned blockindex (random)
Outputted key-value pairs
Partitioningfunction
Entity e of R i [0, x-1] y pairs (i.j.R , e )for j [0, y-1] part(i.j.source)=
(i+j x) mod rEntity e of S j [0, y-1]
x pairs (i.j.S , e )for i [0, x-1]
j
0 1 2
i0 0 2 1
1 1 0 2Example reduce task assignment
of part for x=2, y=3, r=3
R0 R1 … Rx-1
S0 S1 … Sy-1
(1.0.R, e) (1.1.R, e)(1.y-1.R, e)(0.y-1.S, e) (1.y-1.S, e)(x-1.y-1.S, e)
10 / 16
REDUCESPLIT (M=3, R=3, X=2, Y=3)
Learning-based Entity Resolution with MapReduce
Scde m
ap1
MapKey=IndexR.IndexS.Source
Sfgh
Rab
map
2m
ap0
Key Value0.0.S fS
1.0.S fS
0.1.S gS
1.1.S gS
0.2.S hS
1.2.S hS
Parti
tioni
ng b
y (In
dexR
+Ind
exS*
x m
odul
o r)
Key Value0.0.R aR
0.0.S cS
0.0.S fS
1.1.R bR
1.1.S dS
1.1.S gS
Key Value0.2.R aR
0.2.S eS
0.2.S hS
1.0.R bR
1.0.S cS
1.0.S fS
Key Value0.1.R aR
0.1.S dS
0.1.S gS
1.2.R bR
1.2.S eS
1.2.S hSre
duce
0re
duce
1re
duce
2
Pairsa-e, a-hb-c, b-f
Pairsa-d, a-gb-e, b-h
ReduceGroup By: IndexR.IndexS
Key Value0.0.R aR
0.1.R aR
0.2.R aR
1.0.R bR
1.1.R bR
1.2.R bR
Key Value0.0.S cS
1.0.S cS
0.1.S dS
1.1.S dS
0.2.S eS
1.2.S eS
Pairsa-c, a-fb-d, b-g
11 / 16
MAPSIDE VS. REDUCESPLIT
• MapSide requires that the R entirely fits in main memory that is available per map task (multiple per node!)• No data redistribution, sorting, grouping and reduce task scheduling
• With ReduceSplit, only |R|/x entities need to be buffered• At the expense of data replication (|R|*y + |S|*x map output pairs)• Careful choice of x, y is crucial for performance
Learning-based Entity Resolution with MapReduce
12 / 16
EXPERIMENTAL RESULTS – MATCH QUALITY• Bibliographic datasets – DBLP (2,600) vs. GoogleScholar 64,000• Up to six matchers• Two classifiers – Decision Tree and Support Vector Machine from WEKA
• Employing multiple matchers increases overall match quality (F-measure)• Especially true if additional matchers operate on different attributes
Learning-based Entity Resolution with MapReduce
13 / 16
EXPERIMENTAL RESULTS – TIME DISTRIBUTION• Evaluation of the runtime using MapSide• Same match problem• 10 Amazon EC2 High-CPU Medium instances (each with two virtual cores)
• Generally multiple matchers increase match quality• At the expense of runtime• Similarity computation consumes between 88% and 97% of overall
runtime depending on number of matchers
Parallel Sorted Neighborhood Blocking with MapReduce
14 / 16
EXPERIMENTAL RESULTS – SCALABILITY
• MapSide with n= 1…50 dual core VMs• Almost linear speedup for up to 10 nodes• Still good speedup values for more nodes (e.g. ≈40 for n=50)
Learning-based Entity Resolution with MapReduce
15 / 16
CONCLUSIONS• Learning-based Entity Resolution with MapReduce• Two different strategies for evaluation of Cartesian product of two
input sources• MapSide – similarity computation solely during map phase• ReduceSplit – distribution of Cartesian product evaluation evenly across
all reduce tasks• Evaluation of the proposed approaches
• Future work• Incorporate blocking strategies• Analysis of learned model to avoid application of all matchers
Learning-based Entity Resolution with MapReduce
16 / 16Learning-based Entity Resolution with MapReduce
THANK YOU FOR YOUR ATTENTION