Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Record Linkagein a Distributed Environment

Huang YipengWing group meeting, 11 March 2011

2Introduction

Record LinkageDetermining if pairs of personal

records refer to the same entity

E.g. Distinguishing betweendata belonging to…

<Yipeng, author of this presentation> and <Yipeng, son of PM Lee>

3Introduction

The Distributed Environment

Why?◦ Dealing with large

data ◦ Limitation of

blockingAdvantages

◦ Parallel computation

◦ Data source flexibility

◦ Complementary to blocking methods

O(nC2)

Amanda

Beverley

Katherine

Amanda

4Introduction

The Distributed Environment

MapReduce◦ Distributed

environment for large data sets

Hadoop ◦ Open source

implementation

◦ Convenient model for scaling Record Linkage

◦ Protects users from system level concerns

5Introduction

Research ProblemDisconnect between generic

parallel framework and specific Record Linkage problem

The goal Tailor Hadoop for Record Linkage tasks

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

7Related Work

Related WorkRecord Linkage Literature

◦Blocking techniquesParallel Record Linkage Literature

◦P-Febrl (P Christen 2003),

◦P-Swoosh (H Kawai 2006),

◦Parallel Linkage (H Kim 2007)

Hadoop Literature ◦Evaluation Metrics◦Pairwise comparisons (T Elsayed 2008)

9Methodology

MapReduce Workflow

Partitioner

10Methodology

ImplementationMapPurpose:

◦ Parallelism ◦ Data manipulation◦ Blocking

Reads lines of input and outputs <key, value> pairs.

ReducePurpose:

◦ Parallelism ◦ Record Linkage

Records with the same <key> in same Reduce().

Linkage results

11Methodology

Hash Partitioner Default implementation Hash(Key) mod NGood for uniformed data but not

for skewed distributions

10 22 21 3 4 5 6 7 2 80

Reduce task list for Job x

Name Distribution Comparisons

joshua 5000 12497500

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

5416986 comparisons

210 comparisons

13Methodology

Record Linkage Partitioner

Goal: Have all nodes finish the reduce

phase at the same time Attain a better runtime but

retaining the same level of accuracy

14Methodology

Domain principlesCounting pairwise comparisons

gives a more accurate picture of the true computational workload

The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000)

15Methodology

Record Linkage Workflow

Round 1

Round 2

Round 3

Range partition based on comparison workload

Merge lost comparisons from Round 1

Remove cross duplicates

16Methodology

Round 1

Map Phase

Distribution

1. Calc avg comparison workload over N nodes

2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below.

3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes

Methodology

Round 2

List X

B R2 R1

Methodology

Round 2

B Job 1

C Job 2 Job 3

1. Only acts on lost comparisons

2. Because input is indistinct, a 3rd round of deduplication may be needed.

Introduction

20Evaluation

Performance MetricsPerformance evaluation in

absolute runtime, speedup & scaleup on a shared cluster.◦“It’s what users care about” ◦Representative of real operations

21Methodology

Input Records

10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution.

<rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9>

22Methodology

Data setsSynthetic data produced with

Febrl data generator◦Artificially skewed distribution

1 1352694035376718050

Comparisons

Name Distribution Comparisons

joshua 50 1225

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

23Evaluation

Utilization

Node 1 Node 20

IdleComputation

24Evaluation

Utilization

Node 1 Node 2 Node 30

IdleComputation

25Evaluation

Utilization

IdleComputation

26Evaluation

Utilization

Redistributed Computation

Original Computa-tion

Round 2

J4 J6 ?

Node Utilization 50-100%

28Evaluation

Results so far….Default Workflow

RL Workflow

2 nodes, 5000 records, 2433 duplicates

71.5 secs 75 secs

2 nodes, 7000 records, 4814 duplicates

>10 mins 196.8 secs

29Evaluation

Results so far….RL Workflow runtime

◦Similar to Hash-based runtime on small datasets

◦Better as the size of the dataset grows

ConclusionParallelism a right step in the

right direction for record linkage ◦Complementary to existing

approaches

Hadoop can be tailored for Record Linkage tasks◦“Record Linkage” Partitioner /

Workflow is just one an example of possible improvements

Conclusion

Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Documents

Linkage Disequilibrium with Linkage Analysis of Multiline

Evaluation of an Analog Accelerator for Linear Algebrasimha/preprint_isca16.pdf · 2016. 4. 22. · Evaluation of an Analog Accelerator for Linear Algebra Yipeng Huang, Ning Guoy,

Linkage Permanently Progressing? Report - Linkage Strand… · Linkage Permanently Progressing? 1 1. Introduction This report on the linkage of Children Looked After Statistics (CLAS)

Linkage (Genetics)

2014 linkage 20 conversations linkage mena

Klann Linkage

An Analog Accelerator for Linear Algebra - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/8B-3.pdf · An Analog Accelerator for Linear Algebra Yipeng Huang, Ning Guo,

Ch5 linkage

Table of Contents - Kogan.comAlarm linkage: triggered record, linkage alarm output and linkage PTZ preset, sound alarm, report to alarm center, linkage channel single screen display

Evaluation of an Analog Accelerator for Linear Algebrasimha/preprint_isca16.pdf · Evaluation of an Analog Accelerator for Linear Algebra Yipeng Huang, Ning Guoy, Mingoo Seoky, Yannis

Wen-Ting Huang Jau-Chi Huang

Gene Linkage

Linkage Analysis

Linkage. What is Linkage? Linkage is defined genetically: the failure of two genes to assort independently. Linkage occurs when two genes are close to

Development and Uses of a Health Data Linkage System in ... · Western Australian Data Linkage System Scottish Record Linkage System Oxford Record Linkage Study Manitoba Population

Constructing Genetic Linkage Maps with MAPMAKER…home.cc.umanitoba.ca/.../doc/mapmaker/mapmaker.tutorial.pdf · Constructing Genetic Linkage Maps with ... Constructing Genetic Linkage

Joint Linkage and Linkage Disequilibrium Mapping

Maxson: Reduce Duplicate Parsing Overhead on Raw Datahebs/pub/icde20-maxson.pdf · Maxson: Reduce Duplicate Parsing Overhead on Raw Data Xuanhua Shi †, Yipeng Zhang , Hong Huang†∗,

Four-bar Linkage Mechanism Optimization for Linkage Driven

Linkage Overview