Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Preview:

Citation preview

1

Record Linkagein a Distributed Environment

Huang YipengWing group meeting, 11 March 2011

2Introduction

Record LinkageDetermining if pairs of personal

records refer to the same entity

E.g. Distinguishing betweendata belonging to…

<Yipeng, author of this presentation> and <Yipeng, son of PM Lee>

3Introduction

The Distributed Environment

Why?◦ Dealing with large

data ◦ Limitation of

blockingAdvantages

◦ Parallel computation

◦ Data source flexibility

◦ Complementary to blocking methods

O(nC2)

Amanda

Beverley

Katherine

Amanda

Amanda

Amanda

Amanda

Amanda

4Introduction

The Distributed Environment

MapReduce◦ Distributed

environment for large data sets

Hadoop ◦ Open source

implementation

◦ Convenient model for scaling Record Linkage

◦ Protects users from system level concerns

5Introduction

Research ProblemDisconnect between generic

parallel framework and specific Record Linkage problem

The goal Tailor Hadoop for Record Linkage tasks

6

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

7Related Work

Related WorkRecord Linkage Literature

◦Blocking techniquesParallel Record Linkage Literature

◦P-Febrl (P Christen 2003),

◦P-Swoosh (H Kawai 2006),

◦Parallel Linkage (H Kim 2007)

Hadoop Literature ◦Evaluation Metrics◦Pairwise comparisons (T Elsayed 2008)

8

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

9Methodology

MapReduce Workflow

Partitioner

10Methodology

ImplementationMapPurpose:

◦ Parallelism ◦ Data manipulation◦ Blocking

Reads lines of input and outputs <key, value> pairs.

ReducePurpose:

◦ Parallelism ◦ Record Linkage

ops

Records with the same <key> in same Reduce().

Linkage results

11Methodology

Hash Partitioner Default implementation Hash(Key) mod NGood for uniformed data but not

for skewed distributions

Node

10 22 21 3 4 5 6 7 2 80

20

40

60

Reduce task list for Job x

Name Distribution Comparisons

joshua 5000 12497500

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

5416986 comparisons

210 comparisons

13Methodology

Record Linkage Partitioner

Goal: Have all nodes finish the reduce

phase at the same time Attain a better runtime but

retaining the same level of accuracy

14Methodology

Domain principlesCounting pairwise comparisons

gives a more accurate picture of the true computational workload

The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000)

15Methodology

Record Linkage Workflow

Round 1

Round 2

Round 3

Range partition based on comparison workload

Merge lost comparisons from Round 1

Remove cross duplicates

16Methodology

Input

Round 1

Map Phase

Distribution

1. Calc avg comparison workload over N nodes

2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below.

3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes

Methodology

Round 2

A

17

B

List X

A B

A R1

B R1

A B

A R1

B R2 R1

Methodology

Round 2

18

A B

A

B Job 1

A B C

A

B Job 1

C Job 2 Job 3

1. Only acts on lost comparisons

2. Because input is indistinct, a 3rd round of deduplication may be needed.

19

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

Introduction

20Evaluation

Performance MetricsPerformance evaluation in

absolute runtime, speedup & scaleup on a shared cluster.◦“It’s what users care about” ◦Representative of real operations

21Methodology

Input Records

10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution.

<rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9>

22Methodology

Data setsSynthetic data produced with

Febrl data generator◦Artificially skewed distribution

1 1352694035376718050

200

400

600

800

1000

1200

1400

Comparisons

Name Distribution Comparisons

joshua 50 1225

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

23Evaluation

Utilization

Node 1 Node 20

2

4

6

8

10

12

IdleComputation

24Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleComputation

25Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleComputation

A

B

C

26Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

Idle

Redistributed Computation

Original Computa-tion

CA B

Round 2

27

A B C

A

B

C

J1

J3 J5

J2

J4 J6 ?

Node Utilization 50-100%

28Evaluation

Results so far….Default Workflow

RL Workflow

2 nodes, 5000 records, 2433 duplicates

71.5 secs 75 secs

2 nodes, 7000 records, 4814 duplicates

>10 mins 196.8 secs

29Evaluation

Results so far….RL Workflow runtime

◦Similar to Hash-based runtime on small datasets

◦Better as the size of the dataset grows

30

ConclusionParallelism a right step in the

right direction for record linkage ◦Complementary to existing

approaches

Hadoop can be tailored for Record Linkage tasks◦“Record Linkage” Partitioner /

Workflow is just one an example of possible improvements

Conclusion

Recommended