41
Multi-pass Sorted Neighborhood Blocking with MapReduce Lars Kolb, Andreas Thor, Erhard Rahm Jens Hildebrandt, Jakob Zwiener

Multi-pass Sorted Neighborhood Blocking with MapReduce

Embed Size (px)

Citation preview

Page 1: Multi-pass Sorted Neighborhood Blocking with MapReduce

Multi-pass Sorted Neighborhood Blocking with MapReduce Lars Kolb, Andreas Thor, Erhard Rahm

Jens Hildebrandt, Jakob Zwiener

Page 2: Multi-pass Sorted Neighborhood Blocking with MapReduce

Agenda

1. Sorted Neighborhood Method ■ with Map Reduce ■ with Entity Replication

2. Multipass Sorted Neighborhood Method 3. Load Balancing 4. Benchmarks

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

2

Page 3: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

3

sorting key artist_name disc_title Genre tracks

Sonny Terry The Blues Blues 18

Fats Waller Portrait Jazz 17

Blind Blake Best Of Blues 18

Fats Domino I'M Walking Blues 18

Chris Rea Stony Road Blues 17

Jazz Jazz Jazz 20

Acustica Acustica Blues 19

Various The Blues Blues 17

Kelis Tasty R+B 17

1. Calculate Sorting Key • Genre + tracks

Page 4: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

4

sorting key artist_name disc_title Genre tracks

Blues18 Sonny Terry The Blues Blues 18

Jazz17 Fats Waller Portrait Jazz 17

Blues18 Blind Blake Best Of Blues 18

Blues18 Fats Domino I'M Walking Blues 18

Blues17 Chris Rea Stony Road Blues 17

Jazz20 Jazz Jazz Jazz 20

Blues19 Acustica Acustica Blues 19

Blues17 Various The Blues Blues 17

R+B17 Kelis Tasty R+B 17

1. Calculate Sorting Key • Genre + tracks

2. Sort

Page 5: Multi-pass Sorted Neighborhood Blocking with MapReduce

sorting key artist_name disc_title Genre tracks

Blues17 Chris Rea Stony Road Blues 17

Blues17 Various The Blues Blues 17

Blues18 Sonny Terry The Blues Blues 18

Blues18 Blind Blake Best Of Blues 18

Blues18 Fats Domino I'M Walking Blues 18

Blues19 Acustica Acustica Blues 19

Jazz17 Fats Waller Portrait Jazz 17

Jazz20 Jazz Jazz Jazz 20

R+B17 Kelis Tasty R+B 17

Comparisons: O(n*w)

Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

5 1. Calculate Sorting Key

• Genre + tracks 2. Sort 3. Move a window over

the data • Window size w = 3 • Row count n = 9

Comparisons: ?

Page 6: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

6 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

Sorted Neighborhood with Map Reduce - Algorithm

Page 7: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Algorithm

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

7 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

map

1 m

ap2

map

3

sort disc_title ... The Blues ... Portrait ... Best Of ...

sort disc_title ... I'M Walking ... Stony Road ... Jazz ...

sort disc_title ... Acustica ... The Blues ... Tasty ...

Page 8: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Algorithm

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

8 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

map

1 m

ap2

map

3

sort disc_title ... Blues18 The Blues ... Jazz17 Portrait ... Blues18 Best Of ...

sort disc_title ... Blues18 I'M Walking ... Blues17 Stony Road ... Jazz20 Jazz ...

sort disc_title ... Blues19 Acustica ... Blues17 The Blues ... R+B17 Tasty ...

Map: 1. Calculate

SortingKey: Genre+tracks

Page 9: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Algorithm

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

9 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

map

1 m

ap2

map

3

part.sort disc_title ... 1.Blues18 The Blues ... Jazz17 Portrait ... 1.Blues18 Best Of ...

part.sort disc_title ... 1.Blues18 I'M Walking ... 1.Blues17 Stony Road ... Jazz20 Jazz ...

part.sort disc_title ... 1.Blues19 Acustica ... 1.Blues17 The Blues ... R+B17 Tasty ...

Map: 1. Calculate

SortingKey: Genre+tracks

2. Calculate Partition: sorting key partition B… 1

Page 10: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Algorithm

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

10 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

map

1 m

ap2

map

3

part.sort disc_title ... 1.Blues18 The Blues ... 2.Jazz17 Portrait ... 1.Blues18 Best Of ...

part.sort disc_title ... 1.Blues18 I'M Walking ... 1.Blues17 Stony Road ... 2.Jazz20 Jazz ...

part.sort disc_title ... 1.Blues19 Acustica ... 1.Blues17 The Blues ... R+B17 Tasty ...

Map: 1. Calculate

SortingKey: Genre+tracks

2. Calculate Partition: sorting key partition B… 1 J… 2

Page 11: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Algorithm

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

11 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

map

1 m

ap2

map

3

part.sort disc_title ... 1.Blues18 The Blues ... 2.Jazz17 Portrait ... 1.Blues18 Best Of ...

part.sort disc_title ... 1.Blues18 I'M Walking ... 1.Blues17 Stony Road ... 2.Jazz20 Jazz ...

part.sort disc_title ... 1.Blues19 Acustica ... 1.Blues17 The Blues ... 2.R+B17 Tasty ...

Map: 1. Calculate

SortingKey: Genre+tracks

2. Calculate Partition: sorting key partition B… 1 J… 2 R… 2

Page 12: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Algorithm

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

12 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

map

1 m

ap2

map

3

part.sort disc_title ... 1.Blues18 The Blues ... 2.Jazz17 Portrait ... 1.Blues18 Best Of ...

part.sort disc_title ... 1.Blues18 I'M Walking ... 1.Blues17 Stony Road ... 2.Jazz20 Jazz ...

part.sort disc_title ... 1.Blues19 Acustica ... 1.Blues17 The Blues ... 2.R+B17 Tasty ...

Part

ition

ing

Page 13: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Algorithm

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

13 disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks I'M Walking Blues 18

Stony Road Blues 17

Jazz Jazz 20

disc_title Genre tracks Acustica Blues 19

The Blues Blues 17

Tasty R+B 17

map

1 m

ap2

map

3

part.sort disc_title ... 1.Blues18 The Blues ... 2.Jazz17 Portrait ... 1.Blues18 Best Of ...

part.sort disc_title ... 1.Blues18 I'M Walking ... 1.Blues17 Stony Road ... 2.Jazz20 Jazz ...

part.sort disc_title ... 1.Blues19 Acustica ... 1.Blues17 The Blues ... 2.R+B17 Tasty ...

Part

ition

ing

part.sort disc_title ... 1.Blues17 The Blues ... 1.Blues17 Stony Road ... 1.Blues18 I'M Walking ... 1.Blues18 Best Of ... 1.Blues18 The Blues ... 1.Blues19 Acustica ...

part.sort disc_title ... 2.Jazz17 Portrait ... 2.Jazz20 Jazz ... 2.R+B17 Tasty ...

Page 14: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Map Reduce - Limitations

• Neighboring sorting keys must be on the same reducer own partition function • Self defined partitioning + sorting • Internal load balancing does not work

anymore

• Boundary entities • Sliding window cannot compare entities that

are assigned to different reduce nodes • Solution: data replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

14

part.sort disc_title ... 1.Blues17 The Blues ... 1.Blues17 Stony Road ... 1.Blues18 I'M Walking ... 1.Blues18 Best Of ... 1.Blues18 The Blues ... 1.Blues19 Acustica ...

part.sort disc_title ... 2.Jazz17 Portrait ... 2.Jazz20 Jazz ... 2.R+B17 Tasty ...

reduce1

reduce2

Page 15: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

15 m

ap1

map

2 m

ap3

redu

ce1

redu

ce2

Page 16: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

16 m

ap1

map

2 m

ap3

sort disc_title ... Blues18 The Blues ... Jazz17 Portrait ... Blues18 Best Of ...

sort disc_title ... Blues18 I'M Walking ... Blues17 Stony Road ... Jazz20 Jazz ...

sort disc_title ... Blues19 Acustica ... Blues17 The Blues ... R+B17 Tasty ...

redu

ce1

redu

ce2

Page 17: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

17 m

ap1

map

2 m

ap3

part.sort disc_title ... 1.Blues18 The Blues ... 2.Jazz17 Portrait ... 1.Blues18 Best Of ...

part.sort disc_title ... 1.Blues18 I'M Walking ... 1.Blues17 Stony Road ... 2.Jazz20 Jazz ...

part.sort disc_title ... 1.Blues19 Acustica ... 1.Blues17 The Blues ... 2.R+B17 Tasty ...

redu

ce1

redu

ce2

Page 18: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

18 m

ap1

map

2 m

ap3

part.sort disc_title ... 1.Blues18 The Blues ... 2.Jazz17 Portrait ... 1.Blues18 Best Of ... 1.Blues18 The Blues ... 1.Blues18 Best Of ...

part.sort disc_title ... 1.Blues18 I'M Walking ... 1.Blues17 Stony Road ... 2.Jazz20 Jazz ... 1.Blues17 Stony Road ... 1.Blues18 I'M Walking ...

part.sort disc_title ... 1.Blues19 Acustica ... 1.Blues17 The Blues ... 2.R+B17 Tasty ... 1.Blues19 Acustica ... 1.Blues17 The Blues ...

redu

ce1

redu

ce2

Page 19: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

19 m

ap1

map

2 m

ap3

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ... 2.1.Blues18 The Blues ... 2.1.Blues18 Best Of ...

red.part.sort disc_title ... 1.1.Blues18 I'M Walking ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ... 2.1.Blues17 Stony Road ... 2.1.Blues18 I'M Walking ...

red.part.sort disc_title ... 1.1.Blues19 Acustica ... 1.1.Blues17 The Blues ... 2.2.R+B17 Tasty ... 2.1.Blues19 Acustica ... 2.1.Blues17 The Blues ...

redu

ce1

redu

ce2

Page 20: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

20 m

ap1

map

2 m

ap3

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ... 2.1.Blues18 The Blues ... 2.1.Blues18 Best Of ...

red.part.sort disc_title ... 1.1.Blues18 I'M Walking ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ... 2.1.Blues17 Stony Road ... 2.1.Blues18 I'M Walking ...

red.part.sort disc_title ... 1.1.Blues19 Acustica ... 1.1.Blues17 The Blues ... 2.2.R+B17 Tasty ... 2.1.Blues19 Acustica ... 2.1.Blues17 The Blues ...

Part

ition

ing

red.part.sort disc_title ... 1.1.Blues17 The Blues ... 1.1.Blues17 Stony Road ... 1.1.Blues18 I'M Walking ... 1.1.Blues18 Best Of ... 1.1.Blues18 The Blues ... 1.1.Blues19 Acustica ...

red.part.sort disc_title ...

redu

ce1

redu

ce2

Page 21: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

21 m

ap1

map

2 m

ap3

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ... 2.1.Blues18 The Blues ... 2.1.Blues18 Best Of ...

red.part.sort disc_title ... 1.1.Blues18 I'M Walking ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ... 2.1.Blues17 Stony Road ... 2.1.Blues18 I'M Walking ...

red.part.sort disc_title ... 1.1.Blues19 Acustica ... 1.1.Blues17 The Blues ... 2.2.R+B17 Tasty ... 2.1.Blues19 Acustica ... 2.1.Blues17 The Blues ...

Part

ition

ing

red.part.sort disc_title ... 1.1.Blues17 The Blues ... 1.1.Blues17 Stony Road ... 1.1.Blues18 I'M Walking ... 1.1.Blues18 Best Of ... 1.1.Blues18 The Blues ... 1.1.Blues19 Acustica ...

red.part.sort disc_title ... 2.1.Blues17 The Blues ... 2.1.Blues17 Stony Road ... 2.1.Blues18 I'M Walking ... 2.1.Blues18 The Blues ... 2.1.Blues18 Best Of ... 2.1.Blues19 Acustica ...

redu

ce1

redu

ce2

Page 22: Multi-pass Sorted Neighborhood Blocking with MapReduce

Sorted Neighborhood with Entity Replication

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

22 m

ap1

map

2 m

ap3

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ... 2.1.Blues18 The Blues ... 2.1.Blues18 Best Of ...

red.part.sort disc_title ... 1.1.Blues18 I'M Walking ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ... 2.1.Blues17 Stony Road ... 2.1.Blues18 I'M Walking ...

red.part.sort disc_title ... 1.1.Blues19 Acustica ... 1.1.Blues17 The Blues ... 2.2.R+B17 Tasty ... 2.1.Blues19 Acustica ... 2.1.Blues17 The Blues ...

Part

ition

ing

red.part.sort disc_title ... 1.1.Blues17 The Blues ... 1.1.Blues17 Stony Road ... 1.1.Blues18 I'M Walking ... 1.1.Blues18 Best Of ... 1.1.Blues18 The Blues ... 1.1.Blues19 Acustica ...

red.part.sort disc_title ... 2.1.Blues17 The Blues ... 2.1.Blues17 Stony Road ... 2.1.Blues18 I'M Walking ... 2.1.Blues18 Best Of ... 2.1.Blues18 The Blues ... 2.1.Blues19 Acustica ... 2.2.Jazz17 Portrait ... 2.2.Jazz20 Jazz ... 2.2.R+B17 Tasty ...

redu

ce1

redu

ce2

Page 23: Multi-pass Sorted Neighborhood Blocking with MapReduce

• Sorted Neighborhood with Map Reduce • Multipass in one Map Reduce • Load Balancing for Nodes

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

24

Challenges in Sorted Neighborhood on Map Reduce

Page 24: Multi-pass Sorted Neighborhood Blocking with MapReduce

Multipass Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

25

disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks Stony Road Blues 17

Jazz Jazz 20

map

1 m

ap2

Page 25: Multi-pass Sorted Neighborhood Blocking with MapReduce

Multipass Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

26

disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks Stony Road Blues 17

Jazz Jazz 20

map

1 m

ap2

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ...

red.part.sort disc_title ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ...

Page 26: Multi-pass Sorted Neighborhood Blocking with MapReduce

Multipass Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

27

disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks Stony Road Blues 17

Jazz Jazz 20

map

1 m

ap2

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ...

The Blues ... Portrait ... Best Of ...

red.part.sort disc_title ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ...

Stony Road ... Jazz ...

Page 27: Multi-pass Sorted Neighborhood Blocking with MapReduce

Multipass Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

28

disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks Stony Road Blues 17

Jazz Jazz 20

map

1 m

ap2

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ... Th18 The Blues ... Po17 Portrait ... Be18 Best Of ...

red.part.sort disc_title ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ... St17 Stony Road ... Ja20 Jazz ...

Page 28: Multi-pass Sorted Neighborhood Blocking with MapReduce

Multipass Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

29

disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks Stony Road Blues 17

Jazz Jazz 20

map

1 m

ap2

red.part.sort disc_title ... 1.1.Blues18 The Blues ... 2.2.Jazz17 Portrait ... 1.1.Blues18 Best Of ... 2.2.Th18 The Blues ... 2.2.Po17 Portrait ... 1.1.Be18 Best Of ...

red.part.sort disc_title ... 1.1.Blues17 Stony Road ... 2.2.Jazz20 Jazz ... 2.2.St17 Stony Road ... 1.1.Ja20 Jazz ...

Page 29: Multi-pass Sorted Neighborhood Blocking with MapReduce

Multipass Sorted Neighborhood Method

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

30

disc_title Genre tracks The Blues Blues 18

Portrait Jazz 17

Best Of Blues 18

disc_title Genre tracks Stony Road Blues 17

Jazz Jazz 20

map

1 m

ap2

pass.red. part.sort disc_title ... 1.1.1.Blues18 The Blues ... 1.2.2.Jazz17 Portrait ... 1.1.1.Blues18 Best Of ... 2.2.2.Th18 The Blues ... 2.2.2.Po17 Portrait ... 2.1.1.Be18 Best Of ...

pass.red. part.sort disc_title ... 1.1.1.Blues17 Stony Road ... 1.2.2.Jazz20 Jazz ... 2.2.2.St17 Stony Road ... 2.1.1.Ja20 Jazz ...

Page 30: Multi-pass Sorted Neighborhood Blocking with MapReduce

• Sorted Neighborhood with Map Reduce • Multipass in one Map Reduce • Load Balancing for Nodes

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

33

Challenges in Sorted Neighborhood on Map Reduce

Page 31: Multi-pass Sorted Neighborhood Blocking with MapReduce

sortK disc_title ... Blues17 Stony Road ... Blues17 The Blues ... Blues18 The Blues ... Blues18 Best Of ... Blues18 I'M Walking ... Blues19 Acustica ...

sortK disc_title ... Jazz17 Portrait ... Jazz20 Jazz ... R+B17 Tasty ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

34 sort disc_title ... Blues18 The Blues ... Jazz17 Portrait ... Blues18 Best Of ...

sort disc_title ...

Blues18 I'M Walking ...

Blues17 Stony Road ...

Jazz20 Jazz ...

sort disc_title ... Blues19 Acustica ... Blues17 The Blues ... R+B17 Tasty ...

sort disc_title ... Blues17 Stony Road ... Blues17 The Blues ... Blues18 The Blues ... Blues18 Best Of ...

sort disc_title ... Blues18 I'M Walking ... Blues19 Acustica ... Jazz17 Portrait ... Jazz20 Jazz ... R+B17 Tasty ...

Page 32: Multi-pass Sorted Neighborhood Blocking with MapReduce

sort.mapN disc_title ... Blues18.2 I'M Walking ... Blues19.3 Acustica ... Jazz17.1 Portrait ... Jazz20.2 Jazz ... R+B17.3 Tasty ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

35 sort.mapN disc_title ... Blues18.1 The Blues ... Jazz17.1 Portrait ... Blues18.1 Best Of ...

sort.mapN disc_title ...

Blues18.2 I'M Walking ...

Blues17.2 Stony Road ...

Jazz20.2 Jazz ...

sort.mapN disc_title ... Blues19.3 Acustica ... Blues17.3 The Blues ... R+B17.3 Tasty ...

sort.mapN disc_title ... Blues17.2 Stony Road ... Blues17.3 The Blues ... Blues18.1 The Blues ... Blues18.1 Best Of ...

Page 33: Multi-pass Sorted Neighborhood Blocking with MapReduce

sort.mapN.counter disc_title ... Blues18.2.1 I'M Walking ... Blues19.3.1 Acustica ... Jazz17.1.1 Portrait ... Jazz20.2.1 Jazz ... R+B17.3.1 Tasty ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

36 sort.mapN.counter disc_title ... Blues18.1.1 The Blues ... Jazz17.1.1 Portrait ... Blues18.1.2 Best Of ...

sort.mapN.counter disc_title ...

Blues18.2.1 I'M Walking ...

Blues17.2.1 Stony Road ...

Jazz20.2.1 Jazz ...

sort.mapN.counter disc_title ... Blues19.3.1 Acustica ... Blues17.3.1 The Blues ... R+B17.3.1 Tasty ...

sort.mapN.counter disc_title ... Blues17.2.1 Stony Road ... Blues17.3.1 The Blues ... Blues18.1.1 The Blues ... Blues18.1.2 Best Of ...

Page 34: Multi-pass Sorted Neighborhood Blocking with MapReduce

part.sort disc_title ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

37 part.sort disc_title ...

sortKey MapN: 1 2 3

Blues17 0 1 1

Blues18 2 1 0

Blues19 0 0 1

Jazz17 1 0 0

Jazz20 0 1 0

R+B17 0 0 1

Blues18.2.1

Page 35: Multi-pass Sorted Neighborhood Blocking with MapReduce

part.sort disc_title ... 2.Blues18 I'M Walking ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

38 part.sort disc_title ...

sortKey MapN: 1 2 3

Blues17 0 1 1

Blues18 2 1 0

Blues19 0 0 1

Jazz17 1 0 0

Jazz20 0 1 0

R+B17 0 0 1

Blues18.2.1

Page 36: Multi-pass Sorted Neighborhood Blocking with MapReduce

part.sort disc_title ... 2.Blues18 I'M Walking ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

39 part.sort disc_title ...

sortKey MapN: 1 2 3

Blues17 0 1 1

Blues18 2 1 0

Blues19 0 0 1

Jazz17 1 0 0

Jazz20 0 1 0

R+B17 0 0 1

Blues18.1.1

Page 37: Multi-pass Sorted Neighborhood Blocking with MapReduce

part.sort disc_title ... 2.Blues18 I'M Walking ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

40 part.sort disc_title ...

1.Blues18 The Blues ... sortKey MapN: 1 2 3

Blues17 0 1 1

Blues18 2 1 0

Blues19 0 0 1

Jazz17 1 0 0

Jazz20 0 1 0

R+B17 0 0 1

Blues18.1.1

Page 38: Multi-pass Sorted Neighborhood Blocking with MapReduce

part.sort disc_title ... 2.Blues18 I'M Walking ... 2.Blues19 Acustica ... 2.Jazz17 Portrait ... 2.Jazz20 Jazz ... 2.R+B17 Tasty ...

Load Balancing

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

41 part.sort disc_title ... 1.Blues17 Stony Road ... 1.Blues17 The Blues ... 1.Blues18 The Blues ... 1.Blues18 Best Of ...

sortKey MapN: 1 2 3

Blues17 0 1 1

Blues18 2 1 0

Blues19 0 0 1

Jazz17 1 0 0

Jazz20 0 1 0

R+B17 0 0 1

Page 39: Multi-pass Sorted Neighborhood Blocking with MapReduce

Benchmarks

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

43

Page 40: Multi-pass Sorted Neighborhood Blocking with MapReduce

Benchmarks

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

45

artist[:2] + title

artist[:1] + title[:1]

Page 41: Multi-pass Sorted Neighborhood Blocking with MapReduce

Summary

1. Sorted Neighborhood Method ■ with Map Reduce ■ with Entity Replication

2. Multipass Sorted Neighborhood Method 3. Load Balancing 4. Benchmarks

Sorted Neighborhood with MapReduce | Jens Hildebrandt, Jakob Zwiener | 6. Mai 2013

46