52
Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013

Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Introduction to MapReduce Algorithms andAnalysis

Jeff M. Phillips

October 25, 2013

Page 2: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Trade-Offs

Massive parallelism that is very easy to program.

I Cheaper than HPC style (uses top of the line everything)

I Assumption about data (key-value pairs).

I Restrictions on how computation is done.

Page 3: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Cluster of Commodity Nodes (2003)

Big Iron Box:

I 8 2GHz Xeons (Processors)

I 64 GB RAM

I 8 TB disk

I 758,000 USD

Google Rack:

I 176 2GHz Xeons (Processors)

I 176 GB RAM

I 7 TB disk

I 278,000 USD

Page 4: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Google File System

SOSP 2003 (Ghemawat, Gobioff, Leung)

Key-Value Pairs:All files stored as Key-Value pairs

I key: log id value: actual log

I key: web address value: html and/or outgoing links

I key: document id in set value: list of words

I key: word in corpus value: how often it appears

Blocking:All files broken into blocks (often 64 MB)Each block has replication factor (say 3 times), stored in separatenodes.No locality on compute nodes, no neighbors or replicas on samenode (but often same rack).

Page 5: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

No locality?

Really?

I Resiliency: if one dies, use another(on big clusters happens all the time)

I Redundancy: If one is slow, use another(...curse of last reducer)

I Heterogeneity: Format quite flexible, 64MB often still enough(recall: IO-Efficiency)

Page 6: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

No locality?

Really?

I Resiliency: if one dies, use another(on big clusters happens all the time)

I Redundancy: If one is slow, use another(...curse of last reducer)

I Heterogeneity: Format quite flexible, 64MB often still enough(recall: IO-Efficiency)

Page 7: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduce

OSDI 04 (Dean, Ghemawat)

Each Processor has full hard drive,data items < key,value >.Parallelism Procedes in Rounds:

I Map: assigns items to processorby key.

I Reduce: processes all items usingvalue. Usually combines manyitems with same key.

Repeat M+R a constant number oftimes, often only one round.

I Optional post-processing step.

CPU

RAM

CPU

RAM

CPU

RAM

MAP

REDUCE

sort

post-processPro: Robust (duplication) and simple. Can harness LocalityCon: Somewhat restrictive model

Page 8: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduce

OSDI 04 (Dean, Ghemawat)

Page 9: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduce

OSDI 04 (Dean, Ghemawat)

Page 10: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Granularity and Pipelining

OSDI 04 (Dean, Ghemawat)

Page 11: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 12: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 13: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 14: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 15: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 16: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 17: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 18: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 19: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduceOSDI 04 (Dean, Ghemawat)

Page 20: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduce

OSDI 04 (Dean, Ghemawat)

Page 21: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

MapReduce

OSDI 04 (Dean, Ghemawat)

Page 22: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Last Reducer

Typically Map phase linear on blocks. Reducers more variable.

No answer until the last one is done!Some machines get slow/crash!

Solution: Automatically run back-up copies. Take first tocomplete.

Scheduled by Master NodeOrganizes computation, but does not process data.If this fails, all goes down.

Page 23: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Last Reducer

Typically Map phase linear on blocks. Reducers more variable.

No answer until the last one is done!Some machines get slow/crash!

Solution: Automatically run back-up copies. Take first tocomplete.

Scheduled by Master NodeOrganizes computation, but does not process data.If this fails, all goes down.

Page 24: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count

Given text corpus 〈 doc id, list of words 〉, count how many of eachword exists.

Map:

For each word w → 〈w , 1〉

Reduce:{〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

w = “the” is 7% of all words!Combine: (before Map goes to Shuffle phase){〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

Page 25: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count

Given text corpus 〈 doc id, list of words 〉, count how many of eachword exists.

Map:For each word w → 〈w , 1〉

Reduce:

{〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,∑

i ci 〉

w = “the” is 7% of all words!Combine: (before Map goes to Shuffle phase){〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

Page 26: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count

Given text corpus 〈 doc id, list of words 〉, count how many of eachword exists.

Map:For each word w → 〈w , 1〉

Reduce:{〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} →

〈w ,∑

i ci 〉

w = “the” is 7% of all words!Combine: (before Map goes to Shuffle phase){〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

Page 27: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count

Given text corpus 〈 doc id, list of words 〉, count how many of eachword exists.

Map:For each word w → 〈w , 1〉

Reduce:{〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

w = “the” is 7% of all words!Combine: (before Map goes to Shuffle phase){〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

Page 28: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count

Given text corpus 〈 doc id, list of words 〉, count how many of eachword exists.

Map:For each word w → 〈w , 1〉

Reduce:{〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

w = “the” is 7% of all words!

Combine: (before Map goes to Shuffle phase){〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

Page 29: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count

Given text corpus 〈 doc id, list of words 〉, count how many of eachword exists.

Map:For each word w → 〈w , 1〉

Reduce:{〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

w = “the” is 7% of all words!Combine: (before Map goes to Shuffle phase){〈w , c1〉, 〈w , c2〉, 〈w , c3〉, . . .} → 〈w ,

∑i ci 〉

Page 30: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count - Actual Code

Page 31: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count - Actual Code

Page 32: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count - Actual Code

Page 33: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 1: Word Count - Actual Code

Page 34: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 2: Inverted Index

Given all of Wikipedia (all webpages), for each word, list all pagesit is on.

Map:

For page p, each word w → 〈w , p〉

Reduce:{〈w , p1〉, 〈w , p2〉, 〈w , p3〉, . . .} → 〈w ,∪ipi 〉

Page 35: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 2: Inverted Index

Given all of Wikipedia (all webpages), for each word, list all pagesit is on.

Map:For page p, each word w → 〈w , p〉

Reduce:

{〈w , p1〉, 〈w , p2〉, 〈w , p3〉, . . .} → 〈w ,∪ipi 〉

Page 36: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 2: Inverted Index

Given all of Wikipedia (all webpages), for each word, list all pagesit is on.

Map:For page p, each word w → 〈w , p〉

Reduce:{〈w , p1〉, 〈w , p2〉, 〈w , p3〉, . . .} →

〈w ,∪ipi 〉

Page 37: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 2: Inverted Index

Given all of Wikipedia (all webpages), for each word, list all pagesit is on.

Map:For page p, each word w → 〈w , p〉

Reduce:{〈w , p1〉, 〈w , p2〉, 〈w , p3〉, . . .} → 〈w ,∪ipi 〉

Page 38: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Example 2: Inverted Index

Given all of Wikipedia (all webpages), for each word, list all pagesit is on.

Map:For page p, each word w → 〈w , p〉

Combine:{〈w , p1〉, 〈w , p2〉, 〈w , p3〉, . . .} → 〈w ,∪ipi 〉

Reduce:{〈w , p1〉, 〈w , p2〉, 〈w , p3〉, . . .} → 〈w ,∪ipi 〉

Page 39: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Hadoop

Open source version of MapReduce (and related, e.g. HDFS)

I Began 2005 (Cutting + Cafarella) supported by Yahoo!

I Stable enough for large scale around 2008

I Source code released 2009

Java (MapReduce in C++)

Led to widespread adoption in industry and academia!

Page 40: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Rounds

Many algorithms are iterative, especially machine learning / datamining:

I Lloyd’s algorithm for k-means

I gradient descent

I singular value decomposition

May require log2 n rounds.

log2(n = 1 billion) ≈ 30

MapReduce puts rounds at a premium.Hadoop can have several minute delay between rounds.(Each rounds writes to HDFS for resiliency; same in MapReduce)

MRC Model (Karloff, Suri, Vassilvitskii; SODA 2010).Stresses Rounds.

Page 41: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Rounds

Many algorithms are iterative, especially machine learning / datamining:

I Lloyd’s algorithm for k-means

I gradient descent

I singular value decomposition

May require log2 n rounds. log2(n = 1 billion) ≈ 30

MapReduce puts rounds at a premium.Hadoop can have several minute delay between rounds.(Each rounds writes to HDFS for resiliency; same in MapReduce)

MRC Model (Karloff, Suri, Vassilvitskii; SODA 2010).Stresses Rounds.

Page 42: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Rounds

Many algorithms are iterative, especially machine learning / datamining:

I Lloyd’s algorithm for k-means

I gradient descent

I singular value decomposition

May require log2 n rounds. log2(n = 1 billion) ≈ 30

MapReduce puts rounds at a premium.Hadoop can have several minute delay between rounds.(Each rounds writes to HDFS for resiliency; same in MapReduce)

MRC Model (Karloff, Suri, Vassilvitskii; SODA 2010).Stresses Rounds.

Page 43: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Rounds

Many algorithms are iterative, especially machine learning / datamining:

I Lloyd’s algorithm for k-means

I gradient descent

I singular value decomposition

May require log2 n rounds. log2(n = 1 billion) ≈ 30

MapReduce puts rounds at a premium.Hadoop can have several minute delay between rounds.(Each rounds writes to HDFS for resiliency; same in MapReduce)

MRC Model (Karloff, Suri, Vassilvitskii; SODA 2010).Stresses Rounds.

Page 44: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Bulk Synchronous Parallel

Les Valiant [1989] BSPCreates “barriers” in parallel algorithm.

1. Each processor computes on data

2. Processors send/receive data

3. Barrier : All processors wait forcommunication to end globally

Allows for easy synchronization. Easierto analyze since handles many messysynchronization details if this isemulated.

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

Page 45: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Bulk Synchronous Parallel

Les Valiant [1989] BSPCreates “barriers” in parallel algorithm.

1. Each processor computes on data

2. Processors send/receive data

3. Barrier : All processors wait forcommunication to end globally

Allows for easy synchronization. Easierto analyze since handles many messysynchronization details if this isemulated.

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

Page 46: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Bulk Synchronous Parallel

Les Valiant [1989] BSPCreates “barriers” in parallel algorithm.

1. Each processor computes on data

2. Processors send/receive data

3. Barrier : All processors wait forcommunication to end globally

Allows for easy synchronization. Easierto analyze since handles many messysynchronization details if this isemulated.

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

Reduction from MR (Goodrich, Sitchinava, Zhang; ISAAC 2011)

Page 47: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Replication Rate

Consider a join of two sets of size R, S of size n = 10, 000.List pair (r , s) ∈ R × S if f (r , s) = 1, for some function f .

Option 1: Create n2 reducers.Replication rate of g = n.

Option 2: Create 1 reducers.Reducer of size 2n, has n2 = 100million operation.Replication rate g = 1. No parallelism.

Option 3: Create g2 reducers, each with 2 groups of size n/g .Reducer size 2n/g , (n/g)2 operations (g = 10 only 1million).Replication rate of g .

Page 48: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Replication Rate

Consider a join of two sets of size R, S of size n = 10, 000.List pair (r , s) ∈ R × S if f (r , s) = 1, for some function f .

Option 1: Create n2 reducers.Replication rate of g = n.

Option 2: Create 1 reducers.Reducer of size 2n, has n2 = 100million operation.Replication rate g = 1. No parallelism.

Option 3: Create g2 reducers, each with 2 groups of size n/g .Reducer size 2n/g , (n/g)2 operations (g = 10 only 1million).Replication rate of g .

Page 49: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Replication Rate

Consider a join of two sets of size R, S of size n = 10, 000.List pair (r , s) ∈ R × S if f (r , s) = 1, for some function f .

Option 1: Create n2 reducers.Replication rate of g = n.

Option 2: Create 1 reducers.Reducer of size 2n, has n2 = 100million operation.Replication rate g = 1. No parallelism.

Option 3: Create g2 reducers, each with 2 groups of size n/g .Reducer size 2n/g , (n/g)2 operations (g = 10 only 1million).Replication rate of g .

Page 50: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Replication Rate

Consider a join of two sets of size R, S of size n = 10, 000.List pair (r , s) ∈ R × S if f (r , s) = 1, for some function f .

Option 1: Create n2 reducers.Replication rate of g = n.

Option 2: Create 1 reducers.Reducer of size 2n, has n2 = 100million operation.Replication rate g = 1. No parallelism.

Option 3: Create g2 reducers, each with 2 groups of size n/g .Reducer size 2n/g , (n/g)2 operations (g = 10 only 1million).Replication rate of g .

(Afrati, Das Sarma, Salihoglu, Ullman 2013),(Beame, Koutris, Suciu 2013)

Page 51: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Sawzall / Dremel / TenzingGoogle solution to many to few:

I Compute statistics on massive distributed data.

I Separates local computation from aggregation.

I Better with Iteration

Page 52: Introduction to MapReduce Algorithms and Analysisjeffp/teaching/MCMD/MR-intro.pdf · Open source version of MapReduce (and related, e.g. HDFS) I Began 2005 (Cutting + Cafarella) supported

Berkeley Spark: Processing in memory

Zaharia, Chowdhury, Das, Ma, McCauley, Franklin, Shenker, Stoica(HotCloud 2010, NSDI 2012)

I Keeps relevant information in memory.

I Much faster on iterative algorithms (machine learning, SQLqueries)

I Requires careful work to retain resiliency

Key idea: RDDs: Resilient Distributed Data. Can be stored inmemory without replication, rebuilt from lineage if lost.