25
MapReduce Presented by Zilong Tan

A Computational View of MapReduce

Embed Size (px)

Citation preview

MapReducePresented by Zilong Tan

The Data Model● A (logical) file is a string

a1a2...an, where aj is a substring.

The Data Model● A (logical) file is a string

a1a2...an, where aj is a substring.

Eg: “Hello\nworld!” ⇒ a1= “Hello”, a2= “world!”“\n” is a separator.

The Data Model● A (logical) file is a string

a1a2...an, where aj is a substring.

Eg: “Hello\nworld!” ⇒ a1= “Hello”, a2= “world!”“\n” is a separator.

● Q1: How to equally split a file?○ Eg: a1a2...a2n ⇒ a1a2...an and an+1an+2...a2n.

The Data Model● A (logical) file is a string

a1a2...an, where aj is a substring.

Eg: “Hello\nworld!” ⇒ a1= “Hello”, a2= “world!”“\n” is a separator.

● Q1: How to equally split a file?○ Eg: a1a2...a2n ⇒ a1a2...an and an+1an+2...a2n.

● Q2: What about splitting the file into more segments?

The Map(aj) Function● Map: aj → {(key(aj), val(aj))}● key(aj) and val(aj) are strings.

Eg: Map(“Hello”) = (“Hello”, “1”), Map(“Hello world”) = {(“Hello”,“1”), (“world”,“1”)}, Map(“Hello world”) = (“world”, “Hello”).

Contd.

● The input file a1a2...am is organized as

Value 1 Value 2 Value 3 ...

key(a1) val(a1) val(a7) val(a2) #

key(a5) val(an) val(a5) #

key(a3) val(am) val(a2) val(a3) ...

...

Contd.

● The input file a1a2...am is organized as

Value 1 Value 2 Value 3 ...

key(a1) val(a1) val(a7) val(a2) #

key(a5) val(an) val(a5) #

key(a3) val(am) val(a2) val(a3) ...

...

Each row shares the same key.

Contd.

● The input file a1a2...am is organized as

Value 1 Value 2 Value 3 ...

key(a1) val(a1) val(a7) val(a2) #

key(a5) val(an) val(a5) #

key(a3) val(am) val(a2) val(a3) ...

...

Mistake! a2 cannot appear in two rows.

The Reduce(k,v1,v2,...,vd) Function● Reduce: (k,v1,v2,...,vd) → v.

Eg: Reduce(“Hello”,“2”,“1”,“5”) = “Hello 8”. (WordCount)

key(s),val(s1),val(s2),...,val(sd)(a row)

Contd.

Value 1 Value 2 Value 3

“world” “2” “11” “3”

“hello” “10” “0” “5”

Contd.Sort

Value 1 Value 2 Value 3

“hello” “0” “5” “10”

“world” “2” “3” “11”

Contd.

Value 1 Value 2 Value 3

“hello” “0” “5” “10”

“world” “2” “3” “11”

Reduce()

Reduce()

Parallel Computation● The table we have seen is global.● A Map node is assigned a file segment sjsj+1...sj+k, and

executes Map() on each s.● A Reduce node is associated with one or more rows of

the table, and executes Reduce() on each associated row.

● Map() and Reduce() execute concurrently on multiple machines.

WordCount Example

● Input: w = w1w2,...wk.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).

Contd.● w = “cat … dog … bird …”.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).

Value 1 Value 2 Value 3 ...

“cat” “1” “1” “1” ...

“dog” “1” “1” “1” ...

“bird” “1” “1” “1” ...

Contd.● w = “cat … dog … bird …”.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).

Value 1 Value 2 Value 3 ...

“bird” “1” “1” “1” ...

“cat” “1” “1” “1” ...

“dog” “1” “1” “1” ...

Contd.● w = “cat … dog … bird …”.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).

Value 1

“bird” “39”

“cat” “20”

“dog” “11”

The Bursting I/O Problem

● Let N be the file size.● What would be the table size?

The Bursting I/O Problem

● Let N be the file size.● What would be the table size?● At least Ω(N).

○ Each word in the input file corresponds to a value in the table.

● Too much I/O traffic!

The Combinek(v1,v2,...,vd) Function● Goal: to reduce the table size.● Assumptions:

Combinek(v) = v,Combinek(v1,...,vd) = Combinek(Combinek(v1,...,vd-1),vd),Reduce(k,v1,v2,...,vd) = Reduce(k,Combinek(v1,...,vd)).

The Combinek(v1,v2,...,vd) Function● Goal: to reduce the table size.● Assumptions:

Combinek(v) = v,Combinek(v1,...,vd) = Combinek(Combinek(v1,...,vd-1),vd),Reduce(k,v1,v2,...,vd) = Reduce(k,Combinek(v1,...,vd)).

● Table size reduction (m Map nodes):Reduce(k,v1,v2,...,vd) =

Reduce(k,Combinek(v1,...,vd/m),Combinek(vd/m+1,...,v2d/m),...).

Contd.● Assume m map nodes:

○ Best case: each map node has a combiner.○ Minimum possible space: ϴ(m).

Value 1 Value 2 Value 3 ...

“bird” “300” “351” “310” ...

“cat” “109” “1112” “207” ...

“dog” “4” “2” “3” ...

The Partition(k,M) Function

● How to assign rows to reduce nodes?● Partition: key → node.● Typically

Partition(k,M) = HashFunction(k) mod M.Eg.:

Partition(“cat”, 5) = 1 % 5 = 1.

Discussion

● Data Skew Problem○ A particular Reduce node assigned much more rows

than others.● Binary File Support

○ What would happen if the file is a binary string?○ Propose a solution.

● Straggler Detection○ A Reduce node runs longer than usual.○ Identify if it is due to a machine-related issue.