Transcript
Page 1: Upper and Lower Bound on the Cost of a MapReduce Computation

Upper and Lower Bounds

on the Cost of a

Map-Reduce Computation

38th International Conference on Very Large Data Bases (VLDB 2012)

Tzu-Li TaiNational Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory

Foto N. AfratiNational Technical University of Athens

Anish Das SarmaGoogle Research

Semih Salihoglu, Jeffrey D. UllmanStanford University

Page 2: Upper and Lower Bound on the Cost of a MapReduce Computation

Agenda

A. BackgroundB. A Motivating ExampleC. Tradeoff: Parallelism & CommunicationD. Problem Model and AssumptionsE. The Hamming-Distance-1 ProblemF. Conclusion

220

Page 3: Upper and Lower Bound on the Cost of a MapReduce Computation

Background

The MapReduce Paradigm

Map

Map

Map

Map

Reduce

Reduce

Reduce

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ) (๐’Œ๐Ÿ, [๐‘ฝ๐Ÿ]) (๐’Œ๐Ÿ‘, ๐’—๐Ÿ‘)

221

Page 4: Upper and Lower Bound on the Cost of a MapReduce Computation

Background

Distributed/Parallel Computing in Clusters

โ€ข Often uses MapReduce to express applications (Hadoop)- This paper focuses on single-round MR applications

โ€ข Limited bandwidth

โ€ข Limited resources (memory, processing units, etc.)

โ€ข For public clouds, you โ€œpay as you goโ€ for these resources- Amazon EC2 charges for both bandwidth usage & processing units

222

Page 5: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

The Drug Interaction Problem

โ€ข 3000 sets of drug data (patients taking, dates, diagnoses)

โ€ข About 1M of data per drug

โ€ข Problem:Find 2 drugs that when taken together increase the risk of heart attack

โ€ข Cross-referencing 2 drugs across whole set of drugs

223

Page 6: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

Reduce for {๐Ÿ, ๐Ÿ}

Drug 1

Map

Drug 2

Map

Drag 3

Map

Drug 4

Map

Reduce for {๐Ÿ, ๐Ÿ‘}

Reduce for {๐Ÿ, ๐Ÿ’}

Reduce for {๐Ÿ, ๐Ÿ‘}

Reduce for {๐Ÿ, ๐Ÿ’}

Reduce for {๐Ÿ‘, ๐Ÿ’}

( 1,2 , )data 1

( 1,3 , )data 1

( 1,4 , )data 1

( 1,2 , )data 2

( 2,3 , )data 2

( 2,4 , )data 2

( 1,3 , )data 3

( 2,3 , )data 3

( 3,4 , )data 3

( 1,4 , )data 4

( 2,4 , )data 4

( 3,4 , )data 4

( 1,2 , )data 1+2

( 1,3 , )data 1+3

( 1,4 , )data 1+4

( 2,3 , )data 2+3

( 2,4 , )data 2+4

( 3,4 , )data 3+4

224

Page 7: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

What Went Wrong?

โ€ข For 3000 drugs, each set of drug data is replicated 2999 times

โ€ข Each set of data is 1M large= 9 terabytes of communication= 90,000 sec for 1 Gigabit network

โ€ข Communication cost is too high!

225

Page 8: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Different Approach: Grouping Drugsโ€ข ๐บ1: Drugs 1-2โ€ข ๐บ2: Drugs 3-4โ€ข ๐บ3: Drugs 5-6

Key: Own Group + Other Groups

226

Page 9: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

227

Page 10: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

โ€ข Therefore, if we group 3000 drugs as 30 groups- ๐บ1: 1-100, ๐บ2: 101-200, โ€ฆโ€ฆ, ๐บ3:2901-3000

โ€ข Each set of drug data is only replicated 29 times= 87 GB vs. 9TB communication cost

โ€ข But lower parallelism, higher processing cost!

228

Page 11: Upper and Lower Bound on the Cost of a MapReduce Computation

Tradeoff: Parallelism & Communication

ParallelismCommunication

โ€ข To evaluate communication cost, define ๐‘Ÿ๐‘’๐‘๐‘™๐‘–๐‘๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘Ÿ๐‘Ž๐‘ก๐‘’ ๐’“, which represents the average number of key-value pairs created from a single map input

โ€ข To evaluate processing cost, define ๐‘Ÿ๐‘’๐‘‘๐‘ข๐‘๐‘’๐‘Ÿ ๐‘ ๐‘–๐‘ง๐‘’ ๐’’, which represents the maximum amount of values for a single key

229

Page 12: Upper and Lower Bound on the Cost of a MapReduce Computation

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

๐’“ = ๐Ÿ, ๐’’ = ๐Ÿ’

Tradeoff: Parallelism & Communication

2210

Page 13: Upper and Lower Bound on the Cost of a MapReduce Computation

How the Tradeoff can be Used

๐‘Ÿ = ๐‘“(๐‘ž)

โ€ข Communication cost: ๐‘Ž๐‘Ÿ, a: constant

โ€ข Processing cost: Some function of ๐‘ž- Take for example the previous drug interaction problem- The work for each reducer is ๐‘‚ ๐‘ž2 , so

๐ถ๐‘œ๐‘ ๐‘ก๐‘’๐‘Ž๐‘โ„Ž = ๐‘๐‘ž2, b: constant

- The number of reducers is proportional to 1

๐‘ž

- ๐ถ๐‘œ๐‘ ๐‘ก๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ = ๐‘๐‘ž2 ร—1

๐‘ž= ๐‘๐‘ž

Tradeoff: Parallelism & Communication

2211

Page 14: Upper and Lower Bound on the Cost of a MapReduce Computation

How the Tradeoff can be Used

๐ถ๐‘œ๐‘š๐‘๐‘–๐‘›๐‘’๐‘‘ ๐ถ๐‘œ๐‘ ๐‘ก = ๐‘Ž๐‘Ÿ + ๐‘๐‘ž= ๐‘Ž๐‘“ ๐‘ž + ๐‘๐‘ž

โ€ข Solve for ๐‘ž for minimal combined cost

โ€ข Determine ๐‘Ÿ with ๐‘Ÿ = ๐‘“(๐‘ž)

โ€ข Decide appropriate algorithm implementation

Tradeoff: Parallelism & Communication

2212

Page 15: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Mapping Schema๐‘Ÿ , ๐‘ž

Hypothetical set of all inputsconstructed from domain N

Finite domain N All possible outputs corresponding to

the inputs

2213

Page 16: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Example: Hamming Distance 1

1011010011

Distance:2

1011010010

Distance:1

2214

Page 17: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Example: Hamming Distance 1

000โ€ฆโ€ฆ00000โ€ฆโ€ฆ01000โ€ฆโ€ฆ10

.

.

.

.111โ€ฆโ€ฆ00111โ€ฆโ€ฆ01111โ€ฆโ€ฆ10111โ€ฆโ€ฆ11

{Domain: ๐’ƒ bits string length

2๐‘โ„Ž๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ก๐‘–๐‘๐‘Ž๐‘™๐‘–๐‘›๐‘๐‘ข๐‘ก๐‘ 

Mapping Schema๐‘Ÿ , ๐‘ž

No. of outputs =

๐Ÿ๐’ƒ ร— ๐’ƒ

๐Ÿ

2215

Page 18: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

The Mapping Schema Tradeoff Derivation

Given the maximum reducer size ๐‘ž, and assume there are ๐‘ reducers,

๐‘Ÿ =

๐‘–=1

๐‘

๐‘ž๐‘–๐ผ

๐‘ž๐‘–: reducer size of reducer ๐‘– (๐‘ž๐‘– โ‰ค ๐‘ž)๐ผ: Total input size

2216

Page 19: Upper and Lower Bound on the Cost of a MapReduce Computation

22

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

Problem Model & Assumptions

๐’’๐Ÿ = ๐Ÿ’

๐’’๐Ÿ = ๐Ÿ’

๐’’๐Ÿ‘ = ๐Ÿ’๐‘ฐ=๐Ÿ”

โ‡’ ๐’“ =

๐’Š=๐Ÿ

๐’‘

๐’’๐’Š๐‘ฐ =๐Ÿ’ + ๐Ÿ’ + ๐Ÿ’

๐Ÿ”= ๐Ÿ

17

Page 20: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

1. Deriving ๐‘”(๐‘ž): upper bound of outputs a reducer with size ๐‘ž covers

Finding the lower bound of ๐’“ with given ๐’’

2218

Page 21: Upper and Lower Bound on the Cost of a MapReduce Computation

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

Problem Model & Assumptions

๐’’ = ๐Ÿ’

โ‡’ ๐’„๐’๐’—๐’†๐’“๐’”๐Ÿ’๐Ÿ๐’๐’–๐’•๐’‘๐’–๐’•๐’”

โŸน ๐’ˆ ๐’’ =๐’’๐Ÿ=๐’’(๐’’ โˆ’ ๐Ÿ)

๐Ÿโ‰ˆ๐’’๐Ÿ

๐Ÿ 2219

Page 22: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

1. Deriving ๐‘”(๐‘ž): upper bound of outputs a reducer with size ๐‘ž covers2. Determine number of Inputs ๐ผ and Outputs ๐‘‚3. Establish Inequality:

๐‘–=1

๐‘

๐‘”(๐‘ž๐‘–) โ‰ฅ ๐‘‚

4. Manipulate Inequality:

๐‘–=1

๐‘

๐‘ž๐‘–๐‘”(๐‘ž๐‘–)

๐‘ž๐‘–โ‰ฅ ๐‘‚ โ‡’

๐‘–=1

๐‘

๐‘ž๐‘–๐‘”(๐‘ž)

๐‘žโ‰ฅ ๐‘‚

Finding the lower bound of ๐’“ with given ๐’’

โ‡’ ๐’“ =

๐‘–=1

๐‘

๐‘ž๐‘–๐ผ โ‰ฅ๐’’ ร— ๐‘ถ

๐’ˆ(๐’’) ร— ๐‘ฐ

2220

Page 23: Upper and Lower Bound on the Cost of a MapReduce Computation

The Hamming-Distance-1 Problem

1. ๐‘” ๐‘ž = ( ๐‘ž 2) log2 ๐‘ž (by mathematical induction)

2. ๐ผ = 2๐‘, ๐‘‚ =๐‘

22๐‘

3. Inequality:

๐‘–=1

๐‘

๐‘” ๐‘ž๐‘– =

๐‘–=1

๐‘๐‘ž๐‘–2log2 ๐‘ž๐‘– โ‰ฅ

๐‘

22๐‘

โ‡’

๐‘–=1

๐‘๐‘ž๐‘–2log2 ๐‘ž โ‰ฅ

๐‘

22๐‘

โ‡’ ๐’“ =

๐‘–=1

๐‘

๐‘ž๐‘–2๐‘โ‰ฅ ๐’ƒ ๐ฅ๐จ๐ ๐Ÿ ๐’’

2221

Page 24: Upper and Lower Bound on the Cost of a MapReduce Computation

Conclusion

โ€ข Presents a new approach to study optimal Map-Reduce algorithms

โ€ข Established a unified model with two parameters, replication rate and reducer size to study performance over a spectrum ofpossible computing clusters.

โ€ข For several problems, it had been shown that the two parameters are related by a tradeoff formula.

2222


Recommended