24
Upper and Lower Bounds on the Cost of a Map - Reduce Computation 38 th International Conference on Very Large Data Bases (VLDB 2012) Tzu-Li Tai National Cheng Kung University Dept. of Electrical Engineering HPDS Laboratory Foto N. Afrati National Technical University of Athens Anish Das Sarma Google Research Semih Salihoglu, Jeffrey D. Ullman Stanford University

Upper and Lower Bound on the Cost of a MapReduce Computation

Embed Size (px)

DESCRIPTION

[Paper Study] VLDB 2012 Author: Foto N. Afrati, National Technical University of Athens

Citation preview

Page 1: Upper and Lower Bound on the Cost of a MapReduce Computation

Upper and Lower Bounds

on the Cost of a

Map-Reduce Computation

38th International Conference on Very Large Data Bases (VLDB 2012)

Tzu-Li TaiNational Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory

Foto N. AfratiNational Technical University of Athens

Anish Das SarmaGoogle Research

Semih Salihoglu, Jeffrey D. UllmanStanford University

Page 2: Upper and Lower Bound on the Cost of a MapReduce Computation

Agenda

A. BackgroundB. A Motivating ExampleC. Tradeoff: Parallelism & CommunicationD. Problem Model and AssumptionsE. The Hamming-Distance-1 ProblemF. Conclusion

220

Page 3: Upper and Lower Bound on the Cost of a MapReduce Computation

Background

The MapReduce Paradigm

Map

Map

Map

Map

Reduce

Reduce

Reduce

(𝒌𝟏, 𝒗𝟏)

(𝒌𝟐, 𝒗𝟐)

(𝒌𝟐, 𝒗𝟐)

(𝒌𝟐, 𝒗𝟐) (𝒌𝟐, [𝑽𝟐]) (𝒌𝟑, 𝒗𝟑)

221

Page 4: Upper and Lower Bound on the Cost of a MapReduce Computation

Background

Distributed/Parallel Computing in Clusters

• Often uses MapReduce to express applications (Hadoop)- This paper focuses on single-round MR applications

• Limited bandwidth

• Limited resources (memory, processing units, etc.)

• For public clouds, you “pay as you go” for these resources- Amazon EC2 charges for both bandwidth usage & processing units

222

Page 5: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

The Drug Interaction Problem

• 3000 sets of drug data (patients taking, dates, diagnoses)

• About 1M of data per drug

• Problem:Find 2 drugs that when taken together increase the risk of heart attack

• Cross-referencing 2 drugs across whole set of drugs

223

Page 6: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

Reduce for {𝟏, 𝟐}

Drug 1

Map

Drug 2

Map

Drag 3

Map

Drug 4

Map

Reduce for {𝟏, 𝟑}

Reduce for {𝟏, 𝟒}

Reduce for {𝟐, 𝟑}

Reduce for {𝟐, 𝟒}

Reduce for {𝟑, 𝟒}

( 1,2 , )data 1

( 1,3 , )data 1

( 1,4 , )data 1

( 1,2 , )data 2

( 2,3 , )data 2

( 2,4 , )data 2

( 1,3 , )data 3

( 2,3 , )data 3

( 3,4 , )data 3

( 1,4 , )data 4

( 2,4 , )data 4

( 3,4 , )data 4

( 1,2 , )data 1+2

( 1,3 , )data 1+3

( 1,4 , )data 1+4

( 2,3 , )data 2+3

( 2,4 , )data 2+4

( 3,4 , )data 3+4

224

Page 7: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

What Went Wrong?

• For 3000 drugs, each set of drug data is replicated 2999 times

• Each set of data is 1M large= 9 terabytes of communication= 90,000 sec for 1 Gigabit network

• Communication cost is too high!

225

Page 8: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( 𝐺1, 𝐺2 , )data 1

( 𝐺1, 𝐺3 , )data 1

( 𝐺1, 𝐺2 , )data 2

( 𝐺1, 𝐺3 , )data 2

( 𝐺1, 𝐺2 , )data 3

( 𝐺2, 𝐺3 , )data 3

( 𝐺1, 𝐺2 , )data 4

( 𝐺2, 𝐺3 , )data 4

( 𝐺1, 𝐺3 , )data 5

( 𝐺2, 𝐺3 , )data 5

( 𝐺1, 𝐺3 , )data 6

( 𝐺2, 𝐺3 , )data 6

Different Approach: Grouping Drugs• 𝐺1: Drugs 1-2• 𝐺2: Drugs 3-4• 𝐺3: Drugs 5-6

Key: Own Group + Other Groups

226

Page 9: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( 𝐺1, 𝐺2 , )data 1

( 𝐺1, 𝐺3 , )data 1

( 𝐺1, 𝐺2 , )data 2

( 𝐺1, 𝐺3 , )data 2

( 𝐺1, 𝐺2 , )data 3

( 𝐺2, 𝐺3 , )data 3

( 𝐺1, 𝐺2 , )data 4

( 𝐺2, 𝐺3 , )data 4

( 𝐺1, 𝐺3 , )data 5

( 𝐺2, 𝐺3 , )data 5

( 𝐺1, 𝐺3 , )data 6

( 𝐺2, 𝐺3 , )data 6

Reduce for {𝑮𝟏, 𝑮𝟐}

Reduce for {𝑮𝟏, 𝑮𝟑}

Reduce for {𝑮𝟐, 𝑮𝟑}

( 𝐺1, 𝐺2 , )data 1+2+3+4

( 𝐺1, 𝐺3 , )data 1+2+5+6

( 𝐺2, 𝐺3 , )data 3+4+5+6

227

Page 10: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

• Therefore, if we group 3000 drugs as 30 groups- 𝐺1: 1-100, 𝐺2: 101-200, ……, 𝐺3:2901-3000

• Each set of drug data is only replicated 29 times= 87 GB vs. 9TB communication cost

• But lower parallelism, higher processing cost!

228

Page 11: Upper and Lower Bound on the Cost of a MapReduce Computation

Tradeoff: Parallelism & Communication

ParallelismCommunication

• To evaluate communication cost, define 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 𝒓, which represents the average number of key-value pairs created from a single map input

• To evaluate processing cost, define 𝑟𝑒𝑑𝑢𝑐𝑒𝑟 𝑠𝑖𝑧𝑒 𝒒, which represents the maximum amount of values for a single key

229

Page 12: Upper and Lower Bound on the Cost of a MapReduce Computation

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( 𝐺1, 𝐺2 , )data 1

( 𝐺1, 𝐺3 , )data 1

( 𝐺1, 𝐺2 , )data 2

( 𝐺1, 𝐺3 , )data 2

( 𝐺1, 𝐺2 , )data 3

( 𝐺2, 𝐺3 , )data 3

( 𝐺1, 𝐺2 , )data 4

( 𝐺2, 𝐺3 , )data 4

( 𝐺1, 𝐺3 , )data 5

( 𝐺2, 𝐺3 , )data 5

( 𝐺1, 𝐺3 , )data 6

( 𝐺2, 𝐺3 , )data 6

Reduce for {𝑮𝟏, 𝑮𝟐}

Reduce for {𝑮𝟏, 𝑮𝟑}

Reduce for {𝑮𝟐, 𝑮𝟑}

( 𝐺1, 𝐺2 , )data 1+2+3+4

( 𝐺1, 𝐺3 , )data 1+2+5+6

( 𝐺2, 𝐺3 , )data 3+4+5+6

𝒓 = 𝟐, 𝒒 = 𝟒

Tradeoff: Parallelism & Communication

2210

Page 13: Upper and Lower Bound on the Cost of a MapReduce Computation

How the Tradeoff can be Used

𝑟 = 𝑓(𝑞)

• Communication cost: 𝑎𝑟, a: constant

• Processing cost: Some function of 𝑞- Take for example the previous drug interaction problem- The work for each reducer is 𝑂 𝑞2 , so

𝐶𝑜𝑠𝑡𝑒𝑎𝑐ℎ = 𝑏𝑞2, b: constant

- The number of reducers is proportional to 1

𝑞

- 𝐶𝑜𝑠𝑡𝑡𝑜𝑡𝑎𝑙 = 𝑏𝑞2 ×1

𝑞= 𝑏𝑞

Tradeoff: Parallelism & Communication

2211

Page 14: Upper and Lower Bound on the Cost of a MapReduce Computation

How the Tradeoff can be Used

𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝐶𝑜𝑠𝑡 = 𝑎𝑟 + 𝑏𝑞= 𝑎𝑓 𝑞 + 𝑏𝑞

• Solve for 𝑞 for minimal combined cost

• Determine 𝑟 with 𝑟 = 𝑓(𝑞)

• Decide appropriate algorithm implementation

Tradeoff: Parallelism & Communication

2212

Page 15: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Mapping Schema𝑟 , 𝑞

Hypothetical set of all inputsconstructed from domain N

Finite domain N All possible outputs corresponding to

the inputs

2213

Page 16: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Example: Hamming Distance 1

1011010011

Distance:2

1011010010

Distance:1

2214

Page 17: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Example: Hamming Distance 1

000……00000……01000……10

.

.

.

.111……00111……01111……10111……11

{Domain: 𝒃 bits string length

2𝑏ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑡𝑖𝑐𝑎𝑙𝑖𝑛𝑝𝑢𝑡𝑠

Mapping Schema𝑟 , 𝑞

No. of outputs =

𝟐𝒃 × 𝒃

𝟐

2215

Page 18: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

The Mapping Schema Tradeoff Derivation

Given the maximum reducer size 𝑞, and assume there are 𝑝 reducers,

𝑟 =

𝑖=1

𝑝

𝑞𝑖𝐼

𝑞𝑖: reducer size of reducer 𝑖 (𝑞𝑖 ≤ 𝑞)𝐼: Total input size

2216

Page 19: Upper and Lower Bound on the Cost of a MapReduce Computation

22

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( 𝐺1, 𝐺2 , )data 1

( 𝐺1, 𝐺3 , )data 1

( 𝐺1, 𝐺2 , )data 2

( 𝐺1, 𝐺3 , )data 2

( 𝐺1, 𝐺2 , )data 3

( 𝐺2, 𝐺3 , )data 3

( 𝐺1, 𝐺2 , )data 4

( 𝐺2, 𝐺3 , )data 4

( 𝐺1, 𝐺3 , )data 5

( 𝐺2, 𝐺3 , )data 5

( 𝐺1, 𝐺3 , )data 6

( 𝐺2, 𝐺3 , )data 6

Reduce for {𝑮𝟏, 𝑮𝟐}

Reduce for {𝑮𝟏, 𝑮𝟑}

Reduce for {𝑮𝟐, 𝑮𝟑}

( 𝐺1, 𝐺2 , )data 1+2+3+4

( 𝐺1, 𝐺3 , )data 1+2+5+6

( 𝐺2, 𝐺3 , )data 3+4+5+6

Problem Model & Assumptions

𝒒𝟏 = 𝟒

𝒒𝟐 = 𝟒

𝒒𝟑 = 𝟒𝑰=𝟔

⇒ 𝒓 =

𝒊=𝟏

𝒑

𝒒𝒊𝑰 =𝟒 + 𝟒 + 𝟒

𝟔= 𝟐

17

Page 20: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

1. Deriving 𝑔(𝑞): upper bound of outputs a reducer with size 𝑞 covers

Finding the lower bound of 𝒓 with given 𝒒

2218

Page 21: Upper and Lower Bound on the Cost of a MapReduce Computation

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( 𝐺1, 𝐺2 , )data 1

( 𝐺1, 𝐺3 , )data 1

( 𝐺1, 𝐺2 , )data 2

( 𝐺1, 𝐺3 , )data 2

( 𝐺1, 𝐺2 , )data 3

( 𝐺2, 𝐺3 , )data 3

( 𝐺1, 𝐺2 , )data 4

( 𝐺2, 𝐺3 , )data 4

( 𝐺1, 𝐺3 , )data 5

( 𝐺2, 𝐺3 , )data 5

( 𝐺1, 𝐺3 , )data 6

( 𝐺2, 𝐺3 , )data 6

Reduce for {𝑮𝟏, 𝑮𝟐}

Reduce for {𝑮𝟏, 𝑮𝟑}

Reduce for {𝑮𝟐, 𝑮𝟑}

( 𝐺1, 𝐺2 , )data 1+2+3+4

( 𝐺1, 𝐺3 , )data 1+2+5+6

( 𝐺2, 𝐺3 , )data 3+4+5+6

Problem Model & Assumptions

𝒒 = 𝟒

⇒ 𝒄𝒐𝒗𝒆𝒓𝒔𝟒𝟐𝒐𝒖𝒕𝒑𝒖𝒕𝒔

⟹ 𝒈 𝒒 =𝒒𝟐=𝒒(𝒒 − 𝟏)

𝟐≈𝒒𝟐

𝟐 2219

Page 22: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

1. Deriving 𝑔(𝑞): upper bound of outputs a reducer with size 𝑞 covers2. Determine number of Inputs 𝐼 and Outputs 𝑂3. Establish Inequality:

𝑖=1

𝑝

𝑔(𝑞𝑖) ≥ 𝑂

4. Manipulate Inequality:

𝑖=1

𝑝

𝑞𝑖𝑔(𝑞𝑖)

𝑞𝑖≥ 𝑂 ⇒

𝑖=1

𝑝

𝑞𝑖𝑔(𝑞)

𝑞≥ 𝑂

Finding the lower bound of 𝒓 with given 𝒒

⇒ 𝒓 =

𝑖=1

𝑝

𝑞𝑖𝐼 ≥𝒒 × 𝑶

𝒈(𝒒) × 𝑰

2220

Page 23: Upper and Lower Bound on the Cost of a MapReduce Computation

The Hamming-Distance-1 Problem

1. 𝑔 𝑞 = ( 𝑞 2) log2 𝑞 (by mathematical induction)

2. 𝐼 = 2𝑏, 𝑂 =𝑏

22𝑏

3. Inequality:

𝑖=1

𝑝

𝑔 𝑞𝑖 =

𝑖=1

𝑝𝑞𝑖2log2 𝑞𝑖 ≥

𝑏

22𝑏

𝑖=1

𝑝𝑞𝑖2log2 𝑞 ≥

𝑏

22𝑏

⇒ 𝒓 =

𝑖=1

𝑝

𝑞𝑖2𝑏≥ 𝒃 𝐥𝐨𝐠𝟐 𝒒

2221

Page 24: Upper and Lower Bound on the Cost of a MapReduce Computation

Conclusion

• Presents a new approach to study optimal Map-Reduce algorithms

• Established a unified model with two parameters, replication rate and reducer size to study performance over a spectrum ofpossible computing clusters.

• For several problems, it had been shown that the two parameters are related by a tradeoff formula.

2222