View
376
Download
2
Category
Tags:
Preview:
DESCRIPTION
[Paper Study] VLDB 2012 Author: Foto N. Afrati, National Technical University of Athens
Citation preview
Upper and Lower Bounds
on the Cost of a
Map-Reduce Computation
38th International Conference on Very Large Data Bases (VLDB 2012)
Tzu-Li TaiNational Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory
Foto N. AfratiNational Technical University of Athens
Anish Das SarmaGoogle Research
Semih Salihoglu, Jeffrey D. UllmanStanford University
Agenda
A. BackgroundB. A Motivating ExampleC. Tradeoff: Parallelism & CommunicationD. Problem Model and AssumptionsE. The Hamming-Distance-1 ProblemF. Conclusion
220
Background
The MapReduce Paradigm
Map
Map
Map
Map
Reduce
Reduce
Reduce
(๐๐, ๐๐)
(๐๐, ๐๐)
(๐๐, ๐๐)
(๐๐, ๐๐) (๐๐, [๐ฝ๐]) (๐๐, ๐๐)
221
Background
Distributed/Parallel Computing in Clusters
โข Often uses MapReduce to express applications (Hadoop)- This paper focuses on single-round MR applications
โข Limited bandwidth
โข Limited resources (memory, processing units, etc.)
โข For public clouds, you โpay as you goโ for these resources- Amazon EC2 charges for both bandwidth usage & processing units
222
A Motivating Example
The Drug Interaction Problem
โข 3000 sets of drug data (patients taking, dates, diagnoses)
โข About 1M of data per drug
โข Problem:Find 2 drugs that when taken together increase the risk of heart attack
โข Cross-referencing 2 drugs across whole set of drugs
223
A Motivating Example
Reduce for {๐, ๐}
Drug 1
Map
Drug 2
Map
Drag 3
Map
Drug 4
Map
Reduce for {๐, ๐}
Reduce for {๐, ๐}
Reduce for {๐, ๐}
Reduce for {๐, ๐}
Reduce for {๐, ๐}
( 1,2 , )data 1
( 1,3 , )data 1
( 1,4 , )data 1
( 1,2 , )data 2
( 2,3 , )data 2
( 2,4 , )data 2
( 1,3 , )data 3
( 2,3 , )data 3
( 3,4 , )data 3
( 1,4 , )data 4
( 2,4 , )data 4
( 3,4 , )data 4
( 1,2 , )data 1+2
( 1,3 , )data 1+3
( 1,4 , )data 1+4
( 2,3 , )data 2+3
( 2,4 , )data 2+4
( 3,4 , )data 3+4
224
A Motivating Example
What Went Wrong?
โข For 3000 drugs, each set of drug data is replicated 2999 times
โข Each set of data is 1M large= 9 terabytes of communication= 90,000 sec for 1 Gigabit network
โข Communication cost is too high!
225
A Motivating Example
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( ๐บ1, ๐บ2 , )data 1
( ๐บ1, ๐บ3 , )data 1
( ๐บ1, ๐บ2 , )data 2
( ๐บ1, ๐บ3 , )data 2
( ๐บ1, ๐บ2 , )data 3
( ๐บ2, ๐บ3 , )data 3
( ๐บ1, ๐บ2 , )data 4
( ๐บ2, ๐บ3 , )data 4
( ๐บ1, ๐บ3 , )data 5
( ๐บ2, ๐บ3 , )data 5
( ๐บ1, ๐บ3 , )data 6
( ๐บ2, ๐บ3 , )data 6
Different Approach: Grouping Drugsโข ๐บ1: Drugs 1-2โข ๐บ2: Drugs 3-4โข ๐บ3: Drugs 5-6
Key: Own Group + Other Groups
226
A Motivating Example
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( ๐บ1, ๐บ2 , )data 1
( ๐บ1, ๐บ3 , )data 1
( ๐บ1, ๐บ2 , )data 2
( ๐บ1, ๐บ3 , )data 2
( ๐บ1, ๐บ2 , )data 3
( ๐บ2, ๐บ3 , )data 3
( ๐บ1, ๐บ2 , )data 4
( ๐บ2, ๐บ3 , )data 4
( ๐บ1, ๐บ3 , )data 5
( ๐บ2, ๐บ3 , )data 5
( ๐บ1, ๐บ3 , )data 6
( ๐บ2, ๐บ3 , )data 6
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
( ๐บ1, ๐บ2 , )data 1+2+3+4
( ๐บ1, ๐บ3 , )data 1+2+5+6
( ๐บ2, ๐บ3 , )data 3+4+5+6
227
A Motivating Example
โข Therefore, if we group 3000 drugs as 30 groups- ๐บ1: 1-100, ๐บ2: 101-200, โฆโฆ, ๐บ3:2901-3000
โข Each set of drug data is only replicated 29 times= 87 GB vs. 9TB communication cost
โข But lower parallelism, higher processing cost!
228
Tradeoff: Parallelism & Communication
ParallelismCommunication
โข To evaluate communication cost, define ๐๐๐๐๐๐๐๐ก๐๐๐ ๐๐๐ก๐ ๐, which represents the average number of key-value pairs created from a single map input
โข To evaluate processing cost, define ๐๐๐๐ข๐๐๐ ๐ ๐๐ง๐ ๐, which represents the maximum amount of values for a single key
229
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( ๐บ1, ๐บ2 , )data 1
( ๐บ1, ๐บ3 , )data 1
( ๐บ1, ๐บ2 , )data 2
( ๐บ1, ๐บ3 , )data 2
( ๐บ1, ๐บ2 , )data 3
( ๐บ2, ๐บ3 , )data 3
( ๐บ1, ๐บ2 , )data 4
( ๐บ2, ๐บ3 , )data 4
( ๐บ1, ๐บ3 , )data 5
( ๐บ2, ๐บ3 , )data 5
( ๐บ1, ๐บ3 , )data 6
( ๐บ2, ๐บ3 , )data 6
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
( ๐บ1, ๐บ2 , )data 1+2+3+4
( ๐บ1, ๐บ3 , )data 1+2+5+6
( ๐บ2, ๐บ3 , )data 3+4+5+6
๐ = ๐, ๐ = ๐
Tradeoff: Parallelism & Communication
2210
How the Tradeoff can be Used
๐ = ๐(๐)
โข Communication cost: ๐๐, a: constant
โข Processing cost: Some function of ๐- Take for example the previous drug interaction problem- The work for each reducer is ๐ ๐2 , so
๐ถ๐๐ ๐ก๐๐๐โ = ๐๐2, b: constant
- The number of reducers is proportional to 1
๐
- ๐ถ๐๐ ๐ก๐ก๐๐ก๐๐ = ๐๐2 ร1
๐= ๐๐
Tradeoff: Parallelism & Communication
2211
How the Tradeoff can be Used
๐ถ๐๐๐๐๐๐๐ ๐ถ๐๐ ๐ก = ๐๐ + ๐๐= ๐๐ ๐ + ๐๐
โข Solve for ๐ for minimal combined cost
โข Determine ๐ with ๐ = ๐(๐)
โข Decide appropriate algorithm implementation
Tradeoff: Parallelism & Communication
2212
Problem Model & Assumptions
Mapping Schema๐ , ๐
Hypothetical set of all inputsconstructed from domain N
Finite domain N All possible outputs corresponding to
the inputs
2213
Problem Model & Assumptions
Example: Hamming Distance 1
1011010011
Distance:2
1011010010
Distance:1
2214
Problem Model & Assumptions
Example: Hamming Distance 1
000โฆโฆ00000โฆโฆ01000โฆโฆ10
.
.
.
.111โฆโฆ00111โฆโฆ01111โฆโฆ10111โฆโฆ11
{Domain: ๐ bits string length
2๐โ๐ฆ๐๐๐กโ๐๐ก๐๐๐๐๐๐๐๐ข๐ก๐
Mapping Schema๐ , ๐
No. of outputs =
๐๐ ร ๐
๐
2215
Problem Model & Assumptions
The Mapping Schema Tradeoff Derivation
Given the maximum reducer size ๐, and assume there are ๐ reducers,
๐ =
๐=1
๐
๐๐๐ผ
๐๐: reducer size of reducer ๐ (๐๐ โค ๐)๐ผ: Total input size
2216
22
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( ๐บ1, ๐บ2 , )data 1
( ๐บ1, ๐บ3 , )data 1
( ๐บ1, ๐บ2 , )data 2
( ๐บ1, ๐บ3 , )data 2
( ๐บ1, ๐บ2 , )data 3
( ๐บ2, ๐บ3 , )data 3
( ๐บ1, ๐บ2 , )data 4
( ๐บ2, ๐บ3 , )data 4
( ๐บ1, ๐บ3 , )data 5
( ๐บ2, ๐บ3 , )data 5
( ๐บ1, ๐บ3 , )data 6
( ๐บ2, ๐บ3 , )data 6
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
( ๐บ1, ๐บ2 , )data 1+2+3+4
( ๐บ1, ๐บ3 , )data 1+2+5+6
( ๐บ2, ๐บ3 , )data 3+4+5+6
Problem Model & Assumptions
๐๐ = ๐
๐๐ = ๐
๐๐ = ๐๐ฐ=๐
โ ๐ =
๐=๐
๐
๐๐๐ฐ =๐ + ๐ + ๐
๐= ๐
17
Problem Model & Assumptions
1. Deriving ๐(๐): upper bound of outputs a reducer with size ๐ covers
Finding the lower bound of ๐ with given ๐
2218
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( ๐บ1, ๐บ2 , )data 1
( ๐บ1, ๐บ3 , )data 1
( ๐บ1, ๐บ2 , )data 2
( ๐บ1, ๐บ3 , )data 2
( ๐บ1, ๐บ2 , )data 3
( ๐บ2, ๐บ3 , )data 3
( ๐บ1, ๐บ2 , )data 4
( ๐บ2, ๐บ3 , )data 4
( ๐บ1, ๐บ3 , )data 5
( ๐บ2, ๐บ3 , )data 5
( ๐บ1, ๐บ3 , )data 6
( ๐บ2, ๐บ3 , )data 6
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
Reduce for {๐ฎ๐, ๐ฎ๐}
( ๐บ1, ๐บ2 , )data 1+2+3+4
( ๐บ1, ๐บ3 , )data 1+2+5+6
( ๐บ2, ๐บ3 , )data 3+4+5+6
Problem Model & Assumptions
๐ = ๐
โ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
โน ๐ ๐ =๐๐=๐(๐ โ ๐)
๐โ๐๐
๐ 2219
Problem Model & Assumptions
1. Deriving ๐(๐): upper bound of outputs a reducer with size ๐ covers2. Determine number of Inputs ๐ผ and Outputs ๐3. Establish Inequality:
๐=1
๐
๐(๐๐) โฅ ๐
4. Manipulate Inequality:
๐=1
๐
๐๐๐(๐๐)
๐๐โฅ ๐ โ
๐=1
๐
๐๐๐(๐)
๐โฅ ๐
Finding the lower bound of ๐ with given ๐
โ ๐ =
๐=1
๐
๐๐๐ผ โฅ๐ ร ๐ถ
๐(๐) ร ๐ฐ
2220
The Hamming-Distance-1 Problem
1. ๐ ๐ = ( ๐ 2) log2 ๐ (by mathematical induction)
2. ๐ผ = 2๐, ๐ =๐
22๐
3. Inequality:
๐=1
๐
๐ ๐๐ =
๐=1
๐๐๐2log2 ๐๐ โฅ
๐
22๐
โ
๐=1
๐๐๐2log2 ๐ โฅ
๐
22๐
โ ๐ =
๐=1
๐
๐๐2๐โฅ ๐ ๐ฅ๐จ๐ ๐ ๐
2221
Conclusion
โข Presents a new approach to study optimal Map-Reduce algorithms
โข Established a unified model with two parameters, replication rate and reducer size to study performance over a spectrum ofpossible computing clusters.
โข For several problems, it had been shown that the two parameters are related by a tradeoff formula.
2222
Recommended