Upload
tzu-li-tai
View
375
Download
2
Tags:
Embed Size (px)
DESCRIPTION
[Paper Study] VLDB 2012 Author: Foto N. Afrati, National Technical University of Athens
Citation preview
Upper and Lower Bounds
on the Cost of a
Map-Reduce Computation
38th International Conference on Very Large Data Bases (VLDB 2012)
Tzu-Li TaiNational Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory
Foto N. AfratiNational Technical University of Athens
Anish Das SarmaGoogle Research
Semih Salihoglu, Jeffrey D. UllmanStanford University
Agenda
A. BackgroundB. A Motivating ExampleC. Tradeoff: Parallelism & CommunicationD. Problem Model and AssumptionsE. The Hamming-Distance-1 ProblemF. Conclusion
220
Background
The MapReduce Paradigm
Map
Map
Map
Map
Reduce
Reduce
Reduce
(𝒌𝟏, 𝒗𝟏)
(𝒌𝟐, 𝒗𝟐)
(𝒌𝟐, 𝒗𝟐)
(𝒌𝟐, 𝒗𝟐) (𝒌𝟐, [𝑽𝟐]) (𝒌𝟑, 𝒗𝟑)
221
Background
Distributed/Parallel Computing in Clusters
• Often uses MapReduce to express applications (Hadoop)- This paper focuses on single-round MR applications
• Limited bandwidth
• Limited resources (memory, processing units, etc.)
• For public clouds, you “pay as you go” for these resources- Amazon EC2 charges for both bandwidth usage & processing units
222
A Motivating Example
The Drug Interaction Problem
• 3000 sets of drug data (patients taking, dates, diagnoses)
• About 1M of data per drug
• Problem:Find 2 drugs that when taken together increase the risk of heart attack
• Cross-referencing 2 drugs across whole set of drugs
223
A Motivating Example
Reduce for {𝟏, 𝟐}
Drug 1
Map
Drug 2
Map
Drag 3
Map
Drug 4
Map
Reduce for {𝟏, 𝟑}
Reduce for {𝟏, 𝟒}
Reduce for {𝟐, 𝟑}
Reduce for {𝟐, 𝟒}
Reduce for {𝟑, 𝟒}
( 1,2 , )data 1
( 1,3 , )data 1
( 1,4 , )data 1
( 1,2 , )data 2
( 2,3 , )data 2
( 2,4 , )data 2
( 1,3 , )data 3
( 2,3 , )data 3
( 3,4 , )data 3
( 1,4 , )data 4
( 2,4 , )data 4
( 3,4 , )data 4
( 1,2 , )data 1+2
( 1,3 , )data 1+3
( 1,4 , )data 1+4
( 2,3 , )data 2+3
( 2,4 , )data 2+4
( 3,4 , )data 3+4
224
A Motivating Example
What Went Wrong?
• For 3000 drugs, each set of drug data is replicated 2999 times
• Each set of data is 1M large= 9 terabytes of communication= 90,000 sec for 1 Gigabit network
• Communication cost is too high!
225
A Motivating Example
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( 𝐺1, 𝐺2 , )data 1
( 𝐺1, 𝐺3 , )data 1
( 𝐺1, 𝐺2 , )data 2
( 𝐺1, 𝐺3 , )data 2
( 𝐺1, 𝐺2 , )data 3
( 𝐺2, 𝐺3 , )data 3
( 𝐺1, 𝐺2 , )data 4
( 𝐺2, 𝐺3 , )data 4
( 𝐺1, 𝐺3 , )data 5
( 𝐺2, 𝐺3 , )data 5
( 𝐺1, 𝐺3 , )data 6
( 𝐺2, 𝐺3 , )data 6
Different Approach: Grouping Drugs• 𝐺1: Drugs 1-2• 𝐺2: Drugs 3-4• 𝐺3: Drugs 5-6
Key: Own Group + Other Groups
226
A Motivating Example
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( 𝐺1, 𝐺2 , )data 1
( 𝐺1, 𝐺3 , )data 1
( 𝐺1, 𝐺2 , )data 2
( 𝐺1, 𝐺3 , )data 2
( 𝐺1, 𝐺2 , )data 3
( 𝐺2, 𝐺3 , )data 3
( 𝐺1, 𝐺2 , )data 4
( 𝐺2, 𝐺3 , )data 4
( 𝐺1, 𝐺3 , )data 5
( 𝐺2, 𝐺3 , )data 5
( 𝐺1, 𝐺3 , )data 6
( 𝐺2, 𝐺3 , )data 6
Reduce for {𝑮𝟏, 𝑮𝟐}
Reduce for {𝑮𝟏, 𝑮𝟑}
Reduce for {𝑮𝟐, 𝑮𝟑}
( 𝐺1, 𝐺2 , )data 1+2+3+4
( 𝐺1, 𝐺3 , )data 1+2+5+6
( 𝐺2, 𝐺3 , )data 3+4+5+6
227
A Motivating Example
• Therefore, if we group 3000 drugs as 30 groups- 𝐺1: 1-100, 𝐺2: 101-200, ……, 𝐺3:2901-3000
• Each set of drug data is only replicated 29 times= 87 GB vs. 9TB communication cost
• But lower parallelism, higher processing cost!
228
Tradeoff: Parallelism & Communication
ParallelismCommunication
• To evaluate communication cost, define 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 𝒓, which represents the average number of key-value pairs created from a single map input
• To evaluate processing cost, define 𝑟𝑒𝑑𝑢𝑐𝑒𝑟 𝑠𝑖𝑧𝑒 𝒒, which represents the maximum amount of values for a single key
229
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( 𝐺1, 𝐺2 , )data 1
( 𝐺1, 𝐺3 , )data 1
( 𝐺1, 𝐺2 , )data 2
( 𝐺1, 𝐺3 , )data 2
( 𝐺1, 𝐺2 , )data 3
( 𝐺2, 𝐺3 , )data 3
( 𝐺1, 𝐺2 , )data 4
( 𝐺2, 𝐺3 , )data 4
( 𝐺1, 𝐺3 , )data 5
( 𝐺2, 𝐺3 , )data 5
( 𝐺1, 𝐺3 , )data 6
( 𝐺2, 𝐺3 , )data 6
Reduce for {𝑮𝟏, 𝑮𝟐}
Reduce for {𝑮𝟏, 𝑮𝟑}
Reduce for {𝑮𝟐, 𝑮𝟑}
( 𝐺1, 𝐺2 , )data 1+2+3+4
( 𝐺1, 𝐺3 , )data 1+2+5+6
( 𝐺2, 𝐺3 , )data 3+4+5+6
𝒓 = 𝟐, 𝒒 = 𝟒
Tradeoff: Parallelism & Communication
2210
How the Tradeoff can be Used
𝑟 = 𝑓(𝑞)
• Communication cost: 𝑎𝑟, a: constant
• Processing cost: Some function of 𝑞- Take for example the previous drug interaction problem- The work for each reducer is 𝑂 𝑞2 , so
𝐶𝑜𝑠𝑡𝑒𝑎𝑐ℎ = 𝑏𝑞2, b: constant
- The number of reducers is proportional to 1
𝑞
- 𝐶𝑜𝑠𝑡𝑡𝑜𝑡𝑎𝑙 = 𝑏𝑞2 ×1
𝑞= 𝑏𝑞
Tradeoff: Parallelism & Communication
2211
How the Tradeoff can be Used
𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝐶𝑜𝑠𝑡 = 𝑎𝑟 + 𝑏𝑞= 𝑎𝑓 𝑞 + 𝑏𝑞
• Solve for 𝑞 for minimal combined cost
• Determine 𝑟 with 𝑟 = 𝑓(𝑞)
• Decide appropriate algorithm implementation
Tradeoff: Parallelism & Communication
2212
Problem Model & Assumptions
Mapping Schema𝑟 , 𝑞
Hypothetical set of all inputsconstructed from domain N
Finite domain N All possible outputs corresponding to
the inputs
2213
Problem Model & Assumptions
Example: Hamming Distance 1
1011010011
Distance:2
1011010010
Distance:1
2214
Problem Model & Assumptions
Example: Hamming Distance 1
000……00000……01000……10
.
.
.
.111……00111……01111……10111……11
{Domain: 𝒃 bits string length
2𝑏ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑡𝑖𝑐𝑎𝑙𝑖𝑛𝑝𝑢𝑡𝑠
Mapping Schema𝑟 , 𝑞
No. of outputs =
𝟐𝒃 × 𝒃
𝟐
2215
Problem Model & Assumptions
The Mapping Schema Tradeoff Derivation
Given the maximum reducer size 𝑞, and assume there are 𝑝 reducers,
𝑟 =
𝑖=1
𝑝
𝑞𝑖𝐼
𝑞𝑖: reducer size of reducer 𝑖 (𝑞𝑖 ≤ 𝑞)𝐼: Total input size
2216
22
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( 𝐺1, 𝐺2 , )data 1
( 𝐺1, 𝐺3 , )data 1
( 𝐺1, 𝐺2 , )data 2
( 𝐺1, 𝐺3 , )data 2
( 𝐺1, 𝐺2 , )data 3
( 𝐺2, 𝐺3 , )data 3
( 𝐺1, 𝐺2 , )data 4
( 𝐺2, 𝐺3 , )data 4
( 𝐺1, 𝐺3 , )data 5
( 𝐺2, 𝐺3 , )data 5
( 𝐺1, 𝐺3 , )data 6
( 𝐺2, 𝐺3 , )data 6
Reduce for {𝑮𝟏, 𝑮𝟐}
Reduce for {𝑮𝟏, 𝑮𝟑}
Reduce for {𝑮𝟐, 𝑮𝟑}
( 𝐺1, 𝐺2 , )data 1+2+3+4
( 𝐺1, 𝐺3 , )data 1+2+5+6
( 𝐺2, 𝐺3 , )data 3+4+5+6
Problem Model & Assumptions
𝒒𝟏 = 𝟒
𝒒𝟐 = 𝟒
𝒒𝟑 = 𝟒𝑰=𝟔
⇒ 𝒓 =
𝒊=𝟏
𝒑
𝒒𝒊𝑰 =𝟒 + 𝟒 + 𝟒
𝟔= 𝟐
17
Problem Model & Assumptions
1. Deriving 𝑔(𝑞): upper bound of outputs a reducer with size 𝑞 covers
Finding the lower bound of 𝒓 with given 𝒒
2218
M
Drug 1
M
Drug 2
M
Drug 3
M
Drug 4
M
Drug 5
M
Drug 6
( 𝐺1, 𝐺2 , )data 1
( 𝐺1, 𝐺3 , )data 1
( 𝐺1, 𝐺2 , )data 2
( 𝐺1, 𝐺3 , )data 2
( 𝐺1, 𝐺2 , )data 3
( 𝐺2, 𝐺3 , )data 3
( 𝐺1, 𝐺2 , )data 4
( 𝐺2, 𝐺3 , )data 4
( 𝐺1, 𝐺3 , )data 5
( 𝐺2, 𝐺3 , )data 5
( 𝐺1, 𝐺3 , )data 6
( 𝐺2, 𝐺3 , )data 6
Reduce for {𝑮𝟏, 𝑮𝟐}
Reduce for {𝑮𝟏, 𝑮𝟑}
Reduce for {𝑮𝟐, 𝑮𝟑}
( 𝐺1, 𝐺2 , )data 1+2+3+4
( 𝐺1, 𝐺3 , )data 1+2+5+6
( 𝐺2, 𝐺3 , )data 3+4+5+6
Problem Model & Assumptions
𝒒 = 𝟒
⇒ 𝒄𝒐𝒗𝒆𝒓𝒔𝟒𝟐𝒐𝒖𝒕𝒑𝒖𝒕𝒔
⟹ 𝒈 𝒒 =𝒒𝟐=𝒒(𝒒 − 𝟏)
𝟐≈𝒒𝟐
𝟐 2219
Problem Model & Assumptions
1. Deriving 𝑔(𝑞): upper bound of outputs a reducer with size 𝑞 covers2. Determine number of Inputs 𝐼 and Outputs 𝑂3. Establish Inequality:
𝑖=1
𝑝
𝑔(𝑞𝑖) ≥ 𝑂
4. Manipulate Inequality:
𝑖=1
𝑝
𝑞𝑖𝑔(𝑞𝑖)
𝑞𝑖≥ 𝑂 ⇒
𝑖=1
𝑝
𝑞𝑖𝑔(𝑞)
𝑞≥ 𝑂
Finding the lower bound of 𝒓 with given 𝒒
⇒ 𝒓 =
𝑖=1
𝑝
𝑞𝑖𝐼 ≥𝒒 × 𝑶
𝒈(𝒒) × 𝑰
2220
The Hamming-Distance-1 Problem
1. 𝑔 𝑞 = ( 𝑞 2) log2 𝑞 (by mathematical induction)
2. 𝐼 = 2𝑏, 𝑂 =𝑏
22𝑏
3. Inequality:
𝑖=1
𝑝
𝑔 𝑞𝑖 =
𝑖=1
𝑝𝑞𝑖2log2 𝑞𝑖 ≥
𝑏
22𝑏
⇒
𝑖=1
𝑝𝑞𝑖2log2 𝑞 ≥
𝑏
22𝑏
⇒ 𝒓 =
𝑖=1
𝑝
𝑞𝑖2𝑏≥ 𝒃 𝐥𝐨𝐠𝟐 𝒒
2221
Conclusion
• Presents a new approach to study optimal Map-Reduce algorithms
• Established a unified model with two parameters, replication rate and reducer size to study performance over a spectrum ofpossible computing clusters.
• For several problems, it had been shown that the two parameters are related by a tradeoff formula.
2222