View
215
Download
0
Category
Preview:
Citation preview
8/10/2019 Mrjoin Final
1/33
Joins in Hadoop
Gang and Ronnie
8/10/2019 Mrjoin Final
2/33
Agenda
Introduction of new types of joins
Experiment results
Join plan generator Summary and future work
8/10/2019 Mrjoin Final
3/33
Problem at hand
Map join (fragment-duplicate join)
Duplicate
Split 3 Split 4Split 1 Split 2Map tasks:
Fragment (large table)
Duplicate
(small table)
8/10/2019 Mrjoin Final
4/33
Slide taken from project proposal
Too many copies of thesmall table are shuffledacross the network
Partially Solved
Distributed Cache
Doesnt work with toomany nodes involved
Size ofR
Size ofS
Maptasks
Duplicatedata
150 MB 24 GB 352 32 GB
Size of
R
Size of
S
Map
tasks
Duplicate
data
150 MB 24 GB 277
8/10/2019 Mrjoin Final
5/33
Slide taken from project proposal II
Memory Limitation
Hash table is not memory-efficient.
The table size is usually larger than the heap
memory assigned to a task
Out Of Memory Exception!
8/10/2019 Mrjoin Final
6/33
Solving Not-Enough-Memory problem
New Map Joins:
Multi-phase map join (MMJ)
Reversed map join (RMJ) JDBM-based map join (JMJ)
small table as: duplicatelarge table as:fragment
8/10/2019 Mrjoin Final
7/33
Multi-phase map join
n-phase map join
Duplicate
Part 1
Split 3 Split 4Split 1 Split 2Map tasks:
Duplicate
Fragment
Duplicate
Part 2 Duplicate
Part n
Problem? - Reading large table multiple times!
Duplicate
Part 1
8/10/2019 Mrjoin Final
8/33
Reversed map join
Default map join (in each Map task):
1. read duplicateto memory, build hash table
2. for each tuple infragment, probe the hash
table
Reversed map join (in each Map task): :
1. readfragment to memory, build hash table
2. for each tuple in duplicate, probe the hashtable
Problem?not really a Map job
8/10/2019 Mrjoin Final
9/33
JDBM-based map join
JDBM is a transactional persistence engine for
Java.
Using JDBM, we can eliminate
OutOfMemoryException. The size of the hash
table is no longer bound by the heap size.
Problem?Probing a hashtable on disk might take much time!
8/10/2019 Mrjoin Final
10/33
Advanced Joins
Step 1:Semi join on join key only;
Step 2:
Use the result to filter the table; Step 3:
Join new tables.
Can be applied to both map and reduce-sidejoins
Problem?Step 1 and 2 have overhead!
8/10/2019 Mrjoin Final
11/33
The Nine Candidates
AMJ/no dist advanced map join without DC
AMJ/dist advanced map join with DC
DMJ/no dist default map join without DC
DMJ/dist default map join with DC
MMJ multi-phase map join
RMJ/dist reversed map join with DC
JMJ/dist JDBM-based map join with DC
ARJ/dist advanced reduce join with DC
DRJ default reduce join
8/10/2019 Mrjoin Final
12/33
Experiment Setup
TPC-DS benchmark
Evaluated query:
JOIN customer, web_sales ON cid
Performed on different scales of generated
data, e.g. 10GB, 170GB (not actual table size)
Each combination is performed five (5) times
Results are analyzed with error bars
8/10/2019 Mrjoin Final
13/33
Hadoop Cluster
128Hewlett PackardDL160 Compute BuildingBlocks
Each equipped with:
2 quad-core CPUs 16 GB RAM
2 TB storage
High-speed networkconnection
Used in the experiment:
Hadoop Cluster
(Altocumulus):
64 nodes
8/10/2019 Mrjoin Final
14/33
Result analysis
0
50
100
150
200
250
300
350
400
AMJ/no dist
AMJ/dist
DMJ/no dist
DMJ/dist
MMJ
RMJ/dist
JMJ/dist
ARJ/dist
DRJ
Some results ignored
8/10/2019 Mrjoin Final
15/33
One small note
What does 50*200 mean?
TABLE customer: from 50GB version of TPC-DS- actual table size: about 100MB
TABLE web_sales: 200GB version of TPC-DS
- actual table size: about 30GB
8/10/2019 Mrjoin Final
16/33
Distributed Cache
0
50
100
150
200
250
300
350
400
DMJ/no dist
DMJ/dist
8/10/2019 Mrjoin Final
17/33
Distributed Cache II
Distributed cache introduces an overhead
when converting the file in HDFS to local disks.
The following situations are in favor of
Distributed cache (compared to non-DC):
1. number of nodes is low
2. number of map tasks is high
8/10/2019 Mrjoin Final
18/33
Advanced vs. Default
0
50
100
150
200
250
300
10*10 10*30 10*50 10*70 10*100 10*130 10*170 10*200 50*50 50*70 50*100 50*130 50*170 50*200 70*70
ARJ/dist
DRJ
8/10/2019 Mrjoin Final
19/33
Advanced vs. Default II
0
200
400
600
800
1000
1200
AMJ/dist
DMJ/dist
8/10/2019 Mrjoin Final
20/33
Advanced vs. Default III
The overhead of semi-join and filtering is
heavy.
The following situations are in favor of
advanced joins (compared to reduce joins):
1. join selectivity gets lower
2. network becomes slower (true!)
3. we need to handle skewed data
8/10/2019 Mrjoin Final
21/33
Map Join vs Reduce Join--Part I
0
50
100
150
200
250
300
350
400
450
DMJ/no dist
MMJ
JMJ/dist
ARJ/dist
DRJ
8/10/2019 Mrjoin Final
22/33
Map Join vs Reduce Join-- Part II
0
200
400
600
800
1000
1200
1400
DMJ/no distRMJ/dist
JMJ/dist
ARJ/dist
DRJ
8/10/2019 Mrjoin Final
23/33
8/10/2019 Mrjoin Final
24/33
Beyond Default Map Join
Multi-Phase Map Join
Succeed in all experiment groups.
Performance comparable with DMJ when only one
phase is involved.
Performance degrades sharply when phase
number are greater than 2, due to the much more
tasks we launch. Currently no support for distributed cache, not
scalable
8/10/2019 Mrjoin Final
25/33
Beyond Default Map Join
Reversed Map Join
Succeed in all experiment groups.
Not performs as good as DRJ due the overhead of
distributed cache
Performs best when
8/10/2019 Mrjoin Final
26/33
Beyond Default Map Join
JDBM Map Join
Fail for the last two experiment groups, mainly due
to the improper configuration settings.
8/10/2019 Mrjoin Final
27/33
Join Plan Generator
Cost-based + rule-based
Focus on three aspects Whether or not to use
distributed cache
Whether to use Default Map
Join
Map joins or reduce side join
Parameters
Number of distributed
filesd
Network speed v
Number of map tasks m
Number of reduce
tasksr
Number of working
nodes nSmall table size s
Large table size l
8/10/2019 Mrjoin Final
28/33
Join Plan Generator
Whether to use distributed cache
Only works for map join approaches
Cost model
With distributed cache:
where is the average overhead to distribute one file
Without distributed cache:
dsnv
1
sm
v
1
8/10/2019 Mrjoin Final
29/33
Join Plan Generator
Whether to use Default Map Join
We give Default Map Join the highest priority since
it usually works best
The choice on distributed cache can ensure DefaultMap Join works efficiently
Rule: if small table can fit into memory entirely,
just do it.
8/10/2019 Mrjoin Final
30/33
Join Plan Generator
Map Joins or Default Reduce side Join
In those situations where DMJ fails, Reversed Map
Join is most promising in terms of usability and
scalability. Cost model:
RMJ:
(without distributed cache)
(with distributed cache)
where is the average overhead to distribute one
file
DRJ:
dsm
dsn
),(,)( rvfls
8/10/2019 Mrjoin Final
31/33
Join Plan Generator
Distributed cache?
Default Map Join?
Reversed Map Join /
Default Reduce side Join
Y N
YDo it
N
Do it
8/10/2019 Mrjoin Final
32/33
Summary
Distributed cache is a double-edge sword
When using distributed cache properly, Default
Map Join performs best
The three new map join approaches extend
the usability of default map join
8/10/2019 Mrjoin Final
33/33
Future Work
SPJA workflow
(selection, projection, join, aggregation)
Better optimizer
Multi-way join
Build to hybrid system
Need a dedicated (slower) cluster
Recommended