Mrjoin Final

8/10/2019 Mrjoin Final

1/33

Joins in Hadoop

Gang and Ronnie


2/33

Agenda

Introduction of new types of joins

Experiment results

Join plan generator Summary and future work


3/33

Problem at hand

Map join (fragment-duplicate join)

Duplicate

Split 3 Split 4Split 1 Split 2Map tasks:

Fragment (large table)

Duplicate

(small table)


4/33

Slide taken from project proposal

Too many copies of thesmall table are shuffledacross the network

Partially Solved

Distributed Cache

Doesnt work with toomany nodes involved

Size ofR

Size ofS

Maptasks

Duplicatedata

150 MB 24 GB 352 32 GB

Size of

R

Size of

S

Map

tasks

Duplicate

data

150 MB 24 GB 277


5/33

Slide taken from project proposal II

Memory Limitation

Hash table is not memory-efficient.

The table size is usually larger than the heap

memory assigned to a task

Out Of Memory Exception!


6/33

Solving Not-Enough-Memory problem

New Map Joins:

Multi-phase map join (MMJ)

Reversed map join (RMJ) JDBM-based map join (JMJ)

small table as: duplicatelarge table as:fragment


7/33

Multi-phase map join

n-phase map join

Duplicate

Part 1

Split 3 Split 4Split 1 Split 2Map tasks:

Duplicate

Fragment

Duplicate

Part 2 Duplicate

Part n

Problem? - Reading large table multiple times!

Duplicate

Part 1


8/33

Reversed map join

Default map join (in each Map task):

1. read duplicateto memory, build hash table

2. for each tuple infragment, probe the hash

table

Reversed map join (in each Map task): :

1. readfragment to memory, build hash table

2. for each tuple in duplicate, probe the hashtable

Problem?not really a Map job


9/33

JDBM-based map join

JDBM is a transactional persistence engine for

Java.

Using JDBM, we can eliminate

OutOfMemoryException. The size of the hash

table is no longer bound by the heap size.

Problem?Probing a hashtable on disk might take much time!


10/33

Advanced Joins

Step 1:Semi join on join key only;

Step 2:

Use the result to filter the table; Step 3:

Join new tables.

Can be applied to both map and reduce-sidejoins

Problem?Step 1 and 2 have overhead!


11/33

The Nine Candidates

AMJ/no dist advanced map join without DC

AMJ/dist advanced map join with DC

DMJ/no dist default map join without DC

DMJ/dist default map join with DC

MMJ multi-phase map join

RMJ/dist reversed map join with DC

JMJ/dist JDBM-based map join with DC

ARJ/dist advanced reduce join with DC

DRJ default reduce join


12/33

Experiment Setup

TPC-DS benchmark

Evaluated query:

JOIN customer, web_sales ON cid

Performed on different scales of generated

data, e.g. 10GB, 170GB (not actual table size)

Each combination is performed five (5) times

Results are analyzed with error bars


13/33

Hadoop Cluster

128Hewlett PackardDL160 Compute BuildingBlocks

Each equipped with:

2 quad-core CPUs 16 GB RAM

2 TB storage

High-speed networkconnection

Used in the experiment:

Hadoop Cluster

(Altocumulus):

64 nodes


14/33

Result analysis

0

50

100

150

200

250

300

350

400

AMJ/no dist

AMJ/dist

DMJ/no dist

DMJ/dist

MMJ

RMJ/dist

JMJ/dist

ARJ/dist

DRJ

Some results ignored


15/33

One small note

What does 50*200 mean?

TABLE customer: from 50GB version of TPC-DS- actual table size: about 100MB

TABLE web_sales: 200GB version of TPC-DS

- actual table size: about 30GB


16/33

Distributed Cache

0

50

100

150

200

250

300

350

400

DMJ/no dist

DMJ/dist


17/33

Distributed Cache II

Distributed cache introduces an overhead

when converting the file in HDFS to local disks.

The following situations are in favor of

Distributed cache (compared to non-DC):

1. number of nodes is low

2. number of map tasks is high


18/33

Advanced vs. Default

0

50

100

150

200

250

300

10*10 10*30 10*50 10*70 10*100 10*130 10*170 10*200 50*50 50*70 50*100 50*130 50*170 50*200 70*70

ARJ/dist

DRJ


19/33

Advanced vs. Default II

0

200

400

600

800

1000

1200

AMJ/dist

DMJ/dist


20/33

Advanced vs. Default III

The overhead of semi-join and filtering is

heavy.

The following situations are in favor of

advanced joins (compared to reduce joins):

1. join selectivity gets lower

2. network becomes slower (true!)

3. we need to handle skewed data


21/33

Map Join vs Reduce Join--Part I

0

50

100

150

200

250

300

350

400

450

DMJ/no dist

MMJ

JMJ/dist

ARJ/dist

DRJ


22/33

Map Join vs Reduce Join-- Part II

0

200

400

600

800

1000

1200

1400

DMJ/no distRMJ/dist

JMJ/dist

ARJ/dist

DRJ


23/33


24/33

Beyond Default Map Join

Multi-Phase Map Join

Succeed in all experiment groups.

Performance comparable with DMJ when only one

phase is involved.

Performance degrades sharply when phase

number are greater than 2, due to the much more

tasks we launch. Currently no support for distributed cache, not

scalable


25/33


Reversed Map Join

Succeed in all experiment groups.

Not performs as good as DRJ due the overhead of

distributed cache

Performs best when


26/33


JDBM Map Join

Fail for the last two experiment groups, mainly due

to the improper configuration settings.


27/33

Join Plan Generator

Cost-based + rule-based

Focus on three aspects Whether or not to use

distributed cache

Whether to use Default Map

Join

Map joins or reduce side join

Parameters

Number of distributed

filesd

Network speed v

Number of map tasks m

Number of reduce

tasksr

Number of working

nodes nSmall table size s

Large table size l


28/33

Join Plan Generator

Whether to use distributed cache

Only works for map join approaches

Cost model

With distributed cache:

where is the average overhead to distribute one file

Without distributed cache:

dsnv

1

sm

v

1


29/33

Join Plan Generator

Whether to use Default Map Join

We give Default Map Join the highest priority since

it usually works best

The choice on distributed cache can ensure DefaultMap Join works efficiently

Rule: if small table can fit into memory entirely,

just do it.


30/33

Join Plan Generator

Map Joins or Default Reduce side Join

In those situations where DMJ fails, Reversed Map

Join is most promising in terms of usability and

scalability. Cost model:

RMJ:

(without distributed cache)

(with distributed cache)

where is the average overhead to distribute one

file

DRJ:

dsm

dsn

),(,)( rvfls


31/33

Join Plan Generator

Distributed cache?

Default Map Join?

Reversed Map Join /

Default Reduce side Join

Y N

YDo it

N

Do it


32/33

Summary

Distributed cache is a double-edge sword

When using distributed cache properly, Default

Map Join performs best

The three new map join approaches extend

the usability of default map join


33/33

Future Work

SPJA workflow

(selection, projection, join, aggregation)

Better optimizer

Multi-way join

Build to hybrid system

Need a dedicated (slower) cluster

Documents

Mrjoin Final