Mrjoin Final

Embed Size (px)

Citation preview

  • 8/10/2019 Mrjoin Final

    1/33

    Joins in Hadoop

    Gang and Ronnie

  • 8/10/2019 Mrjoin Final

    2/33

    Agenda

    Introduction of new types of joins

    Experiment results

    Join plan generator Summary and future work

  • 8/10/2019 Mrjoin Final

    3/33

    Problem at hand

    Map join (fragment-duplicate join)

    Duplicate

    Split 3 Split 4Split 1 Split 2Map tasks:

    Fragment (large table)

    Duplicate

    (small table)

  • 8/10/2019 Mrjoin Final

    4/33

    Slide taken from project proposal

    Too many copies of thesmall table are shuffledacross the network

    Partially Solved

    Distributed Cache

    Doesnt work with toomany nodes involved

    Size ofR

    Size ofS

    Maptasks

    Duplicatedata

    150 MB 24 GB 352 32 GB

    Size of

    R

    Size of

    S

    Map

    tasks

    Duplicate

    data

    150 MB 24 GB 277

  • 8/10/2019 Mrjoin Final

    5/33

    Slide taken from project proposal II

    Memory Limitation

    Hash table is not memory-efficient.

    The table size is usually larger than the heap

    memory assigned to a task

    Out Of Memory Exception!

  • 8/10/2019 Mrjoin Final

    6/33

    Solving Not-Enough-Memory problem

    New Map Joins:

    Multi-phase map join (MMJ)

    Reversed map join (RMJ) JDBM-based map join (JMJ)

    small table as: duplicatelarge table as:fragment

  • 8/10/2019 Mrjoin Final

    7/33

    Multi-phase map join

    n-phase map join

    Duplicate

    Part 1

    Split 3 Split 4Split 1 Split 2Map tasks:

    Duplicate

    Fragment

    Duplicate

    Part 2 Duplicate

    Part n

    Problem? - Reading large table multiple times!

    Duplicate

    Part 1

  • 8/10/2019 Mrjoin Final

    8/33

    Reversed map join

    Default map join (in each Map task):

    1. read duplicateto memory, build hash table

    2. for each tuple infragment, probe the hash

    table

    Reversed map join (in each Map task): :

    1. readfragment to memory, build hash table

    2. for each tuple in duplicate, probe the hashtable

    Problem?not really a Map job

  • 8/10/2019 Mrjoin Final

    9/33

    JDBM-based map join

    JDBM is a transactional persistence engine for

    Java.

    Using JDBM, we can eliminate

    OutOfMemoryException. The size of the hash

    table is no longer bound by the heap size.

    Problem?Probing a hashtable on disk might take much time!

  • 8/10/2019 Mrjoin Final

    10/33

    Advanced Joins

    Step 1:Semi join on join key only;

    Step 2:

    Use the result to filter the table; Step 3:

    Join new tables.

    Can be applied to both map and reduce-sidejoins

    Problem?Step 1 and 2 have overhead!

  • 8/10/2019 Mrjoin Final

    11/33

    The Nine Candidates

    AMJ/no dist advanced map join without DC

    AMJ/dist advanced map join with DC

    DMJ/no dist default map join without DC

    DMJ/dist default map join with DC

    MMJ multi-phase map join

    RMJ/dist reversed map join with DC

    JMJ/dist JDBM-based map join with DC

    ARJ/dist advanced reduce join with DC

    DRJ default reduce join

  • 8/10/2019 Mrjoin Final

    12/33

    Experiment Setup

    TPC-DS benchmark

    Evaluated query:

    JOIN customer, web_sales ON cid

    Performed on different scales of generated

    data, e.g. 10GB, 170GB (not actual table size)

    Each combination is performed five (5) times

    Results are analyzed with error bars

  • 8/10/2019 Mrjoin Final

    13/33

    Hadoop Cluster

    128Hewlett PackardDL160 Compute BuildingBlocks

    Each equipped with:

    2 quad-core CPUs 16 GB RAM

    2 TB storage

    High-speed networkconnection

    Used in the experiment:

    Hadoop Cluster

    (Altocumulus):

    64 nodes

  • 8/10/2019 Mrjoin Final

    14/33

    Result analysis

    0

    50

    100

    150

    200

    250

    300

    350

    400

    AMJ/no dist

    AMJ/dist

    DMJ/no dist

    DMJ/dist

    MMJ

    RMJ/dist

    JMJ/dist

    ARJ/dist

    DRJ

    Some results ignored

  • 8/10/2019 Mrjoin Final

    15/33

    One small note

    What does 50*200 mean?

    TABLE customer: from 50GB version of TPC-DS- actual table size: about 100MB

    TABLE web_sales: 200GB version of TPC-DS

    - actual table size: about 30GB

  • 8/10/2019 Mrjoin Final

    16/33

    Distributed Cache

    0

    50

    100

    150

    200

    250

    300

    350

    400

    DMJ/no dist

    DMJ/dist

  • 8/10/2019 Mrjoin Final

    17/33

    Distributed Cache II

    Distributed cache introduces an overhead

    when converting the file in HDFS to local disks.

    The following situations are in favor of

    Distributed cache (compared to non-DC):

    1. number of nodes is low

    2. number of map tasks is high

  • 8/10/2019 Mrjoin Final

    18/33

    Advanced vs. Default

    0

    50

    100

    150

    200

    250

    300

    10*10 10*30 10*50 10*70 10*100 10*130 10*170 10*200 50*50 50*70 50*100 50*130 50*170 50*200 70*70

    ARJ/dist

    DRJ

  • 8/10/2019 Mrjoin Final

    19/33

    Advanced vs. Default II

    0

    200

    400

    600

    800

    1000

    1200

    AMJ/dist

    DMJ/dist

  • 8/10/2019 Mrjoin Final

    20/33

    Advanced vs. Default III

    The overhead of semi-join and filtering is

    heavy.

    The following situations are in favor of

    advanced joins (compared to reduce joins):

    1. join selectivity gets lower

    2. network becomes slower (true!)

    3. we need to handle skewed data

  • 8/10/2019 Mrjoin Final

    21/33

    Map Join vs Reduce Join--Part I

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    DMJ/no dist

    MMJ

    JMJ/dist

    ARJ/dist

    DRJ

  • 8/10/2019 Mrjoin Final

    22/33

    Map Join vs Reduce Join-- Part II

    0

    200

    400

    600

    800

    1000

    1200

    1400

    DMJ/no distRMJ/dist

    JMJ/dist

    ARJ/dist

    DRJ

  • 8/10/2019 Mrjoin Final

    23/33

  • 8/10/2019 Mrjoin Final

    24/33

    Beyond Default Map Join

    Multi-Phase Map Join

    Succeed in all experiment groups.

    Performance comparable with DMJ when only one

    phase is involved.

    Performance degrades sharply when phase

    number are greater than 2, due to the much more

    tasks we launch. Currently no support for distributed cache, not

    scalable

  • 8/10/2019 Mrjoin Final

    25/33

    Beyond Default Map Join

    Reversed Map Join

    Succeed in all experiment groups.

    Not performs as good as DRJ due the overhead of

    distributed cache

    Performs best when

  • 8/10/2019 Mrjoin Final

    26/33

    Beyond Default Map Join

    JDBM Map Join

    Fail for the last two experiment groups, mainly due

    to the improper configuration settings.

  • 8/10/2019 Mrjoin Final

    27/33

    Join Plan Generator

    Cost-based + rule-based

    Focus on three aspects Whether or not to use

    distributed cache

    Whether to use Default Map

    Join

    Map joins or reduce side join

    Parameters

    Number of distributed

    filesd

    Network speed v

    Number of map tasks m

    Number of reduce

    tasksr

    Number of working

    nodes nSmall table size s

    Large table size l

  • 8/10/2019 Mrjoin Final

    28/33

    Join Plan Generator

    Whether to use distributed cache

    Only works for map join approaches

    Cost model

    With distributed cache:

    where is the average overhead to distribute one file

    Without distributed cache:

    dsnv

    1

    sm

    v

    1

  • 8/10/2019 Mrjoin Final

    29/33

    Join Plan Generator

    Whether to use Default Map Join

    We give Default Map Join the highest priority since

    it usually works best

    The choice on distributed cache can ensure DefaultMap Join works efficiently

    Rule: if small table can fit into memory entirely,

    just do it.

  • 8/10/2019 Mrjoin Final

    30/33

    Join Plan Generator

    Map Joins or Default Reduce side Join

    In those situations where DMJ fails, Reversed Map

    Join is most promising in terms of usability and

    scalability. Cost model:

    RMJ:

    (without distributed cache)

    (with distributed cache)

    where is the average overhead to distribute one

    file

    DRJ:

    dsm

    dsn

    ),(,)( rvfls

  • 8/10/2019 Mrjoin Final

    31/33

    Join Plan Generator

    Distributed cache?

    Default Map Join?

    Reversed Map Join /

    Default Reduce side Join

    Y N

    YDo it

    N

    Do it

  • 8/10/2019 Mrjoin Final

    32/33

    Summary

    Distributed cache is a double-edge sword

    When using distributed cache properly, Default

    Map Join performs best

    The three new map join approaches extend

    the usability of default map join

  • 8/10/2019 Mrjoin Final

    33/33

    Future Work

    SPJA workflow

    (selection, projection, join, aggregation)

    Better optimizer

    Multi-way join

    Build to hybrid system

    Need a dedicated (slower) cluster