HJ -Hadoop An Optimized MapReduce Runtime for Multi-core Systems

HJ-HadoopAn Optimized MapReduce

Runtime for Multi-core Systems

Yunming ZhangAdvised by: Prof. Alan Cox and Vivek SarkarRice University

Social Informatics

Slide borrowed from Prof. Geoffrey Fox’s presentation

Big Data Era

20 PB/day

100 PB media

120 PB cluster

Bing ~ 300 PB

Slide borrowed from Florin Dinu’s presentation

MapReduce Runtime

Job Starts

…….

Reduce

JobEnds

Figure 1. Map Reduce Programming Model

MapReduce

Hadoop Map Reduce

• Open source implementation of Map Reduce Runtime system– Scalable– Reliable– Available

• Popular platform for big data analytics

Habanero Java(HJ)

• Programming Language and Runtime Developed at Rice University

• Optimized for multi-core systems– Lightweight async task– Work sharing runtime– Dynamic task parallelism– http://habanero.rice.edu

Kmeans

Topics

To be Classified Documents

Kmeans is an application that takes as input a large number of documents and try to classify them into different topics

Kmeans using Hadoop

To be classified documents Computation Memory

Machine 1Map task in a JVM

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics

Machines …

Kmeans using Hadoop

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics

Machines …

Kmeans using Hadoop

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics

Machines …

Kmeans using Hadoop

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics

Machines …

Kmeans using Hadoop

Duplicated In-memory Cluster Centroids

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics 1x

Machines …

Memory Wall

0 50 100 150 200 250 300 350 4000

KMeans Throughput Benchmark

Hadoop

Topics data size (MB) with 4KB/topic

We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.

cluster size(MB)

HadoopFull GC calls

30 3,54250 4,39070 5,18680 1,108,888

Memory Wall

• Hadoop’s approach to the problem– Increase the memory available to each Map Task

JVM by reducing the number of map tasks assigned to each machine.

Kmeans using Hadoop

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

…. Machines …

Topics 2x

Memory Wall

0 50 100 150 200 250 300 350 4000

Hadoop

Decreased throughput due to reduced number of map tasks per machine

We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.

HJ-Hadoop Approach 1

To be classified documents

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Computation Memory

Machine1Map task in a JVM

Topics 4x

Machines …

Dynamic chunking

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Computation Memory

Topics 4x

Machines …

Dynamic chunking

NoDuplicated In-memory Cluster Centroids

Results

0 50 100 150 200 250 300 350 4000

HJ-HadoopHadoop

We used 2 mappers for HJ-Hadoop

Results

0 50 100 150 200 250 300 350 4000

HJ-HadoopHadoop

Process 5x Topics efficiently

Results

0 50 100 150 200 250 300 350 4000

HJ-HadoopHadoop

4x throughput improvement

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Computation Memory

Topics 4x

Machines …

Dynamic chunking

HJ-Hadoop Approach 1Only a single thread reading input

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Computation Memory

Topics 4x

Machines …

Dynamic chunking

Kmeans using Hadoop

Computation Memory

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics

Machines …

Four threads reading input

Computation Memory

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics

Machines …

Computation Memory

slice1

slice2

slice3

slice4

slice5

slice6

slice7

slice8

Topics

Machines …

NoDuplicated In-memory Cluster Centroids

Trade Offs between the two approaches

• Approach 1– Minimum memory overhead– Improved CPU utilization with small task granularity

• Approach 2– Improved IO performance– Overlap between IO and Computation

• Hybrid Approach– Improved IO with small memory overhead– Improved CPU utilization

Conclusions

• Our goal is to tackle the memory inefficiency in the execution of MapReduce applications on multi-core systems by integrating a shared memory parallel model into Hadoop MapReduce runtime– HJ-Hadoop can be used to solve larger problems

efficiently than Hadoop, processing process 5x more data at full throughput of the system

– The HJ-Hadoop can deliver a 4x throughput relative to Hadoop processing very large in-memory data sets

HJ -Hadoop An Optimized MapReduce Runtime for Multi-core Systems

Documents

Processing with What is MapReduce? Hadoop/MapReduce

Businessquiz Hj

Pipelined-MapReduce an Improved MapReduce

By hj nabil Hj muhd muadz Vincent ong

Pocket Pedometer Model HJ-112omronhealthcare.com/wp-content/uploads/HJ112_IM_10262010.pdf · Pocket Pedometer Model HJ-112 INSTRUCTION MANUAL HJ-112N-Z_B_M01_100201.pdf

Estimating runtime of a job in Hadoop MapReduce

MapReduce. MapReduce Outline MapReduce Architecture MapReduce Internals MapReduce Examples JobTracker Interface

Mapreduce Runtime Environments: Design, Performance, Optimizations

HJ 17.08.2012

Hadoop/MapReduce - 123seminarsonly.comHadoop MapReduce • MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper submitted in

Fiscal & monetaryBook · Hj Mohd Zaki Hj Hassanol As’shari (RID) Dk. Faadzilah Pg. DP Hj Abu Bakar (BIFC) CONTRIBUTORS Hj Jefri bin Hj Md Salleh (RID) P.A. Huda P.A. Hj Idris (BIFC)

HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1

Pocket Pedometer Model HJ-112 · PDF filePocket Pedometer Model HJ-112 INSTRUCTION MANUAL HJ-112N-Z_B_M01_100201.pdf

HJ Accountancy

Hadoop and MapReduce - Courses · Hadoop and MapReduce Guest Lecturer: Jiaheng Lu ... Simple example: Word count Mapper (1-2) Mapper (3-4) ... MapReduce: Example. MapReduce in Parallel:

Dato’ Hj Rais Hussin Hj Mohamed Ariff

Portable Parallel Programming on Cloud and HPC: Scientific ...dsc.soic.indiana.edu/...twister4azure_ucc_cr_2_1.pdf · MapReduce runtime for Windows Azure cloud platform that utilizes

grids.ucs.indiana.edugrids.ucs.indiana.edu/.../DryadReport_draft.docx · Web viewApache Hadoop [6] has a similar architecture to Google’s MapReduce runtime [7], where it accesses

HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

EE324 DISTRIBUTED SYSTEMS FALL 2015 MapReduce. Overview 2 MapReduce