Dataflow-Based Performance Analysis for Big Data Cloud · PDF fileDataflow-Based Performance Analysis for Big Data Cloud Jinquan(Jason) ... merge sort merge merge reduce ... Hive data

USENIX ATCUSENIX ATC’’1111

HiTune

Dataflow-Based Performance Analysis for Big Data Cloud

Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu

Intel Asia-Pacific Research and Development Ltd

Shanghai, China, 200241

2011-6-162 USENIX ATC’11

Big Data

“Industrial Revolution of Data”

• The heartbeat of mobile, cloud and social computing

• Expanding faster than Moore’s law

– E.g., Internet of Things

What is Big Data?

• Too large to work with using traditional tools (e.g., RDBMS)

• Require a new architecture

– Massively parallel software running on 100s~1000s of servers

2011-6-163 USENIX ATC’11

Dataflow Model for Big Data Analytics

User

• Applications modeled as dataflow graphs

• Write subroutines running on the vertices

• Abstracted away from messy details of distributed computing

System runtime

• Dynamically map dataflow graphs to the cluster

• Handles all the low level details

– Data partitioning, task distribution, load balancing, node communications, fault tolerance, …

D

A

T

A

MAP

MAP

MAP

MAP

RE

DU

CE

Partitioned Input

Aggregated Output

spill

Streaming dataflow

D

A

A

map

map

map

map

reduce

Aggregated Output

Partitioned Input

copier

copier

copiersort

sort

sortmerge

merge

merge

reduce

reduce

Sequential dataflow

shuffle

shuffle

shuffle

spill

Spill

spill

T

Map Tasks Reduce Tasks

MapReduce

Hadoop

Dryad

2011-6-164 USENIX ATC’11

What Worked

Parallel programming is hard

• Distributed programming is harder

• Dataflow model makes it a lot easier

An appropriately high level of abstraction

• User required to consider data parallelisms exposed by the dataflow

• Runtime distributes executions of subroutines by exploiting data dependencies encoded in the dataflow

Nontrivial software written with threads, semaphores, and mutexesare incomprehensible to humans.

Edward A. LeeCGO 2007, March 2007

Auto-Partitioning Compiler for Intel Network Processor (IXP)

2011-6-165 USENIX ATC’11

What Didn’t Work

Dataflow abstraction makes Big Data system appear as a “black box”

• Very difficult for the user to understand runtime behaviors

• Performance analysis & tuning remain a big challenge

Key challenges of performance analysis for Big Data

• Massively distributed system

– How to correlate concurrent performance activities (across 10000s of programs and machines)?

• High level dataflow abstraction

– How to relate low level performance activities to high level dataflow model?

2011-6-166 USENIX ATC’11

HiTune: “Vtune for Hadoop”

Distributed instrumentations

• Lightweight sampling using binary instrumentation

– No source code modifications

• Implemented using Java programming language agents

– Generic sampling information collected

Dataflow-driven analysis

• Re-constructing dataflow execution process using low level sampling information

– Based on a dataflow specification

• Implemented as several Hadoop jobs

2011-6-167 USENIX ATC’11

HiTune 0.9

Adaptor

Adaptor

Adaptor

Chukwa

Agent

Adaptor

Adaptor

Adaptor

Chukwa

Collector

HDFS

Chukwa

Demux

Local HiTune

data

Local HiTune

data

Local HiTune

data

Local HiTune

data

Local HiTune

data

Local HiTune

data HiTune

Paser

HiTune

Paser

HiTune

Analyzer

Hadoop

Job

Hadoop

Job

HiTune Report (.csv)

Instrumentation

PostProcess

Analysis

Hive

HiveQL

Hive Query Excel Spreadsheet

Visual

Report Samples (.xlsm)

Chukwa

Agent

Chukwa

Collector

Chukwa

Collector

Aggregation

Status

• Used intensively both inside Intel and by several external customers

• Open sourced under Apache License 2.0

• Available at https://github.com/hitune/hitune

2011-6-168 USENIX ATC’11

Overhead

Ratio of instrumented vs. uninstrumented clusters

• Less than 2% runtime overhead due to instrumentation

101% 102% 101%

101% 100% 101%

101% 100% 100%

0%

20%

40%

60%

80%

100%

120%

Sort WordCount Nutch indexingRa

tio

of

Co

mp

leti

on

Tim

e

Workloads

5-slave cluster 10-slave cluster 20-slave cluster

100% 102% 100%

102% 101% 100%

100% 100% 100%

0%

20%

40%

60%

80%

100%

120%

Sort WordCount Nutch indexing

Ra

tio

of

CP

U U

tiliza

tio

n

Workloads


102% 100% 102%

102% 100% 101%

102% 100% 100%

0%

20%

40%

60%

80%

100%

120%

Sort WordCount Nutch indexingRa

tio

of

Me

mo

ry U

tiliza

tio

n

Workloads


98% 100% 98%

98% 100% 98%

99% 98% 100%

0%

20%

40%

60%

80%

100%

120%

Sort WordCount Nutch indexing

Ra

tio

of

Th

rou

gh

pu

t

Workloads


2011-6-169 USENIX ATC’11

The Hadoop Dataflow Model

spill

Streaming dataflow

D

A

A

map

map

map

map

reduce

Aggregated Output

Partitioned Input

copier

copier

copiersort

sort

sortmerge

merge

merge

reduce

reduce

Sequential dataflow

shuffle

shuffle

shuffle

spill

Spill

spill

T


2011-6-1610 USENIX ATC’11

Case Study: Limitation of Traditional Tools

Sorting many small files (3200 500KB-sized files) using Hadoop 0.20.1

• Cluster very lightly utilized (extremely low CPU, disk I/O and network utilization)

• No obvious bottlenecks or hotspots in the cluster

• Traditional tools (e.g., system monitors and program profilers) fail to reveal the root cause

2011-6-1611 USENIX ATC’11

Case Study: Limitation of Traditional Tools

HiTune results (dataflow execution) reveal the root cause

• Upgrading to “Fair Scheduler 2.0” fixes the issue

Dataflow Dataflow

Execution ChartExecution Chart

Map

/Re

du

ce T

asks

0.19.1

Map

/Re

du

ce T

asks

Map

/Re

du

ce T

asks

0.20.1

Time line

The FixThe FixThe Low Utilization IssueThe Low Utilization Issue

bootstrap

map

shuffle

sort

reduce

idle

Time line

2011-6-1612 USENIX ATC’11

Case Study: Limitation of Hadoop Logs

TeraSort

• Large gap between end of map and end of shuffle

– None of CPU, disk I/O and network bandwidth are bottlenecked during the gap

• “Shuffle Fetchers Busy Percent” metric reported by Hadoop is always 100%

– Increasing the number of copier threads brings no improvement

• Traditional tools or Hadoop logs fail to reveal the root cause

gap

2011-6-1613 USENIX ATC’11

Case Study: Limitation of Hadoop Logs

HiTune results (dataflow-based hotspot breakdown) reveal the root cause

• Copier threads idle 80% of the time, waiting for memory merge thread

• memory merge thread busy mostly due to compression

• Changing compression codec to LZO fixes this issue

Copier threads

Busy, 20%

Idle, 80%reserve, 79%

Others,1%

Compress, 81%

Memory Merge threads

Idle, 13% Busy, 87%

Others, 6%

2011-6-1614 USENIX ATC’11

Case Study: Extensibility

Easily extended to support Hive

• Simply changing the dataflowspecification

Aggregation query in Hive performance benchmarks

• 68% of time spent on data input/output, Hadoop/Hive initialization & cleanup

• Critical to reduce intermediate results, improve data input/output, and reduce Hadoop/Hive overheads

Hive

Init

Hive

Active

Hive

Close

Hive Processing Period

Active

Hive data flow stage timeline

map stage timeline

Stage

Init

Stage

Close

Hive

Init

Hive

Active

Hive

Close

Stage

Init

Hive

Init

Hive

Active

Hive

Close


Active


map stage timeline

Stage

Init

Stage

Close

Hive

Init

Hive

Active

Hive

Close

Stage

Init


Active


reduce stage timeline

Stage

Close

Hive

Init

Hive

Active

Hive

Close


Active


reduce stage timeline

Stage

Close

Hive

Init

Hive

Active

Hive

Close

Hive Init

2.1%

Hive

Close

0.2%

Stage Close

20.2%

Hive Input

42.4%

Hive

Operations

32.0%

Hive Output

3.2%

Hive Active

77.6%

Hive Init

2.1%

Hive

Close

0.2%

Stage Close

20.2%

Hive Input

42.4%

Hive

Operations

32.0%

Hive Output

3.2%

Hive Active

77.6%


2011-6-1615 USENIX ATC’11

Summary

HiTune - “VTune for Hadoop”

• Better insights on Hadoop runtime behaviors

– Dataflow-based analysis

• Extremely low runtime overheads

• Very good scalability & extensibility

• v0.9 open sourced under Apache License 2.0

– See https://github.com/hitune/hitune

2011-6-1616 USENIX ATC’11

Documents

Dataflow-Based Performance Analysis for Big Data Cloud · PDF fileDataflow-Based Performance Analysis for Big Data Cloud Jinquan(Jason) ... merge sort merge merge reduce ... Hive data