Upload
dangque
View
224
Download
0
Embed Size (px)
Citation preview
USENIX ATCUSENIX ATC’’1111
HiTune
Dataflow-Based Performance Analysis for Big Data Cloud
Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu
Intel Asia-Pacific Research and Development Ltd
Shanghai, China, 200241
2011-6-162 USENIX ATC’11
Big Data
“Industrial Revolution of Data”
• The heartbeat of mobile, cloud and social computing
• Expanding faster than Moore’s law
– E.g., Internet of Things
What is Big Data?
• Too large to work with using traditional tools (e.g., RDBMS)
• Require a new architecture
– Massively parallel software running on 100s~1000s of servers
2011-6-163 USENIX ATC’11
Dataflow Model for Big Data Analytics
User
• Applications modeled as dataflow graphs
• Write subroutines running on the vertices
• Abstracted away from messy details of distributed computing
System runtime
• Dynamically map dataflow graphs to the cluster
• Handles all the low level details
– Data partitioning, task distribution, load balancing, node communications, fault tolerance, …
D
A
T
A
MAP
MAP
MAP
MAP
RE
DU
CE
Partitioned Input
Aggregated Output
spill
Streaming dataflow
D
A
A
map
map
map
map
reduce
Aggregated Output
Partitioned Input
copier
copier
copiersort
sort
sortmerge
merge
merge
reduce
reduce
Sequential dataflow
shuffle
shuffle
shuffle
spill
Spill
spill
T
Map Tasks Reduce Tasks
MapReduce
Hadoop
Dryad
2011-6-164 USENIX ATC’11
What Worked
Parallel programming is hard
• Distributed programming is harder
• Dataflow model makes it a lot easier
An appropriately high level of abstraction
• User required to consider data parallelisms exposed by the dataflow
• Runtime distributes executions of subroutines by exploiting data dependencies encoded in the dataflow
Nontrivial software written with threads, semaphores, and mutexesare incomprehensible to humans.
Edward A. LeeCGO 2007, March 2007
Auto-Partitioning Compiler for Intel Network Processor (IXP)
2011-6-165 USENIX ATC’11
What Didn’t Work
Dataflow abstraction makes Big Data system appear as a “black box”
• Very difficult for the user to understand runtime behaviors
• Performance analysis & tuning remain a big challenge
Key challenges of performance analysis for Big Data
• Massively distributed system
– How to correlate concurrent performance activities (across 10000s of programs and machines)?
• High level dataflow abstraction
– How to relate low level performance activities to high level dataflow model?
2011-6-166 USENIX ATC’11
HiTune: “Vtune for Hadoop”
Distributed instrumentations
• Lightweight sampling using binary instrumentation
– No source code modifications
• Implemented using Java programming language agents
– Generic sampling information collected
Dataflow-driven analysis
• Re-constructing dataflow execution process using low level sampling information
– Based on a dataflow specification
• Implemented as several Hadoop jobs
2011-6-167 USENIX ATC’11
HiTune 0.9
Adaptor
Adaptor
Adaptor
Chukwa
Agent
Adaptor
Adaptor
Adaptor
Chukwa
Collector
HDFS
Chukwa
Demux
Local HiTune
data
Local HiTune
data
Local HiTune
data
Local HiTune
data
Local HiTune
data
Local HiTune
data HiTune
Paser
HiTune
Paser
HiTune
Analyzer
Hadoop
Job
Hadoop
Job
HiTune Report (.csv)
Instrumentation
PostProcess
Analysis
Hive
HiveQL
Hive Query Excel Spreadsheet
Visual
Report Samples (.xlsm)
Chukwa
Agent
Chukwa
Collector
Chukwa
Collector
Aggregation
Status
• Used intensively both inside Intel and by several external customers
• Open sourced under Apache License 2.0
• Available at https://github.com/hitune/hitune
2011-6-168 USENIX ATC’11
Overhead
Ratio of instrumented vs. uninstrumented clusters
• Less than 2% runtime overhead due to instrumentation
101% 102% 101%
101% 100% 101%
101% 100% 100%
0%
20%
40%
60%
80%
100%
120%
Sort WordCount Nutch indexingRa
tio
of
Co
mp
leti
on
Tim
e
Workloads
5-slave cluster 10-slave cluster 20-slave cluster
100% 102% 100%
102% 101% 100%
100% 100% 100%
0%
20%
40%
60%
80%
100%
120%
Sort WordCount Nutch indexing
Ra
tio
of
CP
U U
tiliza
tio
n
Workloads
5-slave cluster 10-slave cluster 20-slave cluster
102% 100% 102%
102% 100% 101%
102% 100% 100%
0%
20%
40%
60%
80%
100%
120%
Sort WordCount Nutch indexingRa
tio
of
Me
mo
ry U
tiliza
tio
n
Workloads
5-slave cluster 10-slave cluster 20-slave cluster
98% 100% 98%
98% 100% 98%
99% 98% 100%
0%
20%
40%
60%
80%
100%
120%
Sort WordCount Nutch indexing
Ra
tio
of
Th
rou
gh
pu
t
Workloads
5-slave cluster 10-slave cluster 20-slave cluster
2011-6-169 USENIX ATC’11
The Hadoop Dataflow Model
spill
Streaming dataflow
D
A
A
map
map
map
map
reduce
Aggregated Output
Partitioned Input
copier
copier
copiersort
sort
sortmerge
merge
merge
reduce
reduce
Sequential dataflow
shuffle
shuffle
shuffle
spill
Spill
spill
T
Map Tasks Reduce Tasks
2011-6-1610 USENIX ATC’11
Case Study: Limitation of Traditional Tools
Sorting many small files (3200 500KB-sized files) using Hadoop 0.20.1
• Cluster very lightly utilized (extremely low CPU, disk I/O and network utilization)
• No obvious bottlenecks or hotspots in the cluster
• Traditional tools (e.g., system monitors and program profilers) fail to reveal the root cause
2011-6-1611 USENIX ATC’11
Case Study: Limitation of Traditional Tools
HiTune results (dataflow execution) reveal the root cause
• Upgrading to “Fair Scheduler 2.0” fixes the issue
Dataflow Dataflow
Execution ChartExecution Chart
Map
/Re
du
ce T
asks
0.19.1
Map
/Re
du
ce T
asks
Map
/Re
du
ce T
asks
0.20.1
Time line
The FixThe FixThe Low Utilization IssueThe Low Utilization Issue
bootstrap
map
shuffle
sort
reduce
idle
Time line
2011-6-1612 USENIX ATC’11
Case Study: Limitation of Hadoop Logs
TeraSort
• Large gap between end of map and end of shuffle
– None of CPU, disk I/O and network bandwidth are bottlenecked during the gap
• “Shuffle Fetchers Busy Percent” metric reported by Hadoop is always 100%
– Increasing the number of copier threads brings no improvement
• Traditional tools or Hadoop logs fail to reveal the root cause
gap
2011-6-1613 USENIX ATC’11
Case Study: Limitation of Hadoop Logs
HiTune results (dataflow-based hotspot breakdown) reveal the root cause
• Copier threads idle 80% of the time, waiting for memory merge thread
• memory merge thread busy mostly due to compression
• Changing compression codec to LZO fixes this issue
Copier threads
Busy, 20%
Idle, 80%reserve, 79%
Others,1%
Compress, 81%
Memory Merge threads
Idle, 13% Busy, 87%
Others, 6%
2011-6-1614 USENIX ATC’11
Case Study: Extensibility
Easily extended to support Hive
• Simply changing the dataflowspecification
Aggregation query in Hive performance benchmarks
• 68% of time spent on data input/output, Hadoop/Hive initialization & cleanup
• Critical to reduce intermediate results, improve data input/output, and reduce Hadoop/Hive overheads
Hive
Init
Hive
Active
Hive
Close
Hive Processing Period
Active
Hive data flow stage timeline
map stage timeline
Stage
Init
Stage
Close
Hive
Init
Hive
Active
Hive
Close
Stage
Init
Hive
Init
Hive
Active
Hive
Close
Hive Processing Period
Active
Hive data flow stage timeline
map stage timeline
Stage
Init
Stage
Close
Hive
Init
Hive
Active
Hive
Close
Stage
Init
Hive Processing Period
Active
Hive data flow stage timeline
reduce stage timeline
Stage
Close
Hive
Init
Hive
Active
Hive
Close
Hive Processing Period
Active
Hive data flow stage timeline
reduce stage timeline
Stage
Close
Hive
Init
Hive
Active
Hive
Close
Hive Init
2.1%
Hive
Close
0.2%
Stage Close
20.2%
Hive Input
42.4%
Hive
Operations
32.0%
Hive Output
3.2%
Hive Active
77.6%
Hive Init
2.1%
Hive
Close
0.2%
Stage Close
20.2%
Hive Input
42.4%
Hive
Operations
32.0%
Hive Output
3.2%
Hive Active
77.6%
Map Tasks Reduce Tasks
2011-6-1615 USENIX ATC’11
Summary
HiTune - “VTune for Hadoop”
• Better insights on Hadoop runtime behaviors
– Dataflow-based analysis
• Extremely low runtime overheads
• Very good scalability & extensibility
• v0.9 open sourced under Apache License 2.0
– See https://github.com/hitune/hitune
2011-6-1616 USENIX ATC’11