Performance Benchmarking in Open-Source at Amazon EMR …

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

TakLon Stephen Wu, Software developer - Amazon EMR

May 18, 2017

Performance Benchmarking in Open-Source at Amazon EMR

About me (Before 2016)

PhD candidate (on leave) @ Indiana University• Research in distributed system, cloud computing and

data mining, large-scale computation, MapReduce• User of Apache building blocks, e.g., Hadoop, HBase,

Pig, Hive• Used HBase and was contributing to infrastructure that can

handle TB-scale Twitter data • Integrated Apache Pig with a Hadoop plug-in (HARP)• Built an automatic provisioning framework to execute

scientific workloads

Me at AWS (2016+)

Software Development Engineer @ Amazon EMR• Continuing to work on open source building blocks!• Contributed patches to Apache Oozie and Hue• Leading performance benchmark for different Apache

open source applications• Building an automatic benchmarking pipeline

What to expect from the session

• Background• Amazon EMR• Amazon S3

• Why performance benchmarking? • Our solution: Build a benchmarking pipeline• Case study

• Amazon S3 storage mode for Apache HBase• Key takeaways

Managed framework for big data processing

Run Hadoop, Spark, Presto, Hive, and more

Launch a cluster in minutes

Baked-in security featuresPay by the hour and save with Spot

Auto Scaling

Amazon EMR

What is Amazon EMR?

Open source projects on EMR

EMR releases

16 open source software have been provided on EMR:

• Apache Hadoop• Apache Hive• Apache Spark• Apache HBase• Presto• Many others

* http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html

Using EMR with S3 as a data lake

Hive, Pig Spark

Presto HBase

Amazon S3

Using Amazon EMR with S3 as a data lake (cont.)

Resource independent• Compute and storage can scale independently• No need to scale HDFS

• Run cluster only for the duration of a job• Use Amazon EC2 Spot Instances

Get all the benefits of Amazon S3:• Designed to deliver 99.999999999% durability• Virtually unlimited scalability• Run multiple clusters against the same copy of the data

* AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem

Why performance benchmarking?

Software development life cycle

Design

Implement

ValidateDeploy

Release

Performance tests

Applications release often (as of May 2017)

About every 6 months• Hadoop, Hive, etc.About every 3 months• Spark, Zeppelin, etc.About every month• Flink (between 2016/08/08 to 2017/03/22)About every week• Presto has had 22 releases in last 6 months

It’s hard to keep up!

Performance benchmarking @ EMR

Release and software updates • Compare differences between

versions, especially new one

Development• Find the limitations and hidden

values• Figure out the best tuning

parameters for a given use case• Drive innovation

Benchmarking tools• Individual but not as a single piece

* Image sources from internet

Technical issues when benchmarking

Applications and benchmarks are constantly changing

Configuration• Scalability tests • Benchmark tuning • Prone to errors

Benchmark health monitoring

Other issues when benchmarking

Developers may need to spend too much time to understand the benchmark

It’s hard to add a benchmark without a standard model

Lack of a common format to share result

Our Solution:Build a benchmarking pipeline

Build a benchmarking pipeline

• Benchmarks are under version control• Automate benchmark execution• Performance metrics are collected and pushed to storage• Compare and visualize performance• Easy to export data and share reports

Benchmark under control

Keep benchmark source code on git• Developers can track changes easily

Define and standardize execution flow• Prepare section (data generation)• Main section (programs or queries)• Post section (generate performance report)

Compile benchmark script into program• Prototype in Java annotations with Apache Velocity template engine• Reduce errors through automation

Automatic driver

End-to-end execution• Start compute capacity based on provided configuration• Execute benchmark with test configuration• Monitor each execution step in action

• E.g., continue when failed or terminate when failed• Terminate resource after all defined steps are complete

Provide helpful debug information for failure execution (WIP)

Performance metric collector

Generic interface to gather data from different sources• Read local and remote measurement

• System information, cluster size, benchmark name• Runtime built-in metrics, e.g., YARN timeline server

Persist collected data to storage before cluster termination• It helps to maintain a historical view of data

Compare collected data points

Define test criteria • What is a “good” result? • Aggregations, mean, standard deviation standard errors with equal weight for

each step of a benchmark

Provide easy-to-use tools for comparing results • Command line interface • Generic interface such as SQL-like syntax• Graph UI

Export to open, easy-to-share formats • CSV• JSON

Case study:Amazon S3 storage mode for Apache HBase

HBase

Open source, non-relational, distributed database

Runs on top of Hadoop HDFS• Limited by the cluster instance storage

Storing large quantities of sparse data

Portions of data are cached in-memory• Read: BlockCache and BucketCache• Write: Memstore

WAL

HFileHFile

HFile

HDFS

Region server

Memstore

BlockCacheBucketCache

Local disk

Develop new features

Amazon S3 storage mode for HBase

Develop new features (cont.)

Our assumption• Read operations from S3 can be as fast as HDFS• Performance of write to S3 should be the same if

network bandwidth is allowed

How can we confirm it?• YCSB benchmark • HBase built-in PerformanceEvaluation tool

HBase performance tests

HBase-1.2.3• Compare HDFS and storage mode (with consistent view)

YCSB workloads• Various read, scan, update, and insert rates.

Custer size• 21 nodes homogeneous C3.4xlarge cluster with a single master node• Attached 2 x 160 SSD on each node

Running a total of 270 cases• 6 different workloads• Three different datasets, e.g., 10 million, 100 million, and 1 billion• Each workload runs 5 times

HBase tuning parameters

Initial Tuninghbase.hregion.memstore.flush.size 134217728 402653184

hfile.block.cache.size 0.4 0.4

hbase.hstore.blockingStoreFiles 200 1000

hbase.hregion.memstore.block.multiplier 4 8

hbase.hregion.max.filesize 1610612736 (1.6 GB) 1610612736 (1.6GB)

hbase.bucketcache.size 40 GB 16 GB (with HBASE-15314)

• Mitigate latency for “large” compactions• Read from caches, especially the use of on-disk BucketCache• HBASE-15314 allows multiple backing files in BucketCache

YCSB 100 millions records (before tuning)

12477 8217 6442 73091660918034

9602 762014245 22850

1

10

100

1000

10000

100000

1000000

Workload A Workload B Workload C Wordload D Wordload F

Exec

utio

n Ti

me

in S

econ

ds (L

og10

)

HDFS Storage Mode

50% R. 50% U.

95% R. 5% U. 100% R.

95% R. 5% I.

50% R. 50% R.M.W

• Enabled BucketCache for both HDFS and storage mode• 10 HBase clients• Too slow in Workload A and Workload D

Region server restarted

YCSB 100 millions records (after tuning)

34062075 1804 2158

41113242 2129 1851 26024192

1

10

100

1000

10000

100000

1000000

Workload A Workload B Workload C Wordload D Wordload F

Exec

utio

n Ti

me

in S

econ

ds (l

og10

)HDFS Storage Mode Storage Mode with consistent view

95% R. 5% U. 100% R.

95% R. 5% I.

* Workload definitions

50% R. 50% U.

50% R. 50% R.M.W

• Increase parallel workers to 32 HBase clients• Improve IOPS where BucketCache results are split into two files • Less compaction helps improve performance

Summary

Maintain a manageable collection of benchmarks for different runtimes.

Leverage a benchmarking pipeline; hours can be saved by the automation.

Archive a historical view of benchmark data points in a single repository.

Export performance results in a standard format (e.g., CSV) so they can be easily used by other developers and data scientists.

Thank you!If you have any questions,

email [email protected] [email protected]

Documents

Performance Benchmarking in Open-Source at Amazon EMR …