Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TakLon Stephen Wu, Software developer - Amazon EMR
May 18, 2017
Performance Benchmarking in Open-Source at Amazon EMR
About me (Before 2016)
PhD candidate (on leave) @ Indiana University• Research in distributed system, cloud computing and
data mining, large-scale computation, MapReduce• User of Apache building blocks, e.g., Hadoop, HBase,
Pig, Hive• Used HBase and was contributing to infrastructure that can
handle TB-scale Twitter data • Integrated Apache Pig with a Hadoop plug-in (HARP)• Built an automatic provisioning framework to execute
scientific workloads
Me at AWS (2016+)
Software Development Engineer @ Amazon EMR• Continuing to work on open source building blocks!• Contributed patches to Apache Oozie and Hue• Leading performance benchmark for different Apache
open source applications• Building an automatic benchmarking pipeline
What to expect from the session
• Background• Amazon EMR• Amazon S3
• Why performance benchmarking? • Our solution: Build a benchmarking pipeline• Case study
• Amazon S3 storage mode for Apache HBase• Key takeaways
Managed framework for big data processing
Run Hadoop, Spark, Presto, Hive, and more
Launch a cluster in minutes
Baked-in security featuresPay by the hour and save with Spot
Auto Scaling
Amazon EMR
What is Amazon EMR?
Open source projects on EMR
EMR releases
16 open source software have been provided on EMR:
• Apache Hadoop• Apache Hive• Apache Spark• Apache HBase• Presto• Many others
* http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html
Using EMR with S3 as a data lake
Hive, Pig Spark
Presto HBase
Amazon S3
Using Amazon EMR with S3 as a data lake (cont.)
Resource independent• Compute and storage can scale independently• No need to scale HDFS
• Run cluster only for the duration of a job• Use Amazon EC2 Spot Instances
Get all the benefits of Amazon S3:• Designed to deliver 99.999999999% durability• Virtually unlimited scalability• Run multiple clusters against the same copy of the data
* AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem
Why performance benchmarking?
Software development life cycle
Design
Implement
ValidateDeploy
Release
Performance tests
Applications release often (as of May 2017)
About every 6 months• Hadoop, Hive, etc.About every 3 months• Spark, Zeppelin, etc.About every month• Flink (between 2016/08/08 to 2017/03/22)About every week• Presto has had 22 releases in last 6 months
It’s hard to keep up!
Performance benchmarking @ EMR
Release and software updates • Compare differences between
versions, especially new one
Development• Find the limitations and hidden
values• Figure out the best tuning
parameters for a given use case• Drive innovation
Benchmarking tools• Individual but not as a single piece
* Image sources from internet
Technical issues when benchmarking
Applications and benchmarks are constantly changing
Configuration• Scalability tests • Benchmark tuning • Prone to errors
Benchmark health monitoring
Other issues when benchmarking
Developers may need to spend too much time to understand the benchmark
It’s hard to add a benchmark without a standard model
Lack of a common format to share result
Our Solution:Build a benchmarking pipeline
Build a benchmarking pipeline
• Benchmarks are under version control• Automate benchmark execution• Performance metrics are collected and pushed to storage• Compare and visualize performance• Easy to export data and share reports
Benchmark under control
Keep benchmark source code on git• Developers can track changes easily
Define and standardize execution flow• Prepare section (data generation)• Main section (programs or queries)• Post section (generate performance report)
Compile benchmark script into program• Prototype in Java annotations with Apache Velocity template engine• Reduce errors through automation
Automatic driver
End-to-end execution• Start compute capacity based on provided configuration• Execute benchmark with test configuration• Monitor each execution step in action
• E.g., continue when failed or terminate when failed• Terminate resource after all defined steps are complete
Provide helpful debug information for failure execution (WIP)
Performance metric collector
Generic interface to gather data from different sources• Read local and remote measurement
• System information, cluster size, benchmark name• Runtime built-in metrics, e.g., YARN timeline server
Persist collected data to storage before cluster termination• It helps to maintain a historical view of data
Compare collected data points
Define test criteria • What is a “good” result? • Aggregations, mean, standard deviation standard errors with equal weight for
each step of a benchmark
Provide easy-to-use tools for comparing results • Command line interface • Generic interface such as SQL-like syntax• Graph UI
Export to open, easy-to-share formats • CSV• JSON
Case study:Amazon S3 storage mode for Apache HBase
HBase
Open source, non-relational, distributed database
Runs on top of Hadoop HDFS• Limited by the cluster instance storage
Storing large quantities of sparse data
Portions of data are cached in-memory• Read: BlockCache and BucketCache• Write: Memstore
WAL
HFileHFile
HFile
HDFS
Region server
Memstore
BlockCacheBucketCache
Local disk
Develop new features
Amazon S3 storage mode for HBase
Develop new features (cont.)
Our assumption• Read operations from S3 can be as fast as HDFS• Performance of write to S3 should be the same if
network bandwidth is allowed
How can we confirm it?• YCSB benchmark • HBase built-in PerformanceEvaluation tool
HBase performance tests
HBase-1.2.3• Compare HDFS and storage mode (with consistent view)
YCSB workloads• Various read, scan, update, and insert rates.
Custer size• 21 nodes homogeneous C3.4xlarge cluster with a single master node• Attached 2 x 160 SSD on each node
Running a total of 270 cases• 6 different workloads• Three different datasets, e.g., 10 million, 100 million, and 1 billion• Each workload runs 5 times
HBase tuning parameters
Initial Tuninghbase.hregion.memstore.flush.size 134217728 402653184
hfile.block.cache.size 0.4 0.4
hbase.hstore.blockingStoreFiles 200 1000
hbase.hregion.memstore.block.multiplier 4 8
hbase.hregion.max.filesize 1610612736 (1.6 GB) 1610612736 (1.6GB)
hbase.bucketcache.size 40 GB 16 GB (with HBASE-15314)
• Mitigate latency for “large” compactions• Read from caches, especially the use of on-disk BucketCache• HBASE-15314 allows multiple backing files in BucketCache
YCSB 100 millions records (before tuning)
12477 8217 6442 73091660918034
9602 762014245 22850
1
10
100
1000
10000
100000
1000000
Workload A Workload B Workload C Wordload D Wordload F
Exec
utio
n Ti
me
in S
econ
ds (L
og10
)
HDFS Storage Mode
50% R. 50% U.
95% R. 5% U. 100% R.
95% R. 5% I.
50% R. 50% R.M.W
• Enabled BucketCache for both HDFS and storage mode• 10 HBase clients• Too slow in Workload A and Workload D
Region server restarted
YCSB 100 millions records (after tuning)
34062075 1804 2158
41113242 2129 1851 26024192
1
10
100
1000
10000
100000
1000000
Workload A Workload B Workload C Wordload D Wordload F
Exec
utio
n Ti
me
in S
econ
ds (l
og10
)HDFS Storage Mode Storage Mode with consistent view
95% R. 5% U. 100% R.
95% R. 5% I.
* Workload definitions
50% R. 50% U.
50% R. 50% R.M.W
• Increase parallel workers to 32 HBase clients• Improve IOPS where BucketCache results are split into two files • Less compaction helps improve performance
Summary
Maintain a manageable collection of benchmarks for different runtimes.
Leverage a benchmarking pipeline; hours can be saved by the automation.
Archive a historical view of benchmark data points in a single repository.
Export performance results in a standard format (e.g., CSV) so they can be easily used by other developers and data scientists.
Thank you!If you have any questions,