TeraSort Benchmark Comparison for YARN - MapR · TeraSort Benchmark Comparison for YARN MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports

Technical Brief

®

TeraSort BenchmarkComparison for YARN

Test Results

Introduction

Technical Brief

®

TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. It is commonly used to measure MapReduce performance of an Apache™ Hadoop® cluster. The following report compares performance of a YARN-scheduled TeraSort job on MapR and other distributions.

The MapR Distribution including Apache Hadoop continues to be fastest Hadoop distribution in the market. As seen in the figure, MapR is much faster than other distributions (Cloudera CDH was chosen for comparison purposes) sorting 1 TB of data on a 21-node cluster in 494 seconds. The other distribution run under the same conditions took 822 seconds. Please refer to the Appendix for test environment details.

MapR shows a significant performance advantage over other distributions for two primary reasons:

MapR Data Platform Advantage

MapR has set world records for MapReduce performance because of numerous differentiated features for performance including:

• Distributed metadata to eliminate bottlenecks

• C++ implementation in key components

• Fast, direct disk I/O (vs. layered I/O on top of the Linux file system)

• Optimized MapReduce shuffle algorithm

All of these features continue to provide performance benefits and lower infrastructure footprints when applied to MapReduce v2 jobs scheduled using YARN.

(continued on next page)

TeraSort Benchmark Comparison for YARN

Conclusion

Test Results continued

®

2

Taking Disk I/O into Account for YARN Scheduling

In order to calculate system resources required for a job, the YARN scheduler today takes memory and CPU characteristics of the nodes into account. For instance for a MapReduce job, the optimum number of map and reduce slots required will be calculated based on CPU and memory availability across the nodes.

MapR allows the YARN scheduler to also take disk I/O characteristics into account when calculating system resources. This ensures disk bottlenecks are correctly identified during the resource allocation process making YARN jobs perform much better.

MapR provides the best Hadoop performance for a variety of workloads, proven by MapReduce v1, MapReduce v2 (YARN), and YCSB benchmarks. Along with high reliability and the random read- write NFS capability, the MapR performance advantage continues to be one of many key benefits for end users. MapR clusters have proven to be the most cost-efficient Hadoop deployments by requiring a much smaller hardware footprint compared to other distributions.

MapR World-Record Setting Benchmark MapR holds the TeraSort world record sorting 1 TB in 54 seconds, accomplished on 1003 virtual nodes on the Google Cloud platform. Details of the MapR world-record setting benchmark can be found in the MapR blogs.

(continued on next page)

Technical Brief: TeraSort BenchmarkComparison for YARN

MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified distribution for Hadoop. MapR is used by more than 500 customers across financial services, government, healthcare, manu-facturing, media, retail and telecommunications as well as by leading Global 2000 and Web 2.0 companies. Amazon, Cisco, Google and HP are part of the broad MapR partner ecosystem. Investors include Google Capital, Lightspeed Venture Partners, Mayfield Fund, NEA, Qualcomm Ventures and Redpoint Ventures. MapR is based in San Jose, CA. © 2014 MapR Technologies, Inc.

https://www.mapr.com/blog/record-setting-hadoop-cloud

Test Environment Details

®

3

Number of Nodes: 20+1 node for NameNode/CLDB + YARN Resource Manager RAM: 128GB Disks: 11 Disks—110 GB CPU: 2x16 cores Network: 10 GbE CDH Version: CDH 5.1.0 YARN MapR Version: MapR 4.0.1 YARN

Test parameters* Numbers

mapreduce.reduce.memory.mb 3072

mapreduce.map.memory.mb 1024

mapred.maxthreads.generate.mapoutput 2

mapreduce.tasktracker.reserved.physicalmemory.mb.low 0.95

mapred.maxthreads.partition.closer 2

mapreduce.map.sort.spill.percent 0.99

mapreduce.reduce.merge.inmem.threshold 0

mapreduce.job.reduce.slowstart.completedmaps 1

mapreduce.reduce.shuffle.parallelcopies 40

mapreduce.map.speculative false

mapreduce.reduce.speculative false

mapreduce.map.output.compress false

mapreduce.job.reduces 160

mapreduce.task.io.sort.mb 480

mapreduce.task.io.sort.factor 400

mfs.heapsize 35

Technical Brief: TeraSort BenchmarkComparison for YARN

Appendix

* Tuned according to best practices for MapReduce workloads.

Documents

TeraSort Benchmark Comparison for YARN - MapR · TeraSort Benchmark Comparison for YARN MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports