AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Elastic MapReduce:

Deep Dive and Best Practices

Ian Meyers, AWS (meyersi@)

October 29th, 2014

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Observations from AWS

Map-Reduce Engine Vibrant Ecosystem

Hadoop-as-a-Service

Massively Parallel

Cost Effective AWS Wrapper

Integrated to AWS services

What is EMR?

HDFS

Amazon EMR

EMRfs

HDFS

Amazon EMR

Amazon S3 Amazon

DynamoDB

EMRfs

HDFS

Analytics languagesData management

Amazon EMR

Amazon S3 Amazon

DynamoDB

EMRfs

HDFS


Amazon EMRAmazon

RDS

Amazon S3 Amazon

DynamoDB

EMRfs

HDFS


Amazon

Redshift

Amazon EMRAmazon

RDS

Amazon S3 Amazon

DynamoDB

AWS Data Pipeline

Amazon EMR Introduction

Launch clusters of any size in a matter of minutes

Use variety of different instance sizes that match

your workload

Don’t get stuck with hardware

Don’t deal with capacity planning

Run multiple clusters with different sizes, specs

and node types

Elastic MapReduce & Amazon S3

EMR has an optimised driver for Amazon S3

64MB Range Offset Reads to increase performance

Elastic MapReduce Consistent View further Increases Performance

Addresses Consistency

S3 Cost - $.03/GB - Volume Based Price Tiering

Outline






Pattern #1: Transient vs. Alive Clusters

Pattern #2: Core Nodes and Task Nodes

Pattern #3: Amazon S3 & HDFS

Pattern #1: Transient vs. Alive Clusters

Pattern #1: Transient Clusters

Cluster lives for the duration of the job

Shut down the cluster when the job is done

Data persist on Amazon S3

Input & OutputData on

Amazon S3

Benefits of Transient Clusters1. Control your cost

2. Minimum maintenance

• Cluster goes away when job is done

3. Practice cloud architecture

• Pay for what you use

• Data processing as a workflow

Alive ClustersVery similar to traditional Hadoop deployments

Cluster stays around after the job is done

Data persistence model:

Amazon S3

Amazon S3 Copy To HDFS

HDFS and Amazon S3 as backup

Alive Clusters

Always keep data safe on Amazon S3 even if you’re

using HDFS for primary storage

Get in the habit of shutting down your cluster and start a

new one, once a week or month

Design your data processing workflow to account for failure

You can use workflow managements such as AWS Data

Pipeline

Pattern #2: Core & Task nodes

Core Nodes

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Run

TaskTrackers

(Compute)

Run DataNode

(HDFS)

Core Nodes

Can add core

nodes

More HDFS

space

More

CPU/memory


Amazon EMR cluster

Core instance group

HDFS HDFS HDFS

Core Nodes

Can’t remove

core nodes

because of

HDFS


Core instance group

HDFS HDFS HDFS

Amazon EMR cluster

Amazon EMR Task Nodes

Run TaskTrackers

No HDFS

Reads from core

node HDFS Task instance group


Core instance group

HDFS HDFS

Amazon EMR cluster


Can add

task nodes

Task instance group


Core instance group

HDFS HDFS

Amazon EMR cluster


More CPU

power

More

memoryTask instance group


Core instance group

HDFS HDFS

Amazon EMR cluster


You can

remove task

nodes when

processing

is completedTask instance group


Core instance group

HDFS HDFS

Amazon EMR cluster


You can

remove task

nodes when

processing

is completedTask instance group


Core instance group

HDFS HDFS

Amazon EMR cluster

Task Node Use-CasesSpeed up job processing using Spot market

Run task nodes on Spot market

Get discount on hourly price

Nodes can come and go without interruption to your cluster

When you need extra horsepower for a short amount of time

Example: Need to pull large amount of data from Amazon S3

Pattern #3: Amazon S3 & HDFS

Option 1: Amazon S3 as HDFS

Use Amazon S3 as your

permanent data store

HDFS for temporary storage

data between jobs

No additional step to copy

data to HDFS

Amazon EMR Cluster

Task Instance

Group

Core Instance

Group

HDFS HDFS

Amazon S3

Benefits: Amazon S3 as HDFSAbility to shut down your cluster

HUGE Benefit!!

Use Amazon S3 as your durable storage

11 9s of durability

Benefits: Amazon S3 as HDFS

No need to scale HDFS

Capacity

Replication for durability

Amazon S3 scales with your data

Both in IOPs and data storage

Benefits: Amazon S3 as HDFS

Ability to share data between multiple clusters

Hard to do with HDFS

Amazon S3

EMR

EMR

Benefits: Amazon S3 as HDFSTake advantage of Amazon S3 features

Amazon S3 Server Side Encryption

Amazon S3 Lifecycle Policies

Amazon S3 versioning to protect against corruption

Build elastic clusters

Add nodes to read from Amazon S3

Remove nodes with data safe on Amazon S3

EMR Consistent View

Provides a ‘consistent view’ of data on S3 within a Cluster

Ensures that all files created by a Step are available to Subsequent Steps

Index of data from S3, managed by Dynamo DB

Configurable Retry & Metastore

New Hadoop Config File emrfs-site.xml

fs.s3.consistent* System Properties

EMR Consistent View

EMRfs

HDFS

Amazon EMR

Amazon S3 Amazon

DynamoDB

Processed Files RegistryFile Data

EMR Consistent View

Manage data in EMRFS using the emrfs client:

emrfsdescribe-metadata, set-metadata-capacity, delete-metadata, create-metadata, list-metadata-stores - work with Metadata Stores

diff - Show what in a bucket is missing from the index

delete - Remove Index Entries

sync - Ensure that the Index is Synced with a bucket

import - Import Bucket Items into Index

What About Data Locality?

Run your job in the same region as your Amazon

S3 bucket

Amazon EMR nodes have high speed connectivity

to Amazon S3

If your job Is CPU/memory-bound, locality doesn’t

make a huge difference

Amazon S3 provides near linear scalability

S3 Streaming

Performance100 VMs; 9.6GB/s; $26/hr

350 VMs; 28.7GB/s; $90/hr

34 secs per terabyte

GB/Second

Rea

der

Connections

Performance & Scalability

When HDFS is a Better Choice…

Iterative workloads

If you’re processing the same dataset more than once

Disk I/O intensive workloads

Option 2: Optimise for Latency with HDFS

1. Data persisted on Amazon S3


2. Launch Amazon EMR and

copy data to HDFS with

S3distcp

S3D

istC

p


3. Start processing data on

HDFS

S3D

istC

p

Benefits: HDFS instead of S3

Better pattern for I/O-intensive workloads

Amazon S3 as system of record

Durability

Scalability

Cost

Features: lifecycle policy, security

Outline





Amazon EMR Nodes and SizeUse M1.Small Instances for functional testing

Use XLarge + nodes for production workloads

Use CC2/C3 for memory and CPU intensive jobs

HS1, HI1, I2 instances for HDFS workloads

Prefer a smaller cluster of larger nodes

Holy Grail Question

How many nodes do I need?

Instance Resource Allocation

• Hadoop 1 - Static Number of Mappers/Reducers

configured for the Cluster Nodes

• Hadoop 2 - Variable Number of Hadoop

Applications based on File Splits and Available

Memory

• Useful to understand Old vs New Sizing

Instance Resources

1

24

8

16

32

64

128

256

512

10242048

4096

8192

16384

32768

65536

0

50

100

150

200

250

300

Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)

Cluster Sizing Calculation1. Estimate the number of tasks your job requires.

2. Pick an instance and note down the number of tasks it can run in parallel

3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.

4. Run an Amazon EMR cluster with a single Core node and process your sample files from #3.

Note down the amount of time taken to process your sample files.

Cluster Sizing Calculation

Total Tasks * Time To Process Sample Files

Instance Task Capacity * Desired Processing Time

Estimated Number Of Nodes:

Example: Cluster Sizing Calculation1. Estimate the number of tasks your job requires

150

2. Pick an instance and note down the number of tasks it can run in parallel

m1.xlarge with 8 task capacity per instance

Example: Cluster Sizing Calculation

3. We need to pick some sample data files to run a

test workload. The number of sample files should

be the same number from step #2.

8 files selected for our sample test

Example: Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core

node and process your sample files from #3.

Note down the amount of time taken to process

your sample files.

3 min to process 8 files

Cluster Sizing Calculation

Total Tasks For Your Job * Time To Process Sample Files

Per Instance Task Capacity * Desired Processing Time

Estimated number of nodes:

150 * 3 min 8 * 5 min

= 11 m1.xlarge

File Best Practices

Avoid small files at all costs (smaller than

100MB)

Use Compression

Holy Grail Question

What if I have small file issues?

Dealing with Small Files

Use S3DistCP to combine smaller files together

S3DistCP takes a pattern and target file to combine smaller input files to larger ones

./elastic-mapreduce –jar

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args '--src,s3://myawsbucket/cf,\

--dest,hdfs:///local,\

--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-

[0-9]+-[0-9]+).*,\

--targetSize,128,\

CompressionAlways Compress Data Files On Amazon S3

Reduces Bandwidth Between Amazon S3 and

Amazon EMR

Speeds Up Your Job

Compress Task Output

CompressionCompression Types:

Some are fast BUT offer less space reduction

Some are space efficient BUT Slower

Some are splitable and some are not

Algorithm % Space

Remaining

Encoding

Speed

Decoding

Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

Changing Compression TypeYou May Decide To Change Compression Type

Use S3DistCP to change the compression types of your files

Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args '--src,s3://myawsbucket/cf,\

--dest,hdfs:///local,\

--outputCodec,lzo’

Outline





M1/C1 Instance Families

Heavily used by EMR Customers

However, HDFS Utilisation is typically very

Low

M3/C3 Offers better Performance/$

M1 vs M3

Instance Cost / Map Task Cost / Reduce Task

m1.large $0.08 $0.15

m1.xlarge $0.06 $0.15

m3.xlarge $0.04 $0.07

m3.2xlarge $0.04 $0.07

C1 vs C3

Instance Cost / Map Task Cost / Reduce Task

c1.medium $0.13 $0.13

c1.xlarge $0.35 $0.70

c3.xlarge $0.05 $0.11

c3.2xlarge $0.05 $0.11

Orc vs Parquet

File Formats designed for SQL/Data Warehousing

on Hadoop

Columnar File Formats

Compress Well

High Row Count, Low Cardinality

Orc File Format

Optimised Row Columnar Format

Zlib or Snappy External

Compression

250MB Stripe of 1 Column and

Index

RunLength or Dictionary Encoding

1 Output File per Container Task

Parquet File Format

Gzip or Snappy External

Compression

Array Data Structures

Limited Data Type Support for

Hive

Batch Creation

1GB Files

Orc vs Parquet

Depends on the Tool you are using

Consider Future Architecture & Requirements

Test Test Test

In Summary

• Practice Cloud Architecture with Transient Clusters

• Utilize S3 as the system of record for durability

• Utilize Task Nodes on Spot for Increased performance and

Lower Cost

• Move to new Instance Families for Better Performance/$

• Exciting Developments around Columnar File Formats

Technology

AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices