AWS May Webinar Series - Getting Started with Amazon EMR

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jonathan FritzSenior Product Manager, Amazon EMR

May 20, 2015

Getting Started with

Amazon EMREasy, fast, secure, and cost-effective Hadoop on AWS.

Agenda

• Is Hadoop the answer?

• Amazon EMR 101

• Integration with AWS storage and database services

• Common Amazon EMR design patterns

• Q+A

When leveraging your data to derive new insights,

Big Data problems are everywhere

• Data lacks structure

• Analyzing streams of information

• Processing large datasets

• Warehousing large datasets

• Flexibility for undefined ad hoc analysis

• Speed of queries on large data sets

Hadoop is the right system for Big Data

• Massively parallel

• Scalable and fault tolerant

• Flexibility for multiple languages

and data formats

• Open source

• Ecosystem of tools

• Batch and real-time analytics

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

InteractiveTez

In MemorySpark

ApplicationsPig, Hive, Cascading, Mahout, Giraph

HB

as

e

Pre

sto

Imp

ala

Hadoop 2

BatchMapReduce

Storage S3, HDFS

Hadoop 1

Applications

Customers across many verticals

Amazon Elastic MapReduce (EMR) is the

easiest way to run Hadoop in the cloud.

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleCustomize the cluster

Easy to UseLaunch a cluster in minutes

Easy to deploy

AWS Management Console AWS Command Line Interface

You can also use the Amazon EMR API with your favorite SDK

or use AWS Data Pipeline to start your clusters.

Try different configurations to find your optimal architecture.

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Low CostPay an hourly rate

Spot Instances

for task nodes

Up to 90%

off Amazon EC2

on-demand

pricing

On-demand for

core nodes

Standard

Amazon EC2

pricing for

on-demand

capacity

Mix on-demand and EC2 Spot capacity for low costs

Meet SLA at predictable cost Exceed SLA at lower cost

Use multiple EMR instance groups

Master Node

r3.2xlarge

Example Amazon

EMR Cluster

Slave Group - Core

c3.2xlarge

Slave Group – Task

m3.xlarge (EC2 Spot)

Slave Group – Task

m3.2xlarge (EC2 Spot)

Core nodes run HDFS

(DataNode). Task nodes do

not run HDFS. Core and

Task nodes each run YARN

(NodeManager).

Master node runs

NameNode (HDFS),

ResourceManager (YARN),

and serves as a gateway.

ElasticEasily add or remove capacity

Easy to add and remove compute

capacity in your cluster from the console, CLI, or API.

Match compute

demands with

cluster sizing.

Resizable clusters

Use S3 instead of HDFS for your data layer to decouple

your compute capacity and storage

Amazon S3

Amazon EMR

Shut down your EMR

clusters when you

are not processing

data, and stop paying

for them!

ReliableSpend less time monitoring

Easy to monitor and debug

Monitor with Amazon CloudWatch or Ganglia

Cluster, Node, and IO

Monitor Debug

EMR logging to S3 makes logs easily available

Secure

Integrates with AWS

security features

Use Identity and Access Management (IAM) roles with

your Amazon EMR cluster

• IAM roles give AWS services fine grained

control over delegating permissions to AWS

services and access to AWS resources

• EMR uses two IAM roles:

• EMR service role is for the Amazon EMR

control plane

• EC2 instance profile is for the actual

instances in the Amazon EMR cluster

• Default IAM roles can be easily created and

used from the AWS Console and AWS CLI

EMR Security Groups: default and custom

A security group is a virtual firewall which controls access to the EC2 instances in your Amazon EMR cluster

• There is a single default master and default slave security group across all of your clusters

• The master security group has port 22 access for SSHing to your cluster

You can add additional security groups to the master and slave groups on a cluster to separate them from the default master and slave security groups, and further limit ingress and egress policies.

Slave

Security

Group

Master

Security

Group

Other Amazon EMR security features

EMRFS encryption options

• S3 server-side encryption

• S3 client-side encryption (use AWS Key Management Service keys or custom keys)

CloudTrail integration

• Track Amazon EMR API calls for auditing

Launch your Amazon EMR clusters in a VPC

• Logically isolated portion of the cloud (“Virtual Private Network”)

• Enhanced networking on certain instance types

FlexibleCustomize the cluster

Hadoop applications available in EMR

Use Hive on EMR to interact with your data in HDFS

and Amazon S3

• Batch or ad hoc workloads

• Integration with EMRFS for better

performance reading and writing

to S3

• SQL-like query language to make

iterative queries easier

• Schema-on-read to query data

without needing pre-processing

• Use Tez engine for faster queries

Use Pig to easily create ETL workflows

• Uses high-level “Pig Latin” language to

easily script data transformations in

Hadoop

• Strong optimizer for workloads

• Options to create robust user defined

functions

Use HBase on a persistent EMR cluster as a noSQL

scalable database

• Billions of rows and millions of columns

• Backup to and restore from Amazon S3

• Flexible datatypes

• Modulate your HBase tables when adding new data to your system

Impala: a fast SQL query engine for EMR Clusters

• Low-latency SQL query engine for Hadoop

• Perfect for fast ad hoc, interactive queries on

structured on unstructured data

• Can be easily installed on an EMR cluster,

and queried from the CLI or a 3rd party BI tool

• Perfect for memory optimized instances

• Currently uses HDFS as data layer

Hadoop User Experience (Hue)

Query Editor

Hue

Job Browser

Hue

File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)

To install anything else, use Bootstrap Actions

https://github.com/awslabs/emr-bootstrap-actions

https://github.com/awslabs/emr-bootstrap-actions

Spark: an alternative engine to Hadoop with its

own ecosystem of applications

• Does not use map-reduce framework

• In-memory for fast queries

• Great for machine learning or other iterative queries

• Use Spark SQL to create a low-latency data warehouse

• Spark Streaming for real-time workloads

Also use Bootstrap Actions to configure your

applications

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file (Merge values in new config to existing)

--keyword-key-value (Override values provided)

Configuration File

Name

Configuration File

KeywordFile Name Shortcut

Key-Value Pair

Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

EMR Step API

• EMR step can be a map-

reduce job, Hive program, Pig

script, or even an arbitrary

script

• Easily submit Step from

console, CLI, or API

• Submit multiple steps to use

EMR as a sequential workflow

engine

Submit work via the EMR Step API or SSH to the

EMR master node

Connect to Master Node

• Connect to HUE, interact with

application CLIs, or submit

work directly to the Hadoop

APIs

• View the Hadoop UI

• Useful for long-running clusters

and interactive use cases

Let’s see it!

Quick tour of the EMR Console and HUE on an EMR

cluster

Diverse set of partners to use with Amazon EMR

BI / Visualization Business Intelligence BI / Visualization BI / Visualization

Hadoop Distribution Data Transfer Data Transformation

Monitoring Performance Tuning Graphical IDE Graphical IDE

Available on AWS Marketplace Available as a distribution in Amazon EMR

ETL Tool

BI / Visualization

Integration with AWS storage

and database services

Choose your data stores

Amazon S3 as your persistent data store

Amazon S3

• Designed for 99.999999999% durability

• Separate compute and storage

Resize and shut down Amazon EMR clusterswith no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)

EMRFS makes it easier to leverage Amazon S3

Better performance and error handling options

Transparent to applications – just read/write to “s3://”

Consistent view

• For consistent list and read-after-write for new puts

Support for Amazon S3 server-side and client-side encryption

Faster listing using EMRFS metadata

Amazon S3 EMRFS metadata

in Amazon DynamoDB

• List and read-after-write consistency

• Faster list operations

Number

of objects

Without

Consistent

Views

With Consistent

Views

1,000,000 147.72 29.70

100,000 12.70 3.69

Consistent view and fast listing using the optional

EMRFS metadata

*Tested using a single node cluster with a m3.xlarge instance.

EMRFS support for Amazon S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Read data directly into Hive,

Apache Pig, and Hadoop

Streaming and Cascading from

Amazon Kinesis streams

No intermediate data

persistence required

Simple way to introduce real-time sources into

batch-oriented systems

Multi-application support and automatic

checkpointing

Amazon EMR Integration with Amazon Kinesis

Use Hive with EMR to query data DynamoDB

• Export data stored in DynamoDB to

Amazon S3

• Import data in Amazon S3 to

DynamoDB

• Query live DynamoDB data using SQL-

like statements (HiveQL)

• Join data stored in DynamoDB and

export it or query against the joined data

• Load DynamoDB data into HDFS and

use it in your EMR job

Use AWS Data Pipeline and EMR to transform

data and load into Amazon Redshift

Unstructured Data Processed Data

Pipeline orchestrated and scheduled by AWS Data Pipeline

Amazon EMR design patterns

Amazon EMR example #1: Batch processing

GBs of logs pushed

to Amazon S3 hourlyDaily Amazon EMR

cluster using Hive to

process data

Input and output

stored in Amazon S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

http://aws.amazon.com/solutions/case-studies/yelp/

Using Amazon S3 and HDFS

Data Sources

Transient EMR cluster

for batch map/reduce jobs

for daily reports

Long running EMR cluster

holding data in HDFS for

Hive interactive queries

Weekly Report

Ad-hoc Query

Data aggregated

and stored in

Amazon S3

Amazon Confidential

Multiple EMR workflows using the same S3

dataset

Computations

S3DistCp

CascalogLZO

Input Amazon

S3 bucketIntermediate

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Crashlytics (part of Twitter) uses EMR to

analyze data in S3 to power dashboards

on its Answers platform.

Amazon EMR example #2: Long-running cluster

Data pushed to

Amazon S3Daily Amazon EMR cluster

Extract, Transform, and Load

(ETL) data into database 24/7 Amazon EMR cluster

running HBase holds last 2

years’ worth of data

Front-end service uses

HBase cluster to power

dashboard with high

concurrency

Amazon EMR example #3: Interactive query

TBs of logs sent dailyLogs stored in

Amazon S3Amazon EMR cluster using Presto for ad hoc

analysis of entire log set

Interactive query using Presto on multipetabyte warehouse

http://techblog.netflix.com/2014/10/using-presto-in-our-big-

data-platform.html

http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html

EMR example #4: EMR for ETL and query engine for

investigations which require all raw data

TBs of logs sent

daily

Logs stored in S3

Hourly EMR cluster

using Spark for ETL

Load subset into

Redshift DW

Transient EMR cluster using Spark for ad hoc

analysis of entire log set

Client/Sensor Recording Service

Aggregator/ Sequencer

Continuous Processor

Data Warehouse Analytics and Reporting

EMR Example #5: Streaming Data

Client/Sensor Recording Service Aggregator/ Sequencer

Continuous Processor


Kafka

Common Tools

Amazon Kinesis

Streaming Data Repository

Amazon Kinesis

Client/ Sensor Recording Service Aggregator/ Sequencer

Continuous Processor for Dashboard


Amazon Kinesis Amazon EMR

Streaming Data RepositoryLogging Data Processing

Log4J

Amazon Kinesis + Amazon EMR = Fewer

Moving Parts

Processedoutput in real-time and batch workflows

Input

push with Log 4J to

HivePig

Cascading

pull from

Spark

Amazon EMR

Amazon Kinesis

Customer Application

Amazon DynamoDB

Real-time processing with Spark Streaming and batch

workloads on Kinesis streams with the Hadoop stack

AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new

customers about the AWS platform, best practices and new cloud services.

Details• July 1, 2015

• Chicago, Illinois

• @ McCormick Place

Featuring• New product launches

• 36+ sessions, labs, and bootcamps

• Executive and partner networking

Registration is now open• Come and see what AWS and the cloud can do for you.

https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootcamps&trk=Webinar_slide

CTA Script

- If you are interested in learning more about how to navigate the cloud to grow

your business - then attend the AWS Summit Chicago, July 1st.

- Register today to learn from technical sessions led by AWS engineers, hear best

practices from AWS customers and partners, and participate in some of the 30+

paid sessions and labs.

- Simply go to

https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc

amps&trk=Webinar_slide

to register today.

- Registration is FREE.

TRACKING CODE:

- Listed above.

https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootcamps&trk=Webinar_slide

Thank you!

www.aws.amazon.com/elasticmapreduce

www.blogs.aws.amazon.com/bigdata

[email protected]

Technology

AWS May Webinar Series - Getting Started with Amazon EMR