63
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz Senior Product Manager, Amazon EMR May 20, 2015 Getting Started with Amazon EMR Easy, fast, secure, and cost-effective Hadoop on AWS.

AWS May Webinar Series - Getting Started with Amazon EMR

Embed Size (px)

Citation preview

Page 1: AWS May Webinar Series - Getting Started with Amazon EMR

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jonathan FritzSenior Product Manager, Amazon EMR

May 20, 2015

Getting Started with

Amazon EMREasy, fast, secure, and cost-effective Hadoop on AWS.

Page 2: AWS May Webinar Series - Getting Started with Amazon EMR

Agenda

• Is Hadoop the answer?

• Amazon EMR 101

• Integration with AWS storage and database services

• Common Amazon EMR design patterns

• Q+A

Page 3: AWS May Webinar Series - Getting Started with Amazon EMR

When leveraging your data to derive new insights,

Big Data problems are everywhere

• Data lacks structure

• Analyzing streams of information

• Processing large datasets

• Warehousing large datasets

• Flexibility for undefined ad hoc analysis

• Speed of queries on large data sets

Page 4: AWS May Webinar Series - Getting Started with Amazon EMR

Hadoop is the right system for Big Data

• Massively parallel

• Scalable and fault tolerant

• Flexibility for multiple languages

and data formats

• Open source

• Ecosystem of tools

• Batch and real-time analytics

Page 5: AWS May Webinar Series - Getting Started with Amazon EMR

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

InteractiveTez

In MemorySpark

ApplicationsPig, Hive, Cascading, Mahout, Giraph

HB

as

e

Pre

sto

Imp

ala

Hadoop 2

BatchMapReduce

Storage S3, HDFS

Hadoop 1

Applications

Page 6: AWS May Webinar Series - Getting Started with Amazon EMR

Customers across many verticals

Page 7: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon Elastic MapReduce (EMR) is the

easiest way to run Hadoop in the cloud.

Page 8: AWS May Webinar Series - Getting Started with Amazon EMR

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleCustomize the cluster

Page 9: AWS May Webinar Series - Getting Started with Amazon EMR

Easy to UseLaunch a cluster in minutes

Page 10: AWS May Webinar Series - Getting Started with Amazon EMR

Easy to deploy

AWS Management Console AWS Command Line Interface

You can also use the Amazon EMR API with your favorite SDK

or use AWS Data Pipeline to start your clusters.

Page 11: AWS May Webinar Series - Getting Started with Amazon EMR

Try different configurations to find your optimal architecture.

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Page 12: AWS May Webinar Series - Getting Started with Amazon EMR

Low CostPay an hourly rate

Page 13: AWS May Webinar Series - Getting Started with Amazon EMR

Spot Instances

for task nodes

Up to 90%

off Amazon EC2

on-demand

pricing

On-demand for

core nodes

Standard

Amazon EC2

pricing for

on-demand

capacity

Mix on-demand and EC2 Spot capacity for low costs

Meet SLA at predictable cost Exceed SLA at lower cost

Page 14: AWS May Webinar Series - Getting Started with Amazon EMR

Use multiple EMR instance groups

Master Node

r3.2xlarge

Example Amazon

EMR Cluster

Slave Group - Core

c3.2xlarge

Slave Group – Task

m3.xlarge (EC2 Spot)

Slave Group – Task

m3.2xlarge (EC2 Spot)

Core nodes run HDFS

(DataNode). Task nodes do

not run HDFS. Core and

Task nodes each run YARN

(NodeManager).

Master node runs

NameNode (HDFS),

ResourceManager (YARN),

and serves as a gateway.

Page 15: AWS May Webinar Series - Getting Started with Amazon EMR

ElasticEasily add or remove capacity

Page 16: AWS May Webinar Series - Getting Started with Amazon EMR

Easy to add and remove compute

capacity in your cluster from the console, CLI, or API.

Match compute

demands with

cluster sizing.

Resizable clusters

Page 17: AWS May Webinar Series - Getting Started with Amazon EMR

Use S3 instead of HDFS for your data layer to decouple

your compute capacity and storage

Amazon S3

Amazon EMR

Shut down your EMR

clusters when you

are not processing

data, and stop paying

for them!

Page 18: AWS May Webinar Series - Getting Started with Amazon EMR

ReliableSpend less time monitoring

Page 19: AWS May Webinar Series - Getting Started with Amazon EMR

Easy to monitor and debug

Monitor with Amazon CloudWatch or Ganglia

Cluster, Node, and IO

Monitor Debug

Page 20: AWS May Webinar Series - Getting Started with Amazon EMR

EMR logging to S3 makes logs easily available

Page 21: AWS May Webinar Series - Getting Started with Amazon EMR

Secure

Integrates with AWS

security features

Page 22: AWS May Webinar Series - Getting Started with Amazon EMR

Use Identity and Access Management (IAM) roles with

your Amazon EMR cluster

• IAM roles give AWS services fine grained

control over delegating permissions to AWS

services and access to AWS resources

• EMR uses two IAM roles:

• EMR service role is for the Amazon EMR

control plane

• EC2 instance profile is for the actual

instances in the Amazon EMR cluster

• Default IAM roles can be easily created and

used from the AWS Console and AWS CLI

Page 23: AWS May Webinar Series - Getting Started with Amazon EMR

EMR Security Groups: default and custom

A security group is a virtual firewall which controls access to the EC2 instances in your Amazon EMR cluster

• There is a single default master and default slave security group across all of your clusters

• The master security group has port 22 access for SSHing to your cluster

You can add additional security groups to the master and slave groups on a cluster to separate them from the default master and slave security groups, and further limit ingress and egress policies.

Slave

Security

Group

Master

Security

Group

Page 24: AWS May Webinar Series - Getting Started with Amazon EMR

Other Amazon EMR security features

EMRFS encryption options

• S3 server-side encryption

• S3 client-side encryption (use AWS Key Management Service keys or custom keys)

CloudTrail integration

• Track Amazon EMR API calls for auditing

Launch your Amazon EMR clusters in a VPC

• Logically isolated portion of the cloud (“Virtual Private Network”)

• Enhanced networking on certain instance types

Page 25: AWS May Webinar Series - Getting Started with Amazon EMR

FlexibleCustomize the cluster

Page 26: AWS May Webinar Series - Getting Started with Amazon EMR

Hadoop applications available in EMR

Page 27: AWS May Webinar Series - Getting Started with Amazon EMR

Use Hive on EMR to interact with your data in HDFS

and Amazon S3

• Batch or ad hoc workloads

• Integration with EMRFS for better

performance reading and writing

to S3

• SQL-like query language to make

iterative queries easier

• Schema-on-read to query data

without needing pre-processing

• Use Tez engine for faster queries

Page 28: AWS May Webinar Series - Getting Started with Amazon EMR

Use Pig to easily create ETL workflows

• Uses high-level “Pig Latin” language to

easily script data transformations in

Hadoop

• Strong optimizer for workloads

• Options to create robust user defined

functions

Page 29: AWS May Webinar Series - Getting Started with Amazon EMR

Use HBase on a persistent EMR cluster as a noSQL

scalable database

• Billions of rows and millions of columns

• Backup to and restore from Amazon S3

• Flexible datatypes

• Modulate your HBase tables when adding new data to your system

Page 30: AWS May Webinar Series - Getting Started with Amazon EMR

Impala: a fast SQL query engine for EMR Clusters

• Low-latency SQL query engine for Hadoop

• Perfect for fast ad hoc, interactive queries on

structured on unstructured data

• Can be easily installed on an EMR cluster,

and queried from the CLI or a 3rd party BI tool

• Perfect for memory optimized instances

• Currently uses HDFS as data layer

Page 31: AWS May Webinar Series - Getting Started with Amazon EMR

Hadoop User Experience (Hue)

Query Editor

Page 32: AWS May Webinar Series - Getting Started with Amazon EMR

Hue

Job Browser

Page 33: AWS May Webinar Series - Getting Started with Amazon EMR

Hue

File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)

Page 34: AWS May Webinar Series - Getting Started with Amazon EMR

To install anything else, use Bootstrap Actions

https://github.com/awslabs/emr-bootstrap-actions

Page 35: AWS May Webinar Series - Getting Started with Amazon EMR

Spark: an alternative engine to Hadoop with its

own ecosystem of applications

• Does not use map-reduce framework

• In-memory for fast queries

• Great for machine learning or other iterative queries

• Use Spark SQL to create a low-latency data warehouse

• Spark Streaming for real-time workloads

Page 36: AWS May Webinar Series - Getting Started with Amazon EMR

Also use Bootstrap Actions to configure your

applications

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file (Merge values in new config to existing)

--keyword-key-value (Override values provided)

Configuration File

Name

Configuration File

KeywordFile Name Shortcut

Key-Value Pair

Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Page 37: AWS May Webinar Series - Getting Started with Amazon EMR

EMR Step API

• EMR step can be a map-

reduce job, Hive program, Pig

script, or even an arbitrary

script

• Easily submit Step from

console, CLI, or API

• Submit multiple steps to use

EMR as a sequential workflow

engine

Submit work via the EMR Step API or SSH to the

EMR master node

Connect to Master Node

• Connect to HUE, interact with

application CLIs, or submit

work directly to the Hadoop

APIs

• View the Hadoop UI

• Useful for long-running clusters

and interactive use cases

Page 38: AWS May Webinar Series - Getting Started with Amazon EMR

Let’s see it!

Quick tour of the EMR Console and HUE on an EMR

cluster

Page 39: AWS May Webinar Series - Getting Started with Amazon EMR

Diverse set of partners to use with Amazon EMR

BI / Visualization Business Intelligence BI / Visualization BI / Visualization

Hadoop Distribution Data Transfer Data Transformation

Monitoring Performance Tuning Graphical IDE Graphical IDE

Available on AWS Marketplace Available as a distribution in Amazon EMR

ETL Tool

BI / Visualization

Page 40: AWS May Webinar Series - Getting Started with Amazon EMR

Integration with AWS storage

and database services

Page 41: AWS May Webinar Series - Getting Started with Amazon EMR

Choose your data stores

Page 42: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon S3 as your persistent data store

Amazon S3

• Designed for 99.999999999% durability

• Separate compute and storage

Resize and shut down Amazon EMR clusterswith no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)

Page 43: AWS May Webinar Series - Getting Started with Amazon EMR

EMRFS makes it easier to leverage Amazon S3

Better performance and error handling options

Transparent to applications – just read/write to “s3://”

Consistent view

• For consistent list and read-after-write for new puts

Support for Amazon S3 server-side and client-side encryption

Faster listing using EMRFS metadata

Page 44: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon S3 EMRFS metadata

in Amazon DynamoDB

• List and read-after-write consistency

• Faster list operations

Number

of objects

Without

Consistent

Views

With Consistent

Views

1,000,000 147.72 29.70

100,000 12.70 3.69

Consistent view and fast listing using the optional

EMRFS metadata

*Tested using a single node cluster with a m3.xlarge instance.

Page 45: AWS May Webinar Series - Getting Started with Amazon EMR

EMRFS support for Amazon S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Page 46: AWS May Webinar Series - Getting Started with Amazon EMR

Read data directly into Hive,

Apache Pig, and Hadoop

Streaming and Cascading from

Amazon Kinesis streams

No intermediate data

persistence required

Simple way to introduce real-time sources into

batch-oriented systems

Multi-application support and automatic

checkpointing

Amazon EMR Integration with Amazon Kinesis

Page 47: AWS May Webinar Series - Getting Started with Amazon EMR

Use Hive with EMR to query data DynamoDB

• Export data stored in DynamoDB to

Amazon S3

• Import data in Amazon S3 to

DynamoDB

• Query live DynamoDB data using SQL-

like statements (HiveQL)

• Join data stored in DynamoDB and

export it or query against the joined data

• Load DynamoDB data into HDFS and

use it in your EMR job

Page 48: AWS May Webinar Series - Getting Started with Amazon EMR

Use AWS Data Pipeline and EMR to transform

data and load into Amazon Redshift

Unstructured Data Processed Data

Pipeline orchestrated and scheduled by AWS Data Pipeline

Page 49: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon EMR design patterns

Page 50: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon EMR example #1: Batch processing

GBs of logs pushed

to Amazon S3 hourlyDaily Amazon EMR

cluster using Hive to

process data

Input and output

stored in Amazon S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

Page 51: AWS May Webinar Series - Getting Started with Amazon EMR

Using Amazon S3 and HDFS

Data Sources

Transient EMR cluster

for batch map/reduce jobs

for daily reports

Long running EMR cluster

holding data in HDFS for

Hive interactive queries

Weekly Report

Ad-hoc Query

Data aggregated

and stored in

Amazon S3

Amazon Confidential

Page 52: AWS May Webinar Series - Getting Started with Amazon EMR

Multiple EMR workflows using the same S3

dataset

Computations

S3DistCp

CascalogLZO

Input Amazon

S3 bucketIntermediate

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Crashlytics (part of Twitter) uses EMR to

analyze data in S3 to power dashboards

on its Answers platform.

Page 53: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon EMR example #2: Long-running cluster

Data pushed to

Amazon S3Daily Amazon EMR cluster

Extract, Transform, and Load

(ETL) data into database 24/7 Amazon EMR cluster

running HBase holds last 2

years’ worth of data

Front-end service uses

HBase cluster to power

dashboard with high

concurrency

Page 54: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon EMR example #3: Interactive query

TBs of logs sent dailyLogs stored in

Amazon S3Amazon EMR cluster using Presto for ad hoc

analysis of entire log set

Interactive query using Presto on multipetabyte warehouse

http://techblog.netflix.com/2014/10/using-presto-in-our-big-

data-platform.html

Page 55: AWS May Webinar Series - Getting Started with Amazon EMR

EMR example #4: EMR for ETL and query engine for

investigations which require all raw data

TBs of logs sent

daily

Logs stored in S3

Hourly EMR cluster

using Spark for ETL

Load subset into

Redshift DW

Transient EMR cluster using Spark for ad hoc

analysis of entire log set

Page 56: AWS May Webinar Series - Getting Started with Amazon EMR

Client/Sensor Recording Service

Aggregator/ Sequencer

Continuous Processor

Data Warehouse Analytics and Reporting

EMR Example #5: Streaming Data

Page 57: AWS May Webinar Series - Getting Started with Amazon EMR

Client/Sensor Recording Service Aggregator/ Sequencer

Continuous Processor

Data Warehouse Analytics and Reporting

Kafka

Common Tools

Page 58: AWS May Webinar Series - Getting Started with Amazon EMR

Amazon Kinesis

Streaming Data Repository

Amazon Kinesis

Page 59: AWS May Webinar Series - Getting Started with Amazon EMR

Client/ Sensor Recording Service Aggregator/ Sequencer

Continuous Processor for Dashboard

Data Warehouse Analytics and Reporting

Amazon Kinesis Amazon EMR

Streaming Data RepositoryLogging Data Processing

Log4J

Amazon Kinesis + Amazon EMR = Fewer

Moving Parts

Page 60: AWS May Webinar Series - Getting Started with Amazon EMR

Processedoutput in real-time and batch workflows

Input

push with Log 4J to

HivePig

Cascading

pull from

Spark

Amazon EMR

Amazon Kinesis

Customer Application

Amazon DynamoDB

Real-time processing with Spark Streaming and batch

workloads on Kinesis streams with the Hadoop stack

Page 61: AWS May Webinar Series - Getting Started with Amazon EMR

AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new

customers about the AWS platform, best practices and new cloud services.

Details• July 1, 2015

• Chicago, Illinois

• @ McCormick Place

Featuring• New product launches

• 36+ sessions, labs, and bootcamps

• Executive and partner networking

Registration is now open• Come and see what AWS and the cloud can do for you.

Page 62: AWS May Webinar Series - Getting Started with Amazon EMR

CTA Script

- If you are interested in learning more about how to navigate the cloud to grow

your business - then attend the AWS Summit Chicago, July 1st.

- Register today to learn from technical sessions led by AWS engineers, hear best

practices from AWS customers and partners, and participate in some of the 30+

paid sessions and labs.

- Simply go to

https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc

amps&trk=Webinar_slide

to register today.

- Registration is FREE.

TRACKING CODE:

- Listed above.

Page 63: AWS May Webinar Series - Getting Started with Amazon EMR

Thank you!

www.aws.amazon.com/elasticmapreduce

www.blogs.aws.amazon.com/bigdata

[email protected]