Upload
amazon-web-services
View
366
Download
6
Embed Size (px)
Citation preview
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan FritzSenior Product Manager, Amazon EMR
May 20, 2015
Getting Started with
Amazon EMREasy, fast, secure, and cost-effective Hadoop on AWS.
Agenda
• Is Hadoop the answer?
• Amazon EMR 101
• Integration with AWS storage and database services
• Common Amazon EMR design patterns
• Q+A
When leveraging your data to derive new insights,
Big Data problems are everywhere
• Data lacks structure
• Analyzing streams of information
• Processing large datasets
• Warehousing large datasets
• Flexibility for undefined ad hoc analysis
• Speed of queries on large data sets
Hadoop is the right system for Big Data
• Massively parallel
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
Storage S3, HDFS
YARNCluster Resource Management
BatchMapReduce
InteractiveTez
In MemorySpark
ApplicationsPig, Hive, Cascading, Mahout, Giraph
HB
as
e
Pre
sto
Imp
ala
Hadoop 2
BatchMapReduce
Storage S3, HDFS
Hadoop 1
Applications
Customers across many verticals
Amazon Elastic MapReduce (EMR) is the
easiest way to run Hadoop in the cloud.
Why Amazon EMR?
Easy to UseLaunch a cluster in minutes
Low CostPay an hourly rate
ElasticEasily add or remove capacity
ReliableSpend less time monitoring
SecureManage firewalls
FlexibleCustomize the cluster
Easy to UseLaunch a cluster in minutes
Easy to deploy
AWS Management Console AWS Command Line Interface
You can also use the Amazon EMR API with your favorite SDK
or use AWS Data Pipeline to start your clusters.
Try different configurations to find your optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Low CostPay an hourly rate
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Mix on-demand and EC2 Spot capacity for low costs
Meet SLA at predictable cost Exceed SLA at lower cost
Use multiple EMR instance groups
Master Node
r3.2xlarge
Example Amazon
EMR Cluster
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge (EC2 Spot)
Slave Group – Task
m3.2xlarge (EC2 Spot)
Core nodes run HDFS
(DataNode). Task nodes do
not run HDFS. Core and
Task nodes each run YARN
(NodeManager).
Master node runs
NameNode (HDFS),
ResourceManager (YARN),
and serves as a gateway.
ElasticEasily add or remove capacity
Easy to add and remove compute
capacity in your cluster from the console, CLI, or API.
Match compute
demands with
cluster sizing.
Resizable clusters
Use S3 instead of HDFS for your data layer to decouple
your compute capacity and storage
Amazon S3
Amazon EMR
Shut down your EMR
clusters when you
are not processing
data, and stop paying
for them!
ReliableSpend less time monitoring
Easy to monitor and debug
Monitor with Amazon CloudWatch or Ganglia
Cluster, Node, and IO
Monitor Debug
EMR logging to S3 makes logs easily available
Secure
Integrates with AWS
security features
Use Identity and Access Management (IAM) roles with
your Amazon EMR cluster
• IAM roles give AWS services fine grained
control over delegating permissions to AWS
services and access to AWS resources
• EMR uses two IAM roles:
• EMR service role is for the Amazon EMR
control plane
• EC2 instance profile is for the actual
instances in the Amazon EMR cluster
• Default IAM roles can be easily created and
used from the AWS Console and AWS CLI
EMR Security Groups: default and custom
A security group is a virtual firewall which controls access to the EC2 instances in your Amazon EMR cluster
• There is a single default master and default slave security group across all of your clusters
• The master security group has port 22 access for SSHing to your cluster
You can add additional security groups to the master and slave groups on a cluster to separate them from the default master and slave security groups, and further limit ingress and egress policies.
Slave
Security
Group
Master
Security
Group
Other Amazon EMR security features
EMRFS encryption options
• S3 server-side encryption
• S3 client-side encryption (use AWS Key Management Service keys or custom keys)
CloudTrail integration
• Track Amazon EMR API calls for auditing
Launch your Amazon EMR clusters in a VPC
• Logically isolated portion of the cloud (“Virtual Private Network”)
• Enhanced networking on certain instance types
FlexibleCustomize the cluster
Hadoop applications available in EMR
Use Hive on EMR to interact with your data in HDFS
and Amazon S3
• Batch or ad hoc workloads
• Integration with EMRFS for better
performance reading and writing
to S3
• SQL-like query language to make
iterative queries easier
• Schema-on-read to query data
without needing pre-processing
• Use Tez engine for faster queries
Use Pig to easily create ETL workflows
• Uses high-level “Pig Latin” language to
easily script data transformations in
Hadoop
• Strong optimizer for workloads
• Options to create robust user defined
functions
Use HBase on a persistent EMR cluster as a noSQL
scalable database
• Billions of rows and millions of columns
• Backup to and restore from Amazon S3
• Flexible datatypes
• Modulate your HBase tables when adding new data to your system
Impala: a fast SQL query engine for EMR Clusters
• Low-latency SQL query engine for Hadoop
• Perfect for fast ad hoc, interactive queries on
structured on unstructured data
• Can be easily installed on an EMR cluster,
and queried from the CLI or a 3rd party BI tool
• Perfect for memory optimized instances
• Currently uses HDFS as data layer
Hadoop User Experience (Hue)
Query Editor
Hue
Job Browser
Hue
File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)
To install anything else, use Bootstrap Actions
https://github.com/awslabs/emr-bootstrap-actions
Spark: an alternative engine to Hadoop with its
own ecosystem of applications
• Does not use map-reduce framework
• In-memory for fast queries
• Great for machine learning or other iterative queries
• Use Spark SQL to create a low-latency data warehouse
• Spark Streaming for real-time workloads
Also use Bootstrap Actions to configure your
applications
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
KeywordFile Name Shortcut
Key-Value Pair
Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
EMR Step API
• EMR step can be a map-
reduce job, Hive program, Pig
script, or even an arbitrary
script
• Easily submit Step from
console, CLI, or API
• Submit multiple steps to use
EMR as a sequential workflow
engine
Submit work via the EMR Step API or SSH to the
EMR master node
Connect to Master Node
• Connect to HUE, interact with
application CLIs, or submit
work directly to the Hadoop
APIs
• View the Hadoop UI
• Useful for long-running clusters
and interactive use cases
Let’s see it!
Quick tour of the EMR Console and HUE on an EMR
cluster
Diverse set of partners to use with Amazon EMR
BI / Visualization Business Intelligence BI / Visualization BI / Visualization
Hadoop Distribution Data Transfer Data Transformation
Monitoring Performance Tuning Graphical IDE Graphical IDE
Available on AWS Marketplace Available as a distribution in Amazon EMR
ETL Tool
BI / Visualization
Integration with AWS storage
and database services
Choose your data stores
Amazon S3 as your persistent data store
Amazon S3
• Designed for 99.999999999% durability
• Separate compute and storage
Resize and shut down Amazon EMR clusterswith no data loss
Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)
EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Consistent view
• For consistent list and read-after-write for new puts
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number
of objects
Without
Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Consistent view and fast listing using the optional
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
EMRFS support for Amazon S3 client-side encryption
Amazon S3
Am
azo
n S
3 e
ncry
ptio
n c
lien
tsE
MR
FS
en
ab
led
for
Am
azo
n S
3 c
lien
t-sid
e e
ncry
ptio
n
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
Use Hive with EMR to query data DynamoDB
• Export data stored in DynamoDB to
Amazon S3
• Import data in Amazon S3 to
DynamoDB
• Query live DynamoDB data using SQL-
like statements (HiveQL)
• Join data stored in DynamoDB and
export it or query against the joined data
• Load DynamoDB data into HDFS and
use it in your EMR job
Use AWS Data Pipeline and EMR to transform
data and load into Amazon Redshift
Unstructured Data Processed Data
Pipeline orchestrated and scheduled by AWS Data Pipeline
Amazon EMR design patterns
Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourlyDaily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
Using Amazon S3 and HDFS
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data in HDFS for
Hive interactive queries
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3
Amazon Confidential
Multiple EMR workflows using the same S3
dataset
Computations
S3DistCp
CascalogLZO
Input Amazon
S3 bucketIntermediate
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Crashlytics (part of Twitter) uses EMR to
analyze data in S3 to power dashboards
on its Answers platform.
Amazon EMR example #2: Long-running cluster
Data pushed to
Amazon S3Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database 24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Amazon EMR example #3: Interactive query
TBs of logs sent dailyLogs stored in
Amazon S3Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multipetabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
EMR example #4: EMR for ETL and query engine for
investigations which require all raw data
TBs of logs sent
daily
Logs stored in S3
Hourly EMR cluster
using Spark for ETL
Load subset into
Redshift DW
Transient EMR cluster using Spark for ad hoc
analysis of entire log set
Client/Sensor Recording Service
Aggregator/ Sequencer
Continuous Processor
Data Warehouse Analytics and Reporting
EMR Example #5: Streaming Data
Client/Sensor Recording Service Aggregator/ Sequencer
Continuous Processor
Data Warehouse Analytics and Reporting
Kafka
Common Tools
Amazon Kinesis
Streaming Data Repository
Amazon Kinesis
Client/ Sensor Recording Service Aggregator/ Sequencer
Continuous Processor for Dashboard
Data Warehouse Analytics and Reporting
Amazon Kinesis Amazon EMR
Streaming Data RepositoryLogging Data Processing
Log4J
Amazon Kinesis + Amazon EMR = Fewer
Moving Parts
Processedoutput in real-time and batch workflows
Input
push with Log 4J to
HivePig
Cascading
pull from
Spark
Amazon EMR
Amazon Kinesis
Customer Application
Amazon DynamoDB
Real-time processing with Spark Streaming and batch
workloads on Kinesis streams with the Hadoop stack
AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new
customers about the AWS platform, best practices and new cloud services.
Details• July 1, 2015
• Chicago, Illinois
• @ McCormick Place
Featuring• New product launches
• 36+ sessions, labs, and bootcamps
• Executive and partner networking
Registration is now open• Come and see what AWS and the cloud can do for you.
CTA Script
- If you are interested in learning more about how to navigate the cloud to grow
your business - then attend the AWS Summit Chicago, July 1st.
- Register today to learn from technical sessions led by AWS engineers, hear best
practices from AWS customers and partners, and participate in some of the 30+
paid sessions and labs.
- Simply go to
https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc
amps&trk=Webinar_slide
to register today.
- Registration is FREE.
TRACKING CODE:
- Listed above.