Upload
amazon-web-services
View
1.584
Download
6
Embed Size (px)
DESCRIPTION
Amazon Elastic MapReduce (EMR) is one of the largest Hadoop operators in the world. Since its launch five years ago, our customers have launched more than 15 million Hadoop clusters inside of EMR. In this webinar, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Citation preview
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Elastic MapReduce:
Deep Dive and Best Practices
Ian Meyers, AWS (meyersi@)
October 29th, 2014
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS
Map-Reduce Engine Vibrant Ecosystem
Hadoop-as-a-Service
Massively Parallel
Cost Effective AWS Wrapper
Integrated to AWS services
What is EMR?
HDFS
Amazon EMR
EMRfs
HDFS
Amazon EMR
Amazon S3 Amazon
DynamoDB
EMRfs
HDFS
Analytics languagesData management
Amazon EMR
Amazon S3 Amazon
DynamoDB
EMRfs
HDFS
Analytics languagesData management
Amazon EMRAmazon
RDS
Amazon S3 Amazon
DynamoDB
EMRfs
HDFS
Analytics languagesData management
Amazon
Redshift
Amazon EMRAmazon
RDS
Amazon S3 Amazon
DynamoDB
AWS Data Pipeline
Amazon EMR Introduction
Launch clusters of any size in a matter of minutes
Use variety of different instance sizes that match
your workload
Don’t get stuck with hardware
Don’t deal with capacity planning
Run multiple clusters with different sizes, specs
and node types
Elastic MapReduce & Amazon S3
EMR has an optimised driver for Amazon S3
64MB Range Offset Reads to increase performance
Elastic MapReduce Consistent View further Increases Performance
Addresses Consistency
S3 Cost - $.03/GB - Volume Based Price Tiering
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS
Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amazon S3 & HDFS
Pattern #1: Transient vs. Alive Clusters
Pattern #1: Transient Clusters
Cluster lives for the duration of the job
Shut down the cluster when the job is done
Data persist on Amazon S3
Input & OutputData on
Amazon S3
Benefits of Transient Clusters1. Control your cost
2. Minimum maintenance
• Cluster goes away when job is done
3. Practice cloud architecture
• Pay for what you use
• Data processing as a workflow
Alive ClustersVery similar to traditional Hadoop deployments
Cluster stays around after the job is done
Data persistence model:
Amazon S3
Amazon S3 Copy To HDFS
HDFS and Amazon S3 as backup
Alive Clusters
Always keep data safe on Amazon S3 even if you’re
using HDFS for primary storage
Get in the habit of shutting down your cluster and start a
new one, once a week or month
Design your data processing workflow to account for failure
You can use workflow managements such as AWS Data
Pipeline
Pattern #2: Core & Task nodes
Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)
Core Nodes
Can add core
nodes
More HDFS
space
More
CPU/memory
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS HDFS
Core Nodes
Can’t remove
core nodes
because of
HDFS
Master instance group
Core instance group
HDFS HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
Run TaskTrackers
No HDFS
Reads from core
node HDFS Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
Can add
task nodes
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
More CPU
power
More
memoryTask instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
You can
remove task
nodes when
processing
is completedTask instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
You can
remove task
nodes when
processing
is completedTask instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Task Node Use-CasesSpeed up job processing using Spot market
Run task nodes on Spot market
Get discount on hourly price
Nodes can come and go without interruption to your cluster
When you need extra horsepower for a short amount of time
Example: Need to pull large amount of data from Amazon S3
Pattern #3: Amazon S3 & HDFS
Option 1: Amazon S3 as HDFS
Use Amazon S3 as your
permanent data store
HDFS for temporary storage
data between jobs
No additional step to copy
data to HDFS
Amazon EMR Cluster
Task Instance
Group
Core Instance
Group
HDFS HDFS
Amazon S3
Benefits: Amazon S3 as HDFSAbility to shut down your cluster
HUGE Benefit!!
Use Amazon S3 as your durable storage
11 9s of durability
Benefits: Amazon S3 as HDFS
No need to scale HDFS
Capacity
Replication for durability
Amazon S3 scales with your data
Both in IOPs and data storage
Benefits: Amazon S3 as HDFS
Ability to share data between multiple clusters
Hard to do with HDFS
Amazon S3
EMR
EMR
Benefits: Amazon S3 as HDFSTake advantage of Amazon S3 features
Amazon S3 Server Side Encryption
Amazon S3 Lifecycle Policies
Amazon S3 versioning to protect against corruption
Build elastic clusters
Add nodes to read from Amazon S3
Remove nodes with data safe on Amazon S3
EMR Consistent View
Provides a ‘consistent view’ of data on S3 within a Cluster
Ensures that all files created by a Step are available to Subsequent Steps
Index of data from S3, managed by Dynamo DB
Configurable Retry & Metastore
New Hadoop Config File emrfs-site.xml
fs.s3.consistent* System Properties
EMR Consistent View
EMRfs
HDFS
Amazon EMR
Amazon S3 Amazon
DynamoDB
Processed Files RegistryFile Data
EMR Consistent View
Manage data in EMRFS using the emrfs client:
emrfsdescribe-metadata, set-metadata-capacity, delete-metadata, create-metadata, list-metadata-stores - work with Metadata Stores
diff - Show what in a bucket is missing from the index
delete - Remove Index Entries
sync - Ensure that the Index is Synced with a bucket
import - Import Bucket Items into Index
What About Data Locality?
Run your job in the same region as your Amazon
S3 bucket
Amazon EMR nodes have high speed connectivity
to Amazon S3
If your job Is CPU/memory-bound, locality doesn’t
make a huge difference
Amazon S3 provides near linear scalability
S3 Streaming
Performance100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr
34 secs per terabyte
GB/Second
Rea
der
Connections
Performance & Scalability
When HDFS is a Better Choice…
Iterative workloads
If you’re processing the same dataset more than once
Disk I/O intensive workloads
Option 2: Optimise for Latency with HDFS
1. Data persisted on Amazon S3
Option 2: Optimise for Latency with HDFS
2. Launch Amazon EMR and
copy data to HDFS with
S3distcp
S3D
istC
p
Option 2: Optimise for Latency with HDFS
3. Start processing data on
HDFS
S3D
istC
p
Benefits: HDFS instead of S3
Better pattern for I/O-intensive workloads
Amazon S3 as system of record
Durability
Scalability
Cost
Features: lifecycle policy, security
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS
Amazon EMR Nodes and SizeUse M1.Small Instances for functional testing
Use XLarge + nodes for production workloads
Use CC2/C3 for memory and CPU intensive jobs
HS1, HI1, I2 instances for HDFS workloads
Prefer a smaller cluster of larger nodes
Holy Grail Question
How many nodes do I need?
Instance Resource Allocation
• Hadoop 1 - Static Number of Mappers/Reducers
configured for the Cluster Nodes
• Hadoop 2 - Variable Number of Hadoop
Applications based on File Splits and Available
Memory
• Useful to understand Old vs New Sizing
Instance Resources
1
24
8
16
32
64
128
256
512
10242048
4096
8192
16384
32768
65536
0
50
100
150
200
250
300
Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)
Cluster Sizing Calculation1. Estimate the number of tasks your job requires.
2. Pick an instance and note down the number of tasks it can run in parallel
3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
4. Run an Amazon EMR cluster with a single Core node and process your sample files from #3.
Note down the amount of time taken to process your sample files.
Cluster Sizing Calculation
Total Tasks * Time To Process Sample Files
Instance Task Capacity * Desired Processing Time
Estimated Number Of Nodes:
Example: Cluster Sizing Calculation1. Estimate the number of tasks your job requires
150
2. Pick an instance and note down the number of tasks it can run in parallel
m1.xlarge with 8 task capacity per instance
Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test
Example: Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
3 min to process 8 files
Cluster Sizing Calculation
Total Tasks For Your Job * Time To Process Sample Files
Per Instance Task Capacity * Desired Processing Time
Estimated number of nodes:
150 * 3 min 8 * 5 min
= 11 m1.xlarge
File Best Practices
Avoid small files at all costs (smaller than
100MB)
Use Compression
Holy Grail Question
What if I have small file issues?
Dealing with Small Files
Use S3DistCP to combine smaller files together
S3DistCP takes a pattern and target file to combine smaller input files to larger ones
./elastic-mapreduce –jar
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args '--src,s3://myawsbucket/cf,\
--dest,hdfs:///local,\
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-
[0-9]+-[0-9]+).*,\
--targetSize,128,\
CompressionAlways Compress Data Files On Amazon S3
Reduces Bandwidth Between Amazon S3 and
Amazon EMR
Speeds Up Your Job
Compress Task Output
CompressionCompression Types:
Some are fast BUT offer less space reduction
Some are space efficient BUT Slower
Some are splitable and some are not
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
Changing Compression TypeYou May Decide To Change Compression Type
Use S3DistCP to change the compression types of your files
Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args '--src,s3://myawsbucket/cf,\
--dest,hdfs:///local,\
--outputCodec,lzo’
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Observations from AWS
M1/C1 Instance Families
Heavily used by EMR Customers
However, HDFS Utilisation is typically very
Low
M3/C3 Offers better Performance/$
M1 vs M3
Instance Cost / Map Task Cost / Reduce Task
m1.large $0.08 $0.15
m1.xlarge $0.06 $0.15
m3.xlarge $0.04 $0.07
m3.2xlarge $0.04 $0.07
C1 vs C3
Instance Cost / Map Task Cost / Reduce Task
c1.medium $0.13 $0.13
c1.xlarge $0.35 $0.70
c3.xlarge $0.05 $0.11
c3.2xlarge $0.05 $0.11
Orc vs Parquet
File Formats designed for SQL/Data Warehousing
on Hadoop
Columnar File Formats
Compress Well
High Row Count, Low Cardinality
Orc File Format
Optimised Row Columnar Format
Zlib or Snappy External
Compression
250MB Stripe of 1 Column and
Index
RunLength or Dictionary Encoding
1 Output File per Container Task
Parquet File Format
Gzip or Snappy External
Compression
Array Data Structures
Limited Data Type Support for
Hive
Batch Creation
1GB Files
Orc vs Parquet
Depends on the Tool you are using
Consider Future Architecture & Requirements
Test Test Test
In Summary
• Practice Cloud Architecture with Transient Clusters
• Utilize S3 as the system of record for durability
• Utilize Task Nodes on Spot for Increased performance and
Lower Cost
• Move to new Instance Families for Better Performance/$
• Exciting Developments around Columnar File Formats