Upload
amazon-web-services
View
3.551
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) along with AWS database options including Amazon ElastiCache (in-memory data store), Amazon RDS (SQL database), Amazon DynamoDB (NoSQL database), Amazon CloudSearch (search), Amazon EMR (hadoop) and Amazon Redshift (data warehouse). Then we discuss how to architect your database tier by using the right database and storage technologies to achieve the required functionality, performance, availability, and durability—at the right cost.
Citation preview
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
DAT203 - AWS Storage and Database
Architecture Best Practices
Siva Raghupathy, Amazon Web Services
The Third Platform
• Built on:
– Mobile devices
– Cloud services
– Social technologies
– Big data
• Billions of users
• Millions of apps
Data Volume, Velocity, Variety
• 2.7 zettabytes (ZB) of data exists in the digital universe today – 1 ZB = 1 billion terabytes
• 450 billion transaction per day by 2020
• More unstructured data than structured data
Common Questions from Database Developers
Cloud Migration
• How do I move (my data) to the
cloud?
Data/Storage Technologies
• What data store should I use?
– SQL or NoSQL?
– Hadoop or DW?
– What about search?
Management Concerns
• Is my data (in the cloud) secure?
• Relational features w/o management
nightmares?
• My data volume, velocity, and variety
are exploding!
• How can I reduce cost?
Performance and Delivery
• Need low latency (ms or µs)
• Need high throughput
• Need to ship in days – not years!
Cloud Data Tier Anti-Pattern
Data Tier
Cloud Data Tier Architecture – Use the Right Tool for the Job!
App/Web Tier
Client Tier
Data Tier
Search
Hadoop
Cache ETL Blob Store
SQL NoSQL Data
Warehouse
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
AWS
AWS Managed Database & Storage Services
Structured – Complex Query
• SQL – Amazon RDS
(MySQL, Oracle, SQL Server)
• Data Warehouse – Amazon Redshift
• Search – Amazon
CloudSearch
Unstructured – Custom Query
• Hadoop – Amazon Elastic MapReduce
(EMR)
Structured – Simple Query
• NoSQL – Amazon DynamoDB
• Cache – Amazon ElastiCache
(Memcached, Redis)
Unstructured – No Query
• Cloud Storage – Amazon S3
– Amazon Glacier
AWS Primitive Compute and Storage
Compute Capabilities
• Many different EC2 instance types – General purpose
– Compute optimized
– Storage optimized
– Memory optimized
• Host any major data storage technology – RDBMS
– NoSQL
– Cache
Raw Storage Options
• EC2 Instance store (ephemeral)
• Amazon Elastic Block Store (EBS) – Standard volume
• 1 TB, ~100 IOPS per volume
– Provisioned IOPS volume • 1 TB, up to 4000 IOPS per volume
– Stripe multiple volumes for higher IOPS or storage
Primitives add flexibility, but also come with operational burden!
AWS Data Tier Architecture - Us the right tool for the job!
Data Tier
Amazon RDS
Amazon CloudSearch
Amazon DynamoDB
Amazon ElastiCache
Amazon Elastic MapReduce
Amazon S3
Amazon
Glacier
Amazon Redshift AWS Data Pipeline
Reference Architecture
Reference Architecture
Amazon
RDS
Amazon
CloudSearch
Amazon
DynamoDB
Amazon
ElastiCache
Amazon
EMR
Amazon
S3
Amazon
Glacier
AWS Data Pipeline
Amazon
Redshift
Use Case: A Video Streaming Application
Use Case: A Video Streaming App – Upload
Amazon DynamoDB
Amazon RDS
Amazon CloudSearch
Amazon S3
A Video Streaming App – Discovery
X
Amazon Glacier
Amazon
ElastiCache
CloudFront
Amazon DynamoDB
Amazon RDS
Amazon CloudSearch
Amazon S3
Use Case: A Video Streaming App – Recs
Amazon
S3
Amazon
Glacier
Amazon
DynamoDB Amazon
EMR
Use Case: A Video Streaming App – Analytics
Amazon
EMR
Amazon
S3
Amazon
Glacier
Amazon
Redshift
What is the temperature of your data?
Data Characteristics: Hot, Warm, Cold
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low
Cost/GB $$-$ $-¢¢ ¢
Amazon
ElastiCache
Amazon
RDS Amazon
Redshift
Amazon S3
Request rate High Low
Cost/GB High Low
Latency Low High
Data Volume Low High
Amazon Glacier
Amazon
EMR
Str
uctu
re
Low
High
Amazon
DynamoDB
What data store should I use? Elasti-
Cache
Amazon
DynamoDB
Amazon
RDS
Cloud
Search
Amazon
Redshift
Amazon
EMR (Hive)
Amazon S3 Amazon
Glacier
Average
latency
ms ms ms,sec ms,sec sec,min sec,min,
hrs
ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(3 TB Max)
GB–TB TB–PB
(1.6 PB max)
GB–PB
(~nodes)
GB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB
max)
KB
(64 K max)
KB-MB KB-GB
(5 TB max)
GB
(40 TB
max)
Request rate Very High Very High High High Low Low Low–
Very High
(no limit)
Very Low
(no limit)
Storage cost
$/GB/month
$$ ¢¢ ¢¢ $ ¢
¢ ¢ ¢
Durability Low -
Moderate
Very High High High High High Very High Very High
Hot Data Warm Data Cold Data
AWS Data Tier Architecture - Use the right tool for the job!
Data Tier
Amazon RDS
Amazon CloudSearch
Amazon DynamoDB
Amazon ElastiCache
Amazon Elastic MapReduce
Amazon S3
Amazon
Glacier
Amazon Redshift AWS Data Pipeline
Cost Conscious Design
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483
777,600,000
Amazon S3 or Amazon DynamoDB?
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario 1 300 2,048 1,483 777,600,000
Scenario 2 300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
Best Practices
When to use
• Transactions
• Complex queries
• Medium to high query/write rate – Up to 30 K IOPS (15 K reads + 15
K writes)
• 100s of GB to low TBs
• Workload can fit in a single node
• High durability
When not to use
• Massive read/write rates – Example: 150 K write requests per
second
• Data size or throughput demands sharding
– Example: 10 s or 100 s of terabytes
• Simple Get/Put and queries that a NoSQL can handle
• Complex analytics
Read Replicas Push-Button Scaling
Region
Multi-AZ
AZ 1 AZ 2
Amazon RDS
Amazon RDS Best Practices • Use the right DB instance class
• Use EBS-optimized instances
– db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge,
db.cr1.8xlarge
• Use provisioned IOPS
• Use multi-AZ for high availability
• Use read replicas for
– Scaling reads
– Schema changes
– Additional failure recovery
When to use
• Fast and predictable performance
• Seamless/massive scale
• Autosharding
• Consistent/low latency
• No size or throughput limits
• Very high durability
• Key-value or simple queries
When not to use
• Need multi-item/row or cross table
transactions
• Need complex queries, joins
• Need real-time analytics on
historic data
• Storing cold data
Amazon DynamoDB
Amazon DynamoDB Best Practices
• Keep item size small
• Store metadata in Amazon DynamoDB and
large blobs in Amazon S3
• Use a table with a hash key for extremely
high scale
• Use table per day, week, month etc. for
storing time series data
• Use conditional/OCC updates
• Use hash-range key to model
– 1:N relationships
– Multi-tenancy
• Avoid hot keys and hot partitions
Events_table_2012
Event_id (Hash key)
Timestam
p (range key)
Attribute1 …. Attribute N
Events_table_2012_05_week1
Event_id (Hash key)
Timestam
p (range key)
Attribute1 …. Attribute N Events_table_2012_05_week2
Event_id (Hash key)
Timestam
p (range key)
Attribute1 …. Attribute N
Events_table_2012_05_week3
Event_id (Hash key)
Timestam
p (range key)
Attribute1 …. Attribute N
When to use
• Transient key-value store
• Need to speed up reads/write
• Caching frequent SQL, NoSQL or
DW query results
• Saving transient and frequently
updated data – Increment/decrement game
scores/counters
– Web application session storage
• Best effort deduplication
When not to use
• Store infrequently used data
• Need persistence
Amazon ElastiCache (Memcached)
Amazon ElastiCache (Memcached) Best Practices
• Use autodiscovery
• Share memcached client objects in application
• Use TTLs
• Consider memory for connections overhead
• Use Amzon CloudWatch alarms / SNS alerts • Number of connections
• Swap memory usage
• Freeable memory
When to use
• Key-value store with advanced
data structures – Strings, lists, sets, sorted sets,
hashes
• Caching
• Leader boards
• High-speed sorting
• Atomic counters
• Queuing systems
• Activity streams
When not to use
• Need “native” sharding or scale-out
• Need “hard” persistence
• Data won’t fit in memory
• Need transaction rollback even
under exceptions
Amazon ElastiCache (Redis)
Amazon ElastiCache (Redis) Best Practices
• Use TTL
• Use the right instance types • Instances with high ECU/vCPU and network performance
yield the highest throughput. Example: m2.4xlarge, m2.2xlarge
• Use read replicas • Increase read throughput
• AOF cannot protect against all failure modes
• Promote read replicas to primary
• Use RDB file snapshot for on-premises to Amazon ElastiCache migration
• Key parameter group settings • Avoid “AOF with fsync always” – huge impact on performance
• AOF (+ RDB) with fsync everysec – best durability + performance
• Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-pubsub-soft-limit
based on the workloads
When to use
• No search expertise
• Full-text search
• Ranking
• Relevance
• Structured and unstructured data
• Faceting
– $0 to $10 (4 items)
– $10 and above (3 items)
When not to use
• Not as replacement for a database – Not as a system of record
– Transient data
– Nonatomic updates
Amazon CloudSearch
• Batch documents for uploading
• Use Amazon CloudSearch for searching and another
store for retrieving full records for the UI (i.e. don’t use
return fields)
• Include other data like popularity scores in documents
• Use stop words to remove common terms
• Use fielded queries to reduce match sets
• Query latency is proportional to query specificity
Amazon CloudSearch Best Practices
When to use
• Information analysis and reporting
• Complex DW queries that summarize historical data
• Batched large updates e.g. daily sales totals
• 10s of concurrent queries
• 100s GB to PB
• Compression
• Column based
• Very high durability
When not to use
• OLTP workloads
– 1000s of concurrent users
– Large number of singleton
updates
Amazon Redshift
Amazon Redshift Best Practices
• Use COPY command to load large data sets from Amazon
S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts
– Split your data into multiple files
– Use GZIP or LZOP compression
– Use manifest file
• Choose proper sort key
– Range or equality on WHERE clause
• Choose proper distribution key
– Join column, foreign key or largest dimension, group by column
– Avoid distribution key for denormalized data
When to use
• Batch analytics/processing – Answers in minutes or hours
• Structured and unstructured data
• Parallel scans of the entire dataset
with uniform query performance
• Supports Hive QL + other languages
• GB, TB, or PB of data
• Replicated data store (HDFS) for
ad-hoc and real-time queries
(HBase)
When not to use
• Real-time analytics (DW) – Need answers in seconds
• 1000s of concurrent users
Amazon Elastic MapReduce
Amazon Elastic MapReduce Best Practices
• Choose between transient and persistent clusters for best TCO
• Leverage Amazon S3 integration for highly durable and interim storage
• Right-size cluster instances based on each job – not one size fits all
• Leverage resizing and spot to add and remove capacity cost-effectively
• Tuning cluster instances can be easier than tuning Hadoop code
Job Flow
14 Hours
Duration:
Duration:
Job Flow
7 Hours
AWS Data Pipeline
When to use
• Automate movement and transformation
of data (ETL in the cloud)
• Dependency management – Data
– Control
• Schedule management
• Transient Amazon EMR clusters
• Regular data move pattern – Every hour, day
– Every 30 minutes
• Amazon DynamoDB backups – Cross region
When not to use
• Less that 15 minutes scheduling
interval
• Execution latency less than a minute
• Event-based scheduling
AWS Data Pipeline Best Practices
• Use dependency rather than time based
• Make your activities idempotent
• Add in your tools using shell activity
• Use Amazon S3 for staging
When to use
• Store large objects
• Key-value store - Get/Put/List
• Unlimited storage
• Versioning
• Very high durability – 99.999999999%
• Very high throughput (via parallel
clients)
• Use for storing persistent data – Backups
– Source/target for EMR
– Blob store with metadata in SQL
or NoSQL
When not to use
• Complex queries
• Very low latency (ms)
• Search
• Read-after-write consistency for
overwrites
• Need transactions
Amazon S3
Amazon S3 Best Practices
• Use random hash prefix for keys
• Ensure a random access pattern
• Use Amazon CloudFront for high throughput GETs and PUTs
• Leverage the high durability, high throughput design of Amazon S3
for backup and as a common storage sink • Durable sink between data services
• Supports de-coupling and asynchronous delivery
• Consider RRS for lower cost, lower durability storage of derivatives or copies
• Consider parallel threads and multipart upload for faster writes
• Consider parallel threads and range get for faster reads
When to use
• Infrequently accessed data sets
• Very low cost storage
• Data retrieval times of several
hours is acceptable
• Encryption at rest
• Very high durability
– 99.999999999%
• Unlimited amount of storage
When not to use
• Frequent access
• Low latency access
Amazon Glacier
Amazon Glacier Best Practices
• Reduce request and storage costs with aggregation • Aggregating your files into bigger files before sending them to Amazon Glacier
• Store checksums along with your files
• Use a format that allows you to access files within your aggregate archive
• Improve speed and reliability with multipart upload
• Reduce costs with ranged retrievals
• Maintaining your own index in a highly durable store
When to use
• Alternate data store technologies
• Hand-tuned performance needs
• Direct/admin access required
When not to use
• When a managed service will do
the job
• When operational experience is
low
Amazon EC2 + Amazon EBS/Instance Storage
Amazon EBS Best Practices
• Pick the right EC2 instance type • Higher “network performance” instances for driving more Amazon EBS IOPS
• EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazon EBS
• Use provisioned IOPS volumes for database workloads requiring
consistent IOPS
• Use standard volumes for workloads requiring low to moderate IOPS
& occasional bursts
• Stripe multiple Amazon EBS volumes for higher IOPS or storage • RAID0 for higher I/O
• RAID10 for highest local durability
• Amazon EBS snapshots • Quiesce the file system and take a snapshot
HI-Best IOPS/$
HS-Best GB/$
Amazon EC2 Best Practices Best vCPU/$
Best Memory-
GiB/$
Summary
Cloud Data Tier Architecture Anti-Pattern
Data Tier
AWS Data Tier Architecture - Use the right tool for the job!
Data Tier
Amazon RDS
Amazon CloudSearch
Amazon DynamoDB
Amazon ElastiCache
Amazon Elastic MapReduce
Amazon S3
Amazon
Glacier
Amazon Redshift AWS Data Pipeline
Reference Architecture
Amazon
RDS
Amazon
CloudSearch
Amazon
DynamoDB
Amazon
ElastiCache
Amazon
EMR
Amazon
S3
Amazon
Glacier
AWS Data Pipeline
Amazon
Redshift
Cost Conscious Design
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
DAT203
Remember…