AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

DAT203 - AWS Storage and Database

Architecture Best Practices

Siva Raghupathy, Amazon Web Services

The Third Platform

• Built on:

– Mobile devices

– Cloud services

– Social technologies

– Big data

• Billions of users

• Millions of apps

Data Volume, Velocity, Variety

• 2.7 zettabytes (ZB) of data exists in the digital universe today – 1 ZB = 1 billion terabytes

• 450 billion transaction per day by 2020

• More unstructured data than structured data

Common Questions from Database Developers

Cloud Migration

• How do I move (my data) to the

cloud?

Data/Storage Technologies

• What data store should I use?

– SQL or NoSQL?

– Hadoop or DW?

– What about search?

Management Concerns

• Is my data (in the cloud) secure?

• Relational features w/o management

nightmares?

• My data volume, velocity, and variety

are exploding!

• How can I reduce cost?

Performance and Delivery

• Need low latency (ms or µs)

• Need high throughput

• Need to ship in days – not years!

Cloud Data Tier Anti-Pattern

Data Tier

Cloud Data Tier Architecture – Use the Right Tool for the Job!

App/Web Tier

Client Tier

Data Tier

Search

Hadoop

Cache ETL Blob Store

SQL NoSQL Data

Warehouse

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

AWS

AWS Managed Database & Storage Services

Structured – Complex Query

• SQL – Amazon RDS

(MySQL, Oracle, SQL Server)

• Data Warehouse – Amazon Redshift

• Search – Amazon

CloudSearch

Unstructured – Custom Query

• Hadoop – Amazon Elastic MapReduce

(EMR)

Structured – Simple Query

• NoSQL – Amazon DynamoDB

• Cache – Amazon ElastiCache

(Memcached, Redis)

Unstructured – No Query

• Cloud Storage – Amazon S3

– Amazon Glacier

AWS Primitive Compute and Storage

Compute Capabilities

• Many different EC2 instance types – General purpose

– Compute optimized

– Storage optimized

– Memory optimized

• Host any major data storage technology – RDBMS

– NoSQL

– Cache

Raw Storage Options

• EC2 Instance store (ephemeral)

• Amazon Elastic Block Store (EBS) – Standard volume

• 1 TB, ~100 IOPS per volume

– Provisioned IOPS volume • 1 TB, up to 4000 IOPS per volume

– Stripe multiple volumes for higher IOPS or storage

Primitives add flexibility, but also come with operational burden!

AWS Data Tier Architecture - Us the right tool for the job!

Data Tier

Amazon RDS

Amazon CloudSearch

Amazon DynamoDB

Amazon ElastiCache

Amazon Elastic MapReduce

Amazon S3

Amazon

Glacier

Amazon Redshift AWS Data Pipeline

Reference Architecture


Amazon

RDS

Amazon

CloudSearch

Amazon

DynamoDB

Amazon

ElastiCache

Amazon

EMR

Amazon

S3

Amazon

Glacier

AWS Data Pipeline

Amazon

Redshift

Use Case: A Video Streaming Application

Use Case: A Video Streaming App – Upload

Amazon DynamoDB

Amazon RDS

Amazon CloudSearch

Amazon S3

A Video Streaming App – Discovery

X

Amazon Glacier

Amazon

ElastiCache

CloudFront

Amazon DynamoDB

Amazon RDS

Amazon CloudSearch

Amazon S3

Use Case: A Video Streaming App – Recs

Amazon

S3

Amazon

Glacier

Amazon

DynamoDB Amazon

EMR

Use Case: A Video Streaming App – Analytics

Amazon

EMR

Amazon

S3

Amazon

Glacier

Amazon

Redshift

What is the temperature of your data?

Data Characteristics: Hot, Warm, Cold

Hot Warm Cold

Volume MB–GB GB–TB PB

Item size B–KB KB–MB KB–TB

Latency ms ms, sec min, hrs

Durability Low–High High Very High

Request rate Very High High Low

Cost/GB $$-$ $-¢¢ ¢

Amazon

ElastiCache

Amazon

RDS Amazon

Redshift

Amazon S3

Request rate High Low

Cost/GB High Low

Latency Low High

Data Volume Low High

Amazon Glacier

Amazon

EMR

Str

uctu

re

Low

High

Amazon

DynamoDB

What data store should I use? Elasti-

Cache

Amazon

DynamoDB

Amazon

RDS

Cloud

Search

Amazon

Redshift

Amazon

EMR (Hive)

Amazon S3 Amazon

Glacier

Average

latency

ms ms ms,sec ms,sec sec,min sec,min,

hrs

ms,sec,min

(~ size)

hrs

Data volume GB GB–TBs

(no limit)

GB–TB

(3 TB Max)

GB–TB TB–PB

(1.6 PB max)

GB–PB

(~nodes)

GB–PB

(no limit)

GB–PB

(no limit)

Item size B-KB KB

(64 KB max)

KB

(~rowsize)

KB

(1 MB

max)

KB

(64 K max)

KB-MB KB-GB

(5 TB max)

GB

(40 TB

max)

Request rate Very High Very High High High Low Low Low–

Very High

(no limit)

Very Low

(no limit)

Storage cost

$/GB/month

$$ ¢¢ ¢¢ $ ¢

¢ ¢ ¢

Durability Low -

Moderate

Very High High High High High Very High Very High

Hot Data Warm Data Cold Data

AWS Data Tier Architecture - Use the right tool for the job!

Data Tier

Amazon RDS

Amazon CloudSearch

Amazon DynamoDB

Amazon ElastiCache


Amazon S3

Amazon

Glacier


Cost Conscious Design

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project that will greatly increase

my team’s use of Amazon S3. Hoping you could answer

some questions. The current iteration of the design calls for

many small files, perhaps up to a billion during peak. The

total size would be on the order of 1.5 TB per month…”

Request rate

(Writes/sec)

Object size

(Bytes)

Total size

(GB/month)

Objects per month

300 2048 1483 777,600,000

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-736174F7-ECD3-4636-BB5A-0AF2DF8F4D4E

Request rate

(Writes/sec)

Object size

(Bytes)

Total size

(GB/month)

Objects per

month

300 2,048 1,483

777,600,000

Amazon S3 or Amazon DynamoDB?


Request rate

(Writes/sec)

Object size

(Bytes)

Total size

(GB/month)

Objects per

month

Scenario 1 300 2,048 1,483 777,600,000

Scenario 2 300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

use

use


http://calculator.s3.amazonaws.com/calc5.html#r=IAD&key=calc-24CBA60C-49D4-4D42-84B6-B33E2C980C94

Best Practices

When to use

• Transactions

• Complex queries

• Medium to high query/write rate – Up to 30 K IOPS (15 K reads + 15

K writes)

• 100s of GB to low TBs

• Workload can fit in a single node

• High durability

When not to use

• Massive read/write rates – Example: 150 K write requests per

second

• Data size or throughput demands sharding

– Example: 10 s or 100 s of terabytes

• Simple Get/Put and queries that a NoSQL can handle

• Complex analytics

Read Replicas Push-Button Scaling

Region

Multi-AZ

AZ 1 AZ 2

Amazon RDS

Amazon RDS Best Practices • Use the right DB instance class

• Use EBS-optimized instances

– db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge,

db.cr1.8xlarge

• Use provisioned IOPS

• Use multi-AZ for high availability

• Use read replicas for

– Scaling reads

– Schema changes

– Additional failure recovery

When to use

• Fast and predictable performance

• Seamless/massive scale

• Autosharding

• Consistent/low latency

• No size or throughput limits

• Very high durability

• Key-value or simple queries

When not to use

• Need multi-item/row or cross table

transactions

• Need complex queries, joins

• Need real-time analytics on

historic data

• Storing cold data

Amazon DynamoDB

Amazon DynamoDB Best Practices

• Keep item size small

• Store metadata in Amazon DynamoDB and

large blobs in Amazon S3

• Use a table with a hash key for extremely

high scale

• Use table per day, week, month etc. for

storing time series data

• Use conditional/OCC updates

• Use hash-range key to model

– 1:N relationships

– Multi-tenancy

• Avoid hot keys and hot partitions

Events_table_2012

Event_id (Hash key)

Timestam

p (range key)

Attribute1 …. Attribute N

Events_table_2012_05_week1

Event_id (Hash key)

Timestam

p (range key)

Attribute1 …. Attribute N Events_table_2012_05_week2

Event_id (Hash key)

Timestam

p (range key)


Events_table_2012_05_week3

Event_id (Hash key)

Timestam

p (range key)


When to use

• Transient key-value store

• Need to speed up reads/write

• Caching frequent SQL, NoSQL or

DW query results

• Saving transient and frequently

updated data – Increment/decrement game

scores/counters

– Web application session storage

• Best effort deduplication

When not to use

• Store infrequently used data

• Need persistence

Amazon ElastiCache (Memcached)

Amazon ElastiCache (Memcached) Best Practices

• Use autodiscovery

• Share memcached client objects in application

• Use TTLs

• Consider memory for connections overhead

• Use Amzon CloudWatch alarms / SNS alerts • Number of connections

• Swap memory usage

• Freeable memory

When to use

• Key-value store with advanced

data structures – Strings, lists, sets, sorted sets,

hashes

• Caching

• Leader boards

• High-speed sorting

• Atomic counters

• Queuing systems

• Activity streams

When not to use

• Need “native” sharding or scale-out

• Need “hard” persistence

• Data won’t fit in memory

• Need transaction rollback even

under exceptions

Amazon ElastiCache (Redis)

Amazon ElastiCache (Redis) Best Practices

• Use TTL

• Use the right instance types • Instances with high ECU/vCPU and network performance

yield the highest throughput. Example: m2.4xlarge, m2.2xlarge

• Use read replicas • Increase read throughput

• AOF cannot protect against all failure modes

• Promote read replicas to primary

• Use RDB file snapshot for on-premises to Amazon ElastiCache migration

• Key parameter group settings • Avoid “AOF with fsync always” – huge impact on performance

• AOF (+ RDB) with fsync everysec – best durability + performance

• Pub-sub: set client-output-buffer-limit-pubsub-hard-limit and client-output-buffer-limit-pubsub-soft-limit

based on the workloads

When to use

• No search expertise

• Full-text search

• Ranking

• Relevance

• Structured and unstructured data

• Faceting

– $0 to $10 (4 items)

– $10 and above (3 items)

When not to use

• Not as replacement for a database – Not as a system of record

– Transient data

– Nonatomic updates

Amazon CloudSearch

• Batch documents for uploading

• Use Amazon CloudSearch for searching and another

store for retrieving full records for the UI (i.e. don’t use

return fields)

• Include other data like popularity scores in documents

• Use stop words to remove common terms

• Use fielded queries to reduce match sets

• Query latency is proportional to query specificity

Amazon CloudSearch Best Practices

When to use

• Information analysis and reporting

• Complex DW queries that summarize historical data

• Batched large updates e.g. daily sales totals

• 10s of concurrent queries

• 100s GB to PB

• Compression

• Column based


When not to use

• OLTP workloads

– 1000s of concurrent users

– Large number of singleton

updates

Amazon Redshift

Amazon Redshift Best Practices

• Use COPY command to load large data sets from Amazon

S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts

– Split your data into multiple files

– Use GZIP or LZOP compression

– Use manifest file

• Choose proper sort key

– Range or equality on WHERE clause

• Choose proper distribution key

– Join column, foreign key or largest dimension, group by column

– Avoid distribution key for denormalized data

When to use

• Batch analytics/processing – Answers in minutes or hours

• Structured and unstructured data

• Parallel scans of the entire dataset

with uniform query performance

• Supports Hive QL + other languages

• GB, TB, or PB of data

• Replicated data store (HDFS) for

ad-hoc and real-time queries

(HBase)

When not to use

• Real-time analytics (DW) – Need answers in seconds

• 1000s of concurrent users


Amazon Elastic MapReduce Best Practices

• Choose between transient and persistent clusters for best TCO

• Leverage Amazon S3 integration for highly durable and interim storage

• Right-size cluster instances based on each job – not one size fits all

• Leverage resizing and spot to add and remove capacity cost-effectively

• Tuning cluster instances can be easier than tuning Hadoop code

Job Flow

14 Hours

Duration:

Duration:

Job Flow

7 Hours

AWS Data Pipeline

When to use

• Automate movement and transformation

of data (ETL in the cloud)

• Dependency management – Data

– Control

• Schedule management

• Transient Amazon EMR clusters

• Regular data move pattern – Every hour, day

– Every 30 minutes

• Amazon DynamoDB backups – Cross region

When not to use

• Less that 15 minutes scheduling

interval

• Execution latency less than a minute

• Event-based scheduling

AWS Data Pipeline Best Practices

• Use dependency rather than time based

• Make your activities idempotent

• Add in your tools using shell activity

• Use Amazon S3 for staging

When to use

• Store large objects

• Key-value store - Get/Put/List

• Unlimited storage

• Versioning

• Very high durability – 99.999999999%

• Very high throughput (via parallel

clients)

• Use for storing persistent data – Backups

– Source/target for EMR

– Blob store with metadata in SQL

or NoSQL

When not to use

• Complex queries

• Very low latency (ms)

• Search

• Read-after-write consistency for

overwrites

• Need transactions

Amazon S3

Amazon S3 Best Practices

• Use random hash prefix for keys

• Ensure a random access pattern

• Use Amazon CloudFront for high throughput GETs and PUTs

• Leverage the high durability, high throughput design of Amazon S3

for backup and as a common storage sink • Durable sink between data services

• Supports de-coupling and asynchronous delivery

• Consider RRS for lower cost, lower durability storage of derivatives or copies

• Consider parallel threads and multipart upload for faster writes

• Consider parallel threads and range get for faster reads

When to use

• Infrequently accessed data sets

• Very low cost storage

• Data retrieval times of several

hours is acceptable

• Encryption at rest


– 99.999999999%

• Unlimited amount of storage

When not to use

• Frequent access

• Low latency access

Amazon Glacier

Amazon Glacier Best Practices

• Reduce request and storage costs with aggregation • Aggregating your files into bigger files before sending them to Amazon Glacier

• Store checksums along with your files

• Use a format that allows you to access files within your aggregate archive

• Improve speed and reliability with multipart upload

• Reduce costs with ranged retrievals

• Maintaining your own index in a highly durable store

When to use

• Alternate data store technologies

• Hand-tuned performance needs

• Direct/admin access required

When not to use

• When a managed service will do

the job

• When operational experience is

low

Amazon EC2 + Amazon EBS/Instance Storage

Amazon EBS Best Practices

• Pick the right EC2 instance type • Higher “network performance” instances for driving more Amazon EBS IOPS

• EBS-Optimized EC2 instances for dedicated throughput between EC2 & Amazon EBS

• Use provisioned IOPS volumes for database workloads requiring

consistent IOPS

• Use standard volumes for workloads requiring low to moderate IOPS

& occasional bursts

• Stripe multiple Amazon EBS volumes for higher IOPS or storage • RAID0 for higher I/O

• RAID10 for highest local durability

• Amazon EBS snapshots • Quiesce the file system and take a snapshot

HI-Best IOPS/$

HS-Best GB/$

Amazon EC2 Best Practices Best vCPU/$

Best Memory-

GiB/$

Summary

Cloud Data Tier Architecture Anti-Pattern

Data Tier

AWS Data Tier Architecture - Use the right tool for the job!

Data Tier

Amazon RDS

Amazon CloudSearch

Amazon DynamoDB

Amazon ElastiCache


Amazon S3

Amazon

Glacier



Amazon

RDS

Amazon

CloudSearch

Amazon

DynamoDB

Amazon

ElastiCache

Amazon

EMR

Amazon

S3

Amazon

Glacier

AWS Data Pipeline

Amazon

Redshift

Cost Conscious Design

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

DAT203

Remember…

Technology

AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent 2013