Upload
amazon-web-services
View
2.756
Download
3
Tags:
Embed Size (px)
DESCRIPTION
In this session you'll learn about the decisions that went into designing and building DynamoDB, and how it allows you to stay focused on your application while enjoying single digit latencies at any scale. We'll dive deep on how to model data, maintain maximum throughput, and drive analytics against your data, while profiling real world use cases, tips and tricks from customers running on DynamoDB today.
Citation preview
Under the Covers of DynamoDB
Matt Wood
Principal Data Scientist
@mza
Hello.
1. Getting started
2. Data modeling
3. Partitioning
4. Replication & Analytics
Overview
5. Customer story: Localytics
Getting started
1
DynamoDB is a managed
NoSQL database service.
Store and retrieve any amount of data.
Serve any level of request traffic.
Without the operational burden.
Consistent, predictable performance.
Single digit millisecond latency.
Backed on solid-state drives.
Flexible data model.
Key/attribute pairs. No schema required.
Easy to create. Easy to adjust.
Seamless scalability.
No table size limits. Unlimited storage.
No downtime.
Durable.
Consistent, disk only writes.
Replication across data centers and availability zones.
Without the operational burden.
Focus on your app.
Two decisions + three clicks
= ready for use
Two decisions + three clicks
= ready for use
Primary keys
Level of throughput
Two decisions + three clicks
= ready for use
Primary keys
Level of throughput
Provisioned throughput.
Reserve IOPS for reads and writes.
Scale up for down at any time.
Pay per capacity unit.
Priced per hour of provisioned throughput.
Write throughput.
Size of item x writes per second
$0.0065 for 10 write units
Consistent writes.
Atomic increment and decrement.
Optimistic concurrency control: conditional writes.
Transactions.
Item level transactions only.
Puts, updates and deletes are ACID.
Read throughput.
Strong or eventual consistency
Read throughput.
Strong or eventual consistency
Provisioned units = size of item x reads per second
$0.0065 per hour for 50 units
Read throughput.
Strong or eventual consistency
Provisioned units = size of item x reads per second
$0.0065 per hour for 100 units
2
Read throughput.
Strong or eventual consistency
Same latency expectations.
Mix and match at ‘read time’.
Provisioned throughput is
managed by DynamoDB.
Data is partitioned and
managed by DynamoDB.
Indexed data storage.
$0.25 per GB per month.
Tiered bandwidth pricing:
aws.amazon.com/dynamodb/pricing
Reserved capacity.
Up to 53% for 1 year reservation.
Up to 76% for 3 year reservation.
Authentication.
Session based to minimize latency.
Uses the Amazon Security Token Service.
Handled by AWS SDKs.
Integrates with IAM.
Monitoring.
CloudWatch metrics:
latency, consumed read and write throughput,
errors and throttling.
Libraries, mappers and mocks.
ColdFusion, Django, Erlang, Java, .Net,
Node.js, Perl, PHP, Python, Ruby
http://j.mp/dynamodb-libs
Data modeling
2
id = 100 date = 2012-05-16-
09-00-10 total = 25.00
id = 101 date = 2012-05-15-
15-00-11 total = 35.00
id = 101 date = 2012-05-16-
12-00-10 total = 100.00
id = 102 date = 2012-03-20-
18-23-10 total = 20.00
id = 102 date = 2012-03-20-
18-23-10 total = 120.00
id = 100 date = 2012-05-16-
09-00-10 total = 25.00
id = 101 date = 2012-05-15-
15-00-11 total = 35.00
id = 101 date = 2012-05-16-
12-00-10 total = 100.00
id = 102 date = 2012-03-20-
18-23-10 total = 20.00
id = 102 date = 2012-03-20-
18-23-10 total = 120.00
Table
id = 100 date = 2012-05-16-
09-00-10 total = 25.00
id = 101 date = 2012-05-15-
15-00-11 total = 35.00
id = 101 date = 2012-05-16-
12-00-10 total = 100.00
id = 102 date = 2012-03-20-
18-23-10 total = 20.00
id = 102 date = 2012-03-20-
18-23-10 total = 120.00
Item
id = 100 date = 2012-05-16-
09-00-10 total = 25.00
id = 101 date = 2012-05-15-
15-00-11 total = 35.00
id = 101 date = 2012-05-16-
12-00-10 total = 100.00
id = 102 date = 2012-03-20-
18-23-10 total = 20.00
id = 102 date = 2012-03-20-
18-23-10 total = 120.00
Attribute
Where is the schema?
Tables do not require a formal schema.
Items are an arbitrarily sized hash.
Indexing.
Items are indexed by primary and secondary keys.
Primary keys can be composite.
Secondary keys are local to the table.
ID Date Total
ID Date Total
Hash key
ID Date Total
Hash key Range key
Composite primary key
ID Date Total
Hash key Range key Secondary range key
Programming DynamoDB.
Small but perfectly formed API.
CreateTable
UpdateTable
DeleteTable
DescribeTable
ListTables
Query
Scan
PutItem
GetItem
UpdateItem
DeleteItem
BatchGetItem
BatchWriteItem
CreateTable
UpdateTable
DeleteTable
DescribeTable
ListTables
Query
Scan
PutItem
GetItem
UpdateItem
DeleteItem
BatchGetItem
BatchWriteItem
CreateTable
UpdateTable
DeleteTable
DescribeTable
ListTables
Query
Scan
PutItem
GetItem
UpdateItem
DeleteItem
BatchGetItem
BatchWriteItem
Conditional updates.
PutItem, UpdateItem, DeleteItem can take
optional conditions for operation.
UpdateItem performs atomic increments.
One API call, multiple items
BatchGet returns multiple items by key.
Throughput is measured by IO, not API calls.
BatchWrite performs up to 25 put or delete operations.
CreateTable
UpdateTable
DeleteTable
DescribeTable
ListTables
Query
Scan
PutItem
GetItem
UpdateItem
DeleteItem
BatchGetItem
BatchWriteItem
Query vs Scan
Query returns items by key.
Scan reads the whole table sequentially.
Query patterns
Retrieve all items by hash key.
Range key conditions:
==, <, >, >=, <=, begins with, between.
Counts. Top and bottom n values.
Paged responses.
Mapping relationships.
EXAMPLE 1:
Players
user_id =
mza
location =
Cambridge
joined =
2011-07-04
user_id =
jeffbarr
location =
Seattle
joined =
2012-01-20
user_id =
werner
location =
Worldwide
joined =
2011-05-15
Players
user_id =
mza
location =
Cambridge
joined =
2011-07-04
user_id =
jeffbarr
location =
Seattle
joined =
2012-01-20
user_id =
werner
location =
Worldwide
joined =
2011-05-15
Scores user_id =
mza
game =
angry-birds
score =
11,000
user_id =
mza
game =
tetris
score =
1,223,000
user_id =
werner
location =
bejewelled
score =
55,000
Players
user_id =
mza
location =
Cambridge
joined =
2011-07-04
user_id =
jeffbarr
location =
Seattle
joined =
2012-01-20
user_id =
werner
location =
Worldwide
joined =
2011-05-15
Scores Leader boards
user_id =
mza
game =
angry-birds
score =
11,000
user_id =
mza
game =
tetris
score =
1,223,000
user_id =
werner
location =
bejewelled
score =
55,000
game =
angry-birds
score =
11,000
user_id =
mza
game =
tetris
score =
1,223,000
user_id =
mza
game =
tetris
score =
9,000,000
user_id =
jeffbarr
Players
user_id =
mza
location =
Cambridge
joined =
2011-07-04
user_id =
jeffbarr
location =
Seattle
joined =
2012-01-20
user_id =
werner
location =
Worldwide
joined =
2011-05-15
user_id =
mza
game =
angry-birds
score =
11,000
user_id =
mza
game =
tetris
score =
1,223,000
user_id =
werner
location =
bejewelled
score =
55,000
Scores game =
angry-birds
score =
11,000
user_id =
mza
game =
tetris
score =
1,223,000
user_id =
mza
game =
tetris
score =
9,000,000
user_id =
jeffbarr
Leader boards
Query for scores
by user
Players
user_id =
mza
location =
Cambridge
joined =
2011-07-04
user_id =
jeffbarr
location =
Seattle
joined =
2012-01-20
user_id =
werner
location =
Worldwide
joined =
2011-05-15
Scores Leader boards
user_id =
mza
game =
angry-birds
score =
11,000
user_id =
mza
game =
tetris
score =
1,223,000
user_id =
werner
location =
bejewelled
score =
55,000
game =
angry-birds
score =
11,000
user_id =
mza
game =
tetris
score =
1,223,000
user_id =
mza
game =
tetris
score =
9,000,000
user_id =
jeffbarr
High scores by game
Storing large items.
EXAMPLE 2:
Unlimited storage.
Unlimited attributes per item.
Unlimited items per table.
Maximum of 64k per item.
message_id = 1 part = 1 message =
<first 64k>
message_id = 1 part = 2 message =
<second 64k>
message_id = 1 part = 3 joined =
<third 64k>
Split across items.
message_id = 1 message =
http://s3.amazonaws.com...
message_id = 2 message =
http://s3.amazonaws.com...
message_id = 3 message =
http://s3.amazonaws.com...
Store a pointer to S3.
Time series data
EXAMPLE 3:
event_id =
1000
timestamp =
2013-04-16-09-59-01
key =
value
event_id =
1001
timestamp =
2013-04-16-09-59-02
key =
value
event_id =
1002
timestamp =
2013-04-16-09-59-02
key =
value
Hot and cold tables. April
March
event_id =
1000
timestamp =
2013-03-01-09-59-01
key =
value
event_id =
1001
timestamp =
2013-03-01-09-59-02
key =
value
event_id =
1002
timestamp =
2013-03-01-09-59-02
key =
value
April March February January December
Archive data.
Move old data to S3: lower cost.
Still available for analytics.
Run queries across hot and cold data
with Elastic MapReduce.
Partitioning
3
Uniform workload.
Data stored across multiple partitions.
Data is primarily distributed by primary key.
Provisioned throughput is divided evenly across partitions.
To achieve and maintain full
provisioned throughput, spread
workload evenly across hash keys.
Non-Uniform workload.
Might be throttled, even at high levels of throughput.
Distinct values for hash keys.
BEST PRACTICE 1:
Hash key elements should have a
high number of distinct values.
user_id =
mza
first_name =
Matt
last_name =
Wood
user_id =
jeffbarr
first_name =
Jeff
last_name =
Barr
user_id =
werner
first_name =
Werner
last_name =
Vogels
user_id =
simone
first_name =
Simone
last_name =
Brunozzi
... ... ...
Lots of users with unique user_id.
Workload well distributed across hash key.
Avoid limited hash key values.
BEST PRACTICE 2:
Hash key elements should have a
high number of distinct values.
status =
200
date =
2012-04-01-00-00-01
status =
404
date =
2012-04-01-00-00-01
status
404
date =
2012-04-01-00-00-01
status =
404
date =
2012-04-01-00-00-01
Small number of status codes.
Unevenly, non-uniform workload.
Model for even distribution.
BEST PRACTICE 3:
Access by hash key value should be evenly
distributed across the dataset.
mobile_id =
100
access_date =
2012-04-01-00-00-01
mobile_id =
100
access_date =
2012-04-01-00-00-02
mobile_id =
100
access_date =
2012-04-01-00-00-03
mobile_id =
100
access_date =
2012-04-01-00-00-04
... ...
Large number of devices.
Small number which are much more popular than others.
Workload unevenly distributed.
mobile_id =
100.1
access_date =
2012-04-01-00-00-01
mobile_id =
100.2
access_date =
2012-04-01-00-00-02
mobile_id =
100.3
access_date =
2012-04-01-00-00-03
mobile_id =
100.4
access_date =
2012-04-01-00-00-04
... ...
Sample access pattern.
Workload randomized by hash key.
Replication & Analytics
4
Seamless scale.
Scalable methods for data processing.
Scalable methods for backup/restore.
Amazon Elastic MapReduce.
Managed Hadoop service for
data-intensive workflows.
aws.amazon.com/emr
create external table items_db
(id string, votes bigint, views bigint) stored by
'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
tblproperties
("dynamodb.table.name" = "items",
"dynamodb.column.mapping" =
"id:id,votes:votes,views:views");
select id, likes, views
from items_db
order by views desc;
5
Mohit Dilawari
Director of Engineering
@mdilawari
DynamoDB @ Localytics
84
About Localytics
• Mobile App Analytics Service
• 750+ Million Devices and over 20,000 Apps
• Customers Include:
…and many more.
85
About the Development Team
• Small team of four managing entire AWS infrastructure - 100 EC2 Instances
• Experts in BigData • Leveraging Amazon's service has been the key to our success
• Large scale users of: • SQS • S3 • ELB • RDS • Route53 • Elastic Cache • EMR
…and of course DynamoDB
86
Why DynamoDB?
Set it and Forget it
87
Our use-case: Dedup Data
• Each datapoint includes a globally unique ID • Mobile traffic over 2G/3G will upload periodic duplicate data • We accept data up to a 28 day window
88
First Design for Dedup table
Unique ID: aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333
Table Name = dedup_table
ID
aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111
aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222
"Test and Set" in a single operation
aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333
89
Optimization One - Data Aging
• Partition by Month • Create new table day before the month
• Need to keep two months of data
90
Optimization One - Data Aging
Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
Check Previous month:
Table Name = March2013_dedup
ID
aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111
aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222
Not Here!
91
Optimization One - Data Aging
Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
Test and Set in current month:
Inserted
Table Name = April2013_dedup
ID
bbbbbbbbbbbbbbbbbbbbbbbbb111111111111111
bbbbbbbbbbbbbbbbbbbbbbbbb222222222222222 bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
92
Optimization Two
• Reduce the index size - Reduces costs • Each item has a 100 byte overhead which is substantial
• Combine multiple IDs together to one record • Split each ID into two halves
o First half is the key. Second Half is added to the set
93
Optimization Two - Use Sets
Unique ID: ccccccccccccccccccccccccccc999999999999999
Prefix Values
aaaaaaaaaaaaaaaaaaaaaaaaa [111111111111111, 222222222222222, 333333333333333]
bbbbbbbbbbbbbbbbbbbbbbbbb [444444444444444, 555555555555555, 666666666666666]
ccccccccccccccccccccccccccc [777777777777777, 888888888888888, ]
ccccccccccccccccccccccccccc 999999999999999
94
Optimization Three - Combine Months
• Go back to a single table
Prefix March2013 April2013
aaaaaaaaaa... [111111111111111, 22222222222... [1212121212121212, 3434343434....
bbbbbbbbbb... [444444444444444, 555555555.... [4545454545454545, 6767676767.....
ccccccccccc... [777777777777777, 888888888... [8989898989898989, 1313131313....
One Operation 1. Delete February2013 Field 2. Check ID in March2013 • Test and Set into April 2013
95
Recap
Compare Plans for 20 Billion IDs per month
Plan Storage Costs
Read Costs
Write Costs Total Savings
Naive (after a year)
$8400 0 $4000 $12400
Data Age $900 $350 $4000 $5250 57%
Using Sets $150 $350 $4000 $4500 64%
Multiple Months $150 0 $4000 $4150 67%
96
Thank You @mdilawari
1. Getting started
2. Data modeling
3. Partitioning
4. Replication & Analytics
Summary
5. Customer story: Localytics
Free tier.
aws.amazon.com/dynamodb