Upload
vuhanh
View
221
Download
0
Embed Size (px)
Citation preview
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH
2014-05-15
Amazon Redshift & Amazon DynamoDB
Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift
Amazon Redshift
A fully managed data warehouse service • Massively parallel relational data warehouse • Takes care of cluster management and
distribution of your data • Columnar data store with variable compression • Optimized for complex queries across many
large tables • Use standard SQL & standard BI tools
Amazon DynamoDB
A fully managed fast key-value store • Fast, predictable performance • Simple and fast to deploy • Easy to scale as you go, up to millions of IOPS • Pay only for what you use: Read / write IOPS + storage • Data is automatically replicated across data centers
Amazon DynamoDB
Amazon DynamoDB Amazon Redshift
• Fast insert & update • Limited query
capability (single table only)
• NoSQL database
• Fast queries • Flexible queries
(JOINs, aggregation functions, …)
• SQL
Queries in Amazon DynamoDB
Queries in Amazon DynamoDB • Query or BatchQuery APIs retrieve items • Scan & filter to comb through a whole table • You have to join tables in your own code!
Amazon DynamoDB
Queries in Amazon DynamoDB (2) • Apache Hive on Amazon EMR can access data
in DynamoDB • Run HiveQL queries for bulk processing • Can integrate data in HDFS, Amazon S3, …
Amazon DynamoDB HiveQL queries on Amazon EMR
Queries in Amazon DynamoDB (3) • Import data into Amazon Redshift • Use SQL queries, use BI tools etc. • Powerful analytics and aggregation functions
Amazon Redshift Amazon DynamoDB
Importing Data into Amazon Redshift
TMTOWTDI
Query & Insert
Amazon Redshift
Amazon DynamoDB
#1 Query / BatchQuery
#2 Retrieve Items
#3 INSERT … INTO (…)
Query & Insert The Good • Full control over queries • Decide which items you
want to move to Redshift • Process data on the way
The Bad • Slow • Inefficient on the Redshift
side of things • Does not scale well
The COPY Command
Amazon Redshift
Amazon DynamoDB
#1 COPY FROM …
#2 Politely ask for a table
#3 Return whole table
The COPY Command
Amazon Redshift
Amazon DynamoDB
#1 COPY FROM …
#2 Parallel Scans
The COPY Command
Amazon Redshift
Amazon DynamoDB
#1 COPY FROM …
#3 Return Items
The COPY Command • COPY a single table at a time • From one Amazon DynamoDB table into one
Amazon Redshift table • Fast – executed in parallel on all data nodes in
the Amazon Redshift cluster • Can be limited to use a certain percentage of
provisioned throughput on the DynamoDB table
The COPY Command COPY <table_name> (col1, col2, …)
FROM 'dynamodb://<table_name2>'
CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…'
READRATIO 10 -- use 10% of available read capacity
COMPROWS 0 -- how many rows to read to determine
-- compression
[…other options…]
The COPY Command • Attributes are mapped to columns by name • Case of column names is ignored • Attributes that do not map are ignored • Missing attributes are stored as NULL or empty
values • Only works for STRING and NUMBER attributes
The COPY Command The Good • Easy to use • Fast • Efficient use of resources • Scales linearly with
cluster size • Only uses certain
percentage of read throughput
The Bad • Whole tables only • No processing in between • Can only copy from
DynamoDB in same region • Only works with STRING
and NUMBER types
Query & Insert at Scale
Amazon Redshift
Amazon DynamoDB
#1 Query / BatchQuery
#2 Retrieve Items
#3 INSERT … INTO (…) in parallel in parallel
Amazon EMR
Query & Insert at Scale
Amazon Redshift
Amazon DynamoDB
#1 Query / BatchQuery
#2 Retrieve Items
#3 INSERT … INTO (…) in parallel in parallel
Amazon EMR
Query & Insert at Scale
Amazon Redshift
Amazon DynamoDB
#1 Query / BatchQuery
#2 Retrieve Items
#3 INSERT … INTO (…) in parallel in parallel
Amazon EMR
Query & Import using Amazon EMR
Amazon Redshift
Amazon DynamoDB
#1 Query / BatchQuery
#2 Retrieve Items
in parallel
Amazon S3
#3 Export to file(s) on S3
#5 Retrieve files
#4 COPY… FROM s3://
Amazon EMR
Query & Import using Amazon EMR
Amazon Redshift
Amazon DynamoDB
#1 Query / BatchQuery
#2 Retrieve Items
in parallel
#3 COPY … FROM emr://
#4 Retrieve files from HDFS
Query & Import using Amazon EMR The Good • Decide which items you
want to move to Redshift • Full control over queries • Process data on the way • Scales well • Integrates with other data
sources easily
The Bad • Additional complexity • Additional cost (for EMR) • Slower than direct COPY
from Amazon DynamoDB
Please welcome Erez Hadas-Sonnenschein, Sr. Product Manager Witali Stohler, Datawarehouse & BI Specialist
clipkit GmbH
Video Syndication – The Possibilities
News Sports Cars/motor Business/finances Music Gaming Cinema Cooking/food Lifestyle/fashion Traveling Computer/mobile Fitness/wellness Knowledge/hobby entertaintment
Content – Partner Overview
clipkit Player – Analytics (Metrics)
Full Screen
Category
Playlist Pos.
Play / Pause
Progress Pos. Mute / Unmute Volume
clipkit Player – Analytics (Metrics) Location (Country, City) Language Browser Operating System Video Id Publisher URL Etc…
First Implementation (Expensive and Slow)
• designed in starting days • not calculated to such amount of
data • slow copy process from S3 to DB
(PHP application old architecture) • fix EC2 price (expensive to
support peak hours) • PostgreSQL scalability limitations • sometimes the copy process
was so slow that the delay was ~3 days.
Analytics / Metrics (Requests Graph)
• ~ 6,000,000 New Entries per day • ~ 1,000 Requests per second (Peak Hours) • ~ 25 Requests per second (Off-peak Hours)
4000% Requests Growth during the day.
Analytics / Metrics (Numbers)
Second Implementation (Expensive and Slow)
• Inserting only for one (big) Table • The copy command only works
for whole tables • The minimum delay was one
day • Our solution have increase the
provisioned throughput and that was expensive
NO REAL-TIME DATA
Third Implementation (Cheap and Fast)
Third Implementation – Dynamo DB • Java SDK
AmazonDynamoDBAsyncClient (Fire and Go)
• Easy to Create and Delete Tables • Write Latency ~5ms • Throughput auto scale with Dynamic
DynamoDB
• One Table per day • Continuous Iteration and copy to
Redshift • We just pay for what we use
Third Implementation – Redshift • Standard PostgreSQL JDBC • Fully managed by Amazon • Automated Backups and Fast Restores
• ~7000 Insert Items per Second • Less than 2 seconds Queries to > 1 billion
entries • Real-time available data (maximum 1
minute delay)
Third Implementation – Conclusions • Java Web Application
– Auto Scale (Off-Peak - 1 Small Instance)
• Dynamo DB – One Table per day (After copied it will be deleted) – Auto Scale – ~5 ms Put Item Latency
• Redshift – Insert ~7000 Items per second – Fully managed
Thank You!