47
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH 2014-05-15

Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

  • Upload
    vuhanh

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

2014-05-15

Page 2: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon Redshift & Amazon DynamoDB

Page 3: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon Redshift

Page 4: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

Page 5: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon Redshift

A fully managed data warehouse service •  Massively parallel relational data warehouse •  Takes care of cluster management and

distribution of your data •  Columnar data store with variable compression •  Optimized for complex queries across many

large tables •  Use standard SQL & standard BI tools

Page 6: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon DynamoDB

Page 7: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

A fully managed fast key-value store •  Fast, predictable performance •  Simple and fast to deploy •  Easy to scale as you go, up to millions of IOPS •  Pay only for what you use: Read / write IOPS + storage •  Data is automatically replicated across data centers

Amazon DynamoDB

Page 8: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon DynamoDB Amazon Redshift

•  Fast insert & update •  Limited query

capability (single table only)

•  NoSQL database

•  Fast queries •  Flexible queries

(JOINs, aggregation functions, …)

•  SQL

Page 9: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Queries in Amazon DynamoDB

Page 10: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Queries in Amazon DynamoDB •  Query or BatchQuery APIs retrieve items •  Scan & filter to comb through a whole table •  You have to join tables in your own code!

Amazon DynamoDB

Page 11: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Queries in Amazon DynamoDB (2) •  Apache Hive on Amazon EMR can access data

in DynamoDB •  Run HiveQL queries for bulk processing •  Can integrate data in HDFS, Amazon S3, …

Amazon DynamoDB HiveQL queries on Amazon EMR

Page 12: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Queries in Amazon DynamoDB (3) •  Import data into Amazon Redshift •  Use SQL queries, use BI tools etc. •  Powerful analytics and aggregation functions

Amazon Redshift Amazon DynamoDB

Page 13: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Importing Data into Amazon Redshift

Page 14: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

TMTOWTDI

Page 15: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Query & Insert

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…)

Page 16: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Query & Insert The Good •  Full control over queries •  Decide which items you

want to move to Redshift •  Process data on the way

The Bad •  Slow •  Inefficient on the Redshift

side of things •  Does not scale well

Page 17: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

The COPY Command

Amazon Redshift

Amazon DynamoDB

#1 COPY FROM …

#2 Politely ask for a table

#3 Return whole table

Page 18: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

The COPY Command

Amazon Redshift

Amazon DynamoDB

#1 COPY FROM …

#2 Parallel Scans

Page 19: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

The COPY Command

Amazon Redshift

Amazon DynamoDB

#1 COPY FROM …

#3 Return Items

Page 20: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

The COPY Command •  COPY a single table at a time •  From one Amazon DynamoDB table into one

Amazon Redshift table •  Fast – executed in parallel on all data nodes in

the Amazon Redshift cluster •  Can be limited to use a certain percentage of

provisioned throughput on the DynamoDB table

Page 21: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

The COPY Command COPY <table_name> (col1, col2, …)

FROM 'dynamodb://<table_name2>'

CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…'

READRATIO 10 -- use 10% of available read capacity

COMPROWS 0 -- how many rows to read to determine

-- compression

[…other options…]

Page 22: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

The COPY Command •  Attributes are mapped to columns by name •  Case of column names is ignored •  Attributes that do not map are ignored •  Missing attributes are stored as NULL or empty

values •  Only works for STRING and NUMBER attributes

Page 23: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

The COPY Command The Good •  Easy to use •  Fast •  Efficient use of resources •  Scales linearly with

cluster size •  Only uses certain

percentage of read throughput

The Bad •  Whole tables only •  No processing in between •  Can only copy from

DynamoDB in same region •  Only works with STRING

and NUMBER types

Page 24: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Query & Insert at Scale

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…) in parallel in parallel

Page 25: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon EMR

Query & Insert at Scale

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…) in parallel in parallel

Page 26: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon EMR

Query & Insert at Scale

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…) in parallel in parallel

Page 27: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon EMR

Query & Import using Amazon EMR

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

in parallel

Amazon S3

#3 Export to file(s) on S3

#5 Retrieve files

#4 COPY… FROM s3://

Page 28: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Amazon EMR

Query & Import using Amazon EMR

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

in parallel

#3 COPY … FROM emr://

#4 Retrieve files from HDFS

Page 29: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Query & Import using Amazon EMR The Good •  Decide which items you

want to move to Redshift •  Full control over queries •  Process data on the way •  Scales well •  Integrates with other data

sources easily

The Bad •  Additional complexity •  Additional cost (for EMR) •  Slower than direct COPY

from Amazon DynamoDB

Page 30: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Please welcome Erez Hadas-Sonnenschein, Sr. Product Manager Witali Stohler, Datawarehouse & BI Specialist

clipkit GmbH

Page 31: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Video Syndication – The Possibilities

Page 32: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

News Sports Cars/motor Business/finances Music Gaming Cinema Cooking/food Lifestyle/fashion Traveling Computer/mobile Fitness/wellness Knowledge/hobby entertaintment

Content – Partner Overview

Page 33: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

clipkit Player – Analytics (Metrics)

Full Screen

Category

Playlist Pos.

Play / Pause

Progress Pos. Mute / Unmute Volume

Page 34: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

clipkit Player – Analytics (Metrics) Location (Country, City) Language Browser Operating System Video Id Publisher URL Etc…

Page 35: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel
Page 36: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel
Page 37: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel
Page 38: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel
Page 39: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

First Implementation (Expensive and Slow)

•  designed in starting days •  not calculated to such amount of

data •  slow copy process from S3 to DB

(PHP application old architecture) •  fix EC2 price (expensive to

support peak hours) •  PostgreSQL scalability limitations •  sometimes the copy process

was so slow that the delay was ~3 days.

Page 40: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Analytics / Metrics (Requests Graph)

Page 41: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

•  ~ 6,000,000 New Entries per day •  ~ 1,000 Requests per second (Peak Hours) •  ~ 25 Requests per second (Off-peak Hours)

4000% Requests Growth during the day.

Analytics / Metrics (Numbers)

Page 42: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Second Implementation (Expensive and Slow)

•  Inserting only for one (big) Table •  The copy command only works

for whole tables •  The minimum delay was one

day •  Our solution have increase the

provisioned throughput and that was expensive

NO REAL-TIME DATA

Page 43: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Third Implementation (Cheap and Fast)

Page 44: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Third Implementation – Dynamo DB •  Java SDK

AmazonDynamoDBAsyncClient (Fire and Go)

•  Easy to Create and Delete Tables •  Write Latency ~5ms •  Throughput auto scale with Dynamic

DynamoDB

•  One Table per day •  Continuous Iteration and copy to

Redshift •  We just pay for what we use

Page 45: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Third Implementation – Redshift •  Standard PostgreSQL JDBC •  Fully managed by Amazon •  Automated Backups and Fast Restores

•  ~7000 Insert Items per Second •  Less than 2 seconds Queries to > 1 billion

entries •  Real-time available data (maximum 1

minute delay)

Page 46: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Third Implementation – Conclusions •  Java Web Application

–  Auto Scale (Off-Peak - 1 Small Instance)

•  Dynamo DB –  One Table per day (After copied it will be deleted) –  Auto Scale –  ~5 ms Put Item Latency

•  Redshift –  Insert ~7000 Items per second –  Fully managed

Page 47: Amazon Redshift & Amazon DynamoDBmhanisch-aws-public.s3.amazonaws.com/...1615-Hanisch-RedshiftDy… · Amazon Redshift A fully managed data warehouse service • Massively parallel

Thank You!