AWS June Webinar Series - Getting Started: Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Pavan Pothukuchi

June 17, 2015

Amazon RedshiftGetting Started

Introduction

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Amazon

Redshift

a lot faster

a lot cheaper

a whole lot simpler

Redshift

EMR

EC2

Analyze

Data Pipeline

GlacierDynamoDB

Store

Direct Connect

Collect

Kinesis

S3

Selected Amazon Redshift Customers

Rapidly Growing Ecosystem

Benefits

Amazon Redshift Architecture

Leader Node• SQL endpoint, JDBC/ODBC

• Stores metadata

• Coordinates query execution

Compute Nodes• Local, columnar storage

• Execute queries in parallel

• Load, backup, restore via Amazon S3

• Load from Amazon DynamoDB or SSH

Two hardware platforms• Optimized for data processing

• DS2: HDD; scale from 2TB to 2PB

• DC1: SSD; scale from 160GB to 326TB

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Amazon Redshift dramatically reduces I/O

Column storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

ID Age State Amoun

t

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375


Column storage

Data compression

Zone maps



ID Age State Amoun

t

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375


Column storage

Data compression

Zone maps



analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw


Column storage

Data compression

Zone maps



• Track of the minimum and

maximum value for each block

• Skip over blocks that don’t contain

the data needed for a given query

• Minimize unnecessary I/O


Column storage

Data compression

Zone maps



• Use direct-attached storage to

maximize throughput

• Hardware optimized for high

performance data processing

• Large block sizes to make the

most of each read

• Amazon Redshift manages

durability for you

Amazon Redshift Node Types

• Optimized for I/O intensive workloads

• High disk density

• On demand at $0.85/hour

• As low as $1,000/TB/Year

• Scale from 2TB to 2PB

DS2.XL: 31 GB RAM, 2 Cores

2 TB compressed storage, 0.5 GB/sec scan

DS2.8XL: 244 GB RAM, 16 Cores

16 TB compressed, 4 GB/sec scan

• High performance at smaller storage size

• High compute and memory density

• On demand at $0.25/hour

• As low as $5,500/TB/Year

• Scale from 160GB to 326TB

DC1.L: 16 GB RAM, 2 Cores

160 GB compressed SSD storage

DC1.8XL: 256 GB RAM, 32 Cores

2.56 TB of compressed SSD storage

Priced to let you analyze all your data

Price is nodes times hourly

cost

No charge for leader node

3x data compression on avg

Price includes 3 copies of

data

DS2 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.250 $ 13,690



Built-in Security

• Load encrypted from S3

• SSL to secure data in transit; ECDHE perfect forward security

• Encryption to secure data at rest • All blocks on disks & in Amazon S3 encrypted

• Block key, Cluster key, Master key (AES-256)

• On-premises HSM & CloudHSM support

• Audit logging & AWS CloudTrail integration

• Amazon VPC support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Durability and Availability – Managed

Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at

all times

Backups to Amazon S3 are continuous, automatic, and incremental

• Designed for eleven nines of durability

Continuous monitoring and automated recovery from failures of drives and nodes

Able to restore snapshots to any Availability Zone within a region

Easily enable backups to a second region for disaster recovery

Use cases

Common Customer Use Cases

Reduce costs by extending DW

rather than adding HW

Migrate completely from existing

DW systems

Respond faster to business

Improve performance by an

order of magnitude

Make more data available for

analysis

Access business data via

standard reporting tools

Add analytic functionality to

applications

Scale DW capacity as demand

grows

Reduce HW & SW costs by an

order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

• 10s of million ads/day

• Stores 18 months of data

• Analyzes ad opportunities,

clicks and experiments

• 250M mobile events/day

• Stores 3 wk. granular and 4

yr. of aggregate data

• Analyzes new feature usage

and A/B testing

Create and Scale

Enter Cluster Details

Select Node Configuration

Select Security Settings and Provision

Point and click resize

Resize

• Resize while remaining online

• Provision a new cluster in the

background

• Copy data in parallel from node to

node

• Only charged for source cluster

Load data

AWS CloudCorporate Data center

Amazon S3Amazon

Redshift

Flat files

Data loading options

AWS CloudCorporate Data center

ETL

Source DBs

Amazon

Redshift

Amazon

Redshift


AWS Cloud

Amazon

RedshiftAmazon

Kinesis


Demo for loading data

Use the COPY command

Each slice can load one file at a

time

A single input file means only one

slice is ingesting data

Instead of 100MB/s, you’re only

getting 6.25MB/s

Use multiple input files to maximize

throughput

Use the COPY command

You need at least as many input files as you have slices

With 16 input files, all slices are working so you maximize throughput

Get 100MB/s per node; scale linearly as you add nodes

Use multiple input files to maximize

throughput

Load lineorder table from single file

copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl'

credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=>key>'

gzip

compupdate off

region 'us-east-1';

Load lineorder table from multiple files

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl'


gzip

compupdate off

region 'us-east-1';

Query

JDBC/ODBC

Amazon Redshift

Amazon Redshift works with your existing

analysis tools

ODBC/JDBC

BI ClientsRedshift

ODBC/JDBC

BI Server Redshift

Clients

Monitor query performance

View explain plans

Resources

Pavan Pothukuchi | [email protected] |

Detail Pages

• http://aws.amazon.com/redshift

• https://aws.amazon.com/marketplace/redshift/

Best Practices

• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html

• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html

Deep Dive Webinar Series in July

• Migration and Loading Data

• Optimizing Performance

• Reporting and Advanced Analytics

mailto:[email protected]

http://aws.amazon.com/redshift

https://aws.amazon.com/marketplace/redshift/

http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html

http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html

AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new

customers about the AWS platform, best practices and new cloud services.

Details• July 1, 2015

• Chicago, Illinois

• @ McCormick Place

Featuring• New product launches

• 36+ sessions, labs, and bootcamps

• Executive and partner networking

Registration is now open• Come and see what AWS and the cloud can do for you.

• Click here to register: http://amzn.to/1RooPPL

https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootcamps&trk=Webinar_slide

http://amzn.to/1RooPPL

Pavan Pothukuchi – [email protected]

mailto:[email protected]

Load part table using key prefix

copy part from 's3://pp-redshift-webinar-demo/load/part-csv.tbl'

credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<key>;'

csv

null as '\000';

Load supplier table using gzip

copy supplier from 's3://awssampledb/ssbgz/supplier.tbl'

credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<key>'

delimiter '|'

gzip

region 'us-east-1';

Load customer table using a manifest file

copy customer from 's3://pp-redshift-webinar-demo/load/customer-fw-manifest'


fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'

maxerror 10

acceptinvchars as '^'

manifest;

Load dwdate using auto

copy dwdate from 's3://pp-redshift-webinar-demo/load/dwdate-tab.tbl'


delimiter '\t'

dateformat 'auto';

Load lineorder table from single file

copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl'


gzip

compupdate off

region 'us-east-1';

Load lineorder table from multiple files

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl'


gzip

compupdate off

region 'us-east-1';

Technology

AWS June Webinar Series - Getting Started: Amazon Redshift