BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of data in S3

2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Anurag Gupta, Vice President Amazon Athena, Amazon CloudSearch, AWS Data Pipeline, Amazon Elasticsearch Service, Amazon EMR,

Amazon Redshift, AWS Glue, Amazon Aurora, Amazon RDS for MariaDB, RDS for MySQL, RDS for PostgreSQL

April 19, 2017

Introduction to Amazon Redshift Spectrum

What is Big Data?

When your data sets become so large and diverse that you have to start innovating around how to collect, store, process, analyze and share them

Generate

Collect & Store

Analyze

Collaborate & Act

Individual AWS customers generate over a PB/day

Its never been easier to generate vast amounts of data

Generate

Collect & Store

Analyze

Collaborate & Act

Individual AWS customers generating over PB/day

Amazon S3 lets you collect and store all this data

Store exabytes of data in S3

Generate

Collect & Store

Analyze

Collaborate & Act

Individual AWS customers generating over PB/day

Highly Constrained

But how do you analyze it?

Store exabytes of data in S3

1990 2000 2010 2020

Generated DataAvailable for Analysis

Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 20122016 Forecast and 2011 Vendor Shares

Data Volume

Year

The Dark Data Problem Most generated data is unavailable for analysis

The tyranny of OR

Amazon EMR Directly access data in S3

Scale out to thousands of nodes

Open data formats Popular big data frameworks Anything you can dream up and code

Amazon Redshift Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing

But I dont want to choose.

I shouldnt have to choose

I want all of the above

I want

sophisticated query optimization and scale-out processing

super fast performance and support for open formats

the throughput of local disk and the scale of S3

I want all this

From one data processing engine

With my data accessible from all data processing engines

Now and in the future

Were told you have to choose

Pick small clusters for joins or large ones for scans Shuffles are expensive

Open formats cant collocate data for joins

They have to deal with variable cluster sizes

Query optimization requires statistics You cant determine this for external data

Its just physics

Amazon Redshift Spectrum

Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes

Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query

High concurrency: Multiple clusters access same data

No ETL: Query data in-place using open file formats

Full Amazon Redshift SQL support

S3 SQL

Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N

Amazon S3 Exabyte-scale object storage

Data Catalog Apache Hive Metastore

1

Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N



2

Query plan is sent to all compute nodes

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N



3

Compute nodes obtain partition info from Data Catalog; dynamically prune partitions

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N



4

Each compute node issues multiple requests to the Amazon Redshift Spectrum layer

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N



5

Amazon Redshift Spectrum nodes scan your S3 data

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N



6

7 Amazon Redshift Spectrum projects, filters, and aggregates

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N


Data Catalog Hive Metastore

Final aggregations and joins with local Amazon Redshift tables done in-cluster

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N



8

Result is sent back to client

Life of a query

Amazon Redshift

JDBC/ODBC

... 1 2 3 4 N



9

Running an analytic query over an exabyte in S3

Lets build an analytic query - #1

An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books shes written. 1 Table 2 Filters

SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE %POTTER% AND P.AUTHOR = J. K. Rowling


An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation

SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;


An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts

SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;


An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts

SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = US AND R.CITY = Seattle AND R.STATE = WA AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;

Now lets run that query over an exabyte of data in S3

Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* Compression .....5X Columnar file format.......10X

Scanning with 2500 nodes....2500X

Static partition elimination............2X Dynamic partition elimination...350X Redshifts query optimizer......40X

--------------------------------------------------- Total reduction.3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.

Amazon Redshift Spectrum is fast

Leverages Amazon Redshifts advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster

Amazon Redshift Spectrum is cost-effective

You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by:

Partitioning data Using a columnar file format Compressing data

Amazon Redshift Spectrum is secure

End-to-end data encryption

Alerts & notifications

Virtual private cloud

Audit logging

Certifications & compliance

Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS

All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift

PCI/DSS FedRAMP

SOC1/2/3 HIPAA/BAA

Amazon Redshift Spectrum uses standard SQL

Spectrum seamlessly integrates with your existing SQL & BI apps Support for complex joins, nested queries & window functions Support for data partitioned in S3 by any key

Date, Time and any other custom keys e.g., Year, Month, Day, Hour

Defining External Schema and Creating Tables

Define an external schema in Amazon Redshift using the Amazon Athena data catalog or your own Apache Hive Metastore

CREATE EXTERNAL SCHEMA

Query external tables using .

Register external tables using Athena, your Hive Metastore client, or from Amazon Redshift CREATE EXTERNAL TABLE syntax

CREATE EXTERNAL TABLE [PARTITIONED BY ] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, ];

Amazon Redshift Spectrum Current support

File formats

Parquet CSV Sequence ORC (coming soon) RCFile RegExSerDe (coming soon)

Compression

Gzip Snappy Lzo (coming soon) Bz2

Encryption

SSE with AES256 SSE KMS with default

key

Column types

Numeric: bigint, int, smallint, float, double and decimal

Char/varchar/string Timestamp Boolean DATE type can be used only as a

partitioning key

Table type

Non-partitioned table (s3://mybucket/orders/..)

Partitioned table (s3://mybucket/orders/date=YYYY-MM-DD/..)

Converting to Parquet and ORC using Amazon EMR

You can use Hive CREATE TABLE AS SELECT to convert data CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source

Or use Spark - 20 lines of Pyspark code, running on Amazon EMR

1TB of text data reduced to 130 GB in Parquet format with snappy compression Total cost of EMR job to do this: $5

https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Is Amazon Redshift Spectrum useful if I dont have an exabyte?

Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using EMR, Athena & Amazon Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis

The Emerging Analytics Architecture

Athena Amazon Athena Interactive Query

AWS Glue ETL & Data Catalog

Storage

Serverless Compute

Data Processing

Amazon S3 Exabyte-scale Object Storage

Amazon Kinesis Firehose Real-Time Data Streaming

Amazon EMR Managed Hadoop Applications

AWS Lambda Trigger-based Code Execution

AWS Glue Data Catalog Hive-compatible Metastore

Amazon Redshift Spectrum Fast @ Exabyte scale

Amazon Redshift Petabyte-scale Data Warehousing

Over 20 customers helped preview Amazon Redshift Spectrum

Thank you!

Questions?

Slide Number 1Slide Number 2Its never been easier to generate vast amounts of dataAmazon S3 lets you collect and store all this dataBut how do you analyze it?Slide Number 6The tyranny of ORBut I dont want to choose.I shouldnt have to choose I want all of the aboveI want sophisticated query optimization and scale-out processingsuper fast performance and support for open formats the throughput of local disk and the scale of S3I want all this From one data processing engineWith my data accessible from all data processing enginesNow and in the futureWere told you have to choose Pick small clusters for joins or large ones for scansShuffles are expensiveOpen formats cant collocate data for joinsThey have to deal with variable cluster sizesQuery optimization requires statisticsYou cant determine this for external dataIts just physicsAmazon Redshift SpectrumAmazon Redshift SpectrumRun SQL queries directly against data in S3 using thousands of nodes Life of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryRunning an analytic query over an exabyte in S3Lets build an analytic query - #1Lets build an analytic query - #2Lets build an analytic query - #3Lets build an analytic query - #4Now lets run that query over an exabyte of data in S3Amazon Redshift Spectrum is fastAmazon Redshift Spectrum is cost-effectiveAmazon Redshift Spectrum is secureAmazon Redshift Spectrum uses standard SQLDefining External Schema and Creating TablesAmazon Redshift Spectrum Current supportConverting to Parquet and ORC using Amazon EMR Is Amazon Redshift Spectrum useful if I dont have an exabyte?The Emerging Analytics ArchitectureOver 20 customers helped preview Amazon Redshift SpectrumThank you!Questions?

Technology

BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of data in S3