40
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anurag Gupta, Vice President Amazon Athena, Amazon CloudSearch, AWS Data Pipeline, Amazon Elasticsearch Service, Amazon EMR, Amazon Redshift, AWS Glue, Amazon Aurora, Amazon RDS for MariaDB, RDS for MySQL, RDS for PostgreSQL April 19, 2017 Introduction to Amazon Redshift Spectrum

BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of data in S3

Embed Size (px)

Citation preview

  • 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Anurag Gupta, Vice President Amazon Athena, Amazon CloudSearch, AWS Data Pipeline, Amazon Elasticsearch Service, Amazon EMR,

    Amazon Redshift, AWS Glue, Amazon Aurora, Amazon RDS for MariaDB, RDS for MySQL, RDS for PostgreSQL

    April 19, 2017

    Introduction to Amazon Redshift Spectrum

  • What is Big Data?

    When your data sets become so large and diverse that you have to start innovating around how to collect, store, process, analyze and share them

  • Generate

    Collect & Store

    Analyze

    Collaborate & Act

    Individual AWS customers generate over a PB/day

    Its never been easier to generate vast amounts of data

  • Generate

    Collect & Store

    Analyze

    Collaborate & Act

    Individual AWS customers generating over PB/day

    Amazon S3 lets you collect and store all this data

    Store exabytes of data in S3

  • Generate

    Collect & Store

    Analyze

    Collaborate & Act

    Individual AWS customers generating over PB/day

    Highly Constrained

    But how do you analyze it?

    Store exabytes of data in S3

  • 1990 2000 2010 2020

    Generated DataAvailable for Analysis

    Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 20122016 Forecast and 2011 Vendor Shares

    Data Volume

    Year

    The Dark Data Problem Most generated data is unavailable for analysis

  • The tyranny of OR

    Amazon EMR Directly access data in S3

    Scale out to thousands of nodes

    Open data formats Popular big data frameworks Anything you can dream up and code

    Amazon Redshift Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing

  • But I dont want to choose.

    I shouldnt have to choose

    I want all of the above

  • I want

    sophisticated query optimization and scale-out processing

    super fast performance and support for open formats

    the throughput of local disk and the scale of S3

  • I want all this

    From one data processing engine

    With my data accessible from all data processing engines

    Now and in the future

  • Were told you have to choose

    Pick small clusters for joins or large ones for scans Shuffles are expensive

    Open formats cant collocate data for joins

    They have to deal with variable cluster sizes

    Query optimization requires statistics You cant determine this for external data

  • Its just physics

  • Amazon Redshift Spectrum

  • Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes

    Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query

    High concurrency: Multiple clusters access same data

    No ETL: Query data in-place using open file formats

    Full Amazon Redshift SQL support

    S3 SQL

  • Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    1

  • Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    2

  • Query plan is sent to all compute nodes

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    3

  • Compute nodes obtain partition info from Data Catalog; dynamically prune partitions

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    4

  • Each compute node issues multiple requests to the Amazon Redshift Spectrum layer

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    5

  • Amazon Redshift Spectrum nodes scan your S3 data

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    6

  • 7 Amazon Redshift Spectrum projects, filters, and aggregates

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Hive Metastore

  • Final aggregations and joins with local Amazon Redshift tables done in-cluster

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    8

  • Result is sent back to client

    Life of a query

    Amazon Redshift

    JDBC/ODBC

    ... 1 2 3 4 N

    Amazon S3 Exabyte-scale object storage

    Data Catalog Apache Hive Metastore

    9

  • Running an analytic query over an exabyte in S3

  • Lets build an analytic query - #1

    An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books shes written. 1 Table 2 Filters

    SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE %POTTER% AND P.AUTHOR = J. K. Rowling

  • Lets build an analytic query - #2

    An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation

    SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;

  • Lets build an analytic query - #3

    An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts

    SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;

  • Lets build an analytic query - #4

    An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts

    SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = US AND R.CITY = Seattle AND R.STATE = WA AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;

  • Now lets run that query over an exabyte of data in S3

    Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* Compression .....5X Columnar file format.......10X

    Scanning with 2500 nodes....2500X

    Static partition elimination............2X Dynamic partition elimination...350X Redshifts query optimizer......40X

    --------------------------------------------------- Total reduction.3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.

  • Amazon Redshift Spectrum is fast

    Leverages Amazon Redshifts advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster

  • Amazon Redshift Spectrum is cost-effective

    You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by:

    Partitioning data Using a columnar file format Compressing data

  • Amazon Redshift Spectrum is secure

    End-to-end data encryption

    Alerts & notifications

    Virtual private cloud

    Audit logging

    Certifications & compliance

    Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS

    All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift

    PCI/DSS FedRAMP

    SOC1/2/3 HIPAA/BAA

  • Amazon Redshift Spectrum uses standard SQL

    Spectrum seamlessly integrates with your existing SQL & BI apps Support for complex joins, nested queries & window functions Support for data partitioned in S3 by any key

    Date, Time and any other custom keys e.g., Year, Month, Day, Hour

  • Defining External Schema and Creating Tables

    Define an external schema in Amazon Redshift using the Amazon Athena data catalog or your own Apache Hive Metastore

    CREATE EXTERNAL SCHEMA

    Query external tables using .

    Register external tables using Athena, your Hive Metastore client, or from Amazon Redshift CREATE EXTERNAL TABLE syntax

    CREATE EXTERNAL TABLE [PARTITIONED BY ] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, ];

  • Amazon Redshift Spectrum Current support

    File formats

    Parquet CSV Sequence ORC (coming soon) RCFile RegExSerDe (coming soon)

    Compression

    Gzip Snappy Lzo (coming soon) Bz2

    Encryption

    SSE with AES256 SSE KMS with default

    key

    Column types

    Numeric: bigint, int, smallint, float, double and decimal

    Char/varchar/string Timestamp Boolean DATE type can be used only as a

    partitioning key

    Table type

    Non-partitioned table (s3://mybucket/orders/..)

    Partitioned table (s3://mybucket/orders/date=YYYY-MM-DD/..)

  • Converting to Parquet and ORC using Amazon EMR

    You can use Hive CREATE TABLE AS SELECT to convert data CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source

    Or use Spark - 20 lines of Pyspark code, running on Amazon EMR

    1TB of text data reduced to 130 GB in Parquet format with snappy compression Total cost of EMR job to do this: $5

    https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

    https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

  • Is Amazon Redshift Spectrum useful if I dont have an exabyte?

    Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using EMR, Athena & Amazon Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis

  • The Emerging Analytics Architecture

    Athena Amazon Athena Interactive Query

    AWS Glue ETL & Data Catalog

    Storage

    Serverless Compute

    Data Processing

    Amazon S3 Exabyte-scale Object Storage

    Amazon Kinesis Firehose Real-Time Data Streaming

    Amazon EMR Managed Hadoop Applications

    AWS Lambda Trigger-based Code Execution

    AWS Glue Data Catalog Hive-compatible Metastore

    Amazon Redshift Spectrum Fast @ Exabyte scale

    Amazon Redshift Petabyte-scale Data Warehousing

  • Over 20 customers helped preview Amazon Redshift Spectrum

  • Thank you!

    Questions?

    Slide Number 1Slide Number 2Its never been easier to generate vast amounts of dataAmazon S3 lets you collect and store all this dataBut how do you analyze it?Slide Number 6The tyranny of ORBut I dont want to choose.I shouldnt have to choose I want all of the aboveI want sophisticated query optimization and scale-out processingsuper fast performance and support for open formats the throughput of local disk and the scale of S3I want all this From one data processing engineWith my data accessible from all data processing enginesNow and in the futureWere told you have to choose Pick small clusters for joins or large ones for scansShuffles are expensiveOpen formats cant collocate data for joinsThey have to deal with variable cluster sizesQuery optimization requires statisticsYou cant determine this for external dataIts just physicsAmazon Redshift SpectrumAmazon Redshift SpectrumRun SQL queries directly against data in S3 using thousands of nodes Life of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryRunning an analytic query over an exabyte in S3Lets build an analytic query - #1Lets build an analytic query - #2Lets build an analytic query - #3Lets build an analytic query - #4Now lets run that query over an exabyte of data in S3Amazon Redshift Spectrum is fastAmazon Redshift Spectrum is cost-effectiveAmazon Redshift Spectrum is secureAmazon Redshift Spectrum uses standard SQLDefining External Schema and Creating TablesAmazon Redshift Spectrum Current supportConverting to Parquet and ORC using Amazon EMR Is Amazon Redshift Spectrum useful if I dont have an exabyte?The Emerging Analytics ArchitectureOver 20 customers helped preview Amazon Redshift SpectrumThank you!Questions?