Upload
amazon-web-services
View
128
Download
2
Embed Size (px)
Citation preview
2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anurag Gupta, Vice President Amazon Athena, Amazon CloudSearch, AWS Data Pipeline, Amazon Elasticsearch Service, Amazon EMR,
Amazon Redshift, AWS Glue, Amazon Aurora, Amazon RDS for MariaDB, RDS for MySQL, RDS for PostgreSQL
April 19, 2017
Introduction to Amazon Redshift Spectrum
What is Big Data?
When your data sets become so large and diverse that you have to start innovating around how to collect, store, process, analyze and share them
Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers generate over a PB/day
Its never been easier to generate vast amounts of data
Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers generating over PB/day
Amazon S3 lets you collect and store all this data
Store exabytes of data in S3
Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers generating over PB/day
Highly Constrained
But how do you analyze it?
Store exabytes of data in S3
1990 2000 2010 2020
Generated DataAvailable for Analysis
Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 20122016 Forecast and 2011 Vendor Shares
Data Volume
Year
The Dark Data Problem Most generated data is unavailable for analysis
The tyranny of OR
Amazon EMR Directly access data in S3
Scale out to thousands of nodes
Open data formats Popular big data frameworks Anything you can dream up and code
Amazon Redshift Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing
But I dont want to choose.
I shouldnt have to choose
I want all of the above
I want
sophisticated query optimization and scale-out processing
super fast performance and support for open formats
the throughput of local disk and the scale of S3
I want all this
From one data processing engine
With my data accessible from all data processing engines
Now and in the future
Were told you have to choose
Pick small clusters for joins or large ones for scans Shuffles are expensive
Open formats cant collocate data for joins
They have to deal with variable cluster sizes
Query optimization requires statistics You cant determine this for external data
Its just physics
Amazon Redshift Spectrum
Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple clusters access same data
No ETL: Query data in-place using open file formats
Full Amazon Redshift SQL support
S3 SQL
Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
1
Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
2
Query plan is sent to all compute nodes
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
3
Compute nodes obtain partition info from Data Catalog; dynamically prune partitions
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
4
Each compute node issues multiple requests to the Amazon Redshift Spectrum layer
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
5
Amazon Redshift Spectrum nodes scan your S3 data
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
6
7 Amazon Redshift Spectrum projects, filters, and aggregates
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Hive Metastore
Final aggregations and joins with local Amazon Redshift tables done in-cluster
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
8
Result is sent back to client
Life of a query
Amazon Redshift
JDBC/ODBC
... 1 2 3 4 N
Amazon S3 Exabyte-scale object storage
Data Catalog Apache Hive Metastore
9
Running an analytic query over an exabyte in S3
Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books shes written. 1 Table 2 Filters
SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE %POTTER% AND P.AUTHOR = J. K. Rowling
Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation
SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;
Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts
SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books shes written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts
SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = US AND R.CITY = Seattle AND R.STATE = WA AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
Now lets run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* Compression .....5X Columnar file format.......10X
Scanning with 2500 nodes....2500X
Static partition elimination............2X Dynamic partition elimination...350X Redshifts query optimizer......40X
--------------------------------------------------- Total reduction.3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
Amazon Redshift Spectrum is fast
Leverages Amazon Redshifts advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster
Amazon Redshift Spectrum is cost-effective
You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by:
Partitioning data Using a columnar file format Compressing data
Amazon Redshift Spectrum is secure
End-to-end data encryption
Alerts & notifications
Virtual private cloud
Audit logging
Certifications & compliance
Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS
All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift
PCI/DSS FedRAMP
SOC1/2/3 HIPAA/BAA
Amazon Redshift Spectrum uses standard SQL
Spectrum seamlessly integrates with your existing SQL & BI apps Support for complex joins, nested queries & window functions Support for data partitioned in S3 by any key
Date, Time and any other custom keys e.g., Year, Month, Day, Hour
Defining External Schema and Creating Tables
Define an external schema in Amazon Redshift using the Amazon Athena data catalog or your own Apache Hive Metastore
CREATE EXTERNAL SCHEMA
Query external tables using .
Register external tables using Athena, your Hive Metastore client, or from Amazon Redshift CREATE EXTERNAL TABLE syntax
CREATE EXTERNAL TABLE [PARTITIONED BY ] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, ];
Amazon Redshift Spectrum Current support
File formats
Parquet CSV Sequence ORC (coming soon) RCFile RegExSerDe (coming soon)
Compression
Gzip Snappy Lzo (coming soon) Bz2
Encryption
SSE with AES256 SSE KMS with default
key
Column types
Numeric: bigint, int, smallint, float, double and decimal
Char/varchar/string Timestamp Boolean DATE type can be used only as a
partitioning key
Table type
Non-partitioned table (s3://mybucket/orders/..)
Partitioned table (s3://mybucket/orders/date=YYYY-MM-DD/..)
Converting to Parquet and ORC using Amazon EMR
You can use Hive CREATE TABLE AS SELECT to convert data CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source
Or use Spark - 20 lines of Pyspark code, running on Amazon EMR
1TB of text data reduced to 130 GB in Parquet format with snappy compression Total cost of EMR job to do this: $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversionhttps://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
Is Amazon Redshift Spectrum useful if I dont have an exabyte?
Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using EMR, Athena & Amazon Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
The Emerging Analytics Architecture
Athena Amazon Athena Interactive Query
AWS Glue ETL & Data Catalog
Storage
Serverless Compute
Data Processing
Amazon S3 Exabyte-scale Object Storage
Amazon Kinesis Firehose Real-Time Data Streaming
Amazon EMR Managed Hadoop Applications
AWS Lambda Trigger-based Code Execution
AWS Glue Data Catalog Hive-compatible Metastore
Amazon Redshift Spectrum Fast @ Exabyte scale
Amazon Redshift Petabyte-scale Data Warehousing
Over 20 customers helped preview Amazon Redshift Spectrum
Thank you!
Questions?
Slide Number 1Slide Number 2Its never been easier to generate vast amounts of dataAmazon S3 lets you collect and store all this dataBut how do you analyze it?Slide Number 6The tyranny of ORBut I dont want to choose.I shouldnt have to choose I want all of the aboveI want sophisticated query optimization and scale-out processingsuper fast performance and support for open formats the throughput of local disk and the scale of S3I want all this From one data processing engineWith my data accessible from all data processing enginesNow and in the futureWere told you have to choose Pick small clusters for joins or large ones for scansShuffles are expensiveOpen formats cant collocate data for joinsThey have to deal with variable cluster sizesQuery optimization requires statisticsYou cant determine this for external dataIts just physicsAmazon Redshift SpectrumAmazon Redshift SpectrumRun SQL queries directly against data in S3 using thousands of nodes Life of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryLife of a queryRunning an analytic query over an exabyte in S3Lets build an analytic query - #1Lets build an analytic query - #2Lets build an analytic query - #3Lets build an analytic query - #4Now lets run that query over an exabyte of data in S3Amazon Redshift Spectrum is fastAmazon Redshift Spectrum is cost-effectiveAmazon Redshift Spectrum is secureAmazon Redshift Spectrum uses standard SQLDefining External Schema and Creating TablesAmazon Redshift Spectrum Current supportConverting to Parquet and ORC using Amazon EMR Is Amazon Redshift Spectrum useful if I dont have an exabyte?The Emerging Analytics ArchitectureOver 20 customers helped preview Amazon Redshift SpectrumThank you!Questions?