Upload
amazon-web-services
View
548
Download
1
Embed Size (px)
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pavan Pothukuchi, Principal Product Manager, AWS
Chris Liu, Data Infrastructure Engineer, Coursera
July 13, 2016
Getting Started with
Amazon Redshift
AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS big data portfolio
AWS Data Pipeline
Amazon
CloudSearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon
Machine
Learning
Amazon
Elasticsearch
Service
AWS Database
Migration Service
Amazon
QuickSight
Amazon
Kinesis
Firehose
AWS Import/Export
Snowball
AWS Direct
Connect
Collect
Amazon Kinesis
Streams
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, machine learning,
streaming
Analysis inline with process
flows
Pay as you go, grow as you
need
Managed availability and
disaster recovery
Enterprise Big data SaaS
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Forrester Wave™ enterprise data warehouse Q4 ’15
Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
Benefit #1: Amazon Redshift is fast
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, over 1 million packets/sec/node
Choice of storage type, instance size
Regular cadence of autopatched improvements
Benefit #1: Amazon Redshift is fast
New dense storage (HDD) instance type (Jun 2015)
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: Same as our prior generation!
Performance improvement: 50%
Enhanced I/O and commit improvements (Jan 2016)
Performance improvement: 35%
Memory allocation improvements (May 2016)
Performance improvement: 60%
Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)Price per hour for
DW1.XL single node
Effective annual
price per TB compressed
On demand $ 0.850 $ 3,725
1-year reservation $ 0.500 $ 2,190
3-year reservation $ 0.228 $ 999
Dc1 (SSD)Price per hour for
DW2.L single node
Effective annual
price per TB compressed
On demand $ 0.250 $ 13,690
1-year reservation $ 0.161 $ 8,795
3-year reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters
Benefit #4: Security is built in
• Load encrypted from Amazon S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks and in Amazon S3 encrypted
• Block key, cluster key, master key (AES-256)
• On-premises HSM and AWS CloudHSM support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Benefit #5: We innovate quickly
Well over 125 new features added since launch
Release every two weeks
Automatic patching
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress (8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS
Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch size support for single node clusters, new
system tables with commit stats, row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE ciphers (4/22)
3 new regex features, Unload to single file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions, percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User-defined functions
• Machine learning
• Data science
Benefit #7: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
Benefit #8: Service-oriented architecture
Amazon
DynamoDB
Amazon EMR
Amazon S3
Amazon EC2/SSH
Amazon RDS/
Amazon Aurora
Amazon
Redshift
Amazon Kinesis
Amazon ML
AWS Data Pipeline
Amazon
CloudSearch
Amazon
Mobile
Analytics
Performance
Ease of use
SecurityAnalytics and functionality
SOA
Recent launchesDynamic WLM parameters
Queue hopping for timed-out queries
Merge rows from staging to prod. table
2x improvement in query throughput
10x latency improvement for UNION ALL queries
Bzip2 format for ingestion
Table level restore
10x improvement in vacuum perf.
Default access privileges
Tag-based AWS IAM access
IAM roles for COPY/UNLOAD
SAS connector enhancements,
Implicit conversion of SAS
queries to Amazon Redshift
DMS support from OLTP sources
Enhanced data ingestion from
Kinesis Firehose
Improved data schema conversion
to Amazon ML
NTT Docomo: Japan’s largest mobile service
provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
NTT Docomo: Japan’s largest mobile service
provider
125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS
Direct
Connect
Client
Forwarder
LoaderState
management
SandboxAmazon Redshift
S3
Nasdaq: powering 100 marketplaces in 50
countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50
countries
Education at scale
140partners
3.7 millioncourse completions
20 millionlearners worldwide
1900courses
Outline
• Moving from no data warehouse to the Amazon
Redshift ecosystem• No warehouse: m2.2xlarge read replica – 4 CPUs, 32 GB RAM on Amazon RDS
• First Amazon Redshift cluster: 1 ds1.xl node – 2 CPU, 16 GB RAM
Outline
• Moving from no data warehouse to the Amazon
Redshift ecosystem• No warehouse: m2.2xlarge read replica – 4 CPUs, 32 GB RAM on Amazon RDS
• First Amazon Redshift cluster: 1 ds1.xl node – 2 CPU, 16 GB RAM
• The Amazon Redshift ecosystem at Coursera• Current day: 9 dc1.8xl nodes – 288 CPUs, 2.4 TB RAM
Outline
• Moving from no data warehouse to the Amazon
Redshift ecosystem• No warehouse: m2.2xlarge read replica – 4 CPUs, 32 GB RAM on Amazon RDS
• First Amazon Redshift cluster: 1 ds1.xl node – 2 CPU, 16 GB RAM
• The Amazon Redshift ecosystem at Coursera• Current day: 9 dc1.8xl nodes – 288 CPUs, 2.4 TB RAM
• Learnings from 3 years on Amazon Redshift• Lessons in communication, surprises, reflections
Starting point
• Querying production read replica
• Makeshift libraries providing thin abstraction layer• 45 minutes to provide aggregate metrics over all classes running on Coursera =(
Starting point
• Querying production read replica
• Makeshift libraries providing thin abstraction layer• 45 minutes to provide aggregate metrics over all classes running on Coursera =(
Move in progress
• Risk-free deployment• "Let's try it out"
• Few clicks to deploy cluster, connect to cluster, resize
Move in progress
• Risk-free deployment• "Let's try it out"
• Few clicks to deploy cluster, connect to cluster, resize
• AWS ecosystem integration• COPY from S3/EMR/SSH
• Unload to S3
• UNLOAD(COPY(data)) == COPY(UNLOAD(data)) == data
Move in progress
• Risk-free deployment• "Let's try it out"
• Few clicks to deploy cluster, connect to cluster, resize
• AWS ecosystem integration• COPY from S3/EMR/SSH
• Unload to S3
• UNLOAD(COPY(data)) == COPY(UNLOAD(data)) == data
• Minimal administration• In aggregate, less than 1 full-time employee for administration
• Automation and tooling for monitoring usage and performance
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Deriving value from data
• Democratizing data access
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Deriving value from data
• Democratizing data access
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI applications
Internal applicationsThird-party tools
Cassandra
Cassandra
AWS Data Pipeline
Data flow
Amazon S3
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Data products
• Democratizing data access
• Provide directional insights and aggregate metrics
• Aggregate metrics over all classes on Coursera: < 5 seconds
• Goal: Insight at the speed of thought
• Results1: 0.8s median, 28s p95, 120s p99
• Companywide goal tracking
• Scheduled reports to internal and external stakeholders
• Crucial part of data informed culture
1 Results for ad hoc queries ran in the last 90 days
Business insights and reporting
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Data products
• Democratizing data access
• AB experimentation
• 300 M impression table joined with 1.8 B events table in 12 minutes.
• Recommendations model
• Amazon Redshift for relational transformation
• Unload to S3 for model training
• Providing university partners with analytical dashboard
and research exports
Data Products
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Data products
• Democratizing data access
Learnings from 3 years on Amazon Redshift
• Thinking in Amazon Redshift
• Communicate to users
• Surprises
• Reflections
Thinking in Amazon Redshift
• Columnar
• SELECT * considered harmful in most cases
• Nodes, slices, blocks
• 1 MB blocks per slice, n slices per node depending on node type
Thinking in Amazon Redshift
• Columnar
• SELECT * considered harmful in most cases
• Nodes, slices, blocks
• 1 MB blocks per slice, n slices per node depending on node type
• Sorting and distribution
• Share nothing massively parallel processing => data is sorted per slice
• Up to 2 orders of magnitude increase in JOIN/GROUP BY for merge join vs hash join
Communicating to users
• Prefer the scientific method over gut feel
• Investigate how many rows were materialized with svl_query_report
• Understand EXPLAIN plan for data distribution, join strategy, predicate order
Communicating to users
• Prefer the scientific method over gut feel
• Investigate how many rows were materialized with svl_query_report
• Understand EXPLAIN plan for data distribution, join strategy, predicate order
• SQL style guide for readability
• Leading commas, capitalized SQL keywords, conventions for handling dates/timestamps,
conventions for table names, mapping tables
Communicating to users
• Prefer the scientific method over gut feel
• Investigate how many rows were materialized with svl_query_report
• Understand EXPLAIN plan for data distribution, join strategy, predicate order
• SQL style guide for readability
• Leading commas, capitalized SQL keywords, conventions for handling dates/timestamps,
conventions for table names, mapping tables
• Use the right tool for the right task
• Amazon Redshift is not for online traffic serving
• Amazon Redshift is not for stream processing
Surprises
• "Fundamental theorem of Redshift at Coursera"
• Most queries involve full table scans
• 9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column
• Store ~75 M integer values and maintain 1 block per slice
Surprises
• "Fundamental theorem of Redshift at Coursera"
• Most queries involve full table scans
• 9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column
• Store ~75 M integer values and maintain 1 block per slice
• Features may behave in unexpected fashions
• Sort key compression
• Primary and foreign keys
Surprises
• "Fundamental theorem of Redshift at Coursera"
• Most queries involve full table scans
• 9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column
• Store ~75 M integer values and maintain 1 block per slice
• Features may behave in unexpected fashions
• Sort key compression
• Primary and foreign keys
• Features may be unexpectedly expensive
• COMMIT – Batch work, monitor with stl_commit_stats
• VACUUM – Prefer TRUNCATE, monitor with stl_vacuum, stl_query
• Your mileage may vary
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
• Scalability – Scaled cluster up 5 times in the last 3 years as data volume
and usage increased.
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
• Scalability – Scaled cluster up 5 times in the last 3 years as data volume
and usage increased.
• Flexibility – No strict requirement on data modeling; dusty knobs for
tuning in majority of cases. Handles both heavily normalized data model
and denormalized clickstream data.
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
• Scalability – Scaled cluster up 5 times in the last 3 years as data volume
and usage increased.
• Flexibility – No strict requirement on data modeling; dusty knobs for
tuning in majority of cases. Handles both heavily normalized data model
and denormalized clickstream data.
• Extensibility – Standard API (JDBC/ODBC/libpq) and integration points.
Resources
Pavan Pothukuchi | [email protected] |
Chris Liu | [email protected] |
Detail pages
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
Best practices
• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html
Related breakout sessions
• Deep Dive on Amazon QuickSight (2:15–3:15 pm)
• Getting Started with Amazon QuickSight (2:15–3:15 pm)
• Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with Minimal Downtime (4:45–5:45 pm)