Upload
amazon-web-services
View
5.373
Download
0
Embed Size (px)
DESCRIPTION
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. This presentation will give an introduction to the service and its pricing before diving into how it delivers fast query performance on data sets ranging from hundreds of gigabytes to a petabyte or more. Steffen Krause, Technical Evangelist, AWS Padraic Mulligan, Architect and Lead Developer and Mike McCarthy, CTO, Skillspage
Citation preview
Steffen Krause, Technical Evangelist
Introducing Amazon Redshift
Data warehousing done the AWS way
• No upfront costs, pay as you go
• Really fast performance at a really low price
• Open and flexible with support for popular tools
• Easy to provision and scale up massively
We set out to build…
A fast and powerful, petabyte-scale data warehouse that is:
Delivered as a managed service
A Lot Faster
A Lot Cheaper
A Lot Simpler
Amazon Redshift
We’re off to a good start
Amazon Redshift dramatically reduces I/O
ID Age State
123 20 CA
345 25 WA
678 40 FL
Row storage Column storage
Scan Direction
Amazon Redshift automatically compresses your data
• Compress saves space and reduces disk I/O
• COPY automatically analyzes and compresses
your data
– Samples data; selects best compression encoding
– Supports: byte dictionary, delta, mostly n, run
length, text
• Customers see 4-8x space savings with real data
– 20x and higher possible based on data set
• ANALYZE COMPRESSION to see details
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift architecture
• Leader Node – SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes – Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB
• Single node version available
10 GigE (HPC)
Ingestion Backup Restore
JDBC/ODBC
Amazon Redshift runs on optimized hardware
HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate
HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage
• Optimized for I/O intensive workloads
• High disk density
• Runs in HPC - fast network
• HS1.8XL available on Amazon EC2
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup
• Restore
• Resize
10 GigE (HPC)
Ingestion Backup Restore
JDBC/ODBC
Amazon Redshift lets you start small and grow big
Extra Large Node (HS1.XL) 3 spindles, 2 TB, 16 GB RAM, 2 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Eight Extra Large Node (HS1.8XL) 24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE
Cluster 2-100 Nodes (32 TB – 1.6 PB)
Note: Nodes not to scale
Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL Single Node
Effective Hourly Price Per TB
Effective Annual Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
Amazon Redshift is easy to use
• Provision in minutes
• Monitor query performance
• Point and click resize
• Built in security
• Automatic backups
Provision a data warehouse in minutes
Monitor query performance
Point and click resize
Resize your cluster while remaining online
• New target provisioned in the background
• Only charged for source cluster
Resize your cluster while remaining online
• Fully automated
– Data automatically redistributed
• Read only mode during resize
• Parallel node-to-node data copy
• Automatic DNS-based endpoint cutover
• Only charged for one cluster
Amazon Redshift has security built-in
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in Amazon S3
encrypted
• No direct access to compute nodes
• Amazon VPC support
10 GigE (HPC)
Ingestion Backup Restore
Customer VPC
Internal VPC
JDBC/ODBC
Amazon Redshift continuously backs up your data and
recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of
data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental
– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
Amazon Redshift integrates with multiple data sources
Amazon
DynamoDB
Amazon Elastic
MapReduce
Amazon Simple
Storage Service (S3)
Amazon Elastic Compute Cloud (EC2)
AWS Storage Gateway Service
Corporate Data Center
Amazon Relational
Database Service
(RDS)
Amazon Redshift
More coming soon…
Amazon Redshift provides multiple data loading options
• Upload to Amazon S3
• AWS Import/Export
• AWS Direct Connect
• Work with a partner
Data Integration
Systems Integrators
More coming soon…
Amazon Redshift works with your existing analysis tools
JDBC/ODBC
Amazon Redshift
More coming soon…
One Place to Find Skilled People
Everyone Needs
Skilled People
At Home
At Work
In Life
Repeatedly
2 million
15 million
REGISTERED
MEMBERS
2011 2012 2013
77 Instances 3 Availability Zones
2.5+ Billion Relationships Tech Team of 21 10M+ Growth Increments
Reserved/Demand & Spot
Add capacity as required
Auto scale
US East (Northern VA)
Planned for Multi Region
Our social graph models
over 2.5 Billion social
relationships
Ready for additional 10 million
users at any point in time 1 Data Analyst
Total company size 37
150M+ Emails
>150,000,000 emails sent
per month
21,000,000+ SKILLS ADDED BY MEMBERS
1,500,000+ NEW MEMBERS/MONTH
1,200,000,000+ SOCIAL CONNECTIONS IMPORTED
2 SECONDS A NEW MEMBER EVERY
We Measure Everything!
Why Measure?
• Business Insights
• KPIs
• Campaign Management
• Behavioural Analysis
• Algorithm Improvements
• Performance Management
Best user experience
History with Redshift
• Amazon Customer since 2010
• Proprietary SQL Data Warehouse 2011
• Rapid Growth 2012
• Redshift Trials 2012
• Redshift Production DW 2013
Data Architecture
Data Analyst
Raw Data
Get
Data
Join via Facebook
Add a Skill Page
Invite Friends
Web Servers Amazon S3 User Action Trace Events
EMR Hive Scripts Process Content
• Process log files with
regular expressions to
parse out the info we need.
• Processes cookies into
useful searchable data such
as Session, UserId, API
Security token.
• Filters surplus info like
internal varnish logging.
Amazon S3
Aggregated Data
Raw Events
Internal Web
Excel Tableau
Amazon Redshift
EMR
• Heavy Lifting
• Log Parsing & Data Extraction • Cookies
• Clickstream
• Directory Generation
• Network Processing
• Process 40GB+ Telemetry data daily
• Reserved & Spot Instances
Redshift Implementation
• High Storage Extra Large (XL) DW Node • Growing from 2 xDW.HS1.XLARGE nodes
• Reservations
• ETL Activities • Approx. 90 minutes including exports from RDBMS, copying to S3,
loading stage tables, loading target tables, vacuuming and analysing tables
• Schema
• Compression • Starting to use columnar compression
• Retention
DW Anatomy Dimension Purpose
Users Analyse the composition of the user base
Events Analyse significant actions that reflect user activity & behaviour
Clickstream Analyse user browsing and landing events at a page level
Email Click through and Cohort Analysis
Notifications Analyse user to user messaging – what users are mailing what
users and when.
Sessions Traffic & Visit Analysis
Skills Analyse Skills by Classification and User Context
Opportunities Analyse Opportunities by Classification, User Context and
response rate.
Search Analyse and quantify the characteristics of each search made on
the platform.
Performance
Accessing Data
• Consumers
• Tableau
• Excel/PowerPivot
• Technical Team
• Sqlworkbench
Driver: JDBC for postgressql 8.xx
Data Visualisation
Redshift - Nice to haves
• Possibility to load lzo files from S3
• Additional analytical functions e.g. MEDIAN
• Hierarchies
• ETL tool working with S3, many database vendors
Why Redshift works for SkillPages
• Scale - MPP
• Performance
• Columnar
• Platform Integration
• S3, Dynamo
• Operational Advantages
• Ease of Access
• Cost
Thank you!
Customer Use Case
Mike McCarthy
CTO, SkillPages
Resources & Questions
• Steffen Krause | [email protected] | @AWS_Aktuell
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
• https://www.jaspersoft.com/webinar-AWS-Agile-Reporting-and-Analytics-in-the-Cloud