29
Amazon Redshift 1 Author: Douglas Bernardini

REDSHIFT - Amazon

Embed Size (px)

Citation preview

Page 1: REDSHIFT - Amazon

1

Amazon Redshift

Author: Douglas Bernardini

Page 2: REDSHIFT - Amazon

2

What is Redshift?• Cloud-Hosted data warehouse services: AWS• Massive parallel processing (MPP)• Analytics workloads on large scale datasets• Stored by a column-oriented DBMS principle. • Large scale datasets. Up petabytes

Page 3: REDSHIFT - Amazon

3

Features and Benefits• Columnar storage• Parallelizing queries • Multiple nodes• Custom JDBC and ODBC drivers• Ready integraded:

• Amazon S3;• Amazon DynamoDB;• Amazon Elastic MapReduce;• Amazon Kinesis• Any SSH-enabled host.

• Fault Tolerant• Automated Backups• Fast Restores• Secure:

• Encryption• Network Isolation• Audit and Compliance

• SQL friendly

Page 4: REDSHIFT - Amazon

4

MarketPlaceBI Tools• Actian• Actuate Corporation• Birst• Chartio• ClearStory Data• Dundas Data Visualization• Infor• Jaspersoft• Jreport• Logi Analytics• Looker (Software)• MicroStrategy• Pentaho• Periscope.io

Data Integrations Tools• Attunity• FlyData• Informatica• SnapLogic• Talend• Xplenty

• Qlik• Redrock BI• SAS (software)• SiSense• Spotfire• Tableau Software

Page 5: REDSHIFT - Amazon

5

Data Load

Page 6: REDSHIFT - Amazon

6

DynamoDB Integration

Page 7: REDSHIFT - Amazon

7

DynamoDB Integration

Page 8: REDSHIFT - Amazon

8

Business Case

Page 9: REDSHIFT - Amazon

9

Data growing fast!• Enterprise Data is growing at an exponential

rate• Structured and Unstructured data• Data requirements change rapidly

• Cost to maintain data is prohibitive• Hardware not scalable• Expensive to support

• Business agility suffers• Reporting unable to change with the pace

of business• Data silos create bottlenecks

Page 10: REDSHIFT - Amazon

10

Solution Proposal

• Leverage the flexibility of Amazon Web Services

• Scalable• Flexible• Cost-Effective

• AWS Redshift• Data Warehouse

• AWS S3• Persistent Storage

• AWS Data Pipeline• Data Orchestration and ETL

• AWS EC2 / MySQL• Transaction Processing

• Qlik Sense Desktop• Business Intelligence Reporting

Page 11: REDSHIFT - Amazon

11

AWS RedshiftPetabyte-Scale Data Warehouse

• Optimized for DW• Columnar Storage• Data Compression• Zone Maps to reduce I/O

• Scalable• Easily change # of Nodes

• 1-32 node configurations

• Cost-Efficient• On-Demand pricing starts @ $.25/hr.• Run as low as $1,000 per TB/yr.

Page 12: REDSHIFT - Amazon

12

AWS RedshiftPetabyte-Scale Data Warehouse

• Get Started in Minutes• Web Console• CLI

• Full Managed

• Fault Tolerant

• Automated Backups / Fast Restores

• Encryption• Data at Rest – AES-256• Can manage own keys

• Compatible• SQL• Data Integrations

Page 13: REDSHIFT - Amazon

13

AWS Simple Storage Service (S3)Online File/Object Storage

• Durable• Data redundantly stored across

multiple facilities/devices

• Available• 99.99% availability• Choose from different AWS regions

• Secure• SSL – Data Transfer• At Rest – Auto-Encrypted

• Scalable• Flexible capacity based on data

demands

• Low Cost• Pay for what you use

Page 14: REDSHIFT - Amazon

14

AWS Simple Storage Service (S3)

Reliable Simple

Scalable Low Cost

• Distributed Infrastructure ensures activity completion

• Integrated with SNS for event notifications

Data Processing and Transfer Platform

• Drag-and drop console• Pre-built templates for other

AWS services• Visual Pipeline editor

• Dispatch work to one machine or many

• Serial and/or Parallel processing

• Charged per Pipeline• Frequency• Volume

Page 15: REDSHIFT - Amazon

15

AWS Elastic Compute Cloud (EC2) + MySQL

Cloud Infrastructure for Applications & Development

• Flexible• Linux and Windows virtual machines• Supports multiple instance types, software packages, resource configs

• Elastic• Increase/Decrease capacity within minutes• Commission any number of server instances simultaneously

• Secure• Security Groups / Network ACLs• VPC / VPN

• Low Cost• On-Demand / Reserved / Spot Instance options

Page 16: REDSHIFT - Amazon

16

Qlik Sense DesktopData Visualization / BI Tool

• Drag-and-drop Visualizations

• Smart Search

• Explore Multiple data sources in single dashboard/report

• Access analytics on multiple device types

• Collaborate and share insights within reports

• Enables self-service simplicity

Page 17: REDSHIFT - Amazon

17

Architecture

Page 18: REDSHIFT - Amazon

18

Demo

Page 19: REDSHIFT - Amazon

19

Tech Demo

• During this demonstration, we will discuss the setup and execution of using Amazon Redshift as an on-demand, cloud-based, data warehouse solution.

• Our sample data comes from the “Million Song Dataset” available from Columbia University - http://labrosa.ee.columbia.edu/millionsong/

• The BI Tool that is used to create a business-focused dashboard is Qlik Sense Desktop, a Windows-based desktop application - http://www.qlik.com/us/explore/products/sense

• In addition, the following services in the Amazon Web Services stack are used: Amazon Redshift, Amazon S3, Pipeline, and EC2 (Linux AMI running MySQL serves as a transactional database for the demo).

Page 20: REDSHIFT - Amazon

20

Demo Steps1. Create new Linux AMI that will host

MySQL for transaction data processing.• Start new Linux instance and update security groups

for MySQL accessibility• Install MySQL• Create new MySQL users, database, and populate with

demonstration dataset (using MySQL Workbench)

2. Create new S3 bucket for Pipeline ETL processes

3. Create Redshift Cluster (data warehouse)• Instantiate cluster• Connect using SQL Workbench (via JDBC)• Create initial data table

4. Create AWS Pipeline(s) for data processing• MySQL -> S3 • Activate Pipeline for initial ETL from MySQL to S3• S3 -> Redshift• Activate Pipeline for initial ETL from S3 to Redshift

5. Install Qlik Sense Desktop• Install Redshift ODBC Drivers locally on desktop• Create Qlik Sense “Report” (Included in FP submission

for simplicity). Verify initial data in report.

6. Solution Demonstration (Using Amazon CLI – Command Line Interface)

• Simulate transactional data load in MySQL • Verify new data (record count) in MySQL using MySQL

Workbench• Delete initial data in S3 bucket (from Round 1)• Trigger AWS Pipeline that loads data to S3 from

MySQL• Verify data load (CSV file) in S3 bucket• Trigger AWS Pipeline that loads data to Redshift from

S3.• Verify data load in Redshift (using SQL Workbench)• Refresh Qlik report to view analytics of initial data

load.

Page 21: REDSHIFT - Amazon

21

Linux AMI hosts MySQL

Page 22: REDSHIFT - Amazon

22

Redshift Cluster

Page 23: REDSHIFT - Amazon

23

Pipes

Page 24: REDSHIFT - Amazon

24

QlikSense Desktop

Page 25: REDSHIFT - Amazon

25

Add New data into MySQL

Insert songs_dataCount (*)

Page 26: REDSHIFT - Amazon

26

Checking Redshift

Select count (*) from song_data

Page 27: REDSHIFT - Amazon

27

Qlik Update

Page 28: REDSHIFT - Amazon

28

Results• Amazon Web Services provides a powerful

platform to extend on-premise Infrastructure to the cloud

• Enables massive data consolidation• Efficient ETL orchestration & workflow• Simplifies resource management and drives

down computing costs across multiple services

• Changing needs of Business Executives can be made quickly and efficiently

• AWS supports industry standard data source connections

• Existing Reporting/Dashboards can consume AWS Redshift data with no code changes

Page 29: REDSHIFT - Amazon

29

[email protected]

Questions?