REDSHIFT - Amazon

1

Amazon Redshift

Author: Douglas Bernardini

2

What is Redshift?• Cloud-Hosted data warehouse services: AWS• Massive parallel processing (MPP)• Analytics workloads on large scale datasets• Stored by a column-oriented DBMS principle. • Large scale datasets. Up petabytes

3

Features and Benefits• Columnar storage• Parallelizing queries • Multiple nodes• Custom JDBC and ODBC drivers• Ready integraded:

• Amazon S3;• Amazon DynamoDB;• Amazon Elastic MapReduce;• Amazon Kinesis• Any SSH-enabled host.

• Fault Tolerant• Automated Backups• Fast Restores• Secure:

• Encryption• Network Isolation• Audit and Compliance

• SQL friendly

4

MarketPlaceBI Tools• Actian• Actuate Corporation• Birst• Chartio• ClearStory Data• Dundas Data Visualization• Infor• Jaspersoft• Jreport• Logi Analytics• Looker (Software)• MicroStrategy• Pentaho• Periscope.io

Data Integrations Tools• Attunity• FlyData• Informatica• SnapLogic• Talend• Xplenty

• Qlik• Redrock BI• SAS (software)• SiSense• Spotfire• Tableau Software

5

Data Load

6

DynamoDB Integration

7

DynamoDB Integration

8

Business Case

9

Data growing fast!• Enterprise Data is growing at an exponential

rate• Structured and Unstructured data• Data requirements change rapidly

• Cost to maintain data is prohibitive• Hardware not scalable• Expensive to support

• Business agility suffers• Reporting unable to change with the pace

of business• Data silos create bottlenecks

10

Solution Proposal

• Leverage the flexibility of Amazon Web Services

• Scalable• Flexible• Cost-Effective

• AWS Redshift• Data Warehouse

• AWS S3• Persistent Storage

• AWS Data Pipeline• Data Orchestration and ETL

• AWS EC2 / MySQL• Transaction Processing

• Qlik Sense Desktop• Business Intelligence Reporting

11

AWS RedshiftPetabyte-Scale Data Warehouse

• Optimized for DW• Columnar Storage• Data Compression• Zone Maps to reduce I/O

• Scalable• Easily change # of Nodes

• 1-32 node configurations

• Cost-Efficient• On-Demand pricing starts @ $.25/hr.• Run as low as $1,000 per TB/yr.

12

AWS RedshiftPetabyte-Scale Data Warehouse

• Get Started in Minutes• Web Console• CLI

• Full Managed

• Fault Tolerant

• Automated Backups / Fast Restores

• Encryption• Data at Rest – AES-256• Can manage own keys

• Compatible• SQL• Data Integrations

13

AWS Simple Storage Service (S3)Online File/Object Storage

• Durable• Data redundantly stored across

multiple facilities/devices

• Available• 99.99% availability• Choose from different AWS regions

• Secure• SSL – Data Transfer• At Rest – Auto-Encrypted

• Scalable• Flexible capacity based on data

demands

• Low Cost• Pay for what you use

14

AWS Simple Storage Service (S3)

Reliable Simple

Scalable Low Cost

• Distributed Infrastructure ensures activity completion

• Integrated with SNS for event notifications

Data Processing and Transfer Platform

• Drag-and drop console• Pre-built templates for other

AWS services• Visual Pipeline editor

• Dispatch work to one machine or many

• Serial and/or Parallel processing

• Charged per Pipeline• Frequency• Volume

15

AWS Elastic Compute Cloud (EC2) + MySQL

Cloud Infrastructure for Applications & Development

• Flexible• Linux and Windows virtual machines• Supports multiple instance types, software packages, resource configs

• Elastic• Increase/Decrease capacity within minutes• Commission any number of server instances simultaneously

• Secure• Security Groups / Network ACLs• VPC / VPN

• Low Cost• On-Demand / Reserved / Spot Instance options

16

Qlik Sense DesktopData Visualization / BI Tool

• Drag-and-drop Visualizations

• Smart Search

• Explore Multiple data sources in single dashboard/report

• Access analytics on multiple device types

• Collaborate and share insights within reports

• Enables self-service simplicity

17

Architecture

18

Demo

19

Tech Demo

• During this demonstration, we will discuss the setup and execution of using Amazon Redshift as an on-demand, cloud-based, data warehouse solution.

• Our sample data comes from the “Million Song Dataset” available from Columbia University - http://labrosa.ee.columbia.edu/millionsong/

• The BI Tool that is used to create a business-focused dashboard is Qlik Sense Desktop, a Windows-based desktop application - http://www.qlik.com/us/explore/products/sense

• In addition, the following services in the Amazon Web Services stack are used: Amazon Redshift, Amazon S3, Pipeline, and EC2 (Linux AMI running MySQL serves as a transactional database for the demo).

20

Demo Steps1. Create new Linux AMI that will host

MySQL for transaction data processing.• Start new Linux instance and update security groups

for MySQL accessibility• Install MySQL• Create new MySQL users, database, and populate with

demonstration dataset (using MySQL Workbench)

2. Create new S3 bucket for Pipeline ETL processes

3. Create Redshift Cluster (data warehouse)• Instantiate cluster• Connect using SQL Workbench (via JDBC)• Create initial data table

4. Create AWS Pipeline(s) for data processing• MySQL -> S3 • Activate Pipeline for initial ETL from MySQL to S3• S3 -> Redshift• Activate Pipeline for initial ETL from S3 to Redshift

5. Install Qlik Sense Desktop• Install Redshift ODBC Drivers locally on desktop• Create Qlik Sense “Report” (Included in FP submission

for simplicity). Verify initial data in report.

6. Solution Demonstration (Using Amazon CLI – Command Line Interface)

• Simulate transactional data load in MySQL • Verify new data (record count) in MySQL using MySQL

Workbench• Delete initial data in S3 bucket (from Round 1)• Trigger AWS Pipeline that loads data to S3 from

MySQL• Verify data load (CSV file) in S3 bucket• Trigger AWS Pipeline that loads data to Redshift from

S3.• Verify data load in Redshift (using SQL Workbench)• Refresh Qlik report to view analytics of initial data

load.

21

Linux AMI hosts MySQL

22

Redshift Cluster

23

Pipes

24

QlikSense Desktop

25

Add New data into MySQL

Insert songs_dataCount (*)

26

Checking Redshift

Select count (*) from song_data

27

Qlik Update

28

Results• Amazon Web Services provides a powerful

platform to extend on-premise Infrastructure to the cloud

• Enables massive data consolidation• Efficient ETL orchestration & workflow• Simplifies resource management and drives

down computing costs across multiple services

• Changing needs of Business Executives can be made quickly and efficiently

• AWS supports industry standard data source connections

• Existing Reporting/Dashboards can consume AWS Redshift data with no code changes

29

[email protected]

Questions?

Data & Analytics

REDSHIFT - Amazon