41
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. November 30, 2016 Migrating Your Data Warehouse to Amazon Redshift DAT202 Pavan Pothukuchi, Sr. Manager PM, Amazon Redshift Ali Khan, Director of BI and Analytics, Scholastic Laxmikanth Malladi, Principal Architect, Northbay Solutions

AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Embed Size (px)

Citation preview

Page 1: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

November 30, 2016

Migrating Your Data Warehouse to Amazon Redshift

DAT202

Pavan Pothukuchi, Sr. Manager PM, Amazon Redshift

Ali Khan, Director of BI and Analytics, Scholastic

Laxmikanth Malladi, Principal Architect, Northbay Solutions

“It’s our biggest driver of growth in our biggest markets, and is a feature of the

company” …on Data Mining in Redshift– Chris Lambert, Lyft CTO

“The doors were blown wide open to create custom dashboards for anyone to

instantly go in and see and assess what is going in our ad delivery landscape,

something we have never been able to do until now.”– Bryan Blair, Vevo’s VP of Ad Operations

“Analytical queries are 10 times faster in Amazon Redshift than they

were with our previous data warehouse.”– Yuki Moritani, NTT Docomo Innovation Manager

“We have several petabytes of data and use a massive Redshift

cluster. Our data science team can get to the data faster and then

analyze that data to find new ways to reduce costs, market

products, and enable new business.”– Yuki Moritani, NTT Docomo Innovation Manager

“We saw a 2x performance improvement on a wide variety of

workloads. The more complex the queries, the higher the

performance improvement..”- Naeem Ali, Director of Software Development, Data

Science at Cablevision (Optimum)

“Over the last few years, we’ve tried all kinds of databases in search of more

speed, including $15k of custom hardware. Of everything we’ve tried,

Amazon Redshift won out each time.”– Periscope Data, Analyst’s Guide to Redshift

“We took Amazon Redshift for a test run the moment it was

released. It’s fast. It’s easy. Did I mention it’s ridiculously fast?

We’re using it to provide our analysts an alternative to Hadoop.”– Justin Yan, Data Scientist at Yelp

“The move to Redshift also significantly improved dashboard query

performance… Redshift performed ~200% faster than the

traditional SQL Server we had been using in the past.”

- Dean Donovan, Product Development at DiamondStream

“…[Redshift] performance has blown away everyone here (we

generally see 50-100x speedup over Hive)”

- Jie Li Data Infrastructure at Pinterest

“450,000 online queries 98 percent faster than previous traditional data

center, while reducing infrastructure costs by 80 percent.”

- John O’Donovan, CTO, Financial Times

“We needed to load six months' worth of data, about 10 TB of data, for a

campaign. That type of load would have taken about 20 days with our previous

solution. By using Amazon Redshift, it only took six hours to load the data.”

- Zhong Hong, VP of Infrastructure, Vivaki (Publicis Groupe)

“We regularly process multibillion row datasets and we do that in a

matter of hours. We are heading to up to 10 times more data volumes in

the next couple of years, easily.”

- Bob Harris, CTO, Channel 4

“On our previous big data warehouse system, it took around 45

minutes to run a query against a year of data, but that number went

down to just 25 seconds using Amazon Redshift”

- Kishore Raja Director of Strategic Programs and R&D, Boingo Wireless

“Most competing data warehousing solutions would have cost us up

to $1 million a year. By contrast, Amazon Redshift costs us just

$100,000 all-in, representing a total cost savings of around 90%”

- Joel Cumming, Head of Data, Kik Interactive

“Annual costs of Redshift are equivalent to just the annual

maintenance of some of the cheaper on-premises options for

data warehouses..”

- Kevin Diamond, CTO, HauteLook (Nordstrom)

“Our data volume keeps growing, and we can support that

growth because Amazon Redshift scales so well.. We wouldn’t

have that capability using the supporting on-premises hardware in

our previous solution.”

- Ajit Zadgaonkar, Director of Ops. and Infrastructure, Edmunds

“With Amazon Redshift and Tableau, anyone in the company can set up

any queries they like - from how users are reacting to a feature, to growth by

demographic or geography, to the impact sales efforts had in different areas”

- Jon Hoffman, Head of Engineering, Foursquare

Page 2: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Today’s agenda

• Amazon Redshift Overview

• Use cases and benefits

• Migration options

• Scholastic’s use case

• Architecture details

• Technical overview

• Key project learnings

Page 3: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

Page 4: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical

representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any

vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.

Forrester Wave™ Enterprise Data Warehouse Q4 ’15

Page 5: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Selected Amazon Redshift customers

Page 6: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Why migrate to Amazon Redshift?

100x faster

Scales from GBs to PBs

Analyze data without storage

constraints

10x cheaper

Easy to provision and operate

Higher productivity

10x faster

No programming

Standard interfaces and

integration to leverage BI tools,

machine learning, streaming

Transactional database MPP database Hadoop

Page 7: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Migration from Oracle @ Boingo Wireless

2000+ Commercial Wi-Fi locations

1 million+ Hotspots

90M+ ad engagements

100+ countries

Legacy DW: Oracle 11g based DW

Before migration

Rapid data growth slowed

analytics

Mediocre IOPS, limited memory,

vertical scaling

Admin overhead

Expensive (license, h/w, support)

After migration

180x performance improvement

7x cost savings

Page 8: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

Exadata SAPHANA

Redshift

$400,000

$300,000

$55,000

7,200

2,700

15 15

Query

Performance

Data Load

Performance

1 year of data

1 million records

Late

ncy in s

econds

RedshiftExisting System

7X cheaper than Oracle Exadata 180X faster than Oracle database

Migration from Oracle @ Boingo Wireless

Page 9: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Migration from Greenplum @ NTT Docomo

68 million customers

10s of TBs per day of data across

mobile network

6PB of total data (uncompressed)

Data science for marketing

operations, logistics etc.

Legacy DW: Greenplum on-premises

After migration:

125 node DS2.8XL cluster

4,500 vCPUs, 30TB RAM

6 PB uncompressed

10x faster analytic queries

50% reduction in time for new BI

app. deployment

Significantly less ops. overhead

Page 10: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Migration from SQL on Hadoop @ Yahoo

Analytics for website/mobile events

across multiple Yahoo properties

On an average day

2B events

25M devices

Before migration: Hive – Found it to be

slow, hard to use, share and repeat

After migration:

21 node DC1.8XL (SSD)

50TB compressed data

100x performance improvement

Real-time insights

Easier deployment and

maintenance

Page 11: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Migration from SQL on Hadoop @ Yahoo

1

10

100

1000

10000

CountDistinctDevices

Count AllEvents

FilterClauses

Joins

Seco

nd

s

Amazon Redshift

Impala

Page 12: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Business Value and Productivity

Business Productivity Benefits

Analyze more data

Faster time to market

Get better insights

Match capacity with demand

Page 13: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

ENGINE X Amazon Redsh i f t

ETL Sc r i p t s

SQL i n repo r t s

Adhoc . que r i es

How to Migrate?

Schema Convers ion Database Migra t ion

Map da ta t ypes

Choose compress ion

encod ing , so r t keys ,

d i s t r i bu t i on keys

Gene ra te and app l y DDL

Schema & Data

Trans format ionData Migrat ion

Conver t SQL Code

Bu lk Load

Cap tu re upda tes

Trans fo rma t i ons

Assess Gaps

Sto red P rocedu res

Func t i ons

1 2

3

4

Page 14: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Convert schema in a few clicks

Sources include Oracle, Teradata,

Greenplum and Netezza

Automatic schema optimization

Converts application SQL code

Detailed assessment report

AWS Schema

Conversion Tool

(AWS SCT)

Page 15: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

AWS Schema Conversion Tool

Page 16: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Start your first migration in few minutes

Sources include: Aurora, Oracle, SQL

Server, MySQL and PostgreSQL

Bulk load and continuous replication

Migrate a TB for $3

Fault tolerant

(AWS DMS)

Page 17: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

AWS DMS: Change data capture

Replication instance

Source Target

Update

t1 t2

t1

t2

Transactions Change

apply

after bulk

load

Page 18: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Data integration partners

Data Integration Systems Integrators

Amazon Redshift

Page 19: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Beyond Amazon Redshift…

Page 20: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Scholastic, Established 1920

Page 21: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Where were we?

Platform

13+ years old. IBM AS/400 DB2 and Microsoft SQL Server are the primary data

warehouse platforms. BI Platform is primarily Microsoft (SSRS, SSAS, Excel, SharePoint)

500+ direct users across every LOB and business function

20+ TB. 5,500+ DB2 workloads, 350+ SQL Server workloads, 15 SSAS cubes, 150+

SSRS reports

Challenges

Inflexible, multi-layered architecture – slow time to market

Inability to meet internal SLAs due to performance of daily ETL processes

Scalability limitations with SQL Server Analysis Services (SSAS) for reports

Limited ability to perform self-service Business Intelligence

21

Page 22: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Moving forward: Key decision factors

• Improved performance, scalability, availability,

logging, security

• Enablement of self service business intelligence

• Leverage the skill set of current team (Relational DB

& SQL)

• Integration with existing technology stack

• Alignment with the tech strategy (devops model,

Cloud First)

• Ability to support Big Data initiatives

• Team up with an experienced consulting partner22

Page 23: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Why we chose AWS and Amazon Redshift

AWS was chosen for its agility, scalability, elasticity, and

security

Redshift

• Scalable, fast

• Managed service, cost-optimization models,

elastic

• SQL/relational matched skillset of team

S3 was chosen as location for ingestion process

NorthBay was chosen as the implementation partner for

their expertise in Big Data and Redshift migrations

23

Page 24: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

How the project unfolded

Goals

• 3-month pilot to migrate a Functional area in key LOB

• Demonstrate immediate business value

• Use AWS Stack & Open Source for Data Movement from DB2

(No CDC/ETL tool)

Outcomes

• Core Framework for Migration

• ELT Architecture and Validation

• Visualization/Self-service capability through Tableau

Page 25: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

EMR Cluster running

Sqoop ScriptOutput Bucket EC2 Instance running

Copy Command

Redshift

(Staging)

Data Pipeline

SNS Topic

(Pipeline Status) (Pipeline Failure)

SNS Email Notification

Lambda

(Save Pipeline Stats)

RDS MySQL Instance

(Pipeline

Configurations)

DynamoDB

Redshift

(Enterprise Data

Repository)

AS400 / DB2

(Staging)

SQL Server EDW

Tableau

(Reporting Tool)Source

DBs

SSAS CubesSSRS Reports

Scholastic data cloud: Technical architecture

Page 26: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Core Framework

• Jobs and Job Groups are defined as metadata in DynamoDB

• Control-M scheduler, Custom Application and Data Pipeline for

Orchestration

• ELT Process with EMR/Sqoop for Extraction. Load and Transform

the data through Redshift SQL scripts

• Core Framework enables

• Restart capability from point of failure

• Capturing of operational statistics (# of rows updated, etc.)

• Audit capability (which feed caused the Fact to change, etc.)

26

Page 27: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Extract

• Pre-create EMR resources at the start of Batch

• Achieve parallelism in Sqoop with mappers and Fair Scheduling

• Sqoop query to add additional fields like Batch_id, Updated_date etc

• Data extracts are split and compressed for optimized loading into Redshift

27

AS400 / DB2

EMR with Sqoop

S3

Metadata

KMS

Data Pipeline

1

2

3

4

5 6

Control Flow

Data Flow

Page 28: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Load

• Truncate and Load through Data Pipeline for Staging tables

• Dynamic Work Load Management (WLM) queues setup to allow maximum

resources during Loading/Transformation

• Check and terminate any locks on tables to allow truncation

• Capture metrics related to number of rows loaded, time taken, etc.28

StagingS3

KMSData Pipeline

4

1 2

3

EC2 Control Flow

Data Flow

Page 29: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Transform

• Custom Application for building Dimensions and Facts

• SQL Scripts are stored in S3 and executed by ELT process

• SQL scripts refactored from SQL Server and AS400 scripts

• Non-Functional Requirements are achieved through Custom App

29

1

32

4

5

6

7a

7b

S3Staging

Facts

Metadata

Dimensions

App

Control Flow

Data Flow

Page 30: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Schema Design

• Modified Star Schema

• Natural Keys instead of generating unique identifiers

• Commonly used columns from Dimensions are copied over to

Facts

• Surrogate keys are eliminated except for few cases

• Compression

• Define appropriate Distribution and Sort Keys

• Define primary key and Foreign keys

Page 31: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Security

• AWS Key Management Service (KMS) is used for encrypting

access credentials to Source and Target databases

• Jenkins job to allow encrypting of credentials using KMS

directly by Database Administrators

• Amazon EMR, Jenkins resources are given KMS decrypt

permissions to allow connecting to Sources and Targets during

the ELT process

• Standard Security in Transit and at Rest throughout the process

• IAM federation through Enterprise Active Directory31

Page 32: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Reporting

• Business users access to Facts/Dimensions through Tableau

• Power users access to Staging tables through Tableau

• Enable Data Analysts access to files in S3 using Hive/Presto

• Self-Service capability across business users

32

S3 Staging Facts/ Dimensions

Business

AnalystsPower

UsersData

Analysts

EMR

Presto/Hive

Page 33: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Workstream Effort

• Define Jobs and Job Groups specific to each

Workstream

• Create Redshift tables (Staging, Facts, Dimensions)

based on mapping from AS400 and best practices

learned

• Create new SQL scripts (based on the logic from

AS400/SQL Server code) for transformation

• Develop, Test and Deploy in 2-week Agile sprints33

Page 34: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Key Lessons - Technical

• Isolate core framework with project specific code repositories

• Consolidating logging solution across Amazon S3, Amazon

Redshift, Amazon DynamoDB etc., was a challenge

• Make appropriate schema changes when migrating to new

platform

• Custom Framework for gathering operational stats (eg: # of

rows loaded etc.)

• Start with Test Automation tools and Acceptance Test Driven

Development (ATDD) earlier in the project34

Page 35: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Project timeline revisited

After the successful pilot:

• Executive Leadership accelerated timeline:

• Reduce project timeline by 50% (to 12 months) to

deliver value faster to LOBs

• Realize cost savings by eliminating the DB2 and

SQL Server platforms earlier

• Users wanted to be on the new platform!

• Scholastic & NorthBay partnered to create a

training curriculum to ensure a supply of skilled

staff would be available to our teams35

Page 36: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Scaling up: 7 workstreams

• Developed a model for estimating effort and cost

(AWS costs & Labor per LOB migration)

• Running agile teams in parallel – employed Agile

coaches

• Enhanced the core framework to ensure it would

scale effectively when in use by multiple teams

simultaneously

• Building a Code repository for use by all teams

• Building CI / CD Frameworks

Page 37: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Where are we now?

• 4 of 7 LOBs migrated – framework enables complete migration of a

functional area within days/weeks as opposed to months. On track to

migrate and decommission entire legacy environment within next 6

months

• 10 weeks to migrate from an external vendor hosting data and providing

reports for one LoB

• Cost of Data Ingestion Framework is under $40/day (EC2, EMR, Data

Pipeline)

• First “Big Data” initiative in production, captures and processes an

average of 1.5 Million e reading events daily (peak: 7 Million)

• Profile: LOB #1

• Loading ~5-6 Million rows/day (6-7GB/day)

• Processing over 1.5 billion rows within Redshift daily

• Complete ETL/ELT batch cycle performance improved by over 170%

Page 38: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Key lessons – project execution

• Essential to monitor and optimize AWS costs

• “Data Champion” / “Data Guide” partnership absolutely critical for

successful adoption of new platforms

• Importance of strong Agile coaches while scaling out Agile teams

• Criticality of choosing consulting partners (AWS & North Bay)

who can ramp up and supply key resources fast and cycle off the

project when finished

• Creating new data platforms and migrating data into them is

easy, especially with AWS. Decommission of existing data

platforms is hard!

38

Page 39: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Thank you!

Page 40: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Remember to complete

your evaluations!

Page 41: AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

Related Sessions

Hear from other customers discussing their Amazon Redshift use cases:

• BDM402—Best Practices for Data Warehousing with Amazon Redshift (King.com)

• BDA304—What’s New with Amazon Redshift

• SVR308—Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year

• GAM301—How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful

Player Insights

• BDA207—Fanatics: Deploying Scalable, Self-Service Business Intelligence on AWS

• BDM306— Netflix: Using Amazon S3 as the fabric of our big data ecosystem

• BDA203 — Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift

(GE Power and Water)

• BDM206 — Understanding IoT Data: How to Leverage Amazon Kinesis in Building an IoT

Analytics Platform on AWS (Hello)

• STG307— Case Study: How Prezi Built and Scales a Cost-Effective, Multipetabyte Data Platform

and Storage Infrastructure on Amazon S3