Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io -...

Preview:

Citation preview

Redshift at Lightspeed

How to continuously optimize and modify Redshift schemas

Panoply.io

The Missing Part: Continuous Data Warehousing

Core Idea Product

Continuous IntegrationPuppet

Chef New Relic

Unit Tests

AWS Heroku

Docker

Server Frameworks

Github Bitbucket

Client Frameworks

SCRUM Kanban Extreme

speed/spēd/

noun

the rate at which someone or something is able to operate or change state

Make the change easy

Make the easy change

First,

Then,“— Kent Beck

Panoply.io

1. DataColumns, Tables, Data Types, Compression, Constraints

/ code changes

2. QueriesTransformations, Sortkeys, Distkeys

/ life, business, environment

Continuous Data Integration

#1 Data & Metadata Changes

Panoply.io

Groups

int g_id

string name

Session Events

int s_id

int u_id

datetime time

datetime start_time

datetime end_time

Users

int u_id

string gender

string name

string first_name

string last_name

int g_id

Messages

int m_id

int from

int to

string text

int to_u

int to_g

Users supportcommit #4acd617 by alice

Adding messagescommit #0ca9e87 by bob

Track user sessionscommit #709ff49 by alice

Breakdown the namecommit #791079b by alice

Add Groupscommit #44ff83b by bob

Session time-rangecommit #df7a369 by alice

Panoply.io

automate with build scripts

users: - column: u_id type: int

- column: first_name type: varchar - column: last_name type: varchar

- column: g_id - type: int - references: - groups.g_idgroups: - column: g_id type: int

Users supportcommit #4acd617 by alice

Breakdown the namecommit #791079b by alice

Add Groupscommit #44ff83b by bob

Commit Log schema.yaml

Panoply.io

schema.yaml Users

integer id

varchar address

Groups

integer admin

users: - column: id type: integer - column: address type: varchargroups: - column: admin type: integer references: - users.id

Reject on error

create table ... ( ... )alter table ... add column ...alter table ... remove column ...alter table ... rename column ... to ...

remodel

Panoply.io

Concurrency & Locks

commit 1 commit 2 commit 3

rollback rollbackDoneStart

Users

integer id

varchar addressQueries

Alter Table locks

Panoply.io

Messages

date created

Add temporary columnalter table messages add column timestamp created_tmp

Messages

date created-old

ts created

Rename columnsalter table messages rename column created to created-old;alter table messages rename column created-tmp to created;

Users

ts created

Drop old columnalter table messages drop column created-old

Messages

date created

ts created_tmp

Copy data to new columnupdate messages set created_tmp = created

Reject on error

Altering Column Types

Panoply.io

ViewGroup Admins

string group_name

string admin_name

Users

string name

Groups

integer admin

Approach #1Drop all, and reconstruct until reaching stability

Approach #2pg_depend

On error - reject

DAG: Directional A-cyclic Graph

Rebuilding Views & Constraints

#2 Query Changes

Transformations Sortkeys Distkeys

Panoply.io

ETLextract transform load

Data Available

Data Available

ELT extract load transform

ETL Is Yesterday’s Problem

Rigid, Inflating Dev-dependent

Lost

Panoply.io

users

groups

Viewusers-per-group

int g_id

varchar name

int count_users

Viewavg-turn-around

string type

float turn_around

int uniques

Raw DataTransformation Views

Immediate Availability

avg-turn-around

…Selective Materialization

Panoply.io

Unsorted gender

1mb blocks

Sorted gender

1mb blocks

Female Male

Sortkeys: Recap

Panoply.io

SELECT COUNT(1) FROM users WHERE gender = 'female'

count

2498644

SELECT * FROM STL_EXPLAIN WHERE ...

plannode cost info

Aggregate cost=68718.76..68718.76 rows=1 width=0

Seq Scan on users cost=0.00..62500.00 rows=2487501 width=0 Filters: gender = ‘female’

SELECT DATEDIFF('ms', endtime, starttime), * FROM STL_SCAN

slice datediff rows pre_filter is_rrscan

1 250 4071 7813 t

2 309 3846 7813 t

52%49%

Panoply.io

Even AllKey gender

Female Male

Diststyle & Distkeys: Recap

Panoply.io

SELECT groups.name, COUNT(DISTINCT u_id)FROM groups FULL JOIN users ON groups.g_id = users.g_idGROUP BY groups.name;

name count

Group 1 6

Group 2 2

plan cost info

HashAggregate 61000331250

Subquery Scan 61000306250

HashAggregate 61000256250

Hash Full Join DS_DIST_BOTH 61000231250 users.g_id = groups.g_id

Seq Scan on users 50000

Hash 15000

Seq Scan on groups 15000

inner

outer

Panoply.io

DS_DIST_BOTH

DS_DIST_ALL_INNERall all all

Good

OK

Bad

DS_DIST_INNER

DS_DIST_NONEDS_DIST_ALL_NONE

DS_BCAST_INNER

node 1 node 2 node 3

Panoply.io

plan cost info HashAggregate 331250

Subquery Scan 306250

HashAggregate 256250

Hash Full Join DS_DIST_NONE 231250 users.g_id = groups.g_id

Seq Scan on users 50000

Hash 15000

Seq Scan on groups 15000

users DISTSTYLE KEY DISTKEY (g_id)groups DISTSTYLE KEY DISTKEY (g_id)

Panoply.io

SELECT "type", AVG(time_to_session) avg_time_to_sessionFROM (SELECT u_id, "type", DATEDIFF(seconds,MS.time,first_session_after_message) time_to_session FROM (SELECT MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END "type", MIN(S.start_time) first_session_after_message FROM (SELECT M.time, to_u, to_g, nvl(A.u_id,B.u_id) u_id FROM messages M LEFT JOIN (SELECT DISTINCT G.id AS g_id, U.u_id FROM groups G RIGHT JOIN users U ON G.id = U.g_id) A ON M.to_g = A.g_id LEFT JOIN users B ON M.to_u = B.u_id) MM LEFT JOIN (SELECT u_id, start_time, end_time FROM sessions) S ON MM.u_id = S.u_Id AND MM.time < S.start_time AND MM.time < S.end_time WHERE DATEDIFF (seconds,MM.time,S.start_time) < 3600 GROUP BY MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END) MS) MS1 GROUP BY "type";

Real Life Example

Panoply.io

SELECT ...

type

Private

time_to_session

62.398

Group 102.873

EXPLAIN SELECT ...

plan cost info Hash Join DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id

Hash Left Join DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id

Hash Left Join DS_DIST_INNER 1349121397704 messages.to_g = users.g_id

Hash Left Join DS_DIST_BOTH 49000231250 users.g_id = groups.g_id

1.

2.

3.

4.

DS_DIST_BOTH

DS_BCAST_INNER

DS_DIST_INNER

Panoply.io

users DISTKEY (u_id)sessions DISTKEY (u_id)messages DISTKEY (to_u)groups DISTSTYLE ALL

Cost: 2x faster Actual: 8x faster (12 mins to 1.5mins)

50%

32%

3%

121,000%

hash join cost info

DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id

DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id

DS_DIST_INNER 1349121397704 messages.to_g = users.g_id

DS_DIST_BOTH 49000231250 users.g_id = groups.g_id

Previously

1.

2.

3.

4.

hash join cost

DS_DIST_OUTER 2900030128376

DS_DIST_NONE 1300007879765

DS_BCAST_INNER 1300001568828

DS_DIST_NONE 231250

1.

2.

3.

4.

Panoply.io

analyzeSTL_EXPLAIN

STL_SCAN

rebuildN-weeks optimum

parseexplain_analysis

int q_id

timestamp time

varchar table

varchar column

float filter_cost

float dist_cost

1-distkey per table Duplicate tables or use ALL

current configuration

users: - diststyle: KEY - diskey: g_id - sortkeys: - gendergroups: - diststyle: KEY - diskey: g_id

Automated Optimization

Panoply.io

users

users-dist-uid

…unload / copy data

…chase by update_time

Clone table schema with new Sortkey and Distkey

Empiric testreplay queries / explains

instant swap

Rebuilding Tables

Summary Continuous Data Warehousing

Future Tip of the Iceberg Frameworks & Platforms

?Automated Data Management Platform over RedshiftPanoply.io

Panoply.ioAutomated Data Management

Platform over Redshift

Recommended