View
1.209
Download
0
Category
Preview:
Citation preview
Redshift at Lightspeed
How to continuously optimize and modify Redshift schemas
Panoply.io
The Missing Part: Continuous Data Warehousing
Core Idea Product
Continuous IntegrationPuppet
Chef New Relic
Unit Tests
AWS Heroku
Docker
Server Frameworks
Github Bitbucket
Client Frameworks
SCRUM Kanban Extreme
speed/spēd/
noun
the rate at which someone or something is able to operate or change state
Make the change easy
Make the easy change
First,
Then,“— Kent Beck
Panoply.io
1. DataColumns, Tables, Data Types, Compression, Constraints
/ code changes
2. QueriesTransformations, Sortkeys, Distkeys
/ life, business, environment
Continuous Data Integration
#1 Data & Metadata Changes
Panoply.io
Groups
int g_id
string name
Session Events
int s_id
int u_id
datetime time
datetime start_time
datetime end_time
Users
int u_id
string gender
string name
string first_name
string last_name
int g_id
Messages
int m_id
int from
int to
string text
int to_u
int to_g
Users supportcommit #4acd617 by alice
Adding messagescommit #0ca9e87 by bob
Track user sessionscommit #709ff49 by alice
Breakdown the namecommit #791079b by alice
Add Groupscommit #44ff83b by bob
Session time-rangecommit #df7a369 by alice
Panoply.io
automate with build scripts
users: - column: u_id type: int
- column: first_name type: varchar - column: last_name type: varchar
- column: g_id - type: int - references: - groups.g_idgroups: - column: g_id type: int
Users supportcommit #4acd617 by alice
Breakdown the namecommit #791079b by alice
Add Groupscommit #44ff83b by bob
Commit Log schema.yaml
Panoply.io
schema.yaml Users
integer id
varchar address
Groups
integer admin
users: - column: id type: integer - column: address type: varchargroups: - column: admin type: integer references: - users.id
Reject on error
create table ... ( ... )alter table ... add column ...alter table ... remove column ...alter table ... rename column ... to ...
remodel
Panoply.io
Concurrency & Locks
commit 1 commit 2 commit 3
rollback rollbackDoneStart
Users
integer id
varchar addressQueries
Alter Table locks
Panoply.io
Messages
date created
Add temporary columnalter table messages add column timestamp created_tmp
Messages
date created-old
ts created
Rename columnsalter table messages rename column created to created-old;alter table messages rename column created-tmp to created;
Users
ts created
Drop old columnalter table messages drop column created-old
Messages
date created
ts created_tmp
Copy data to new columnupdate messages set created_tmp = created
Reject on error
Altering Column Types
Panoply.io
ViewGroup Admins
string group_name
string admin_name
Users
string name
Groups
integer admin
Approach #1Drop all, and reconstruct until reaching stability
Approach #2pg_depend
On error - reject
DAG: Directional A-cyclic Graph
Rebuilding Views & Constraints
#2 Query Changes
Transformations Sortkeys Distkeys
Panoply.io
ETLextract transform load
Data Available
Data Available
ELT extract load transform
ETL Is Yesterday’s Problem
Rigid, Inflating Dev-dependent
Lost
Panoply.io
users
…
groups
…
Viewusers-per-group
int g_id
varchar name
int count_users
Viewavg-turn-around
string type
float turn_around
int uniques
Raw DataTransformation Views
Immediate Availability
avg-turn-around
…Selective Materialization
Panoply.io
Unsorted gender
1mb blocks
Sorted gender
1mb blocks
Female Male
Sortkeys: Recap
Panoply.io
SELECT COUNT(1) FROM users WHERE gender = 'female'
count
2498644
SELECT * FROM STL_EXPLAIN WHERE ...
plannode cost info
Aggregate cost=68718.76..68718.76 rows=1 width=0
Seq Scan on users cost=0.00..62500.00 rows=2487501 width=0 Filters: gender = ‘female’
SELECT DATEDIFF('ms', endtime, starttime), * FROM STL_SCAN
slice datediff rows pre_filter is_rrscan
1 250 4071 7813 t
2 309 3846 7813 t
52%49%
Panoply.io
Even AllKey gender
Female Male
Diststyle & Distkeys: Recap
Panoply.io
SELECT groups.name, COUNT(DISTINCT u_id)FROM groups FULL JOIN users ON groups.g_id = users.g_idGROUP BY groups.name;
name count
Group 1 6
Group 2 2
plan cost info
HashAggregate 61000331250
Subquery Scan 61000306250
HashAggregate 61000256250
Hash Full Join DS_DIST_BOTH 61000231250 users.g_id = groups.g_id
Seq Scan on users 50000
Hash 15000
Seq Scan on groups 15000
inner
outer
Panoply.io
DS_DIST_BOTH
DS_DIST_ALL_INNERall all all
Good
OK
Bad
DS_DIST_INNER
DS_DIST_NONEDS_DIST_ALL_NONE
DS_BCAST_INNER
node 1 node 2 node 3
Panoply.io
plan cost info HashAggregate 331250
Subquery Scan 306250
HashAggregate 256250
Hash Full Join DS_DIST_NONE 231250 users.g_id = groups.g_id
Seq Scan on users 50000
Hash 15000
Seq Scan on groups 15000
users DISTSTYLE KEY DISTKEY (g_id)groups DISTSTYLE KEY DISTKEY (g_id)
Panoply.io
SELECT "type", AVG(time_to_session) avg_time_to_sessionFROM (SELECT u_id, "type", DATEDIFF(seconds,MS.time,first_session_after_message) time_to_session FROM (SELECT MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END "type", MIN(S.start_time) first_session_after_message FROM (SELECT M.time, to_u, to_g, nvl(A.u_id,B.u_id) u_id FROM messages M LEFT JOIN (SELECT DISTINCT G.id AS g_id, U.u_id FROM groups G RIGHT JOIN users U ON G.id = U.g_id) A ON M.to_g = A.g_id LEFT JOIN users B ON M.to_u = B.u_id) MM LEFT JOIN (SELECT u_id, start_time, end_time FROM sessions) S ON MM.u_id = S.u_Id AND MM.time < S.start_time AND MM.time < S.end_time WHERE DATEDIFF (seconds,MM.time,S.start_time) < 3600 GROUP BY MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END) MS) MS1 GROUP BY "type";
Real Life Example
Panoply.io
SELECT ...
type
Private
time_to_session
62.398
Group 102.873
EXPLAIN SELECT ...
plan cost info Hash Join DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id
Hash Left Join DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id
Hash Left Join DS_DIST_INNER 1349121397704 messages.to_g = users.g_id
Hash Left Join DS_DIST_BOTH 49000231250 users.g_id = groups.g_id
1.
2.
3.
4.
DS_DIST_BOTH
DS_BCAST_INNER
DS_DIST_INNER
Panoply.io
users DISTKEY (u_id)sessions DISTKEY (u_id)messages DISTKEY (to_u)groups DISTSTYLE ALL
Cost: 2x faster Actual: 8x faster (12 mins to 1.5mins)
50%
32%
3%
121,000%
hash join cost info
DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id
DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id
DS_DIST_INNER 1349121397704 messages.to_g = users.g_id
DS_DIST_BOTH 49000231250 users.g_id = groups.g_id
Previously
1.
2.
3.
4.
hash join cost
DS_DIST_OUTER 2900030128376
DS_DIST_NONE 1300007879765
DS_BCAST_INNER 1300001568828
DS_DIST_NONE 231250
1.
2.
3.
4.
Panoply.io
analyzeSTL_EXPLAIN
STL_SCAN
rebuildN-weeks optimum
parseexplain_analysis
int q_id
timestamp time
varchar table
varchar column
float filter_cost
float dist_cost
1-distkey per table Duplicate tables or use ALL
current configuration
users: - diststyle: KEY - diskey: g_id - sortkeys: - gendergroups: - diststyle: KEY - diskey: g_id
Automated Optimization
Panoply.io
users
…
…
users-dist-uid
…
…unload / copy data
…
…
…
…chase by update_time
Clone table schema with new Sortkey and Distkey
Empiric testreplay queries / explains
instant swap
Rebuilding Tables
Summary Continuous Data Warehousing
Future Tip of the Iceberg Frameworks & Platforms
?Automated Data Management Platform over RedshiftPanoply.io
Panoply.ioAutomated Data Management
Platform over Redshift
Recommended