23
© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Redshift Best Practices Part 2 May 2013 Eric Ferreira & John Loughlin

AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

Embed Size (px)

DESCRIPTION

This session follows our webinar on data loading and key choices and shows you how to use Amazon Redshift efficiently. Hear our experts discuss how to extract the best performance from your Amazon Redshift cluster by using the commands like vacuum appropriately. Understand what information is exposed in the Amazon Redshift console and how to use it. Learn how to tune performance by explaining query plans and examining how memory and disk space are used. Reasons to attend: - Learn how to use Amazon Redshift efficiently. - Manage storage effectively with Vacuum. - Attend Q&A session with Amazon Redshift experts.

Citation preview

Page 1: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Redshift Best Practices

Part 2

May 2013

Eric Ferreira & John Loughlin

Page 2: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Agenda

Introduction & Recap

Best Practices for • Workload Migration

• Copy Command Options

• Vacuum

• Space Management

Q&A

Page 3: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon DynamoDB Fast, Predictable, Highly-Scalable NoSQL Data Store

Amazon RDS Managed Relational Database Service for

MySQL, Oracle and SQL Server

Amazon ElastiCache In-Memory Caching Service

Amazon Redshift Fast, Powerful, Fully Managed, Petabyte-Scale

Data Warehouse Service

Compute Storage

AWS Global Infrastructure

Database

Application Services

Deployment & Administration

Networking

AWS Database

Services

Scalable High Performance

Application Storage in the Cloud

Page 4: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Redshift architecture Leader Node

• SQL endpoint

• Postgres based

• Stores metadata

• Communicates with client

• Compiles queries

• Coordinates query execution

Compute Nodes • Local, columnar storage

• Execute queries in parallel - slices

• Load, backup, restore via Amazon S3

Everything is mirrored

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Page 5: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

In Part 1…

This is Part 2 of the Redshift Best Practices series.

Visit:

http://aws.amazon.com/resources/databaseservices/webin

ars/

To watch Part 1.

Page 6: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Workload Migration

ELT/ETL Process • Load Atomic Data (target table or staging area)

• Transform data (include cleanup and aggregation)

• Prepare target tables for query/reports

• Includes Statistics gathering and vacuum

• Includes data retention policy

Re-evaluate to take advantage of cloud characteristics.

Page 7: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Workload Migration cont.

Make provision for testing multiple options before you

migrate the production workflow

• Different number of nodes

• Few large nodes versus many small nodes (16xXL versus 2x8XL)

• WLM Settings

• Concurrency versus response time

• Different Sort and Distribution Keys

• Test both queries and load/vacuum times

• Compression

Page 8: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Workload Best Practices

Organizing and keeping your load files in S3 allows for re-run or scenario testing as you evolve your workflow in the platform.

• Keep in S3 or Glacier for fiscal/legal reasons

Data updated for short-term • consider having a short-term version of the table for staging and a long term version once

data gets stable.

Round Robin distribution key • When you don’t have a good Distribution Key

• Check Part 1 for query on checking for distribution skew

• Trade off with collocated joins

Loading the target (final) table • Use a chronological date/timestamp columns for first sortkey. Vacuum is needed less often

and runs faster

• When first sort column has low cardinality/resolution (i.e, date instead of timestamp), subsequent columns should match common filters and/or grouping columns

Page 9: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Workload Best Practices cont.

Use UNLOAD command to archive data that is not needed for business reasons • Data that needs to exist only for fiscal/legal reasons can be re-loaded as

needed.

Consider applying retention policies less often than the regular workflow • Weekly/Monthly process during a less busy time

• Make space provision for the data growth

• Make sure all queries have date/timestamp range filters (> and <)

• Keep a sliding window of data to minimize block re-write during vacuum

Take manual snapshots to save status at specific mileposts (year-end).

Page 10: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Workload Best Practices cont.

Ratio between Load/Query performance needs

• Low ratio: Consider Load -> Snapshot -> Spin “Query” clusters -

> Tear down

• High ratio: Consider Performance above space needs when

choosing number of nodes

Normalization Rule of Thumb

• De-normalize only to avoid non-collocated joins

• Slow Changing Dimensions (type II): Keep normalized, match

distkey with fact table

Page 11: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

COPY Command

COPY table_name [ (column1 [,column2, ...]) ] FROM 's3://objectpath' [ WITH ] CREDENTIALS [AS] 'aws_access_credentials' [ option [ ... ] ]

Options worth mentioning:

GZIP • Using compressed files saves network bandwidth and can speed up loads.

MAXERROR and NOLOAD • Default maxerror is 0. Set to a larger value while troubleshooting new data

stream

• Use with noload option to speed up file validation

STATUPDATE • When loading significant amount of data to non empty table can update stats at

the end the load.

Page 12: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

COPY Command Common Issues

UTF-8 • Currently redshift can only load well-formed uft-8 characters up to 3 bytes.

NULL AS and ESCAPE • Common issues loading files can be circumvented with these options

• Narrow down to small set of rows and visually find what type of problem you have

• Note that the error message might refer to a later portion. For example “Delimiter not found” might be caused by a EOL that was not escaped.

DATEFORMAT and TIMEFORMAT • Currently all date/timestamp columns have to use the same formatting

defined by the option

• Using ACCEPTANYDATE will not generate errors but load NULL when format does not match

Page 13: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

COPY Command Troubleshooting

STL_LOAD_ERRORS / STL_LOADERROR_DETAIL • Find errors during specific loads

• You can create a view to simplify troubleshooting process create view loadview as (select distinct tbl, trim(name) as table_name, query, starttime, trim(filename) as input, line_number, colname, err_code, trim(err_reason) as reason from stl_load_errors sl, stv_tbl_perm sp where sl.tbl = sp.id);

• Then you “select * from loadview where table_name = <table>” if you have any issues.

STL_LOAD_COMMITS / STL_FILE_SCAN / STL_S3CLIENT • Load times for specific files. Confirms a given file was read

STL_S3CLIENT_ERROR • Information about specific S3 or file transfer errors that happen during load

process

Page 14: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

COPY Command – Historical Information

Look back to confirm number files and bytes loaded by each COPY statement

select substring(q.querytxt,1,40) as querytxt, s.n_files, size_mb, s.time_seconds,

s.size_mb/decode(s.time_seconds,0,1,s.time_seconds) as mb_per_s

from (select query, count(*) as n_files,

sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) - min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time

from stl_s3client where query > 0 and transfer_time > 0 group by query ) as s

LEFT JOIN stl_Query as q on q.query = s.query

order by mb_per_s desc

limit 10

Page 15: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

COPY Command – Historical Information

cont. querytxt | n_files | size_mb | time_seconds | mb_per_s

--------------------------------------------------------------+---------+---------+--------------+----------

copy lineitem from 's3://tpc-h/100/lineitem.tbl.' credential | 603 | 22201 | 2390 | 9

copy lineitem from 's3://tpc-h/1/lineitem.tbl.' credentials | 34 | 192 | 21 | 8

copy customer from 's3://tpc-h/100/customer.tbl.' credential | 152 | 750 | 85 | 8

copy partsupp from 's3://tpc-h/100/partsupp.tbl.' credential | 82 | 2720 | 367 | 7

COPY ANALYZE part | 22 | 40 | 7 | 5

copy orders from 's3://tpc-h/100/orders.tbl.' credentials '' | 152 | 4800 | 1035 | 4

copy orders from 's3://tpc-h/1/orders.tbl.' credentials '' g | 34 | 32 | 7 | 4

copy part from 's3://tpc-h/100/part.tbl.' credentials '' gzi | 202 | 400 | 95 | 4

COPY ANALYZE supplier | 34 | 0 | 3 | 0

copy supplier from 's3://tpc-h/100/supplier.tbl.' credential | 102 | 0 | 10 | 0

(10 rows)

Page 16: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Vacuum

Before Vacuum

• Data inserted goes to a “non-sorted” area at the end of the table

• As this area grows, query times grow

• Data deleted is “marked” in a special column

• As that column grows, query times grow

What vacuum does

• Non-sorted area gets sorted and integrated into the table

• Deleted rows are removed and blocks reorganized

Page 17: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Vacuum cont.

• Vacuum takes advantage of sortkey and skips blocks that don’t need to be modified.

• Vacuum is a maintenance type operation

• Only one vacuum can be running at a time (cluster-wide)

• More Memory = Faster Vacuum

– set wlm_query_slot_count to 4;

• Keep track of Vacuum progress (ETA)

– SVV_VACUUM_PROGRESS

• Record vacuum details after to consider adjust frequency

– SVV_VACUUM_SUMMARY

April/2013

May/2013

Unsorted

March/2013

May/2013

June/2013

April/2013

Page 18: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Space Management

Redshift has a single pool of space used for tables and temporary segments. • Loads need 2.5 times the space of the data being loaded if table

has a sortkey

• Vacuum may need 2.5 times the size of the table.

Monitor the free space • Performance Tab in the console

• Cloudwatch Alarms

• SQL

Page 19: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Space Management cont.

Tables Sizes select trim(pgdb.datname) as Database, trim(pgn.nspname) as

Schema,

trim(a.name) as Table, b.mbytes, a.rows

from ( select db_id, id, name, sum(rows) as rows

from stv_tbl_perm a group by db_id, id, name ) as a

join pg_class as pgc on pgc.oid = a.id

join pg_namespace as pgn on pgn.oid = pgc.relnamespace

join pg_database as pgdb on pgdb.oid = a.db_id

join (select tbl, count(*) as mbytes

from stv_blocklist group by tbl) b on a.id=b.tbl

order by mbytes desc, a.db_id, a.name;

Free Space select sum(capacity)/1024 as capacity_gbytes,

sum(used)/1024 as used_gbytes,

(sum(capacity) - sum(used))/1024

as free_gbytes

from stv_partitions

where part_begin=0;

• Redshift allows you to resize your cluster up and down and across node

types. Online (read-only access).

Page 20: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Summary

• Experiment to optimize your workflows

• Various STL/STV tables hold most information needed for

troubleshooting

• Space Management and Vacuum schedule should be

considered during implementation phase

Page 21: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

More information COPY Command

http://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.html

Loads Troubleshooting

http://docs.aws.amazon.com/redshift/latest/dg/t_Troubleshooting_load_errors.html

Vacuum

http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html

UNLOADING data

http://docs.aws.amazon.com/redshift/latest/dg/c_unloading_data.html

Page 22: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Page 23: AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

© 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Q&A