AWS (Amazon Redshift) presentation

AWS: Redshift overviewPRESENTATION PREPARED BY VOLODYMYR ROVETSKIY

AgendaWhat is AWS Redshift

Amazon Redshift Pricing

AWS Redshift Architecture• Data Warehouse System Architecture• Internal Architecture and System Operation

Query Planning and Designing Tables• Query Planning And Execution Workflow• Columnar Storage• Zone Maps• Compression• Referential Integrity

Data locate in Redshift• The Sortkey• The Distribution Key

Workload Management (WLM)

Loading Data• What is Amazon S3• Data Loading from Amazon S3• COPY from Amazon S3

Redshift table maintenance operations• ANALYZE• VACUUM

Amazon Redshift Snapshots

Amazon Redshift Security

Monitoring Cluster Performance

Useful resources

Conclusion

What is Amazon Redshift

Cluster architecture

Columnar storage

Zone maps

Compression

Read optimized

No referential integrity by design

Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay.

Amazon Redshift features

Amazon Redshift PricingClients pay an hourly rate based on the type and

number of nodes in your cluster. There is discount up to 75% over On-Demand rates

by committing to use Amazon Redshift for a 1 or 3 year term.

Prices include two additional copies of your data, one on the cluster nodes and one in Amazon S3.

Amazon Redshift take care of backup, durability, availability, security, monitoring, and maintenance.

Price is depend on chosen Region.Dense Storage (DS) nodes allow you to create large

data warehouses using hard disk drives (HDDs) for a low price point.

Dense Compute (DC) nodes allow you to create high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).

Data Warehouse System Architecture

Leader node• Stores metadata• Manages communications with client programs and compute nodes• Manages distributing data to the slices on compute nodes• Develops and distributes execution plans for compute nodes

Compute nodes• Execute the query segments in parallel and send results back to the leader node for final

aggregation• Each compute node has its own dedicated CPU, memory, and attached disk storage• User data is stored on the compute nodes

Node slices• Each slice is allocated a portion of the node's memory and disk space• The slices work in parallel to complete the operation.• The number of slices per node is determined by the node size of the cluster.• The rows of table are distributed to the node slices according to the distribution key

Client applications• Amazon Redshift is based on industry-standard PostgreSQL

The following diagram shows a high level view of internal components and functionality of the Amazon Redshift data warehouse.

Internal Architecture and System Operation

Query Planning And Execution WorkflowThe query planning and execution workflow follows these steps:• 1. The leader node receives the query and parses the SQL.• 2. The parser produces an initial query tree that is a logical

representation of the original query. Amazon Redshift then inputs this query tree into the query optimizer.

• 3. The optimizer evaluates and if necessary rewrites the query to maximize its efficiency.

• 4. The optimizer generates a query plan for the execution with the best performance.

• 5. The execution engine translates the query plan into compiled C++ code

• 6. The compute node execute the compiled code segments in parallel

Columnar Storage

Pic.1 shows how records from database tables are typically stored into disk blocks by row.

Pic.2 shows how with columnar storage, the values for each column are stored sequentially into disk blocks.Columnar storage is optimizing analytic query performance because: reduces the overall disk I/O requirements reduces the amount of data you need to load from disk each block holds the same type of data block data can use a compression scheme selected specifically for the column data type

Zone MapsThe zone map is held separately from the block, like

an index

The zone map holds only two data points per block, the highest and lowest values in the block.

Redshift uses the zone map when executing queries, and excludes the blocks that the zone map indicates won’t be returned by the WHERE clause filter.

The zone maps will filter data blocks efficiently if columns are used as the sortkey

CompressionBenefits of Compression• Reduces the size of data when it is stored or read from storage• Conserves storage space• Reduces the amount of disk I/O• Improves query performance

Redshift recommendations and advices:• Use COPY command to apply automatic compression.(COMPUPDATE

ON)• Produce a report with the suggested column encoding schemes for

the tables analyzed. (ANALYZE COMPRESSION)• Compression type cannot be changed for a column after the table is

created• Highly compressed sort keys means many rows per block You’ll scan

more data blocks than you need

Referential integrity. Redshift unsupported features:Table partitioning TablespacesConstraints:

◦ Unique◦ Foreign key◦ Primary key◦ Check constraints◦ Exclusion constraints

Indexes

Important: Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

CollationsStored proceduresTriggersTable functionsSequencesFull text searchExotic data types (arrays, JSON, Geospatial types, etc.)

Data locate in RedshiftThe Sort key• Each table can have a single Sort Key – a compound key, comprised of 1 to 400

columns from the table• Redshift is storing data on disk in Sort Key order• Sort keys should be selected based on how the table is used:

• Columns that are used to join to other tables should be included in the sort key;

• Date type columns that are used in filtering operations should be included;

• Redshift stores metadata about each data block, including the min and max of each column value – using sortkey, Redshift can skip entire blocks when answering a query;

Sort keys and Zone MapsCREATE SOME_TABLE ( SALESID INTEGER NOT NULL,

DATE DATETIME NOT NULL )

SELECT COUNT(*) FROM SOME_TABLEWHERE DATE = ’09-JUNE-2013’

CREATE SOME_TABLE ( SALESID INTEGER NOT NULL, DATE DATETIME NOT NULL ) SORTKEY (DATE)

SELECT COUNT(*) FROM SOME_TABLEWHERE DATE = ’09-JUNE-2013’

The Sort keys – Single Column

Table is sorted by 1 column [ SORTKEY ( date ) ]. Best for:

• Queries that use 1st column (i.e. date) as primary filter• Can speed up joins and group by• Quickest to VACUUM

Example:

create table sales(date datetime not

null,region datetime not

null,country varchar not

null)distkey(date)sortkey(date);

The Sort keys – Compound

Table is sorted by 1st column , then 2nd column etc. [ SORTKEY COMPOUND ( date, region, country) ].Best for:

Example:

create table sales(date datetime not null,region datetime not null,country varchar not null)

distkey(date)sortkey compound (date, region, country);

The Sort keys – Interleaved

Equal weight is given to each column. [ SORTKEY INTERLEAVED ( date, region, country) ]Best for:

• Queries that use different columns in filter• Queries get faster the more columns used in the filter• The Slowest to VACUUM

Example:

create table sales(date datetime not null,region datetime not null,country varchar not null)

distkey(date)sortkey interleaved(date, region, country);

Data locate in Redshift

The Distribution Key

• Redshift will distribute and replicate data between compute nodes;• By default, data will be spread evenly across all compute

nodes slices (EVEN distribution) • The even distribution of data across the nodes is vital to

ensuring consistent query performance • If data is denormalised and does not participate in joins, then

an EVEN distribution won’t be problematic • Alternatively a Distribution key can be provided (KEY

distribution) • The Distribution key helps distribute data across a node’s

slices • The Distribution key is defined on a per-table basis• The Distribution Key is comprised of only a single column

Distribution styles by example

• Large Fact tables • Large dimension tables

• Medium dimension tables (1K – 2M)

• Tables with no joins or group by • Small dimension tables (<1000) Data Distribution

Workload Management (WLM)WLM allows you to:

• Manage and adjust query concurrency• Increase query concurrency up to 15 in a queue• Define user groups and query groups• Segregate short and long running queries • Help improve performance of individual queries

Be aware:

• Query workload is distributed to every compute node• Increasing concurrency may not always help due to resource contention (CPU, Memory, I/O)• Total throughput may increase by letting one query complete first and allowing other queries to wait

WLM Options by default:

• 1 queue with a concurrency of 5• Define up to 8 queues with a total concurrency of 15• Redshift has a super user queue internally

Short Description of Amazon Simple Storage Service (S3) Cloud Storage for web applicationOrigin store for content distributionStaging area and persistent store for Big Data analyticsBackup and archive target databasesTo use Amazon S3, you need an AWS accountBefore you can store data in Amazon S3, you must

create a bucket.Add an object to the created bucket (a text file, a photo,

a video and so forth)When objects are added to the bucket you can view

and manage them

Data Loading from Amazon S3

Best Practice and recommendations:

• S3 bucket and your cluster must be created in the same region• Split your data on S3 into multiple files• Use a COPY Command to load data• Load your data in sort key order to avoid needing to vacuum• Organize your data as a sequence of time-series tables• Run the VACUUM command whenever you add, delete, or modify a large number of rows• Run the ANALYZE command whenever you’ve made a non-trivial number of changes to

update table statistics

COPY from Amazon S3 Syntax Parameters

FROM - the path to the Amazon S3 objects that contain the data

MANIFEST - The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same region as the Amazon Redshift cluster.

ENCRYPTED - Specifies that the input files on Amazon S3 are encrypted using client-side encryption.

REGION [AS] 'aws-region‘ - Specifies the AWS region where the source data is located. Examples

Redshift table maintenance operations

ANALYZE: The command used to capture statistical information about a table for use by the query planner.

• Run before running queries.• Run against the database after regular load or update cycle.• Run against any new tables that you create.• Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in

queries and their propensity to change.• Do not need to analyze all columns in all tables regularly or on the same schedule. Analyze the columns that are frequently used

in the following:• Sorting and grouping operations• Joins• Query predicates

This command can analyze the whole table or specified columns: ANALYZE <TABLE NAME>;ANALYZE <TABLE NAME> (<COLUMN1>,<COLUMN2>);

Redshift table maintenance operationsVACUUM: A process to physically reorganize tables after load activity.

• Can be run in 4 modes: • VACUUM FULL - reclaims space and re-sorts;• VACUUM DELETE ONLY - reclaims space but does not re-sort • VACUUM SORT ONLY - re-sorts but does not reclaim space • VACUUM REINDEX - used for INTERLEAVED sort keys, re-

analyzes sort keys and then runs FULL VACUUM

VACUUM is an I/O intensive operation and can take time to run. To minimize the impact of VACUUM:

• Run VACUUM on a regular schedule during time periods when you expect minimal activity on the cluster

• Use TRUNCATE instead of DELETE where possible • TRUNCATE or DROP test tables• Perform a Deep Copy instead of VACUUM • Load Data in sort order and remove need for VACUUM

TO threshold PERCENT - the threshold above which VACUUM skips the sort phase and the target threshold for reclaiming space in the delete phase. If you include the TO threshold PERCENT parameter, you must also specify a table name. This parameter can't be used with REINDEX.For example, if you specify 75 for threshold, VACUUM skips the sort phase if 75 percent or more of the table's rows are already in sort order. For the delete phase, VACUUMS sets a target of reclaiming disk space such that at least 75 percent of the table's rows are not marked for deletion following the vacuum. The threshold value must be an integer between 0 and 100. The default is 95.

Amazon Redshift SnapshotsAutomated Snapshots

• enabled by default when cluster is created• periodically takes from the cluster(every eight hours or every 5 GB of data changes)• deleted at the end of a retention period(1 day by default)• Can be disabled (set retention period to 0)

Manual Snapshots

• Can be taken whenever you want• will never automatically delete• manual snapshots accrue storage charges

Excluding Tables From Snapshot

• To create a no-backup table, include the BACKUP NO parameter when you create the table

Copying Snapshots to Another Region

• Copying snapshots across regions incurs data transfer charges

Restoring a Table from a Snapshot (feature was added 2016 10th of March)

• You can restore a table only to the current, active running cluster and from a snapshot that was taken of that cluster.• You can restore only one table at a time.• You cannot restore a table from a cluster snapshot that was taken prior to a cluster being resized.

Amazon Redshift SecurityCluster Security: Controlling access to Redshift cluster for management• Cluster runs within a Virtual Private Cloud (VPC) managed by the Amazon Redshift

service

Connection security: Controlling clients that can connect to Redshift cluster • . Users can only connect to the cluster using an ODBC or JDBC connections. You

may optionally only permit connections to the Amazon Redshift cluster from a VPC you control.

Database object security: Controlling which users have access to which database objects • At the database security level Amazon Redshift uses the Postgres security model,

with user name / password authentication. Database user accounts are configured separately from Redshift’s management security using SQL commands.

Data Security: encryption of data at rest (load data, table data, and backup data) • You can encrypt data that is loaded into Amazon Redshift, encrypt the data stored

in the Amazon Redshift tables, and encrypt the backups.

Monitoring Cluster PerformanceAmazon CloudWatch metrics help you monitor physical aspects of your cluster, such as CPU utilization, latency, and throughput.

Performance data helps you monitor database activity and performance. This data is aggregated in the Amazon Redshift console to help you easily correlate what you see in Amazon CloudWatch metrics

Query/Load Performance Data Amazon CloudWatch Metrics

Useful resources to learn more about Redshift

Redshift Documentation

• https://aws.amazon.com/redshift• http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html

Open Source Scripts and Tools

• https://github.com/awslabs/amazon-redshift-utils• http://www.aginity.com/redshift

https://aws.amazon.com/redshift

http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html

https://github.com/awslabs/amazon-redshift-utils

http://www.aginity.com/redshift

Conclusion

Amazon Redshift’s features

• Optimized for Data Warehousing- It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources.

• Scalable- With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change.

• No Up-Front Costs- You pay only for the resources you provision. You can choose On-Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing.

• Fault Tolerant- Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3.

• SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers.

• Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster.

• Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-accelerated AES-256 encryption for data at rest.

Redshift

Optimized for Data

Warehousing

Scalable

No Up-Front Costs

Fault Tolerant

Secure

SQL Standards

Jeff Bezos reacted to my payment :-))

Documents

AWS (Amazon Redshift) presentation