52
Amazon Athena Workshop 26 January 2017

Amazon Athena Hands-On Workshop

Embed Size (px)

Citation preview

Page 1: Amazon Athena Hands-On Workshop

Amazon Athena Workshop26 January 2017

Page 2: Amazon Athena Hands-On Workshop

Agenda

1

2

3

4

5

Wi-Fi: DaHouseGuest, Pass: JustDoit!

Feedback Form: goo.gl/T9BZvy Labs: github.com/doitintl/athena-workshop

2

Q & A

Breaks: 11:30 | 13:00 - 13:45 | 15:00

Facilities & Organization

Page 3: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

About us..

Vadim SoloveyCTO

Shahar FrankSoftware Engineering Lead

Page 4: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

Page 5: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

Page 6: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

Page 7: Amazon Athena Hands-On Workshop

Workshop Agenda

● Module 1○ Introduction to AWS Athena○ Demo

● Module 2○ Interacting with AWS Athena○ Lab 2

● Module 3○ Supported Formats and SerDes○ Lab 3

● Module 4○ Partitioning Data○ Lab 4

● Module 5○ Converting to columnar formats○ Lab 5

● Module 6○ Athena Security

● Module 7○ Service Limits

● Module 8○ Comparison to Google BigQuery○ Demo

Page 8: Amazon Athena Hands-On Workshop

[1] AWS Athena

[1] Introduction

Understanding Purpose & Use-Cases

Page 9: Amazon Athena Hands-On Workshop

[1] Challenges

Organizations are challenged with data analysis without heavy investments and long deployment time

● Significant amount of effort required to analyze data on S3● Users often have access to only aggregated data sets● Managing Hadoop or data warehouse requires expertise

Page 10: Amazon Athena Hands-On Workshop

[1] Introducing AWS Athena

Athena is an interactive query service that makes it easy to analyze data directly from AWS S3

using Standard SQL

Page 11: Amazon Athena Hands-On Workshop

[1] AWS Athena Overview

Easy to use

1. Login to a console2. Create a table (either by following a wizard or by typing Hive DDL

statement)3. Start querying

Page 12: Amazon Athena Hands-On Workshop

[1] AWS Athena is Highly Available

High Availability Features

● You connect to a service endpoint or log into a console● Athena uses warm compute pools across multiple availability zones● Your data is in Amazon S3 which has 99.999999999% durability

Page 13: Amazon Athena Hands-On Workshop

[1] Querying Data Directly from Amazon S3

Direct access to your data without hassles

● No loading of data● No ETL required● No additional storage required● Query of data in raw format

Page 14: Amazon Athena Hands-On Workshop

[1] Use ANSI SQL

Use of skills you probably already have

● Start with writing Standard ANSI SQL syntax● Support for complex joins, nested queries & window functions● Support for complex data types (arrays, structs)● Support for partitioning of data by any key:

○ e.g. date, time, custom keys○ Or customer-year-month-day-hour

Page 15: Amazon Athena Hands-On Workshop

[1] AWS Athena Overview

Amazon Athena is server-less way to query your data that lives on S3 using SQL

Features:● Serverles with zero spin-up time and transparent upgrades● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format

○ AVRO (coming soon)● Compression is supported out of the box● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query

Additional Information:● Not a general purpose database● Usually used by Data Analysts to run interactive queries over large datasets● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)

Page 16: Amazon Athena Hands-On Workshop

[1] Underlying Technologies

Presto (originating from Facebook)

● Used for SQL queries● In-memory distributed querying engine ANSI SQL compatible with

extensions

Hive (originating from Hadoop project)

● Used for DDL functionality● Complex data types● Multitude of formats● Supports data partitioning

Page 17: Amazon Athena Hands-On Workshop

[1] Presto vs. Hive Architecture

Page 18: Amazon Athena Hands-On Workshop

[1] Use Cases

Athena complements Amazon Redshift and Amazon EMR

Page 19: Amazon Athena Hands-On Workshop

AWS Athena

[2] Interacting with AWS Athena

Develop, Execute and Visualize Queries

Page 20: Amazon Athena Hands-On Workshop

[2] Interacting with AWS Athena

Amazon Athena is server-less way to query your data that lives on S3 using SQL

Web User Interface:● Run queries and examine results● Manage databases and tables● Save queries and share across organization for re-use● Query History

JDBC Driver:● Programmatic way to access AWS Athena

○ SQL Workbench, JetBrains DataGrip, sqlline○ Your own app

AWS QuickSight:● Visualize Athena data with charts, pivots and dashboards.

Page 21: Amazon Athena Hands-On Workshop

Hands On

Lab 2

Interacting with AWS Athena

Page 22: Amazon Athena Hands-On Workshop

Data Formats

[3] Supported Formats and SerDes

Efficient Data Storage

Page 23: Amazon Athena Hands-On Workshop

[3] Data and Compression Formats

The data formats presently supported are

● CSV● TSV● Parquet (Snappy is default compression)● ORC (Zlib is default compression)● JSON● Apache Web Server logs (RegexSerDe)● Custom Delimiters

Compression Formats

● Currently, Snappy, Zlib, and GZIP are the supported compression formats. ● LZO is not supported as of today

Page 24: Amazon Athena Hands-On Workshop

[3] CSV Example

CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' LOCATION 's3://nyc-yellow-trips/csv/'

Page 25: Amazon Athena Hands-On Workshop

[3] Parquet Example

CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS PARQUET LOCATION 's3://nyc-yellow-trips/parquet tblproperties ("parquet.compress"="SNAPPY");

Page 26: Amazon Athena Hands-On Workshop

[3] ORC Example

CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....)STORED AS ORC LOCATION 's3://nyc-yellow-trips/orc/’tblproperties ("parquet.compress"="ZLIB");

Page 27: Amazon Athena Hands-On Workshop

[3] RegEx Serde (Apache Log Example)

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( Date DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$")

LOCATION 's3://athena-examples/cloudfront/plaintext/';

Page 28: Amazon Athena Hands-On Workshop

[3] Comparing Formats

PARQUET

● Columnar format● Schema segregation into footer● Column major format● All data is pushed to the leaf● Integrated compression and

indexes● Support for predicate pushdown

ORC

● Apache Top Level Project● Schema segregation into footer● Column major format with stripes● Integrated compression and

indexes and stats● Support for predicate pushdown

Page 29: Amazon Athena Hands-On Workshop

[3] Comparing Formats

Page 30: Amazon Athena Hands-On Workshop

[3] Converting to Parquet or ORC format

● You can use Hive CTAS to convert data:CREATE TABLE new_key_value_storeSTORED AS PARQUETAS SELECT c1, c2, c3, .., cN FROM noncolumunartableSORT BY key

● You can also use Spark to convert the files to Parquet or ORC

● 20 lines of PySpark code running on EMR [1]○ Converts 1TB of text data into 130GB of Parquet with Snappy compression○ Approx. cost is $5

[1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Page 31: Amazon Athena Hands-On Workshop

[3] Pay By the Query ($5 per TB scanned)

● You are paying by the amount of scanned data

● Means to save on cost○ Compress○ Convert to columnar format○ Use partitioning

● Free: DDL queries, failed queries

Dataset Size on S3 Query Runtime Data Scanned Cost

Logs stored as CSV 1TB 237s 1.15TB $5.75

Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013

Savings 87% less 34x faster 99% less 99.7% cheaper

Page 32: Amazon Athena Hands-On Workshop

Hands On

Lab 3

Formats & SerDes

Page 33: Amazon Athena Hands-On Workshop

AWS Athena

[4] Partitioning Data

To improve performance and reduce cost

Page 34: Amazon Athena Hands-On Workshop

[4] Partitioning Data

By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost

Benefits of Data Partitioning:● Partitions limit the scope of data being scanned during the query● Improves Performance● Reduce query cost● You can partition your data by any key

Common Practice:● Based on time, often leading with a multi-level partitioning scheme

○ YEAR -> MONTH -> DAY -> HOUR

Page 35: Amazon Athena Hands-On Workshop

[4] Data already partitioned and stored on S3

$ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/

CREATE EXTERNAL TABLE impressions ( ... ...)PARTITIONED BY (dt string)ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;

// load partitions into AthenaMSCK REPAIR TABLE impressions

// Run sample querySELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'

Page 36: Amazon Athena Hands-On Workshop

[4] Data is not partitioned

aws s3 ls s3://athena-examples/elb/plaintext/ --recursive

2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt

ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/plaintext/2015/01/01/'

Page 37: Amazon Athena Hands-On Workshop

[5] AWS Athena

[5] Converting to Columnar Formats

Apache Parquet & ORC

Page 38: Amazon Athena Hands-On Workshop

[5] Converting to Columnar Formats (batch data)

Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC.

The process for converting to columnar formats using an EMR cluster is as follows:

● Create an EMR cluster with Hive installed.● In the step section of the cluster create statement, you can specify a script stored in

Amazon S3, which points to your input data and creates output data in the columnar format in an Amazon S3 location. In this example, the cluster auto-terminates.

Page 39: Amazon Athena Hands-On Workshop

[5] Converting to Columnar Formats (streaming data)

Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC.

The process for converting to columnar formats using an EMR cluster is as follows:

● Create an EMR cluster with Spark● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on

S3

Page 40: Amazon Athena Hands-On Workshop

AWS Athena

[6] Athena Security

Authorization and Access

Page 41: Amazon Athena Hands-On Workshop

[6] Athena Security

Amazon offers three ways to control data access:

● AWS Identity and Access Management policies● Access Control Lists● Amazon S3 bucket policies

Users are in control who can access data on S3. It’s possible to fine-tune security to allow different people to see different sets of data and also to grant access to other user’s data.

Page 42: Amazon Athena Hands-On Workshop

AWS Athena

[7] Service Limits

Know your limits and mitigate the risk

Page 43: Amazon Athena Hands-On Workshop

[7] Service Limits

You can request a limit increase by contacting AWS Support.

● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent queries at one time per account.

● Query timeout: 30 minutes● Number of databases: 100● Table: 100 per database● Number of partitions: 20k per table● You may encounter a limit for Amazon S3 buckets per account, which is 100.

Page 44: Amazon Athena Hands-On Workshop

[7] Known Limitations

The following are known limitations in Amazon Athena

● User-defined functions (UDF or UDAFs) are not supported.● Stored procedures are not supported.● Currently, Athena does not support any transactions found in Hive or Presto. For a full list

of keywords not supported, see Unsupported DDL.● LZO is not supported. Use Snappy instead.

Page 45: Amazon Athena Hands-On Workshop

[7] Avoid Surprises

Use backticks if table names begin with an underscore. For example:

CREATE TABLE myUnderScoreTable (`_id` string,`_index`string,...

For the LOCATION clause, using a trailing slash

USE s3://path_to_bucket/

DO NOT USE s3://path_to_bucket s3://path_to_bucket/* s3://path_to_bucket/mySpecialFile.dat

Page 46: Amazon Athena Hands-On Workshop

AWS Athena

[8] Comparing to Google BigQuery

Know your limits and mitigate the risk

Page 47: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

Google BigQuery

• Serverless Analytical Columnar Database based on Google Dremel• Data:

• Native Tables• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)

• Ingestion:• File Imports• Streaming API (up to 100K records/sec per table) • Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)

• ANSI SQL 2011• Priced at $5/TB of scanned data + storage + streaming (if used)• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.

Page 48: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

Summary

Feature \ Product AWS Athena Google BigQuery

Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native

ANSI SQL Support Yes* Yes*

DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)

Underlying Technology FB Presto Google Dremel

Caching No Yes

Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity

User Defined Functions No Yes

Data Partitioning On Any Key By DAY

Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data

Page 49: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

Test Drive Summary

Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %

[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%

[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%

[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%

[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%

[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%

[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%

[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%

Page 50: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

What Athena does better than BigQuery?

Advantages:• Can be faster than BigQuery, especially with federated/external tables• Ability to use regex to define a schema (query files without needing to change the

format)• Can be faster and cheaper than BigQuery when using a partitioned/columnar format• Tables can be partitioned on any column

Issues:• It’s not easy to convert data between formats• Doesn’t support DDL, i.e. no insert/update/delete• No built-in ingestion

Page 51: Amazon Athena Hands-On Workshop

DoIT International confidential │ Do not distribute

What BigQuery does better than Athena?

• It has native table support giving it better performance and more features

• It’s easy to manipulate data, insert/update records and write query results back to a

table

• Querying native tables is very fast

• Easy to convert non-columnar formats into a native table for columnar queries

• Supports UDFs, although they will be available in the future for Athena

• Supports nested tables (nested and repeated fields)

Page 52: Amazon Athena Hands-On Workshop

Remember to complete your evaluations ;-)

https://goo.gl/T9BZvy