Amazon Athena Hands-On Workshop

Amazon Athena Workshop26 January 2017

Agenda

1

2

3

4

5

Wi-Fi: DaHouseGuest, Pass: JustDoit!

Feedback Form: goo.gl/T9BZvy Labs: github.com/doitintl/athena-workshop

2

Q & A

Breaks: 11:30 | 13:00 - 13:45 | 15:00

Facilities & Organization

DoIT International confidential │ Do not distribute

About us..

Vadim SoloveyCTO

Shahar FrankSoftware Engineering Lead




Workshop Agenda

● Module 1○ Introduction to AWS Athena○ Demo

● Module 2○ Interacting with AWS Athena○ Lab 2

● Module 3○ Supported Formats and SerDes○ Lab 3

● Module 4○ Partitioning Data○ Lab 4

● Module 5○ Converting to columnar formats○ Lab 5

● Module 6○ Athena Security

● Module 7○ Service Limits

● Module 8○ Comparison to Google BigQuery○ Demo

[1] AWS Athena

[1] Introduction

Understanding Purpose & Use-Cases

[1] Challenges

Organizations are challenged with data analysis without heavy investments and long deployment time

● Significant amount of effort required to analyze data on S3● Users often have access to only aggregated data sets● Managing Hadoop or data warehouse requires expertise

[1] Introducing AWS Athena

Athena is an interactive query service that makes it easy to analyze data directly from AWS S3

using Standard SQL

[1] AWS Athena Overview

Easy to use

1. Login to a console2. Create a table (either by following a wizard or by typing Hive DDL

statement)3. Start querying

[1] AWS Athena is Highly Available

High Availability Features

● You connect to a service endpoint or log into a console● Athena uses warm compute pools across multiple availability zones● Your data is in Amazon S3 which has 99.999999999% durability

[1] Querying Data Directly from Amazon S3

Direct access to your data without hassles

● No loading of data● No ETL required● No additional storage required● Query of data in raw format

[1] Use ANSI SQL

Use of skills you probably already have

● Start with writing Standard ANSI SQL syntax● Support for complex joins, nested queries & window functions● Support for complex data types (arrays, structs)● Support for partitioning of data by any key:

○ e.g. date, time, custom keys○ Or customer-year-month-day-hour

[1] AWS Athena Overview

Amazon Athena is server-less way to query your data that lives on S3 using SQL

Features:● Serverles with zero spin-up time and transparent upgrades● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format

○ AVRO (coming soon)● Compression is supported out of the box● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query

Additional Information:● Not a general purpose database● Usually used by Data Analysts to run interactive queries over large datasets● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)

[1] Underlying Technologies

Presto (originating from Facebook)

● Used for SQL queries● In-memory distributed querying engine ANSI SQL compatible with

extensions

Hive (originating from Hadoop project)

● Used for DDL functionality● Complex data types● Multitude of formats● Supports data partitioning

[1] Presto vs. Hive Architecture

[1] Use Cases

Athena complements Amazon Redshift and Amazon EMR

AWS Athena

[2] Interacting with AWS Athena

Develop, Execute and Visualize Queries

[2] Interacting with AWS Athena

Amazon Athena is server-less way to query your data that lives on S3 using SQL

Web User Interface:● Run queries and examine results● Manage databases and tables● Save queries and share across organization for re-use● Query History

JDBC Driver:● Programmatic way to access AWS Athena

○ SQL Workbench, JetBrains DataGrip, sqlline○ Your own app

AWS QuickSight:● Visualize Athena data with charts, pivots and dashboards.

Hands On

Lab 2

Interacting with AWS Athena

Data Formats

[3] Supported Formats and SerDes

Efficient Data Storage

[3] Data and Compression Formats

The data formats presently supported are

● CSV● TSV● Parquet (Snappy is default compression)● ORC (Zlib is default compression)● JSON● Apache Web Server logs (RegexSerDe)● Custom Delimiters

Compression Formats

● Currently, Snappy, Zlib, and GZIP are the supported compression formats. ● LZO is not supported as of today

[3] CSV Example

CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' LOCATION 's3://nyc-yellow-trips/csv/'

[3] Parquet Example

CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS PARQUET LOCATION 's3://nyc-yellow-trips/parquet tblproperties ("parquet.compress"="SNAPPY");

[3] ORC Example

CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....)STORED AS ORC LOCATION 's3://nyc-yellow-trips/orc/’tblproperties ("parquet.compress"="ZLIB");

[3] RegEx Serde (Apache Log Example)

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( Date DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$")

LOCATION 's3://athena-examples/cloudfront/plaintext/';

[3] Comparing Formats

PARQUET

● Columnar format● Schema segregation into footer● Column major format● All data is pushed to the leaf● Integrated compression and

indexes● Support for predicate pushdown

ORC

● Apache Top Level Project● Schema segregation into footer● Column major format with stripes● Integrated compression and

indexes and stats● Support for predicate pushdown

[3] Comparing Formats

[3] Converting to Parquet or ORC format

● You can use Hive CTAS to convert data:CREATE TABLE new_key_value_storeSTORED AS PARQUETAS SELECT c1, c2, c3, .., cN FROM noncolumunartableSORT BY key

● You can also use Spark to convert the files to Parquet or ORC

● 20 lines of PySpark code running on EMR [1]○ Converts 1TB of text data into 130GB of Parquet with Snappy compression○ Approx. cost is $5

[1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

[3] Pay By the Query ($5 per TB scanned)

● You are paying by the amount of scanned data

● Means to save on cost○ Compress○ Convert to columnar format○ Use partitioning

● Free: DDL queries, failed queries

Dataset Size on S3 Query Runtime Data Scanned Cost

Logs stored as CSV 1TB 237s 1.15TB $5.75

Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013

Savings 87% less 34x faster 99% less 99.7% cheaper

Hands On

Lab 3

Formats & SerDes

AWS Athena

[4] Partitioning Data

To improve performance and reduce cost

[4] Partitioning Data

By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost

Benefits of Data Partitioning:● Partitions limit the scope of data being scanned during the query● Improves Performance● Reduce query cost● You can partition your data by any key

Common Practice:● Based on time, often leading with a multi-level partitioning scheme

○ YEAR -> MONTH -> DAY -> HOUR

[4] Data already partitioned and stored on S3

$ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/

CREATE EXTERNAL TABLE impressions ( ... ...)PARTITIONED BY (dt string)ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;

// load partitions into AthenaMSCK REPAIR TABLE impressions

// Run sample querySELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'

[4] Data is not partitioned

aws s3 ls s3://athena-examples/elb/plaintext/ --recursive

2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt

ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/plaintext/2015/01/01/'

[5] AWS Athena

[5] Converting to Columnar Formats

Apache Parquet & ORC

[5] Converting to Columnar Formats (batch data)

Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC.

The process for converting to columnar formats using an EMR cluster is as follows:

● Create an EMR cluster with Hive installed.● In the step section of the cluster create statement, you can specify a script stored in

Amazon S3, which points to your input data and creates output data in the columnar format in an Amazon S3 location. In this example, the cluster auto-terminates.

[5] Converting to Columnar Formats (streaming data)

Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC.

The process for converting to columnar formats using an EMR cluster is as follows:

● Create an EMR cluster with Spark● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on

S3

AWS Athena

[6] Athena Security

Authorization and Access

[6] Athena Security

Amazon offers three ways to control data access:

● AWS Identity and Access Management policies● Access Control Lists● Amazon S3 bucket policies

Users are in control who can access data on S3. It’s possible to fine-tune security to allow different people to see different sets of data and also to grant access to other user’s data.

AWS Athena

[7] Service Limits

Know your limits and mitigate the risk

[7] Service Limits

You can request a limit increase by contacting AWS Support.

● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent queries at one time per account.

● Query timeout: 30 minutes● Number of databases: 100● Table: 100 per database● Number of partitions: 20k per table● You may encounter a limit for Amazon S3 buckets per account, which is 100.

[7] Known Limitations

The following are known limitations in Amazon Athena

● User-defined functions (UDF or UDAFs) are not supported.● Stored procedures are not supported.● Currently, Athena does not support any transactions found in Hive or Presto. For a full list

of keywords not supported, see Unsupported DDL.● LZO is not supported. Use Snappy instead.

[7] Avoid Surprises

Use backticks if table names begin with an underscore. For example:

CREATE TABLE myUnderScoreTable (`_id` string,`_index`string,...

For the LOCATION clause, using a trailing slash

USE s3://path_to_bucket/

DO NOT USE s3://path_to_bucket s3://path_to_bucket/* s3://path_to_bucket/mySpecialFile.dat

AWS Athena

[8] Comparing to Google BigQuery

Know your limits and mitigate the risk


Google BigQuery

• Serverless Analytical Columnar Database based on Google Dremel• Data:

• Native Tables• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)

• Ingestion:• File Imports• Streaming API (up to 100K records/sec per table) • Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)

• ANSI SQL 2011• Priced at $5/TB of scanned data + storage + streaming (if used)• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.


Summary

Feature \ Product AWS Athena Google BigQuery

Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native

ANSI SQL Support Yes* Yes*

DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)

Underlying Technology FB Presto Google Dremel

Caching No Yes

Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity

User Defined Functions No Yes

Data Partitioning On Any Key By DAY

Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data


Test Drive Summary

Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %

[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%

[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%

[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%

[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%

[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%

[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%

[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%


What Athena does better than BigQuery?

Advantages:• Can be faster than BigQuery, especially with federated/external tables• Ability to use regex to define a schema (query files without needing to change the

format)• Can be faster and cheaper than BigQuery when using a partitioned/columnar format• Tables can be partitioned on any column

Issues:• It’s not easy to convert data between formats• Doesn’t support DDL, i.e. no insert/update/delete• No built-in ingestion


What BigQuery does better than Athena?

• It has native table support giving it better performance and more features

• It’s easy to manipulate data, insert/update records and write query results back to a

table

• Querying native tables is very fast

• Easy to convert non-columnar formats into a native table for columnar queries

• Supports UDFs, although they will be available in the future for Athena

• Supports nested tables (nested and repeated fields)

Remember to complete your evaluations ;-)

https://goo.gl/T9BZvy

Internet

Amazon Athena Hands-On Workshop