Presto@Netflix Presto Meetup 03-19-15

Presto @ Netflix: Interactive Queries

at Petabyte Scale

Nezih Yigitbasi and Zhenxiao Luo

Big Data Platform

Outline

» Big data platform @ Netflix

» Why we love Presto?

» Our contributions

» What are we working on?

» What else we need?

Cloud

Apps

S3

Suro Ursula

SSTable

sCassandra Aegisthus

Event Data

15m

Daily

Dimension Data

Our Data Pipeline

https://github.com/Netflix/suro

https://github.com/Netflix/aegisthus

Data

Warehouse

Service

Tool

s

Gateways

Big Data Platform Architecture

Prod

Clients

Clusters

VPCQuery Prod TestBonusProd

» Batch jobs (Pig, Hive)» ETL jobs

» reporting and other analysis

» Ad-hoc queries» interactive data exploration

» Looked at Impala, Redshift, Spark, and Presto

Our Use Cases

Deployment

» v 0.86

» 1 coordinator (r3.4xlarge)

» 250 workers (m2.4xlarge)

Tooling

Numbers

» ~2.5K queries/day against our 10PB Hive DW on S3

» 230+ Presto users out of 300+ platform users

» presto-cli, Python, R,

BI tools (ODBC/JDBC), etc.

» Atlas/Suro for monitoring/logging

Presto @ Netflix

https://github.com/Netflix/Atlas


Why we love Presto?

» Open source

» Fast

» Scalable

» Works well on AWS

» Good integration with the Hadoop stack

» ANSI SQL

Our Contributions

24 open PRs, 60+ commits

» S3 file system

» multipart upload, IAM roles, retries, monitoring, etc.

» Functions for complex types

» Parquet

» name/index-based access, type coercion, etc.

» Query optimization

» Various other bug fixes

» Vectorized reader* Read based on column vectors

» Predicate pushdown Use statistics to skip data

» Lazy load Postpone loading the data until needed

» Lazy materialization Postpone decoding the data until needed

What are we Working on?Parquet Optimizations

* PARQUET-

131

https://issues.apache.org/jira/browse/PARQUET-131

Netflix Integration

» BI tools integration» ODBC driver, Tableau web connector, etc.

» Better monitoring » Ganglia ⟶ Atlas

» Data lineage» Presto ⟶ Suro⟶ Charlotte

https://github.com/Netflix/atlas


» Graceful cluster shrink

» Better resource management

» Dynamic type coercion for all file formats

» Support for more Hive types (e.g., decimal)

» Predictable metastore cache behavior

» Big table joins similar to Hive

What else we need?

THANK YOU

Internet

Presto@Netflix Presto Meetup 03-19-15