Presto @ Netflix: Interactive Queries at Petabyte ScaleNezih Yigitbasi and Zhenxiao LuoBig Data Platform

Outline Big data platform @ Netflix Why we love Presto? Our contributions What are we working on? What else we need?

Cloud AppsS3SuroUrsulaSSTablesCassandraAegisthus

Event Data15mDailyDimension DataOur Data Pipeline

Event Data15mDailyDimension DataOur Data Pipeline




GatewaysBig Data Platform ArchitectureProd








Batch jobs (Pig, Hive) ETL jobs reporting and other analysis

Batch jobs (Pig, Hive) ETL jobs reporting and other analysis

Ad-hoc queries interactive data exploration

Looked at Impala, Redshift, Spark, and PrestoOur Use Cases

Ad-hoc queries interactive data exploration

Looked at Impala, Redshift, Spark, and PrestoOur Use Cases

Deployment v 0.86 1 coordinator (r3.4xlarge) 250 workers (m2.4xlarge)ToolingNumbers ~2.5K queries/day against our 10PB Hive DW on S3 230+ Presto users out of 300+ platform users presto-cli, Python, R, BI tools (ODBC/JDBC), etc. Atlas/Suro for monitoring/loggingPresto @ Netflix

r3.4xlarge and m2.4xlarge are both memory optimized instances where m2 is a previous generation instance type5PB of our 10PB Hive DW is in Parquet format

Why we love Presto? Open source Fast Scalable Works well on AWS Good integration with the Hadoop stack ANSI SQL

single warehouse on s3, spin up multiple test/prod presto clusters and query live data etc.

Our Contributions 24 open PRs, 60+ commits S3 file system multipart upload, IAM roles, retries, monitoring, etc. Functions for complex types Parquet name/index-based access, type coercion, etc. Query optimization Various other bug fixes

s3 fs: exp backoff, exposed various configs for the aws sdk, multipart upload, IAM roles, and monitoring prestoS3FileSystem and AWS sdk

Vectorized reader* Read based on column vectors Predicate pushdown Use statistics to skip data Lazy load Postpone loading the data until needed Lazy materialization Postpone decoding the data until needed What are we Working on?Parquet Optimizations

Vectorized reader* Read based on column vectors Predicate pushdown Use statistics to skip data Lazy load Postpone loading the data until needed Lazy materialization Postpone decoding the data until needed What are we Working on?Parquet Optimizations


Netflix Integration BI tools integration ODBC driver, Tableau web connector, etc.

Better monitoring Ganglia Atlas

Data lineage Presto Suro Charlotte

Data lineage Presto Suro Charlotte

Graceful cluster shrink Better resource management Dynamic type coercion for all file formats Support for more Hive types (e.g., decimal) Predictable metastore cache behavior Big table joins similar to Hive

What else we need?

Graceful cluster shrink Better resource management Dynamic type coercion for all file formats Support for more Hive types (e.g., decimal) Predictable metastore cache behavior Big table joins similar to Hive

What else we need?



