15
Netflix running Presto in the AWS Cloud Zhenxiao Luo Senior Software Engineer @ Netflix

Netflix running Presto in the AWS Cloud

Embed Size (px)

Citation preview

Page 1: Netflix running Presto in the AWS Cloud

Netflix running Presto in the AWS CloudZhenxiao Luo

Senior Software Engineer @ Netflix

Page 2: Netflix running Presto in the AWS Cloud

Outline

● BigDataPlatform@Netflix● Use cases & requirements● What we did

○ Reading/Writing from/to Amazon S3○ Operations○ Deployment○ Performance

● What’s next?

Page 3: Netflix running Presto in the AWS Cloud

BigDataPlatform @ Netflix

Page 4: Netflix running Presto in the AWS Cloud

Use Cases● Big Batch Jobs

○ high throughput, fault tolerant, ETL○ data spills to disk○ Hive on Tez, Pig on Tez

● Adhoc Queries○ low latency, interactive, data exploration○ in-memory, but limited data size○ Impala, Redshift, Spark, Presto

Page 5: Netflix running Presto in the AWS Cloud

Netflix Requirement● SQL like Language● Low latency for adhoc queries● Work well on AWS cloud● Good integration with Hadoop stack● Scale to 1000+ node cluster● Open source with community support

Page 6: Netflix running Presto in the AWS Cloud

What did Netflix do?

Page 7: Netflix running Presto in the AWS Cloud

Reading/Writing to/from S3

● Option 1: Apache Hadoop NativeS3FileSysyem

● Option 2: PrestoS3FileSystem○ retry logic for read timeout○ write directly to final S3 path

● Option 3: emrFileSystem○ disable hadoop logging○ disable hadoop FileSystem cache

Page 9: Netflix running Presto in the AWS Cloud

Our Operations Environment

● Launch script on top of EMR

● Ganglia integration

● Usage graphs - concurrent queries & tasks

Page 10: Netflix running Presto in the AWS Cloud

Current Deployment

● Presto in Production @ Netflix● 100+ nodes Presto Cluster● 1000+ queries running per day● Presto query against the same Petabyte Scale S3 Data

Warehouse as Hive and Pig

Page 11: Netflix running Presto in the AWS Cloud

Observed Performance @ Netflix

● Data in Sequence File Format● One MapReduce Job SmallTableScan

○ MapReduce overhead dominates the query execution time○ Presto is always ~10X faster than Hive

● One MapReduce Job BigTableScan○ MapReduce overhead is marginal compared with big table scan time○ Presto performs similar to Hive

● Multiple MapReduce Aggregation○ Presto is always > 10X faster than Hive

● Joins○ Presto is always > 2X faster than Hive

Page 12: Netflix running Presto in the AWS Cloud

What we are working on

● Support Parquet File Format○ https://github.com/facebook/presto/pull/1147○ Parquet performs similar to Sequence, but not as fast as RCFile

● ODBC/JDBC driver for Presto○ Support Microstrategy running on Presto

Page 13: Netflix running Presto in the AWS Cloud

Some inconveniences ...● Support Server Side “Use Schema”

○ Workaround: Client Side “Use Schema” Or “Schema.Table”● Recurse the partition directory

○ Different behavior with Hive● Metadata caching

○ have to rerun the query a number of times to see the metadata change

● Extend JSON extract functions to allow . notation○ json_extract_scalar(mapColumn, '$.namePart1.namePart2')○ Workaround: regexp_extract

● WebUI running slow○ load query task info on demand

Page 14: Netflix running Presto in the AWS Cloud

Features we would like● Big table join● User Defined Functions● Break down one column value into several tuples

○ In Hive: lateral view explode json_tuple● Decimal type● Scheduler● Writes

○ Insert overwrite○ Alter table add partition○ Parallel writes from workers (not client only)

Page 15: Netflix running Presto in the AWS Cloud

Q & AThank you!