Copyright ©2015 Treasure Data. All Rights Reserved.
Presto as a ServiceTips for operation and monitoring
Dongmin YuTreasure Data, Inc.min@treasure-data.comJeroMQ / ZeroMQ committer & maintainer
Mar 19, 2015Presto Meetup @ Facebook
Copyright ©2015 Treasure Data. All Rights Reserved.
Topics
• Presto as a Service in Treasure Data– Error Recovery– Presto Deployment
• Tips for Monitoring Presto– JSON API– Presto + Fluentd
• Custom changes
2
Copyright ©2015 Treasure Data. All Rights Reserved.
Treasure Data: Presto as a Service
3
Presto Public Release
Hive
TD API / Web ConsoleInteractive query
batch query
Presto
Treasure Data
PlazmaDB:MessagePack Columnar Storage
td-presto connector
Copyright ©2015 Treasure Data. All Rights Reserved.
Deployment• Building Presto takes more than 20 minutes.
• Facebook frequently releases new versions
• Let CircleCI build Presto – Deploy jar files to private Maven repository– We sometime use non-release versions
• for fixing serious bugs• hot-fix patches
• Integration Test– td-presto connector
• PlazmaDB, Multi-tenant query scheduler• Query optimizer
– Run test queries on staging cluster– Presto Verifier
5
Copyright ©2015 Treasure Data. All Rights Reserved.
Production: Blue-Green Deployment• http://martinfowler.com/bliki/BlueGreenDeployment.html
• 2 Presto Coordinators (Blue/Green)– Route Presto queries to the active cluster– No down-time upon deployment
• Launch Presto worker instances with chef <- less than 5 min. in AWS• Inactive clusters is used for pre-production testing and customer support
– Investigation and tuning of customer query performance– Trouble shooting
6
Copyright ©2015 Treasure Data. All Rights Reserved.
Error Recovery
• Presto has no fault tolerance• Error types
– User error• Syntax errors
– SQL syntax, missing function• Semantic errors
– missing tables/columns– Insufficient resource
• Exceeded task memory size– Internal failure
• I/O error– S3/Riak CS
• worker failure• etc.
7
Worth A Retry!
Copyright ©2015 Treasure Data. All Rights Reserved.
Failed Query Rate
8
Copyright ©2015 Treasure Data. All Rights Reserved. 9
Copyright ©2015 Treasure Data. All Rights Reserved.
Query Retry Patterns used in TD
• Error code + message pattern
10
Copyright ©2015 Treasure Data. All Rights Reserved.
Monitoring Presto with Fluentd
11
Copyright ©2015 Treasure Data. All Rights Reserved.
Monitoring Presto
• REST API for monitoring Presto state– JSON format
• (presto server IP):8080/v1/query– List of recent queries (BasicQueryInfo class)
• (presto server IP):8080/v1/query/(query id)– Detailed query state information– Query plan, tasks and running worker IDs – Processed rows/data size
12
Copyright ©2015 Treasure Data. All Rights Reserved.
Query List /v1/query
13
Copyright ©2015 Treasure Data. All Rights Reserved.
Detailed query Info /v1/query/(query id)
14
Copyright ©2015 Treasure Data. All Rights Reserved.
/ui/query-execution/(query id)
15
Copyright ©2015 Treasure Data. All Rights Reserved.
Complex Queries
16
Copyright ©2015 Treasure Data. All Rights Reserved. 17
Copyright ©2015 Treasure Data. All Rights Reserved.
Presto Coordinator
• Organizes query execution pipelines– Coordinates presto workers
• Retrieves table partition and split location from connectors– Creates distributed query plans
• Full GC– Stalls coordinator
• When memory is insufficient– Use memory-rich machine– GC Tuning
• UseG1GC
18
Copyright ©2015 Treasure Data. All Rights Reserved.
presto-metrics (Ruby)
• https://github.com/xerial/presto-metrics
19
Copyright ©2015 Treasure Data. All Rights Reserved. 20
Copyright ©2015 Treasure Data. All Rights Reserved.
Query Collection in TD
• SQL query logs– query, detailed query plan, elapsed time, processed rows, etc.– newSetBinder(binder,EventClient.class).addBinding()
.to(FluentEventClient.class)
• Presto is used for analyzing the query history
21
Copyright ©2015 Treasure Data. All Rights Reserved.
Daily/Hourly Query Usage
22
Copyright ©2015 Treasure Data. All Rights Reserved.
Query Running Time
• More than 90% of queries finishes within 2 min.≒ expected response time for interactive queries
23
Copyright ©2015 Treasure Data. All Rights Reserved.
Detecting Anomaly
• Started Query Rate (in 5min/15min)– If no query has started, cluster may be down (or not started properly)
• Processed rows in a query– Sum up the number of the processed rows from all of the sub stages– Simple, but the most reliable measure
• Send an alert– Slack notification– PagerDuty call
• JP/US team rotation
24
Copyright ©2015 Treasure Data. All Rights Reserved.
Benchmarking
• Query performance comparison– between two versions of Presto
• Benchmark– Run query set multiple times– Store the results to TD– Report the result with Presto
• Aggregation query
25
Copyright ©2015 Treasure Data. All Rights Reserved.
Presto Operation Tool
• Prestop– Our internal tool for managing multiple presto clusters• written in Scala
– Query monitoring– Benchmarking– Workload simulation
• stress testing
• Monitoring– Datadog– PageDuty– ChartIO (query stats)
26
Copyright ©2015 Treasure Data. All Rights Reserved.
buffer
Optimizing Scan Performance – Storage Manager
• Fully utilize the network bandwidth from S3• TD Presto becomes CPU bottleneck
27
TableScanOperators
• s3 file list• table schema header
request
S3 / RiakCS
• release(Buffer)
Buffer size limitReuse allocated buffers
Request Queue
• priority queue• max connections limit
HeaderColumn Block 0 (column names)
Column Block 1
Column Block i
Column Block m
MPC1 file
HeaderReader
• callback to HeaderParser
ColumnBlockReader
headerHeaderParser
• parse MPC file header• column block offsets• column names
column block requestColumn block requests
column block
prepare
buffer
MessageUnpackerMessageUnpacker
S3 read
S3 read
pull records
Retry GET request on- 500 (internal error)- 503 (slow down)- 404 (not found)- eventual consistency
S3 read• decompression• msgpack-java v07• On-demand de-ser
S3 read
S3 read
S3 read
Copyright ©2015 Treasure Data. All Rights Reserved.
Multi-tenancy: Resource Allocation• Price-plan based resource allocation
• Parameters– The number of worker nodes to use (min-candidates)– The number of hash partitions (initial-hash-partitions)– The maximum number of running tasks per account
• If running queries exceeds allowed number of tasks, the next queries need to wait (queued)
• Presto: SqlQueryExecution class– Controls query execution state: planning -> running -> finished
• No resource allocation policy
– Extended TDSqlQueryExection class monitors running tasks and limits resource usage• Rewriting SqlQueryExecutionFactory at run-time by using ASM library
28
Copyright ©2015 Treasure Data. All Rights Reserved.
WE ARE HIRING!
29
Check: www.treasuredata.com