17
1 Enterprise DataLake Consumption Layer powered by Presto @ WalmartLabs Ashish Tadose Principal Engineer

Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

  • Upload
    others

  • View
    2

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

1

Enterprise DataLake Consumption Layer powered by Presto @ WalmartLabs

Ashish Tadose

Principal Engineer

Page 2: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

2

Agenda

• Data stores @ Walmart Labs

• Motivation for Presto as Distributed Query service

• Multi-tenant Distributed Query service

• Presto deployment & auto-scaling in GCP

• Security integrations

• Overall architecture

• Monitoring

• Best practices and tuning

Footer

Page 3: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

3

Data stores @ Walmart LabsAccess needs are varied from team to team – one solution does not fit all….

Page 4: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

4

Motivation for Presto..

• DataLake cluster - powered by on-prem Hadoop/HDFS

• Compute storage colocation – GOOD

• Need to ingest data from all diverse sources – CHALLENGING

• Scaling out compute with growing needs – CHALLENGING

• Need to separate storage & compute / support federated query capability – PRESTO..

• Isolated clusters in private cloud powering dedicated data-marts

Dat

a jo

urne

y

Page 5: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

5

• Simplified query access layer

• Leverage cloud elastic compute

• Better scalability & Effective cluster utilization by auto-scaling

• Performant query response times

• Security – Authentication – LDAP– Authorization – work with existing policies

• Handle sensitive data – encryption at rest & over the wire

• Efficient Monitoring & alerting

• Resource quotas – SLA guarantees

• Flexibility to configure query configuration per tenant

Multi-tenant Query service - requirements

Page 6: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

6

Presto & Alluxio Works well together…

Small range query response timeLower is better

Large scan query response timeLower is better

ConcurrencyHigher is better

Presto Presto + Alluxio

• Avoids unpredictable network

• Consistent query latency

• Higher throughput and better concurrency

Page 7: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

7

• Cloud DataProc init scripts or optional image -https://cloud.google.com/dataproc/docs/tutorials/presto-dataproc

– Super easy to spawn Presto cluster – Elevated cost due to managed services such as DataProc– Overhead of additional Hadoop components – Difficult to source new catalog or deploy config changes

• Alluxio – no GCP managed deployment

• Presto-admin – can be used deployment and configuration not auto-scaling

• Need for lower level deployment strategy

Presto on GCP

Page 8: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

8

• WalmartLabs internal auto-scaler Presto deployer

• Framework to deploy and auto-scale Presto cluster in GCP

• Leverages ansible & GCP deployment manager

• Auto-scaling via configurable cluster wide CPU & memory usage threshold

• Our recent changes – will be released soon to open community – Alluxio deployment co-located with Presto workers– Efficient configurability – suitable for multiple envs– More auto-scaling configs– Terraform integration – making it cloud agnostic

GCP presto auto-deployment

Page 9: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

9

• Ranger plugin for Hive catalog

• Caching ranger policies

• Hive MetaStore impersonation

Presto Security integrations

Page 10: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

10

Hive MetaStore , Alluxio integration & Views

• Automated approach to sync metadata

• Hive MetaStore event listeners

• External metastore clients

• Waggle-dance (WIP)

https://github.com/HotelsDotCom/waggle-dance

• Hive native views access

Page 11: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

11

Presto Alluxio – overall stack

Page 12: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

12

• Presto Event listeners

– Track latencies – Analyze failures – Faulty clients – Frequently queried tables for caching

• On prem monitoring - Prometheus & Grafana

• GCP stack driver integration

• GCP Stackdriver Presto MBeans integration issue

Presto monitoring & archiving

Page 13: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

13

• Kafka – ability to apply timestamp filters based Kafka message timestamp– https://www.slideshare.net/shubhamtagra/debugging-data-pipelines-ola-by-karan-kumar

• Druid connector – Based on Druid JDBC interface and extension to Presto’s BaseJdbcClient

• ClickHouse connector

• ThoughtSpot connector

• BigQuery connector

• SAP HANA connector

Presto custom connectors

Page 14: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

14

• SLA guarantees by Presto resource queues - https://prestosql.io/docs/current/admin/resource-groups.html

• Each application group has varying query patterns

– Configurable through session properties • join_reordering_strategy• optimize_top_n_row_number• query_max_execution_time

– Session Property Managers - https://prestosql.io/docs/current/admin/session-property-managers.html• Configure sessions for resource groups, source types, client tags

Supporting Multi-tenant cluster

Page 15: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

15

Distributed query across Data stores

Page 16: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

16

• ORC compression – ZLIB

– Point to point queries performs well for snappy – Large aggregation ZLIB is better

• Enable bloom filter on frequently used columns in filters

• Enable sorting on frequently used columns (boost query perf on the cost of higher ingestion time )

• Increase ORC stripe & stride size

– ORC files are splittable on a stripe level thus affects parallelism.– We observed 18%-22% increased in presto parallelism (after setting stripe size = 128Mb and index stride = 16k)

• Enable Table & column stats (Most important )

– Now stats can be computed via Presto - https://prestosql.io/docs/current/sql/analyze.html

ORC storage recommendations

Page 17: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

17

THANKS!

17