Upload
wojciech-biela
View
790
Download
2
Embed Size (px)
Citation preview
11
Presto - Analytical Database Wojciech BielaŁukasz Osipiuk
https://prestodb.io
2
Who are we?
Center for Hadoop
3
History of Presto
FALL 20126 developers start Presto
development
FALL 201488 Releases
41 Contributors 3943 Commits
FALL 2015132 Releases
105 Contributors6300 Commits
---------Teradata part of
Presto community & offers support
SPRING 2013Presto rolled out within Facebook
FALL 2013Facebook open sources Presto
FALL 2008Facebook open
sources Hive
4
➔ 100% open source distributed ANSI SQL engine for Big Data
➔ Optimized for low latency, Interactive querying◆ Cross platform query capability, not only SQL on Hadoop◆ Distributed under the Apache license, now supported by Teradata◆ Used by a community of well known, well respected technology companies◆ Modern code base◆ Proven scalability
What is Presto?
5
High level architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
MetadataAPI
Parser/analyzer Planner Scheduler
Worker
Client
Data locationAPI
Pluggable
6
Plan executionHive Presto
map
reduce
I/O
I/O
I/O
I/O
I/O
task task
task task
task task
task
I/O
7
Presto Extensibility – connector interfaces
Parser/analyzer Planner
Worker
Data location API
Hiv
e
Ca
ssa
nd
ra
Ka
fka
MyS
QL
…
Metadata API
Hiv
e
Ca
ssa
nd
ra
Ka
fka
MyS
QL
…
Data stream API
Hiv
e
Ca
ssa
nd
ra
Ka
fka
MyS
QL
…
Scheduler
Coordinator
8
Presto Extensibility – plugins
➔ Connectors
➔ Data types
➔ Extra functions
➔ Security providers
9
➔ Facebook◆ Multiple production clusters (100s of nodes total)
● Including 300PB Hadoop data warehouse● Single cluster size order of 10s of nodes
◆ 1000s of internal daily active users◆ Millions of queries each month◆ Multiple PBs scanned every day◆ Trillions of rows a day◆ ORC format
➔ Netflix ◆ Over 250-node production cluster on EC2◆ Over 15 PB in S3 (Parquet format)◆ Over 300 users and 2.5K queries daily◆ presto-cli, R, Python, BI tools◆ 50% queries under 4s
Some usage facts
10
Netflix Data Pipeline
Suro / Kafka Cassandra
AegisthusUrsula
Amazon S3
TVs mobile laptop dimensionsevents
TD
TVs mobile laptopTVs mobile laptop
11
Presto use-cases at Facebook
➔ three use cases
◆ Data warehouse - big data
◆ User facing - small data
◆ User facing - medium data
12
Presto use-cases at Facebook (data warehouse)
HDFS data warehouse
13
Presto use-cases at Facebook (data warehouse)
➔ Multiple clusters
➔ O(103) of users
➔ O(106) queries per month
➔ petabytes of data scanned every day
➔ 100s of concurrent queries
14
Presto use-cases at Facebook (data warehouse)
Loader
Client
Presto
Data Node
Presto
Data Node
M/R
Data Node
M/R
Data Node
Presto
Data Node
Presto
Hive
15
Presto use-cases at Facebook (data warehouse)
Client
Presto
PrestoDispatcher
Presto
Presto
Presto
Presto
Presto
16
Presto use-cases at Facebook (realtime)
Real time user facing
17
Presto use-cases at Facebook (realtime)
Requirements
➔ User facing
➔ 0.1-5 seconds latency
➔ Support for data updates
➔ highly available
➔ 10-15 way joins
18
Presto use-cases at Facebook (realtime)
Loader
Client
mysqlPresto
Presto
Presto
mysql
mysql
mysql
mysql
19
Presto use-cases at Facebook (semi realtime)
Requirements
➔ Large data sets (smaller than warehouse)
➔ seconds to minutes latency
➔ predictable performance
➔ 5-15 minutes load latency
➔ 100s concurrent queries
20
Presto use-cases at Facebook (semi realtime)
Raptor
21
Presto use-cases at Facebook (semi realtime)
Raptor Loader
Client
Presto
Flash
Presto
Flash
Presto
Flash
Presto
FlashPresto
mysql
Kafka
Kafka
KafkaKafka
Loader
Gluster
Gluster
backup tier
22
Presto use-cases at Facebook (semi realtime)
Raptor Loader
Client
Presto
Flash
Presto
Flash
Presto
Flash
Presto
FlashPresto
mysql
Kafka
Kafka
KafkaKafka
Loader
Gluster
Gluster
backup tier
INSERT INTO raptor_table SELECT * from kafka_table where token BETWEEN ${last_token} AND ${next_token}
MARK LOAD in PROGRESS in MySQL
23
Presto use-cases at Facebook (semi realtime)
Extra features
➔ Physical data reorganization
➔ Fully fledged and atomic DDL
➔ Atomic data loading
➔ Tiered architecture
24
➔ Data stays in memory during execution and is pipelined across nodes MPP-style
➔ Vectorized columnar processing
➔ Presto is written in highly tuned Java◆ Efficient in-memory data structures◆ Very careful coding of inner loops◆ Bytecode generation
➔ Optimized ORC reader
➔ Predicates push-down
➔ Query optimizer
Presto = Performance
25
www.github.com/facebook/prestowww.github.com/prestodb
Certified Distro: www.teradata.com/prestoWebsite: www.prestodb.ioPresto : User’s Group: www.groups.google.com/group/presto-users
Interested in joining Teradata?● Presto development ● other Hadoop related development and consulting
contact our Recruitment Partner: Renata Rosłoniec (VBC)tel. 514 035 237, [email protected]
How can I contribute?