Upload
grzegorz-kokosinski
View
188
Download
1
Embed Size (px)
Citation preview
1
Presto - SQL on anythingJanuary 2017
Grzegorz KokosińskiKarol SobczakTeradata Center for Hadoop
2
Agenda
- Who are we?
- What is Presto?
- What is data federation?
- Different federation strategies in other databases (HIVE)
- what is supported and what are the problems
- Presto Connector
- Show time
3
Lets make some noise
• Let tweet about this presentation!– #whug
– #prestodb
– #teradata
• Later on we will query that data!
4
Who are we
5
What is Presto?
• 100% open source distributed SQL query engine- Originally developed by Facebook
• Key Differentiators:- Performance & Scale- Cross platform query capability, not only SQL on Hadoop
• Apache licensed, hosted on GitHub- Certified distro & support from Teradata
6
Presto Users
See more at https://github.com/prestodb/presto/wiki/Presto-Users
7
• Facebook – Multiple production clusters (100s of nodes total)
- 300PB in HDFS, sharded MySQL, SSD-based Raptor– 1000s of internal daily active users– 10s-100s of concurrent queries
• Netflix – 250+ node on EC2, 40+ PB in S3 (Parquet format)– Over 650 active users and 6K+ queries daily
• Twitter– 200+ nodes on-premises over Parquet nested data
• Uber– 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily
• FINRA– 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata)
Presto in Production
8
• In-memory processing• Pipelined execution across nodes (MPP-style)
– Vectorized columnar processing– Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java– Efficient memory management (reduced GC overhead)– Very careful coding of inner loops– Runtime bytecode generation
• Optimized ORC & Parquet readers• Excellent performance with interactive SQL analytics
– Enables to use BI tools
Presto – Query Execution Performance
9
• Hadoop/Hive connector & file formats (HDFS/S3):– HDFS & S3 + HCatalog– ORC, RCFile, Parquet, SequenceFile, Text
• Raptor– columnar store on flash driven by Facebook
• Open source data stores (driven by the community)– MySQL & PostgreSQL (non-parallel)– Cassandra (by Teradata)– Kafka– Redis– MongoDB– ElasticSearch– Accumulo (by Bloomberg)
Supported data sources & file formats
10
[ WITH with_query [, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ]
In addition:• Windowing functions • UNNEST, TABLESAMPLE • ROLLUP, CUBE, GROUPING SETS• UNION, EXCEPT, INTERSECT• Subqueries (EXISTS, IN)
ANSI SQL Support
11
Presto is not a database!
• Presto is a query execution engine (storage independent)• Pluggable custom user functionalities
– Connectors– Functions– Types– System access controllers– Resource group configuration managers– Event listeners– …
• Built-in core functionalities:– parser, execution, types, sql functions, monitoring
12
Data federation
• Query data from several data sources (databases)
• Streaming– One to One
- there is a single connection between database access points- e.g. PSQL via PSQL- using storage handlers to access RDBMS data from Hive
– Many to One- many connections from one database nodes to a single access point of
other database- Accessing REST from UDF in (possibly each) HIVE map/reduce task
– Many to Many- workers talk to each other directly
• Through storage– Needs (intermittent) data materialization
• Presto supports them all!
13
Data federation common problems
• model incompatibilities
• multinode streaming is not always possible
• transactions
• cost based optimizations (statistics)
• SQL pushdown (predicates, projections, aggregations?, joins?)
14
Connector
• Presto interface to access arbitrary data source (hive, mysql, jmx)• Provides:
– metadata– ability to distributed, parallel and streamed read/write– transaction boundary– physical data layouts– statistics– (SQL) predicate pushdown)– indexes (index join)– session or table properties– access control– procedures (CALL …– . . .
• Most (if not all) of the above points are optional
15
Presto Architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
MetadataAPI
Parser/analyzer
Planner Scheduler
Worker
Client
Data locationAPI
Pluggable
16
Data federation with Presto
• Through the storage
• Demo– HIVE
HDFS DataNode
HDFS DataNode
HiveMetastore
HDFSNamenode
data transfer
Presto worker
Presto worker
Prestocoordinator
data transfermetadata
metadata
17
Data federation with Presto
• One to One
• Demo– psql– REST– and above with HIVE
Presto worker
Presto worker
Prestocoordinator
SQL Database
JDBC metadataJDBC data
18
Many to many - data federation with Presto
AMP
AMP
AMP
AMP
QG
Exchange
QG
Exchange
PE Coordinator
Worker Thread
Worker Thread
Worker Thread
Worker Thread
Init & metadata exchange
Bi-directionalfully parallel
data exchange
TERADATA PRESTO
• Key features:• Low latency• High performance• Concurrency• SQL pushdown• Data conversion• Compression• Efficient CPU usage
19
Conclusion
• Presto Connector is expressive
• 3rd party data source is 1st class citizen
• Single ANSI SQL to rule them all– use BI tools on data which is not BI friendly
• Rapid data integration
20
Certified Distro: www.teradata.com/prestoWebsite: www.prestodb.ioPresto Users Group: www.groups.google.com/group/presto-users
GitHub:www.github.com/prestodb/prestowww.github.com/Teradata/presto
More information
21
www.teradata.com/presto