Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Presto: Fast SQL-on-Anythingincluding Delta Lake, Snowflake, Elasticsearch and more!
Kamil Bajda-PawlikowskiCo-founder/CTO @ Starburst
Agenda
▪ Presto & Starburst▪ Delta Lake Integration▪ Data Platform Architecture▪ Use Cases
Presto & Starburst
What is Presto?
High performance MPP SQL
engine
•Interactive ANSI SQL queries
•Proven scalability
•High concurrency
Separation of compute & storage
•Scale storage & compute independently
•SQL-on-anything
•Federated queries
Community-driven open
source project
Deploy Anywhere
•Kubernetes
•Cloud
•On premises
Presto Users
Facebook: 10,000+ of nodes, 1000s of users
Uber 2,000+ nodes, 160K+ queries dailyLinkedIn: 500+ nodes, 200K+ queries daily
Lyft: 400+ nodes, 100K+ queries daily
Starburst
6
Enterprise
Grade Security
On-Prem,
or Cloud
Rapid Time to
Insights
Low Cost of
Ownership
24x7 Expert
Support
ANSI SQL MPP
Query Engine
High
Concurrency
Our Platform
Named Open Source Startup to Watch 2020
600% Growth YoY
100+
Enterprise Customers
NPS Score
80+
Massive
Scale
Starburst Enterprise Presto
Performance Connectivity Security Management
30+ supported enterprise
connectors
High performance parallel
connectors for Oracle,
Teradata, Snowflake and
more
Support
From petabytes to exabytes
– query data from disparate
sources using SQL – with
high concurrency
Control your
price/performance with the
latest cost-based optimizer
Caching available for
frequently accessed data
Kerberos & LDAP
integration
Global Security for fine-
grained Access Control
Data encryption
Data masking
Query auditing
Configuration
Autoscaling
High availability
Monitoring
Deploy anywhere
The largest team of Presto
experts in the world
Fully-tested, stable
releases, curated by the
Presto creators
Hot fixes & security
patches
24x7 support, 365 – we’ve
got your back
7
Starburst CustomersTech
Retail Media & Telco
Finance & Insurance
Healthcare & Pharma Other
Delta Lake Integration
Why Delta Lake?
▪ ACID properties over data lake
▪ Open source table format
▪ Stored as Parquet files
▪ Object storage support
▪ Schema evolution
▪ Time travel feature
▪ Metadata & statistics
▪ Data skipping & z-ordering
Native Presto Delta Lake Reader
Supports data skipping & dynamic filtering
Optimizes query using file statistics
Supports reading the Delta transaction log
Native connector written from scratch
Native Delta Lake Reader Performance
▪ 2x average speedup across 22 queries
▪ 6x best query speedup
▪ “What we have here is game changing for our industry. Especially now that the native Delta reader works as fast as it does. We have people lining up to now use this data”
▪ “We have queries that were running in 10 minutes that are now running in 47 seconds"
Feedback from customers:Standard TPC-H benchmark:
Data Platform Architecture
Starburst PlatformData Scientists Data AnalystsFinance Marketers
The Data Consumption Layer
Existing analytics tools
Data Masking Global SecurityColumn + Row-
level permissionsQuery Auditing Fine-grained
access controlData Encryption
Data Lakes Relational Databases NoSQL Stores Publish/Subscribe
Azure Event Hub
Different SQL Technologies In Your Toolbelt
Streaming Ingestion
Machine Learning
Data Investigation
Large Batch Jobs
Fast Federated Queries
High Concurrency SQL Engine
High Performance Ad Hoc
Reporting/Analytics
Optionality
Cloud Data Warehouse
Rapid Ad Hoc Reporting/Analytics
Fast, but everything must live in
Snowflake (ETL/ELT is required)
Vendor and data lock in
Cloud Data Platform Ecosystem
Deployment Architecture
Use Cases
Data Flow Diagram
Using a combination of Databricks and Starburst Presto to
bring a full data ingestion and analytical environment to life
Data Ingestion and Transformation
● Real-time ingestion of event data into
Delta tables
● Customer and inventory data ingested
every hour
● Modified customer information merged
into Delta Lake table
● Data marts created using streaming and
batch data
Query-time Data Federation
● Single point of access to numerous data
sources
● Query Delta Lake and federate with
legacy databases as well as many
NoSQL data stores
● Enforce table, column and row level
policies to ensure maximum data
security
● Mask column data for different groups
and users
Data Consumption & Analytics BI Reporting Tools
SQL Query Tools
• Connect using a variety of BI and SQL
tools including Looker, Tableau, Power
BI and DBeaver
• JDBC, ODBC and many libraries
including Python, R and Java
SELECT id, COUNT(*), SUM(active_seconds)
FROM delta.iot.events e
JOIN snowflake.sales.customer c ON (e.customer_id = c.id)
WHERE e.event_date >= current_date
AND c.region = 'US'
AND c.id IN
(SELECT l.customer_id
FROM elastic.web.logs l
WHERE l.visit_date >= date '2020-01-01')
GROUP BY id;
Thank You!Try Presto with Delta:
www.starburstdata.com/delta-lake-reader
Feedback
Your feedback is important to us.
Don’t forget to rateand review the sessions.