Upload
others
View
15
Download
1
Embed Size (px)
Citation preview
Performance Engineering for Apache Spark and
Databricks Runtime
ETHZ, Big Data HS19
1
Bart Samwel, Sabir Akhadov
About Databricks & the Presenters
Databricks: "startup" by the original creators of Apache SparkTM with 1000+ employees (engineering in SF and Amsterdam).
Bart Samwel: software engineer @ European Development Center, Tech Lead of performance engineering teams
Sabir Akhadov: software engineer @ EDC,performance benchmarking team
Our Job: More Speed
• Make Databricks Runtime (= Apache Spark + extensions) faster• Find the bottlenecks• Translate research & insights• Invent novel ways to speed it up
• Make sure it doesn't slow down• Regression performance benchmarking
• Make sure it's faster than the competition• Competitive benchmarking & analysis
The Paths to Speed
1. Do less (reduce I/O or data volume)2. Be prepared (do stuff ahead of time, e.g. indexes, clustering)3. Do things once (caching)4. Be smarter (better algorithms / query plans)5. Go faster (better raw execution speed)
You need all of these to win the race!
Filter
Scan
Project
RDDSELECT Store, AmountFROM SalesWHERE day_of_week = "Friday" (*)
(*) You can't actually write Spark SQL and get plain RDDs. This is just to make a point.
Scan
RDDSELECT Store, AmountFROM SalesWHERE day_of_week = "Friday"
We don't know what's in there!
Filter
Scan
Project
Filter
Scan
Project
● Filter in the data source● Read only the columns you need!
DataFrame
Traditional MapReduce / RDDs are opaque.SQL / DataFrames are transparent.
Key Insight:• Opaque / operational != optimizable• Transparent / declarative = optimizable
1. Do Less
• Read only the columns you need• Columnar file formats (Parquet, ORC)• NOT Avro, JSON, CSV, ...
• Can you avoid reading files at all?• Yes, if query has filters!• But: need clustering!
('CH', 2019-12-03, 3231)
('NL', 2019-12-03, 2216)
('CH', 2019-11-29, 3283)
('CH', 2019-12-02, 1823)
('NL', 2019-12-02, 2731)
('NL', 2019-12-01, 812)
('CH', 2019-11-30, 12833)
('NL', 2019-11-30, 1823)
('CH', 2019-11-29, 5122)
('NL', 2019-11-28, 8975)
('NL', 2019-11-29, 2617)
('CH', 2019-11-28, 8537)
SELECT Store, AmountFROM SalesWHERE Country = 'CH'
File 1 File 2 File 3
('CH', 2019-12-03, 3231)
('NL', 2019-12-03, 2216)
('CH', 2019-11-29, 3283)
('CH', 2019-12-02, 1823)
('NL', 2019-12-02, 2731)
('NL', 2019-12-01, 812)
('CH', 2019-11-30, 12833)
('NL', 2019-11-30, 1823)
('CH', 2019-11-29, 5122)
('NL', 2019-11-28, 8975)
('NL', 2019-11-29, 2617)
('CH', 2019-11-28, 8537)
SELECT Store, AmountFROM SalesWHERE Country = 'CH'
File 1 File 2 File 3
('CH', 2019-12-03, 3231)
('CH', 2019-11-30, 12833)
('CH', 2019-11-29, 3283)
('CH', 2019-12-02, 1823)
('NL', 2019-12-02, 2731)
('NL', 2019-12-01, 812)
('NL', 2019-12-03, 2216)
('NL', 2019-11-30, 1823)
('CH', 2019-11-29, 5122)
('NL', 2019-11-28, 8975)
('NL', 2019-11-29, 2617)
('CH', 2019-11-28, 8537)
SELECT Store, AmountFROM SalesWHERE Country = 'CH'
File 1 File 2 File 3
('CH', 2019-12-03, 3231)
('CH', 2019-11-30, 12833)
('CH', 2019-11-29, 3283)
('CH', 2019-12-02, 1823)
('NL', 2019-12-02, 2731)
('NL', 2019-12-01, 812)
('NL', 2019-12-03, 2216)
('NL', 2019-11-30, 1823)
('CH', 2019-11-29, 5122)
('NL', 2019-11-28, 8975)
('NL', 2019-11-29, 2617)
('CH', 2019-11-28, 8537)
SELECT Store, AmountFROM SalesWHERE Country = 'CH'
File 1 File 2 File 3
Knowing How to Do Less
• Know which file contains what data• For each file, for each column:
• min/max value• bloom filters
• Parquet files are self describing!• So you have to read the file to skip it ?!?
• Cloud storage is high latency• Need consolidated metadata cache / storage
2. Be Prepared
• Traditional database: indexes (e.g. B-trees)• File based big data: partitioning and clustering
Partitioning in Parquet
month=1/
file1.parquet
file2.parquet
month=2/
file3.parquet
month=3/
...etc...
• Low cardinality columns only!• Can skip files per partition!
Order Based Clustering
• Sort data by search columns X, Y• Split sorted data into files of reasonable size
✔ X = <value>✔ X = <value> AND Y = <value>❌ Y = <value>
Better: Z-Order Clustering• X, Y => 2D plane• Map onto 1D space
using space filling curve (e.g. Z-Order)
• Sort by 1D space• Observe: Every curve
range has narrow min/max for both X, Y => good for skipping
• blog post
blog post
• Clustering = reorganization• Reorganization = replacing files
• BUT: Concurrent readers may see inconsistent data
• Partitioning != reorganization• Happens immediately at write time• But produces many small files for incremental insertions• Compacting small files = reorganization
Reorganization & Concurrency
Consistency: Delta Lake
• Transaction log for file sets• 1 transaction = atomically add files & remove files
• Stores file names and metadata (min/max per column)• No more file listing• No more opening files for skipping metadata
• SELECT/INSERT/UPDATE/DELETE with serializable isolation• Reorganize data safely
https://delta.io/
3. Do Things Once
• 0x is best• 1x is next best• 2x is a waste
File Caching
• Cloud storage (e.g. S3) has low bandwidth compared to SSD• "Delta Cache": cache cloud files on local SSD• Changes queries from I/O bound to CPU bound
Result Caching
• Spark already allows you to cache DataFrames/RDDs explicitly• Automatic caching is more difficult
• Use case: everybody in the company opens the same dashboard all the time
• Safely reuse a result from a different user's session?– Same settings?– Same permissions?– Is the data not stale?– How stale is acceptable?
4. Be Smarter
• Pick the best algorithm for your query• Catalyst optimizer
Rule based optimizations
Transform logical query plan based on rules, e.g.:• Push down work below operators to reduce data size, e.g.:
• filter before join, aggregation, projection,...• aggregation before join• projection before everything (drop columns!)
• Simplify expressions• Precompute constant expressions• Turn EXISTS subqueries into semijoins
Cost based optimization
Select final plan using cost + rules:• Cost = based on table statistics• Compare multiple options by cost• Join reordering• Join method (sort-merge join, hash join, broadcast join)
Are Statistics Robust?
No.
Averages are often wrong
SELECT * FROM Sales WHERE• country = 'USA'• country = 'Greenland'
Filters interact in unexpected ways
SELECT * FROM People WHERE• city = 'Amsterdam' AND favorite_team = 'Ajax'• city = 'Rotterdam' AND favorite_team = 'Ajax'
A good plan for average data can be really bad for actual data!
Robust: Adaptive Query Execution
Be smart at execution time!• In Spark 3.0: automatic Broadcast Join (fast but works only
when one input is "small", <100MB or so)• Detect skew (hot partitions), automatically mitigate
5. Go Faster
When you really can't avoid doing work... do the work fast!
Multiple possible execution model choices:• Row-at-a-time vs. column-at-a-time
• Columnar enables vectorization • Interpreted vs. compiled• Native vs. JVM
Execution Models in Spark / Databricks
• Classic: Interpreted, row-at-a-time, JVM• Tungsten: JIT compiled, row-at-a-time, JVM
• Based on ideas from HyPeR paper (but JVM instead of LLVM)• Code specialized for each query, multiple operators pipelined for the
same data e.g. filter + project + aggregate.• Efficient because data stays in registers
• Databricks Parquet Reader: Interpreted, columnar, Native• Only for scans.
• Future: Tungsten+LLVM? Columnar+JVM? Columnar+Native?
Final Notes
• There is no silver bullet -- every workload has its own bottleneck
• Bottlenecks are ever changing, e.g.:• Make CPU faster -> I/O bound• Reduce I/O -> CPU bound• Speed up aggregation -> shuffle becomes the problem
• Benchmark, benchmark, benchmark• "If you say you care about performance and you don't have a
benchmark, then you don't really care about performance"
And now...
Dynamic Partition Pruning in Apache Spark 3.0
*Do less*
TPCDS Q98 on 10 TB
How to Make a Query 100x Faster?
Static Partition PruningSELECT * FROM Sales WHERE date_year = 2019
Filter
Scan
Basic data-flow
Filter
Scan
Filter Push-down
Filter
Scan
Partition files
2019
2014
Static Partition PruningSELECT * FROM Sales WHERE date_year = 2019
Filter
Scan
Basic data-flow
Filter
Scan
Filter Push-down
Filter
Scan
Partition files
2019
2014
Star Schema
Joining TablesSELECT * FROM Sales JOIN Date ON date_id = Date.idWHERE Date.year = 2019
Static pruning not possible
ScanSales
Filteryear = 2019
Join
precompute denormalized table
ScanSales
Join
ScanDate
Filteryear = 2019
Scan
ScanDate
*Be prepared*
Duplicate dataJoin maintenanceWide table
Dynamic Pruning
Dynamic pruning
ScanSales
Filteryear = 2019
Join
SELECT * FROM Sales JOIN Date ON date_id = Date.idWHERE Date.year = 2019
ScanDate
Dynamic Pruning
Partition files
Scan FACT TABLE Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on date_id
A Simple Approach
Partition files
Scan FACT TABLE
Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on date_id
Scan DIM TABLE
Filter DIM
year = 2019
A Simple Approach
Partition files
Scan FACT TABLE
Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on date_id
Scan DIM TABLE
Filter DIMDouble the work
*Do things once*
year = 2019
Broadcast Hash Join
FileScan FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Execute the smaller side
Broadcast the smaller side result
Execute the join locally without a shuffle
worker nodes
Reusing Broadcast Results
Partition files with multi-columnar data
FileScan
FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Dynamic Filter
year = 2019
Reusing Broadcast Results
Partition files
FileScan
FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Dynamic Filter
year = 2019
Experimental SetupWorkload Selection
- TPC-DS scale factors 1-10 TB
Cluster Configuration- 10 i3.xlarge machines, 40 cores total
Data-Processing Framework- Apache Spark 3.0
TPCDS 1 TB
60 / 102 queries speedup between 2 and 18
Data Skipped
Very effective in skipping data
TPCDS 10 TB
Even better speedups at 10x the scale
Query 98SELECT i_item_desc, i_category, i_class, i_current_price, sum(ss_ext_sales_price) as itemrevenue, sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratioFROM store_sales, item, date_dimWHERE ss_item_sk = i_item_sk and i_category in ('Sports', 'Books', 'Home') and ss_sold_date_sk = d_date_sk and cast(d_date as date) between cast('1999-02-22' as date) and (cast('1999-02-22' as date) + interval '30' day)GROUP BY i_item_id, i_item_desc, i_category, i_class, i_current_price
ORDER BY i_category, i_class, i_item_id, i_item_desc, revenueratio
TPCDS 10 TB, Q98
Highly selective dimension filter that retains only one month out of 5 years of data
Conclusion
Apache Spark 3.0 introduces Dynamic Partition Pruning
Significant speedup, exhibited in many TPC-DS queries
This optimization improves Spark performance for star-schema queries, making it unnecessary to denormalize tables.
57
Thanks!
We're hiring for internships (3 months) and full time engineers in Amsterdam and San Francisco!
databricks.com/company/careers
Bart Samwel linkedin.com/in/bsamwelSabir Akhadov linkedin.com/in/akhadov