Performance Engineering for Apache Spark and Databricks ......Execution Models in Spark / Databricks • Classic: Interpreted, row-at-a-time, JVM • Tungsten: JIT compiled, row-at-a-time,

Performance Engineering for Apache Spark and

Databricks Runtime

ETHZ, Big Data HS19

1

Bart Samwel, Sabir Akhadov

About Databricks & the Presenters

Databricks: "startup" by the original creators of Apache SparkTM with 1000+ employees (engineering in SF and Amsterdam).

Bart Samwel: software engineer @ European Development Center, Tech Lead of performance engineering teams

Sabir Akhadov: software engineer @ EDC,performance benchmarking team

Our Job: More Speed

• Make Databricks Runtime (= Apache Spark + extensions) faster• Find the bottlenecks• Translate research & insights• Invent novel ways to speed it up

• Make sure it doesn't slow down• Regression performance benchmarking

• Make sure it's faster than the competition• Competitive benchmarking & analysis

The Paths to Speed

1. Do less (reduce I/O or data volume)2. Be prepared (do stuff ahead of time, e.g. indexes, clustering)3. Do things once (caching)4. Be smarter (better algorithms / query plans)5. Go faster (better raw execution speed)

You need all of these to win the race!

Filter

Scan

Project

RDDSELECT Store, AmountFROM SalesWHERE day_of_week = "Friday" (*)

(*) You can't actually write Spark SQL and get plain RDDs. This is just to make a point.

Scan

RDDSELECT Store, AmountFROM SalesWHERE day_of_week = "Friday"

We don't know what's in there!

Filter

Scan

Project

Filter

Scan

Project

● Filter in the data source● Read only the columns you need!

DataFrame

Traditional MapReduce / RDDs are opaque.SQL / DataFrames are transparent.

Key Insight:• Opaque / operational != optimizable• Transparent / declarative = optimizable

1. Do Less

• Read only the columns you need• Columnar file formats (Parquet, ORC)• NOT Avro, JSON, CSV, ...

• Can you avoid reading files at all?• Yes, if query has filters!• But: need clustering!

('CH', 2019-12-03, 3231)

('NL', 2019-12-03, 2216)

('CH', 2019-11-29, 3283)

('CH', 2019-12-02, 1823)

('NL', 2019-12-02, 2731)

('NL', 2019-12-01, 812)

('CH', 2019-11-30, 12833)

('NL', 2019-11-30, 1823)

('CH', 2019-11-29, 5122)

('NL', 2019-11-28, 8975)

('NL', 2019-11-29, 2617)

('CH', 2019-11-28, 8537)

SELECT Store, AmountFROM SalesWHERE Country = 'CH'

File 1 File 2 File 3

('CH', 2019-12-03, 3231)

('NL', 2019-12-03, 2216)

('CH', 2019-11-29, 3283)

('CH', 2019-12-02, 1823)

('NL', 2019-12-02, 2731)

('NL', 2019-12-01, 812)

('CH', 2019-11-30, 12833)

('NL', 2019-11-30, 1823)

('CH', 2019-11-29, 5122)

('NL', 2019-11-28, 8975)

('NL', 2019-11-29, 2617)

('CH', 2019-11-28, 8537)



('CH', 2019-12-03, 3231)

('CH', 2019-11-30, 12833)

('CH', 2019-11-29, 3283)

('CH', 2019-12-02, 1823)

('NL', 2019-12-02, 2731)

('NL', 2019-12-01, 812)

('NL', 2019-12-03, 2216)

('NL', 2019-11-30, 1823)

('CH', 2019-11-29, 5122)

('NL', 2019-11-28, 8975)

('NL', 2019-11-29, 2617)

('CH', 2019-11-28, 8537)



('CH', 2019-12-03, 3231)

('CH', 2019-11-30, 12833)

('CH', 2019-11-29, 3283)

('CH', 2019-12-02, 1823)

('NL', 2019-12-02, 2731)

('NL', 2019-12-01, 812)

('NL', 2019-12-03, 2216)

('NL', 2019-11-30, 1823)

('CH', 2019-11-29, 5122)

('NL', 2019-11-28, 8975)

('NL', 2019-11-29, 2617)

('CH', 2019-11-28, 8537)



Knowing How to Do Less

• Know which file contains what data• For each file, for each column:

• min/max value• bloom filters

• Parquet files are self describing!• So you have to read the file to skip it ?!?

• Cloud storage is high latency• Need consolidated metadata cache / storage

2. Be Prepared

• Traditional database: indexes (e.g. B-trees)• File based big data: partitioning and clustering

Partitioning in Parquet

month=1/

file1.parquet

file2.parquet

month=2/

file3.parquet

month=3/

...etc...

• Low cardinality columns only!• Can skip files per partition!

Order Based Clustering

• Sort data by search columns X, Y• Split sorted data into files of reasonable size

✔ X = <value>✔ X = <value> AND Y = <value>❌ Y = <value>

Better: Z-Order Clustering• X, Y => 2D plane• Map onto 1D space

using space filling curve (e.g. Z-Order)

• Sort by 1D space• Observe: Every curve

range has narrow min/max for both X, Y => good for skipping

• blog post

blog post

https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html

https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html

• Clustering = reorganization• Reorganization = replacing files

• BUT: Concurrent readers may see inconsistent data

• Partitioning != reorganization• Happens immediately at write time• But produces many small files for incremental insertions• Compacting small files = reorganization

Reorganization & Concurrency

Consistency: Delta Lake

• Transaction log for file sets• 1 transaction = atomically add files & remove files

• Stores file names and metadata (min/max per column)• No more file listing• No more opening files for skipping metadata

• SELECT/INSERT/UPDATE/DELETE with serializable isolation• Reorganize data safely

https://delta.io/

https://delta.io/

3. Do Things Once

• 0x is best• 1x is next best• 2x is a waste

File Caching

• Cloud storage (e.g. S3) has low bandwidth compared to SSD• "Delta Cache": cache cloud files on local SSD• Changes queries from I/O bound to CPU bound

Result Caching

• Spark already allows you to cache DataFrames/RDDs explicitly• Automatic caching is more difficult

• Use case: everybody in the company opens the same dashboard all the time

• Safely reuse a result from a different user's session?– Same settings?– Same permissions?– Is the data not stale?– How stale is acceptable?

4. Be Smarter

• Pick the best algorithm for your query• Catalyst optimizer

Rule based optimizations

Transform logical query plan based on rules, e.g.:• Push down work below operators to reduce data size, e.g.:

• filter before join, aggregation, projection,...• aggregation before join• projection before everything (drop columns!)

• Simplify expressions• Precompute constant expressions• Turn EXISTS subqueries into semijoins

Cost based optimization

Select final plan using cost + rules:• Cost = based on table statistics• Compare multiple options by cost• Join reordering• Join method (sort-merge join, hash join, broadcast join)

Are Statistics Robust?

No.

Averages are often wrong

SELECT * FROM Sales WHERE• country = 'USA'• country = 'Greenland'

Filters interact in unexpected ways

SELECT * FROM People WHERE• city = 'Amsterdam' AND favorite_team = 'Ajax'• city = 'Rotterdam' AND favorite_team = 'Ajax'

A good plan for average data can be really bad for actual data!

Robust: Adaptive Query Execution

Be smart at execution time!• In Spark 3.0: automatic Broadcast Join (fast but works only

when one input is "small", <100MB or so)• Detect skew (hot partitions), automatically mitigate

5. Go Faster

When you really can't avoid doing work... do the work fast!

Multiple possible execution model choices:• Row-at-a-time vs. column-at-a-time

• Columnar enables vectorization • Interpreted vs. compiled• Native vs. JVM

Execution Models in Spark / Databricks

• Classic: Interpreted, row-at-a-time, JVM• Tungsten: JIT compiled, row-at-a-time, JVM

• Based on ideas from HyPeR paper (but JVM instead of LLVM)• Code specialized for each query, multiple operators pipelined for the

same data e.g. filter + project + aggregate.• Efficient because data stays in registers

• Databricks Parquet Reader: Interpreted, columnar, Native• Only for scans.

• Future: Tungsten+LLVM? Columnar+JVM? Columnar+Native?

https://www.vldb.org/pvldb/vol4/p539-neumann.pdf

Final Notes

• There is no silver bullet -- every workload has its own bottleneck

• Bottlenecks are ever changing, e.g.:• Make CPU faster -> I/O bound• Reduce I/O -> CPU bound• Speed up aggregation -> shuffle becomes the problem

• Benchmark, benchmark, benchmark• "If you say you care about performance and you don't have a

benchmark, then you don't really care about performance"

And now...

Dynamic Partition Pruning in Apache Spark 3.0

*Do less*

TPCDS Q98 on 10 TB

How to Make a Query 100x Faster?

Static Partition PruningSELECT * FROM Sales WHERE date_year = 2019

Filter

Scan

Basic data-flow

Filter

Scan

Filter Push-down

Filter

Scan

Partition files

2019

2014

Static Partition PruningSELECT * FROM Sales WHERE date_year = 2019

Filter

Scan

Basic data-flow

Filter

Scan

Filter Push-down

Filter

Scan

Partition files

2019

2014

Star Schema

Joining TablesSELECT * FROM Sales JOIN Date ON date_id = Date.idWHERE Date.year = 2019

Static pruning not possible

ScanSales

Filteryear = 2019

Join

precompute denormalized table

ScanSales

Join

ScanDate

Filteryear = 2019

Scan

ScanDate

*Be prepared*

Duplicate dataJoin maintenanceWide table

Dynamic Pruning

Dynamic pruning

ScanSales

Filteryear = 2019

Join

SELECT * FROM Sales JOIN Date ON date_id = Date.idWHERE Date.year = 2019

ScanDate

Dynamic Pruning

Partition files

Scan FACT TABLE Scan DIM TABLE

Non-partitioned dataset

Filter DIM

Join on date_id

A Simple Approach

Partition files

Scan FACT TABLE

Scan DIM TABLE


Filter DIM

Join on date_id

Scan DIM TABLE

Filter DIM

year = 2019

A Simple Approach

Partition files

Scan FACT TABLE

Scan DIM TABLE


Filter DIM

Join on date_id

Scan DIM TABLE

Filter DIMDouble the work

*Do things once*

year = 2019

Broadcast Hash Join

FileScan FileScan with Dim Filter


BroadcastExchange

Broadcast Hash Join

Execute the smaller side

Broadcast the smaller side result

Execute the join locally without a shuffle

worker nodes

Reusing Broadcast Results

Partition files with multi-columnar data

FileScan

FileScan with Dim Filter


BroadcastExchange

Broadcast Hash Join

Dynamic Filter

year = 2019

Reusing Broadcast Results

Partition files

FileScan

FileScan with Dim Filter


BroadcastExchange

Broadcast Hash Join

Dynamic Filter

year = 2019

Experimental SetupWorkload Selection

- TPC-DS scale factors 1-10 TB

Cluster Configuration- 10 i3.xlarge machines, 40 cores total

Data-Processing Framework- Apache Spark 3.0

TPCDS 1 TB

60 / 102 queries speedup between 2 and 18

Data Skipped

Very effective in skipping data

TPCDS 10 TB

Even better speedups at 10x the scale

Query 98SELECT i_item_desc, i_category, i_class, i_current_price, sum(ss_ext_sales_price) as itemrevenue, sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratioFROM store_sales, item, date_dimWHERE ss_item_sk = i_item_sk and i_category in ('Sports', 'Books', 'Home') and ss_sold_date_sk = d_date_sk and cast(d_date as date) between cast('1999-02-22' as date) and (cast('1999-02-22' as date) + interval '30' day)GROUP BY i_item_id, i_item_desc, i_category, i_class, i_current_price

ORDER BY i_category, i_class, i_item_id, i_item_desc, revenueratio

TPCDS 10 TB, Q98

Highly selective dimension filter that retains only one month out of 5 years of data

Conclusion

Apache Spark 3.0 introduces Dynamic Partition Pruning

Significant speedup, exhibited in many TPC-DS queries

This optimization improves Spark performance for star-schema queries, making it unnecessary to denormalize tables.

57

Thanks!

We're hiring for internships (3 months) and full time engineers in Amsterdam and San Francisco!

databricks.com/company/careers

Bart Samwel linkedin.com/in/bsamwelSabir Akhadov linkedin.com/in/akhadov

https://databricks.com/company/careers

Documents

Performance Engineering for Apache Spark and Databricks ......Execution Models in Spark / Databricks • Classic: Interpreted, row-at-a-time, JVM • Tungsten: JIT compiled, row-at-a-time,