New Developments in Spark

New Developments in Spark

Matei ZahariaAugust 18th, 2015

About Databricks

Founded by creators of Spark in 2013 and remains the top contributor

End-to-end service for Spark on EC2• Interactive notebooks, dashboards,

and production jobs

Our Goal for Spark

Unified engine across data workloads and platforms

…

SQLStreaming ML Graph Batch …

Past 2 Years

Fast growth in libraries and integration points• New library for SQL + DataFrames• 10x growth of ML library• Pluggable data source API• R language

Result: very diverse use of Spark• Only 40% of users on Hadoop YARN• Most users use at least 2 of Spark’s

built-in libraries• 98% of Databricks customers use

SQL, 60% use Python

Beyond Libraries

Best thing about basing Spark’s libraries on a high-level API is that we can also make big changes underneath them

Now working on some of the largest changes to Spark Core since the project began

This Talk

Project Tungsten: CPU and memory efficiency

Network and disk I/O

Adaptive query execution

Hardware Trends

Storage

Network

CPU

Hardware Trends

2010

Storage 50+MB/s(HDD)

Network 1Gbps

CPU ~3GHz

Hardware Trends

2010 2015


500+MB/s(SSD)

Network 1Gbps 10Gbps

CPU ~3GHz ~3GHz

Hardware Trends

2010 2015


500+MB/s(SSD) 10x

Network 1Gbps 10Gbps 10x

CPU ~3GHz ~3GHz L

Tungsten: Preparing Spark for Next 5 Years

Substantially speed up execution by optimizing CPU efficiency, via:

(1) Off-heap memory management(2) Runtime code generation(3) Cache-aware algorithms

Interfaces to Tungsten

DataFrames(Python, Java, Scala, R) RDDsSpark SQL …

Data schema + query plan

LLVMJVM GPU NVRAMTungstenbackends

…

DataFrame API

Single-node tabular structure in R and Python, with APIs for:

relational algebra (filter, join, …)math and statsinput/output (CSV, JSON, …)

Google Trends for “data frame”

DataFrame: lingua franca for “small data”

head(flights)#> Source: local data frame [6 x 16]#> #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum#> 1 2013 1 1 517 2 830 11 UA N14228#> 2 2013 1 1 533 4 850 20 UA N24211#> 3 2013 1 1 542 2 923 33 AA N619AA#> 4 2013 1 1 544 -‐1 1004 -‐18 B6 N804JB#> .. ... ... ... ... ... ... ... ... ...

15

Spark DataFrames

Structured data collections with similar API to R/Python• DataFrame = RDD + schema

Capture many operations as expressions in a DSL• Enables rich optimizations

df = jsonFile(“tweets.json”)

df(df(“user”) === “matei”)

.groupBy(“date”)

.sum(“retweets”)

0

5

10

Python RDD Scala RDD DataFrameRu

nnin

g Ti

me

How does Tungsten help?

1. Off-Heap Memory Management

Store data outside JVM heap to avoid object overhead & GC• For RDDs: fast serialization libraries• For DataFrames & SQL: binary format we compute on directly

2-10x space saving, especially for strings, nested objects

Can use new RAM-like devices, e.g. flash, 3D XPoint

2. Runtime Code Generation

Generate Java code for DataFrame and SQL expressions requested by user

Avoids virtual calls and generics/boxing

Can do same in core, ML and graph• Code-gen serializers, fused functions,

math expressions 9.3

9.4

36.7

Hand

w

ritte

nCo

de g

enIn

terp

rete

d Pr

ojec

tion

Evaluating “SELECT a+a+a”(time in seconds)

3. Cache-Aware Algorithms

Use custom memory layout to better leverage CPU cache

Example: AlphaSort-style prefix sort• Store prefixes of sort key inside pointer array• Compare prefixes to avoid full record fetches + comparisons

pointer record

key prefix pointer record

Naïve layout

Cache friendly layout

Tungsten Performance Results

0

200

400

600

800

1000

1200

1x 2x 4x 8x 16x

Run time (seconds)

Data set size (relative)

Default

Code Gen

Tungsten onheap

Tungsten offheap

This Talk




Motivation

Network and storage speeds have improved 10x, but this speed isn’t always easy to leverage!

Many challenges with:• Keeping disk operations large (even on SSDs)• Keeping network connections busy & balanced across cluster• Doing all this on many cores and many disks

Sort Benchmark

Started by Jim Gray in 1987 to measure HW+SW advances• Many entrants use purpose-built hardware & software

Participated in largest category: Daytona GraySort• Sort 100 TB of 100-byte records in a fault-tolerant manner

Set a new world record (tied with UCSD)• Saturated 8 SSDs and 10 Gbps network / node• 1st time public cloud + open source won

On-Disk Sort RecordTime to sort 100 TB

2100 machines2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1 PB in 4 hours

Saturating the Network

1.1GB/sec per node

This Talk




Motivation

Query planning is crucial to performance in distributed setting• Level of parallelism in operations• Choice of algorithm (e.g. broadcast vs. shuffle join)

Hard to do well for big data even with cost-based optimization• Unindexed data => don’t have statistics• User-defined functions => hard to predict

Solution: let Spark change query plan adaptively

Traditional Spark Scheduling

file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()

mapreduce sort

Adaptive Planning

map


Adaptive Planning

map


Adaptive Planning

map

reduce


Adaptive Planning

map

reduce


Adaptive Planning

map

reduce


Adaptive Planning

map

reducesort


Advanced Example: JoinGoal: Bring together data items with the same key

Advanced Example: Join

Shuffle join(good if both

datasets large)

Goal: Bring together data items with the same key


Broadcast join(good if top

dataset small)



Hybrid join(broadcast popular

key, shuffle rest)



Hybrid join(broadcast popular

key, shuffle rest)


More details: SPARK-9850

Impact of Adaptive Planning

Level of parallelism: 2-3x

Choice of join algorithm: as much as 10x

Follow it at SPARK-9850

Effect of Optimizations in Core

Often, when we made one optimization, we saw all of the Spark components get faster

• Scheduler optimization for Spark Streaming => SQL 2x faster• Network optimizations => speed up all comm-intensive libraries• Tungsten => DataFrames, SQL and parts of ML

Same applies to other changes in core, e.g. debug tools

Conclusion

Spark has grown a lot, but it still remains the most active open source project in big data

Small core + high-level API => can make changes quickly

New hardware => exciting optimizations at all levels

Learn More: sparkhub.databricks.com

Software

New Developments in Spark