43
New Developments in Spark Matei Zaharia August 18 th , 2015

New Developments in Spark

Embed Size (px)

Citation preview

Page 1: New Developments in Spark

New Developments in Spark

Matei ZahariaAugust 18th, 2015

Page 2: New Developments in Spark

About Databricks

Founded by creators of Spark in 2013 and remains the top contributor

End-to-end service for Spark on EC2• Interactive notebooks, dashboards,

and production jobs

Page 3: New Developments in Spark

Our Goal for Spark

Unified engine across data workloads and platforms

SQLStreaming ML Graph Batch …

Page 4: New Developments in Spark

Past 2 Years

Fast growth in libraries and integration points• New library for SQL + DataFrames• 10x growth of ML library• Pluggable data source API• R language

Result: very diverse use of Spark• Only 40% of users on Hadoop YARN• Most users use at least 2 of Spark’s

built-in libraries• 98% of Databricks customers use

SQL, 60% use Python

Page 5: New Developments in Spark

Beyond Libraries

Best thing about basing Spark’s libraries on a high-level API is that we can also make big changes underneath them

Now working on some of the largest changes to Spark Core since the project began

Page 6: New Developments in Spark

This Talk

Project Tungsten: CPU and memory efficiency

Network and disk I/O

Adaptive query execution

Page 7: New Developments in Spark

Hardware Trends

Storage

Network

CPU

Page 8: New Developments in Spark

Hardware Trends

2010

Storage 50+MB/s(HDD)

Network 1Gbps

CPU ~3GHz

Page 9: New Developments in Spark

Hardware Trends

2010 2015

Storage 50+MB/s(HDD)

500+MB/s(SSD)

Network 1Gbps 10Gbps

CPU ~3GHz ~3GHz

Page 10: New Developments in Spark

Hardware Trends

2010 2015

Storage 50+MB/s(HDD)

500+MB/s(SSD) 10x

Network 1Gbps 10Gbps 10x

CPU ~3GHz ~3GHz L

Page 11: New Developments in Spark

Tungsten: Preparing Spark for Next 5 Years

Substantially speed up execution by optimizing CPU efficiency, via:

(1) Off-heap memory management(2) Runtime code generation(3) Cache-aware algorithms

Page 12: New Developments in Spark

Interfaces to Tungsten

DataFrames(Python, Java, Scala, R) RDDsSpark SQL …

Data schema + query plan

LLVMJVM GPU NVRAMTungstenbackends

Page 13: New Developments in Spark

DataFrame API

Single-node tabular structure in R and Python, with APIs for:

relational algebra (filter, join, …)math and statsinput/output (CSV, JSON, …)

Google Trends for “data frame”

Page 14: New Developments in Spark

DataFrame: lingua franca for “small data”

head(flights)#>  Source:  local  data  frame  [6  x  16]#>  #>        year  month  day  dep_time dep_delay arr_time arr_delay carrier  tailnum#>  1    2013          1      1            517                  2            830                11            UA    N14228#>  2    2013          1      1            533                  4            850                20            UA    N24211#>  3    2013          1      1            542                  2            923                33            AA    N619AA#>  4    2013          1      1            544                -­‐1          1004              -­‐18            B6    N804JB#>  ..    ...      ...  ...            ...              ...            ...              ...          ...          ...

Page 15: New Developments in Spark

15

Spark DataFrames

Structured data collections with similar API to R/Python• DataFrame = RDD + schema

Capture many operations as expressions in a DSL• Enables rich optimizations

df = jsonFile(“tweets.json”)

df(df(“user”) === “matei”)

.groupBy(“date”)

.sum(“retweets”)

0

5

10

Python RDD Scala RDD DataFrameRu

nnin

g Ti

me

Page 16: New Developments in Spark

How does Tungsten help?

Page 17: New Developments in Spark

1. Off-Heap Memory Management

Store data outside JVM heap to avoid object overhead & GC• For RDDs: fast serialization libraries• For DataFrames & SQL: binary format we compute on directly

2-10x space saving, especially for strings, nested objects

Can use new RAM-like devices, e.g. flash, 3D XPoint

Page 18: New Developments in Spark

2. Runtime Code Generation

Generate Java code for DataFrame and SQL expressions requested by user

Avoids virtual calls and generics/boxing

Can do same in core, ML and graph• Code-gen serializers, fused functions,

math expressions 9.3

9.4

36.7

Hand

w

ritte

nCo

de g

enIn

terp

rete

d Pr

ojec

tion

Evaluating “SELECT a+a+a”(time in seconds)

Page 19: New Developments in Spark

3. Cache-Aware Algorithms

Use custom memory layout to better leverage CPU cache

Example: AlphaSort-style prefix sort• Store prefixes of sort key inside pointer array• Compare prefixes to avoid full record fetches + comparisons

pointer record

key prefix pointer record

Naïve layout

Cache friendly layout

Page 20: New Developments in Spark

Tungsten Performance Results

0

200

400

600

800

1000

1200

1x 2x 4x 8x 16x

Run time (seconds)

Data set size (relative)

Default

Code Gen

Tungsten onheap

Tungsten offheap

Page 21: New Developments in Spark

This Talk

Project Tungsten: CPU and memory efficiency

Network and disk I/O

Adaptive query execution

Page 22: New Developments in Spark

Motivation

Network and storage speeds have improved 10x, but this speed isn’t always easy to leverage!

Many challenges with:• Keeping disk operations large (even on SSDs)• Keeping network connections busy & balanced across cluster• Doing all this on many cores and many disks

Page 23: New Developments in Spark

Sort Benchmark

Started by Jim Gray in 1987 to measure HW+SW advances• Many entrants use purpose-built hardware & software

Participated in largest category: Daytona GraySort• Sort 100 TB of 100-byte records in a fault-tolerant manner

Set a new world record (tied with UCSD)• Saturated 8 SSDs and 10 Gbps network / node• 1st time public cloud + open source won

Page 24: New Developments in Spark

On-Disk Sort RecordTime to sort 100 TB

2100 machines2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1 PB in 4 hours

Page 25: New Developments in Spark

Saturating the Network

1.1GB/sec per node

Page 26: New Developments in Spark

This Talk

Project Tungsten: CPU and memory efficiency

Network and disk I/O

Adaptive query execution

Page 27: New Developments in Spark

Motivation

Query planning is crucial to performance in distributed setting• Level of parallelism in operations• Choice of algorithm (e.g. broadcast vs. shuffle join)

Hard to do well for big data even with cost-based optimization• Unindexed data => don’t have statistics• User-defined functions => hard to predict

Solution: let Spark change query plan adaptively

Page 28: New Developments in Spark

Traditional Spark Scheduling

file.map(word   =>  (word,  1)).reduceByKey(_   +  _).sortByKey()

mapreduce sort

Page 29: New Developments in Spark

Adaptive Planning

map

file.map(word   =>  (word,  1)).reduceByKey(_   +  _).sortByKey()

Page 30: New Developments in Spark

Adaptive Planning

map

file.map(word   =>  (word,  1)).reduceByKey(_   +  _).sortByKey()

Page 31: New Developments in Spark

Adaptive Planning

map

reduce

file.map(word   =>  (word,  1)).reduceByKey(_   +  _).sortByKey()

Page 32: New Developments in Spark

Adaptive Planning

map

reduce

file.map(word   =>  (word,  1)).reduceByKey(_   +  _).sortByKey()

Page 33: New Developments in Spark

Adaptive Planning

map

reduce

file.map(word   =>  (word,  1)).reduceByKey(_   +  _).sortByKey()

Page 34: New Developments in Spark

Adaptive Planning

map

reducesort

file.map(word   =>  (word,  1)).reduceByKey(_   +  _).sortByKey()

Page 35: New Developments in Spark

Advanced Example: JoinGoal: Bring together data items with the same key

Page 36: New Developments in Spark

Advanced Example: Join

Shuffle join(good if both

datasets large)

Goal: Bring together data items with the same key

Page 37: New Developments in Spark

Advanced Example: Join

Broadcast join(good if top

dataset small)

Goal: Bring together data items with the same key

Page 38: New Developments in Spark

Advanced Example: Join

Hybrid join(broadcast popular

key, shuffle rest)

Goal: Bring together data items with the same key

Page 39: New Developments in Spark

Advanced Example: Join

Hybrid join(broadcast popular

key, shuffle rest)

Goal: Bring together data items with the same key

More details: SPARK-9850

Page 40: New Developments in Spark

Impact of Adaptive Planning

Level of parallelism: 2-3x

Choice of join algorithm: as much as 10x

Follow it at SPARK-9850

Page 41: New Developments in Spark

Effect of Optimizations in Core

Often, when we made one optimization, we saw all of the Spark components get faster

• Scheduler optimization for Spark Streaming => SQL 2x faster• Network optimizations => speed up all comm-intensive libraries• Tungsten => DataFrames, SQL and parts of ML

Same applies to other changes in core, e.g. debug tools

Page 42: New Developments in Spark

Conclusion

Spark has grown a lot, but it still remains the most active open source project in big data

Small core + high-level API => can make changes quickly

New hardware => exciting optimizations at all levels

Page 43: New Developments in Spark

Learn More: sparkhub.databricks.com