Upload
databricks
View
8.603
Download
2
Embed Size (px)
Citation preview
New Developments in Spark
Matei ZahariaAugust 18th, 2015
About Databricks
Founded by creators of Spark in 2013 and remains the top contributor
End-to-end service for Spark on EC2• Interactive notebooks, dashboards,
and production jobs
Our Goal for Spark
Unified engine across data workloads and platforms
…
SQLStreaming ML Graph Batch …
Past 2 Years
Fast growth in libraries and integration points• New library for SQL + DataFrames• 10x growth of ML library• Pluggable data source API• R language
Result: very diverse use of Spark• Only 40% of users on Hadoop YARN• Most users use at least 2 of Spark’s
built-in libraries• 98% of Databricks customers use
SQL, 60% use Python
Beyond Libraries
Best thing about basing Spark’s libraries on a high-level API is that we can also make big changes underneath them
Now working on some of the largest changes to Spark Core since the project began
This Talk
Project Tungsten: CPU and memory efficiency
Network and disk I/O
Adaptive query execution
Hardware Trends
Storage
Network
CPU
Hardware Trends
2010
Storage 50+MB/s(HDD)
Network 1Gbps
CPU ~3GHz
Hardware Trends
2010 2015
Storage 50+MB/s(HDD)
500+MB/s(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
Hardware Trends
2010 2015
Storage 50+MB/s(HDD)
500+MB/s(SSD) 10x
Network 1Gbps 10Gbps 10x
CPU ~3GHz ~3GHz L
Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Off-heap memory management(2) Runtime code generation(3) Cache-aware algorithms
Interfaces to Tungsten
DataFrames(Python, Java, Scala, R) RDDsSpark SQL …
Data schema + query plan
LLVMJVM GPU NVRAMTungstenbackends
…
DataFrame API
Single-node tabular structure in R and Python, with APIs for:
relational algebra (filter, join, …)math and statsinput/output (CSV, JSON, …)
Google Trends for “data frame”
DataFrame: lingua franca for “small data”
head(flights)#> Source: local data frame [6 x 16]#> #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum#> 1 2013 1 1 517 2 830 11 UA N14228#> 2 2013 1 1 533 4 850 20 UA N24211#> 3 2013 1 1 542 2 923 33 AA N619AA#> 4 2013 1 1 544 -‐1 1004 -‐18 B6 N804JB#> .. ... ... ... ... ... ... ... ... ...
15
Spark DataFrames
Structured data collections with similar API to R/Python• DataFrame = RDD + schema
Capture many operations as expressions in a DSL• Enables rich optimizations
df = jsonFile(“tweets.json”)
df(df(“user”) === “matei”)
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python RDD Scala RDD DataFrameRu
nnin
g Ti
me
How does Tungsten help?
1. Off-Heap Memory Management
Store data outside JVM heap to avoid object overhead & GC• For RDDs: fast serialization libraries• For DataFrames & SQL: binary format we compute on directly
2-10x space saving, especially for strings, nested objects
Can use new RAM-like devices, e.g. flash, 3D XPoint
2. Runtime Code Generation
Generate Java code for DataFrame and SQL expressions requested by user
Avoids virtual calls and generics/boxing
Can do same in core, ML and graph• Code-gen serializers, fused functions,
math expressions 9.3
9.4
36.7
Hand
w
ritte
nCo
de g
enIn
terp
rete
d Pr
ojec
tion
Evaluating “SELECT a+a+a”(time in seconds)
3. Cache-Aware Algorithms
Use custom memory layout to better leverage CPU cache
Example: AlphaSort-style prefix sort• Store prefixes of sort key inside pointer array• Compare prefixes to avoid full record fetches + comparisons
pointer record
key prefix pointer record
Naïve layout
Cache friendly layout
Tungsten Performance Results
0
200
400
600
800
1000
1200
1x 2x 4x 8x 16x
Run time (seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten offheap
This Talk
Project Tungsten: CPU and memory efficiency
Network and disk I/O
Adaptive query execution
Motivation
Network and storage speeds have improved 10x, but this speed isn’t always easy to leverage!
Many challenges with:• Keeping disk operations large (even on SSDs)• Keeping network connections busy & balanced across cluster• Doing all this on many cores and many disks
Sort Benchmark
Started by Jim Gray in 1987 to measure HW+SW advances• Many entrants use purpose-built hardware & software
Participated in largest category: Daytona GraySort• Sort 100 TB of 100-byte records in a fault-tolerant manner
Set a new world record (tied with UCSD)• Saturated 8 SSDs and 10 Gbps network / node• 1st time public cloud + open source won
On-Disk Sort RecordTime to sort 100 TB
2100 machines2013 Record: Hadoop
2014 Record: Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1 PB in 4 hours
Saturating the Network
1.1GB/sec per node
This Talk
Project Tungsten: CPU and memory efficiency
Network and disk I/O
Adaptive query execution
Motivation
Query planning is crucial to performance in distributed setting• Level of parallelism in operations• Choice of algorithm (e.g. broadcast vs. shuffle join)
Hard to do well for big data even with cost-based optimization• Unindexed data => don’t have statistics• User-defined functions => hard to predict
Solution: let Spark change query plan adaptively
Traditional Spark Scheduling
file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()
mapreduce sort
Adaptive Planning
map
file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()
Adaptive Planning
map
file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()
Adaptive Planning
map
reduce
file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()
Adaptive Planning
map
reduce
file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()
Adaptive Planning
map
reduce
file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()
Adaptive Planning
map
reducesort
file.map(word => (word, 1)).reduceByKey(_ + _).sortByKey()
Advanced Example: JoinGoal: Bring together data items with the same key
Advanced Example: Join
Shuffle join(good if both
datasets large)
Goal: Bring together data items with the same key
Advanced Example: Join
Broadcast join(good if top
dataset small)
Goal: Bring together data items with the same key
Advanced Example: Join
Hybrid join(broadcast popular
key, shuffle rest)
Goal: Bring together data items with the same key
Advanced Example: Join
Hybrid join(broadcast popular
key, shuffle rest)
Goal: Bring together data items with the same key
More details: SPARK-9850
Impact of Adaptive Planning
Level of parallelism: 2-3x
Choice of join algorithm: as much as 10x
Follow it at SPARK-9850
Effect of Optimizations in Core
Often, when we made one optimization, we saw all of the Spark components get faster
• Scheduler optimization for Spark Streaming => SQL 2x faster• Network optimizations => speed up all comm-intensive libraries• Tungsten => DataFrames, SQL and parts of ML
Same applies to other changes in core, e.g. debug tools
Conclusion
Spark has grown a lot, but it still remains the most active open source project in big data
Small core + high-level API => can make changes quickly
New hardware => exciting optimizations at all levels
Learn More: sparkhub.databricks.com