Upload
dodieu
View
221
Download
2
Embed Size (px)
Citation preview
Scaling Spark on Lustre
Nicholas Chaimov (Univesity of Oregon) Costin Iancu, Khaled Ibrahim, Shane Canon (LBNL)
Allen D. Malony (University of Oregon)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
My Background q Main area of research in HPC parallel performance
measurement and analysis ❍ TAU Performance System
q Research interests ❍ Performance observation (application + system) ❍ Introspection
◆ runtime performance observation and query ❍ In situ analysis ❍ Feedback and adaptive control
q Performance optimization across whole system ❍ Based on behavior knowledge and performance models
q This work supported by IPCC grant to LBL for data analytics (Spark) on Lustre ❍ Nick’s summer internship and Ph.D. graduate support
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Data Analytics q Apache Spark is a popular data analytics framework
❍ High-level constructs for expressing analytics computations ❍ Fast and general engine for large-scale data processing ❍ Supports datasets larger than the system physical memory
q Improves programmer productivity through ❍ HLL front-ends (Scala, R, SQL) ❍ Multiple domain-specific libraries:
◆ Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox
q Specialized runtime provides for ❍ Elastic parallelism and resilience
q Developed for cloud and commodity environments q Latency-optimized local disk storage and bandwidth-optimized network
q Can Spark run well on HPC platforms? Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Preview of Main Contributions
q Improved scaling of Spark on Lustre 520x on HPC ❍ 100 cores to 52,000 cores on Cray
q Deliver scalable data intensive processing systems competitive to node-level local storage (SSD)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Berkeley Data Analytics Stack (Spark)
From https://amplab.cs.berkeley.edu/software/
BigData Benchmark
Page Rank Collab Filter,
Spark-Perf
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
HPC Node Performance on Spark
q Cray node ~2x slower than workstation for Spark ❍ When data on disk ❍ Same concurrency
◆ 86% slower on Edison
❍ All cores ◆ 40% slower on Edison ◆ when using all cores (24)
q The problem is I/O! q Cray matches performance when data is cached
Spark SQL Big Data Benchmark S3Suffix ScaleFactor Rankings(rows) Rankings(bytes) UserVisits(rows) UserVisits(bytes) Documents(bytes)
/5nodes/ 5 90Million 6.38GB 775Million 126.8GB 136.9GB
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
86%
40%
Spark and HPC Design Assumptions q Why is I/O a problem on the HPC node? q Spark expects
❍ Local disk w/ HDFS overlay for distributed file system ❍ Can utilize fast local disk (SSD) for shuffle files ❍ Assumes ALL disk operations are fast
q Spark generally targets cloud / commodity cluster ❍ Disk: I/O optimized for latency ❍ Network: optimized for bandwidth ❍ Matches well to Spark expectations
q HPC system conflict with these assumptions ❍ Disk: I/O optimized for bandwidth ❍ Network: optimized for latency
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Getting a Handle on Design Assumptions q Can HPC architectures give performance advantages?
❍ Do we need local disks? ◆ Cloud: node local SSD ◆ BurstBuffer: mid layer of SSD storage ◆ Lustre: backend storage system
❍ Can we exploit the advantages of HPC networks? ◆ What is the impact of RDMA optimizations?
q Differences in architecture guide the software design ❍ Evaluation of Spark on HPC system (Cray XC30, XC40) ❍ Techniques to improve performance on HPC architectures
by eliminating disk I/O overhead
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Data Movement in Spark q Block is the unit of movement and execution (persistent)
❍ Vertical movement: blocks transferred across network to another node ❍ Horizontal movement: blocks moved between memory
q Shuffle involves both vertical + horizontal movement (temporary)
Shuffle Manager Shuffle ManagerIn
terc
onne
ctio
n
N
etw
ork
Block Manager
Core
Memory
Core
Core
Core Task Task Task
PersistentStorage
S
L
Block Manager
Task Task TaskCore
Memory
Core
Core
Core
PersistentStorageR
L
TemporaryStorage
TemporaryStorage
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
I/O Happens Everywhere q Program input/output
❍ Explicit ❍ Distributed with
global namespace (HDFS)
q Shuffle and Block manager ❍ Implicit ❍ Local (Java)
◆ FileInputStream ◆ FileOutputStream
q Lots of I/O ❍ Both disk and network
join
union
groupBy
map
Stage3
Stage1
Stage2
A: B:
C: D:
E:
F:
G:
Read Input
Read Input
Shuffle
Shuffle Write Output
BlockManager BlockManager
BlockManager
BlockManager
BlockManager BlockManager
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Spark Data Management Abstraction
q Resilient Distributed Dataset (RDD) ❍ Composed of partitions of data
◆ which are composed of blocks
❍ RDDs are created from other RDDs by applying transformations or actions
❍ Has a lineage specifying how its blocks are computed
❍ Requesting a block either retrieves from cache or triggers computation
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Word Count Example
q Transformations declare intent ❍ Do not trigger computation but simply build
the lineage ❍ textFile, flatMap, map, reduceByKey
q Actions trigger computation on parent RDD ❍ collect
q Data transparently managed by runtime
val textFile = sc.textFile(”input.txt") !val counts = textFile.flatMap(line => line.split(" ")) ! .map(word => (word, 1)) ! .reduceByKey(_ + _) !counts.collect() textFile
flatMap
map
reduceByKey
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Partitioning
textFile flatMap map reduceByKey
textFile flatMap map reduceByKey
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Data is partitioned by the runtime
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Stages
q Vertical data movement (local) between transformations
q Horizontal (remote) and vertical data movement between stages (shuffle)
p1
p2
p3
textFile
p1
p2
p3
flatMap
p1
p2
p3
map
p1
p2
p3
reduceByKey(local)
STAGE 0
p1
p2
p3
reduceByKey(global)
STAGE 1
JOB 0
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Structure of Spark Input q Data in Spark formats is pre-partitioned q Parquet format:
❍ Single directory with _common_metadata ._common_metadata.crc _metadata ._metadata.crc part-r-00001.gz.parquet .part-r-00001.gz.parquet.crc part-r-00002.gz.parquet .part-r-00002.gz.parquet.crc […] part-r-03977.gz.parquet .part-r-03977.gz.parquet.crc _SUCCESS ._SUCCESS.crc
❍ This example has 3,977 partitions ◆ 3,977 data files ◆ 3,977 checksum files ◆ 3 metadata files and 3 metadata checksum files
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Reason for Multiple File Opens q Even with greatly increased time spent blocking on
open, individual opens are still short q However, there are a large number of opens
❍ At minimum, each task opens file, reads its part, closes file ◆ For each partition of the input, at least one file open
❍ Many readers do multiple opens per partition ◆ sc.textFile => 2 opens per partition
❍ Parquet reader: each task ◆ Opens input file, opens checksum file, compares checksum ◆ Closes input file and checksum file ◆ Opens input file, reads footer, closes input file ◆ Opens input file, reads actual data, closes input file
❍ Total of 4 file open/close cycles per task ❍ 3,977 * 4 = 15,908 file opens to read BDD dataset
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Shuffle (Communication)
q Output (Map) phase ❍ Every node
independently ◆ sorts local data ◆ writes sorted data to disk
q Input (Reduce) phase ❍ Every node
◆ reads local blocks from disk ◆ issues requests for remote blocks ◆ services incoming requests for blocks
p1
p2
p3
textFile
p1
p2
p3
flatMap
p1
p2
p3
map
p1
p2
p3
reduceByKey(local)
STAGE 0
p1
p2
p3
reduceByKey(global)
STAGE 1
JOB 0
Map
Reduce
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Shuffle Directory Structure
q Each node stores its maps in a worker specific directory in shuffle temporary storage ❍ Storage is in subdirectories
◆ 15/shuffle_0_1_0.data, 36/shuffle_0_2_0.data, 29/shuffle_0_3_0.data, … ❍ As many shuffle files as there are shuffle tasks
◆ # of shuffle tasks is configurable (# block managers is configurable) ◆ Default: max # of partitions of any parent RDD
Shuffle Manager Shuffle Manager
Inte
rcon
nect
ion
Net
wor
k
Block Manager
Core
Memory
Core
Core
Core Task Task Task
PersistentStorage
S
L
Block Manager
Task Task TaskCore
Memory
Core
Core
Core
PersistentStorageR
L
TemporaryStorage
TemporaryStorage
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Shuffle Writes q Each shuffle file is written to as many times as there are
partitions in the input ❍ This is controllable ❍ Need partitions > cores for load balancing/latency hiding
q For each write ❍ Open file in append mode ❍ Write results of sorting partition ❍ Close file
q Size of write is size of partition after local reduce ❍ Varies with workload ❍ Partitions are constrained to fit in memory ❍ Often very small (in practice will never exceed 1 GB)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Ok, so what about Spark on HPC Systems? q Spark certainly can run on HPC platforms q Consider Spark on Cray machines ❍ Cray XC30 (Edison) and XC40 (Cori) ❍ Lustre parallel distributed file system
q Out of the box, there are performance issues q The good news is that it’s all about Lustre! q Perform experiments comparing ❍ Lustre ❍ Burst Buffer ❍ Spark benchmarks
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Lustre Design
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Network Multiple nodes
One node
Spark
Burst Buffer
��
�
����
�&RUL�3�
Experimental Setup q Cray XC30 at NERSC (Edison)
❍ 2.4 GHz Ivy Bridge ❍ 64 GB RAM
q Cray XC40 at NERSC (Cori) ❍ 2.3 GHz Haswell ❍ 128 GB RAM ❍ Cray DataWarp
q Comet at SDSC ❍ 2.5GHz Haswell ❍ InfiniBand FDR ❍ 320 GB SSD ❍ 128 GB RAM regular nodes ❍ 1.5 TB RAM large memory nodes
q Spark 1.5.0 ❍ spark-perf benchmarks (Core + MLLib)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
0
2000
4000
6000
8000
10000
12000
1 2 4 8 16
TimePe
rOpe
ra1o
n(m
icrosecond
s)
Nodes
GroupByTest-I/OComponents-Cori
Lustre-Open BBStriped-OpenBBPrivate-Open Lustre-ReadBBStriped-Read BBPrivate-ReadLustre-Write BBStriped-Write
I/O Scalability (Lustre and BB, Cori)
Read/Write scales better than Open
Open/Close operations O(cores2)
9,216 36,864 147,456
589,824
2,359,296
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
BB: Burst Buffer
Spark results in lots of opens
Single MDS node bottleneck
!
✔
Multiple OSS nodes
I/O Variability is HIGH with Extreme Outliers
READ fopen
q # fopens quickly overwhelms the MDS q Variability in fopen access time is the real problem
❍ Mean time is 23X larger than SSD, variability is 14,000X q Extreme outliers results in straggler tasks (longest sets a bound!)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Many Opens versus One Open
1
11
21
31
41
51
61
1024 8192 65536 524288
Slow
downofn(ORC
)vs.O(nR)C
ReadSize(bytes)
ManyOpen-Read-Close Cyclesvs.Open-OnceRead-Many
EdisonLustre WorkstationLocalDisk
CoriLustre CoriStripedBB
CoriPrivateBB CoriMountedFile
CometLustre CometSSD
Lustre
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Improving I/O Performance q Eliminate file operations that affect the metadata server
❍ Combine files within a node (currently per core combine) ❍ Keep files open (cache fopen()) ❍ Use memory mapped local file system /dev/shm (no spill) ❍ Use file system backed by single Lustre file
q Partial solutions that need to be used in conjunction ❍ Memory pressure is high in Spark due to resilience and poor
garbage collection ❍ fopen() not necessarily from Spark (e.g., Parquet reader) ❍ Third party layers not optimized for HPC/Lustre
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Adding File-Backed Filesystems in Shifter q NERSC Shifter
❍ Lightweight container infrastructure for HPC ❍ Compatible with Docker images ❍ Integrated with Slurm scheduler ❍ Idea: control mounting of filesystems within container
q Per-Node Cache ❍ New feature added to improve Spark performance
--volume=$SCRATCH/backingFile:mnt:perNodeCache=size=100G ❍ File for each node is stored on backend Lustre filesystem ❍ File-backed filesystem mounted within each node’s
container instance at common path (/mnt) ❍ Single file open (intermediate data file opens are kept local)
q Now deployed in production on Cori
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Scalability with Shifter Luster-mount Solution
0
100
200
300
400
500
600
700
1 5 10 20 40 80 160 320
Time(s)
Nodes
Cori-GroupBy-WeakScaling-TimetoJobCompleLon
Ramdisk
MountedFile
Lustre
Spark on Lustre with mounted files scales up to O(10,000) cores Only 60% slower than in-memory
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Spark on plain Lustre scales only up to O(100) cores
Does not take away memory resources and not limited in size to available memory
User-level Techniques Reduce System Call Overhead
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 5 10 20 40 80 160 320
Time(s)
Nodes
Cori-GroupBy-WeakScaling-TimetoJobCompleLon
RamdiskRamdisk+ShiRerRamdisk+Pooling
• File Pooling reduces time spent in syscalls by avoiding fopen calls • Shifter moves some calls into user mode • Shifter also benefits shared libraries, class files, and so on, which
are stored on mounted read-only filesystem
10,000 cores w/ Ramdisk • + file pooling (8% speedup) • + Shifter (15% speedup)
Midlayer storage: Optimizing for the Tail
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
BB median open is 2x slower than Lustre BB open variance is 5x smaller
BB scales better than standalone Lustre
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Global vs. Local Storage
0
2
4
6
8
10
12
14
16
18
naive-ba
yes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2
vec
fp-growth
lda
pic
spearm
an
summary-stats
prefi
x-span
naive-ba
yes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2
vec
fp-growth
lda
pic
spearm
an
summary-stats
prefi
x-span
naive-ba
yes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2
vec
fp-growth
lda
pic
spearm
an
summary-stats
prefi
x-span
naive-ba
yes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2
vec
fp-growth
lda
pic
spearm
an
summary-stats
prefi
x-span
naive-ba
yes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2
vec
fp-growth
lda
pic
spearm
an
summary-stats
prefi
x-span
1 2 4 8 16
Nodes
Comet-spark-perfMLLib-Lustre/SSDTime
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
• When the shuffle is not quadratic or iterative – Tasks are compute-bound – Lustre and NAS storage is competitive
Comet – Time with Lustre storage / Time with SSD storage
Compute-bound benchmarks
I/O-bound benchmarks
Optimizations Make Global Storage Competitive
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
0
2
4
6
8
10
12
14
naive-bayes
kmeans
svd
pca
block-m
atrix-m
ult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2vec
fp-growth
lda
pic
spearm
an
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-m
atrix-m
ult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2vec
fp-growth
lda
pic
spearm
an
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-m
atrix-m
ult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2vec
fp-growth
lda
pic
spearm
an
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-m
atrix-m
ult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2vec
fp-growth
lda
pic
spearm
an
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-m
atrix-m
ult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-m
at
word2vec
fp-growth
lda
pic
spearm
an
summary-stats
prefix-span
1 2 4 8 16
Nodes
Cori-spark-perfMLLib-Lustre/Lustre-mountTimeCori – Time with Lustre storage / Time with Lustre-backed File
• Shifter eliminates remote metadata I/O, but not read/write I/O • High-shuffle benchmark slowdown is high on Lustre compared to Lustre-backed mounted file
– This indicates that most of the overhead (>10x) is in fopen – Not in latency/BW to storage (<2x)
• about 2x more slowdown in SSD vs Lustre than mounted file vs Lustre.
Global Storage matches Local Storage
0
2
4
6
8
10
12
14
16
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Comet CometRDMA Cori
Time(s)
chi-sq-feature
App Fetch JVM
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
• Cori (TCP) is 20% faster than Comet (RDMA)
512 cores
Competitive Advantage from Network Performance
0
20
40
60
80
100
120
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Comet CometRDMA Cori
Time(
s)
pic- PowerIterationClustering
App Fetch JVM
You do need good TCP
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
• Benefit of RDMA optimizations is target dependent – Single node Comet is 50% faster than Cori – Cori/TCP is 27% faster than Comet/RDMA on 16 nodes
• Better communication leads to higher availability of cores
Conclusions and Impact q Transformative recipe for tuning Spark on Lustre
❍ We started at O(100) cores scalability ❍ Showed O(10,000) cores scalability
q NERSC used our solutions up to 52,000 cores (Cori Phase I whole machine) ❍ Lustre-mount released in Shifter J (Cori and Edison)
https://github.com/NERSC/shifter
q Future work ❍ Global namespace enables “Spark” redesign ❍ Combine Lustre-mount and Burst Buffer ❍ Decentralize scheduler ❍ Evaluate competitive advantage from better networks
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)