Scaling Spark on Lustre

Scaling Spark on Lustre

Nicholas Chaimov (Univesity of Oregon) Costin Iancu, Khaled Ibrahim, Shane Canon (LBNL)

Allen D. Malony (University of Oregon)

Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)

[email protected] [email protected]

[email protected]

My Background q  Main area of research in HPC parallel performance

measurement and analysis ❍  TAU Performance System

q  Research interests ❍  Performance observation (application + system) ❍  Introspection

◆  runtime performance observation and query ❍  In situ analysis ❍  Feedback and adaptive control

q  Performance optimization across whole system ❍  Based on behavior knowledge and performance models

q  This work supported by IPCC grant to LBL for data analytics (Spark) on Lustre ❍  Nick’s summer internship and Ph.D. graduate support


Data Analytics q  Apache Spark is a popular data analytics framework

❍  High-level constructs for expressing analytics computations ❍  Fast and general engine for large-scale data processing ❍  Supports datasets larger than the system physical memory

q  Improves programmer productivity through ❍  HLL front-ends (Scala, R, SQL) ❍  Multiple domain-specific libraries:

◆ Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox

q  Specialized runtime provides for ❍  Elastic parallelism and resilience

q  Developed for cloud and commodity environments q  Latency-optimized local disk storage and bandwidth-optimized network

q  Can Spark run well on HPC platforms? Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)

Preview of Main Contributions

q  Improved scaling of Spark on Lustre 520x on HPC ❍ 100 cores to 52,000 cores on Cray

q  Deliver scalable data intensive processing systems competitive to node-level local storage (SSD)


Berkeley Data Analytics Stack (Spark)

From https://amplab.cs.berkeley.edu/software/

BigData Benchmark

Page Rank Collab Filter,

Spark-Perf


HPC Node Performance on Spark

q  Cray node ~2x slower than workstation for Spark ❍ When data on disk ❍ Same concurrency

◆ 86% slower on Edison

❍ All cores ◆ 40% slower on Edison ◆ when using all cores (24)

q  The problem is I/O! q  Cray matches performance when data is cached

Spark SQL Big Data Benchmark S3Suffix ScaleFactor Rankings(rows) Rankings(bytes) UserVisits(rows) UserVisits(bytes) Documents(bytes)

/5nodes/ 5 90Million 6.38GB 775Million 126.8GB 136.9GB


86%

40%

Spark and HPC Design Assumptions q  Why is I/O a problem on the HPC node? q  Spark expects

❍  Local disk w/ HDFS overlay for distributed file system ❍  Can utilize fast local disk (SSD) for shuffle files ❍  Assumes ALL disk operations are fast

q  Spark generally targets cloud / commodity cluster ❍  Disk: I/O optimized for latency ❍  Network: optimized for bandwidth ❍  Matches well to Spark expectations

q  HPC system conflict with these assumptions ❍  Disk: I/O optimized for bandwidth ❍  Network: optimized for latency


Getting a Handle on Design Assumptions q  Can HPC architectures give performance advantages?

❍  Do we need local disks? ◆ Cloud: node local SSD ◆ BurstBuffer: mid layer of SSD storage ◆ Lustre: backend storage system

❍  Can we exploit the advantages of HPC networks? ◆ What is the impact of RDMA optimizations?

q  Differences in architecture guide the software design ❍  Evaluation of Spark on HPC system (Cray XC30, XC40) ❍  Techniques to improve performance on HPC architectures

by eliminating disk I/O overhead


Data Movement in Spark q  Block is the unit of movement and execution (persistent)

❍  Vertical movement: blocks transferred across network to another node ❍  Horizontal movement: blocks moved between memory

q  Shuffle involves both vertical + horizontal movement (temporary)

Shuffle Manager Shuffle ManagerIn

terc

onne

ctio

n

N

etw

ork

Block Manager

Core

Memory

Core

Core

Core Task Task Task

PersistentStorage

S

L

Block Manager

Task Task TaskCore

Memory

Core

Core

Core

PersistentStorageR

L

TemporaryStorage

TemporaryStorage


I/O Happens Everywhere q  Program input/output

❍  Explicit ❍  Distributed with

global namespace (HDFS)

q  Shuffle and Block manager ❍  Implicit ❍  Local (Java)

◆ FileInputStream ◆ FileOutputStream

q  Lots of I/O ❍  Both disk and network

join

union

groupBy

map

Stage3

Stage1

Stage2

A: B:

C: D:

E:

F:

G:

Read Input

Read Input

Shuffle

Shuffle Write Output

BlockManager BlockManager

BlockManager

BlockManager

BlockManager BlockManager


Spark Data Management Abstraction

q  Resilient Distributed Dataset (RDD) ❍ Composed of partitions of data

◆ which are composed of blocks

❍ RDDs are created from other RDDs by applying transformations or actions

❍ Has a lineage specifying how its blocks are computed

❍ Requesting a block either retrieves from cache or triggers computation


Word Count Example

q  Transformations declare intent ❍ Do not trigger computation but simply build

the lineage ❍  textFile, flatMap, map, reduceByKey

q  Actions trigger computation on parent RDD ❍  collect

q  Data transparently managed by runtime

val textFile = sc.textFile(”input.txt") !val counts = textFile.flatMap(line => line.split(" ")) ! .map(word => (word, 1)) ! .reduceByKey(_ + _) !counts.collect() textFile

flatMap

map

reduceByKey


Partitioning

textFile flatMap map reduceByKey

textFile flatMap map reduceByKey

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Data is partitioned by the runtime


Stages

q  Vertical data movement (local) between transformations

q  Horizontal (remote) and vertical data movement between stages (shuffle)

p1

p2

p3

textFile

p1

p2

p3

flatMap

p1

p2

p3

map

p1

p2

p3

reduceByKey(local)

STAGE 0

p1

p2

p3

reduceByKey(global)

STAGE 1

JOB 0


Structure of Spark Input q  Data in Spark formats is pre-partitioned q  Parquet format:

❍  Single directory with _common_metadata ._common_metadata.crc _metadata ._metadata.crc part-r-00001.gz.parquet .part-r-00001.gz.parquet.crc part-r-00002.gz.parquet .part-r-00002.gz.parquet.crc […] part-r-03977.gz.parquet .part-r-03977.gz.parquet.crc _SUCCESS ._SUCCESS.crc

❍  This example has 3,977 partitions ◆ 3,977 data files ◆ 3,977 checksum files ◆ 3 metadata files and 3 metadata checksum files


Reason for Multiple File Opens q  Even with greatly increased time spent blocking on

open, individual opens are still short q  However, there are a large number of opens

❍  At minimum, each task opens file, reads its part, closes file ◆ For each partition of the input, at least one file open

❍  Many readers do multiple opens per partition ◆ sc.textFile => 2 opens per partition

❍  Parquet reader: each task ◆ Opens input file, opens checksum file, compares checksum ◆ Closes input file and checksum file ◆ Opens input file, reads footer, closes input file ◆ Opens input file, reads actual data, closes input file

❍  Total of 4 file open/close cycles per task ❍  3,977 * 4 = 15,908 file opens to read BDD dataset


Shuffle (Communication)

q  Output (Map) phase ❍ Every node

independently ◆ sorts local data ◆ writes sorted data to disk

q  Input (Reduce) phase ❍ Every node

◆ reads local blocks from disk ◆ issues requests for remote blocks ◆ services incoming requests for blocks

p1

p2

p3

textFile

p1

p2

p3

flatMap

p1

p2

p3

map

p1

p2

p3

reduceByKey(local)

STAGE 0

p1

p2

p3

reduceByKey(global)

STAGE 1

JOB 0

Map

Reduce


Shuffle Directory Structure

q  Each node stores its maps in a worker specific directory in shuffle temporary storage ❍  Storage is in subdirectories

◆  15/shuffle_0_1_0.data, 36/shuffle_0_2_0.data, 29/shuffle_0_3_0.data, … ❍  As many shuffle files as there are shuffle tasks

◆  # of shuffle tasks is configurable (# block managers is configurable) ◆  Default: max # of partitions of any parent RDD

Shuffle Manager Shuffle Manager

Inte

rcon

nect

ion

Net

wor

k

Block Manager

Core

Memory

Core

Core

Core Task Task Task

PersistentStorage

S

L

Block Manager

Task Task TaskCore

Memory

Core

Core

Core

PersistentStorageR

L

TemporaryStorage

TemporaryStorage


Shuffle Writes q  Each shuffle file is written to as many times as there are

partitions in the input ❍  This is controllable ❍  Need partitions > cores for load balancing/latency hiding

q  For each write ❍  Open file in append mode ❍  Write results of sorting partition ❍  Close file

q  Size of write is size of partition after local reduce ❍  Varies with workload ❍  Partitions are constrained to fit in memory ❍  Often very small (in practice will never exceed 1 GB)


Ok, so what about Spark on HPC Systems? q  Spark certainly can run on HPC platforms q  Consider Spark on Cray machines ❍ Cray XC30 (Edison) and XC40 (Cori) ❍ Lustre parallel distributed file system

q  Out of the box, there are performance issues q  The good news is that it’s all about Lustre! q  Perform experiments comparing ❍ Lustre ❍ Burst Buffer ❍ Spark benchmarks


Lustre Design


Network Multiple nodes

One node

Spark

Burst Buffer

��

�

��

�&RUL�3�

Experimental Setup q  Cray XC30 at NERSC (Edison)

❍  2.4 GHz Ivy Bridge ❍  64 GB RAM

q  Cray XC40 at NERSC (Cori) ❍  2.3 GHz Haswell ❍  128 GB RAM ❍  Cray DataWarp

q  Comet at SDSC ❍  2.5GHz Haswell ❍  InfiniBand FDR ❍  320 GB SSD ❍  128 GB RAM regular nodes ❍  1.5 TB RAM large memory nodes

q  Spark 1.5.0 ❍  spark-perf benchmarks (Core + MLLib)


0

2000

4000

6000

8000

10000

12000

1 2 4 8 16

TimePe

rOpe

ra1o

n(m

icrosecond

s)

Nodes

GroupByTest-I/OComponents-Cori

Lustre-Open BBStriped-OpenBBPrivate-Open Lustre-ReadBBStriped-Read BBPrivate-ReadLustre-Write BBStriped-Write

I/O Scalability (Lustre and BB, Cori)

Read/Write scales better than Open

Open/Close operations O(cores2)

9,216 36,864 147,456

589,824

2,359,296


BB: Burst Buffer

Spark results in lots of opens

Single MDS node bottleneck

!

✔

Multiple OSS nodes

I/O Variability is HIGH with Extreme Outliers

READ fopen

q  # fopens quickly overwhelms the MDS q  Variability in fopen access time is the real problem

❍  Mean time is 23X larger than SSD, variability is 14,000X q  Extreme outliers results in straggler tasks (longest sets a bound!)


Many Opens versus One Open

1

11

21

31

41

51

61

1024 8192 65536 524288

Slow

downofn(ORC

)vs.O(nR)C

ReadSize(bytes)

ManyOpen-Read-Close Cyclesvs.Open-OnceRead-Many

EdisonLustre WorkstationLocalDisk

CoriLustre CoriStripedBB

CoriPrivateBB CoriMountedFile

CometLustre CometSSD

Lustre


Improving I/O Performance q  Eliminate file operations that affect the metadata server

❍  Combine files within a node (currently per core combine) ❍  Keep files open (cache fopen()) ❍  Use memory mapped local file system /dev/shm (no spill) ❍  Use file system backed by single Lustre file

q  Partial solutions that need to be used in conjunction ❍  Memory pressure is high in Spark due to resilience and poor

garbage collection ❍  fopen() not necessarily from Spark (e.g., Parquet reader) ❍  Third party layers not optimized for HPC/Lustre


Adding File-Backed Filesystems in Shifter q  NERSC Shifter

❍  Lightweight container infrastructure for HPC ❍  Compatible with Docker images ❍  Integrated with Slurm scheduler ❍  Idea: control mounting of filesystems within container

q  Per-Node Cache ❍  New feature added to improve Spark performance

--volume=$SCRATCH/backingFile:mnt:perNodeCache=size=100G ❍  File for each node is stored on backend Lustre filesystem ❍  File-backed filesystem mounted within each node’s

container instance at common path (/mnt) ❍  Single file open (intermediate data file opens are kept local)

q  Now deployed in production on Cori


Scalability with Shifter Luster-mount Solution

0

100

200

300

400

500

600

700

1 5 10 20 40 80 160 320

Time(s)

Nodes

Cori-GroupBy-WeakScaling-TimetoJobCompleLon

Ramdisk

MountedFile

Lustre

Spark on Lustre with mounted files scales up to O(10,000) cores Only 60% slower than in-memory


Spark on plain Lustre scales only up to O(100) cores

Does not take away memory resources and not limited in size to available memory

User-level Techniques Reduce System Call Overhead


0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 5 10 20 40 80 160 320

Time(s)

Nodes

Cori-GroupBy-WeakScaling-TimetoJobCompleLon

RamdiskRamdisk+ShiRerRamdisk+Pooling

•  File Pooling reduces time spent in syscalls by avoiding fopen calls •  Shifter moves some calls into user mode •  Shifter also benefits shared libraries, class files, and so on, which

are stored on mounted read-only filesystem

10,000 cores w/ Ramdisk •  + file pooling (8% speedup) •  + Shifter (15% speedup)

Midlayer storage: Optimizing for the Tail


BB median open is 2x slower than Lustre BB open variance is 5x smaller

BB scales better than standalone Lustre

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Global vs. Local Storage

0

2

4

6

8

10

12

14

16

18

naive-ba

yes

kmeans

svd

pca

block-matrix-mult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2

vec

fp-growth

lda

pic

spearm

an

summary-stats

prefi

x-span

naive-ba

yes

kmeans

svd

pca

block-matrix-mult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2

vec

fp-growth

lda

pic

spearm

an

summary-stats

prefi

x-span

naive-ba

yes

kmeans

svd

pca

block-matrix-mult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2

vec

fp-growth

lda

pic

spearm

an

summary-stats

prefi

x-span

naive-ba

yes

kmeans

svd

pca

block-matrix-mult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2

vec

fp-growth

lda

pic

spearm

an

summary-stats

prefi

x-span

naive-ba

yes

kmeans

svd

pca

block-matrix-mult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2

vec

fp-growth

lda

pic

spearm

an

summary-stats

prefi

x-span

1 2 4 8 16

Nodes

Comet-spark-perfMLLib-Lustre/SSDTime


•  When the shuffle is not quadratic or iterative –  Tasks are compute-bound –  Lustre and NAS storage is competitive

Comet – Time with Lustre storage / Time with SSD storage

Compute-bound benchmarks

I/O-bound benchmarks

Optimizations Make Global Storage Competitive


0

2

4

6

8

10

12

14

naive-bayes

kmeans

svd

pca

block-m

atrix-m

ult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2vec

fp-growth

lda

pic

spearm

an

summary-stats

prefix-span

naive-bayes

kmeans

svd

pca

block-m

atrix-m

ult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2vec

fp-growth

lda

pic

spearm

an

summary-stats

prefix-span

naive-bayes

kmeans

svd

pca

block-m

atrix-m

ult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2vec

fp-growth

lda

pic

spearm

an

summary-stats

prefix-span

naive-bayes

kmeans

svd

pca

block-m

atrix-m

ult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2vec

fp-growth

lda

pic

spearm

an

summary-stats

prefix-span

naive-bayes

kmeans

svd

pca

block-m

atrix-m

ult

pearson

chi-sq-feature

chi-sq-gof

chi-sq-m

at

word2vec

fp-growth

lda

pic

spearm

an

summary-stats

prefix-span

1 2 4 8 16

Nodes

Cori-spark-perfMLLib-Lustre/Lustre-mountTimeCori – Time with Lustre storage / Time with Lustre-backed File

•  Shifter eliminates remote metadata I/O, but not read/write I/O •  High-shuffle benchmark slowdown is high on Lustre compared to Lustre-backed mounted file

–  This indicates that most of the overhead (>10x) is in fopen –  Not in latency/BW to storage (<2x)

•  about 2x more slowdown in SSD vs Lustre than mounted file vs Lustre.

Global Storage matches Local Storage

0

2

4

6

8

10

12

14

16

1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

Comet CometRDMA Cori

Time(s)

chi-sq-feature

App Fetch JVM


•  Cori (TCP) is 20% faster than Comet (RDMA)

512 cores

Competitive Advantage from Network Performance

0

20

40

60

80

100

120

1 2 4 8 16 1 2 4 8 16 1 2 4 8 16

Comet CometRDMA Cori

Time(

s)

pic- PowerIterationClustering

App Fetch JVM

You do need good TCP


•  Benefit of RDMA optimizations is target dependent –  Single node Comet is 50% faster than Cori –  Cori/TCP is 27% faster than Comet/RDMA on 16 nodes

•  Better communication leads to higher availability of cores

Conclusions and Impact q  Transformative recipe for tuning Spark on Lustre

❍  We started at O(100) cores scalability ❍  Showed O(10,000) cores scalability

q  NERSC used our solutions up to 52,000 cores (Cori Phase I whole machine) ❍  Lustre-mount released in Shifter J (Cori and Edison)

https://github.com/NERSC/shifter

q  Future work ❍  Global namespace enables “Spark” redesign ❍  Combine Lustre-mount and Burst Buffer ❍  Decentralize scheduler ❍  Evaluate competitive advantage from better networks