Accelerating Spark with RDMA for Big Data Processing: Early

Xiaoyi Lu, Md. Wasi-‐ur-‐Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda

Network-‐Based Compu2ng Laboratory

Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

HotI 2014

Outline

•  Introduc7on •  Problem Statement

•  Proposed Design •  Performance Evalua7on

•  Conclusion & Future work

2

HotI 2014

Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode)

Digital Data Explosion in the Society

3

HotI 2014

Big Data Technology - Hadoop

Hadoop Distributed File System (HDFS)

MapReduce (Cluster Resource Management & Data

Processing)

Hadoop Common/Core (RPC, ..)

Hadoop Distributed File System (HDFS)

YARN (Cluster Resource Management & Job

Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce (Data Processing)

Other Models (Data Processing)

Hadoop 1.x Hadoop 2.x

•  Apache Hadoop is one of the most popular Big Data technology –  Provides frameworks for large-scale, distributed data storage and

processing –  MapReduce, HDFS, YARN, RPC, etc.

4

HotI 2014

Worker

Worker

Worker

Worker

Master

HDFSDriver

Zookeeper

Worker

SparkContext

Big Data Technology - Spark •  An emerging in-‐memory

processing framework –  Itera7ve machine learning –  Interac7ve data analy7cs –  Scala based Implementa7on –  Master-‐Slave; HDFS, Zookeeper

•  Scalable and communica7on intensive –  Wide dependencies between Resilient Distributed Datasets (RDDs)

–  MapReduce-‐like shuffle opera7ons to repar77on RDDs

–  Same as Hadoop, Sockets-based communication

5 Narrow Dependencies Wide dependencies

= Partition = Data Block= RDD

HotI 2014

Common Protocols using Open Fabrics

Applica-on

Verbs Sockets

Applica7on Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User space

RDMA

RoCE

RoCE Adapter

User space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Adapter

Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-‐TOE

Hardware Offload

RSockets

InfiniBand Adapter

InfiniBand Switch

User space

RSockets

6

HotI 2014

Previous Studies •  Very good performance improvements for Hadoop (HDFS

/MapReduce/RPC), HBase, Memcached over InfiniBand –  Hadoop Acceleration with RDMA

•  N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC’14

•  N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over InfiniBand, SC’12

•  M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS’14

•  M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC’13

•  X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP’13

–  HBase Acceleration with RDMA •  J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand,

IPDPS’12

–  Memcached Acceleration with RDMA •  J. Jose, et.al., Memcached Design on High Performance RDMA Capable

Interconnects, ICPP’11 7

HotI 2014

•  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

•  RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

•  RDMA for Memcached (RDMA-Memcached)

•  OSU HiBD-Benchmarks (OHB) •  http://hibd.cse.ohio-state.edu

The High-Performance Big Data (HiBD) Project

8

HotI 2014

•  High-Performance Design of Hadoop over RDMA-enabled Interconnects

–  High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

–  Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

•  Current release: 0.9.9 (03/31/14) –  Based on Apache Hadoop 1.2.1

–  Compliant with Apache Hadoop 1.2.1 APIs and applications

–  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR)

•  RoCE support with Mellanox adapters

•  Various multi-core platforms

•  Different file systems with disks and SSDs

•  RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD!

RDMA for Apache Hadoop

9

http://hibd.cse.ohio-state.edu

HotI 2014

0 50

100 150 200 250 300 350 400

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) OSU-IB (QDR)

0 20 40 60 80

100 120 140 160

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) OSU-IB (QDR)

Performance Benefits – RandomWriter & Sort in SDSC-Gordon

•  RandomWriter in a cluster of single SSD per node

–  16% improvement over IPoIB for 50GB in a cluster of 16 nodes


Reduced by 20% Reduced by 36%

•  Sort in a cluster of single SSD per node



10

Can RDMA benefit Apache Spark

on High-Performance Networks also?

HotI 2014

Outline




11

HotI 2014

Problem Statement •  Is it worth it?

–  Is the performance improvement potential high enough, if we can successfully adapt RDMA to Spark?

–  A few percentage points or orders of magnitude? •  How difficult is it to adapt RDMA to Spark?

–  Can RDMA be adapted to suit the communication needs of Spark?

–  Is it viable to have to rewrite portions of the Spark code with RDMA?

•  Can Spark applications benefit from an RDMA-enhanced design? –  What are the performance benefits that can be achieved by using

RDMA for Spark applications on modern HPC clusters? –  Can RDMA-based design benefit applications transparently?

12

HotI 2014

Assessment of Performance Improvement Potential

•  How much benefit RDMA can bring in Apache Spark compared to other interconnects/protocols?

•  Assessment Methodology –  Evaluation on primitive-level micro-benchmarks

•  Latency, Bandwidth •  1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps) •  Java/Scala-based environment

–  Evaluation on typical workloads •  GroupByTest •  1GigE, 10GigE, IPoIB (32Gbps) •  Spark 0.9.1

13

HotI 2014

Evaluation on Primitive-level Micro-benchmarks

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Message Size (bytes)

1GigE10GigE

IPoIB (32Gbps)RDMA (32Gbps)

0

5

10

15

20

25

30

35

40

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (m

s)

Message Size (bytes)

1GigE10GigE

IPoIB (32Gbps)RDMA (32Gbps)

0

500

1,000

1,500

2,000

2,500

3,000

1GigE 10GigE IPoIB(32Gbps)

RDMA(32Gbps)

Band

wid

th (M

Bps)

Small Message Large Message

•  Obviously, IPoIB and 10GigE are much better than 1GigE.

•  Compared to other interconnects/protocols, RDMA can significantly –  reduce the latencies for all the

message sizes –  improve the peak bandwidth

14

Can these benefits of High-Performance

Networks be achieved in Apache Spark?

HotI 2014

Evaluation on Typical Workloads •  For High-Performance Networks,

–  The execution time of GroupBy is significantly improved

–  The network throughputs are much higher for both send and receive

•  Spark can benefit from high-performance interconnects to a greater extent!

0

5

10

15

20

25

30

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

65%

22%

1GigE10GigEIPoIB (32Gbps)

0

100

200

300

400

500

600

5 10 15 20

Net

wor

k Th

roug

hput

(MB/

s)

Sample Times (s)

1GigE10GigE

IPoIB (32Gbps)

0

100

200

300

400

500

600

5 10 15 20

Net

wor

k Th

roug

hput

(MB/

s)

Sample Times (s)

1GigE10GigE

IPoIB (32Gbps)

15 Network Throughput in Recv Network Throughput in Send

Can RDMA further benefit Spark

performance compared with IPoIB and 10GigE?

HotI 2014

Outline




16

HotI 2014

Architecture Overview •  Design Goals

–  High Performance –  Keeping the existing

Spark architecture and interface intact

–  Minimal code changes

Spark Applications(Scala/Java/Python)

Spark(Scala/Java)

BlockFetcherIteratorBlockManager

Task TaskTask Task

Java NIO ShuffleServer

(default)

Netty ShuffleServer

(optional)

RDMA ShuffleServer

(plug-in)

Java NIO ShuffleFetcher(default)

Netty ShuffleFetcher

(optional)

RDMA ShuffleFetcher(plug-in)

Java Socket RDMA-based Shuffle Engine(Java/JNI)

1/10 Gig Ethernet/IPoIB (QDR/FDR) Network

Native InfiniBand (QDR/FDR)•  Approaches

–  Plug-in based approach to extend shuffle framework in Spark •  RDMA Shuffle Server + RDMA Shuffle Fetcher •  100 lines of code changes inside Spark original files

–  RDMA-based Shuffle Engine 17

HotI 2014

SEDA-based Data Shuffle Plug-ins •  SEDA - Staged Event-Driven

Architecture •  A set of stages connected by

queues •  A dedicated thread pool will be in

charge of processing events on the corresponding queue

–  Listener, Receiver, Handlers, Responders

•  Performing admission controls on these event queues

•  High throughput through maximally overlapping different processing stages as well as maintain default task-level parallelism in Spark

Request Queue

ResponseQueue

HandlerListener Receiver

RDMA-based Shuffle Engine Block Data

Responder

Receive Request Send

Response Read DataAccept

Connection

Request Enqeue

ResponseEnqueue

Request Dequeue

Connection Pool

ResponseDequeue

18

HotI 2014

RDMA-based Shuffle Engine •  Connection Management

–  Alternative designs •  Pre-connection

–  Hide the overhead in the initialization stage –  Before the actual communication, pre-connect processes to each other –  Sacrifice more resources to keep all of these connections alive

•  Dynamic Connection Establishment and Sharing –  A connection will be established if and only if an actual data exchange

is going to take place –  A naive dynamic connection design is not optimal, because we need to

allocate resources for every data block transfer –  Advanced dynamic connection scheme that reduces the number of

connection establishments –  Spark uses multi-threading approach to support multi-task execution in

a single JVM à Good chance to share connections! –  How long should the connection be kept alive for possible re-use? –  Time out mechanism for connection destroy

19

HotI 2014

RDMA-based Shuffle Engine •  Data Transfer

–  Each connection is used by multiple tasks (threads) to transfer data concurrently

–  Packets over the same communication lane will go to different entities in both server and fetcher sides

–  Alternative designs •  Perform sequential transfers of complete blocks over a

communication lane à Keep the order –  Cause long wait times for some tasks that are ready to transfer data

over the same connection •  Non-blocking and Out-of-order Data Transfer

–  Chunking data blocks –  Non-blocking sending over shared connections –  Out-of-order packet communication –  Guarantee both performance and ordering –  Efficiently work with the dynamic connection management and sharing

mechanism 20

HotI 2014

RDMA-based Shuffle Engine •  Buffer Management

–  On-JVM-Heap vs. Off-JVM-Heap Buffer Management –  Off-JVM-Heap

•  High-Performance through native IO •  Shadow buffers in Java/Scala •  Registered for RDMA communication

–  Flexibility for upper layer design choices •  Support connection sharing mechanism à Request ID •  Support packet processing in order à Sequence number •  Support non-blocking send à Buffer flag + callback

21

HotI 2014

Outline




22

HotI 2014

Experimental Setup •  Hardware

–  Intel Westmere Cluster (A) • Up to 9 nodes • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz quad-‐core CPUs, 24 GB main memory

• Mellanox QDR HCAs (32Gbps) + 10GigE –  TACC Stampede Cluster (B)

• Up to 17 nodes •  Intel Sandy Bridge (E5-‐2680) dual octa-‐core processors, running at 2.70GHz, 32 GB main memory

• Mellanox FDR HCAs (56Gbps) •  Soiware

–  Spark 0.9.1, Scala 2.10.4 and JDK 1.7.0 –  GroupBy Test 23

HotI 2014

Performance Evaluation on Cluster A

•  For 32 cores, up to 18% over IPoIB (32Gbps) and up to 36% over 10GigE

•  For 64 cores, up to 20% over IPoIB (32Gbps) and 34% over 10GigE

0 1 2 3 4 5 6 7 8 9

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

18%

10GigEIPoIB (32Gbps)RDMA (32Gbps)

0

2

4

6

8

10

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

20%10GigEIPoIB (32Gbps)RDMA (32Gbps)

4 Worker Nodes, 32 Cores, (32M 32R) 8 Worker Nodes, 64 Cores, (64M 64R)

24

HotI 2014

Performance Evaluation on Cluster B

•  For 128 cores, up to 83% over IPoIB (56Gbps) •  For 256 cores, up to 79% over IPoIB (56Gbps)

8 Worker Nodes, 128 Cores, (128M 128R) 16 Worker Nodes, 256 Cores, (256M 256R)

0

10

20

30

40

50

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

83%IPoIB (56Gbps)RDMA (56Gbps)

0

5

10

15

20

25

30

35

16 24 32 40

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

79%IPoIB (56Gbps)RDMA (56Gbps)

25

HotI 2014

Performance Analysis on Cluster B •  Java-based micro-benchmark comparison

IPoIB (56Gbps) RDMA (56Gbps) Peak Bandwidth 1741.46MBps 5612.55MBps

•  Benefit of RDMA Connection Sharing Design

0

5

10

15

20

16 24 32 40

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

55%

RDMA−w/o−Conn−Sharing (56Gbps)RDMA−w−Conn−Sharing (56Gbps)

By enabling connection sharing, achieve 55% performance benefit for 40GB data size on 16 nodes

26

HotI 2014

Outline




27

HotI 2014

Conclusion and Future Work

•  Three major conclusions

–  RDMA and high-performance interconnects can benefit Spark.

–  Plug-in based approach with SEDA-/RDMA-based designs provides both performance and productivity.

–  Spark applications can benefit from an RDMA-enhanced design.

•  Future Work –  Continuously update this package with newer designs and carry

out evaluations with more Spark applications on systems with high-performance networks

–  Will make this design publicly available through the HiBD project 28

HotI 2014

Thank You! {luxi, rahmanmd, islamn, shankard, panda}

@cse.ohio-‐state.edu

Network-‐Based Compu7ng Laboratory hrp://nowlab.cse.ohio-‐state.edu/

The High-‐Performance Big Data Project hrp://hibd.cse.ohio-‐state.edu/

Documents

Accelerating Spark with RDMA for Big Data Processing: Early