Upload
phamanh
View
223
Download
2
Embed Size (px)
Citation preview
Xiaoyi Lu, Md. Wasi-‐ur-‐Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda
Network-‐Based Compu2ng Laboratory
Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA
Accelerating Spark with RDMA for Big Data Processing: Early Experiences
HotI 2014
Outline
• Introduc7on • Problem Statement
• Proposed Design • Performance Evalua7on
• Conclusion & Future work
2
HotI 2014
Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode)
Digital Data Explosion in the Society
3
HotI 2014
Big Data Technology - Hadoop
Hadoop Distributed File System (HDFS)
MapReduce (Cluster Resource Management & Data
Processing)
Hadoop Common/Core (RPC, ..)
Hadoop Distributed File System (HDFS)
YARN (Cluster Resource Management & Job
Scheduling)
Hadoop Common/Core (RPC, ..)
MapReduce (Data Processing)
Other Models (Data Processing)
Hadoop 1.x Hadoop 2.x
• Apache Hadoop is one of the most popular Big Data technology – Provides frameworks for large-scale, distributed data storage and
processing – MapReduce, HDFS, YARN, RPC, etc.
4
HotI 2014
Worker
Worker
Worker
Worker
Master
HDFSDriver
Zookeeper
Worker
SparkContext
Big Data Technology - Spark • An emerging in-‐memory
processing framework – Itera7ve machine learning – Interac7ve data analy7cs – Scala based Implementa7on – Master-‐Slave; HDFS, Zookeeper
• Scalable and communica7on intensive – Wide dependencies between Resilient Distributed Datasets (RDDs)
– MapReduce-‐like shuffle opera7ons to repar77on RDDs
– Same as Hadoop, Sockets-based communication
5 Narrow Dependencies Wide dependencies
= Partition = Data Block= RDD
HotI 2014
Common Protocols using Open Fabrics
Applica-on
Verbs Sockets
Applica7on Interface
SDP
RDMA
SDP
InfiniBand Adapter
InfiniBand Switch
RDMA
IB Verbs
InfiniBand Adapter
InfiniBand Switch
User space
RDMA
RoCE
RoCE Adapter
User space
Ethernet Switch
TCP/IP
Ethernet Driver
Kernel Space
Protocol
InfiniBand Adapter
InfiniBand Switch
IPoIB
Ethernet Adapter
Ethernet Switch
Adapter
Switch
1/10/40 GigE
iWARP
Ethernet Switch
iWARP
iWARP Adapter
User space IPoIB
TCP/IP
Ethernet Adapter
Ethernet Switch
10/40 GigE-‐TOE
Hardware Offload
RSockets
InfiniBand Adapter
InfiniBand Switch
User space
RSockets
6
HotI 2014
Previous Studies • Very good performance improvements for Hadoop (HDFS
/MapReduce/RPC), HBase, Memcached over InfiniBand – Hadoop Acceleration with RDMA
• N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC’14
• N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over InfiniBand, SC’12
• M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS’14
• M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC’13
• X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP’13
– HBase Acceleration with RDMA • J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand,
IPDPS’12
– Memcached Acceleration with RDMA • J. Jose, et.al., Memcached Design on High Performance RDMA Capable
Interconnects, ICPP’11 7
HotI 2014
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB) • http://hibd.cse.ohio-state.edu
The High-Performance Big Data (HiBD) Project
8
HotI 2014
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.9 (03/31/14) – Based on Apache Hadoop 1.2.1
– Compliant with Apache Hadoop 1.2.1 APIs and applications
– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• Different file systems with disks and SSDs
• RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD!
RDMA for Apache Hadoop
9
http://hibd.cse.ohio-state.edu
HotI 2014
0 50
100 150 200 250 300 350 400
50 GB 100 GB 200 GB 300 GB
Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (QDR) OSU-IB (QDR)
0 20 40 60 80
100 120 140 160
50 GB 100 GB 200 GB 300 GB
Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (QDR) OSU-IB (QDR)
Performance Benefits – RandomWriter & Sort in SDSC-Gordon
• RandomWriter in a cluster of single SSD per node
– 16% improvement over IPoIB for 50GB in a cluster of 16 nodes
– 20% improvement over IPoIB for 300GB in a cluster of 64 nodes
Reduced by 20% Reduced by 36%
• Sort in a cluster of single SSD per node
– 20% improvement over IPoIB for 50GB in a cluster of 16 nodes
– 36% improvement over IPoIB for 300GB in a cluster of 64 nodes
10
Can RDMA benefit Apache Spark
on High-Performance Networks also?
HotI 2014
Outline
• Introduc7on • Problem Statement
• Proposed Design • Performance Evalua7on
• Conclusion & Future work
11
HotI 2014
Problem Statement • Is it worth it?
– Is the performance improvement potential high enough, if we can successfully adapt RDMA to Spark?
– A few percentage points or orders of magnitude? • How difficult is it to adapt RDMA to Spark?
– Can RDMA be adapted to suit the communication needs of Spark?
– Is it viable to have to rewrite portions of the Spark code with RDMA?
• Can Spark applications benefit from an RDMA-enhanced design? – What are the performance benefits that can be achieved by using
RDMA for Spark applications on modern HPC clusters? – Can RDMA-based design benefit applications transparently?
12
HotI 2014
Assessment of Performance Improvement Potential
• How much benefit RDMA can bring in Apache Spark compared to other interconnects/protocols?
• Assessment Methodology – Evaluation on primitive-level micro-benchmarks
• Latency, Bandwidth • 1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps) • Java/Scala-based environment
– Evaluation on typical workloads • GroupByTest • 1GigE, 10GigE, IPoIB (32Gbps) • Spark 0.9.1
13
HotI 2014
Evaluation on Primitive-level Micro-benchmarks
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (u
s)
Message Size (bytes)
1GigE10GigE
IPoIB (32Gbps)RDMA (32Gbps)
0
5
10
15
20
25
30
35
40
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (m
s)
Message Size (bytes)
1GigE10GigE
IPoIB (32Gbps)RDMA (32Gbps)
0
500
1,000
1,500
2,000
2,500
3,000
1GigE 10GigE IPoIB(32Gbps)
RDMA(32Gbps)
Band
wid
th (M
Bps)
Small Message Large Message
• Obviously, IPoIB and 10GigE are much better than 1GigE.
• Compared to other interconnects/protocols, RDMA can significantly – reduce the latencies for all the
message sizes – improve the peak bandwidth
14
Can these benefits of High-Performance
Networks be achieved in Apache Spark?
HotI 2014
Evaluation on Typical Workloads • For High-Performance Networks,
– The execution time of GroupBy is significantly improved
– The network throughputs are much higher for both send and receive
• Spark can benefit from high-performance interconnects to a greater extent!
0
5
10
15
20
25
30
4 6 8 10
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
65%
22%
1GigE10GigEIPoIB (32Gbps)
0
100
200
300
400
500
600
5 10 15 20
Net
wor
k Th
roug
hput
(MB/
s)
Sample Times (s)
1GigE10GigE
IPoIB (32Gbps)
0
100
200
300
400
500
600
5 10 15 20
Net
wor
k Th
roug
hput
(MB/
s)
Sample Times (s)
1GigE10GigE
IPoIB (32Gbps)
15 Network Throughput in Recv Network Throughput in Send
Can RDMA further benefit Spark
performance compared with IPoIB and 10GigE?
HotI 2014
Outline
• Introduc7on • Problem Statement
• Proposed Design • Performance Evalua7on
• Conclusion & Future work
16
HotI 2014
Architecture Overview • Design Goals
– High Performance – Keeping the existing
Spark architecture and interface intact
– Minimal code changes
Spark Applications(Scala/Java/Python)
Spark(Scala/Java)
BlockFetcherIteratorBlockManager
Task TaskTask Task
Java NIO ShuffleServer
(default)
Netty ShuffleServer
(optional)
RDMA ShuffleServer
(plug-in)
Java NIO ShuffleFetcher(default)
Netty ShuffleFetcher
(optional)
RDMA ShuffleFetcher(plug-in)
Java Socket RDMA-based Shuffle Engine(Java/JNI)
1/10 Gig Ethernet/IPoIB (QDR/FDR) Network
Native InfiniBand (QDR/FDR)• Approaches
– Plug-in based approach to extend shuffle framework in Spark • RDMA Shuffle Server + RDMA Shuffle Fetcher • 100 lines of code changes inside Spark original files
– RDMA-based Shuffle Engine 17
HotI 2014
SEDA-based Data Shuffle Plug-ins • SEDA - Staged Event-Driven
Architecture • A set of stages connected by
queues • A dedicated thread pool will be in
charge of processing events on the corresponding queue
– Listener, Receiver, Handlers, Responders
• Performing admission controls on these event queues
• High throughput through maximally overlapping different processing stages as well as maintain default task-level parallelism in Spark
Request Queue
ResponseQueue
HandlerListener Receiver
RDMA-based Shuffle Engine Block Data
Responder
Receive Request Send
Response Read DataAccept
Connection
Request Enqeue
ResponseEnqueue
Request Dequeue
Connection Pool
ResponseDequeue
18
HotI 2014
RDMA-based Shuffle Engine • Connection Management
– Alternative designs • Pre-connection
– Hide the overhead in the initialization stage – Before the actual communication, pre-connect processes to each other – Sacrifice more resources to keep all of these connections alive
• Dynamic Connection Establishment and Sharing – A connection will be established if and only if an actual data exchange
is going to take place – A naive dynamic connection design is not optimal, because we need to
allocate resources for every data block transfer – Advanced dynamic connection scheme that reduces the number of
connection establishments – Spark uses multi-threading approach to support multi-task execution in
a single JVM à Good chance to share connections! – How long should the connection be kept alive for possible re-use? – Time out mechanism for connection destroy
19
HotI 2014
RDMA-based Shuffle Engine • Data Transfer
– Each connection is used by multiple tasks (threads) to transfer data concurrently
– Packets over the same communication lane will go to different entities in both server and fetcher sides
– Alternative designs • Perform sequential transfers of complete blocks over a
communication lane à Keep the order – Cause long wait times for some tasks that are ready to transfer data
over the same connection • Non-blocking and Out-of-order Data Transfer
– Chunking data blocks – Non-blocking sending over shared connections – Out-of-order packet communication – Guarantee both performance and ordering – Efficiently work with the dynamic connection management and sharing
mechanism 20
HotI 2014
RDMA-based Shuffle Engine • Buffer Management
– On-JVM-Heap vs. Off-JVM-Heap Buffer Management – Off-JVM-Heap
• High-Performance through native IO • Shadow buffers in Java/Scala • Registered for RDMA communication
– Flexibility for upper layer design choices • Support connection sharing mechanism à Request ID • Support packet processing in order à Sequence number • Support non-blocking send à Buffer flag + callback
21
HotI 2014
Outline
• Introduc7on • Problem Statement
• Proposed Design • Performance Evalua7on
• Conclusion & Future work
22
HotI 2014
Experimental Setup • Hardware
– Intel Westmere Cluster (A) • Up to 9 nodes • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz quad-‐core CPUs, 24 GB main memory
• Mellanox QDR HCAs (32Gbps) + 10GigE – TACC Stampede Cluster (B)
• Up to 17 nodes • Intel Sandy Bridge (E5-‐2680) dual octa-‐core processors, running at 2.70GHz, 32 GB main memory
• Mellanox FDR HCAs (56Gbps) • Soiware
– Spark 0.9.1, Scala 2.10.4 and JDK 1.7.0 – GroupBy Test 23
HotI 2014
Performance Evaluation on Cluster A
• For 32 cores, up to 18% over IPoIB (32Gbps) and up to 36% over 10GigE
• For 64 cores, up to 20% over IPoIB (32Gbps) and 34% over 10GigE
0 1 2 3 4 5 6 7 8 9
4 6 8 10
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
18%
10GigEIPoIB (32Gbps)RDMA (32Gbps)
0
2
4
6
8
10
8 12 16 20
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
20%10GigEIPoIB (32Gbps)RDMA (32Gbps)
4 Worker Nodes, 32 Cores, (32M 32R) 8 Worker Nodes, 64 Cores, (64M 64R)
24
HotI 2014
Performance Evaluation on Cluster B
• For 128 cores, up to 83% over IPoIB (56Gbps) • For 256 cores, up to 79% over IPoIB (56Gbps)
8 Worker Nodes, 128 Cores, (128M 128R) 16 Worker Nodes, 256 Cores, (256M 256R)
0
10
20
30
40
50
8 12 16 20
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
83%IPoIB (56Gbps)RDMA (56Gbps)
0
5
10
15
20
25
30
35
16 24 32 40
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
79%IPoIB (56Gbps)RDMA (56Gbps)
25
HotI 2014
Performance Analysis on Cluster B • Java-based micro-benchmark comparison
IPoIB (56Gbps) RDMA (56Gbps) Peak Bandwidth 1741.46MBps 5612.55MBps
• Benefit of RDMA Connection Sharing Design
0
5
10
15
20
16 24 32 40
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
55%
RDMA−w/o−Conn−Sharing (56Gbps)RDMA−w−Conn−Sharing (56Gbps)
By enabling connection sharing, achieve 55% performance benefit for 40GB data size on 16 nodes
26
HotI 2014
Outline
• Introduc7on • Problem Statement
• Proposed Design • Performance Evalua7on
• Conclusion & Future work
27
HotI 2014
Conclusion and Future Work
• Three major conclusions
– RDMA and high-performance interconnects can benefit Spark.
– Plug-in based approach with SEDA-/RDMA-based designs provides both performance and productivity.
– Spark applications can benefit from an RDMA-enhanced design.
• Future Work – Continuously update this package with newer designs and carry
out evaluations with more Spark applications on systems with high-performance networks
– Will make this design publicly available through the HiBD project 28
HotI 2014
Thank You! {luxi, rahmanmd, islamn, shankard, panda}
@cse.ohio-‐state.edu
Network-‐Based Compu7ng Laboratory hrp://nowlab.cse.ohio-‐state.edu/
The High-‐Performance Big Data Project hrp://hibd.cse.ohio-‐state.edu/