N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy* & D. K. Panda
Network-Based Computing Laboratory
Department of Computer Science and Engineering
The Ohio State University, Columbus, OH, USA
*IBM TJ Watson Research Center
Yorktown Heights, NY, USA
High Performance RDMA-based Design of HDFS over InfiniBand
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
• Conclusion & Future work
2
SC 2012
Introduction
• Big Data: provides groundbreaking opportunities for
enterprise information management and decision
making
• The rate of information growth appears to be
exceeding Moore’s Law
• The amount of data is exploding; companies are
capturing and digitizing more information that ever
• 35 zettabytes of data will be generated and
consumed by the end of this decade
3
SC 2012
Introduction
4
Graph of 1026 Twitter users whose tweets contain ‘big data’ https://nodexlgraphgallery.org/Pages/Graph.aspx?graphID=1266
Tweets made over a period of 8 hours and 24 minutes
SC 2012
Big Data Technology
• Apache Hadoop is a popular Big
Data technology
– Provides framework for large-
scale, distributed data storage and
processing
• Hadoop is an open-source
implementation of MapReduce
programming model
• Hadoop Distributed File System
(HDFS) (http://hadoop.apache.org/) is
the underlying file system of
Hadoop and Hadoop DataBase,
HBase
5
HDFS
MapReduce HBase
Hadoop Framework
SC 2012
• Adopted by many reputed organizations – eg:- Facebook, Yahoo!
• Highly reliable and fault-tolerant - replication
• NameNode: stores the file system namespace
• DataNode: stores data blocks
• Developed in Java for platform-independence and portability
• Uses Java sockets for communication
(HDFS Architecture)
Hadoop Distributed File System
(HDFS)
6
SC 2012
7
Application/Middleware
Sockets Application/Middleware
Interface
TCP/IP
Hardware Offload
TCP/IP
Ethernet Driver
Kernel
Space
Protocol Implementation
1/10 GigE Adapter
Ethernet Switch
Network Adapter
Network Switch
1/10 GigE
InfiniBand Adapter
InfiniBand Switch
IPoIB
IPoIB
10 GigE Adapter
10 GigE Switch
10 GigE-TOE
Modern High Performance Interconnects (Socket Interface)
Cloud Computing systems are
being widely used on High
Performance Computing (HPC)
Clusters
Hadoop middleware
components do not leverage
HPC cluster features for
communication
Commodity high performance
networks like InfiniBand can
provide low latency and high
throughput data transmission
For data-intensive applications
network performance becomes
key component for HDFS
SC 2012
HDFS Block Write Time
• HDFS block write time:
– 1097 ms for 1GigE
– 448 ms for IPoIB (QDR) (TCP/IP emulation over IB)
• IPoIB (QDR) improves the write time by 59%
8
0
200
400
600
800
1000
1200
1GigE IPoIB(QDR)
Blo
ck
Wri
te T
ime
(m
s)
4 DataNodes each
with single HDD
Block size = 64MB
Replication factor = 3
1097 ms
448 ms
59%
The byte stream
communication
nature of TCP/IP
requires multiple
data copies
Can we improve further?
SC 2012
Modern High Performance Interconnects
9
Application/Middleware
IB Verbs Sockets Application/Middleware
Interface
TCP/IP
Hardware Offload
TCP/IP
Ethernet Driver
Kernel
Space
Protocol Implementation
1/10 GigE Adapter
Ethernet Switch
Network Adapter
Network Switch
1/10 GigE
InfiniBand Adapter
InfiniBand Switch
IPoIB
IPoIB
User
space
IB Verbs
InfiniBand Adapter
InfiniBand Switch
RDMA
10 GigE Adapter
10 GigE Switch
10 GigE-TOE
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
• Conclusion & Future work
10
SC 2012
Problem Statement
• How does the network performance impact the overall
HDFS performance?
• Can we re-design HDFS to take advantage of high
performance networks and exploit advanced features
such as RDMA?
• What will be the performance improvement of HDFS with
the RDMA-based design over InfiniBand?
• Can we observe the performance improvement for other
cloud computing middleware such as HBase?
11
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
• Conclusion & Future Work
12
SC 2012
HDFS Hybrid Design Overview
• JNI Layer bridges Java based HDFS with communication library written in native code
• Only the communication part of HDFS Write is modified; No change in HDFS architecture
13
HDFS
IB Verbs
InfiniBand
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
UCR
(communication library)
HDFS Write involves
replication; more
network intensive
HDFS Read is mostly
node-local
Enables high performance RDMA communication, while supporting traditional socket interface
SC 2012
Unified Communication Runtime (UCR)
• Light-weight, high performance communication runtime
• Design of UCR evolved from MVAPICH/MVAPICH2 software
stacks (http://mvapich.cse.ohio-state.edu/)
• Communications based on endpoints, analogous to sockets
• Designed as a native library to extract high performance from
advanced network technologies
• Enhanced APIs to support data center middlewares such as
HBase, Memcached, etc.
14
SC 2012
Design Challenges
• Socket-based HDFS
– new socket connection for each block
– New receiver thread per block in the DataNode
– Multiple data copies
• RDMA-based HDFS
– Creating a new UCR connection per block is expensive
– Reduce data copy overhead
– Keep low memory footprint
15
SC 2012
Components and Communication Flow
16
RDMA Enabled DFSClient
(Connection)
JNI Interface JNI Interface
IBVerbs IBVerbs
UCR UCR
RDMAData
Streamer
RDMAResponse
Processor
DataQueue
RDMABlock
Xceiver
RDMAPacket
Responder
AckQueue AckQueue
Client DataNode Create end-point
RDMADataXceiverServer
(Connection, RDMADataXceiver)
SC 2012
HDFS-JNI Interaction
• The new design does not change the user-level APIs
• Communication part of HDFS write API is modified
keeping the socket-based design intact
• A new configuration parameter dfs.ib.enabled is added
to select the communication protocol;
– dfs.ib.enabled = true -> write will go over RDMA
– dfs.ib.enabled = false -> write will go over socket
17
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
• Conclusion & Future Work
18
SC 2012
Experimental Setup
• Hardware – Intel Clovertown (Cluster A)
• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 250 GB hard disk
• Network: 1GigE, IPoIB, 10GigE TOE and IB-DDR (16Gbps)
– Intel Westmere (Cluster B)
• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk
• 4 storage nodes with two 1 TB HDD per node, 24 GB RAM
• 4 storage nodes with 300GB OCZ VeloDrive PCIe SSD
• Network: 1GigE, IPoIB and IB-QDR (32Gbps)
• Software – Hadoop 0.20.2, HBase 0.90.3 and Sun Java SDK 1.7.
– Yahoo! Cloud Serving Benchmark (YCSB) 19
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
– Determining the optimal packet-size for different
interconnects/protocols
– Micro-benchmark level evaluations (measures the latency of file write)
– Evaluations with TestDFSIO (measures the throughput of sequential file
write)
– Integration with HBase (TCP/IP) (measures latency and throughput of
HBase put operation)
• Conclusion & Future Work 20
SC 2012
HDFS Optimal Packet-Size Evaluation
• For both the clusters (HDD/SSD) – Optimal packet-size for 1/10GigE:
64KB
– Optimal packet-size for IPoIB: 128KB
– Optimal packet-size for OSU-IB: 512KB
21
0
5
10
15
20
25
30
64K 128K 256K 512K 1M
File W
rite
Tim
e (
s)
Packet-Size (IPoIB (QDR))
1GB 2GB 3GB 4GB 5GB
0
5
10
15
20
25
30
35
64K 128K 256K 512K 1M
File W
rite
Tim
e (
s)
Packet-Size (OSU-IB (QDR))
1GB 2GB 3GB 4GB 5GB
0
10
20
30
40
50
60
70
80
90
100
64K 128K 256K 512K 1M
File W
rite
Tim
e (
s)
Packet-Size (1GigE)
1GB 2GB 3GB 4GB 5GB
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
– Determining the optimal packet-size for different
interconnects/protocols
– Micro-benchmark level evaluations (measures the latency of file write)
– Evaluations with TestDFSIO (measures the throughput of sequential file
write)
– Integration with HBase (TCP/IP) (measures latency and throughput of
HBase put operation)
• Conclusion & Future Work 22
SC 2012
Evaluations using Micro-benchmark
(DataNode Storage: HDD)
23
• Cluster A with 4 DataNodes
– 14% improvement over 10GigE for 5 GB file size
– 20% improvement over IPoIB (16Gbps) for 5GB file size
• Cluster B with 32 DataNodes
– 16% improvement over IPoIB (32Gbps) for 8GB file size
0
20
40
60
80
100
120
1GB 2GB 3GB 4GB 5GB
File
Wri
te T
ime
(s)
File Size (GB)
1GigE
IPoIB (DDR)
10GigE
OSU-IB (DDR)
Cluster A with 4 DataNodes Cluster B with 32 DataNodes
Reduced
by 16%
Reduced
by 14%
0
20
40
60
80
100
120
140
2GB 4GB 6GB 8GB 10GB
File
Wri
te T
ime
(s)
File Size (GB)
1GigE
IPoIB (QDR)
OSU-IB (QDR)
SC 2012
Evaluations using Micro-benchmark
(DataNode Storage: SSD)
24
• Cluster A with 4 DataNodes
– 16% improvement over 10GigE for 5GB file size
– 25% improvement over IPoIB (16Gbps) for 5GB file size
• Cluster B with 4 DataNodes
– 25% improvement over IPoIB (32Gbps) for 10GB file size
0
10
20
30
40
50
60
1GB 2GB 3GB 4GB 5GB
File
Wri
te T
ime
(s)
File Size (GB)
1GigE
IPoIB (DDR)
10GigE
OSU-IB (DDR)
Cluster A with 4 DataNodes Cluster B with 4 DataNodes
0
20
40
60
80
100
120
140
160
180
2GB 4GB 6GB 8GB 10GB
File
Wri
te T
ime
(s)
File Size (GB)
1GigE
IPoIB (QDR)
OSU-IB (QDR)
Reduced
by 16%
Reduced
by 25%
SC 2012
Communication Times in HDFS
25
• Cluster B with 32 HDD DataNodes
– 30% improvement in communication time over IPoIB (32Gbps)
– 87% improvement in communication time over 1GigE
• Similar improvements are obtained for SSD DataNodes
Cluster B
0
10
20
30
40
50
60
70
80
90
100
2GB 4GB 6GB 8GB 10GB
Co
mm
un
ica
tio
n T
ime
(s)
File Size (GB)
1GigE
IPoIB (QDR)
OSU-IB (QDR)
Reduced
by 30%
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
– Determining the optimal packet-size for different
interconnects/protocols
– Micro-benchmark level evaluations (measures the latency of file write)
– Evaluations with TestDFSIO (measures the throughput of sequential file
write)
– Integration with HBase (TCP/IP) (measures latency and throughput of
HBase put operation)
• Conclusion & Future Work 26
SC 2012
Evaluations using TestDFSIO
27
• Cluster B with 32 HDD DataNodes
– 13.5% improvement over IPoIB (32Gbps) for 8GB file size
• Cluster B with 4 SSD DataNodes
– 15% improvement over IPoIB (32Gbps) for 8GB file size
Cluster B
0
50
100
150
200
250
300
350
2 4 6 8 10
Th
rou
gh
pu
t (M
eg
aB
yte
s/s
)
File Size (GB)
1GigE IPoIB (QDR) OSU-IB (QDR)
0
20
40
60
80
100
120
140
2 4 6 8 10
Th
rou
gh
pu
t (M
eg
aB
yte
s/s
)
File Size (GB)
1GigE IPoIB (QDR) OSU-IB (QDR)
Cluster B with 32 HDD Nodes Cluster B with 4 SSD Nodes
Increased by 13.5%
Increased by 15%
SC 2012
Evaluations using TestDFSIO
28
• Cluster B with 4 DataNodes, 2 HDD per node
– 16.2% improvement over IPoIB (32Gbps) for 10GB file size
• Cluster B with 4 DataNodes, 1 HDD per node
– 10% improvement over IPoIB (32Gbps) for 10GB file size
• 2 HDD vs 1 HDD
– 2.01x improvement for OSU-IB (32Gbps)
– 1.8x improvement for IPoIB (32Gbps)
Cluster B, 4
storage nodes
Increased by 16.2%
0
50
100
150
200
250
300
350
2 4 6 8 10
Ave
rag
e T
hro
ug
hp
ut
(Me
gaB
yte
s/s
)
File Size (GB)
1GigE (1HDD) IPoIB(1HDD) OSU-IB (1HDD)
1GigE (2HDD) IPoIB (2HDD) OSU-IB (2HDD)
SC 2012
Evaluations using TestDFSIO
29
0
50
100
150
200
250
300
2 4 6 8 10
Ave
rag
e T
hro
ug
hp
ut
(Me
gaB
yte
s/s
)
File Size (GB)
1GigE IPoIB (QDR) OSU-IB (QDR)Increased by 28%
• Cluster B with 4 DataNodes, 1 SSD per node
– 28% improvement over IPoIB (32Gbps) for 10GB file size
Cluster B, 4
storage nodes
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design
• Performance Evaluation
– Determining the optimal packet-size for different
interconnects/protocols
– Micro-benchmark level evaluations (measures the latency of file write)
– Evaluations with TestDFSIO (measures the throughput of sequential file
write)
– Integration with HBase (TCP/IP) (measures latency and throughput of
HBase put operation)
• Conclusion & Future Work 30
SC 2012
Evaluations using YCSB
(Single Region Server: 100% Update)
31
• HBase Put latency for 360K records
– 204 us for OSU Design; 252 us for IPoIB (32Gbps)
• HBase Put throughput for 360K records
– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)
• 20% improvement in both average latency and throughput
Put Latency Put Throughput
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
120K 240K 360K 480K
Th
rou
gh
pu
t (M
eg
aB
yte
s/s
)
Number of Records (Record Size = 1KB)
1GigE IPoIB (QDR) OSU-IB (QDR)
0
50
100
150
200
250
300
350
400
450
500
120K 240K 360K 480K
Ave
rag
e L
ate
nc
y (
us)
Number of Record (Record Size = 1KB)
1GigE IPoIB (QDR) OSU-IB (QDR)
Cluster B
Reduced
by 20%
Increased
by 20%
SC 2012
Evaluations using YCSB
(32 Region Servers: 100% Update)
32
0
50
100
150
200
250
300
350
400
240K 480K 720K 960K
Ave
rag
er
Late
nc
y (
us)
Number of Records (Record Size = 1KB)
1GigE IPoIB (QDR) OSU-IB (QDR)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
240K 480K 720K 960K
Th
rou
gh
pu
t (K
op
s/s
ec
)
Number of Records (Record Size =1 KB)
1GigE IPoIB (QDR) OSU-IB (QDR)
• HBase Put latency for 960K records
– 195 us for OSU Design; 273 us for IPoIB (32Gbps)
• HBase Put throughput for 960K records
– 4.60 Kops/sec for OSU Design; 3.45 Kops/sec for IPoIB (32Gbps)
• 29% improvement in average latency; 33% improvement in throughput
Cluster B
Reduced
by 29% Increased
by 33%
Put Latency Put Throughput
SC 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design using Hybrid Transports
• Performance Evaluation
• Conclusion & Future Work
33
SC 2012
Conclusion
• Detailed Profiling and Analysis of HDFS
• RDMA-based Design of HDFS over InfiniBand
– First design of HDFS over InfiniBand network
• Comprehensive Evaluation of the RDMA-based
design of HDFS
• Integration with HBase leads performance
improvement of HBase Put operation
34
SC 2012
Future Works
• Identify architectural bottlenecks of higher level
HDFS designs and propose enhancements to work
with high performance communication schemes
• Investigate on faster recovery on DataNode Failure
• Other HDFS operations (e.g. HDFS Read) will be
implemented over RDMA
• Integration with other Hadoop components designed
over InfiniBand
35
SC 2012
Thank You!
{islamn, rahmanmd, jose, rajachan, wangh, subramon, panda}@cse.ohio-state.edu,
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
36
MVAPICH Web Page http://mvapich.cse.ohio-state.edu/