Download pdf - High Performance RDMA-based Design of HDFS over InfiniBandnowlab.cse.ohio-state.edu/static/media/publications/... · 2017. 7. 18. · HBase 5 HDFS HBase MapReduce Hadoop Framework

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy* & D. K. Panda

Network-Based Computing Laboratory

Department of Computer Science and Engineering

The Ohio State University, Columbus, OH, USA

*IBM TJ Watson Research Center

Yorktown Heights, NY, USA

High Performance RDMA-based Design of HDFS over InfiniBand

SC 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design

• Performance Evaluation

• Conclusion & Future work

2

SC 2012

Introduction

• Big Data: provides groundbreaking opportunities for

enterprise information management and decision

making

• The rate of information growth appears to be

exceeding Moore’s Law

• The amount of data is exploding; companies are

capturing and digitizing more information that ever

• 35 zettabytes of data will be generated and

consumed by the end of this decade

3

SC 2012

Introduction

4

Graph of 1026 Twitter users whose tweets contain ‘big data’ https://nodexlgraphgallery.org/Pages/Graph.aspx?graphID=1266

Tweets made over a period of 8 hours and 24 minutes

SC 2012

Big Data Technology

• Apache Hadoop is a popular Big

Data technology

– Provides framework for large-

scale, distributed data storage and

processing

• Hadoop is an open-source

implementation of MapReduce

programming model

• Hadoop Distributed File System

(HDFS) (http://hadoop.apache.org/) is

the underlying file system of

Hadoop and Hadoop DataBase,

HBase

5

HDFS

MapReduce HBase

Hadoop Framework

SC 2012

• Adopted by many reputed organizations – eg:- Facebook, Yahoo!

• Highly reliable and fault-tolerant - replication

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-independence and portability

• Uses Java sockets for communication

(HDFS Architecture)

Hadoop Distributed File System

(HDFS)

6

SC 2012

7

Application/Middleware

Sockets Application/Middleware

Interface

TCP/IP

Hardware Offload

TCP/IP

Ethernet Driver

Kernel

Space

Protocol Implementation

1/10 GigE Adapter

Ethernet Switch

Network Adapter

Network Switch

1/10 GigE

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

10 GigE Adapter

10 GigE Switch

10 GigE-TOE

Modern High Performance Interconnects (Socket Interface)

Cloud Computing systems are

being widely used on High

Performance Computing (HPC)

Clusters

Hadoop middleware

components do not leverage

HPC cluster features for

communication

Commodity high performance

networks like InfiniBand can

provide low latency and high

throughput data transmission

For data-intensive applications

network performance becomes

key component for HDFS

SC 2012

HDFS Block Write Time

• HDFS block write time:

– 1097 ms for 1GigE

– 448 ms for IPoIB (QDR) (TCP/IP emulation over IB)

• IPoIB (QDR) improves the write time by 59%

8

0

200

400

600

800

1000

1200

1GigE IPoIB(QDR)

Blo

ck

Wri

te T

ime

(m

s)

4 DataNodes each

with single HDD

Block size = 64MB

Replication factor = 3

1097 ms

448 ms

59%

The byte stream

communication

nature of TCP/IP

requires multiple

data copies

Can we improve further?

SC 2012

Modern High Performance Interconnects

9

Application/Middleware

IB Verbs Sockets Application/Middleware

Interface

TCP/IP

Hardware Offload

TCP/IP

Ethernet Driver

Kernel

Space

Protocol Implementation

1/10 GigE Adapter

Ethernet Switch

Network Adapter

Network Switch

1/10 GigE

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

User

space

IB Verbs

InfiniBand Adapter

InfiniBand Switch

RDMA

10 GigE Adapter

10 GigE Switch

10 GigE-TOE

SC 2012

Outline



• Design


• Conclusion & Future work

10

SC 2012

Problem Statement

• How does the network performance impact the overall

HDFS performance?

• Can we re-design HDFS to take advantage of high

performance networks and exploit advanced features

such as RDMA?

• What will be the performance improvement of HDFS with

the RDMA-based design over InfiniBand?

• Can we observe the performance improvement for other

cloud computing middleware such as HBase?

11

SC 2012

Outline



• Design


• Conclusion & Future Work

12

SC 2012

HDFS Hybrid Design Overview

• JNI Layer bridges Java based HDFS with communication library written in native code

• Only the communication part of HDFS Write is modified; No change in HDFS architecture

13

HDFS

IB Verbs

InfiniBand

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

UCR

(communication library)

HDFS Write involves

replication; more

network intensive

HDFS Read is mostly

node-local

Enables high performance RDMA communication, while supporting traditional socket interface

SC 2012

Unified Communication Runtime (UCR)

• Light-weight, high performance communication runtime

• Design of UCR evolved from MVAPICH/MVAPICH2 software

stacks (http://mvapich.cse.ohio-state.edu/)

• Communications based on endpoints, analogous to sockets

• Designed as a native library to extract high performance from

advanced network technologies

• Enhanced APIs to support data center middlewares such as

HBase, Memcached, etc.

14

http://mvapich.cse.ohio-state.edu/



SC 2012

Design Challenges

• Socket-based HDFS

– new socket connection for each block

– New receiver thread per block in the DataNode

– Multiple data copies

• RDMA-based HDFS

– Creating a new UCR connection per block is expensive

– Reduce data copy overhead

– Keep low memory footprint

15

SC 2012

Components and Communication Flow

16

RDMA Enabled DFSClient

(Connection)

JNI Interface JNI Interface

IBVerbs IBVerbs

UCR UCR

RDMAData

Streamer

RDMAResponse

Processor

DataQueue

RDMABlock

Xceiver

RDMAPacket

Responder

AckQueue AckQueue

Client DataNode Create end-point

RDMADataXceiverServer

(Connection, RDMADataXceiver)

SC 2012

HDFS-JNI Interaction

• The new design does not change the user-level APIs

• Communication part of HDFS write API is modified

keeping the socket-based design intact

• A new configuration parameter dfs.ib.enabled is added

to select the communication protocol;

– dfs.ib.enabled = true -> write will go over RDMA

– dfs.ib.enabled = false -> write will go over socket

17

SC 2012

Outline



• Design



18

SC 2012

Experimental Setup

• Hardware – Intel Clovertown (Cluster A)

• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 250 GB hard disk

• Network: 1GigE, IPoIB, 10GigE TOE and IB-DDR (16Gbps)

– Intel Westmere (Cluster B)

• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk

• 4 storage nodes with two 1 TB HDD per node, 24 GB RAM

• 4 storage nodes with 300GB OCZ VeloDrive PCIe SSD

• Network: 1GigE, IPoIB and IB-QDR (32Gbps)

• Software – Hadoop 0.20.2, HBase 0.90.3 and Sun Java SDK 1.7.

– Yahoo! Cloud Serving Benchmark (YCSB) 19

SC 2012

Outline



• Design


– Determining the optimal packet-size for different

interconnects/protocols

– Micro-benchmark level evaluations (measures the latency of file write)

– Evaluations with TestDFSIO (measures the throughput of sequential file

write)

– Integration with HBase (TCP/IP) (measures latency and throughput of

HBase put operation)

• Conclusion & Future Work 20

SC 2012

HDFS Optimal Packet-Size Evaluation

• For both the clusters (HDD/SSD) – Optimal packet-size for 1/10GigE:

64KB

– Optimal packet-size for IPoIB: 128KB

– Optimal packet-size for OSU-IB: 512KB

21

0

5

10

15

20

25

30

64K 128K 256K 512K 1M

File W

rite

Tim

e (

s)

Packet-Size (IPoIB (QDR))

1GB 2GB 3GB 4GB 5GB

0

5

10

15

20

25

30

35

64K 128K 256K 512K 1M

File W

rite

Tim

e (

s)

Packet-Size (OSU-IB (QDR))

1GB 2GB 3GB 4GB 5GB

0

10

20

30

40

50

60

70

80

90

100

64K 128K 256K 512K 1M

File W

rite

Tim

e (

s)

Packet-Size (1GigE)

1GB 2GB 3GB 4GB 5GB

SC 2012

Outline



• Design






write)




SC 2012

Evaluations using Micro-benchmark

(DataNode Storage: HDD)

23

• Cluster A with 4 DataNodes

– 14% improvement over 10GigE for 5 GB file size

– 20% improvement over IPoIB (16Gbps) for 5GB file size

• Cluster B with 32 DataNodes


0

20

40

60

80

100

120

1GB 2GB 3GB 4GB 5GB

File

Wri

te T

ime

(s)

File Size (GB)

1GigE

IPoIB (DDR)

10GigE

OSU-IB (DDR)

Cluster A with 4 DataNodes Cluster B with 32 DataNodes

Reduced

by 16%

Reduced

by 14%

0

20

40

60

80

100

120

140

2GB 4GB 6GB 8GB 10GB

File

Wri

te T

ime

(s)

File Size (GB)

1GigE

IPoIB (QDR)

OSU-IB (QDR)

SC 2012

Evaluations using Micro-benchmark

(DataNode Storage: SSD)

24

• Cluster A with 4 DataNodes

– 16% improvement over 10GigE for 5GB file size


• Cluster B with 4 DataNodes


0

10

20

30

40

50

60

1GB 2GB 3GB 4GB 5GB

File

Wri

te T

ime

(s)

File Size (GB)

1GigE

IPoIB (DDR)

10GigE

OSU-IB (DDR)

Cluster A with 4 DataNodes Cluster B with 4 DataNodes

0

20

40

60

80

100

120

140

160

180


File

Wri

te T

ime

(s)

File Size (GB)

1GigE

IPoIB (QDR)

OSU-IB (QDR)

Reduced

by 16%

Reduced

by 25%

SC 2012

Communication Times in HDFS

25

• Cluster B with 32 HDD DataNodes

– 30% improvement in communication time over IPoIB (32Gbps)

– 87% improvement in communication time over 1GigE

• Similar improvements are obtained for SSD DataNodes

Cluster B

0

10

20

30

40

50

60

70

80

90

100


Co

mm

un

ica

tio

n T

ime

(s)

File Size (GB)

1GigE

IPoIB (QDR)

OSU-IB (QDR)

Reduced

by 30%

SC 2012

Outline



• Design






write)




SC 2012

Evaluations using TestDFSIO

27

• Cluster B with 32 HDD DataNodes

– 13.5% improvement over IPoIB (32Gbps) for 8GB file size

• Cluster B with 4 SSD DataNodes


Cluster B

0

50

100

150

200

250

300

350

2 4 6 8 10

Th

rou

gh

pu

t (M

eg

aB

yte

s/s

)

File Size (GB)

1GigE IPoIB (QDR) OSU-IB (QDR)

0

20

40

60

80

100

120

140

2 4 6 8 10

Th

rou

gh

pu

t (M

eg

aB

yte

s/s

)

File Size (GB)


Cluster B with 32 HDD Nodes Cluster B with 4 SSD Nodes

Increased by 13.5%

Increased by 15%

SC 2012


28

• Cluster B with 4 DataNodes, 2 HDD per node

– 16.2% improvement over IPoIB (32Gbps) for 10GB file size

• Cluster B with 4 DataNodes, 1 HDD per node


• 2 HDD vs 1 HDD

– 2.01x improvement for OSU-IB (32Gbps)

– 1.8x improvement for IPoIB (32Gbps)

Cluster B, 4

storage nodes

Increased by 16.2%

0

50

100

150

200

250

300

350

2 4 6 8 10

Ave

rag

e T

hro

ug

hp

ut

(Me

gaB

yte

s/s

)

File Size (GB)

1GigE (1HDD) IPoIB(1HDD) OSU-IB (1HDD)

1GigE (2HDD) IPoIB (2HDD) OSU-IB (2HDD)

SC 2012


29

0

50

100

150

200

250

300

2 4 6 8 10

Ave

rag

e T

hro

ug

hp

ut

(Me

gaB

yte

s/s

)

File Size (GB)

1GigE IPoIB (QDR) OSU-IB (QDR)Increased by 28%

• Cluster B with 4 DataNodes, 1 SSD per node


Cluster B, 4

storage nodes

SC 2012

Outline



• Design






write)




SC 2012

Evaluations using YCSB

(Single Region Server: 100% Update)

31

• HBase Put latency for 360K records

– 204 us for OSU Design; 252 us for IPoIB (32Gbps)

• HBase Put throughput for 360K records

– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)

• 20% improvement in both average latency and throughput

Put Latency Put Throughput

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

120K 240K 360K 480K

Th

rou

gh

pu

t (M

eg

aB

yte

s/s

)

Number of Records (Record Size = 1KB)


0

50

100

150

200

250

300

350

400

450

500

120K 240K 360K 480K

Ave

rag

e L

ate

nc

y (

us)

Number of Record (Record Size = 1KB)


Cluster B

Reduced

by 20%

Increased

by 20%

SC 2012

Evaluations using YCSB

(32 Region Servers: 100% Update)

32

0

50

100

150

200

250

300

350

400

240K 480K 720K 960K

Ave

rag

er

Late

nc

y (

us)

Number of Records (Record Size = 1KB)


0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

240K 480K 720K 960K

Th

rou

gh

pu

t (K

op

s/s

ec

)

Number of Records (Record Size =1 KB)


• HBase Put latency for 960K records

– 195 us for OSU Design; 273 us for IPoIB (32Gbps)

• HBase Put throughput for 960K records

– 4.60 Kops/sec for OSU Design; 3.45 Kops/sec for IPoIB (32Gbps)

• 29% improvement in average latency; 33% improvement in throughput

Cluster B

Reduced

by 29% Increased

by 33%

Put Latency Put Throughput

SC 2012

Outline



• Design using Hybrid Transports



33

SC 2012

Conclusion

• Detailed Profiling and Analysis of HDFS

• RDMA-based Design of HDFS over InfiniBand

– First design of HDFS over InfiniBand network

• Comprehensive Evaluation of the RDMA-based

design of HDFS

• Integration with HBase leads performance

improvement of HBase Put operation

34

SC 2012

Future Works

• Identify architectural bottlenecks of higher level

HDFS designs and propose enhancements to work

with high performance communication schemes

• Investigate on faster recovery on DataNode Failure

• Other HDFS operations (e.g. HDFS Read) will be

implemented over RDMA

• Integration with other Hadoop components designed

over InfiniBand

35

SC 2012

Thank You!

{islamn, rahmanmd, jose, rajachan, wangh, subramon, panda}@cse.ohio-state.edu,

[email protected]

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/

36

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/