Download ppt - Early Experiences with NFS over RDMA OpenFabric Workshop San Francisco, September 25, 2006 Sandia National Laboratories, CA Helen Y. Chen, Dov Cohen, Joe

Early Experiences with NFS over RDMA

OpenFabric WorkshopSan Francisco, September 25, 2006

Sandia National Laboratories, CAHelen Y. Chen, Dov Cohen, Joe Kenny

Jeff Decker, and Noah Fischerhycsw,idcoehn,jcdecke,[email protected]

SAND 2006-4293C

2

Outline

• Motivation

• RDMA technologies

• NFS over RDMA

• Testbed hardware and software

• Preliminary results and analysis

• Conclusion

• Ongoing work and Future Plans

3

What is NFS

• A network attached storage file access protocol layered on RPC, typically carried over UDP/TCP over IP

• Allow files to be shared among multiple clients across LAN and WAN

• Standard, stable and mature protocol adopted for cluster platform

4

NFS Scalability Concerns in Large Clusters

• Large number of concurrent requests from parallel applications

• Parallel I/O requests serialized by NFS to a large extend• Need RDMA and pNFS

NFS Server

Application 1 Application 2 Application N

Concurrent I/OConcurrent I/O

Concurrent I/O

5

How DMA Works

6

How RDMA Works

7

Why NFS over RDMA

• NFS moves big chunks of data incurring many copies with each RPC

• Cluster Computing– High bandwidth and low latency

• RDMA– Offload protocol processing– Offload host memory I/O bus– A must for 10/20 Gbps networks

http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-nfs-rdma-problem-statement-04.txt

8

The NFS RDMA Architecture

• NFS is a family of protocol layered over RPC

• XDR encodes RPC requests and results onto RPC transports

• NFS RDMA is implemented as a new RPC transport mechanism

• Selection of transport is an NFS mount option

NFSv2

NFSv3

NFSv4

NLM

NFSACL

RPC

XDR

UDP TCP RDMA

Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad, “NFS over RDMA”, ACM SIGCOMM 2003 Workshops, August 25-27, 2003

9

This Study

10

OpenFabrics Software Stack

RDMA NICR-NIC

Host Channel Adapter

HCA

User Direct Access Programming Lib

UDAPL

Reliable Datagram Service

RDS

iSCSI RDMA Protocol (Initiator)

iSER

SCSI RDMA Protocol (Initiator)

SRP

Sockets Direct Protocol

SDP

IP over InfiniBandIPoIB

Performance Manager Agent

PMA

Subnet Manager Agent

SMA

Management Datagram

MAD

Subnet Administrator

SA

Common

InfiniBand

iWARP

Key

InfiniBand HCA iWARP R-NIC

HardwareSpecific Driver

Hardware SpecificDriver

ConnectionManager

MAD

InfiniBand OpenFabrics Kernel Level Verbs / API iWARP

SA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

InfiniBand OpenFabrics User Level Verbs / API iWARP

SDPIPoIB SRP iSER RDS

SDP Lib

User Level MAD API

Open SM

DiagTools

Hardware

Provider

Mid-Layer

Upper Layer Protocol

User APIs

Kernel Space

User Space

NFS-RDMARPC

ClusterFile Sys

Application Level

SMA

ClusteredDB Access

SocketsBasedAccess

VariousMPIs

Access to File

Systems

BlockStorageAccess

IP BasedApp

Access

Apps & Access

Methodsfor usingOF Stack

UDAPL

Ker

nel b

ypas

s

Ker

nel b

ypas

s

Offers a common, open source, and open development RDMA application programming interface http://openfabrics.org/

11

Testbed Key Hardware

• Mainboard: Tyan Thunder K8WE (S2895)http://www.tyan.com/products/html/thunderk8we.html– CPU – Dual 2.2 Ghz AMD Opteron Skt940http://www.amd.com/us-en/assets/content_type/

white_papers_and_tech_docs – Memory – 8 GB ATP 1GB PC3200 DDR SDRAM on

NFS server and 2 GB CORSAIR CM725D512RLP-3200/M on client

• IB Switch: Flextronics InfiniScale III 24-port switchhttp://mellanox.com/products/switch_silicon.php

• IB HCA: Mellanox MT25208 InfiniHost III Exhttp://www.mellanox.com/products/shared/Infinihostglossy.pdf

http://www.tyan.com/products/html/thunderk8we.html

http://www.mellanox.com/products/shared/Infinihostglossy.pdf

12

Testbed Key Software

• Kernel: Linux 2.6.16.5 with deadline I/O scheduler

• NFS/RDMA release candidate 4 –

http://sourceforge.net/projects/nfs-rdma/

• oneSIS used to boot all the nodes

http://www.oneSIS.org

• OpenFabric IB stack svn 7442

http://openib.org

http://www.onesis.org/

http://www.onesis.org/

13

Testbed Configuration

• One NFS server and up to four clients– NFS/TCP vs. NFS/RDMA– IPoIB and IB RDMA running SDR

• Ext2 with Software RAID0 backend• Clients ran IOZONE writing and

reading 64KB records and 5GB aggregate file size used – To eliminate cache effect on client – To maintain consistent disk I/O on server

Allowing the evaluation of NFS/RDMA transport

without being constrained by disk I/O

• System resources monitored using vmstat at 2s intervals

Server

Clients

IB switch

14

Local, NFS, and NFS/RDMA Throughput

Local NFS (IPoIB) NFS/RDMA

Write (MB/s) 266.11 100.02 130.26

Read (MB/s) 1518.20 179.94 692.94

• Reads are from server cache reflecting– TCP RPC transport achieved ~180 MB/s (1.4 Gb/s) of

throughput

– RDMA RPC transport was capable of delivering ~700MB/s (5.6Gb/s) throughput

• RPCNFSDCOUNT=8

• /proc/sys/sunrpc/svc_rdma/max_requests=16

15

NFS & NFS/RDMA Server Disk I/O

NFS/RDMA Server Block I/O

0

50000

100000

150000

200000

250000

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Write RewriteReadReread Idle

Elapsed Time

0

10

20

30

40

50

bo

wa

NFS Server Block I/O

0

50000

100000

150000

200000

250000

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Write Rewrite Read Reread

Elapsed Time

0

10

20

30

40

50

bo

wa

• Writes incurred disk I/O issued according to deadline scheduler– NFS/RDMA server has higher incoming data rate, and thus higher block

I/O output rate to disk – NFS/RDMA data-rate bottlenecked by the storage I/O rate as indicated

by the higher IOWAIT time

16

NFS vs. NFS/RDMA Client Interrupt and Context Switch

NFS Client Interrupt & Context switch

0

50000

100000

Elapsed Time

in

cs

NFSRDMA Client Interrupt & CS

0

20000

40000

60000

80000

100000

Elapsed Time

in

cs

• NFS/RDMA incurred ~1/8 of Interrupts, completed in a little more than 1/2 of the time

• NFS/RDMA showed higher context-switch rates indicating faster processing of application requests

Higher throughput comparing to NFS!

17

Client CPU EfficiencyNFS Client CPU

0

20

40

60

80

100

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86


Elapsed time

%C

PU

CPU per MB of transfer: t)cpu/100 / file-size

Write NFS 0.00375

NFS/RDMA = 0.00144

61.86% more efficient!

ReadNFS = 0.00435NFS/RDMA = 0.00107


Improved application performance

NFS/RDMA Client CPU

0

20

40

60

80

100

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Write Rewrite ReadReread Idle

Elapsed time

%C

PU

18

Server CPU Efficiency

NFS Server CPU

010203040506070

1 6 11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86


Elapsed Time

%C

PU

CPU per MB of transfer:t)cpu/100 / file-

size

WriteNFS = 0.00564NFS/RDMA = 0.00180


ReadNFS = 0.00362NFS/RDMA = 0.00055


Improved system performance

NFA/RDMA Server CPU

0

10

20

30

40

50

1 6 11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

Write Rew rite ReadReread Idle

Elapsed Time

%C

PU

19

• To minimize the impact of disk I/O– One 5GB, two 2.5GB, three 1.67GB, four 1.25GB

• Ignored rewrite and reread due to client-side cache effect

NFS Scalability

0.00

500000.00

1000000.00

1500000.00

Agg

rega

te th

roug

hput

(KB

/s)

write

read

write 100017.20 98584.20 87956.00 83743.20

read 179944.00 236774.80 264673.00 272429.40

1 Client 2 Clients 3 Clients 4 Clients

NFS

Scalability Test - Throughput

NFS/RDMA Scalability

0.00

500000.00

1000000.00

1500000.00

Agg

rega

te T

hrou

ghpu

t (K

B/s

)

write

read

write 130257.00 198165.00 270703.60 318829.20

read 692946.80 958892.80 927666.40 1276621.40

1 Client 2 Clients 3 Clients 4 Clients

NFS/RDMA

20

Scalability Test – Server I/O

NFS Server I/O (4 Clients)

0

50000

100000

150000

200000

250000

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91

Elapsed Time

0

20

40

60

80

100

bo

wa

NFS/RDMA Server I/O (4 clients)

0

50000

100000

150000

200000

250000

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91

Elapsed Time

0

20

40

60

80

100

bo

wa

• NFS RDMA transport demonstrated faster processing of concurrent RPC I/O requests and responses from and to the 4 clients than NFS

• Concurrent NFS/RDMA writes were impacted more by our slow storage as indicated by the close to 80% CPU IOWAIT times

21

Scalability Test – Server CPU

NFS/RDMA Server CPU (4 clients)

0

20

40

60

80

100

1 7 131925 31374349 55616773 798591

Elapsed Time

%C

PU sy

id

wa

• NFS/RDMA incurred ~½ the CPU overhead and for half of the duration, but delivered 4 times the aggregate throughput comparing to NFS

• NFS/RDMA write-performance was impacted more by the backend storage than NFS, as indicated by the ~70% vs. ~30% idle CPU time waiting for IO to complete

NFS Server CPU (4 Clients)

0

20

40

60

80

100

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91

Elapsed Time

%C

PU

sy

id

wa

22

• Compared to NFS, NFS/RDMA demonstrated:– impressive CPU efficiency– and promising scalability

NFS/RDMA will Improve application and system

level performance!

NFS/RDMA can easily take advantage of the bandwidth in 10/20 Gigabit network for large file accesses

Preliminary Conclusion

23

• SC06 participation – HPC Storage Challenge Finalist

• Micro benchmark• MPI Applications with POSIX and/or MPI I/O

– Xnet NFS/RDMA demo over IB and iWARP

Ongoing Work

24

Future Plans

• Initiate study of NFSv4 pNFS performance with RDMA storage– Blocks (SRP, iSER)

– File (NFSv4/RDMA)

– Object (iSCSI-OSD)?

25

• NFSv3 – Use of ancillary Network Lock Manager (NLM) protocol

adds complexity and limits scalability in parallel I/O– No attribute caching requirement squelches

performance

• NFSv4 – Use of Integrated lock management allows byte range

locking required for Parallel I/O – Compound operations improves efficiency of data

movement and …

Why NFSv4

26

Why Parallel NFS (pNFS)

• pNFS extends NFSv4– Minimum extension to allow out-of-band I/O– Standards-based scalable I/O solution

• Asymmetric, out-of-band solutions offer scalability– Control path (open/close) different from Data Path

(read/write)http://www3.ietf.org/proceedings/04nov/slides/nfsv4-8/pnfs-reqs-ietf61.ppt

27

Acknowledgement

• The authors would like to thank the following for their technical input

– Tom Talpey and James Lentini from NetApp

– Tom Tucker from Open Grid Computing

– James Ting from Mellanox

– Matt Leininger and Mitch Sukalski from Sandia