Early Experiences with NFS over RDMA
OpenFabric WorkshopSan Francisco, September 25, 2006
Sandia National Laboratories, CAHelen Y. Chen, Dov Cohen, Joe Kenny
Jeff Decker, and Noah Fischerhycsw,idcoehn,jcdecke,[email protected]
SAND 2006-4293C
2
Outline
• Motivation
• RDMA technologies
• NFS over RDMA
• Testbed hardware and software
• Preliminary results and analysis
• Conclusion
• Ongoing work and Future Plans
3
What is NFS
• A network attached storage file access protocol layered on RPC, typically carried over UDP/TCP over IP
• Allow files to be shared among multiple clients across LAN and WAN
• Standard, stable and mature protocol adopted for cluster platform
4
NFS Scalability Concerns in Large Clusters
• Large number of concurrent requests from parallel applications
• Parallel I/O requests serialized by NFS to a large extend• Need RDMA and pNFS
NFS Server
Application 1 Application 2 Application N
Concurrent I/OConcurrent I/O
Concurrent I/O
5
How DMA Works
6
How RDMA Works
7
Why NFS over RDMA
• NFS moves big chunks of data incurring many copies with each RPC
• Cluster Computing– High bandwidth and low latency
• RDMA– Offload protocol processing– Offload host memory I/O bus– A must for 10/20 Gbps networks
http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-nfs-rdma-problem-statement-04.txt
8
The NFS RDMA Architecture
• NFS is a family of protocol layered over RPC
• XDR encodes RPC requests and results onto RPC transports
• NFS RDMA is implemented as a new RPC transport mechanism
• Selection of transport is an NFS mount option
NFSv2
NFSv3
NFSv4
NLM
NFSACL
RPC
XDR
UDP TCP RDMA
Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad, “NFS over RDMA”, ACM SIGCOMM 2003 Workshops, August 25-27, 2003
9
This Study
10
OpenFabrics Software Stack
RDMA NICR-NIC
Host Channel Adapter
HCA
User Direct Access Programming Lib
UDAPL
Reliable Datagram Service
RDS
iSCSI RDMA Protocol (Initiator)
iSER
SCSI RDMA Protocol (Initiator)
SRP
Sockets Direct Protocol
SDP
IP over InfiniBandIPoIB
Performance Manager Agent
PMA
Subnet Manager Agent
SMA
Management Datagram
MAD
Subnet Administrator
SA
Common
InfiniBand
iWARP
Key
InfiniBand HCA iWARP R-NIC
HardwareSpecific Driver
Hardware SpecificDriver
ConnectionManager
MAD
InfiniBand OpenFabrics Kernel Level Verbs / API iWARP
SA Client
ConnectionManager
Connection ManagerAbstraction (CMA)
InfiniBand OpenFabrics User Level Verbs / API iWARP
SDPIPoIB SRP iSER RDS
SDP Lib
User Level MAD API
Open SM
DiagTools
Hardware
Provider
Mid-Layer
Upper Layer Protocol
User APIs
Kernel Space
User Space
NFS-RDMARPC
ClusterFile Sys
Application Level
SMA
ClusteredDB Access
SocketsBasedAccess
VariousMPIs
Access to File
Systems
BlockStorageAccess
IP BasedApp
Access
Apps & Access
Methodsfor usingOF Stack
UDAPL
Ker
nel b
ypas
s
Ker
nel b
ypas
s
Offers a common, open source, and open development RDMA application programming interface http://openfabrics.org/
11
Testbed Key Hardware
• Mainboard: Tyan Thunder K8WE (S2895)http://www.tyan.com/products/html/thunderk8we.html– CPU – Dual 2.2 Ghz AMD Opteron Skt940http://www.amd.com/us-en/assets/content_type/
white_papers_and_tech_docs – Memory – 8 GB ATP 1GB PC3200 DDR SDRAM on
NFS server and 2 GB CORSAIR CM725D512RLP-3200/M on client
• IB Switch: Flextronics InfiniScale III 24-port switchhttp://mellanox.com/products/switch_silicon.php
• IB HCA: Mellanox MT25208 InfiniHost III Exhttp://www.mellanox.com/products/shared/Infinihostglossy.pdf
12
Testbed Key Software
• Kernel: Linux 2.6.16.5 with deadline I/O scheduler
• NFS/RDMA release candidate 4 –
http://sourceforge.net/projects/nfs-rdma/
• oneSIS used to boot all the nodes
http://www.oneSIS.org
• OpenFabric IB stack svn 7442
http://openib.org
13
Testbed Configuration
• One NFS server and up to four clients– NFS/TCP vs. NFS/RDMA– IPoIB and IB RDMA running SDR
• Ext2 with Software RAID0 backend• Clients ran IOZONE writing and
reading 64KB records and 5GB aggregate file size used – To eliminate cache effect on client – To maintain consistent disk I/O on server
Allowing the evaluation of NFS/RDMA transport
without being constrained by disk I/O
• System resources monitored using vmstat at 2s intervals
Server
Clients
IB switch
14
Local, NFS, and NFS/RDMA Throughput
Local NFS (IPoIB) NFS/RDMA
Write (MB/s) 266.11 100.02 130.26
Read (MB/s) 1518.20 179.94 692.94
• Reads are from server cache reflecting– TCP RPC transport achieved ~180 MB/s (1.4 Gb/s) of
throughput
– RDMA RPC transport was capable of delivering ~700MB/s (5.6Gb/s) throughput
• RPCNFSDCOUNT=8
• /proc/sys/sunrpc/svc_rdma/max_requests=16
15
NFS & NFS/RDMA Server Disk I/O
NFS/RDMA Server Block I/O
0
50000
100000
150000
200000
250000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86
Write RewriteReadReread Idle
Elapsed Time
0
10
20
30
40
50
bo
wa
NFS Server Block I/O
0
50000
100000
150000
200000
250000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86
Write Rewrite Read Reread
Elapsed Time
0
10
20
30
40
50
bo
wa
• Writes incurred disk I/O issued according to deadline scheduler– NFS/RDMA server has higher incoming data rate, and thus higher block
I/O output rate to disk – NFS/RDMA data-rate bottlenecked by the storage I/O rate as indicated
by the higher IOWAIT time
16
NFS vs. NFS/RDMA Client Interrupt and Context Switch
NFS Client Interrupt & Context switch
0
50000
100000
Elapsed Time
in
cs
NFSRDMA Client Interrupt & CS
0
20000
40000
60000
80000
100000
Elapsed Time
in
cs
• NFS/RDMA incurred ~1/8 of Interrupts, completed in a little more than 1/2 of the time
• NFS/RDMA showed higher context-switch rates indicating faster processing of application requests
Higher throughput comparing to NFS!
17
Client CPU EfficiencyNFS Client CPU
0
20
40
60
80
100
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86
Write Rewrite Read Reread
Elapsed time
%C
PU
CPU per MB of transfer: t)cpu/100 / file-size
Write NFS 0.00375
NFS/RDMA = 0.00144
61.86% more efficient!
ReadNFS = 0.00435NFS/RDMA = 0.00107
75.47% more efficient!
Improved application performance
NFS/RDMA Client CPU
0
20
40
60
80
100
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86
Write Rewrite ReadReread Idle
Elapsed time
%C
PU
18
Server CPU Efficiency
NFS Server CPU
010203040506070
1 6 11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
Write Rewrite Read Reread
Elapsed Time
%C
PU
CPU per MB of transfer:t)cpu/100 / file-
size
WriteNFS = 0.00564NFS/RDMA = 0.00180
68.10% more efficient!
ReadNFS = 0.00362NFS/RDMA = 0.00055
84.70% more efficient!
Improved system performance
NFA/RDMA Server CPU
0
10
20
30
40
50
1 6 11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
Write Rew rite ReadReread Idle
Elapsed Time
%C
PU
19
• To minimize the impact of disk I/O– One 5GB, two 2.5GB, three 1.67GB, four 1.25GB
• Ignored rewrite and reread due to client-side cache effect
NFS Scalability
0.00
500000.00
1000000.00
1500000.00
Agg
rega
te th
roug
hput
(KB
/s)
write
read
write 100017.20 98584.20 87956.00 83743.20
read 179944.00 236774.80 264673.00 272429.40
1 Client 2 Clients 3 Clients 4 Clients
NFS
Scalability Test - Throughput
NFS/RDMA Scalability
0.00
500000.00
1000000.00
1500000.00
Agg
rega
te T
hrou
ghpu
t (K
B/s
)
write
read
write 130257.00 198165.00 270703.60 318829.20
read 692946.80 958892.80 927666.40 1276621.40
1 Client 2 Clients 3 Clients 4 Clients
NFS/RDMA
20
Scalability Test – Server I/O
NFS Server I/O (4 Clients)
0
50000
100000
150000
200000
250000
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91
Elapsed Time
0
20
40
60
80
100
bo
wa
NFS/RDMA Server I/O (4 clients)
0
50000
100000
150000
200000
250000
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91
Elapsed Time
0
20
40
60
80
100
bo
wa
• NFS RDMA transport demonstrated faster processing of concurrent RPC I/O requests and responses from and to the 4 clients than NFS
• Concurrent NFS/RDMA writes were impacted more by our slow storage as indicated by the close to 80% CPU IOWAIT times
21
Scalability Test – Server CPU
NFS/RDMA Server CPU (4 clients)
0
20
40
60
80
100
1 7 131925 31374349 55616773 798591
Elapsed Time
%C
PU sy
id
wa
• NFS/RDMA incurred ~½ the CPU overhead and for half of the duration, but delivered 4 times the aggregate throughput comparing to NFS
• NFS/RDMA write-performance was impacted more by the backend storage than NFS, as indicated by the ~70% vs. ~30% idle CPU time waiting for IO to complete
NFS Server CPU (4 Clients)
0
20
40
60
80
100
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91
Elapsed Time
%C
PU
sy
id
wa
22
• Compared to NFS, NFS/RDMA demonstrated:– impressive CPU efficiency– and promising scalability
NFS/RDMA will Improve application and system
level performance!
NFS/RDMA can easily take advantage of the bandwidth in 10/20 Gigabit network for large file accesses
Preliminary Conclusion
23
• SC06 participation – HPC Storage Challenge Finalist
• Micro benchmark• MPI Applications with POSIX and/or MPI I/O
– Xnet NFS/RDMA demo over IB and iWARP
Ongoing Work
24
Future Plans
• Initiate study of NFSv4 pNFS performance with RDMA storage– Blocks (SRP, iSER)
– File (NFSv4/RDMA)
– Object (iSCSI-OSD)?
25
• NFSv3 – Use of ancillary Network Lock Manager (NLM) protocol
adds complexity and limits scalability in parallel I/O– No attribute caching requirement squelches
performance
• NFSv4 – Use of Integrated lock management allows byte range
locking required for Parallel I/O – Compound operations improves efficiency of data
movement and …
Why NFSv4
26
Why Parallel NFS (pNFS)
• pNFS extends NFSv4– Minimum extension to allow out-of-band I/O– Standards-based scalable I/O solution
• Asymmetric, out-of-band solutions offer scalability– Control path (open/close) different from Data Path
(read/write)http://www3.ietf.org/proceedings/04nov/slides/nfsv4-8/pnfs-reqs-ietf61.ppt
27
Acknowledgement
• The authors would like to thank the following for their technical input
– Tom Talpey and James Lentini from NetApp
– Tom Tucker from Open Grid Computing
– James Ting from Mellanox
– Matt Leininger and Mitch Sukalski from Sandia