Upload
vuongnguyet
View
221
Download
2
Embed Size (px)
Citation preview
SoftRDMA: Rekindling High PerformanceSoftware RDMA over Commodity Ethernet
MaoMiao,Fengyuan Ren,Xiaohui Luo,JingXie,Qingkai Meng,Wenxue Cheng
Dept. of Computer Science and Technology, Tsinghua University,
Background
• Remote Direct Memory Access (RDMA) - Protocoloffload
• reduceCPUoverheadandbypass kernel - Memorypre-allocationandpre-registering- Zero-Copy
• Datatransferreddirectlyfromuserspace
• DomainRDMANetworkProtocols- InfniBand (IB)- RDMAOverConvergedEthernet(RoCE)- InternetWideAreaRDMAProtocol(iWARP)
Background
• InfniBand (IB)- Customnetworkprotocolandpurpose-builtHW- LosslessL2useshop-by-hop,credit-basedflowcontroltopreventpacketdrops
• Challenges- Incompatiblewith Ethernetinfrastructure- Costmuch
• DCoperatorsneedtodeployandmanagetwoseparatenetworks
Background• RDMA Over Converged Ethernet (RoCE)
- IndeedIBoverEthernet- Routability
• RoCEv2currentlyincludesUDPandIPlayerstoactuallyprovidethatcapability
- LosslessL2• Priority-basedFlowControl(PFC)
• Challenges- Complex andrestrictiveconfiguration- Perils ofusingPFC inalarge-scaledeployment
• head-of-lineblocking,unfairness,spreadingcongestionproblems,etc
Background
• Internet Wide Area RDMA Protocol (iWARP)- EnablesRDMAovertheexistingTCP/IP
• LeveragesTCP/IPforreliabilityandcongestioncontrolmechanismstoensurescalability,routabilityandreliability
- OnlyNICs(RNIC)shouldbespeciallybuilt,nootherchangesrequired
• Challenges- Specially-builtRNIC
Motivation
• Common challenges for RDMA deployment- Specificandnon-backwardcompatibledevices
• Ethernetnon-compatibility• Inflexibility ofHWdevices
- Expensive andhighcost• Equipmentreplacement• Extraburdenofoperationmanagement
• Is it possible to design software RDMA (SoftRDMA) over commodity Ethernet devices ?
Motivation• Software Framework Evolution
- High-performance packet I/O• IntelDPDK,netmap,PacketShader I/O(PSIO)
- High-performance user-level stack• mTCP,IX,Arrakis,Sandstorm
- Drivethetechnicalchanges• Memoryresourcespre-allocationandre-use• Zero-copy• Kernelbypassing• Batchprocessing• Affinityandprefetching
Motivation
• Much similar design philosophy between RDMA and novel SW evolvements - Memoryresourcespre-allocation- Zero-copy- Kernelbypassing
• Can we design SoftRDMA based on the high-performance packet I/O ?- ComparableperformancetoRDMAschemes- Nocustomizeddevicesrequired- CompatiblewithEthernetinfrastructures
SoftRDMA Design: Dedicated Userspace iWARP Stack
Applications
Verbs API
RDMAP
DDP
MPA
TCP
IP
NIC driver Data Link
DPDK
Userspace
Applications
Verbs API
RDMAP
DDP
MPA
TCP
IP
NIC driver Data Link
User Space
Kernel Space
Applications
Verbs API
RDMAP
DDP
MPA
TCP
IP
NIC driver Data Link
User Space
Kernel Space
• User-leveliWARP +Kernel-levelTCP/IP• Kernel-leveliWARP +Kernel-levelTCP/IP• User-leveliWARP +User-levelTCP/IP
SoftRDMA Design: Dedicated Userspace iWARP Stack
• In-kernel stack- Modeswitchingoverhead- Complexityofkernelmodification
• User-level stack- Eliminatemodeswitchingoverhead- Morefreespaceforstackdesign
One-Copy versus Zero-Copy• Seven steps for Pkts from NIC to App• Two-Copy
- Step6:copiedfromRXringbufferforstackprocessing- Step7:copiedtodifferentApps’bufferafterprocessing
123456
7 8 …
Device Driver
RX Ring Buffer
NIC1
DMA Engine
Interrupt Generator
NIC InterruptHandler
SoftirqNet_rx_action
NIC
1
IP
TCP
App1 read()
HardwareInterrupt
Netif_rx_schedule()Raised softirq check
1
2
3
Poll_queue()
dev->poll
App2 read()
…U
ser Space
Kernel
5
Stack Processing6
7 App recv
Control flow Data flow
1st COPY
4
2nd COPY
1
2
3
4
5
6
7
One-Copy versus Zero-Copy
123456
7 8 …
Device Driver
RX Ring Buffer
NIC1
DMA Engine
Interrupt Generator
NIC InterruptHandler
SoftirqNet_rx_action
NIC
1
IP
TCP
App1 read()
HardwareInterrupt
Netif_rx_schedule()Raised softirq check
1
2
3
Poll_queue()
dev->poll
App2 read()
…U
ser Space
Kernel
5
Stack Processing6
7 App recv
Control flow Data flow
1st COPY
4
2nd COPY
• One-Copy- MemorymappingbetweenkernelanduserspacetoremovethecopyinStep7
• Zero-Copy- SharingtheDMAregiontoremovethecopyinStep6
7
6
One-Copy versus Zero-Copy
• Two obstacles for Zero-Copy in stack processing- WheretoputtheinputPkts fordifferentAppsandhowtomanagethem?
• Unaware abouttheapplication-appointedplace tostoretheinputPkts beforestackprocessing
- WhethertheDMAregionislargeenoughorcouldbereusedfasttoholdinputpackets?
• TheDMAregionisfiniteandfixed,whichcouldonlystoreuptothousandsofinputpackets
• SoftRDMA adopts One-Copy
SoftRDMA Threading Model
• Traditional multi-threading model (c1)- OnethreadforAppprocessing,theotherforPkts’RX/TX- Goodforthroughputasbatchprocessing- Higherlatencyasthethreadswitchingandcommunicationcost
TCP/IP TCP/IP
AppEvent
ConditionsBatchedSystcalls
(c1)
Thread 1 Thread 2
SoftRDMA Threading Model
• Run-to-completion threading model (c2)- Runallstages(Pkt RX/TX,APPprocessing…)intocompletion- Indeedimprovethelatency- SophisticatedprocessingmaymakethePkt loss
TCP/IP TCP/IP
App
NIC Driver
User Space
(c2)
Thread 1 Thread 2
SoftRDMA Threading Model
• SoftRDMA threading model (c3)- OnethreadforPkts’RX,includingOne-Copy,theotherforAppprocessingandPkts’TX
- AcceleratethePkt receivingprocess- AppprocessingandPkts’TXrunwithinathreadtoimprovetheefficiencyandreducethelatency
TCP/IP TCP/IP
App
(c3)
Thread 1 Thread 2
SoftRDMA Implementation
• 20K lines of code, 7.8K are new- DPDKI/O- User-levelTCP/IPbasedonlwIP rawAPI- MPA/DDP/RDMAP layerofiWARP
• RDMA Verbs
SoftRDMA Performance
• Experiment config- DELLPowerEdgeR430- Intel82599ES10GbE NIC- Chelsio T520-SO-CR10GbERNIC
• Four RDMA implementation schemes- Hardware-supportedRNIC(iWARP RNIC)- User-leveliWARP basedonkernel-socket(KernelSocket)- User-leveliWARP basedonDPDK-basedlwIP sequentialAPI(SequentialAPI)
- User-leveliWARP basedonDPDK-basedlwIP rawAPI(SoftRDMA)
SoftRDMA Implementation• Short Message (≤ 10KB) Transfer
- Thecloselatencymetric• SoftRDMA:6.63us/64B6.80us/1KB52.20us/10KB• RNIC: 3.59us/64B5.29us/1KB16.27us/10KB
- Thethroughputfallsfarbehind• Acceptableforshortmessagedelivering
SoftRDMA Implementation• Long Message (10KB-500KB) Transfer
- Thecloselatencymetric• SoftRDMA:101.36us/100KB500.06us/500KB• RNIC: 93.45us/100KB432.50us/500KB
- Theclosethroughputperformance• SoftRDMA:1461.71Mbps/10KB 7893.31Mbps/100KB• RNIC: 8854.16Mbps/10KB 8917.44Mbps/100KB
Next work
• Amorestable and robust user-levelstack• NICs’HW features utilized toacceleratetheprotocolprocessing- TSO/LSO/LRO- Memorybasedscatter/gatherforZero-Copy
• Morecomparisonexperiments- TestsamongSoftRDMA,iWARPNIC, RoCENIC- Testson40GbE/50GbEdevices
Conclusion
• SoftRDMA:ahigh-performancesoftwareRDMAimplementationovercommodityEthernet- Thededicateduserspace iWARP stackbasedonhigh-performancenetworkI/O
- One-Copy- Thecarefullydesignedthreadingmodel
Thanks!Q&A