Upload
austen-philip-welch
View
219
Download
0
Embed Size (px)
Citation preview
Scalable RDMA Scalable RDMA Software SolutionSoftware Solution
Sean HeftySean Hefty
Intel CorporationIntel Corporation
OverviewOverview
• Common Addressing ModelCommon Addressing Model
• Communication ManagementCommunication Management
• Device HandlingDevice Handling
Infiniband Software SolutionInfiniband Software Solution
• Path ResolutionPath Resolution• Multicast SupportMulticast Support
RDMA Software SolutionRDMA Software Solution
Infiniband ScalabilityInfiniband Scalability
Common Addressing ModelCommon Addressing Model
• Simplify user interfaceSimplify user interface
• Permit use of standard name services Permit use of standard name services and interfacesand interfaces– Socket addresses, name service resolutionSocket addresses, name service resolution
Consistent addressing model across RDMA devices
Consistent addressing model across RDMA devices
remote resolution (ARP)
remote resolution (ARP)
local resolution
local resolution
src_dev_addrbroadcastsrc_dev_addrbroadcast
dst_dev_addrdst_dev_addr
......
network SW
IPoIB net_device
RDMA device
network
RDMA device RDMA device
address mapping
RDMA workqueue
request queue
Address Mapping ServiceAddress Mapping Service
RDMA addresses
RDMA addresses
RDMA addresses
Map IP address to RDMA deviceMap IP address to RDMA device
Generic workqueue
Generic workqueue
Communication ManagementCommunication Management
• Socket like semanticsSocket like semantics
• IP network addressesIP network addresses
• RDMA port spacesRDMA port spaces– Conceptual port space sharingConceptual port space sharing– TCP, UDP, SDP, SCTPTCP, UDP, SDP, SCTP
Common connection interface for all RDMA devices
Common connection interface for all RDMA devices
RDMA APIRDMA API
......
RDMA device
network
RDMA device RDMA device
verbsRDMA CMaddress mappingRDMA WQ
RDMA CMRDMA CM Transport independent interface
Transport independent interface
Acquire device before
connecting
Acquire device before
connectingWildcard listens
across all devicesWildcard listens
across all devices
Resolve routing before connectingResolve routing
before connecting
Handle device hotplug eventsHandle device hotplug events
RDMA CMRDMA CM
idle
address resolution
bind local address
route resolution
connect
destroy
listen
device removal
Optionally bind to a specific deviceOptionally bind to a specific device
Bind to local device
Bind to local device
New connection
New connection
Serialize removal with
connect events
Serialize removal with
connect events
Select fabric path
Select fabric path
Infiniband ScalabilityInfiniband Scalability
• Support scale-out to thousands of nodesSupport scale-out to thousands of nodes
• Efficient MPI collective operationsEfficient MPI collective operations
• Prevent SA stormsPrevent SA storms
• Reduce hardware requirementsReduce hardware requirements– High performance UDHigh performance UD– Multicast endpointsMulticast endpoints– Minimal memory footprintMinimal memory footprint
Path ResolutionPath Resolution
• Reduce connection setup timeReduce connection setup time
• Decrease SA flood on app startupDecrease SA flood on app startup
• Application selected routesApplication selected routes– I view MPI as an appI view MPI as an app
• MultiPath record supportMultiPath record support– Path independencePath independence
Get TableGet Table
Indexing Service
SA
network
Local SA Path Records
Path ResolutionPath Resolution
timetime
eventevent eventevent
updateupdate updateupdate updateupdate
update delayupdate delay hold timehold time cache timeoutcache timeout
Efficient SA interaction
Efficient SA interaction
Radix tree – w/ variable sized key
Radix tree – w/ variable sized key
Still requires scalable SAStill requires scalable SA
Path records
Path records
Multicast SupportMulticast Support
• Creation attributes outside of specCreation attributes outside of spec
• SA tracks join/leave requests per portSA tracks join/leave requests per port
• Requires local reference countingRequires local reference counting
• Serialize operations to SASerialize operations to SA
• Queue join/leave requestsQueue join/leave requests
Architectural support issuesArchitectural support issues
Multicast SupportMulticast Support
idle
joining send joining full
send member full member
leaving
upgrading
downgrading
• Multicast group Multicast group identified by MGIDidentified by MGID
• Map IP address to Map IP address to MGIDMGID
• Support send-only Support send-only and full membershipand full membership
• Creation?Creation?– Automatic using Automatic using
IPoIB for attributesIPoIB for attributes– Separate applicationSeparate application
Current StatusCurrent Status
• RDMA CMRDMA CM– Submitted for 2.6.17Submitted for 2.6.17– Merging iWarp supportMerging iWarp support
• Path resolutionPath resolution– Verify implementationVerify implementation– Enable route selection algorithmsEnable route selection algorithms
• MulticastMulticast– Next at batNext at bat
Future WorkFuture Work
• VerbsVerbs– QP redirectionQP redirection
• MADMAD– Large transfersLarge transfers– Dual-sided RMPPDual-sided RMPP
• Subnet administrationSubnet administration– Where to start?Where to start?
Future WorkFuture Work
• RDMA CMRDMA CM– Failover – multiple routesFailover – multiple routes– UD supportUD support– IPv6IPv6
• Path resolutionPath resolution– Scalable memory footprintScalable memory footprint– Reduce network requirementsReduce network requirements– Efficient change detectionEfficient change detection