15
Scalable RDMA Scalable RDMA Software Solution Software Solution Sean Hefty Sean Hefty Intel Corporation Intel Corporation

Scalable RDMA Software Solution Sean Hefty Intel Corporation

Embed Size (px)

Citation preview

Page 1: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Scalable RDMA Scalable RDMA Software SolutionSoftware Solution

Sean HeftySean Hefty

Intel CorporationIntel Corporation

Page 2: Scalable RDMA Software Solution Sean Hefty Intel Corporation

OverviewOverview

• Common Addressing ModelCommon Addressing Model

• Communication ManagementCommunication Management

• Device HandlingDevice Handling

Infiniband Software SolutionInfiniband Software Solution

• Path ResolutionPath Resolution• Multicast SupportMulticast Support

RDMA Software SolutionRDMA Software Solution

Infiniband ScalabilityInfiniband Scalability

Page 3: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Common Addressing ModelCommon Addressing Model

• Simplify user interfaceSimplify user interface

• Permit use of standard name services Permit use of standard name services and interfacesand interfaces– Socket addresses, name service resolutionSocket addresses, name service resolution

Consistent addressing model across RDMA devices

Consistent addressing model across RDMA devices

Page 4: Scalable RDMA Software Solution Sean Hefty Intel Corporation

remote resolution (ARP)

remote resolution (ARP)

local resolution

local resolution

src_dev_addrbroadcastsrc_dev_addrbroadcast

dst_dev_addrdst_dev_addr

......

network SW

IPoIB net_device

RDMA device

network

RDMA device RDMA device

address mapping

RDMA workqueue

request queue

Address Mapping ServiceAddress Mapping Service

RDMA addresses

RDMA addresses

RDMA addresses

Map IP address to RDMA deviceMap IP address to RDMA device

Generic workqueue

Generic workqueue

Page 5: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Communication ManagementCommunication Management

• Socket like semanticsSocket like semantics

• IP network addressesIP network addresses

• RDMA port spacesRDMA port spaces– Conceptual port space sharingConceptual port space sharing– TCP, UDP, SDP, SCTPTCP, UDP, SDP, SCTP

Common connection interface for all RDMA devices

Common connection interface for all RDMA devices

Page 6: Scalable RDMA Software Solution Sean Hefty Intel Corporation

RDMA APIRDMA API

......

RDMA device

network

RDMA device RDMA device

verbsRDMA CMaddress mappingRDMA WQ

RDMA CMRDMA CM Transport independent interface

Transport independent interface

Acquire device before

connecting

Acquire device before

connectingWildcard listens

across all devicesWildcard listens

across all devices

Resolve routing before connectingResolve routing

before connecting

Handle device hotplug eventsHandle device hotplug events

Page 7: Scalable RDMA Software Solution Sean Hefty Intel Corporation

RDMA CMRDMA CM

idle

address resolution

bind local address

route resolution

connect

destroy

listen

device removal

Optionally bind to a specific deviceOptionally bind to a specific device

Bind to local device

Bind to local device

New connection

New connection

Serialize removal with

connect events

Serialize removal with

connect events

Select fabric path

Select fabric path

Page 8: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Infiniband ScalabilityInfiniband Scalability

• Support scale-out to thousands of nodesSupport scale-out to thousands of nodes

• Efficient MPI collective operationsEfficient MPI collective operations

• Prevent SA stormsPrevent SA storms

• Reduce hardware requirementsReduce hardware requirements– High performance UDHigh performance UD– Multicast endpointsMulticast endpoints– Minimal memory footprintMinimal memory footprint

Page 9: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Path ResolutionPath Resolution

• Reduce connection setup timeReduce connection setup time

• Decrease SA flood on app startupDecrease SA flood on app startup

• Application selected routesApplication selected routes– I view MPI as an appI view MPI as an app

• MultiPath record supportMultiPath record support– Path independencePath independence

Page 10: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Get TableGet Table

Indexing Service

SA

network

Local SA Path Records

Path ResolutionPath Resolution

timetime

eventevent eventevent

updateupdate updateupdate updateupdate

update delayupdate delay hold timehold time cache timeoutcache timeout

Efficient SA interaction

Efficient SA interaction

Radix tree – w/ variable sized key

Radix tree – w/ variable sized key

Still requires scalable SAStill requires scalable SA

Path records

Path records

Page 11: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Multicast SupportMulticast Support

• Creation attributes outside of specCreation attributes outside of spec

• SA tracks join/leave requests per portSA tracks join/leave requests per port

• Requires local reference countingRequires local reference counting

• Serialize operations to SASerialize operations to SA

• Queue join/leave requestsQueue join/leave requests

Architectural support issuesArchitectural support issues

Page 12: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Multicast SupportMulticast Support

idle

joining send joining full

send member full member

leaving

upgrading

downgrading

• Multicast group Multicast group identified by MGIDidentified by MGID

• Map IP address to Map IP address to MGIDMGID

• Support send-only Support send-only and full membershipand full membership

• Creation?Creation?– Automatic using Automatic using

IPoIB for attributesIPoIB for attributes– Separate applicationSeparate application

Page 13: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Current StatusCurrent Status

• RDMA CMRDMA CM– Submitted for 2.6.17Submitted for 2.6.17– Merging iWarp supportMerging iWarp support

• Path resolutionPath resolution– Verify implementationVerify implementation– Enable route selection algorithmsEnable route selection algorithms

• MulticastMulticast– Next at batNext at bat

Page 14: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Future WorkFuture Work

• VerbsVerbs– QP redirectionQP redirection

• MADMAD– Large transfersLarge transfers– Dual-sided RMPPDual-sided RMPP

• Subnet administrationSubnet administration– Where to start?Where to start?

Page 15: Scalable RDMA Software Solution Sean Hefty Intel Corporation

Future WorkFuture Work

• RDMA CMRDMA CM– Failover – multiple routesFailover – multiple routes– UD supportUD support– IPv6IPv6

• Path resolutionPath resolution– Scalable memory footprintScalable memory footprint– Reduce network requirementsReduce network requirements– Efficient change detectionEfficient change detection