37
Architecture of Parallel Computers CSC / ECE 506 OpenFabrics Alliance Lecture 18 7/17/2006 Dr Steve Hunter

Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

Architecture of Parallel ComputersCSC / ECE 506

OpenFabrics AllianceLecture 18

7/17/2006

Dr Steve Hunter

Page 2: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 2Arch of Parallel Computers

Outline

• Infiniband and Ethernet Review

• DDP and RDMA

• OpenFabrics Alliance

– IP over Infiniband (IPoIB)

– Sockets Direct Protocol (SDP)

– Network File System (NFS)

– SCSI RDMA Protocol (SRP)

– iSCSI Extensions for RDMA (iSER)

– Reliable Datagram Sockets (RDS)

Page 3: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 3Arch of Parallel Computers

Infiniband Goals - Review

• Interconnect for server I/O and efficient interprocess communications

• Standard across the industry– backed by all the major players

» 200+ companies

• With an architecture able to match future systems:– Low overhead– Scalable bandwidth, up and down– Scalable fanout, few to thousands– Low cost, excellent price/performance– Robust reliability, availability, and serviceability– Leverages Internet Protocol suite and paradigms

Page 4: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 4Arch of Parallel Computers

The Basic Unit: an IB Subnet - Review

• Basic whole IB system is a subnet• Elements:

– Endnodes– Links– Switches

• What it does: Communicate– endnodes with endnodes,– via message queues, – which process messages over several

transport types,– and are SARed into packets,– which are placed on links,– and routed by switches.

End Node

Switch

End Node

End Node

End Node

End Node

End Node

End Node End

Node

Switch

Switch

End Node

End Node

Switch

Links

Page 5: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 5Arch of Parallel Computers

End Node Attachment to IB - Review

• End nodes attach to IB via Channel Adapters:– Host CAs (HCAs)

» O/S API/KPIs not specified» Queues and memory accessible via verbs» QP, CQ, and RDMA engines» Must support three IB Transports» Can include:

• Dual ports– load balancing, availability (path migration)– Attach to same or different subnets

• Partitioning• Atomics, …

– Target CAs (TCAs)» Queue access method is vendor unique» QP and CQ engines» Need only support Unreliable Datagram» ULP can be standard or proprietary» In other words…

• A smaller subset of required functions.

IO Controller

TCA

QPs CQs

IB Layers

IB Layers

HCA

QPs CQs

Host

Verbs

Memory Controller

CPU CPU CPU CPU

Memory Tables

Adapter

Page 6: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 6Arch of Parallel Computers

Infiniband Summary

• InfiniBand architecture is a very high performance, low latency interconnect

technology based on an industry-standard approach to Remote Direct Memory

Access (RDMA)

– An InfiniBand fabric is built from hardware and software that are configured, monitored

and operated to deliver a variety of services to users and applications

• Characteristics of the technology that differentiate it from comparative

interconnects such as the traditional Ethernet include:

– End-to-end reliable delivery,

– Scalable bandwidths from 10 to 60 Gbps available today moving to 120 Gbps in the

near future

– Scalability without performance degradation

– Low latency between devices

– Greatly reduced server CPU utilization for protocol processing

– Efficient I/O channel architecture for network and storage virtualizations

Page 7: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 7Arch of Parallel Computers

Advanced Ethernet - Review

TCP/IP Model

Ethernet

Examples

IP

TCP, UDP

Copper, Optical

HTTP, SMTP, FTP

PhysicalPhysical

LinkLink

NetworkNetwork

TransportTransport

ApplicationApplication

RD

MA

NIC

(R

NIC

)SC

SI

iSER / RNIC Model shownwith SCSI application

PhysicalPhysical

Media Access Control (MAC)Media Access Control (MAC)

Internet Protocol (IP)Internet Protocol (IP)

Direct Data Placement (DDP)Direct Data Placement (DDP)

Transmission Control Protocol Transmission Control Protocol (TCP)(TCP)

SCSlSCSl appapp

iSCSIiSCSI Extensions for RDMA Extensions for RDMA ((iSERiSER))

Internet SCSI (Internet SCSI (iSCSIiSCSI))

Markers with PDU Alignment Markers with PDU Alignment (MPA)(MPA)

Remote Direct Memory Access Remote Direct Memory Access Protocol (RDMAP)Protocol (RDMAP)

MACMACServiceService

IPIPServiceService

TCPTCPServiceService

RDMARDMAServiceService

SCSISCSIServiceService

• It’s expected the OpenFabrics effort (i.e., OpenIB / OpenRDMA merger) will enable even more advanced functions into NIC technology

Page 8: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 8Arch of Parallel Computers

Advanced Ethernet Summary

• The iWARP technology, implemented as RDMA Network Interface Card (RNIC),

achieves Zero-copy, RDMA, and protocol offload over existing TCP/IP networks

– It was demonstrated that a 10GbE based RNIC can reduce the CPU processing

overhead from 80-90% to less than 10% comparing to its host stack equivalent

– Additionally, its achievable end-to-end latency is now 5 microseconds or less.

• iWARP together with the emerging low latency (low hundreds of nanoseconds)

10 GbE switches can also provide a powerful infrastructure for clustered

computing, server-to-server processing, visualization and file system

– The advantage of the iWARP technology includes its ability to leverage the widely

deployed TCP/IP infrastructure, its broad knowledge base, and mature management and

monitoring capabilities.

– In addition, an iWARP infrastructure is a routable infrastructure, thereby eliminating the

need for gateways to connect to the LAN or WAN internet.

Page 9: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 9Arch of Parallel Computers

DDP and RDMA

• IETF RFC http://rfc.net/rfc4296.html

• The central idea of general-purpose DDP is that a data sender will supplement the data it sends with placement information that allows the receiver's network interface to place the data directly at its final destination without any copying.

– DDP can be used to steer received data to its final destination, without requiring layer-specific behavior for each different layer.

– Data sent with such DDP information is said to be `tagged'.

• The central components of the DDP architecture are the “buffer”, which is an object with beginning and ending addresses, and a method (set()), which sets the value of an octet at an address.

– In many cases, a buffer corresponds directly to a portion of host user memory. However, DDP does not depend on this; a buffer could be a disk file, or anything else that can be viewed as an addressable collection of octets.

Page 10: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 10Arch of Parallel Computers

DDP and RDMA

• Remote Direct Memory Access (RDMA) extends the capabilities of DDP with two primary functions.

– It adds the ability to read from buffers registered to a socket (RDMA Read).

» This allows a client protocol to perform arbitrary, bidirectional data movement without involving the remote client.

» When RDMA is implemented in hardware, arbitrary data movement can be performed without involving the remote host CPU at all.

• RDMA specifies a transport-independent untagged message service (Send) with characteristics that are both very efficient to implement in hardware, and convenient for client protocols.

– The RDMA architecture is patterned after the traditional model for device programming, where the client requests an operation using Send-like actions (programmed I/O), the server performs the necessary data transfers for the operation (DMA reads and writes), and notifies the client of completion.

» The programmed I/O+DMA model efficiently supports a high degree of concurrency and flexibility for both the client and server, even when operations have a wide range of intrinsic latencies.

Page 11: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 11Arch of Parallel Computers

OpenFabrics Alliance

• The OpenFabric Alliance is an international organization comprised of industry,

academic and research groups that have developed a unified core of open

source software stacks (OpenSTAC) leveraging RDMA architectures for both the

Linux and Windows operating systems over both InfiniBand and Ethernet.

– RDMA is a communications technique allowing data to be transmitted from the memory

of one computer to the memory of another computer without passing through either

devices CPU, without needing extensive buffering, and without calling to an operating

system kernel

• The core OpenSTAC software supports all the well known standard upper layer

protocols such as MPI, IP, SDP, NFS, SRP, iSER, and RDS on top of Ethernet and

InfiniBand (IB) infrastructures

– The OpenFabric software and supporting services better enables low-latency InfiniBand

and 10 GbE to deliver clustered computing, server-to-server processing, visualization

and file system access

Page 12: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 12Arch of Parallel Computers

OpenFabrics Software Stack

RDMA NICR-NIC

Host Channel Adapter

HCA

User Direct Access Programming Lib

UDAPL

Reliable Datagram Service

RDS

iSCSI RDMA Protocol (Initiator)

iSER

SCSI RDMA Protocol (Initiator)

SRP

Sockets Direct Protocol

SDP

IP over InfiniBandIPoIB

Performance Manager Agent

PMA

Subnet Manager Agent

SMA

Management Datagram

MAD

Subnet Administrator

SA

Common

InfiniBand

iWARP

Key

InfiniBand HCAInfiniBand HCA iWARP RiWARP R--NICNIC

HardwareSpecific Driver

Hardware SpecificDriver

ConnectionManagerMAD

InfiniBand Verbs / API

SA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

User Level Verbs / API

SDPIPoIB SRP iSER RDS

UDAPL

SDP Library

User Level MAD API

Open SM

DiagTools

Hardware

Provider

Mid-Layer

Upper Layer Protocol

User APIs

Kernel Space

User Space

NFS-RDMARPC

ClusterFile Sys

Application Level

SMA

R-NIC Driver API

ClusteredDB Access

(Oracle10g RAC)

SocketsBasedAccess

(IBM DB2)

VariousMPIs

Access toFile

Systems

BlockStorageAccess

IP BasedApp

Access

Apps & Access

Methodsfor usingOF Stack

Page 13: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 13Arch of Parallel Computers

IP over IB (IPoIB)

• IETF Standard for mapping Internet protocols to Infiniband

– IETF IPoIB Working Group

• Covers

– Fabric initialization

– Multicast/Broadcast

– Address resolution (IPv4/IPv6)

– IP Datagram encapsulation (IPv4/IPv6)

– MIBs

Page 14: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 14Arch of Parallel Computers

IP over IB (IPoIB)

• Communication Parameters

– Obtained from Subnet Manager (SM)

» P_Key (Partition Key)

» SL (Service Level)

» Path Rate

» Link MTU (for IPv6 can be reduced with router advert)

» GRH parameters – TClass, Flow Label, HopLimit

– Obtained from address resolution

» Data Link Layer Address (GID)

• Perstent Data Link layer address necessary

• Enables IB Routers to be deployed eventually

» QPN (queue pair number)

Page 15: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 15Arch of Parallel Computers

IP over IB (IPoIB)

• Address Resolution

– IPv4

» ARP request is sent on Broadcast MGID

» ARP reply is unicast back and contains GID and QPN

– IPv6

» Neighbor discovery using all IP-hosts multicast address

» Existing RFCs

• Summary

– Feels like Ethernet with 2KB MTU

– Doesn’t utilize most of Inifinband custom hardware

» e.g., SAR, Reliable Transport, Zero Copy, RDMA Reads/Writes, Kernel Bypass

» SDP is the enhanced version

Page 16: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 16Arch of Parallel Computers

Sockets Direct Protocol (SDP)

• Based on Microsoft’s Winsock Direct Protocol

• SDP Feature Summary

– Maps sockets SOCK_STREAM to RDMA semantics

– Optimizations for transaction oriented protocols

– Optimizations for mixing of small and large messages

• Uses advanced Infiniband features

– Reliable Connected (RC) service

– Uses RDMA Writes, Reads, and Sends

– Supports Automatic Path Migration

Page 17: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 17Arch of Parallel Computers

SDP Terminology

• Data Source

– Side of connection which is sourcing the ULP data to be transferred

• Data Sink

– Side of connection which is receiving (sinking) the ULP data

• Data Transfer Mechanism

– To move ULP data from Data Source to Data Sink (e.g., Bcopy, Receiver Initiated Zcopy,

Read Zcopy)

• Flow Control Mode

– State that the half connection is currently in (Combined, Pipelined, Buffered)

• Bcopy Threshold

– If message length is under threshold, use Bcopy mechanism. Threshold is locally

defined.

Page 18: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 18Arch of Parallel Computers

SDP Modes

• Flow Control Modes restrict data transfer mechanisms

• Buffered Mode

– Used when receiver wishes to force all transfers to use the Bcopy Mechanism

• Combined Mode

– Used when receiver is not pre-posting buffers and uses peek/select interface (Bcopy or

Read Zcopy, only one outstanding)

• Pipelined Mode

– Highly optimized transfer mode – multiple write or read buffers outstanding, can use all

data transfer mechanisms (Bcopy, Read Zcopy, Receive Initiated Write Zcopy)

Page 19: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 19Arch of Parallel Computers

SDP Terminology

• Enables buffer-copy when

– Transfer is short

– Application needs buffering

• Enables zero-copy when

– Transfer is long

Data SoureUser Buffer

Data Sink User Buffer

CA CAInfinibandReliable

Connection (RC)

SDP Private Buffer Pool(Fixed Size)

Zero Copy Path

Zero Copy Path

Buffer Copy Path

Buffer Copy Path

Page 20: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 20Arch of Parallel Computers

Network File System (NFS)

• Network File System (NFS) is a protocol originally developed by Sun Microsystems in 1984 and defined in RFCs 1094, 1813, and 3530 (obsoletes 3010), as a distributed file system which allows a computer to access files over a network as easily as if they were on its local disks.

– NFS is one of many protocols built on the Open Network Computing Remote Procedure

Call system (ONC RPC)

• Version 2 of the protocol

– originally operated entirely over UDP and was meant to keep the protocol stateless, with

locking (for example) implemented outside of the core protocol

• Version 3 added:

– support for 64-bit file sizes and offsets, to handle files larger than 4GB

– support for asynchronous writes on the server, to improve write performance;

– additional file attributes in many replies, to avoid the need to refetch them;

– a READDIRPLUS operation, to get file handles and attributes along with file names

when scanning a directory;

– assorted other improvements.

Page 21: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 21Arch of Parallel Computers

Network File System (NFS)

• Version 4 (RFC 3530)

– Influenced by AFS and CIFS, includes performance improvements, mandates strong

security, and introduces a stateful protocol. Version 4 was the first version developed

with the Internet Engineering Task Force (IETF) after Sun Microsystems handed over

the development of the NFS protocols.

• Various side-band protocols have been added to NFS, including:

– The byte-range advisory Network Lock Manager (NLM) protocol which was added to

support System V UNIX file locking APIs.

– The remote quota reporting (RQUOTAD) protocol to allow NFS users to view their data

storage quotas on NFS servers.

• WebNFS is an extension to Version 2 and Version 3 which allows NFS to be more easily integrated into Web browsers and to enable operation through firewalls.

Page 22: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 22Arch of Parallel Computers

SCSI RDMA Protocol (SRP)

• SRP defines a SCSI protocol mapping onto the InfiniBand Architecture and/or functionally similar cluster protocols

• RDMA Consortium voted to create iSER instead of porting SRP to IP

– SRP doesn’t have a wide following

– SRP doesn’t have a discovery or management protocol

– Version 2 of SRP hasn’t been updated for 1.5 years

Page 23: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 23Arch of Parallel Computers

iSCSI Extensions for RDMA (iSER)

• iSER combines SRP and iSCSI with new RDMA capabilities

• iSER is maintained as part of iSCSI in IETF

– Recently extended to IB by IBM, Voltaire, HP, EMC, and others

• Benefits to add iSER to IB

– Combines same (almost) storage protocol across all RDMA Networks

» Easier to train staff

» Bridging products more staight-forward

» Motivate storage community to iSCSI/iSER mentality and may help with

acceptance on IP

– Desire for a common Discovery and Management protocol across iSCSI,

iSER/iWARP, and IP

» i.e., same Management and discovery process and software to handle IP networks

and IB networks

Page 24: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 24Arch of Parallel Computers

iSCSI Extensions for RDMA (iSER)

• iSCSI’s main performance deficiencies stem from TCP/IP

– TCP is a complex protocol requiring significant processing

– Stream based, making it hard to separate data and headers

– Requires copies that increase latency and CPU overhead

– Using checksums requiring additional CRCs in the ULP

• iSER eliminates the bottlenecks through:

– Zero copy using RDMA

– CRC calculated by hardware

– Work with message boundaries instead of streams

– Transport protocol implemented in hardware (minimal CPU cycles per iO)

Page 25: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 25Arch of Parallel Computers

iSCSI Extensions for RDMA (iSER)

• iSER leverages on iSCSI management, discovery, and RAS

– Zero-Configuration, Discovery and global storage name server (SLP, iSNS)

– Change Notifications and active monitoring of devices and initiators

– High-Availability, and 3 levels of automated recovery

– Multi-pathing and storage aggregation

– Industry standard management interfaces (MIB)

– 3rd party storage managers

– Security: Partitioning, Authentication, Central login control, etc.

• Working with iSER over IB doesn’t require any changes

– Focused effort from both communities

• More advanced than SRP

Page 26: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 26Arch of Parallel Computers

iSCSI Extensions for RDMA (iSER)

• iSCSI specification:

– http://www.ietf.org/rfc/rfc3720.txt

• iSER and DA Introduction

– http://www.rdmaconsortium.org/home/iSER_DA_intro.pdf

• iSER specification

– http://www.ietf.org/internet-drafts/draft-ietf-ips-iser-05.txt

• iSER over IB Overview

– http://www.haifa.il.ibm.com/satran/ips/iSER-in-an-IB-network-V9.pdf

Page 27: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 27Arch of Parallel Computers

Reliable Datagram Sockets (RDS)

• Goals– Provide reliable datagram service

» performance» scalability» high availability» simplify application code

– Maintain sockets API» application code portability» faster time-to-market

Page 28: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 28Arch of Parallel Computers

Reliable Datagram Sockets (RDS)

Host Channel Adapter

OpenIB Access Layer

IPoIB

IP

Oracle 10g

SocketApplications

TCP UDP SDP RDS

Kernel

User UDP

Applications

Stack Overview

Page 29: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 29Arch of Parallel Computers

Reliable Datagram Sockets (RDS)

• Application connectionless– RDS maintains node-to-node connection– IP addressing– Uses CMA– On-demand connection setup

» connect on first sendmsg()or data recv

» disconnect on error or policy like inactivity

– Connection setup/teardown transparent to applications

Page 30: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 30Arch of Parallel Computers

Reliable Datagram Sockets (RDS)

• Data and Control Channel– Uses RC QP for node level connections– Data and Control QPs per session– Selectable MTU– b-copy send/recv– H/W flow control

Page 31: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 31Arch of Parallel Computers

The End

Page 32: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 32Arch of Parallel Computers

RDS - Send

• Connection established on first send

• sendmsg()

– allows send pipelining

• ENOBUF returned if insufficient send buffers, application retries

Page 33: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 33Arch of Parallel Computers

RDS - Receive

• Identical to UDP recvmsg()– similar blocking/non-blocking behavior

• “Slow” receiver ports are stalled at sender side– combination of activity (LRU) and memory utilization used to detect

slow receivers– sendmsg() to stalled destination port returns EWOULDBLOCK,

application can retry» Blocking socket can wait for unblock

– recvmsg() on a stalled port un-stalls it

Page 34: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 34Arch of Parallel Computers

RDS - High Availability (failover)

• Use of RC and on-demand connection setup allows HA– connection setup/teardown transparent to applications– every sendmsg() could “potentially” result in a connection setup– if a path fails, connection is torn down, next send can connect on

an alternate path (different port or different HCA)

Page 35: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 35Arch of Parallel Computers

Preliminary performance RDS on OpenIB

netperf (UDP_STREAM)

0

500

1000

1500

2000

2500

3000

3500

4000

2k 4k 8k 16k 32K 64K

msg size (bytes)

Mb

its/

sec UDP GbE

UDP ipoib send

UDP ipoib recv

Rds (send = recv)

*Dual 2.4GHz Xeon2G memory4x PCI-X HCA

**Sdp ~3700Mb/secTCP_STREAM

Page 36: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 36Arch of Parallel Computers

Preliminary performance RDS on OpenIB

netperf (UDP_STREAM)

0

500

1000

1500

2000

2500

3000

3500

4000

2k 4k 8k 16k 32K 64K

msg size (bytes)

Mb

its/

sec UDP GbE

UDP ipoib recv

Rds (send = recv)

*Dual 2.4GHz Xeon2G memory4x PCI-X HCA

**Sdp ~3700Mb/secTCP_STREAM

Page 37: Architecture of Parallel Computers CSC / ECE 506 ......– Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 3 Infiniband Goals - Review • Interconnect for

CSC / ECE 506 37Arch of Parallel Computers

Preliminary performance RDS on OpenIB

Latency

0

50

100

150

200

250

300

350

400

450

500

4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

Msg size(bytes)

use

c

UDP GigE

UDP ipoib

Rds