Designing High-End Computing Systems with …...Designing High-End Computing Systems with InfiniBand...

Preview:

Citation preview

Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet

Dhabaleswar K. (DK) Panda

A Tutorial at Supercomputing ‘09by

Pavan BalajiDhabaleswar K. (DK) PandaThe Ohio State University

E-mail: panda@cse.ohio-state.eduhttp://www.cse.ohio-state.edu/~panda

Pavan BalajiArgonne National LaboratoryE-mail: balaji@mcs.anl.gov

http://www.mcs.anl.gov/~balajihttp://www.cse.ohio state.edu/ panda http://www.mcs.anl.gov/ balaji

Matthew KoopNASA Goddard

E il tth k @E-mail: matthew.koop@nasa.govhttp://www.cse.ohio-state.edu/~koop

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack• The Open Fabrics Software Stack

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

HPC Clusters and Applications

• Multi-core processors becoming increasingly common

pp

• System scales growing rapidly– Small (~256 cores – 8/16 nodes with 8/16 cores each)– Medium (~1K cores – 32/64 nodes with 8/16 cores each)– Large (~10K cores – 320/640 nodes with 8/16 cores each)– Huge (~100K cores – 3200/6400 nodes with 8/16 cores each)– Huge ( 100K cores – 3200/6400 nodes with 8/16 cores each)

• Large range of applications

– Scientific– Scientific – Commercial

• Diverse computation and communication characteristicsp• Diverse scaling requirements

Supercomputing '09

Trends for Computing Clusters in the T 500 Li t

• Top 500 list of Supercomputers (www.top500.org)

Top 500 List

Jun. 2001: 33/500 (6.6%) Nov. 2005: 360/500 (72.0%)

Nov. 2001: 43/500 (8.6%) Jun. 2006: 364/500 (72.8%)

Jun. 2002: 80/500 (16%) Nov. 2006: 361/500 (72.2%)

Nov. 2002: 93/500 (18.6%) Jun. 2007: 373/500 (74.6%)

Jun. 2003: 149/500 (29.8%) Nov. 2007: 406/500 (81.2%)

Nov. 2003: 208/500 (41.6%) Jun. 2008: 400/500 (80.0%)

Jun. 2004: 291/500 (58.2%) Nov. 2008: 410/500 (82.0%)

Nov. 2004: 294/500 (58.8%) Jun. 2009: 410/500 (82.0%)

Supercomputing '09

Jun. 2005: 304/500 (60.8%) Nov. 2009: To be announced

PetaFlop to ExaFlop Computingp p p g

10 PFlops in 2011

100 PFlops in 2015

Supercomputing '09

Expected to have an ExaFlop system in 2018-2019 !

Integrated High-End Computing E i tEnvironments

Compute cluster Storage cluster

Meta-DataManager

I/O ServerNode

MetaData

DataCompute

Node

ComputeNode

I/O ServerNode Data

ComputeNode

LANLANFrontend

NodeNode

I/O ServerNode Data

ComputeNodeLAN/WANDifferent requirements for each

type of HEC system

Routers/Servers

Database Server

Application Server

Datacenter for Visualization and Mining

.

.

Switch

.

.

Routers/ServersRouters/Servers

Application Server

Application Server

Database Server

Database Server

Switch Switch

Supercomputing '09

.Routers/Servers

Application Server

Database Server

Tier1 Tier3

Programming Models for HPC Clusters

• Message Passing Interface (MPI) is the de-facto

g g

standard– MPI 1.0 was the initial standard; MPI 2.2 is the latest

– MPI 3.0 is under discussion in the MPI Forum

• Other models coming up as well

– Traditional Partitioned Global Address Space Models (Global Arrays, UPC, Coarray Fortran)

– HPCS languages (X10, Chapel, Fortress)

Supercomputing '09

Networking Requirements for HPC Cl t

• Different Communication PatternsP i t t i t

Clusters

– Point-to-point• Low latency, high bandwidth, low CPU usage, overlap of computation &

communication

S l bl ll ti “ ” i ti– Scalable collective or “group” communication• Broadcast, barrier, reduce, all-reduce, all-to-all, etc.

– Support for concurrent multi-pair communication at the NIC • Emerging multi-core architecture

• Reduced network contention and congestion– Good routing schemeg– Efficient congestion control management

• Reliability and Fault ToleranceF il d t ti d ( t k d l l)– Failure detection and recovery (network and process level)

– Ease of administrationSupercomputing '09

Sample Diagram of State-of-the-Art File S t

• Sample file systems:

Systems

– Lustre, Panasas, GPFS, Sistina/Redhat GFS

– PVFS, Google File systems, Oracle Cluster File system (OCFS2)

Metadata Server

Metadata Server Server

Computing node I/O server

Network

Computing node I/O server

Supercomputing '09

Computing node I/O server

Networking Requirements for Storage Cl t

• Several similar to HPC Clusters

Clusters

– Low latency, high bandwidth, low CPU utilization

– Reduced network contention & congestion

R li bilit d F lt T l– Reliability and Fault Tolerance

• Several unique challenges as wellAggregate bandwidth becomes very important– Aggregate bandwidth becomes very important

• Systems typically use fewer I/O nodes than compute nodes

– Quality of Service (QoS)• The same set of file servers support multiple “client applications”

– Network UnificationN d k i h b h & “ d d”• Need to work with both compute systems & “standard” storage protocols (Fiber Channel, iSCSI)

Supercomputing '09

Enterprise Datacenter Environmentsp

ProxyWeb-server

(A h )ProxyServer

(Apache)

WAN

ClientsStorage

More Computation and CommunicationR i t

ApplicationDatabase

WAN Requirements

• Requests are received from clients over the WAN

Application Server (PHP)

Server(MySQL)

• Proxy nodes perform caching, load balancing, resource monitoring, etc.– If not cached, request forwarded to the next tier Application Server

A li ti f th b i l i (CGI J l t )• Application server performs the business logic (CGI, Java servlets)– Retrieves appropriate data from the database to process the requests

Supercomputing '09

Networking Requirements for D t t

• Support for large number of communication streams

Datacenters

• Heterogeneous compute requirements– Front-end servers typically much lesser loaded than

backend servers need to handle communication in a load-resilient manner

• Quality of Service (QoS)– The same set of servers respond to many clients

• Network Virtualization– Zero-overhead communication in Virtual environments

– Efficient Migration

Supercomputing '09

Networking Requirements for I t t d E i t

• High performance WAN-level communication

Integrated Environments

– Low latency

– High bandwidth (unidirectional and bidirectional)

– Low CPU utilization

• Good performance in spite of delays– Out-of-order messages

– Buffering requirements to keep high-bandwidth pipes full

• Seamless integration between LAN and WAN protocols

• Resiliency to network failure

Supercomputing '09

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

A Typical IB Networkyp

Three primary components

Channel Adapters

Switches/Routers

Links and connectors

Supercomputing '09

Communication and Management S ti

• Two forms of communication semantics

Semantics

– Channel semantics (Send/Recv)

– Memory semantics (RDMA, Atomic operations)

• Management model– A detailed management model complete with

managers, agents, messages and protocols

• Verbs Interface– A low-level programming interface for performing

communication as well as management

Supercomputing '09

Communication in the Channel S ti (S d/R i M d l)

Memory Memory

Semantics (Send/Receive Model)Processor Processor

ReceiveBuffer

SendBuffer

CQQP

Send Recv CQQP

Send RecvProcessor is involved only to:

1. Post receive WQE2. Post send WQEQ

3. Pull out completed CQEs from the CQ

InfiniBand DeviceInfiniBand Device

S d WQE t i i f ti b t th Receive WQE contains information on the receive

Hardware ACK

Supercomputing '09

Send WQE contains information about the send buffer

Receive WQE contains information on the receive buffer; Incoming messages have to be matched to

a receive WQE to know where to place the data

Communication in the Memory S ti (RDMA M d l)

Memory Memory

Semantics (RDMA Model)Processor Processor

ReceiveBuffer

SendBuffer

CQQP

Send Recv CQQP

Send RecvInitiator processor is involved only to:

1. Post send WQE2. Pull out completed CQE from the send CQ

No involvement from the target processor

InfiniBand DeviceInfiniBand Device

S d WQE t i i f ti b t th

Hardware ACK

Supercomputing '09

Send WQE contains information about the send buffer and the receive buffer

Basic iWARP Capabilities

• Supports most of the communication features supported

p

by IB (with minor differences)– Hardware acceleration, RDMA, Multicast, QoS

L k f t• Lacks some features– E.g., Atomic operations

but supports some other features• … but supports some other features– Out-of-Order data placement (useful for iSCSI semantics)

Fine grained data rate control (very useful for long haul– Fine-grained data rate control (very useful for long-haul networks)

– Fixed bandwidth QoS (more details later)

Supercomputing '09

IB and 10GE:C liti d DiffCommonalities and Differences

IB iWARP/10GEIB iWARP/10GE

Hardware Acceleration SupportedSupported

(for TOE and iWARP)Supported

RDMA SupportedSupported

(for iWARP)

Atomic Operations Supported Not supported

Multicast Supported Supported

Data Placement OrderedOut-of-order(for iWARP)

Data Rate-control Static and Coarse-grainedDynamic and Fine-grained

(for TOE and iWARP)

QoS PrioritizationPrioritization and

Fixed Bandwidth QoS

Supercomputing '09

Fixed Bandwidth QoS

One-way Latency: MPI over IBy y

7Small Message Latency

400MVAPICH-InfiniHost III-DDR

Large Message Latency

5

6

250

300

350MVAPICH-InfiniHost III-DDR

MVAPICH-Qlogic-SDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX-QDR-PCIe2

3

4

Late

ncy

(us)

2.77150

200

250 MVAPICH-Qlogic-DDR-PCIe2

Late

ncy

(us)

1

2

1.061.281.492.19

50

100

0

Message Size (bytes)

0

Message Size (bytes)

Supercomputing '09

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back

Bandwidth: MPI over IB

3500MVAPICH-InfiniHost III-DDR

Unidirectional Bandwidth6000

Bidirectional Bandwidth5553.5

2500

3000 MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2

sec

3022.1

1952.94000

5000

sec

5553.5

3621.4

1500

2000

Mill

ionB

ytes

/s

1399.8

1389.42000

3000

Mill

ionB

ytes

/s

2457.4

2718.3

0

500

1000 936.5

0

1000

2000

1519.8

0

Message Size (bytes)

0

Message Size (bytes)

Supercomputing '09

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.4 GHz Quad-core (Nehalem) Intel with IB switch

One-way Latency: MPI over iWARP

35One-way Latency

y y

25

30 Chelsio (TCP/IP)

Chelsio (iWARP)

15

20

Late

ncy

(us)

15.47

5

10

L

6.88

00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Message Size (bytes)

Supercomputing '09

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

Bandwidth: MPI over iWARP

1400Ch l i (TCP/IP)

Unidirectional Bandwidth2500

Bidirectional Bandwidth

2260 8

1000

1200Chelsio (TCP/IP)

Chelsio (iWARP)

ec 839 8

2000

c

2260.81231.8

600

800

Mill

ionB

ytes

/se 839.8

1000

1500

illio

nByt

es/s

ec

855.3

200

400

M

500M

i

0

Message Size (bytes)

0

Message Size (bytes)Message Size (bytes) Message Size (bytes)

Supercomputing '09

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Advanced Capabilities in InfiniBandp

CompleteComplete Hardware

ImplementationsExist

Supercomputing '09

Basic IB Capabilities at EachP t l L

• Link Layer

Protocol Layer

– CRC-based data integrity, Buffering and Flow-control, Virtual Lanes, Service Levels and QoS, Switching and Multicast WAN capabilitiesMulticast, WAN capabilities

• Network LayerRouting and Flow Labels– Routing and Flow Labels

• Transport LayerReliable Connection Unreliable Datagram Reliable– Reliable Connection, Unreliable Datagram, Reliable Datagram and Unreliable Connection

– Shared Receive Queue and Extended ReliableShared Receive Queue and Extended Reliable Connections (discussed in more detail later)

Supercomputing '09

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

Congestion Control

• Switch detects congestion on a link

g

– Detects whether it is the root or victim of congestion

• IB follows a three-step protocol Forward Explicit Congestion Notification– Forward Explicit Congestion Notification

• Used to communicate congested port status• Switch sets FECN bit; marks packets leaving the congested state

Backward Explicit Congestion Notification– Backward Explicit Congestion Notification• Destination sends BECN to sender informing about congestion

– Injection Rate Control (Throttling)• Source throttles its send rate temporarily (timer based)• Original injection rate reduces over time• Congestion control may be performed per QP or SL

• Pro-active does not wait for packet drops to occurSupercomputing '09

Multipathing Capability

• Similar to basic switching, except…

p g p y

– … sender can utilize multiple LIDs associated to the same destination port

• Packets sent to one DLID take a fixed path• Packets sent to one DLID take a fixed path• Different packets can be sent using different DLIDs• Each DLID can have a different path (switch can be configured

differently for each DLID)

• Can cause out-of-order arrival of packets– IB uses a simplistic approach:

• If packets in one connection arrive out-of-order, they are droppeddropped

– Easier to use different DLIDs for different connectionsSupercomputing '09

Automatic Path Migration

• Automatically utilizes IB multipathing for network

g

y p g

fault-tolerance

• Enables migrating connections to a different pathEnables migrating connections to a different path

– Connection recovery in the case of failures

O ti l F t– Optional Feature

• Available for RC, UC, and RD

• Reliability guarantees for service type maintained

during migration

Supercomputing '09

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

Shared Receive Queues (SRQs)Q ( Q )

Process Process

QP QP QP QP… QP QP QP QP…

Receive Queues (one queue per QP)

Shared Receive Queue (Many QPs to one queue)

• Shared Receive Queues allows multiple QPs to share a single Receive Queue

• Allows much better scalability for applications/libraries that pre post to• Allows much better scalability for applications/libraries that pre-post to receive queues

Supercomputing '09

Flow Control with SRQs

• SRQ has link-level flow-control, but no E2E flow-control

Q

– Problem: How will the sender know how many messages to send (buffers are shared by all receivers)?

• Solution: SRQ limit event functionality can be used to post• Solution: SRQ limit event functionality can be used to post additional buffers

• Limit event functionality– Receiver requests network for an interrupt when there are

less than some buffers left (low watermark)P bl Wh t if h b t f b ff– Problem: What if you have a burst of messages buffers get utilized before you can post more

• Solution: Sender will just retransmit (handled in hardware)j ( )• Can lose some performance, but this is a very rare case !

Supercomputing '09

Low Watermark Method

• Upon receiving a low watermark (SRQ limit event) the receiver can post additional buffers

Sender 1 Sender 2 ReceiverPost 10 buffers, set limit of 4

Limit Event (4 remaining)Have thread post 6 more

• If SRQ is exhausted the send operation does not complete until the receiver has posted more bufferscomplete until the receiver has posted more buffers– Sender hardware gets Receiver Not Ready (RNR NAK)

Supercomputing '09

SRQ Buffer Usage

• Buffers are consumed in the order received from RQs

Q g

and SRQs

• Libraries such as MPI are forced to post fixed sized buffers – leading to potential waste

• Multiple SRQs, each with a different size?

MPI Message

1KB

Posted Buffer (8KB)

7KB

Supercomputing '09

1KB 7KB (unused)

Multiple SRQsp Q

QPs

SRQ

to peer 0

To peer 18KBbuffers

SRQ

to peer 2

4KBbuffers

SRQ

Process

to peer n

buffers

• Each additional SRQ multiplies the number of QPs required– Each QP can only be associated with one SRQ

• Not a scalable solution since many QPs are needed for each process

Supercomputing '09

eXtended Reliable Connection (XRC)

Instead of connecting

( )

M = number of nodesP = number of processes per node

processes, connect processes to a node

NodeProcesses A

P = number of processes per node

Each process needs only a single QP to a node

In best case M-1 QPs per process are needed for a f ll t d t

M-1(per process)

M-1(per process)

fully-connected setup

M*(M-1)*P QPs total vs. M2P2 M*P for RCB C D M2P2 – M*P for RC

Supercomputing '09

XRC Addressing

• XRC uses SRQ Numbers (SRQN) to direct where a

g

operation should complete

Send to #2Send to #1SRQ#1

Process 0

SRQ#1

Process 2

Send to #2Send to #1

SRQ#2Process 1

SRQ#2Process 3

• Hardware does all routing of data, so p2 is not actually involved in the data transfer

Supercomputing '09

data transfer• Connections are not bi-directional, so p3 cannot sent to p0

Multiple SRQs with XRCp Q

QPs QPs

SRQ

QPsto process 0

to process 1

to process 2

8KBbuffers SRQ

QPsto node 0

to node 1

8KBbuffers

SRQ

Process

p

to process n

4KBbuffers

…SRQ

Processto node n

4KBbuffers

RC XRC

Add i b SRQ b l ll lti l SRQ

RC XRC

• Addressing by SRQ number also allows multiple SRQs per process without additional memory resources

P t ti l t l i li ti d lib i• Potential to use less memory in applications and libraries

Supercomputing '09

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

Data Segmentation

• Application can hand over a large message

g

pp g g

– Network adapter segments it to MTU sized packets

Single notification hen the entire message is transmitted– Single notification when the entire message is transmitted

or received (not per packet)

• Reduced host overhead to send/receive messages

– Depends on the number of messages, not the number of

bytes

Supercomputing '09

Transaction Ordering

• IB follows a strong transaction ordering for RC

g

• Sender network adapter transmits messages in the order in which WQEs were posted

• Each QP utilizes a single LID– All WQEs posted on same QP take the same path

– All packets are received by the receiver in the same order

– All receive WQEs are completed in the order in which they were posted

Supercomputing '09

Message-level Flow-Control

• Also called as End-to-end Flow-control

g

– Does not depend on the number of network hops

• Separate from Link-level Flow-Control– Link-level flow-control only relies on the number of bytes

being transmitted, not the number of messages

– Message-level flow-control only relies on the number of messages transferred, not the number of bytes

• If 5 receive WQEs are posted the sender can send 5• If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs)– If the sent messages are larger than what the receive– If the sent messages are larger than what the receive

buffers are posted, flow-control cannot handle itSupercomputing '09

Static Rate Control andA t N ti ti

• IB allows link rates to be statically changed

Auto-Negotiation

– On a 4X link, we can set data to be sent at 1X

– For heterogeneous links, rate can be set to the lowest link rate

– Useful for low-priority traffic

• Auto-negotiation also available– E.g., if you connect a 4X adapter to a 1X switch, data is

t ti ll t t 1X tautomatically sent at 1X rate

• Only fixed settings availableC t t t i t t 3 16 Gb f l– Cannot set rate requirement to 3.16 Gbps, for example

Supercomputing '09

Advanced Capabilities in IB

• Link Layer

p

– Congestion Control

• Network Layer– Multipathing Capability and Automatic Path Migration

• Transport Layer– Shared Receive Queues, Extended Reliable Connection

– Data Segmentation, Transaction Ordering

– Message-level End-to-End Flow Control

– Static Rate Control and Auto-Negotiation

• Management ToolsSupercomputing '09

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Concepts in IB Management

• Agents

p g

– Processes or hardware units running on each adapter, switch, router (everything on the network)Provide capability to query and set parameters– Provide capability to query and set parameters

• Managers– Make high-level decisions and implement it on the– Make high-level decisions and implement it on the

network fabric using the agents

• Messaging schemesg g– Used for interactions between the manager and agents

(or between agents)

• MessagesSupercomputing '09

InfiniBand Management

• All IB management happens using packets called as

g

Management Datagrams

– Popularly referred to as “MAD packets”p y p

• Four major classes of management mechanisms

Subnet Management– Subnet Management

– Subnet Administration

– Communication Management

– General Services

Supercomputing '09

Subnet Management & Administration

• Consists of at least one subnet manager (SM) and

g

several subnet management agents (SMAs)– Each adapter, switch, router has an agent running

C i ti b t th SM d t b t– Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs)g ( )

• SM’s responsibilities include:– Discovering the physical topology of the subnet– Assigning LIDs to the end nodes, switches and routers– Populating switches and routers with routing paths– Subnet sweeps to discover topology changes

Supercomputing '09

Subnet Managerg

Inactive Link

Active Links

Inactive Links

Compute Node

Switch Multicast Join

Multicast Setup

Node

Multicast Join

Multicast Setup

Subnet ManagerSubnet

ManagerSubnet

Manager

Supercomputing '09

Subnet Manager Sweep Behavior

• SM can be configured to sweep once or continuously

g p

• On the first sweep:– All ports are assigned LIDs on the first sweep

All t t th it h– All routes are setup on the switches

• On consequent sweeps:If there has been any change to the topology appropriate– If there has been any change to the topology, appropriate routes are updated

– If DLID X is down, packet not sent all the wayy• First hop will not have a forwarding entry for LID X

• Sweep time configured by the system administrator– Cannot be too high or too low

Supercomputing '09

Subnet Manager Scalability Issues

• Single subnet manager has issues on large systems

g y

– Performance and overhead of scanning• Hardware implementations on switches are faster, but will work

only for small systems (memory usage)only for small systems (memory usage)• Software implementations are more popular (OpenSM)

– Fault tolerance• There can be multiple SMs• During initialization only one should be active; once started

other SMs can handle different network portionsother SMs can handle different network portions

• Asynchronous events specified to improve scalability– E g TRAPS are events sent by an agent to the SM when a– E.g., TRAPS are events sent by an agent to the SM when a

link goes downSupercomputing '09

Multicast Group Management

• Creation, joining/leaving, deleting multicast groups occur

p g

as SA requests– The requesting node sends a request to a SA

– The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets

E h it h t i i f ti hi h t t f d• Each switch contains information on which ports to forward the multicast packet to

• Multicast itself does not go through the subnet managerMulticast itself does not go through the subnet manager– Only the setup and teardown goes through the SM

Supercomputing '09

General Services

• Several general service management features provided by the standard– Performance Management

Se eral req ired and optional performance co nters• Several required and optional performance counters• Flow control counters, RNR counters, Number of sent and

received packets

– Hardware Management• Baseboard Management• Device Management• SNMP Tunneling• Vendor SpecificVendor Specific• Application Specific

Supercomputing '09

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Tools to Analyze InfiniBand Networks

• Different types of tools exist:

y

– High-level tools that internally talk to the subnet manager using management datagrams

– Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) countersspecific) counters

• Possible to write your own tools based on the management datagram interfacemanagement datagram interface– Several vendors provide such IB management tools

Supercomputing '09

Network Discovery Tools

• Starting with almost no knowledge about the system,

y

we can identify several details of the network configuration– Example tools include:

• ibhosts: finds all the network adapters in the system

• ibswitches: finds all the network switches in the system• ibswitches: finds all the network switches in the system

• ibnetdiscover: finds the connectivity between the ports

• … and many others existy

– Possible to write your own tools based on the management datagram interface

• Several vendors provide such IB management tools

Supercomputing '09

Discovering Network Adaptersg p

% ibhostsCa : 0x0002c9020023c314 ports 2 " HCA 2"GUID of the adapterCa : 0x0002c9020023c314 ports 2 HCA-2Ca : 0x0002c9020023c05c ports 2 " HCA-2"Ca : 0x0002c9020023c0e8 ports 2 " HCA-2"Ca : 0x0002c9020023c178 ports 2 " HCA-2"

GUID of the adapter

pCa : 0x0002c9020023c058 ports 2 " HCA-2"Ca : 0x0002c9020023bffc ports 2 " HCA-2"Ca : 0x0002c9020023c08c ports 2 "wci59"Number of adapter

tCa : 0x0011750000ffe01a ports 1 " HCA-1"Ca : 0x0011750000ffe141 ports 1 " HCA-1"Ca : 0x0011750000ffe1dd ports 1 " HCA-1"Ca : 0x0011750000ffe079 ports 1 " HCA-1"

ports96 adapters

“online”

Ca : 0x0011750000ffe079 ports 1 HCA-1Ca : 0x0011750000ffe25c ports 1 " HCA-1"Ca : 0x0002c9020023c318 ports 2 " HCA-2"...

Adapter description

Supercomputing '09

Network Adapter Classificationp

% ibnetdiscover –H /* Some parts snipped out */Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA 2"Vendor IDCa : ports 2 devid 0x6282 vendid 0x2c9 HCA-2Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"

Vendor IDDevice ID

59 InfiniHost III adaptersp

Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"

Mellanox adapters

adapters

Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"

8 Qlogic adapters

29 ConnectX adapters

Ca : ports 1 devid 0x10 vendid 0x1fc1 HCA-1Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"...

adapters

Supercomputing '09

Discovering Network Switchesg

% ibswitches /* Some parts snipped out */Switch : ports 24 "SilverStorm 9120 Leaf 1 Chip A"Switch : ports 24 SilverStorm 9120 Leaf 1, Chip ASwitch : ports 24 "SilverStorm 9120 Spine 2, Chip A"Switch : ports 24 "SilverStorm 9120 Spine 1, Chip A"Switch : ports 24 "SilverStorm 9120 Spine 3, Chip A"

Switch vendorinformation

p p pSwitch : ports 24 "SilverStorm 9120 Spine 2, Chip B"Switch : ports 24 "SilverStorm 9120 Spine 1, Chip B"Switch : ports 24 "SilverStorm 9120 Spine 3, Chip B"Switch : ports 24 "SilverStorm 9120 Leaf 8, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 2, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 4, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 12 Chip A"

Ports per switch

Switch : ports 24 SilverStorm 9120 Leaf 12, Chip ASwitch : ports 24 "SilverStorm 9120 Leaf 6, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 10, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 7, Chip A"

12 Leaf switches3 Spine switches with 2 chips each

Supercomputing '09

Switch : ports 24 "SilverStorm 9120 Leaf 3, Chip A“...

Discovering Network Connectivityg y

% ibnetdiscover /* Some parts snipped out */Switch 24 "S 00066a000700067c" # "SilverStorm 9120Switch 24 S-00066a000700067c # SilverStorm 9120 GUID=0x00066a00020001aa Leaf 1, Chip A" base port 0 lid 66 lmc 0[24] # "SilverStorm 9120 Spine 2, Chip A" lid 104 4xDDRp p[23] # "SilverStorm 9120 Spine 2, Chip A" lid 104 4xDDR[22] # "SilverStorm 9120 Spine 3, Chip A" lid 100 4xDDR[21] # "SilverStorm 9120 Spine 1, Chip A" lid 110 4xDDRConnectivity of

h it h t...[12] "H-0002c9030001e5e6" # " HCA-1" lid 125 4xDDR[11] "H-0002c9030001e3fa" # " HCA-1" lid 142 4xDDR[10] "H-0002c9030000b0c4" # " HCA-1" lid 106 4xDDR

each switch port

[10] H-0002c9030000b0c4 # HCA-1 lid 106 4xDDR[9] "H-0002c9030000b0c8" # " HCA-1" lid 108 4xDDR[8] "H-0002c9030001e5fa" # " HCA-1" lid 143 4xDDR...

Supercomputing '09

Roughly Constructed Network Fabricg y

Spine 3Spine 3 (Unmanaged)Chip A Chip B

Spine 2

FAN4 FAN3Spine 3 (Unmanaged)

Spine 2 (Managed)

Spine 1 (Managed)

8 Qlogic Adapters

Spine 1

FAN2 FAN1PS1

PS2

PS3

PS4

PS5

PS6

59 MellanoxInfiniHost III

Leaf 10Leaf 9Leaf 12Leaf 11

InfiniHost IIIAdapters

Leaf 4Leaf 3

Leaf 6Leaf 5Leaf 8Leaf 7

29 MellanoxConnectXAd t

Supercomputing '09Leaf 2Leaf 1

Leaf 2Leaf 1Adapters

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Overall Diagnostics

• Tools to query overall fabric health

g

[ib1 ]# ibdiagnet -r…STAGE Errors Warningsg

Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 Performance Counters Report 0 0 Partitions Check 0 0 IPoIB Subnets Check 0 0 Subnet Manager Check 0 0 Fabric Qualities Report 0 0Fabric Qualities Report 0 0 Credit Loops Check 0 0 Multicast Groups Report 0 0

Supercomputing '09

End-node Adapter Statep

[ib1 ]# ibportstate 8 1PortInfo:# Port info: Lid 8 port 1# Port info: Lid 8 port 1LinkState:.......................ActivePhysLinkState:...................LinkUpLinkWidthSupported:..............1X or 4XLinkWidthEnabled:................1X or 4XLinkWidthActive:.................4XLinkSpeedSupported:..............2.5 Gbps or 5.0 GbpsLinkSpeedEnabled:................2.5 Gbps or 5.0 GbpsLinkSpeedActive:.................5.0 Gbps

Supercomputing '09

End-node Adapter Countersp

[ib1 ]# ibdatacounts 119 1# Port counters: Lid 119 port 1XmtData:.........................2102127705RcvData:.........................2101904109XmtPkts: 9069780XmtPkts:.........................9069780RcvPkts:.........................9068305

[ib1 ]# ibdatacounts 119 1# Port counters: Lid 119 port 1XmtData:.........................432RcvData:.........................432XmtPkts 6XmtPkts:.........................6RcvPkts:.........................6

[ib1 ]# ibcheckerrs -v 20 1

Supercomputing '09

Error check on lid 20 (ib12 HCA-2) port 1: OK

InfiniBand Management & Tools

• Subnet Management

g

g

• Diagnostic Tools

System Discovery Tools– System Discovery Tools

– System Health Monitoring Tools

S t P f M it i T l– System Performance Monitoring Tools

Supercomputing '09

Network Switching and Routingg g

% ibroute -G 0x66a000700067cLid Out DestinationLid Out Destination

Port Info0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1')0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1')

Packets to LID 0x0001 will be sent out onp p g

0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1')0x0004 015 : (Channel Adapter portguid 0x0002c9020023c305: ' HCA-2')0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1')

will be sent out on Port 001

0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 8, Chip A')0x0015 020 : (Channel Adapter portguid 0x0002c9020023c131: ' HCA-2')0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 91200x0016 019 : (Switch portguid 0x00066a0007000732: SilverStorm 9120 GUID=0x00066a00020001aa Leaf 10, Chip A')0x0017 019 : (Channel Adapter portguid 0x0002c9030001c937: ' HCA-1')0x0018 019 : (Channel Adapter portguid 0x0002c9020023c039: ' HCA-2')

Supercomputing '09

...

Static Analysis of Network Contentiony

Spine Blocks2 11 22 17 2427 15 20

Leaf Blocks

Spine Blocks

• Based on destination LIDs and switching/routing

4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10

• Based on destination LIDs and switching/routing information, the exact path of the packets can be identifiedidentified– If application communication pattern is known, we can

statically identify possible network contentiony y

Supercomputing '09

Dynamic Analysis of Network C t ti

• IB provides many optional counters to query

Contention

performance counters– PortXmitWait: Number of ticks in which there was data

to send, but no flow-control credits

– RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive bufferthe receiver has not yet posted a receive buffer

• This can timeout, so it can be an error in some cases

– PortXmitFlowPkts: Number of (link-level) flow-controlPortXmitFlowPkts: Number of (link level) flow control packets transmitted on the port

– SWPortVLCongestion: Number of packets dropped due to congestion

Supercomputing '09

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Advanced Capabilities in iWARP/10GE

• Security in iWARP

p

• Security in iWARP

• Multipathing support using VLANs (Ethernet Feature)

• 64b/66b Encoding Standards

• Link Aggregation (Ethernet Feature)

Supercomputing '09

Security in iWARP

• iWARP was designed to be compliant with the Internet,

y

while providing high performance– Security is an important consideration

– E.g., can be used in a data-centers where unknown clients can communicate over iWARP with the server

M ltiple le els of sec rit meas res specified b the• Multiple levels of security measures specified by the standard (in practice, only a few are implemented)

Untrusted peer access model– Untrusted peer access model

– Encrypted Wire Protocol

– Information DisclosureInformation Disclosure

Supercomputing '09

Untrusted Peer Access Model

• Single access RDMA– The target exposes memory region for a single protected

access– Initiator performs three steps:– Initiator performs three steps:

• Writes data to the target location• Sends an invalidate STAG message (all access to the target

memory is removed)• Sends a verification key

– Target performs two steps:– Target performs two steps:• Uses verification key to ensure message is not tampered• Marks the process as complete

– Verification model unspecified by the iWARP standard

Supercomputing '09

Encrypted Wire Protocol and I f ti Di l

• iWARP is built on top of TCP/IP, so all its security

Information Disclosure

protocols are directly usable by iWARP– The standard discusses IPSec, but does not specify it

• Can be thought of as the “recommended mechanism”

– Security capabilities are not directly accessible, except to turn on or offturn on or off

• Information DisclosurePeer specific memory protection capabilities– Peer-specific memory protection capabilities

• E.g., only peer X can access this buffer

• Only peer X can write to this, and peer Y can read from ity p , p

• Mixed modes (a buffer is readable by a subset of peers)Supercomputing '09

Denial of Service Attacks

• Typical forms of DoS attacks are when the peer negotiates resources (e.g., by opening a connection) but performs no real work– Especially difficult to handle when done on the network

adapter (limited resources)

The iWARP standard does not specify any solution for• The iWARP standard does not specify any solution for this; leaves it to the TCP/IP layer to handle it

E g using authentication or terminating connections by– E.g., using authentication or terminating connections by monitoring usage

– Recommends that communication be offloaded only after ythe authentication is done

Supercomputing '09

Advanced Capabilities in iWARP/10GE

• Security in iWARP

p

• Security in iWARP

• Multipathing support using VLANs (Ethernet Feature)

• 64b/66b Encoding Standards

• Link Aggregation (Ethernet Feature)

Supercomputing '09

VLAN based Multipathing

• Ethernet basic switching

p g

– Network is broken down to a tree by disabling links

– Pros: No live-locks and simple switching

– Cons: Single path between nodes and wastage of links

• VLAN based multipathing– Overlay many logical networks on one physical network

• Each overlay network will break down into a unique tree

D di hi h l t k d t• Depending on which overlay network you send on, you get a different path

• Adding nodes/links is simple; you just add a new overlayg p ; y j y

• Older overlays will continue to work as earlierSupercomputing '09

Example VLAN Configuration

• Basic Ethernet converts the

p g

topology to a tree– Wastes four of the links

• Can be considered as twoCan be considered as two different VLANs– All the links in the network

tili dare utilized

• Can be used for:– High Performanceg

– Security (if someone has to get access only to a part of the network)

• Supported by several switch vendorsthe network)

– Fault toleranceSupercomputing '09

• Woven Systems, Cisco

Advanced Capabilities in iWARP/10GE

• Security in iWARP

p

• Security in iWARP

• Multipathing support using VLANs (Ethernet Feature)

• 64b/66b Encoding Standards

• Link Aggregation (Ethernet Feature)

Supercomputing '09

64b/66b Network Data Encoding

• All communication channels utilize data encoding

g

– There can be imbalance in the number of 1’s and 0’s in the data bytes that are being transmitted

• Leads to a problem called DC-balancingp g• Reduce signal integrity especially for fast networks

– Converts data into a format with more even 1’s and 0’s– E.g., 10GE has 12.5Gbps signaling; same for Myrinet,

Quadrics, GigaNet cLAN and most other networks• 8b/10b encoding8b/10b encoding

– Pros: Better signal integrity, so lesser retransmits– Cons: More bits sent over the wire (20% overhead)

• 64b/66b has the same benefit, but lesser overheadSupercomputing '09

Link Aggregation

• Link aggregation allows for multiple links to logically

gg g

look like a single faster link– Done at a hardware level

– Several multi-port network adapters allow for packets sequencing to avoid out-of-order packets

B th 64b/66b di d li k ti i l– Both 64b/66b encoding and link aggregation are mainly driven by the need to conserve power

• More data rate but not necessarily higher clock speedMore data rate but not necessarily higher clock speed

Supercomputing '09

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

OpenFabrics

• www.openfabrics.org

p

• Open source organization (formerly OpenIB)• Incorporates both IB and iWARP in a unified manner

F i ff t f O S IB d iWARP• Focusing on effort for Open Source IB and iWARP support for Linux and Windows

• Design of complete software stack with `best of breed’Design of complete software stack with best of breed components– Gen1– Gen2 (current focus)

• Users can download the entire stack and runLatest release is OFED 1 4 2– Latest release is OFED 1.4.2

– OFED 1.5 is being worked outSupercomputing '09

OpenFabrics Software StackpSA Subnet Administrator

MAD Management DatagramOpenDiag

Application Level

ClusteredDB Access

SocketsBasedA

VariousMPIs

Access toFile

S t

BlockStorageA

IP BasedApp

A

SMA Subnet Manager Agent

PMA Performance Manager Agent

IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC

SDP Lib

User Level MAD API

Open SM

DiagTools

User APIs

U S

DB AccessAccess MPIs SystemsAccessAccess

UDAPL

SDP Sockets Direct Protocol

SRP SCSI RDMA Protocol (Initiator)

iSER iSCSI RDMA Protocol (Initiator)

RDS Reliable Datagram Service

SDPIPoIB SRP iSER RDS

SDP Lib

Upper Layer Protocol

Kernel Space

User Space

NFS-RDMARPC

ClusterFile Sys

g

UDAPL User Direct Access Programming Lib

HCA Host Channel Adapter

R-NIC RDMA NIC

ConnectionManager

MADSA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

Mid-LayerSMA

el b

ypas

s

el b

ypas

s

CommonKeyHardware

Specific DriverHardware Specific

Driver

InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC

ProviderApps & Access

Ker

ne

Ker

ne

Supercomputing '09

InfiniBand

iWARPInfiniBand HCA iWARP R-NICHardware

AccessMethodsfor usingOF Stack

Programming with OpenFabricsg g p

Sender ReceiverSample Steps

1. Create QPs (endpoints)

Sample Steps

Process(endpoints)

2. Register memory for sending and receiving

3. Send

– ChannelP t i

Kernel

HCA • Post receive

• Post send

– Post RDMA

HCA

operation

Supercomputing '09

Verb Steps

• Open HCA and create QPs to end nodes

p

– Can be done with connection managers (rdma_cm or ibcm) or directly through verbs with out-of-band communication

• Register memory

ibv_mr * mrhandle = ibv_reg_mr(pd, *buffer, len, IBV ACCESS LOCAL WRITE | IBV ACCESS REMOTE WRITE |

P i i b d d

IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ

• Permissions can be set as needed

Supercomputing '09

Verbs: Post Receive

• Prepare and post receive descriptor (channel semantics)struct ibv_recv_wr *bad_wr;struct ibv_recv_wr rr;struct ibv_sge sg_entry;

rr.next = NULL;rr.wr_id = 0;

1rr.num_sge = 1;rr.sg_list = &(sg_entry);sg_entry.addr = (uintptr_t) buf; /* local buffer address */sg entry.length = len;g_ y g ;sg_entry.lkey = mr_handle->lkey; /* memory handle */

ret = ibv_post_recv(qp, &rr, &bad_wr); /* post to QP */

Supercomputing '09

ret = ibv_post_srq_recv(srq, &rr, &bad_wr); /* post to SRQ */

Verbs: Post Send

• Prepare and post send descriptor (channel semantics)struct ibv_send_wr *bad_wr;struct ibv_send_wr sr;struct ibv_sge sg_entry;

sr.next = NULL;sr.opcode = IBV_WR_SEND;sr wr id = 0;sr.wr_id = 0;sr.num_sge = 1;sr.send_flags = IBV_SEND_SIGNALED;sr.sg list = &(sg entry);g_ g_ ysg_entry.addr = (uintptr_t) buf;sg_entry.length = len;sg_entry.lkey = mr_handle->lkey;

Supercomputing '09

ret = ibv_post_send(qp, &sr, &bad_wr);

Verbs: Post RDMA Write

• Prepare and post RDMA write (memory semantics)struct ibv_send_wr *bad_wr; struct ibv_send_wr sr;struct ibv_sge sg_entry;

sr.next = NULL;;sr.opcode = IBV_WR_RDMA_WRITE; /* set type to RDMA Write */sr.wr_id = 0;sr.num_sge = 1;sr.send_flags = IBV_SEND_SIGNALED;sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */sr.wr.rdma.rkey = rkey; /* from remote node */sr sg list &(sg entry);sr.sg_list = &(sg_entry);sg_entry.addr = buf; /* local buffer */sg_entry.length = len;sg entry.lkey = mr handle->lkey;

Supercomputing '09

g_ y y _ y;

ret = ibv_post_send(qp, &sr, &bad_wr);

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Designing MPI Using IB/iWARP F tFeatures

MPI Design Components

ProtocolMapping

Buffer

FlowControl

Connection

CommunicationProgress

Collective

Multi-railSupport

One-sidedBufferManagement

ConnectionManagement

CollectiveCommunication

Substrate

One sidedActive/Passive

Substrate

RDMAOperations

UnreliableDatagram

Static RateControl

Multicast Out-of-orderPlacement

QoS Multi-PathVLANspg

AtomicOperations

SharedReceive Queues

End-to-EndFlow Control

Send /Receive

DynamicRate

Control

Multi-PathLMC

Supercomputing '09

IB and Ethernet Features

MVAPICH/MVAPICH2 Software

• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Used by more than 975 organizations in 51 countries

– More than 32,000 downloads from OSU site directly

– Empowering many TOP500 clusters• 8th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)vendors including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/Supercomputing '09

MPICH2 Software Stack

• High-performance and Widely Portable MPIS t MPI 1 MPI 2 d MPI 2 1– Supports MPI-1, MPI-2 and MPI-2.1

– Supports multiple networks (TCP, IB, iWARP, Myrinet)– Commercial support by many vendors

• IBM (integrated stack distributed by Argonne)• Microsoft, Intel (in process of integrating their stack)

– Used by many derivative implementationsUsed by many derivative implementations• E.g., MVAPICH2, IBM, Intel, Microsoft, SiCortex, Cray, Myricom• MPICH2 and its derivatives support many Top500 systems

(estimated at more than 90%)(estimated at more than 90%)– Available with many software distributions– Integrated with the ROMIO MPI-IO implementation and the MPE

profiling libraryprofiling library– http://www.mcs.anl.gov/research/projects/mpich2

Supercomputing '09 96

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

– Communication Characteristics on Multi-core Systems

– Protocol Processing Interactions

• Network Congestion and Hot-spots

• Collective CommunicationCollective Communication

• Scalability for Large-scale Systems

• Fault Tolerance• Fault Tolerance

• Quality of Service

• Application Scalability

Supercomputing '09

MPI Bandwidth on ConnectX with M lti

1600Multi-stream Bandwidth

Multicore

1200

1400

ytes

/sec

)

600

800

1000

th (M

illio

nBy

1 pair

2 pairs

4 pairs

5 fold performance

difference

200

400

Ban

dwid

t p

8 pairsdifference

0

Message Size (bytes)

Supercomputing '09

S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

– Communication Characteristics on Multi-core Systems

– Protocol Processing Interactionsg

• Network Congestion and Hot-spots

• Collective CommunicationCollective Communication

• Scalability for Large-scale Systems

F lt T l• Fault Tolerance

• Quality of Service

• Application ScalabilitySupercomputing '09

Analyzing Interrupts and Cache Missesy g p

100000Hardware Interrupts

250L2 Cache Misses

10000

Core 0

Core 1

Core 2200

Core 0

Core 1

Core 2

100

1000

r Mes

sage

Core 3

100

150

Diff

eren

ce Core 3

1

10

nter

rupt

s pe

r

50

100

Per

cent

age

0.1

1 8 64 512

4K 32K

256K 2M

In

0

Supercomputing '09

0.01Message Size (bytes)

-50Message Size (bytes)

MPI Performance on Different Cores

3500Intel Platform

3000AMD Platform

2500

3000

Core 0

Core 1

Core 2

2500

Core 0

Core 1

Core 2

1500

2000

2500

idth

(Mbp

s) Core 3

1500

2000

wid

th (M

bps) Core 3

1000

1500

Ban

dw

500

1000Ban

dw

0

500

1 8 64 2

4K 2K 6K M

0

500

1 8 64 12 4K 2K 6K 2M

Supercomputing '09

6 51 4 32 256 2M

Message Size (bytes)6 51 4 32 256 2

Message Size (bytes)

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

• Fault Tolerance

• Quality of Servicey

• Application Scalability

Supercomputing '09

Hot-Spot Avoidance with MVAPICH

• Deterministic nature of IB routing

p

30NAS FT

leads to network hot-spots

• Responsibility of path utilization 20

25

s)

Original

HSAM, 4 Pathsis up to the MPI Library

• We Design HSAM (Hot-Spot

Avoidance MVAPICH) to alleviate 10

15

20

me

(Sec

onds

Avoidance MVAPICH) to alleviate

this problem

0

5

10

Tim

032x1 64x1

Number of Processes

Supercomputing '09

A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula and D. K. Panda , “Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective”, CCGrid ’07 (Best Paper Nominee)

Network Congestion with Multi-Coresg

Faster Processor

P t l N t k

Protocol Processing

Network Communication

Protocol Processing

Network Communication

Protocol Network

Faster Network

Multi-coresProtocol

ProcessingNetwork

Communication

Supercomputing '09

Network Usage

Communication Burstiness

25000Communication Performance

1000000Network Congestion

20000

s) 10000

100000 RX framesTX frames

10000

15000

ughp

ut (M

bps

100

1000

ause

Fra

mes

5000Thro

u

10

100

Pa0

Configuration

1

Configuration

Supercomputing '09

g Configuration

A. Shah, B. N. Bryant, H. Shah, P. Balaji and W. Feng, “Switch Analysis of Network Congestion for Scientific Applications” (under preparation)

Out-of-Order (OoO) Packets

• Multi-path communication supported by many networks

( )

– IB, 10GE: Hardware feature!– Can cause OoO packets protocol stack has to handle

• Simple approach by most protocols (e g IB) drop & retransmit• Simple approach by most protocols (e.g., IB) drop & retransmit• Not good for large-scale systems

– iWARP specifies a more graceful approach• Out-of-order placement of data• Overhead of out-of-order packets should be minimized

1234 1

2

Supercomputing '09

34

Issues with Out-of-Order Packets

PacketHeader

iWARPHeader Data Payload Packet

HeaderiWARPHeader

Data Payload

PacketHeader

iWARPHeader

Data Payload

I t di t S it h S t ti

PacketHeader

iWARPHeader

PartialPayload

PacketHeader

PartialPayload

PacketHeader

iWARPHeader

Data Payload

PacketHeader

iWARPHeader

Data Payload

Intermediate Switch Segmentation

Delayed Packet Out-Of-Order PacketsOut-Of-Order Packets

(Cannot identify iWARP header)

Supercomputing '09

Handling Out-of-Order Packets in iWARPiWARP

RDMAP RDDPRDMAP MarkersRDMAP HOST

DDPHeader Payload (IF ANY)

Pad CRC

CRC Markers

RDMAP Markers

DDPHeader Payload (IF ANY)

MarkerSegment

TCP/IP

TCP/IP

RDDP CRC

Markers TCP/IP

RDDP CRC NIC

MarkerSegmentLength

• Packet structure becomes overly

Host-based Host-offloaded Host-assistedcomplicated

• Performing in hardware no longer straight forward!

Supercomputing '09

straight forward!

Overhead of Supporting OoO Packetspp g

60iWARP Latency

8000iWARP Bandwidth

50

Host-offloaded iWARP

Host-based iWARP

Host-assisted iWARP 6000

7000

30

40

ency

(us)

In-order iWARP

4000

5000

dth

(Mbp

s)

20

Late

2000

3000

Ban

dwi

0

10

0

1000

Supercomputing '09

Message Size (bytes) Message Size (bytes)

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

– IB Multicast based MPI Broadcast

– Shared memory aware collectives

• Scalability for Large-scale Systems

• Fault Tolerance Quality of Service

• Application ScalabilityApplication Scalability

Supercomputing '09

MPI Broadcast Using IB Multicastg

160Actual Measurements (64 processes)

180Analytical Model (1024 processes)

120

140 Multicast

Point-to-point 140

160 Multicast

Point-to-point

80

100

ency

(us)

80

100

120

ency

(us)

40

60Late

40

60

80

Late

0

20

1 2 4 8 16 32 64 28 56 12 1K 2K 4K

0

20

1 2 4 8 16 32 64 28 56 12 1K 2K 4K1 2 5 2 4

Message Size (bytes)

Supercomputing '093 6 12 25 5 1 2 4

Message Size (bytes)

Shared-memory Aware Collectives(4K cores on TACC Ranger with MVAPICH2)(4K cores on TACC Ranger with MVAPICH2)

160MPI_Reduce (4096 cores)

4500MPI_ Allreduce (4096 cores)

120

140

3500

4000 Original

Shared-memory

80

100

ency

(us) Original

Shared-memory2000

2500

3000

ency

(us)

40

60Late

1000

1500

2000

Late

0

20

0 4 8 16 32 64 128 256 5120

500

0 4 8 6 32 64 28 56 2 K K K K0 4 8 16 32 64 128 256 512

Message Size (bytes)

Supercomputing '09

1 3 6 12 25 51 1 2 4 8

Message Size (bytes)

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

– Memory Efficient Communication

• Fault Tolerance

• Quality of Service

• Application Scalability• Application Scalability

Supercomputing '09

Memory Utilization usingSh d R i QShared Receive Queues

100

120

B) 1416

) MVAPICH-RDMA

40

60

80

100

mor

y U

sed

(MB

MVAPICH-RDMA

MVAPICH SR68

1012

ory

Use

d (G

B)

MVAPICH-SRMVAPICH-SRQ

0

20

2 4 8 16 32

Mem

Number of Processes

MVAPICH-SR

MVAPICH-SRQ

024

128 256 512 1024 2048 4096 8192 16384

Mem

o

N b f P

• SRQ consumes only 1/10th compared to RDMA for 16 000 processes

Number of Processes Number of Processes

Analytical modelMPI_Init memory utilization

SRQ consumes only 1/10 compared to RDMA for 16,000 processes

• Send/Recv exhausts the Buffer Pool after 1000 processes; consumes 2X memory as SRQ for 16,000 processes

Supercomputing '09

S. Sur, L. Chai, H. –W. Jin and D. K. Panda, “Shared Receive Queue Based Scalable MPI Design for InfiniBand Clusters”, IPDPS 2006

Communication Buffer Memory Utili ti ith NAMD ( 1)

ARDMA-SR ARDMA-SRQ SRQARDMA-SR % ARDMA-SRQ % SRQ %

Utilization with NAMD (apoa1)

Avg. RDMA channels 53.15

0.9511.051.1

40506070

Per

form

ance

Usa

ge (M

B)

g

Avg. Low watermarks 0.03

Unexpected Msgs (%) 48.2

0.80.850.90.95

0102030

16 32 64

Nor

mal

ized

Mem

ory

U

Total Messages 3.7e6

MPI Time (%) 23.54

Number of Processes

• 50% messages < 128 Bytes, other 50% between 128 Bytes and 32 KB– 53 RDMA connections setup for 64 process experiment

• SRQ Channel takes 5-6MB of memory– Memory needed by SRQ decreases by 1MB going from 16 to 64

Supercomputing '09

S. Sur, M. Koop and D. K. Panda, “High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis”, SC ‘06

UD vs. RC: Performance and S l bilit (SMG2000 A li ti )Scalability (SMG2000 Application)

1.2RC UDMemory Usage (MB/process)

0.6

0.8

1

aliz

ed T

ime

RC UD

RC (MVAPICH 0.9.8) UD Design

Conn. Buffers Struct. Total Buffers Struct Total

512 22.9 65.0 0.3 88.2 37.0 0.2 37.2

1024 29 5 65 0 0 6 95 1 37 0 0 4 37 4

0

0.2

0.4

Nor

ma1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4

2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9

4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7

Large number of peers per process (992 at maximum)

128 256 512 1024 2048 4096Processes

UD reduces HCA QP cache thrashing

M K S S Q G d D K P d “Hi h P f MPI D i i U li bl D t

Supercomputing '09

M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters,” ICS ‘07

eXtended Reliable Connection (XRC)( )

RC XRC (8-core) XRC (16-core) RC XRC

Memory Usage Performance on NAMD (64 cores)

300350400450500

/pro

cess

)

0.81

1.2

d Ti

me

RC XRC

50100150200250

Mem

ory

(MB

00.20.40.6

Nor

mal

ize

• Memory usage for 32K processes with 16-cores per node can be 30MB/process (for

0

1 4 16 64 256 1K 4K 16K

M

Connections

0apoa1 er-gre f1atpase jac

Dataset

Memory usage for 32K processes with 16 cores per node can be 30MB/process (for connections)

• Performance for NAMD can increase when there is frequent communication to many peers (HCA cache miss goes down)

Supercomputing '09

M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08

Hybrid Transport Design (UD/RC/XRC)y p g ( )

• Both UD and RC/XRC have benefits

• Evaluate characteristics of all of them and use two sets of transports in the same application – get the best of both

Supercomputing '09

M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand,” IPDPS ‘08

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective CommunicationCollective Communication

• Scalability for Large-scale Systems

Fault Tolerance• Fault Tolerance

– Network Faults: Automatic Path Migration

P F lt Ch k i t R t t– Process Fault: Checkpoint-Restart

• Quality of Service

• Application ScalabilitySupercomputing '09

Fault Tolerance

• Component failures are common in large-scale clusters

• Imposes need on reliability and fault tolerance

• Working along the following three anglesg g g g– Reliable Networking with Automatic Path Migration (APM)

utilizing Redundant Communication Paths (available since MVAPICH 1.0 and MVAPICH2 1.0)

– Process Fault Tolerance with Efficient Checkpoint and Restart (available since MVAPICH2 0.9.8)

– End-to-end Reliability with memory-to-memory CRC ( il bl i MVAPICH 0 9 9)(available since MVAPICH 0.9.9)

Supercomputing '09

Network Fault-Tolerance with APM

• Network Fault Tolerance using InfiniBand Automatic Path Migration (APM)

– Utilizes Redundant Communication Paths

• Multiple Ports

• LMC

• Supported in OFED 1.2

A. Vishnu, A. Mamidala, S. Narravula and D. K. Panda, “Automatic Path Migration over InfiniBand: Early Experiences”, Third International Workshop on System Management Techniques, Processes, and

Supercomputing '09

y p p y g qServices, held in conjunction with IPDPS ‘07

APM Performance Evaluation

80FT Class B

300LU Class B

60

70250

40

50

(sec

onds

)

150

200

(sec

onds

)20

30Tim

e (

Original Armed

100Tim

e

0

10

8 16 32 64

Armed-Migrated Network Fault

0

50

8 16 32 648 16 32 64

QPs per process

Supercomputing '09

8 16 32 64

QPs per process

Checkpoint-Restart Support in MVAPICH2

• Process-level Fault Tolerance

MVAPICH2

– User-transparent, system-level checkpointing– Based on BLCR from LBNL to take coordinated checkpoints

of entire program including front end and individualof entire program, including front end and individual processes

– Designed novel schemes tog• Coordinate all MPI processes to drain all in flight messages in

IB connections St i ti t t & b ff hil h k i ti• Store communication state & buffers while checkpointing

• Restarting from the checkpoint

• Systems-level checkpoint can be initiated from the applicationSystems level checkpoint can be initiated from the application (available since MVAPICH2 1.0)

Supercomputing '09

Checkpoint-Restart Performance with PVFS2PVFS2

NAS, LU Class C, 32x1 (Storage: 8 PVFS2 servers on IPoIB)

80

100

onds

)

40

60

on T

ime

(Sec

o

0

20

Exe

cutio

No checkpoint

1 ckpt (avg 60 sec

interval)

2 ckpts (avg 40 sec

interval)

3 ckpts (avg 30 sec

interval)

4 ckpts (avg 20 sec

interval)

Number of Checkpoints Taken

Supercomputing '09

Q. Gao, W. Yu, W. Huang and D.K. Panda, “Application-Transparent Checkpoint/Restart for MPI over InfiniBand”, ICPP ‘06

Enhancing CR Performance with I/O A ti f M lti S tAggregation for Multi-core Systems

Original BLCR Aggregation

30000

35000

40000

45000

kpoint(ms)

speedup=13.08

speedup=11.57

15000

20000

25000

30000

take one check

speedup=9.13

0

5000

10000

LU.C.64 SP.C.64 BT.C.64 EP.C.64

Time to

speedup=1.67

• 64 MPI processes on 4 nodes, 16 processes/node• Checkpoint data is written to local disk files

LU.C.64 SP.C.64 BT.C.64 EP.C.64

Supercomputing '09

X. Ouyang, K. Gopalakrishnan, D. K. Panda, Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems, Int'l Conference on Parallel Processing (ICPP '09), Sept. 2009.

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

• Fault Tolerance

• Quality of Servicey

• Application Scalability

Supercomputing '09

QoS in IBQoS in IB

IB i bl f idi IB HCA B ff O i ti• IB is capable of providing network level differentiated service – Virtual Lane

IB HCA Buffer Organization

differentiated service –QoS

• Uses Service LevelsCommon

Virtual Lane

VirtualUses Service Levels (SL) and Virtual Lanes (VL) to classify traffic

Buffer

Pool

Virtual Lane

Virtual

Lane

Arbiter

IB

Link

( ) y Pool

Virtual Lane

Supercomputing '09

Virtual Lane

Inter-Job Quality of ServiceQ y

16Small Messages

9000Large Messages

12

14

6000

7000

8000

)

6

8

10

aten

cy (u

s) No-Traffic

No-QoS

QoS 4000

5000

6000

Late

ncy

(us)

2

4

6La

1000

2000

3000

L

0

2

1 2 4 8 16 32 64 128 256 512

0

M Si (b t )Message Size (bytes) Message Size (bytes)

Supercomputing '09

Can differentiate between multiple jobs

Design Challenges and Sample R lt

• Interaction with Multi-core Environments

Results

• Network Congestion and Hot-spots

• Collective Communication• Collective Communication

• Scalability for Large-scale Systems

• Fault Tolerance

• Quality of Servicey

• Application Scalability

Supercomputing '09

Performance of HPC Applications on TACC R i MVAPICH + IBTACC Ranger using MVAPICH + IB

• Rob Farber’s facial• Rob Farber s facial recognition application was run up to 60K coresup to 60K cores using MVAPICH

• Ranges from 84% of peak at low end to 65% of peak at high end

http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber

Supercomputing '09

3DFFT based Computations

• Internally utilize sequential 1D FFT libraries and perform

p

data grid transforms to collect the required data– Example implementations: P3DFFT, FFTW

– 3D volume of data divided amongst 2D grid of processes

– Grid transpose MPI_Alltoallv across the row and columnE h i i h ll i i• Each process communicates with all processes in its row

Supercomputing '09

Performance of HPC Applications on TACC R DNS/T b lTACC Ranger: DNS/Turbulence

Courtesy: P.K. Yeung, Diego Donzis, TG 2008

Supercomputing '09

Application Example: Blast Simulationspp p

• Researchers from the University of Utah haveUniversity of Utah have developed a simulation framework, called UintahCombines ad anced• Combines advanced mechanical, chemical and physical models into a novel computationala novel computational framework

• Have run > 32K MPI t k Rtasks on Ranger

• Uses asynchronous communication

http://www.tacc.utexas.edu/news/feature-stories/2009/explosive-science/Courtesy: J. Luitjens, M. Bertzins, Univ of Utah

Supercomputing '09

Application Example: OMENpp p

• OMEN is a two- and three-dimensional Schrodinger-Poisson solver basedsolver based

• Used in semi-conductor modelingg

• Run to almost 60K tasks

Courtesy: Mathieu Luisier, Gerhard Klimeck, Purdehttp://www.tacc.utexas.edu/RangerImpact/pdf/Save_Our_Semiconductors.pdf

Supercomputing '09

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

IPoIB vs. SDP Architectural ModelsTraditional Model Possible SDP Model

S k t A S k t A li tiSDP

Sockets App

Sockets API

Sockets Application

Sockets APIUser User

OS Modules

InfiniBand Hardware

KernelTCP/IP Sockets

Provider KernelTCP/IP Sockets

ProviderSockets Direct

Protocol

TCP/IP TransportDriver

TCP/IP TransportDriver

Kernel Bypass

RDMA Driver

InfiniBand CA

Driver

InfiniBand CA

Semantics

Supercomputing '09

InfiniBand CA InfiniBand CA

(Source: InfiniBand Trade Association 2002 )

SDP vs. IPoIB (IB QDR)( Q )

1500

2000

ps) IPoIB-RC 35004000

ps)

500

1000

1500

ndw

idth

(MBp IPoIB-UD

SDP

1500200025003000

ndw

idth

(MB

0

500

2 4 8 16 32 64 128

256

512

1K 2K 4K 8K 16K

32K

64K

Ban

0500

10001500

2 4 8 6 2 4 8 6 2 K K K K K K K

Bid

ir B

a n20

25

30

us)

1 3 6 12 25 51 1 2 4 8 16 32 64

SDP enables high bandwidth

5

10

15

Late

ncy

(u SDP enables high bandwidth (up to 15 Gbps),

low latency (6.6 µs)

Supercomputing '09

0

5

2 4 8 16 32 64 128 256 512 1K 2K

Flow-Control in IB

• Previous implementations of high-speed sockets (such as SDP) were on other networks– Implemented flow-control in software

• IB provides end-to-end message-level flow-control in hardware– Benefits: Asynchronous progress (i.e., SDP stack does

not need to keep waiting for the receiver to be ready; hardware will take care of it)hardware will take care of it)

– Issues:• No intelligent software coalescingNo intelligent software coalescing

• Does not handle buffer overflowSupercomputing '09

Flow-control Performance

35Latency

7000Bandwidth

25

30

s)

Credit-based

RDMA-based

NIC-assisted5000

6000

Mbp

s)

10

15

20

Late

ncy

(us

2000

3000

4000

Ban

dwid

th (M

0

5

10

0

1000

2000

Message Size (bytes) Message Size (bytes)

Supercomputing '09

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur and W. Gropp, Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand, ICPP, Sep 2007

Application Evaluationpp

800Iso-surface Rendering

5Virtual Microscope

600

700

800Credit-based

RDMA-based

NIC-assisted3 5

4

4.5

5Credit-based

RDMA-based

NIC-assisted

400

500

utio

n Ti

me

(s)

2.5

3

3.5

utio

n Ti

me

(s)

100

200

300

Exe

cu

1

1.5

2

Exe

cu

0

100

1024 2048 4096 8192Dataset dimensions

0

0.5

512 1024 2048 4096Dataset dimensions

Supercomputing '09

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Lustre Performance

1200Write Performance (4 OSSs)

3500Read Performance (4 OSSs)

800

1000IPoIBNative

ut (M

Bps

)

2000

2500

3000

ut (M

Bps

)

200

400

600

Thro

ughp

u

500

1000

1500

Thro

ughp

u

01 2 3 4

Number of Clients

0

500

1 2 3 4Number of Clients

• Lustre over Native IB– Write: 1.38X faster than IPoIB; Read: 2.16X faster than IPoIB

• Memory copies in IPoIB and Native IB

Supercomputing '09

• Memory copies in IPoIB and Native IB– Reduced throughput and high overhead; I/O servers are saturated

CPU Utilization

90

100IPoIB (Read) IPoIB (Write)

60

70

80

90 Native (Read) Native (Write)

n (%

)

30

40

50

60

PU U

tiliz

atio

n

0

10

20

30

CP

01 2 3 4

Number of Clients

• 4 OSS nodes, IOzone record size 1MB

Supercomputing '09

4 OSS nodes, IOzone record size 1MB

• Offers potential for greater scalability

Can we enhance NFS Performance i IB RDMA

• Many enterprise environments use NFSIETF current major revision is NFSv4; NFSv4 1 deals with pNFS

using IB RDMA

– IETF current major revision is NFSv4; NFSv4.1 deals with pNFS• Current systems use Ethernet with TCP/IP; can they use IB?

– Metadata intensive workloads are latency sensitive– Need throughput for large transfers and OLTP type workloads

• NFS over RDMA standard has been proposed– Designed and implemented this on InfiniBand in Open Solaris

• Taking advantage of RDMA mechanisms in InfiniBand• Design works for NFSv3 and NFSv4• Interoperable with Linux NFS/RDMA

– NFS over RDMA design incorporated into Open Solaris by Sun• Ongoing work for pNFS (NFSv4.1)

– Joint work with Sun and NetApppp– http://nowlab.cse.ohio-state.edu/projects/nfs-rdma/index.html

Supercomputing '09

NFS/RDMA Performance

1000Write (tmpfs)

1000Read (tmpfs)

600

700

800

900

t (M

B/s

)

600

700

800

900

t (M

B/s

)

200

300

400

500

Read-ReadThro

ughp

u

200

300

400

500

Thro

ughp

ut

0

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Read-Write

Number of Threads

0

100

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Number of Threads

• IOzone Read Bandwidth up to 913 MB/s (Sun x2200’s with x8 PCIe)• Read-Write design by OSU, available with the latest OpenSolaris• NFS/RDMA is being added into OFED 1 4

Supercomputing '09

NFS/RDMA is being added into OFED 1.4

R. Noronha, L. Chai, T. Talpey and D. K. Panda, “Designing NFS With RDMA For Security, Performance and Scalability”, ICPP ‘07

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Enterprise Datacenter Environmentsp

ProxyWeb-server

(A h )ProxyServer

(Apache)

WAN

ClientsStorage

More Computation and CommunicationR i t

ApplicationDatabase

WAN Requirements

• Requests are received from clients over the WAN

Application Server (PHP)

Server(MySQL)

• Proxy nodes perform caching, load balancing, resource monitoring, etc.– If not cached, request forwarded to the next tier Application Server

A li ti f th b i l i (CGI J l t )• Application server performs the business logic (CGI, Java servlets)– Retrieves appropriate data from the database to process the requests

Supercomputing '09

Proposed Architecture for Enterprise M lti ti D t tMulti-tier Data-centers

Existing Data-Center ComponentsExisting Data Center Components

Advanced System Services

ActiveCaching

CooperativeCaching

DynamicReconfiguration

ResourceMonitoring

Dynamic Content Caching Active Resource AdaptationActive

CachingCooperative

CachingDynamic

ReconfigurationResource

Monitoring

GlobalM

DistributedL k

PointT

Services

Data-CenterService

Primitives

Caching

SoftSh d

Caching Reconfiguration Monitoring

Distributed Data Sharing Substrate

Caching Caching Reconfiguration Monitoring

SoftSh d

DistributedL k

Distributed Data Sharing Substrate

Sockets Direct Protocol

MemoryAggregator

LockManager

ToPoint

Primitives

AdvancedC i ti P t l

SharedState

SharedState

LockManager

RDMA Atomic MulticastProtocolOffl d

RDMA-basedFlow-control

Communication Protocols and Subsystems

Network

Async. Zero-copyCommunication

Async. Zero-copyCommunication

Supercomputing '09

RDMA Atomic MulticastOffload Network

Data-Center Response Time with SDP(I t t P i )(Internet Proxies)

30

20

25

ms)

15

20

onse

Tim

e (m

IPoIBSDP

5

10

Res

po

032K 64K 128K 256K 512K 1M 2M

Message Size (bytes)

Supercomputing '09

Message Size (bytes)

Cache Coherency and Consistency ith D i D twith Dynamic Data

Example of Strong Cache Coherency: Never Send Stale Data

Proxy NodeCache

Back-endData

User Request #1

User Request #2

Proxy Node Back-end

Example of Strong Cache Consistency:Always Follow Increasing Time Line of Events

User Request #1

Proxy Node

Back endData

q

User Request #2

User Request #3

y

Supercomputing '09

Active Polling: An Approach for Strong C h C hCache Coherency

Proxy Node Back-end

Request

Proxy Node Back-end

Request

SoftwareO h d RDMA Read

Cache Hit

Overhead

Cache Hit

RDMA Read

Cache MissCache Miss

Supercomputing '09

Strong Cache Coherency with RDMAg y

2500Data-center Throughput

10Data-center Response Time

2000No Cache

IPoIBSeco

nd 7

8

9

e (m

s)

1000

1500IPoIB

RDMA

sact

ions

per

S

3

4

5

6

Res

pons

e Ti

me

0

500Tran

s

0

1

2

3R

0 10 20 30 40 50 60 70 80 90 100

200

Number of Compute Threads

0

0 10 20 30 40 50 60 70 80 90 100

200

Number of Compute Threads

RDMA can sustain performance even with heavy load on the back-end

Supercomputing '09

S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D. K. Panda, “Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand”, SAN ‘04

Advanced System Services over IB

• Cache coherency is one of the possible enhancements

y

IB can provide for Enterprise datacenter environments• Other concepts:

– System load monitoring• RDMA based monitoring of load on each node (kernel writes

this information to a memory location; read it using RDMA)this information to a memory location; read it using RDMA)• Multicast capabilities can help spread such information

quickly to multiple processes (reliability not very important)

– Load Balancing for Performance and QoS• Asynchronously update forwarding tables to decide how

many machines serve the content for a given websitemany machines serve the content for a given website

Supercomputing '09

http://nowlab.cse.ohio-state.edu/projects/data-centers

Designing High-End Computing S t ith IB d iWARP

• Message Passing Interface (MPI)

Systems with IB and iWARP

• Message Passing Interface (MPI)

• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )

• File Systems

• Multi-tier Data Centers

• Virtualization

Supercomputing '09

Current I/O Virtualization Approachespp

• I/O in VMM (e.g. VMware ESX)– Device drivers hosted in the VMM– I/O operations always trap into the

VMMApplication Application

Dom0 VM

– VMM ensures safe device sharing among VMs

• I/O in a special VM

Guest ModuleBackend Module

PrivilegedMod le p

– Device drivers are hosted in a special (privileged) VM

– I/O operations always involve the

OS

VMM

Module

– I/O operations always involve the VMM and the special VM

– E.g.: Xen and VMware Workstation

Device

Supercomputing '09

From OS-bypass to VMM-bypassyp yp

• Guest modules in guest VMs handle t d t ( i il dsetup and management (privileged

access)

– Guest modules communicate with Application Application

VM VM

VMM backend modules to get jobs done

– Original privileged module can be

OS Guest Module

reused

• Once setup, devices are accessed directly from guest VMs (VMM-bypass)

Backend Module

Privileged Module y g ( yp )

– Either from OS kernel or applications

• Backend and privileged modules can also reside in a special VM

VMM

Device

reside in a special VM

Supercomputing '09

Privileged Access

VMM-bypass Access

Xen Overhead with VMM Bypassyp

18One-way Latency

9001000

Unidirectional Bandwidth

10121416

Xen

Native

(us) 600

700800900

ytes

/sec

468

10

Late

ncy

(

200300400500

Mill

ionB

y

02

0 1 2 4 8 16 32 64 128

256

512 1K 2K 4K

M Si (b t )

0100

1 4 16 64 256

1K 4K 16K

64K

256K 1M 4M

M Si (b t )Message Size (bytes) Message Size (bytes)

• Only VMM Bypass operations are used (MVAPICH implementation)• Xen-IB performs similar to native InfiniBand

Supercomputing '09

W. Huang, J. Liu, B. Abali, D. K. Panda. “A Case for High Performance Computing with Virtual Machines”, ICS ’06

Optimizing VM migration through RDMARDMA

VMVM

Pre-allocate Machine states

Helper Process

Machine statesMachine statesMachine states

Physical host Physical host

resourcesMachine statesMachine statesMachine statesMachine states

Live VM migration:• Step 1: Pre-allocate resource on target hostp g

• Step 2: Pre-copy machine states for multiple iterations

• Step 3: Suspend VM and copy the latest updates to machine states

• Step 4: Restart VM on the new host

Supercomputing '09

Fast Migration over RDMAg

70Native IPoIB RDMA

100%250RDMA IPoIB CPU IPoIB CPU RDMA

30

40

50

60

70

on T

ime

(s)

60%

80%

100%

150

200

250

Util

izat

ion

BW

(M

B/s

)

0

10

20

30

SP BT FT LU EP CG

Exe

cutio

0%

20%

40%

0

50

100

SP A 9 BT A 9 FT B 8 LU A 8 EP B 9 CG B 8

CP

U U

Effe

ctiv

e B

SP BT FT LU EP CG SP.A.9 BT.A.9 FT.B.8 LU.A.8 EP.B.9 CG.B.8

• Disable one physical CPU on the nodes

Mi ti h d ith IP IB d ti ll i• Migration overhead with IPoIB drastically increases

• RDMA achieves higher migration performance with less CPU usage

W H Q G J Li D K P d “Hi h P f Vi t l M hi Mi ti ith RDMA

Supercomputing '09

W. Huang, Q. Gao, J. Liu, D. K. Panda. “High Performance Virtual Machine Migration with RDMA over Modern Interconnects”, Cluster ’07 (Selected as a Best Paper)

Summary of Design Issues and R lt

• Current generation IB adapters, 10GE/iWARP

Results

g p

adapters and software environments are already

delivering competitive performance compared to g p p p

other interconnects

• IB and 10GE/iWARP hardware firmware and• IB and 10GE/iWARP hardware, firmware, and

software are going through rapid changes

• Significant performance improvement is expected in

near future

Supercomputing '09

Presentation Overview

• Networking Requirements of HEC Systems

• Recap of InfiniBand and 10GE Overview

• Advanced Features of IBAdvanced Features of IB

• Advanced Features of 10/40/100GE

The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage

• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and

Virtualization

Conclusions and Final Q&A• Conclusions and Final Q&A

Supercomputing '09

Concluding Remarks

• Presented networking requirements for HEC Clusters

g

• Presented advanced features of IB and 10GE

• Discussed OpenFabrics stack and usagep g

• Discussed Design Issues, Challenges, and State-of-the-art in designing various high-end systems with IB and 10GEin designing various high end systems with IB and 10GE

• IB and 10GE are emerging as new architectures leading to a new generation of networked computing systems openinga new generation of networked computing systems, opening many research issues needing novel solutions

Supercomputing '09

Funding Acknowledgmentsg g

Our research is supported by the following organizations

• Funding support by

• Equipment support by

Supercomputing '09

Personnel Acknowledgmentsg

Current Students – M Kalaiya (M S )

Past Students – P Balaji (Ph D ) – A Mamidala (Ph D )M. Kalaiya (M. S.)

– K. Kandalla (M.S.)– P. Lai (Ph.D.)– M. Luo (Ph.D.)

P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

A. Mamidala (Ph.D.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)– G. Santhanaraman (Ph.D.)

– G. Marsh (Ph. D.)– X. Ouyang (Ph.D.)– S. Potluri (M. S.)

H S b i (Ph D )

– B. Chandrasekharan (M.S.)– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– K. Vaidyanathan (Ph.D.)– H. Subramoni (Ph.D.)

Current Post-Doc– E Mancini

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

M Koop (Ph D )

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)E. Mancini

Current Programmer– J. Perkins

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– P. Lai (Ph. D.)

Supercomputing '09

( )

– J. Liu (Ph.D.)

Web Pointers

http://www.cse.ohio-state.edu/~pandahttp://www.mcs.anl.gov/~balaji

http://www.cse.ohio-state.edu/~koophttp://nowlab.cse.ohio-state.edu

MVAPICH Web Pagehttp://mvapich.cse.ohio-state.edu

panda@cse.ohio-state.edubalaji@mcs.anl.gov

Supercomputing '09

matthew.koop@nasa.gov

Recommended