Hot Interconnects 2005 Control Path Implementation for a Low- Latency Optical HPC Switch C. Minkenberg 1, F. Abel 1, P. Müller 1, R. Krishnamurthy 1, M

Hot Interconnects 2005

Control Path Implementation for a Low-Latency Optical HPC Switch

C. Minkenberg1, F. Abel1, P. Müller1, R. Krishnamurthy1, M. Gusat1, B.R. Hemenway2 1 IBM Research, Zurich Research Laboratory2 Corning Inc., Science and Technology

OSMOSIS

Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated

Outline

Motivation Can optics play a significant role in high-performance computing

interconnects?

OSMOSIS Requirements

Design decisions

Architecture

Control path challenges Arbiter speed & complexity

Summary

OSMOSIS


HPC Interconnection Networks

Presently implemented as electronic packet switching networks,but quickly approaching electronic limits with further scaling

Future could be based on maturing all-optical packet switching,but need to solve the technical challenges and accelerate the cost reduction of all-optical packet switching for HPC interconnects

build a full-function all-optical packet switch demonstrator system showing the scalability, performance and cost paths for a potential commercial system

Optical Shared MemOry Supercomputer Interconnect System Sponsored by DoE & NNSA as part of ASC

Joint 2½-year project between Corning (optics and packaging) and IBM (electronics ― arbiter, input and output adapters ― and system design)

OSMOSIS


HPC Requirements for OSMOSIS

Near 1 microsecond memory-memory latency Includes encoding/decoding, arbitration, virtual output queues (VOQ)

Scaling to 2048+ nodes In a multi-stage topology

Very low bit-error rate (10-21) After forward error correction (FEC) and reliable delivery (RD)

Low switching overhead (<25%) Includes optical switching overhead, header, line coding, and FEC

FPGA only Cost and flexibility

OSMOSIS


Key design choices

Large switch radix 64-port switch allows scaling to 2048 nodes in two levels

3-stage, 2-level Fat Tree topology with flow control

Basic module scales from 16 to 128 ports

Cell switching No provisioning or aggregation techniques (burst/container switching)

Full switch reconfiguration every time slot (51.2 ns) requires low overhead and fast arbitration

Enabled by fast semicondunctor optical amplifiers (SOAs)

Input queuing with central arbitration Optical crossbar, no buffering in optical domain

Electronic input buffers with VOQs to eliminate HOL blocking

Electronic central arbiter to achieve high maximum throughput and low latency

Port speed 40 Gb/s, cell size 256 B, Allows ~ 25% overhead

Arbitration feasible at 40 Gb/s

OSMOSIS


OSMOSIS System Architecture

Broadcast-and-select architecture (crossbar) Combination of wavelength- and space-division multiplexing Fast switching based on SOAs Electronic input and output adapters Electronic arbitration

EQ

control

2 Rx

central arbiter(bipartite graph matching algorithm)

VOQs

Tx

control

64 Ingress Adapters

All-optical Switch

64 Egress Adapters

EQ

control

2 Rx

control links

8 Broadcast Units128 Select Units

8x11x88x1

Com- biner

Fast SOA 1x8FiberSelectorGates

Fast SOA 1x8FiberSelectorGates

Fast SOA 1x8WavelengthSelectorGates

Fast SOA 1x8WavelengthSelectorGates

OpticalAmplifier

WDM Mux

StarCoupler

8x1 1x128VOQs

Tx

control

1

8

1

128

1

64

1

64

OSMOSIS


Control Path Challenges

System size Remote adapter cards – long round-trip time

Control channel protocol Combines many functions

Matching algorithm Iterative round-robin-based matching algorithm

– Good performance, practical, amenable to distribution– Requires about log2(64) = 6 iterations for highest performance

Speed

– Short cell duration makes it impossible to complete sufficient iterations Complexity

– Implement iterative matching algorithm for 64 ports @ 51.2 ns in FPGAs– Parallelism and distribution are needed

Packaging A large number of high-end FPGAs must be accommodated in close proximity

OSMOSIS


System Size

Large installations Switch remote from nodes

Long cables

Long round-trip latency between adapters and arbiter

adapters and crossbar

arbiter and crossbar

Control channel protocol (ΔRGP) Incremental requests and grants

Arbiter keeps track of pending requests per VOQ

Careful delay matching to ensure correct operation

adapter

arbiter

cross-bar

adapter

32x32, 5-SLIP with ΔRGP, uniform Bernoulli

0.01

0.1

1

10

100

1000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

throughput

del

ay [

pac

ket

cycl

es]

RT = 0

RT = 16

RT = 32

RT = 64

RT = 128

OSMOSIS


Control Channel Protocol

Incremental request/grant protocol (ΔRGP) To cope with round trip time

“Census” To ensure consistency of ΔRGP in presence of errors

Reliable delivery Relaying of intra-switch acknowledgments

Flow control To prevent egress buffer overflow (on-off watermark-based)

Multicast Very large fanout (64 bits)

Special control message format

Control channel bandwidth 12 B control messages

2 Gb/s/port (2.5 Gb/s raw) bidirectional

Aggregate arbiter bandwidth = (64 + 16)*2.5*2 = 400 Gb/s!

One control channel interface (CCI) FPGA per two ports

OSMOSIS


FLPPR: Fast Low-latency Parallel Pipelined aRbitration

Short cell duration and large radix Pipelining required to complete enough iterations per matching Mean latency decreases as the number of iterations increases

FLPPR Pipelined allocators, parallel requests Allows requests to be issued to any allocator in any time slot Matching rate independent of number of iterations

Performance advantages Eliminates pipelining latency at low load Achieves 100% throughput with uniform traffic Reduces latency with respect to PMM also at high load Can improve throughput with nonuniform traffic

Highly amenable to distributed implementation in FPGAs

Can be applied to any existing iterative matching algorithm

delayline

sub-arbiter

sub-arbiter

sub-arbiter

sub-arbiter

VOQpending requestcounters

intermediate requests

intermediate grants

requests

grants

64x64, K = 6, Uniform Bernoulli

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

utilization [%]

late

ncy

[ti

me

slo

ts]

PMM

FLPPR basic

FLPPR optimized

OSMOSIS


Distribution of Arbiter Complexity

Matching 64 ports = state-of-the-art Full algorithm does not fit in largest

Xilinx FPGA for 64 ports

Approach Distribute the input and output

selectors

Place 2 input selectors per CCI FPGA

Place 64 output selectors in per sub-arbiter FPGA

Works only well with two-phase algorithm (e.g. DRRM, but not SLIP)

Still performs well Requires careful request policy

New issue Round trip between input and output

selectors: Still under study

outputselector

outputselector

controlchannel

itf.

controlchannel

itf.

inputselector

inputselector

OSMOSIS


Arbiter Structure and Packaging

Name ID # location

Control channel interface

CCI 32 OSCI[0:31]

Switch command interface

SCI 8 OSCI[32:39]

Sub-arbiters A 4 OSCB

Clocking and control

CLK 1 OSCB

Multiplexer MUX 1 OSCB

ACK router ACK 1 OSCB

Embedded microprocessor

μP 1 OSCB

Total 47 + 1 devices

OSCBlayout

Midplane (OSCB; prototype shown here) with 40 daughter boards (OSCI)

OSMOSIS


Summary

OSMOSIS all-optical data path

multi-stage ready

high radix

cell switching

pipelined, distributed central arbitration

low latency, high throughput and low overhead

Project status All FPGAs designed (placed and

routed)

Final arbiter baseboard in layout

Final switch being integrated

Scheduled for completion in 1Q06 Optical switching module(fiber selection stage with 8 SOAs)

Documents

Hot Interconnects 2005 Control Path Implementation for a Low- Latency Optical HPC Switch C. Minkenberg 1, F. Abel 1, P. Müller 1, R. Krishnamurthy 1, M