Upload
caitlin-doyle
View
219
Download
0
Embed Size (px)
Citation preview
Hot Interconnects 2005
Control Path Implementation for a Low-Latency Optical HPC Switch
C. Minkenberg1, F. Abel1, P. Müller1, R. Krishnamurthy1, M. Gusat1, B.R. Hemenway2 1 IBM Research, Zurich Research Laboratory2 Corning Inc., Science and Technology
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
Outline
Motivation Can optics play a significant role in high-performance computing
interconnects?
OSMOSIS Requirements
Design decisions
Architecture
Control path challenges Arbiter speed & complexity
Summary
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
HPC Interconnection Networks
Presently implemented as electronic packet switching networks,but quickly approaching electronic limits with further scaling
Future could be based on maturing all-optical packet switching,but need to solve the technical challenges and accelerate the cost reduction of all-optical packet switching for HPC interconnects
build a full-function all-optical packet switch demonstrator system showing the scalability, performance and cost paths for a potential commercial system
Optical Shared MemOry Supercomputer Interconnect System Sponsored by DoE & NNSA as part of ASC
Joint 2½-year project between Corning (optics and packaging) and IBM (electronics ― arbiter, input and output adapters ― and system design)
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
HPC Requirements for OSMOSIS
Near 1 microsecond memory-memory latency Includes encoding/decoding, arbitration, virtual output queues (VOQ)
Scaling to 2048+ nodes In a multi-stage topology
Very low bit-error rate (10-21) After forward error correction (FEC) and reliable delivery (RD)
Low switching overhead (<25%) Includes optical switching overhead, header, line coding, and FEC
FPGA only Cost and flexibility
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
Key design choices
Large switch radix 64-port switch allows scaling to 2048 nodes in two levels
3-stage, 2-level Fat Tree topology with flow control
Basic module scales from 16 to 128 ports
Cell switching No provisioning or aggregation techniques (burst/container switching)
Full switch reconfiguration every time slot (51.2 ns) requires low overhead and fast arbitration
Enabled by fast semicondunctor optical amplifiers (SOAs)
Input queuing with central arbitration Optical crossbar, no buffering in optical domain
Electronic input buffers with VOQs to eliminate HOL blocking
Electronic central arbiter to achieve high maximum throughput and low latency
Port speed 40 Gb/s, cell size 256 B, Allows ~ 25% overhead
Arbitration feasible at 40 Gb/s
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
OSMOSIS System Architecture
Broadcast-and-select architecture (crossbar) Combination of wavelength- and space-division multiplexing Fast switching based on SOAs Electronic input and output adapters Electronic arbitration
EQ
control
2 Rx
central arbiter(bipartite graph matching algorithm)
VOQs
Tx
control
64 Ingress Adapters
All-optical Switch
64 Egress Adapters
EQ
control
2 Rx
control links
8 Broadcast Units128 Select Units
8x11x88x1
Com- biner
Fast SOA 1x8FiberSelectorGates
Fast SOA 1x8FiberSelectorGates
Fast SOA 1x8WavelengthSelectorGates
Fast SOA 1x8WavelengthSelectorGates
OpticalAmplifier
WDM Mux
StarCoupler
8x1 1x128VOQs
Tx
control
1
8
1
128
1
64
1
64
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
Control Path Challenges
System size Remote adapter cards – long round-trip time
Control channel protocol Combines many functions
Matching algorithm Iterative round-robin-based matching algorithm
– Good performance, practical, amenable to distribution– Requires about log2(64) = 6 iterations for highest performance
Speed
– Short cell duration makes it impossible to complete sufficient iterations Complexity
– Implement iterative matching algorithm for 64 ports @ 51.2 ns in FPGAs– Parallelism and distribution are needed
Packaging A large number of high-end FPGAs must be accommodated in close proximity
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
System Size
Large installations Switch remote from nodes
Long cables
Long round-trip latency between adapters and arbiter
adapters and crossbar
arbiter and crossbar
Control channel protocol (ΔRGP) Incremental requests and grants
Arbiter keeps track of pending requests per VOQ
Careful delay matching to ensure correct operation
adapter
arbiter
cross-bar
adapter
32x32, 5-SLIP with ΔRGP, uniform Bernoulli
0.01
0.1
1
10
100
1000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
throughput
del
ay [
pac
ket
cycl
es]
RT = 0
RT = 16
RT = 32
RT = 64
RT = 128
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
Control Channel Protocol
Incremental request/grant protocol (ΔRGP) To cope with round trip time
“Census” To ensure consistency of ΔRGP in presence of errors
Reliable delivery Relaying of intra-switch acknowledgments
Flow control To prevent egress buffer overflow (on-off watermark-based)
Multicast Very large fanout (64 bits)
Special control message format
Control channel bandwidth 12 B control messages
2 Gb/s/port (2.5 Gb/s raw) bidirectional
Aggregate arbiter bandwidth = (64 + 16)*2.5*2 = 400 Gb/s!
One control channel interface (CCI) FPGA per two ports
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
FLPPR: Fast Low-latency Parallel Pipelined aRbitration
Short cell duration and large radix Pipelining required to complete enough iterations per matching Mean latency decreases as the number of iterations increases
FLPPR Pipelined allocators, parallel requests Allows requests to be issued to any allocator in any time slot Matching rate independent of number of iterations
Performance advantages Eliminates pipelining latency at low load Achieves 100% throughput with uniform traffic Reduces latency with respect to PMM also at high load Can improve throughput with nonuniform traffic
Highly amenable to distributed implementation in FPGAs
Can be applied to any existing iterative matching algorithm
delayline
sub-arbiter
sub-arbiter
sub-arbiter
sub-arbiter
VOQpending requestcounters
intermediate requests
intermediate grants
requests
grants
64x64, K = 6, Uniform Bernoulli
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
utilization [%]
late
ncy
[ti
me
slo
ts]
PMM
FLPPR basic
FLPPR optimized
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
Distribution of Arbiter Complexity
Matching 64 ports = state-of-the-art Full algorithm does not fit in largest
Xilinx FPGA for 64 ports
Approach Distribute the input and output
selectors
Place 2 input selectors per CCI FPGA
Place 64 output selectors in per sub-arbiter FPGA
Works only well with two-phase algorithm (e.g. DRRM, but not SLIP)
Still performs well Requires careful request policy
New issue Round trip between input and output
selectors: Still under study
outputselector
outputselector
controlchannel
itf.
controlchannel
itf.
inputselector
inputselector
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
Arbiter Structure and Packaging
Name ID # location
Control channel interface
CCI 32 OSCI[0:31]
Switch command interface
SCI 8 OSCI[32:39]
Sub-arbiters A 4 OSCB
Clocking and control
CLK 1 OSCB
Multiplexer MUX 1 OSCB
ACK router ACK 1 OSCB
Embedded microprocessor
μP 1 OSCB
Total 47 + 1 devices
OSCBlayout
Midplane (OSCB; prototype shown here) with 40 daughter boards (OSCI)
OSMOSIS
Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated
Summary
OSMOSIS all-optical data path
multi-stage ready
high radix
cell switching
pipelined, distributed central arbitration
low latency, high throughput and low overhead
Project status All FPGAs designed (placed and
routed)
Final arbiter baseboard in layout
Final switch being integrated
Scheduled for completion in 1Q06 Optical switching module(fiber selection stage with 8 SOAs)