FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS AKA. DFSSSP ROUTING
Torsten Hoefler and Jens Domke
All used images belong to the producers.
New challenging problems need more bandwidth Scientific computing
Big data analytics
Distributed graph problems
InfiniBand offers advanced routing (kind of) IB spec mandates deadlock-free deterministic
distributed routing
OpenSM is the standard routing software Optimal algorithms for fat-tree and tori networks
MinHop for others (can produce credit loops)
Commercial alternatives exist
MOTIVATION FOR BETTER ROUTING
T. Hoefler and J. Domke Slide 2 of 28
FORGET FULL BISECTION BANDWIDTH Expensive topologies do not guarantee high bandwidth
Deterministic oblivious routing cannot reach full bandwidth! See Valiant’s lower bound
Random routing is asymptotically optimal but looses locality
InfiniBand routing: Deterministic oblivious, destination-based, simple
Linear forwarding table (LFT) at each switch
LID mask control (LMC) enables multiple addresses per port
Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks
T. Hoefler and J. Domke Slide 3 of 28
BUT MY VENDOR SAID “NON-BLOCKING”
Two communications 16, 414
Full bisection bandwidth network
No full bandwidth observed!
Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks
T. Hoefler and J. Domke Slide 4 of 28
SO HOW BAD IS CONGESTION?
Lower Bound! (~ GigE Speed)
Reality?
CHiC Supercomputer: • slightly aged but reflects routing • 566 nodes, full bisection IB fat-tree • no endpoint congestion! • effective Bisection Bandwidth: 0.699
Microbenchmarks (yet to be seen in
practice)
Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks
T. Hoefler and J. Domke Slide 5 of 28
BUT I HAVE A CLEVER SUBNET MANAGER! OpenSM (IB) routing algorithms: minhop (finds minimal paths, balances number of routes local
at each switch)
updn/dnup (uses Up*/Down* turn-control, limits choice but routes contain no credit loops)
ftree (fat-tree optimized routing, no credit loops)
dor/torus-2QoS (dimension order routing for k-ary n-cubes, DOR might generate credit loops)
lash (uses DOR and breaks credit loops with virtual lanes)
It’s clever if you have a fat-tree or a torus But beware if you add or remove one link!
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 6 of 28
WHY ARBITRARY TOPOLOGIES? Many networks grow over time or fulfill more
than one purpose
Fat-trees and butterflies are hard to grow
Tori networks may have undesirable properties
IB supports arbitrary topologies!
Hybrid networks exist:
T. Hoefler and J. Domke Slide 7 of 28
EFFECTIVE BISECTION BANDWIDTH A measure for global network performance Considers routing
Can be measured with a benchmark
More realistic then bisection bandwidth
Effective Bisection Bandwidth (eBB) benchmark Divide network into equal partitions A and B
combinations
Find one peer in B for each node in A pairings
Huge number of patterns Statistics converge fast (~1000 measurements)
Implemented in Netgauge/eBB (download and try!)
Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks
T. Hoefler and J. Domke Slide 8 of 28
ACHIEVING HIGH BANDWIDTH Model network as G=(VP[VC, E)
Path r(u,v) is a path between u,v 2 VP
Routing R consists of P(P-1) paths
Edge load l(e) = number of paths on e 2 E
Edge forwarding index ¼(G,R)=maxe2E l(e)
¼(G,R) is an upper bound to congestion!
Goal is to find R that minimizes ¼(G,R)
Shown to be NP-hard in the general case
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 9 of 28
A SIMPLE HEURISTIC
Keep it simple, greedily minimize ¼(G,R)
SSSP routing starts an SSSP run at each node
Finds paths with minimal edge-load l(e)
Updates routing tables in reverse
Essentially SDSP
Updates l(e) between runs
Strives for global balancing
An example …
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 10 of 28
Source: penny-arcade.com
STEP 1/3 Step 1:
Source-node 0:
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 11 of 28
STEP 2/3 Step 2:
Source-node 1:
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 12 of 28
STEP 3/3 Step 3:
Source-node 2:
¼(G,R)=2
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 13 of 28
EVALUATION - ODIN
Simulation Benchmark
(Netgauge; pattern: eBB)
Simulation predicts 5% improvement
Benchmark shows 18% improvement!
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 14 of 28
EVALUATION - DEIMOS
Simulation Benchmark
(Netgauge; pattern: eBB)
Simulation predicts 23% improvement
Benchmark shows 40% improvement!
T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks
T. Hoefler and J. Domke Slide 15 of 28
IT WORKS, IS THAT ALL? JUST SSSP?
Shown to run well on real systems in practice Odin (128 nodes, 18% eBB speedup)
Deimos (726 nodes, 40% eBB speedup)
Lomonosov1 (~4.5k nodes, ~10-20% Graph500 speedup)
Unfortunately not!
SSSP Routing may create credit loops On certain topologies
To be proven if some topologies are loop-free
Problematic in production environments (and interesting in theory )
1Lomonosov experiments were executed by Anton Korzh and Alexander Naumov
T. Hoefler and J. Domke Slide 16 of 28
WHAT ARE CREDITS AND WHY DO THEY LOOP?
IB uses credit-based flow-control egress messages sent only if receive-buffer available
Very similar to deadlocks in wormhole-routed systems
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 17 of 28
DEAL WITH CREDIT LOOPS Prevent (UP*/Down*, turn-based routing) Limits routing options
Resolve (LASH, use VLs to break cycles) Consumes additional buffers
Ignore (MinHop, DOR) Potential resolution: packet timeouts
Discouraged by IB specification
Others: Bubble Routing etc. Not supported by current specification
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 18 of 28
USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example:
Deadlock!
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 19 of 28
USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example:
2 VLs resolve deadlock
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 20 of 28
DEADLOCK-FREE SSSP ROUTING Perform normal SSSP
Detect cycles
“Break” cycle by adding new VL, rinse, repeat
VLs are expensive, how many do we need?
NP-hard problem
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 21 of 28
IS IT PRACTICAL? DO WE HAVE ENOUGH VLS?
Usually yes!
But it depends on the topology and network size
IB supports up to 15 VLs
Current hardware has 8 VLs
0
2
4
6
8
10
12
14
16
ChiC
Deimos
Odin
JUROPA
Ranger
Tsubame
Needed #(virtual lanes)
LASH
DFSSSP
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 22 of 28
MPI BENCHMARKS
47% faster execution (compared to MinHop)
Same performance for SSSP and DFSSSP
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0 512 1024 1536 2048 2560 3072 3584 4096
Runtime in s
Elements in buffer (#floats)
MPI All-to-All on 128 nodes (1 core per node)
MinHop
LASHDFSSSP
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 23 of 28
EBB AND NAS PARALLEL BENCHMARKS Measured on Deimos at TU Dresden
1 MPI process per node (except for 1024)
0
50
100
150
200
250
300
350
400
128 256 512 1024
eBB in MiByte/s
Number of MPI-processes
MinHop
LASHSSSPDFSSSP
0
50
100
150
200
250
121 256 484 1024
Gflop/s (total)
Number of MPI-processes
MinHop
LASHSSSPDFSSSP
BT, class C – solver
Netgauge, eBB
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 24 of 28
IS IT PRACTICAL? WHAT ABOUT LARGE SCALE? Up to 64% higher eBB (depending on #nodes and topology)
Runtime can be an issue! (for systems w/ #nodes > 5.000)
0
0.2
0.4
0.6
0.8
1
CHiC
Deimos
JUROPA
Odin
Ranger
Tsubame
Eff. bisection bandwidth
MinHopUp*/Down*FatTree
LASHDORSSSP
DFSSSP
10-4
10-2
100
102
104
CHiC
Deimos
JUROPA
Odin
Ranger
Tsubame
Runtime in s
MinHopUp*/Down*FatTree
LASHDORSSSP
DFSSSP
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 25 of 28
HOW DO I GET/USE DFSSSP? OpenFabrics Alliance integrated DFSSSP into the
official OFED stack in May, 2012 http://www.openfabrics.org/downloads/
management/opensm-3.3.16.tar.gz
Start OpenSM with option “-R dfsssp” or set “routing_engine dfsssp” in configuration file
Configure QoS for switches/HCAs to equally use VLs
(Please don’t use DFSSSP in opensm-3.3.14/15)
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 26 of 28
HOW DO I GET/USE DFSSSP? QoS config example for OpenSM:
qos TRUE qos_max_vls 8 qos_high_limit 4 qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64 qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4 qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
Enable SL queries for path setup in Open MPI: Configure Open MPI with
“--with-openib --enable-openib-dynamic-sl”
Run your application with “--mca btl_openib_ib_path_record_service_level 1”
Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
T. Hoefler and J. Domke Slide 27 of 28
DFSSSP can offer: Same or better eff. bisection bandwidth
(compared to other routing algorithms)
Huge increase in application performance and speed-up of MPI-collectives (especially for irregular topologies)
Usability on arbitrary topologies and efficient usage of VLs
Free and easy to use (integration in OFED)
Enhancement for other technologies (similar to IB)
CONCLUSION
T. Hoefler and J. Domke Slide 28 of 28