62
Robustness of Interconnection Networks 3rd JLESC Summer School Atsushi Hori RIKEN AICS 1 16628日火曜日

Robustness of Interconnection Networks

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robustness of Interconnection Networks

Robustness of Interconnection

Networks

3rd JLESC Summer School

Atsushi HoriRIKEN AICS

116年6月28日火曜日

Page 2: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Self IntroductionAtsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

involved in the Flagship2020 project to develop the post-K computer

216年6月28日火曜日

Page 3: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Atsushi Hori - System Software ResearcherThe oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

involved in the Flagship2020 project to develop the post-K computer

Self Introduction

316年6月28日火曜日

Page 4: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Atsushi Hori - System Software ResearcherThe oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

involved in the Flagship2020 project to develop the post-K computer

DISCLAIMERThis contents of this talk are based

on my personal experiences andindependent from the Flagship2020

project

Self Introduction

316年6月28日火曜日

Page 5: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Atsushi Hori - System Software ResearcherThe oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

involved in the Flagship2020 project to develop the post-K computer

The colored slides are supplements

Self Introduction

316年6月28日火曜日

Page 6: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Self IntroductionAtsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

involved in the Flagship2020 project to develop the post-K computer

4

The venue of the next JLESC, in Dec., Kobe

16年6月28日火曜日

Page 7: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

HPC Network• Low latency and high bandwidth• Higher performance than silicon disks

• High Bi-section bandwidth• Low congestion possibility (hopefully)

• Very Reliable• No error, No loss

• Dense (in a computer room)• Internet covers the whole earth

• Packet Switching• No circuit switching (old telephone network)

516年6月28日火曜日

Page 8: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Network Basics

Outline

6

Routing

Topology

Implementation

Fault Resilience+ my personal opinion

16年6月28日火曜日

Page 9: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Glossary• A network consists of• Nodes where packets are sent and received

may include a switch (see below)

• Switches (Routers) • Links connecting nodes and switches

• Data transfer• Packet a unit of transfer• Message consists of multiple packets

716年6月28日火曜日

Page 10: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Topology

816年6月28日火曜日

Page 11: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Network Topologies (1)

9

Torus

FatTreeSwitch Switch

Switch Switch

Mesh

NodeLink

“SkinnyTree”Switch

Switch Switch

16年6月28日火曜日

Page 12: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Network Topology in Top500• Topologies in Top500 http://www.top500.org• Torus/Mesh BG/Q, the K (Tofu)• FatTree Infiniband, Aries, Cray Gemini, Tiahne • SkinnyTree Ethernet• Misc. IBM Power 775

10

❘❘❘❘

❘❘❘❘❘

❘❘

❘❘❘❘❘❘❘❘

❘❘❘❘❘❘❘❘

❘❘❘❘❘

❘❘❘❘❘❘❘

❘❘❘

❘❘❘❘❘❘❘❘❘❘❘

❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘❘

❘❘❘❘❘❘❘❘❘❘❘

❘❘❘

❘❘❘❘

❘❘

❘❘❘❘❘

❘❘❘❘

❘❘

❘❘

❘❘❘

❘❘

❘❘

❘❘❘

❘❘

❘❘

❘❘

❘❘❘

❘❘❘

❘❘

❘❘

❘❘❘❘❘

❘❘

❘❘❘

❘❘❘❘

❘❘

❘❘

❘❘

❘❘

❘❘❘❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘

❘❘❘❘❘

❘❘

❘❘❘❘

❘❘❘

❘❘❘

❘❘

❘❘

❘❘❘❘

❘❘

❘❘

❘❘

❘❘❘❘

❘❘❘

❘❘❘

❘❘❘❘

❘❘❘❘

❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘❘❘

❘❘

❘❘❘❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘

❘❘

❘❘

❘❘❘

❘❘

❘❘

❘❘❘❘❘

❘❘❘❘❘❘❘❘❘❘

❘❘

❘❘

❘❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘

❘❘❘

❘❘

❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘❘❘❘

❘❘❘❘

❘❘

❘❘

❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘

❘❘

❘❘

❘❘

❘❘❘❘

❘❘

❘❘

❘❘❘

❘❘❘❘

0 50 100 150 200 250 300 350 400 450 500

Topo

logy

Rank in Top500 as of Nov. 2015

FatTree

Torus/Mesh

SkinnyTree

Misc.

16年6月28日火曜日

Page 13: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Network Topologies (2)

11

Hypercube Dragonfly

CM-2, nCUBE in 90s Cray XC series

and many others (ring, star, butterfly, to name a few)

Nod

es

LinkNode

Sw.

16年6月28日火曜日

Page 14: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Routing

1216年6月28日火曜日

Page 15: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Routing• Find a path from a sender node to a receiver

node• Ex) X-Y (Dimension Order) Routing in 2D

Mesh

13

Nj

Ni

Node

16年6月28日火曜日

Page 16: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Deadlock• A routing algorithm on a network topology

must be deadlock free• Cyclic path can cause deadlock• Deadlock can be avoided by having bypass• Virtual channels

14

Sw.

1 channel 2 (virtual) channels

Sw.

16年6月28日火曜日

Page 17: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Deadlock• A routing algorithm on a network topology

must be deadlock free• Cyclic path can cause deadlock• Deadlock can be avoided by having bypass• Virtual channels

1516年6月28日火曜日

Page 18: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

• Hot spot• Packet congestion happens

• 2D Mesh Hot spot at the center• 2D Torus No hot spots

16

Hot Spot

Node

16年6月28日火曜日

Page 19: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Partitioning• Multiple jobs can run on a big machine• Node space is partitioned

• Partitioning may change topology of a job• Jobs may have interference

17

Job A Job B

Job C Job D

Job C

Job A

Job B

Job

D

Job B, C and D can interfere with the others

2D torus turns into 2D mesh

Job C

Node Node

16年6月28日火曜日

Page 20: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Dynamic (Adaptive) Routing• Static Routing

• Once a path is fixed, packets go along with the path

• Dynamic (adaptive) Routing• Paths can be changed dynamically according to

the state of the network• Issues

• Algorithm: how, who, when ?• Deadlock free• Route changing latency & H/W resource• Stability (see next slide)

• Packet order is not preserved (see next of next)

1816年6月28日火曜日

Page 21: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Oscillation in Adaptive RoutingTwo roads to the same destination

1. One is very crowded2. The radio says the

other is empty3. Everybody rushes

into the other road4. (repeat 1-3)

1916年6月28日火曜日

Page 22: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Packet Order• Adaptive routing cannot preserve packet

ordering• This can be problematic when receiving large

messages consisting of multiple packets

20

P0 P1 P2 P3 P4 P5 P6 P7 …

Sending Order = Receiving Order

P0 P1 P2 P3 P4 P5 P6 P7

Recvbuf 0 Recvbuf 1

P0 P5 P3 P2 P4 P7 P9 P6 …

Sending Order ≠ Receiving Order

P0 P2 P3 P4 P5 P6 P7

Recvbuf 0 Recvbuf 1

16年6月28日火曜日

Page 23: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Metrics• Topology

• The higher radix, the smaller network diameter• Network Diameter• High-Radix or Low-Radix

• Performance• Whole

• Bisection Bandwidth• P2P

• Bandwidth and Latency• Hop count

• Collective Operations (Barrier, and so on)• Latency

2116年6月28日火曜日

Page 24: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Implementation

2216年6月28日火曜日

Page 25: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Installation of the K Computer

2316年6月28日火曜日

Page 26: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Direct/Indirect Network• Direct or Indirect Network• Direct network• Every node has a switch inside

• Indirect network• Node has no switch

24

Machine/Network Direct/Indirect

the K (Tofu) Direct

BG/Q Direct

Infiniband Indirect

Ethernet Indirect

Note: In many books, direct or indirect network is categorized as an aspect of topology

16年6月28日火曜日

Page 27: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Level off Cable Lengths• Naive implementation results in uneven cable

lengths

2516年6月28日火曜日

Page 28: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Level off Cable Lengths• To level off cable lengths, alternate nodes are

connected

2616年6月28日火曜日

Page 29: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Co-Design• Network Cost = ∑ C + ∑ S + ∑ L

C: Network interface (card) of a nodeS: SwitchL: Cable

• Co-design• Communication patterns of applications• Find protocols to maximize performance of

possible applications, and• to minimize network cost• to minimize power consumption

2716年6月28日火曜日

Page 30: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Fault Resilience

2816年6月28日火曜日

Page 31: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Fault Resilience• System and/or jobs can survive from a

network component failure• Possible failure points• Link• Switch• Node

2916年6月28日火曜日

Page 32: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Link or Switch Failure• Static routing• Somebody changes routing info. to bypass

failed part(s)

• Dynamic routing• If a failure can be detected, the failed

part(s) can be automatically bypassed• Needless to say it must be deadlock free

3016年6月28日火曜日

Page 33: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Node Failure• If application has• Dynamic load balancing• Job stops using the failed node,

and rebalance load• Static load balancing• Ex) Stencil Computation• hard to rebalance load => spare node

31

2D Jacobi iterationV’(i,j) = A * ( V(i-1,j ) + V(i+1,j) +

V(i,j-1) + V(i,j+1) )

2D array V(N,M)

16年6月28日火曜日

Page 34: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Spare Node Substitution• Assuming switches and links are all healthy• A naive spare node substitution may result in a

large number of packet collisions• Max. latency depends on #collisions

• Is there a way to avoid this situation ?

32

Spare

No S

F

232

3

2

2

Migration

4 Possible Collisions

5 Po

ssib

le C

ollis

ions

16年6月28日火曜日

Page 35: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Spare Node• Pros• Easy to program• Balanced load

• Cons• Lower node

utilization• Additional

packet collisions

33

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23 21

24 25 26 27 28 29

30 31 32 33 34 35

0D S

lidin

g

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23

24 25 26 21 28 29

30 31 32 27 34 35

33

1D S

lidin

g2D

Slid

ing

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

Spare Nodes

Spar

e N

odes

Node 21 fails

16年6月28日火曜日

Page 36: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Sliding Methods - Basic Idea•Sliding Methods•0D - Naive method•1D - Up to 3 (worst case)

•2D - Up to 2•and so on

•Hybrid Sliding• Try the highest

degree method first• If failed, try lower

degree method

34

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23 21

24 25 26 27 28 29

30 31 32 33 34 35

0D S

lidin

g

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23

24 25 26 21 28 29

30 31 32 27 34 35

33

1D S

lidin

g2D

Slid

ing

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

Spare Nodes

Spar

e N

odes

Node 21 fails

16年6月28日火曜日

Page 37: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Repair

3516年6月28日火曜日

Page 38: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

The K and FX100

36

Copyright 2015 FUJITSU LIMITED

�SPARC64TM XIfx�HPC-ACE2�L1キャッシュ、Wayを2倍�スーパースカラーの強化•アウトオブオーダ資源の増加•分岐予測の強化

�256 bit wide SIMD•単精度倍幅モード•8バイト整数命令

�アシスタントコア•IO・OS・通信のデーモンを処理

キーテクノロジー (5)

core core

core core

core core

core core

core core

core core

core core

core core

Assistantcore

Assistantcore

core core

core core

core core

core core

core core

core core

core core

core core

Tofu2 interfaceTofu2 controller

HMC interface HMC

inte

rface

L2 cache

L2 cache

PCI interface

MAC MAC MA

C MA

C

PCI controller

7

�Rack�216ノード / キャビネット�CPU、メモリ、光モジュールを直接水冷(水冷率90%)

�Chassis�19インチランクマウント型シャーシ�12ノード / 2U�本体装置間 Tofu2は光接続

�CPU Memory Board�CPU x 3�3 x 8 Micron's HMCs

Copyright 2015 FUJITSU LIMITED

キーテクノロジー (4)

6 4 system boards(384 cores)

3 CPUs (Nodes)

32+2 Cores + Tofu

CPU8 Cores

ICC(Tofu Network)

The K Computer

2011

FX1002015

1 system board4 CPUs and 4 ICCs

(32 cores)

18 chassis (6,912 cores)

24 chassis (768 cores)

http://accc.riken.jp/wp-content/uploads/2015/06/chiba.pdf

16年6月28日火曜日

Page 39: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Fujitsu FX100• A chassis contains 3 nodes, switches and links.• Tofu unit consists of 12 nodes is also a scheduling

unit• Tofu 6D coordinate: “XYZabc” (a=2, b=3, c=2)• “XYZ” coord. represents the location of a Tofu unit• “abc” coord. represents the location inside of a Tofu unit

• A chassis contains 12 nodes• 3 chassis compose 3 Tofu units

37

White paper FUJITSU Supercomputer PRIMEHPC FX100 – Evolution to the Next Generation

Page 7 of 8 http://www.fujitsu.com/global/products/computing/servers/supercomputer/primehpc-fx100/

The Tofu interconnect 2 (Tofu2) is an interconnect integrated into the SPARC64™ XIfx processor. Tofu2 enhances the bandwidth and functions of the Tofu interconnect (Tofu1) of the previous PRIMEHPC FX10 systems.

6D mesh/torus network Tofu2 interconnects nodes to construct a system with a 6D mesh/torus network, like with Tofu1. The sizes of the three axes X, Y, and Z vary depending on the system configuration. The sizes of the other three axes, A, B, and C, are fixed at 2, 3, and 2, respectively. Each node has 10 ports. The network topology from the user’s view is a virtual 1D/2D/3D torus. An arbitrary number of dimensions and size of each axis are specified by a user. The virtual torus space is mapped to the 6D mesh/torus network and reflected in the rank numbers. This virtual torus scheme improves the system fault tolerance and availability by enabling the region containing a failed node to be utilized as a torus.

High-speed 25 Gbps serial transmission Each link of Tofu2 consists of four lanes of signals with a data transfer speed of 25.78125 Gbps and provides peak throughput of 12.5 GB/s. The link bandwidth is 2.5 times higher than that of Tofu1, which uses 8 lanes of 6.25 Gbps signals and provides 5 GB/s of throughput.

Tofu2 connects 12 nodes in a PRIMEHPC FX100 main unit by electrical links, and inter-main unit links use optical transceiver modules because of a large transmission loss at 25 Gbps. Optical transceivers are even placed near the CPU to minimize the transmission loss. In contrast, Tofu1 does not use optical transceivers.

Optical-link dominant network Twelve nodes in a main unit are connected in the configuration of (X, Y, Z, A, B, C) = (1, 1, 3, 2, 1, 2). The number of intra-main unit links is 20 (Figure 14). Therefore, 40 out of 120 ports are used for intra-main unit links, and the other 80 are used for inter-main unit links.

For conventional HPC interconnects using 10 Gbps generation transmission technology, the ratio of optical links in the total network was up to one-third (A, B, and C in Figure 15). These interconnects partially used optical transmission and only used it to extend the wire length. In contrast, the ratio of optical links in Tofu2 is far beyond that of electrical links. Tofu2 is recognized as a next-generation HPC interconnect that mainly uses optical transmission.

Tofu Interconnect 2

Figure 12 6D mesh/torus topology model

Figure 13 Close placement of optical modules and CPU

Figure 14 Connection topology in main unit

for synchronization. The host bus connects aSparc64 chip5 to the Tofu interconnect andPCI Express devices.

Figure 2a shows the ICC chip structureand interconnections. Four Sparc64 chips

on a board are interconnected with the A-and C-axes links. Three boards in a Tofuunit are interconnected with the B-axislinks. Tofu units are interconnected withthe X-, Y- and Z-axes links that form a 3Dtorus. The Z-axis links connect 17 Tofuunits in two racks: 16 for computingnodes, and one for I/O nodes. The X-axisand Y-axis are expandable according to thenumber of columns and rows of racks.

Figure 2b shows a topological model ofthe ABC 3D mesh/torus forming a Tofuunit. In the event of a single-board failurethat decreases the B-axis’s length, the 3Dtorus graph’s embeddability is unaffected be-cause the ABC 3D topology remains cubic.

One of the big challenges in building theK computer was system reliability. Forinstance, a mean time between failures(MTBF) of five years per node, assumingcommodity processing nodes, would bringabout two failures every hour in the80,000-node system. We needed aboutone-hundredth of that failure rate. To min-imize the failure rate, we integrated allactive components of the Tofu interconnectinto a single ICC chip, protected major datapaths using error-correction code, and

[3B2-9] mmi2012010021.3d 18/1/012 18:44 Page 22

Host bus interface

Tofu networkinterface (TNI) and

Tofu barrierinterface (TBI)

Tofu network router (TNR)

PCI E

xpre

ssro

ot c

ompl

ex

Figure 1. A micrograph of the ICC chip.

The chip integrates all active components

of the Tofu interconnect: a Tofu network

router (TNR), four Tofu network interfaces

(TNIs), a Tofu barrier interface (TBI), a host

bus interface, and two PCI Express root

complexes.

Tofu unit

Quad SPARC64 board

SPARC64chip

InterConnect Controller (ICC) chip

TNR

C axis

B axis

Z axis

Y axis

X axis

A axis

PCIExpress

root

Hostbus

C axis

B axis

A axis(a) (b)

xis

A axis

TBI

TNI

Figure 2. The ICC chip structure and interconnections along six axes (a). A topological model of the A-, B-, and C-axes (b).

....................................................................

22 IEEE MICRO

...............................................................................................................................................................................................HOT INTERCONNECTS

A Tofu unit

White paper FUJITSU Supercomputer PRIMEHPC FX100 – Evolution to the Next Generation

https://www.fujitsu.com/global/Images/primehpc-fx100-hard-en.pdf

16年6月28日火曜日

Page 40: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

The K and FX100

38

Copyright 2015 FUJITSU LIMITED

�SPARC64TM XIfx�HPC-ACE2�L1キャッシュ、Wayを2倍�スーパースカラーの強化•アウトオブオーダ資源の増加•分岐予測の強化

�256 bit wide SIMD•単精度倍幅モード•8バイト整数命令

�アシスタントコア•IO・OS・通信のデーモンを処理

キーテクノロジー (5)

core core

core core

core core

core core

core core

core core

core core

core core

Assistantcore

Assistantcore

core core

core core

core core

core core

core core

core core

core core

core core

Tofu2 interfaceTofu2 controller

HMC interface HMC

inte

rface

L2 cache

L2 cache

PCI interface

MAC MAC MA

C MA

C

PCI controller

7

�Rack�216ノード / キャビネット�CPU、メモリ、光モジュールを直接水冷(水冷率90%)

�Chassis�19インチランクマウント型シャーシ�12ノード / 2U�本体装置間 Tofu2は光接続

�CPU Memory Board�CPU x 3�3 x 8 Micron's HMCs

Copyright 2015 FUJITSU LIMITED

キーテクノロジー (4)

6

3 CPUs (Nodes)

The K Computer

2011

FX1002015

18 chassis (6,912 cores)

24 chassis (768 cores)

4 system boards(384 cores)

1 system board4 CPUs and 4 ICCs

(32 cores)

The Tofu circuit is on the same CPU die, however, the Tofu circuit can keep running while the CPU cores are shutdown and power off.

32+2 Cores + Tofu

http://accc.riken.jp/wp-content/uploads/2015/06/chiba.pdf

16年6月28日火曜日

Page 41: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Various Units in FX100• Various units in Fujitsu FX100

• Unit of network• Tofu (12 nodes)

• Unit of scheduling• Tofu (not to interfere with other jobs)

• Physical Unit• Chassis - A chassis spans 3 Tofu units

• Replacement of a chassis• Affects 3 Tofu units (36 nodes, 1152 cores)

• Before replacement, the jobs running on the 36 nodes must be aborted

• While in the replacement, the affected 36 nodes can not accept jobs

• Tofu is direct network, replacement can affect entire network because XYZ connections for I/O are lost

• This 36-Nf nodes are called apparent failure (in this talk)

3916年6月28日火曜日

Page 42: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Repair Schedule• Entire system must go on as much as possible• Replacement may cause more apparent failures as packaging

density increases• Replacement cannot take place as soon as failure happens

• Remedy for apparent failures is getting harder• The more frequent system service, the higher running cost• So, repair is scheduled once in a day, 2-3 times in a week,

once in a week, and so on

40

The K’s case: Every morning, SEs replace the failed nodes1. Shutdown the chassis2. Unplug the chassis3. Replace failed mother board4. Plug the chassis5. Reboot (K’s nodes are disk-less)

Apparent failure

16年6月28日火曜日

Page 43: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Repair Interval• The longer repair interval, the larger number

of failed parts• K: One node failure in 1~2 days

41

# Fa

iled

Com

pone

nts

OperationRepair

Time

ApparentFailure

# Fa

iled

Com

pone

nts

Operation RepairTime

ApparentFailure

Average

Average

16年6月28日火曜日

Page 44: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Towards Exa-flops• Higher failure rate

• larger number of components• end of Moore’s law is close

• Longer time between repair• to reduce running cost• denser packaging results in more apparent

failures• larger impact on running jobs

➡ Always having one or more number of failed components

4216年6月28日火曜日

Page 45: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Network Resilience Towards Exa-scale and Beyond

My Personal Opinion

4316年6月28日火曜日

Page 46: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Failure will be Daily Life• Assumption of current HPC

failure happens unexpectedly and unusually• System design is based on particular rules

and algorithms• Random failure breaks those rules and

algorithms• Node MTBF is less than a day already

• If failure is daily happening, why don’t we design HPC systems having failures in mind ?

4416年6月28日火曜日

Page 47: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Failure Conscious Design• Failure• happens randomly• # Combinations are factorial !

• impossible to handle failures case by case• impossible to predict performance

degradation due to the failures

4516年6月28日火曜日

Page 48: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Stencil and Cartesian Topology • The node failure problem in stencil computations

is revisited• Communication pattern of stencil computation

fits with Cartesian topology very very well• When spare node

substitutions take place, then the fitness is gone and performance degrades

46

0

20

40

60

80

100

120

0 20 40 60 80 100 120 140 160 180 200

# C

ollis

ions

# Node Failures

Best Average Worst

5P-Stencil Communication Performance Degradation over the Number of Failed Nodes [7]

16年6月28日火曜日

Page 49: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Topology and Protocol• Protocols of collective

operations are optimized according to topology

• If conditions of H/W support are NOT met, then general protocol takes place

• Failure break those conditions

47

0

2

4

6

8

10

12

0 50 100 150 200 250 300

Slow

dow

n

# Node Failures

K-Barrier

K-Allreduce

BGQ-Barrier

BGQ-Allreduce

BGQ-Barrier*

BGQ-Allreduce*

16年6月28日火曜日

Page 50: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Regular topology turns into random topology

as the number of failed links increases

48

(Full) Dragonfly 22/28 16/28

Nodes

Sw.

16年6月28日火曜日

Page 51: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Regular topology turns into random topology

as the number of failed links increases

49

(Full) Dragonfly 22/28 16/28

Nodes

Sw.

QualitativeChange

QuantitativeChange

16年6月28日火曜日

Page 52: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Randomness may be an answer

5016年6月28日火曜日

Page 53: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Randomness may be an answer• Can we rely on the rules and algorithms which can be

broken by failures ?• Failures on regularity

Qualitative change: Hard to imagine

5016年6月28日火曜日

Page 54: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Randomness may be an answer• Can we rely on the rules and algorithms which can be

broken by failures ?• Failures on regularity

Qualitative change: Hard to imagine• What if we give up having such rules ?

• Failures on randomness Quantitative change: Easier to imagine

5016年6月28日火曜日

Page 55: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Randomness may be an answer• Can we rely on the rules and algorithms which can be

broken by failures ?• Failures on regularity

Qualitative change: Hard to imagine• What if we give up having such rules ?

• Failures on randomness Quantitative change: Easier to imagine

• Let’s start designing random systems from the beginning, forget about failures in regular systems

50

➡Random Topology➡Random Network

16年6月28日火曜日

Page 56: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Random Topology (1)

51

• Goal: to make a low-latency topology for HPC networks – low diameter and low average path hops

• Random topology is best [Koibuchi et al, ISCA2012]

100 times improvement (a) Non-random Shortcuts

(b) Random Shortcuts

1,024-node network

Avg

. sho

rtest

pat

h le

ngth

[hop

s]

Good Point of Random Topology

3 Switch degree ≈ Number of shortcuts

Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf

16年6月28日火曜日

Page 57: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Random Topology (2)

52

Two Approaches to Quasi-randomness

• Method A makes a non-random topology random • Method B makes a random topology layout-friendly

11

Low High (not random) (fully random)

Randomness

Method A Method B start start

Quasi-random topologies

Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf

16年6月28日火曜日

Page 58: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Random Routing in Hypercube

53

7 Sid C-K Chau

Random Routing in Hypercube • For deterministic bit-fix routing, the worst case requires

at least 2𝑛/2/2 steps (exponential in n) • But for random bit-fix routing, it requires O(n) steps

with high probability (i.e., using more than O(n) steps has a vanishing probability converging to 0, as nl0)

• Random bit-fix routing has two stages: 1. Pick a random node r(i) in the hypercube independently, and

use bit-fixing routing from i to r(i) 2. Use bit-fixing routing from r(i) to d(i)

• Obviously, longer paths are needed for random bit-fix routing. Then why is this better?

• The intuition is that random routing can average out the worst case configuration from deterministic routing

• The probability that a randomly generated configuration is the worst case is very low, and is vanishing for large n

• This intuition is behind many randomized algorithms

1100

0100

0110

1110

0010

1010

1000

0001

1101

0101

1111

0011

1011

1001

0111

0000

i

j

d(i)

d(j)

Random bit-fixing routing

r(i)

r(j)

0000 l 0000 l 0000 0001 l 0001 l 0100 0010 l 1000 l 1000 0011 l 0101 l 1100 0100 l 0001 l 0001 0101 l 1110 l 0101 0111 l 1101 l 1101 1110 l 0000 l 1011 1111 l 1110 l 1111

i l r(i) l d(i)

A two-stage configuration

Sid C-K Chau, https://www.cl.cam.ac.uk/teaching/1011/CompSysMod/RandBits_Lec2V2.pdf

16年6月28日火曜日

Page 59: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Dynamic routing vs. Random routing

• A switch has several routing candidates for a packet to go through• Static routing• choose fixed one always

• Dynamic routing• choose one according to network status

• Random routing• choose one in a random way• not have to be uniformly random

5416年6月28日火曜日

Page 60: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Randomness in a Network• Combination of regularity and randomness

• Random Topology• Regular part + Random part

• Ex) Ring + Random shortcuts• Random (Oblivious) Routing

(≠ Brownian motion)• Random routing + Regular routing

• Node/switch on the way is randomly chosen

• Failure may happen on the regular part ?• The factorial nature can be relaxed• Ex) Redundant links of the regular part of topology.

55

x

19

Conclusions • Use of random shortcuts at HPC interconnects

• Ring + random shortcuts is best • Advantage of high-radix networks

• Little variability of sampling and performance • Random shortcut topology imposes no constraints

on the number of switches, and links

Random Shortcut Topology (Ring + random shortcuts)

Up to 18% lower latency

Hypercube (Non-random topology)

16年6月28日火曜日

Page 61: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

My Last Word

“An eye for an eye, a tooth for a tooth”

Randomness for randomness

Randomness MAY save the future supercomputers (not yet proven)

Thank you56

16年6月28日火曜日

Page 62: Robustness of Interconnection Networks

3rd JLESC SS@Lyon 2016

Reference1) High-radix router: Microarchitecture of a High-Radix Router, John Kim, William J. Dally, et. al.,

ISCA’05.2) Tofu network: THE TOFU INTERCONNECT, Yuichiro Ajima, et. al., HOT INTERCONNECTS, 2012. 3) Dragonfly network: Technology-Driven, Highly-Scalable Dragonfly Topology, John Kim, William J.

Dally, et. al., ISCA '08.4) Routing algorithms: A Survey and Evaluation of Topology-Agnostic Deterministic Routing

Algorithms, J. Flich et al., in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 3, pp. 405-425, March 2012.

5) Shortest path finding algorithm (Dijkstra Algorithm): A note on two problems in connexion with graphs, Dijkstra, E.W., In Numerische Mathematik, 1959.

6) Adaptive routing in Infiniband: Fail-in-place Network Design: Interaction Between Topology, Routing Algorithm and Failures, J. Domke, T. Hoefler, and S. Matsuoka, SC ’14, 2014.

7) Spare node substitution: Sliding Substitution of Failed Nodes, Atsushi Hori, et. al., In Proceedings of the 22nd European MPI Users' Group Meeting, ACM, 2015.

8) Random algorithms including the random routing: Randomized Algorithms, Rajeev Motwani and Prabhakar Raghavan, Cambridge University Press, 1995.

9) Random network: A Case for Random Shortcut Topologies for HPC Interconnects, Michihiro Koibuchi, et. al., ISCA’12.

10)Another view on HPC network robustness: Robustness Attributes of Interconnection Networks for Parallel Processing, Behrooz Parhami, Keynote lecture, 1st Int'l Suprcomputing Conf. (ISUM-2010), 2010 March 4. (https://www.ece.ucsb.edu/~parhami/pres_folder/parh10-isum-robustness-int-nets.ppt)

5716年6月28日火曜日