27
Copyright 2019 FUJITSU LABORATORIES LIMITED Conservative Interconnect of Large-scale HPC systems Kohta Nakashima Fujitsu Laboratories Ltd. 2019.11.26 0

Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Copyright 2019 FUJITSU LABORATORIES LIMITED

Conservative Interconnect of

Large-scale HPC systems

Kohta Nakashima

Fujitsu Laboratories Ltd.

2019.11.26

0

Page 2: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Introduction

Interconnect technologies are important for HPC systems

For performance

Low latency and high bandwidth can improve system performance

Latency (InfiniBand):

• Direct connection: 1us

• Switch latency: 0.2us/switch

Throughput:

• Average shortest path length affects throughput

For cost

Small # of switches and links can reduce system cost

Copyright 2019 FUJITSU LABORATORIES LIMITED

Smaller diameter and average shortest path length improve

the HPC system performance and cost

1.0us

1.0+0.2 x 4 = 1.8us

2 data flow shared the link

1

Page 3: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Outline

Introduction

Example of HPC systems

Network topologies for HPC systems

The reason why the HPC networks are so conservative

How to explorer to innovate HPC networks

Copyright 2019 FUJITSU LABORATORIES LIMITED2

Page 4: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Example of supercomputer in Japan (1)

ABCI: AI Bridging Cloud Infrastructure (2018)

# of nodes: 1,088

Peak Flops: 37PFlops

HPL perf.: 19.88PF(5th in Top500, 2018/6)

NVIDIA Tesla V100

InfiniBand x 2

Copyright 2019 FUJITSU LABORATORIES LIMITED

Xeon

1.5TF

Infini

Band

DDR4

192GB

Computing Node

12.5GB/s

x 2

PCIe

16GB/s

128GB/s

V100

7.8TF

Xeon

1.5TF

Fabrics

V100

7.8TF

V100

7.8TF

V100

7.8TF

Infini

Band

DDR4

192GB

Edge SW

SW SW

SW SW

SW

SW

Director SW (Cluster of SW)

34 nodes3

Page 5: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Example of supercomputer in Japan (2)

Oakforest-PACS (2016)

# of nodes: 8,208

Peak Flops: 25PFlops

HPL perf.: 13.55PF (6th in Top500, 2016/11)

Xeon Phi 68 cores/1.4GHz

OmniPath 100Gbps(12.5GB/s)

Copyright 2019 FUJITSU LABORATORIES LIMITED

Xeon Phi

3.0TF

Omni

Path

MCDRAM

16GBDDR4

96GB

Computing Node

12.5GB/s

PCIe 16GB/s

128GB/s 480GB/s

Fabrics

SW SW SW SW

8,208 nodes

Director SW (Cluster of SW)

Edge SW

24

24

4

Page 6: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

20 systems in Top500 list

Rank Name Fabric

1 Summit Fat Tree/IB

2 Sierra Fat Tree/IB

3 TaihuLight Fat Tree/Custom

4 Tianhe-2A Fat Tree/TX2

5 Frontera Fat Tree/IB

6 Piz Daint DF/Aries

7 Trinity DF/Aries

8 ABCI Fat Tree/IB

9 SuperMUC-NG Fat Tree/OPA

10 Lassen Fat Tree?/IB

Copyright 2019 FUJITSU LABORATORIES LIMITED

Rank Name Fabric

11PANGEA III Fat Tree?/IB

12Sequoia Torus/Custom

13Cori DF/Aries

14Nurion Fat Tree/OPA

15Oakforest-PACS Fat Tree/OPA

16HPC4 Fat Tree?/IB

17Tera-1000-2 Fat Tree/BXI

18Stampede2 Fat Tree/OPA

19Marconi Fat Tree/OPA

20DGX SuperPOD Fat Tree?/IB

DF: Dragonfly, IB: InfiniBand, TX2: TH Express 2, OPA: OmniPath, BXI: Bull Exascale Interconnect

Topology

Fat Tree: 16 systems

Dragonfly: 3 systems

Torus: 1 system

Interconnect

InfiniBand: 7 systems

OmniPath: 5 systems

Aries(Cray): 3 systems

TX2: 1, BXI: 1, Custom: 2

5

Page 7: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Other topologies

16 Fat Tree, 3 Dragonfly, 1 Torus in Top20

Others

Dragonfly+

• Gadi, National Computational Infrastructure (NCI Australia), #47

• Niagara, University of Toronto, #76

Hypercube

• Eagle, National Renewable Energy Laboratory, 8D Hypercube #43

Top500: Fat Tree, Dragonfly, Torus, Dragonfly+, Hypercube

Fat Tree: Almost all systems

Dragonfly: Cray users

Torus: Fujitsu FX series, IBM BlueGene, and Sugon users

Dragonfly+: Challenging users with Mellanox InfiniBand

Hypercube: A part of SGI users

Copyright 2019 FUJITSU LABORATORIES LIMITED6

Page 8: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Fabrics

Open

Third party can purchase it and integrate the system using the fabrics

InfiniBand (Mellanox) and OmniPath (Intel)

Custom

Vendor only provides the system integrated by the fabrics

Copyright 2019 FUJITSU LABORATORIES LIMITED

Fabric Vendor Topology

InfiniBand Mellanox Fat Tree, Dragonfly+, Torus,

Hypercube, etc

OmniPath Intel Fat Tree

Fabric Vendor Topology

Aries/Slingshot Cray Dragonfly

BXI (*) Atos Fat Tree

Tofu Fujitsu Torus

(* BXI: Bull eXascale Interconnect, Bull was purchased by Atos)

7

Page 9: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Outline

Introduction

Example of HPC systems

Network topologies for HPC systems

The reason why the HPC networks are so conservative

How to explorer to innovate HPC networks

Copyright 2019 FUJITSU LABORATORIES LIMITED8

Page 10: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Fat Tree

Non blocking, if # of ports for uplink and downlink is same

Configurable for performance, cost

Copyright 2019 FUJITSU LABORATORIES LIMITED

33

33

Diameter: 6

9

Page 11: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Dragonfly

Local: Local switches are connected directly in group

Global: Groups are connected directly

Copyright 2019 FUJITSU LABORATORIES LIMITED

s1 s2 s3

group1

gro

up2 g

roup4

group3

Diameter: 5

10

Page 12: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Dragonfly+

Local: Local nodes are connected by Clos (like Fat tree)

Global: Groups are connected directly

Copyright 2019 FUJITSU LABORATORIES LIMITED

group1

gro

up2 g

roup4

group3

s1 s2 s3

s1

s2

s3 s

1s2

s3

s1s2s3

Diameter: 5

11

Page 13: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Torus

Scalable, easy to connect physically for 10,000+ nodes

Copyright 2019 FUJITSU LABORATORIES LIMITED

2D Torus

12

Page 14: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

My recommendation for customers/colleagues

> 16,000 nodes or Fujitsu FX series customer

Torus

Easy for physical installation

16,000 ~ 800

For major customer

• 3-level Fat Tree

• Mature

For challenging customer

• DF/DF+

• DF+: Mellanox support

< 800 nodes

2-level Fat Tree

Perfect

Copyright 2019 FUJITSU LABORATORIES LIMITED

Who is rival?

Fat Tree!

13

Page 15: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Detail of Fat Tree features (1)

Easy to configure balance between performance and cost

Easy to explain the configuration of fabric

Copyright 2019 FUJITSU LABORATORIES LIMITED

Sales engineer can configure the fabric for customer easily

33

33

33

33

23

33

23

33

2/3 perf

2/3 global

1/1 local

14

Page 16: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Detail of Fat Tree features (2)

Easy to explain the behavior if some links or switches failed

Copyright 2019 FUJITSU LABORATORIES LIMITED15

Page 17: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Outline

Introduction

Example of HPC systems

Network topologies for HPC systems

The reason why the HPC networks are so conservative

How to explorer to innovate HPC networks

Copyright 2019 FUJITSU LABORATORIES LIMITED16

Page 18: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Road to decide configuration of HPC systems

Define specification for the procurement

Explain suitable purpose of the system

Copyright 2019 FUJITSU LABORATORIES LIMITED

Important to convince various stakeholders

Upper layer

Tech div.

System owner

• University/National lab

• Industry

Component vender

• Intel/AMD/NVIDIA etc

• Mellanox/Intel etc

Upper layer

Sales div.

Tech div.

System vendor

• IBM/Cray/Fujitsu etc

Upper layer

Sales div.

Tech div.

Propose the system design

Explain whether the procurement is good for business

Consult and promote components

17

Page 19: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Why the HPC networks are so conservative

Because Fat Tree is very suitable for HPC systems

Easy to explain for the design

Easy to configure cost and performance

A little expensive but mature

Because stakeholders are conservative

Hard to challenge the novel technologies with high risk because total budget of system is very high (1~1,000 M$)

Stakeholders

• System owners

• System vendors

• Component venders

Copyright 2019 FUJITSU LABORATORIES LIMITED18

Page 20: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Outline

Introduction

Example of HPC systems

Network topologies for HPC systems

The reason why the HPC networks are so conservative

How to explorer to innovate HPC networks

Copyright 2019 FUJITSU LABORATORIES LIMITED19

Page 21: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Conditions for acceptable topologies (1)

Easy to understand

Easy to configure the performance and cost

For not only engineers but also sales division

Without simulation, only use paper and pencil

Feel low risks

Simple logic for routing, deadlock avoidance, and fault tolerant

Easy to find another routing if some links failed

Feel relieved somehow…

Copyright 2019 FUJITSU LABORATORIES LIMITED20

Page 22: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Conditions for acceptable topologies (2)

Proven/Mature

Easy to convince if other systems use the topology

Copyright 2019 FUJITSU LABORATORIES LIMITED

Is your proposed network topology used in other systems?

Sure, X university and Y national laboratory used it.

(OK, I will ask my friend worked for X university)

How about the topology in X university?

It is good! No problem.

21

Page 23: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

How to make stakeholders adopt novel topology? (1)

Strategy 1: Collaboration with interconnect vendor

Pros

If the topology has strong advantage compared to other topology, it can motivate interconnect vender to support.

If interconnect vendor support the topology, system owners and system venders feel low risk and relieved

Cons

If the topology reduce # of switches and/or links, it may be unacceptable for interconnect vender

Almost all good topology can reduce # of switch; hard to accept to support it for interconnect vendor

Example of success:Dragonfly+ by Mellanox

Copyright 2019 FUJITSU LABORATORIES LIMITED22

Page 24: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

How to make stakeholders adopt novel topology? (2)

Strategy 2 :Verification of usefulness of the topology through research collaboration with understanding customers

Pros

Customers who are also research institutions jointly verify usefulness, so they are easy to accept

If the customers are authority of interconnect, other customers may accept the topology

Cons

Requires the system for verification for research collaboration

• The system requires 100s computing nodes and switches…

• Total cost: at least several million dollars…

The system cost of verification:

• For customer: Too high. Because the verification may fail

• For system vendor: Too high as an investment, other customer may not accept the topology even if the verification succeeded

Copyright 2019 FUJITSU LABORATORIES LIMITED23

Page 25: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

How to make stakeholders adopt novel topology? (3)

Strategy 3 : National project for supercomputing

Pros

High probability of success because of huge investment

Cons

Unacceptable too high risk technologies because of huge investment

If the technologies for the project focus only huge systems, the merit of the technologies for mid-range systems may decrease…

Example of success:

Dragonfly:XC30(Cascade), HPCS Project in DARPA, USA

6D-Mesh/Torus: Tofu – K computer, National project for HPC in Japan

Copyright 2019 FUJITSU LABORATORIES LIMITED24

Page 26: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Summary and proposal

Summary: HPC interconnect is very conservative

Fat tree is very suitable for practical HPC systems

Stakeholders do not like take a risk

Proposal

Network combination

• Traditional network guaranteesthe minimum performance and fault tolerance

• Novel network performs excellent performance

Additional evaluation value for the novel network

• Diameter and average shortest path length with failure switches and links

• Worst case results when n switches/links failed

Copyright 2019 FUJITSU LABORATORIES LIMITED

Novel network

Traditional network

25

Page 27: Conservative Interconnect of Large-scale HPC systemsresearch.nii.ac.jp/graphgolf/2019/candar19/graphgolf2019-nakashima.pdfIntroduction Interconnect technologies are important for HPC

Copyright 2019 FUJITSU LABORATORIES LIMITED