21
Large-Scale HPC systems based on Heterogeneous multicore processors Toshikazu Ebisuzaki (RIKEN) 1

Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Embed Size (px)

Citation preview

Page 1: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Large-Scale HPC systems based on Heterogeneous multicore

processors

Toshikazu Ebisuzaki (RIKEN)

1

Page 2: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

contents

• What is heterogeneous many-core processors?

• Introduction of GYOUKOU

• Prospects of Exaflops Computing

2

Page 3: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Homogeneous v.s. heterogeneous

core

core

core

interconnect

Homogeneous many-core processor

GPU PEZY-SC1

Page 4: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Homogeneous many-core processer system

interconnect

core

core

core

interconnect

Homogeneous Many-core processor

memorygeneral purpose

processerco

re

core

core

interconnect

core

core

core

interconnect

Homogeneous Many-core processorHomogeneous Many-core processor 4

Page 5: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Homogeneous v.s. heterogeneous

5

core

core

core

interconnect

Homogeneous many-core processor

GPU PEZY-SC1

core

core

core

Heterogeneous many-core processorSW26010 PEZY-SC2

mem

ory

processor

interconnect

Page 6: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Heterogeneous many-core processor system

interconnect

core

core

core

Heterogeneous many-core processor

mem

ory

processor

interconnect

core

core

core

Heterogeneous many-core processor

mem

ory

processor

interconnect

core

core

core

Heterogeneous many-core processor

mem

ory

processor

interconnect

6

Page 7: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Heterogeneous Manycore Processors

• SW26010 Sunway TaihuLight

→talk of Professor Liu

• PEZY-SC2

– Gyoukou JAMSTEC

– Shoubu Sys.B RIKEN ACCC

– Suiren Blue KEK

– Ajisai RIKEN AICS

– Satsuki RIKEN CAP

7

Page 8: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

PEZY-SC PEZY-SC2Process TSMC28HPM TSMC16FFPGL

Freq. Core 733MHz 1GHz

Peripherals 66MHz 66MHz

MemoryCache L1:1MB, L2:4MB, L3:8MB (Chip Total) L1:12MB, L2:12MB, LLC: 40MB (Chip Total)

Scratch Pad 16MB (16KB/PE) 40MB(20KB/PE)

IPsControl CPU

ARM926 x 2 (Management,Debug)Cache L1:32KB x 2, L2:64KB

MIPS64R6(P6600) 6core(General Purpose)

PCIe I/FPCIe Gen3 8Lane 4Port(8GB/s x 4 = 32GB/s)

PCIe Gen4 8Lane 4Port(64GB/s)

DDR I/FDDR4 64bit 2,400MHz 8Port(19.2GB/s x 8 = 153.6GB/s)

Custom TCI Stacked DRAM 4Port 2TB/s(available on phase-2 version 2017 fall)

DDR4 3.200MHz 4Port 100GB/sNum. of PE (MIMD core) 1,024 2,048

Peak Performance 3.0T Flops (Single Precision)1.5T Flops (Double Precision)

8.2T Flops (Single Precision)4.1T Flops (Double Precision)

Power(typical) 70W (Leak:10W, Dynamic:60W) 130W(Estimated)

PEZY-SC/SC2 Specification

8

Page 9: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

PEZY-SC2 Block DiagramTC

IDR

AM

8G

B

51

2G

B/s

TCID

RA

M 8

GB

5

12

GB

/s

Prefecture

16City 256PE

25

6b

it x82

56

bit x8

LLCLLC

LLCLLC

LLCLLC

LLCLLC

TCID

RA

M 8

GB

5

12

GB

/sTC

IDR

AM

8G

B

51

2G

B/s

DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s

state

DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s

PCIe x8PCIe x8PCIe x8

prefecture

prefectureprefecture

prefecture prefecture

prefectureprefecture

Xb

arB

us 2

56

bitx3

2 xb

ar

25

6b

it x8

Xb

arB

us 2

56

bitx3

2 xb

ar

25

6b

it x

8

MIPS64R6 x 6PCIe x8

LLCLLC

LLCLLC

LLCLLC

LLCLLC

Uncached Access

9

Page 10: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Hierarchical Architecture

PEZY-SC2 (2,048PE)

Prefecture (256PE) City (16PE) Village (4PE) PE

Program Counter× 8

L1 Instruction Cache64bit × 512w (4KB)

ALU4FP ops/cycle

Register File32bit × 512w (2KB)

Local Storage32bit × 5120w (20KB)

PE

PE

L1 Data Cache2KB

PE

PE

Village(4PE)

L2 Data Cache64KB

L2 Instruction Cache32KB

Village(4PE)

Village(4PE)

Village(4PE)

Special Function UnitCity(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

City(16PE)

Prefecture (256PE) Prefecture (256PE)

LLC2560KB

LLC2560KB

LLC2560KB

LLC2560KB

Prefecture (256PE) Prefecture (256PE)

LLC2560KB

LLC2560KB

LLC2560KB

LLC2560KB

TCI DRAM 512GB/s

TCI DRAM 512GB/s

Prefecture (256PE) Prefecture (256PE)

LLC2560KB

LLC2560KB

LLC2560KB

LLC2560KB

Prefecture (256PE) Prefecture (256PE)

LLC2560KB

LLC2560KB

LLC2560KB

LLC2560KB

TCI DRAM 512GB/s

TCI DRAM 512GB/s

MIPS MIPS MIPS

MIPS MIPS MIPS

DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s

DDR4 DIMM 25GB/s DDR4 DIMM 25GB/s

PCIe Gen4 x8

PCIe Gen4 x8

PCIe Gen4 x8

PCIe Gen4 x8

10

Page 11: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Die Plot

27172.32(um) x 23695.200(um)11

Page 12: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Processing Element2way SuperScaler In-order issue / Out-of-order completion

16 stages pipeline

Fine-grain time-sliced multi-threading (like HEP, Sun Niagara)

8 hardware thread / PE (Active 4thread, Inactive 4thread)

Simplify data-forwarding / pipeline control

Eliminate hardware branch prediction mechanism

L1/L2 Cache coherence does NOT support by hardware

IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB

LA1 LA2

TH0F TH1F

TH2F

TH0B TH1B

TH3F

TH3B TH2B

l.chgthread

l.actthread

clk

clk

clk

clk

IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB

IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB

IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB

IF1 IF2 IF3 ID RA1 RA2 RA3 RA4 EX1 EX2 EX3 EX4 WB1 WB2 FB CB

Thread 0Thread 1Thread 2Thread 3

12

Page 13: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Instruction Set ArchitectureOriginal ISA

Focusing on science calculation, image processing, AI and deep learning

Support double precision / single precision / half precision floating point

Register File (/thread)

Integer 64b x 32

floating 64b x 32

Multi-processor Support

change thread (Switch Active – Inactive thread)

cache flush (Each Cache-level)

barrier synchronization (Each hierarchy)

13

Page 14: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Software EnvironmentWe provide OpenCL like PZCL framework

Develop both host-processor code and PEZY-SC2 code

LLVM is used in PZCL compiler.

Special functions for PEZY-SC2 control

sync (barrier synchronization)

flush (writeback from specified cache)

get_pid, get_tid (get PEID/thread-ID)

chgthread (change active / in-active thread)

14

Page 15: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

ZettaScaler-2.0 1st systemSystem Overview

26 Tank 832node system (32node / tank)

Model ZettaScaler-2.0

Nodes 832

Vendor ExaScaler Inc.

Processor Xeon D -1571

Speed 1,300

Sockets per Node: 1

Cores per Socket: 16

Accelerator/CP: PEZY-SC2

Accelerators/CP per Node: 16

Cores per Accelerators/CP: 2,048

Operating System: Linux CentOS7.3

Primary Interconnect: InfiniBand EDR

Memory per Node (GB) 1,088 15

GYOUKOU

Page 16: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Immersion Cooling Tanks

26 Tanks @ JAMSTEC16 Bricks / Tank

16

Page 17: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

System Network

H

SC2x16

HH H

SC2x16

IB SWITCH IB SWITCH

H

SC2x16

HH H

SC2x16………

H

SC2x16

HH H

SC2x16

H

SC2x16

HH H

SC2x16

648 port director SWITCH

down 32

up 4

H EDR HCA MCX-455AH

tank

Brick

8port InfiniBand EDR / tank

Tank Switch: Mellanox SB7790

Total 8 x 26 =208port connection using 648-port director Switch

Mellanox CS7500

17

Page 18: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

BrickBrick

1brick = 2node 32 x PEZY-SC2

Ultimate High Density Implementation

1 Base Carrier Board8 Sub Carrier Board32 PEZY-SC2 Module Card1 Dual-XeonD Module Card4 InfiniBand EDR HCA

18

Page 19: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Node OverviewNode

1 x Xeon D-1571 (16core, 1.3GHz)

16 x PEZY-SC2 (2,048core, 1GHz)

Multi-Layer PCIe Internal Network (Gen3 x16, 128Gbps+128Gbps)

Inter-SC2 Ring Network (PCIe Gen4 x 8 128Gbps + 128Gbps)

2 x InfiniBand Inter-node Network (EDR 100Gbps)

PLX / PEX9797

PCI Express Fabric

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

Gen3 x16

128Gbps + 128Gbps

PLX / PEX9797

PCI Express Fabric

PLX / PEX9797

PCI Express Fabric

PLX / PEX9797

PCI Express Fabric

PLX / PEX9797

PCI Express Fabric

PLX / PEX9797

PCI Express Fabric

IB

EDR

2CH

IB

EDR

2CH

Gen3 x16

128Gbps + 128Gbps

Gen3 x16

128Gbps + 128Gbps

IB EDR 100Gbps

x 2

IB EDR 100Gbps

x 2

Gen3 x16

128Gbps + 128Gbps

Gen4 x8

128Gbps + 128Gbps

CN

CN

CN CN CN

CN CN CN

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

CN

CN

CN CN CN

CN CN CN

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

CN

CN

CN CN CN

CN CN CN

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

PEZY-

SC2

CN

CN

CN CN CN

CN CN CN

XeonD 1571

16Core /

32 Thread

1.3GHz

(TB 2.1GHz )

19

Page 20: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

ExaFlops Computing (1018 flops/s)

• Computational Power ≈a human brain

• Deep Learning ≈Matrix Algebra– Single/double precisions are not necessary

– half precision

• Exaflops machine will overcome human brains

→relief of human from boring works

• Full and real-time emulations of a human brain

→studies of human brains

→experimental philosophy, literature, theology

20

Page 21: Large-Scale HPC systems based on Heterogeneous … · L1 Instruction Cache 64bit ×512w (4KB) ALU ... sync (barrier synchronization) ... •Computational Power ≈a human brain •Deep

Conclusions

• Heterogeneous many-core processors:

– Next Processor architecture for HPC

– Sunway 26010

– PEZY SC2

• GYOKOU based on PEZY SC2

• Prospect of ExaFlops Computing

Computational Power ≈a human brain

21