70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

S U : ED 3ECDP(-AOEC E C : E C ?EA PEBE? LLHE? PEK O

P PDA ? HA KB # 4EHHEK ,KNAO7GUN GT 5

=G OUTGR A KXIUS OTM 2KT KX OT D O3K GX SKT UL 4GX N A Y KS AIOKTIK BYOTMN G CTO KXYO

AK KSHKX N % / 820A

Sunway Machine: the Challenges and Opportunities

Scientific Computing with 10 Million Cores

Long Term Plan for Sunway TaihuLight

6 PHE A

Sunway-I:

- CMA service, 1998

- commercial chip

- 0.384 Tflops

- 48th of TOP500

Sunway BlueLight:

- NSCC-Jinan, 2011

- 16-core processor

- 1 Pflops

- 14th of TOP500

Sunway TaihuLight:

- NSCC-Wuxi, 2016

- 260-core processor

- 125 Pflops

- 1st of TOP500

:DA S U 4 ?DE A IEHU

Core Group 2

Data Transfer Network

MPE 8*8 CPE Mesh

PPU

iMC

Memory

Core Group 0

MPE8*8 CPE Mesh

iMC

PPU

Memory

Core Group 1

MPE8*8 CPE Mesh

PPU

Core Group 3 iMC

Memory

MPE8*8 CPE

Mesh

PPU

iMC

Memory

NoC

ComputingCore

LDM

ColumnCommunication Bus

Control Network

Registers

Row Communication

Bus

Transfer Agent (TA)

Memory Level

LDM Level

Register Level

Computing Level

8*8 CPE Mesh

# ( S U ,KNA NK?AOOKN

n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK

p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK

p KT OXKIUS OTMY Y KS

0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI


p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK




p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK




p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK




p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK



0KS PK ,K A?P PDA # 4EHHEK ,KNAO)


2D core arraywith row andcolumn buses


2D core array with rowand column buses

Network on Chip



Network on Chip

Customized Network Board toFully Connect 256 Nodes



Network on Chip

Customized Network Board toFully Connect 256 Nodes

Sunway Net

:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K





Sunway Micro




6 PHE A

4 ?DE A , L EHEPU ,KIL NEOK

0

0.5

1

1.5

2

2.5

3

PeakPerformance

MemorySize

Gflops/Watt

Tflops/m^3

memorybandwidth

communicationbandwidth

TaihuLight Tianhe-2 Titan KComputer

0

0.5

1

1.5

2

2.5

3Linpack

Graph

HPCG

hpgmg

TaihuLight Tianhe-2

Titan KComputer

Sunway TaihuLight

125 Pflops

32 GB and136GB/s per node 22 flops/byte

10 millioncores

MPE + CPE

user-controlled64 KB LDM

registercommunicationamong CPEs

4 FKN A P NAO PK ,K OE AN

Sunway TaihuLight

125 Pflops


10 millioncores

MPE + CPE



4 FKN A P NAO PK ,K OE AN

Intel KNL 7250 of Cori: 6.5 flops/byteNVIDIA P100 of Piz Daint: 7.2 flops/byte

Sunway TaihuLight

125 Pflops


10 millioncores

MPE + CPE



4 FKN ,D HHA CA #( ? HE C

Sunway TaihuLight

125 Pflops


10 millioncores

MPE + CPE



4 FKN ,D HHA CA ( 4AIKNU HH

Sunway TaihuLight

125 Pflops


10 millioncores

MPE + CPE



4 FKN ,D HHA CA ( 4AIKNU HH

Refactoring and Redesigning

2016FullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling

PhaseFieldSimulationsofCoarseningDynamics

AtomisticSimulationofSiliconNanowires

Run-awayElectronTrajectorySimulation

GenomeFunctionalAnnotationandHomeoticGeneBuilding

SpacecraftCFDNumericalSimulation

2017Extreme-scaleGraphProcessingFramework

SimulationofPlanetaryRings

SimulationsofQuantumSpinLiquidStatesviaPEPS++

MolecularDynamicsSimulationofCondensedCovalentMaterials

cryo-EMMacromoleculeStructureDetermination

RedesigningCAM-SE

NonlinearEarthquakeSimulation

1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O

2016 Gordon Bell FinalistsFullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling











RedesigningCAM-SE



2016 Gordon Bell PrizeFullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling











RedesigningCAM-SE



2016 Gordon Bell PrizeFullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling






2017 Gordon Bell FinalistsExtreme-scaleGraphProcessingFramework





RedesigningCAM-SE



racks chips core-groups cores total number of cores

163,840 processes 65 threads

DD-MG K-cycle

Very

sha

llow

Uniform DD

Plug & PlayNow let’s find a way to design a subdomain solver.

LLHE? PEK 1 ( 1ILHE?EP KHRAN BKN PIKOLDANE? -U IE?O

racks chips core-groups cores total number of cores

163,840 processes 65 threads

Geometry-based pipelined ILU (GP-ILU)

YX

Z

8×88×8

8×88×8

Two-levelpipeline

blk_height

Synchronizationavoiding

11´

( ) dim_zblk_height1num_corescell_sizereg_size

<+-

DD-MG K-cycle

Subdomain matrix of 1st-order with geometric index

Our goal of design:1. Single sweep2. Synchronization-free3. Improved data-locality

BNK S XKY X T, AF 3 O N )< IUXKY J .% Y 8 KTGR -(

PNK C O? HE C NAO HPO

1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M0%

20%

40%

60%

80%

100%

Total number of cores

Para

llel e

ffici

ency

33% (GB’15)

67%

45%

0.00125

0.0025

0.005

0.01

0.02

0.04

0.08

0.16

0.33 M 1.33 M 5.32 M2.66 M 10.64 M

0.488

34X

SYPD

Total number of cores

Implicit Explicit

89.5X

2.480 1.389 0.920 0.620Resolution (km)

0.67 M

A O? HE C NAO HPO

7.95 DP-PF

23.66 DP-PF

DOFs=772B

“Exa-scale”for exp

The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit

LLHE? PEK 11 ( KNPE C ,. 4 A AOEC E C , 4 . BKNS U : ED 3ECDP

35

CAM5.0 ��

��

CPL7

CESM1.2.0

Tsinghua + BNU 30+ Professors and Students

• Four component models, millions lines of code• Large-scale run on Sunway TaihuLight

• 24,000 MPI processes•Over one million cores

• 10-20x speedup for kernels• 2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

LLHE? PEK 11 ( KNPE C ,. 4 A AOEC E C , 4 . BKNS U : ED 3ECDP

36

CAM5.0 ��

��

CPL7

CESM1.2.0

Tsinghua + BNU 30+ Professors and Students

• Four component models, millions lines of code• Large-scale run on Sunway TaihuLight

• 24,000 MPI processes•Over one million cores

• 10-20x speedup for kernels• 2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

a high complexity in application, and a heavy legacy in thecode base (millions lines of code)

an extremely complicated MPMD program with nohotspots (or hundreds of hotspots)

misfit between the in-place design philosophy and thenew architecture

lack of people with interdisciplinary knowledge andexperience

4 FKN ,D HHA CAO

6LA ,, OA AB ?PKNE C KB , 4

CAMinitial Dyn_run Phy_run1 Phy_run2

PassstatevariablesPassstatevariablesandtracers

Passtracers(u,v)todynamics

do ie = nets, netedo k = 1, nlev

do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …

end doend do

end do

Euler_step:

do ie = nets, netecompute Q min/max values for lim8compute Biharmonic mixing term f

end do

do ie = nets, nete2D advection stepdata packing

end do

Bonundary exchange

Data extracting


dp(k) = func_1()do q = 1, qsize

Qtens(k,q,ie) = func_2(dp(k))end do

end doend do


do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …

end doend do

end do

do ie = nets, nete do k = 1, nlev

do q = 1, qsizeQtens(k,q,ie) = …

end do end do

end doData packing

do ie = nets, netedo q = 1, qsize

do k = 1, nlev….

end doend do

end do


dp0 = func_3()dpdiss = func_4()do q = 1, qsize

Qtens(k,q,ie) = func_5(dp0, dpdiss)

end doend do

end do


dp(k) = func_5()Vstar(k) = func_6()

end do

do q = 1, qsizedo k = 1, nlev

Qtens(k,q,ie) = func_7(dp(k), Vstar(k))

end do

do k = 1, nlevdp_star(k) = func_8(dp(k))

end do

do k = 1, nlevQtens(k,q,ie) =

func_9(dp_star(k))end do

end doData packing

end do

optimized:


do q = 1, qsizeQtens(k,q,ie) = func_2(func_1())

end doend do

end do


do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …

end doend do

end do


do k = 1, nlev….

end doend do

end do


do q = 1, qsizeQtens(k,q,ie) =

func_5(func_3(),func_4())end do

end doend do

do ie = nets, nete do q = 1, qsize


func_7(func_5(),func_6())end do


func_9(func_8(func_5())) end do

end doData packing

end do


do k = 1, nlevqmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …

end doend do

end do


do k = 1, nlevQtens(k,q,ie) = …

end do end do

end doData packing

!$ACC PARALLEL LOOPdo ie_q = 1, qsize*(nete-nets)

do k = 1, nlevq = func(ie_q)ie = func(ie_q)qmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …

end doend do

!$ACC PARALLEL LOOPdo ie_q = 1, qsize*(nete-nets)

do k = 1, nlev q = func(ie_q)ie = func(ie_q)Qtens(k,q,ie) = …

end do end do!$ACC PARALLEL LOOPData packing

1

2

3

4

5

6

• manual transformation ofloops

• manual OpenACCparallelization andoptimization on code anddata structures

do begin_chunk to end_chunktphysbc(){convect_deep_tend(6.47%)convect_shallow_tend(15.57%)macrop_driver_tend(8.38%)microp_aero_run(4.29%)microp_driver_tend(7.13%)aerosol_wet_intr(4.29%)convect_deep_tend_2(0.51%)radiation_tend(54.07%)

}enddo

tphysbc(){do begin_chunk to end_chunkconvect_deep_tend(6.47%)convect_shallow_tend(15.57%)macrop_driver_tend(8.38%)microp_aero_run(4.29%)microp_driver_tend(7.13%)aerosol_wet_intr(4.29%)convect_deep_tend_2(0.51%)radiation_tend(54.07%)

enddo}

tphysbc(){do begin_chunk to end_chunkconvect_deep_tend(6.47%)

enddo……do begin_chunk to end_chunk

microp_driver_tend(7.13%)enddo……do begin_chunk to end_chunk

radiation_tend(54.07%)enddo

}

do begin_chunk to end_chunkconvect_deep_tend(6.47%){zm_conv_tend(6.47%){zm_convr(2.03%)zm_conv_evap()montran()convtranc(0.06%)

}}

enddo

convect_deep_tend(6.47%){zm_conv_tend(6.47%){

do begin_chunk to end_chunkzm_convr(2.03%)

enddodo begin_chunk to end_chunkzm_conv_evap()

enddodo begin_chunk to end_chunkmontran()

enddodo begin_chunk to end_chunk

convtranc(0.06%)enddo

}}

• tool based transformation of loops

0.040.15

0.24 0.25

0.6

0.780.87

1.54

1.2

1.621.75

2.81

0

0.5

1

1.5

2

2.5

3

1024 2400 4096 5120 7350 9600 12000 24000

Simula

tionS

peed(D

escribe

dinM

odelYearPe

rDay(M

YPD))

NumberofCGs(eachCGincludes1MPEand64CPEs)

MPEonly MPE+CPEfordynamiccore MPE+CPEforbothdynamiccoreandphysicsschemes

, 4 IK AH( O? H EHEPU OLAA L• SORROUT IUXK YIGRK % AF 3• SGT IUXK XKLGI UXOTM LUX NK

KT OXK SUJKR• IUS K O O K YOS RG OUT Y KKJ U

NK YGSK SUJKR UT =20@FKRRU Y UTK

n A K , XK XO K UL 5UX XGT KT022 IUJK U 0 NXKGJ 2 IUJKp LOTKX SKSUX IUT XUR NXU MN

G Y KIOLOI 3<0 YINKSK

p SUXK KLLOIOKT KI UXO G OUT

n A K %, XKMOY KX IUSS TOIG OUTHGYKJ XKJKYOMTp XKSU K JG G JK KTJKTI

p K UYK SUXK GXGRRKROYS

PDNA OA E A CN E A A AOEC

elek elek+7

C0,0

C7,0

a16*iStage 1 Ci, j a16*i+a16*i+1 a16*i+...+a16*i+15

Stage 2 a0C0,0 a0+...+a15

a16C1,0 a16+...+a31

p15

p31

akCk, 0 ak+...+ak+15

p15+ =

p127p111+ =

...

...

...

...

... ... ...

C0,1

C7,1

... ...

...

...

...

C0,7

C7,7

a0+a1

a8+a9

ak+ak+1

Stage 3 Ci, j ... p16*ia16*i a16*i+a16*i+1 a16*i+...+a16*i+15

...

ANBKNI ?A 1ILNKRAIA P PDNK CD A AOEC

1 Sunway CG (64 CPEs)could be equivalent to0.1x Intel Coreor1.8x Intel Coreor7.2x Intel Coreor in certain cases43.1x Intel Core

? HE CPDA

-U IE?,KNAPK

4EHHEK OKB ,KNAO

512 2048 8192 32768 131072

0.01

0.05

0.1

1

5

Number of Processes

PFlo

ps

48 elements in each process192 elements in each process768 elements in each process650 elelments in each process

3.3 PFlopsPara.eff 98.55%




EI H PEK KB 0 NNE? A 2 PNE

n 3 TGSOI X XK YU XIKMKTKXG UX UXOMOTG KJ LXUS26 53<

n AKOYSOI G K XU GMG OUT UXOMOTG KJ LXUS 0D 32

n NKX ORO OKY,p YU XIK GX O OUTKXp 3 <UJKR 8T KX URG UXp @KY GX IUT XURRKX

LLHE? PEK 111 ( K HE A N . NPDM A EI H PEK KS U : ED 3ECDP

Source Partitioner

Restart Controller

3D Model Interpolator

LZ4 Compression, Group I/O, Balanced I/O Forwarding

Snapshot/Sesimo Recorder

Velocity Update

Stress Update

Source Injection

Stress Adjustment For

Plasticity

Dynamic Rupture Source Generator

(Based on CG-FDM)

Seismic Wave Propagation

(Based on AWP-ODC)

Next Timestep

3D Vel/Den Model

Fault Stress Init

Friction Law Ctrl

Wave Eqn Solver

4 HPE 3ARAH -KI E -A?KILKOEPEK

𝑀𝑥

𝑀𝑦

(1) MPI decomposition

𝐵𝑧

𝐵𝑦

(2) CG blocking

Finished area

Computing area

Buffer area

Unfinished area

𝑤𝑥𝑤𝑧

𝑤𝑦

𝐶𝑧

𝐶𝑦

z

y

x

Computedirection

(3) Athread decomposition(4) LDM buffering scheme

1xx 2xx nxx

1yy 2yy nyy

1zz 2zz nzz

1yz 2yz nyz

...

1v 2v nv

1u 2u nu

1w 2w nw

1v 2v nv

1u 2u nu 1w 2w nw

1xx 2xx nxx

1yy 2yy nyy

1zz 2zz nzz

1yz 2yz nyz

...

1xx 1yy 1zz 1xy 1xz 1yz


1u 1v 1w 2u 2v 2w

1xx 2xx nxx

1yy 2yy nyy

1zz 2zz nzz

1yz 2yz nyz

1v 2v nv

1u 2u nu 1w 2w nw

1u 1v 1w 2u 2v 2w



...

dvelcx

dstrqc

dvelcx

dstrqc

afterbefore

fusearrays

……

Leftboundary Innerpart

Rightboundary DMAtransfer

Registercommunication

Registercommunication

ID:63ID:00 ID:01 ID:02

# NN U B OEK

D HK AT?D CA PDNK CDNACEOPAN ?KII E? PEK

KLPEIE A HK? E C?K BEC N PEK C E A U

HUPE? H IK AH

H ?A 4AIKNU ?DAIA

6 PDA BHU ,KILNAOOEK

CPE

(a) Collect statistic from coarse grid (b) Computation workflowCoarse Fine

(d) Compression algorithms

(2)

(1)

(3)

dma_get

Decompressed block

W

TN

(c) Decompress-compute-compress scheme

Compressed grid

dma_put

min/max

13-point stencilIn-place

decomposition

sign exp (8b) frac (24b) sign exp (0-8b) frac (7-15b)

... ...ef

e

NN

EEN

�

�

15

)(log minmax2

(str, r1,r2, ,r6,sigma2,yldfac)

sign exp (5b) frac (10b)sign exp (8b) frac (24b)

1EEE754 32b to 16b FP conversion

(vel,ww0,phi,cohes,taxx, ,taxz)

sign exp (8b) frac (24b)

IEEE754 32-bit floating point format

sign frac (15b)

8

)/(1 minmax

��

��

VV

VVVVV

cmpr

16-bit floating point formats

(d1,lam,mu,qp,qs,vx1,vx2,ww)

LDM

Main Memory

16b to 32b decompression General 32b computation 32b to 16b compression

Host Memory

Host Memory:

CPE:

40.647.8 45.4

28.9 27.622.9

4.2

39.3

12.9 13.1

0102030405060

Speedup

MPE

PAR

MEM

CMPR

LAA L( , . RO # 4 .

21.262%

21.262% 18.5

54%18.554%

23.870%

24.873%

12.436%

26.979%

2779%

2779%

0

510

1520

2530

DMA Bandwidth

PAR

MEM

CMPR

4AIKNU SE PD PEHE PEK

Number of processes8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K

PFLO

PS

0.6

1

1.5 2

3 4

6

9

14 18 Ideal (Linear)

Ideal (Non-linear)Ideal (Linear+Compress)Ideal (Non-linear+Compress)

Linear (Peak: 10.7PFLops, Para. eff. 97.9%)Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%)Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%)Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%)

A ? HE C

Spee

dup

1

2 3 4 6 8

121622

79.9%73.6%

63.6%

Linear

Idealdx=100mdx=50mdx=16m 75.5%

75.6%

53.3%

Non-Linear

Idealdx=100mdx=50mdx=16m

Spee

dup

1

2 3 4 6 8

121622

160K

75.8%

128K

72.4%

100K

51.2%

Linear+Compress

80K

64K

Number of processes48

K32

K24

K16

K12

K8K


160K

67.5%

128K

Non-Linear+Compress

67.2%

100K80

K

51.7%

64K

Number of processes48

K32

K24

K16

K12

K8K


PNK C ? HE C

116˚E 117˚E 118˚E 119˚E

38˚N

39˚N

40˚N Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

(e)

116˚E 117˚E 118˚E 119˚E

Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

5

6

7

8

9

10

11

Inte

nsi

ty

(f)

38˚N

39˚N

40˚N Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

(c)

Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

−0.2

−0.1

0.0

0.1

0.2

Vel

oci

ty (

m/s

)

(d)

0s 50s 100s

Ninghe (200m)

Cangzhou (200m) (a)

0s 50s 100s

Ninghe (16m)

Cangzhou (16m) (b)

EI H PEK AO HPO( I RO # I




6 PHE A

Traditional HPCApplications

(Science -> Service)

Deep LearningRelated

ApplicationsSunway Micro

3K C :ANI H

54




ApplicationsSunway Micro tanshan.gif

3K C :ANI H

55“15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios”, Gordon Bell Prize Finalist, SC 2017.





3K C :ANI H

56





3K C :ANI H

57

1.0x

2.2x

3.5x

1.0x

2.7x

4.5x

1.0x 7.1x 8.5x0

10

20

30

40

50

60

70

80

Intel(24cores) swBLAS(1CG) swDNN(1CG)

Averagetim

epe

rite

ration

TrainingAlexNet withswCaffe

total convolution fullyconnected





3K C :ANI H

58

n < AB 2NOTG, SGPUX Y UTYUXY UL NK 7 2 NGXJ GXK GTJ YUL GXK JK KRU SKT

n =@2 2, KTJUX UL NK SGINOTK

n =20@, @OIN UL UNT 3KTTOY 0RROYUT 1G KX 7GO OTM E Y UX GTJ GJ OIK UTNK 20< A4 UX

n A242, FOLKTM 2 O A K K 3G 3GTOKR @U KT :OS RYKT UYN BUHOT 0RK 1XK KX GTJ3G KO < JOYI YYOUT GTJ GJ OIK UT NK KGX NW G K YOS RG OUT UX

? KSHA CAIA PO

Documents

70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2