Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
S U : ED 3ECDP(-AOEC E C : E C ?EA PEBE? LLHE? PEK O
P PDA ? HA KB # 4EHHEK ,KNAO7GUN GT 5
=G OUTGR A KXIUS OTM 2KT KX OT D O3K GX SKT UL 4GX N A Y KS AIOKTIK BYOTMN G CTO KXYO
AK KSHKX N % / 820A
Sunway Machine: the Challenges and Opportunities
Scientific Computing with 10 Million Cores
Long Term Plan for Sunway TaihuLight
6 PHE A
Sunway-I:
- CMA service, 1998
- commercial chip
- 0.384 Tflops
- 48th of TOP500
Sunway BlueLight:
- NSCC-Jinan, 2011
- 16-core processor
- 1 Pflops
- 14th of TOP500
Sunway TaihuLight:
- NSCC-Wuxi, 2016
- 260-core processor
- 125 Pflops
- 1st of TOP500
:DA S U 4 ?DE A IEHU
Core Group 2
Data Transfer Network
MPE 8*8 CPE Mesh
PPU
iMC
Memory
Core Group 0
MPE8*8 CPE Mesh
iMC
PPU
Memory
Core Group 1
MPE8*8 CPE Mesh
PPU
Core Group 3 iMC
Memory
MPE8*8 CPE
Mesh
PPU
iMC
Memory
NoC
ComputingCore
LDM
ColumnCommunication Bus
Control Network
Registers
Row Communication
Bus
Transfer Agent (TA)
Memory Level
LDM Level
Register Level
Computing Level
8*8 CPE Mesh
# ( S U ,KNA NK?AOOKN
n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK
p IUS OTM HUGXJ
p Y KX TUJK
p IGHOTK
p KT OXKIUS OTMY Y KS
0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI
n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK
p IUS OTM HUGXJ
p Y KX TUJK
p IGHOTK
p KT OXKIUS OTMY Y KS
0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI
n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK
p IUS OTM HUGXJ
p Y KX TUJK
p IGHOTK
p KT OXKIUS OTMY Y KS
0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI
n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK
p IUS OTM HUGXJ
p Y KX TUJK
p IGHOTK
p KT OXKIUS OTMY Y KS
0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI
n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK
p IUS OTM HUGXJ
p Y KX TUJK
p IGHOTK
p KT OXKIUS OTMY Y KS
0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI
0KS PK ,K A?P PDA # 4EHHEK ,KNAO)
0KS PK ,K A?P PDA # 4EHHEK ,KNAO)
2D core arraywith row andcolumn buses
0KS PK ,K A?P PDA # 4EHHEK ,KNAO)
2D core array with rowand column buses
Network on Chip
0KS PK ,K A?P PDA # 4EHHEK ,KNAO)
2D core array with rowand column buses
Network on Chip
Customized Network Board toFully Connect 256 Nodes
0KS PK ,K A?P PDA # 4EHHEK ,KNAO)
2D core array with rowand column buses
Network on Chip
Customized Network Board toFully Connect 256 Nodes
Sunway Net
:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K
:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K
:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K
:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K
:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K
Sunway Micro
Sunway Machine: the Challenges and Opportunities
Scientific Computing with 10 Million Cores
Long Term Plan for Sunway TaihuLight
6 PHE A
4 ?DE A , L EHEPU ,KIL NEOK
0
0.5
1
1.5
2
2.5
3
PeakPerformance
MemorySize
Gflops/Watt
Tflops/m^3
memorybandwidth
communicationbandwidth
TaihuLight Tianhe-2 Titan KComputer
0
0.5
1
1.5
2
2.5
3Linpack
Graph
HPCG
hpgmg
TaihuLight Tianhe-2
Titan KComputer
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per node 22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunicationamong CPEs
4 FKN A P NAO PK ,K OE AN
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per node 22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunicationamong CPEs
4 FKN A P NAO PK ,K OE AN
Intel KNL 7250 of Cori: 6.5 flops/byteNVIDIA P100 of Piz Daint: 7.2 flops/byte
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per node 22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunicationamong CPEs
4 FKN ,D HHA CA #( ? HE C
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per node 22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunicationamong CPEs
4 FKN ,D HHA CA ( 4AIKNU HH
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per node 22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunicationamong CPEs
4 FKN ,D HHA CA ( 4AIKNU HH
Refactoring and Redesigning
2016FullyImplicitSolver for AtmosphericDynamics
SurfaceWaveModeling
PhaseFieldSimulationsofCoarseningDynamics
AtomisticSimulationofSiliconNanowires
Run-awayElectronTrajectorySimulation
GenomeFunctionalAnnotationandHomeoticGeneBuilding
SpacecraftCFDNumericalSimulation
2017Extreme-scaleGraphProcessingFramework
SimulationofPlanetaryRings
SimulationsofQuantumSpinLiquidStatesviaPEPS++
MolecularDynamicsSimulationofCondensedCovalentMaterials
cryo-EMMacromoleculeStructureDetermination
RedesigningCAM-SE
NonlinearEarthquakeSimulation
1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O
2016 Gordon Bell FinalistsFullyImplicitSolver for AtmosphericDynamics
SurfaceWaveModeling
PhaseFieldSimulationsofCoarseningDynamics
AtomisticSimulationofSiliconNanowires
Run-awayElectronTrajectorySimulation
GenomeFunctionalAnnotationandHomeoticGeneBuilding
SpacecraftCFDNumericalSimulation
2017Extreme-scaleGraphProcessingFramework
SimulationofPlanetaryRings
SimulationsofQuantumSpinLiquidStatesviaPEPS++
MolecularDynamicsSimulationofCondensedCovalentMaterials
cryo-EMMacromoleculeStructureDetermination
RedesigningCAM-SE
NonlinearEarthquakeSimulation
1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O
2016 Gordon Bell PrizeFullyImplicitSolver for AtmosphericDynamics
SurfaceWaveModeling
PhaseFieldSimulationsofCoarseningDynamics
AtomisticSimulationofSiliconNanowires
Run-awayElectronTrajectorySimulation
GenomeFunctionalAnnotationandHomeoticGeneBuilding
SpacecraftCFDNumericalSimulation
2017Extreme-scaleGraphProcessingFramework
SimulationofPlanetaryRings
SimulationsofQuantumSpinLiquidStatesviaPEPS++
MolecularDynamicsSimulationofCondensedCovalentMaterials
cryo-EMMacromoleculeStructureDetermination
RedesigningCAM-SE
NonlinearEarthquakeSimulation
1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O
2016 Gordon Bell PrizeFullyImplicitSolver for AtmosphericDynamics
SurfaceWaveModeling
PhaseFieldSimulationsofCoarseningDynamics
AtomisticSimulationofSiliconNanowires
Run-awayElectronTrajectorySimulation
GenomeFunctionalAnnotationandHomeoticGeneBuilding
SpacecraftCFDNumericalSimulation
2017 Gordon Bell FinalistsExtreme-scaleGraphProcessingFramework
SimulationofPlanetaryRings
SimulationsofQuantumSpinLiquidStatesviaPEPS++
MolecularDynamicsSimulationofCondensedCovalentMaterials
cryo-EMMacromoleculeStructureDetermination
RedesigningCAM-SE
NonlinearEarthquakeSimulation
1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O
racks chips core-groups cores total number of cores
163,840 processes 65 threads
DD-MG K-cycle
Very
sha
llow
Uniform DD
Plug & PlayNow let’s find a way to design a subdomain solver.
LLHE? PEK 1 ( 1ILHE?EP KHRAN BKN PIKOLDANE? -U IE?O
racks chips core-groups cores total number of cores
163,840 processes 65 threads
Geometry-based pipelined ILU (GP-ILU)
YX
Z
8×88×8
8×88×8
Two-levelpipeline
blk_height
Synchronizationavoiding
11´
( ) dim_zblk_height1num_corescell_sizereg_size
<+-
DD-MG K-cycle
Subdomain matrix of 1st-order with geometric index
Our goal of design:1. Single sweep2. Synchronization-free3. Improved data-locality
BNK S XKY X T, AF 3 O N )< IUXKY J .% Y 8 KTGR -(
PNK C O? HE C NAO HPO
1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M0%
20%
40%
60%
80%
100%
Total number of cores
Para
llel e
ffici
ency
33% (GB’15)
67%
45%
0.00125
0.0025
0.005
0.01
0.02
0.04
0.08
0.16
0.33 M 1.33 M 5.32 M2.66 M 10.64 M
0.488
34X
SYPD
Total number of cores
Implicit Explicit
89.5X
2.480 1.389 0.920 0.620Resolution (km)
0.67 M
A O? HE C NAO HPO
7.95 DP-PF
23.66 DP-PF
DOFs=772B
“Exa-scale”for exp
The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit
LLHE? PEK 11 ( KNPE C ,. 4 A AOEC E C , 4 . BKNS U : ED 3ECDP
35
CAM5.0 �����
����� �������
CPL7
CESM1.2.0
Tsinghua + BNU 30+ Professors and Students
• Four component models, millions lines of code• Large-scale run on Sunway TaihuLight
• 24,000 MPI processes•Over one million cores
• 10-20x speedup for kernels• 2-3x speedup for the entire model
“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.
LLHE? PEK 11 ( KNPE C ,. 4 A AOEC E C , 4 . BKNS U : ED 3ECDP
36
CAM5.0 �����
����� �������
CPL7
CESM1.2.0
Tsinghua + BNU 30+ Professors and Students
• Four component models, millions lines of code• Large-scale run on Sunway TaihuLight
• 24,000 MPI processes•Over one million cores
• 10-20x speedup for kernels• 2-3x speedup for the entire model
“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.
a high complexity in application, and a heavy legacy in thecode base (millions lines of code)
an extremely complicated MPMD program with nohotspots (or hundreds of hotspots)
misfit between the in-place design philosophy and thenew architecture
lack of people with interdisciplinary knowledge andexperience
4 FKN ,D HHA CAO
6LA ,, OA AB ?PKNE C KB , 4
CAMinitial Dyn_run Phy_run1 Phy_run2
PassstatevariablesPassstatevariablesandtracers
Passtracers(u,v)todynamics
do ie = nets, netedo k = 1, nlev
do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …
end doend do
end do
Euler_step:
do ie = nets, netecompute Q min/max values for lim8compute Biharmonic mixing term f
end do
do ie = nets, nete2D advection stepdata packing
end do
Bonundary exchange
Data extracting
do ie = nets, netedo k = 1, nlev
dp(k) = func_1()do q = 1, qsize
Qtens(k,q,ie) = func_2(dp(k))end do
end doend do
do ie = nets, netedo k = 1, nlev
do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …
end doend do
end do
do ie = nets, nete do k = 1, nlev
do q = 1, qsizeQtens(k,q,ie) = …
end do end do
end doData packing
do ie = nets, netedo q = 1, qsize
do k = 1, nlev….
end doend do
end do
do ie = nets, netedo k = 1, nlev
dp0 = func_3()dpdiss = func_4()do q = 1, qsize
Qtens(k,q,ie) = func_5(dp0, dpdiss)
end doend do
end do
do ie = nets, netedo k = 1, nlev
dp(k) = func_5()Vstar(k) = func_6()
end do
do q = 1, qsizedo k = 1, nlev
Qtens(k,q,ie) = func_7(dp(k), Vstar(k))
end do
do k = 1, nlevdp_star(k) = func_8(dp(k))
end do
do k = 1, nlevQtens(k,q,ie) =
func_9(dp_star(k))end do
end doData packing
end do
optimized:
do ie = nets, netedo k = 1, nlev
do q = 1, qsizeQtens(k,q,ie) = func_2(func_1())
end doend do
end do
do ie = nets, netedo k = 1, nlev
do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …
end doend do
end do
do ie = nets, netedo q = 1, qsize
do k = 1, nlev….
end doend do
end do
do ie = nets, netedo k = 1, nlev
do q = 1, qsizeQtens(k,q,ie) =
func_5(func_3(),func_4())end do
end doend do
do ie = nets, nete do q = 1, qsize
do k = 1, nlevQtens(k,q,ie) =
func_7(func_5(),func_6())end do
do k = 1, nlevQtens(k,q,ie) =
func_9(func_8(func_5())) end do
end doData packing
end do
do ie = nets, netedo q = 1, qsize
do k = 1, nlevqmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …
end doend do
end do
do ie = nets, netedo q = 1, qsize
do k = 1, nlevQtens(k,q,ie) = …
end do end do
end doData packing
!$ACC PARALLEL LOOPdo ie_q = 1, qsize*(nete-nets)
do k = 1, nlevq = func(ie_q)ie = func(ie_q)qmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …
end doend do
!$ACC PARALLEL LOOPdo ie_q = 1, qsize*(nete-nets)
do k = 1, nlev q = func(ie_q)ie = func(ie_q)Qtens(k,q,ie) = …
end do end do!$ACC PARALLEL LOOPData packing
1
2
3
4
5
6
• manual transformation ofloops
• manual OpenACCparallelization andoptimization on code anddata structures
do begin_chunk to end_chunktphysbc(){convect_deep_tend(6.47%)convect_shallow_tend(15.57%)macrop_driver_tend(8.38%)microp_aero_run(4.29%)microp_driver_tend(7.13%)aerosol_wet_intr(4.29%)convect_deep_tend_2(0.51%)radiation_tend(54.07%)
}enddo
tphysbc(){do begin_chunk to end_chunkconvect_deep_tend(6.47%)convect_shallow_tend(15.57%)macrop_driver_tend(8.38%)microp_aero_run(4.29%)microp_driver_tend(7.13%)aerosol_wet_intr(4.29%)convect_deep_tend_2(0.51%)radiation_tend(54.07%)
enddo}
tphysbc(){do begin_chunk to end_chunkconvect_deep_tend(6.47%)
enddo……do begin_chunk to end_chunk
microp_driver_tend(7.13%)enddo……do begin_chunk to end_chunk
radiation_tend(54.07%)enddo
}
do begin_chunk to end_chunkconvect_deep_tend(6.47%){zm_conv_tend(6.47%){zm_convr(2.03%)zm_conv_evap()montran()convtranc(0.06%)
}}
enddo
convect_deep_tend(6.47%){zm_conv_tend(6.47%){
do begin_chunk to end_chunkzm_convr(2.03%)
enddodo begin_chunk to end_chunkzm_conv_evap()
enddodo begin_chunk to end_chunkmontran()
enddodo begin_chunk to end_chunk
convtranc(0.06%)enddo
}}
• tool based transformation of loops
0.040.15
0.24 0.25
0.6
0.780.87
1.54
1.2
1.621.75
2.81
0
0.5
1
1.5
2
2.5
3
1024 2400 4096 5120 7350 9600 12000 24000
Simula
tionS
peed(D
escribe
dinM
odelYearPe
rDay(M
YPD))
NumberofCGs(eachCGincludes1MPEand64CPEs)
MPEonly MPE+CPEfordynamiccore MPE+CPEforbothdynamiccoreandphysicsschemes
, 4 IK AH( O? H EHEPU OLAA L• SORROUT IUXK YIGRK % AF 3• SGT IUXK XKLGI UXOTM LUX NK
KT OXK SUJKR• IUS K O O K YOS RG OUT Y KKJ U
NK YGSK SUJKR UT =20@FKRRU Y UTK
n A K , XK XO K UL 5UX XGT KT022 IUJK U 0 NXKGJ 2 IUJKp LOTKX SKSUX IUT XUR NXU MN
G Y KIOLOI 3<0 YINKSK
p SUXK KLLOIOKT KI UXO G OUT
n A K %, XKMOY KX IUSS TOIG OUTHGYKJ XKJKYOMTp XKSU K JG G JK KTJKTI
p K UYK SUXK GXGRRKROYS
PDNA OA E A CN E A A AOEC
elek elek+7
C0,0
C7,0
a16*iStage 1 Ci, j a16*i+a16*i+1 a16*i+...+a16*i+15
Stage 2 a0C0,0 a0+...+a15
a16C1,0 a16+...+a31
p15
p31
akCk, 0 ak+...+ak+15
p15+ =
p127p111+ =
...
...
...
...
... ... ...
C0,1
C7,1
... ...
...
...
...
C0,7
C7,7
a0+a1
a8+a9
ak+ak+1
Stage 3 Ci, j ... p16*ia16*i a16*i+a16*i+1 a16*i+...+a16*i+15
...
ANBKNI ?A 1ILNKRAIA P PDNK CD A AOEC
1 Sunway CG (64 CPEs)could be equivalent to0.1x Intel Coreor1.8x Intel Coreor7.2x Intel Coreor in certain cases43.1x Intel Core
? HE CPDA
-U IE?,KNAPK
4EHHEK OKB ,KNAO
512 2048 8192 32768 131072
0.01
0.05
0.1
1
5
Number of Processes
PFlo
ps
48 elements in each process192 elements in each process768 elements in each process650 elelments in each process
3.3 PFlopsPara.eff 98.55%
2.4 PFlopsPara.eff 92.2%
1.76 PFlopsPara.eff 88.3%
2.72 PFlopsPara.eff 92.9%
EI H PEK KB 0 NNE? A 2 PNE
n 3 TGSOI X XK YU XIKMKTKXG UX UXOMOTG KJ LXUS26 53<
n AKOYSOI G K XU GMG OUT UXOMOTG KJ LXUS 0D 32
n NKX ORO OKY,p YU XIK GX O OUTKXp 3 <UJKR 8T KX URG UXp @KY GX IUT XURRKX
LLHE? PEK 111 ( K HE A N . NPDM A EI H PEK KS U : ED 3ECDP
Source Partitioner
Restart Controller
3D Model Interpolator
LZ4 Compression, Group I/O, Balanced I/O Forwarding
Snapshot/Sesimo Recorder
Velocity Update
Stress Update
Source Injection
Stress Adjustment For
Plasticity
Dynamic Rupture Source Generator
(Based on CG-FDM)
Seismic Wave Propagation
(Based on AWP-ODC)
Next Timestep
3D Vel/Den Model
Fault Stress Init
Friction Law Ctrl
Wave Eqn Solver
4 HPE 3ARAH -KI E -A?KILKOEPEK
𝑀𝑥
𝑀𝑦
(1) MPI decomposition
𝐵𝑧
𝐵𝑦
(2) CG blocking
Finished area
Computing area
Buffer area
Unfinished area
𝑤𝑥𝑤𝑧
𝑤𝑦
𝐶𝑧
𝐶𝑦
z
y
x
Computedirection
(3) Athread decomposition(4) LDM buffering scheme
1xx 2xx nxx
1yy 2yy nyy
1zz 2zz nzz
1yz 2yz nyz
...
1v 2v nv
1u 2u nu
1w 2w nw
1v 2v nv
1u 2u nu 1w 2w nw
1xx 2xx nxx
1yy 2yy nyy
1zz 2zz nzz
1yz 2yz nyz
...
1xx 1yy 1zz 1xy 1xz 1yz
2xx 2yy 2zz 2xy 2xz 2yz
1u 1v 1w 2u 2v 2w
1xx 2xx nxx
1yy 2yy nyy
1zz 2zz nzz
1yz 2yz nyz
1v 2v nv
1u 2u nu 1w 2w nw
1u 1v 1w 2u 2v 2w
1xx 1yy 1zz 1xy 1xz 1yz
2xx 2yy 2zz 2xy 2xz 2yz
...
dvelcx
dstrqc
dvelcx
dstrqc
afterbefore
fusearrays
……
Leftboundary Innerpart
Rightboundary DMAtransfer
Registercommunication
Registercommunication
ID:63ID:00 ID:01 ID:02
# NN U B OEK
D HK AT?D CA PDNK CDNACEOPAN ?KII E? PEK
KLPEIE A HK? E C?K BEC N PEK C E A U
HUPE? H IK AH
H ?A 4AIKNU ?DAIA
6 PDA BHU ,KILNAOOEK
CPE
(a) Collect statistic from coarse grid (b) Computation workflowCoarse Fine
(d) Compression algorithms
(2)
(1)
(3)
dma_get
Decompressed block
W
TN
(c) Decompress-compute-compress scheme
Compressed grid
dma_put
min/max
13-point stencilIn-place
decomposition
sign exp (8b) frac (24b) sign exp (0-8b) frac (7-15b)
... ...ef
e
NN
EEN
�
�
15
)(log minmax2
(str, r1,r2, ,r6,sigma2,yldfac)
sign exp (5b) frac (10b)sign exp (8b) frac (24b)
1EEE754 32b to 16b FP conversion
(vel,ww0,phi,cohes,taxx, ,taxz)
sign exp (8b) frac (24b)
IEEE754 32-bit floating point format
sign frac (15b)
8
)/(1 minmax
��
���
VV
VVVVV
cmpr
16-bit floating point formats
(d1,lam,mu,qp,qs,vx1,vx2,ww)
LDM
Main Memory
16b to 32b decompression General 32b computation 32b to 16b compression
Host Memory
Host Memory:
CPE:
40.647.8 45.4
28.9 27.622.9
4.2
39.3
12.9 13.1
0102030405060
Speedup
MPE
PAR
MEM
CMPR
LAA L( , . RO # 4 .
21.262%
21.262% 18.5
54%18.554%
23.870%
24.873%
12.436%
26.979%
2779%
2779%
0
510
1520
2530
DMA Bandwidth
PAR
MEM
CMPR
4AIKNU SE PD PEHE PEK
Number of processes8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K
PFLO
PS
0.6
1
1.5 2
3 4
6
9
14 18 Ideal (Linear)
Ideal (Non-linear)Ideal (Linear+Compress)Ideal (Non-linear+Compress)
Linear (Peak: 10.7PFLops, Para. eff. 97.9%)Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%)Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%)Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%)
A ? HE C
Spee
dup
1
2 3 4 6 8
121622
79.9%73.6%
63.6%
Linear
Idealdx=100mdx=50mdx=16m 75.5%
75.6%
53.3%
Non-Linear
Idealdx=100mdx=50mdx=16m
Spee
dup
1
2 3 4 6 8
121622
160K
75.8%
128K
72.4%
100K
51.2%
Linear+Compress
80K
64K
Number of processes48
K32
K24
K16
K12
K8K
Idealdx=100mdx=50mdx=16m
160K
67.5%
128K
Non-Linear+Compress
67.2%
100K80
K
51.7%
64K
Number of processes48
K32
K24
K16
K12
K8K
Idealdx=100mdx=50mdx=16m
PNK C ? HE C
116˚E 117˚E 118˚E 119˚E
38˚N
39˚N
40˚N Beijing
Tangshan
Shunyi
Tianjin
Ninhe
Luanxian
Cangzhou
Bohai
LuannanWuqing
(e)
116˚E 117˚E 118˚E 119˚E
Beijing
Tangshan
Shunyi
Tianjin
Ninhe
Luanxian
Cangzhou
Bohai
LuannanWuqing
5
6
7
8
9
10
11
Inte
nsi
ty
(f)
38˚N
39˚N
40˚N Beijing
Tangshan
Shunyi
Tianjin
Ninhe
Luanxian
Cangzhou
Bohai
LuannanWuqing
(c)
Beijing
Tangshan
Shunyi
Tianjin
Ninhe
Luanxian
Cangzhou
Bohai
LuannanWuqing
−0.2
−0.1
0.0
0.1
0.2
Vel
oci
ty (
m/s
)
(d)
0s 50s 100s
Ninghe (200m)
Cangzhou (200m) (a)
0s 50s 100s
Ninghe (16m)
Cangzhou (16m) (b)
EI H PEK AO HPO( I RO # I
Sunway Machine: the Challenges and Opportunities
Scientific Computing with 10 Million Cores
Long Term Plan for Sunway TaihuLight
6 PHE A
Traditional HPCApplications
(Science -> Service)
Deep LearningRelated
ApplicationsSunway Micro
3K C :ANI H
54
Traditional HPCApplications
(Science -> Service)
Deep LearningRelated
ApplicationsSunway Micro tanshan.gif
3K C :ANI H
55“15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios”, Gordon Bell Prize Finalist, SC 2017.
Traditional HPCApplications
(Science -> Service)
Deep LearningRelated
ApplicationsSunway Micro
3K C :ANI H
56
Traditional HPCApplications
(Science -> Service)
Deep LearningRelated
ApplicationsSunway Micro
3K C :ANI H
57
1.0x
2.2x
3.5x
1.0x
2.7x
4.5x
1.0x 7.1x 8.5x0
10
20
30
40
50
60
70
80
Intel(24cores) swBLAS(1CG) swDNN(1CG)
Averagetim
epe
rite
ration
TrainingAlexNet withswCaffe
total convolution fullyconnected
Traditional HPCApplications
(Science -> Service)
Deep LearningRelated
ApplicationsSunway Micro
3K C :ANI H
58
n < AB 2NOTG, SGPUX Y UTYUXY UL NK 7 2 NGXJ GXK GTJ YUL GXK JK KRU SKT
n =@2 2, KTJUX UL NK SGINOTK
n =20@, @OIN UL UNT 3KTTOY 0RROYUT 1G KX 7GO OTM E Y UX GTJ GJ OIK UTNK 20< A4 UX
n A242, FOLKTM 2 O A K K 3G 3GTOKR @U KT :OS RYKT UYN BUHOT 0RK 1XK KX GTJ3G KO < JOYI YYOUT GTJ GJ OIK UT NK KGX NW G K YOS RG OUT UX
? KSHA CAIA PO