Upload
phungthuan
View
221
Download
0
Embed Size (px)
Citation preview
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
EarlyExperiencewithP100onPower8Christian Trott - CenterforComputingResearch
SandiaNationalLaboratories/NMUnclassified Unlimited ReleaseSAND2016-11748 C
Sandia’sPathForward§ MandatetorunacrossallDOELeadershipclassSystems
§ Trinity– LANL/SNL– KNL/Haswell(2017)§ Cori-- NERSC– KNL(2017)§ Sierra– LLNL– POWER9/Volta(2018)§ Summit– ORNL– POWER9/Volta(2018)§ Aurora– ANL– KNH(2019)
§ Weneedperformanceportablecode:§ Kokkos forC++Applications(majorityofourapps)§ OpenMP forFortran
DDR#
HBM#
DDR#
HBM#
DDR#DDR#
DDR#
HBM#HBM#
Kokkos#
LAMMPS# Sierra# Albany#Trilinos#
https://github.com/kokkos
Kokkos:Performance,PortabilityandProductivity
Kokkos:PerformancePortabilitythroughAbstraction
Kokkos
Execution Spaces (“Where”)
Execution Patterns (“What”)
Execution Policies (“How”)
- N-Level- Support Heterogeneous Execution
- parallel_for/reduce/scan, task spawn- Enable nesting
- Range, Team, Task-Dag- Dynamic / Static Scheduling- Support non-persistent scratch-pads
Memory Spaces (“Where”)
Memory Layouts (“How”)
Memory Traits
- Multiple-Levels- Logical Space (think UVM vs explicit)
- Architecture dependent index-maps- Also needed for subviews
- Access Intent: Stream, Random, …- Access Behavior: Atomic- Enables special load paths: i.e. texture
Parallel ExecutionData Structures
§ OpenMP 4.5+backendforKokkos
§ Goal:interoperabilityofC++apps/librariesinnativeOpenMP4.5withKokkos libraries/apps.
§ Onbranch:basicfeaturesarenowworking,includingdatamanagementandsimpledataparallelism
§ Expectedtogointomain-lineKokkos earlynextyear.
§ WorkingcloselywithIBMCORALOpenMP researchcompilergroupandIBMXL
OpenMP Support/Interoperability
https://www.ibm.com/developerworks/community/groups/service/html/communitystart?communityUuid=8e0d7b52-b996-424b-bb33-345205594e0d
P100+POWER8:atestbedforSierra§ P100+POWER8havecommon featureswithSierra/Summit
§ NVLink:Firstgenerationnow, secondgeneration later§ IBMSoftwarestack(XLcompiler,Cuda forPower)§ HBMmemory§ Infiniband Network§ SignificantnewP100feature:Hardwaredoubleprecisionatomicadd
§ Getsoftwaretowork§ Portapplications toPOWER(notalwaystrivialduetodifferentmemorymodel
fromX86)§ BenchmarksFocusonKokkos:usingGCC5.3.0/CUDA8.0.44
Possible through dedicated TestBed team at Sandia: Special thanks to Si Hammond
SyntheticBenchmarksHBM
0
100
200
300
400
500
600
GB/s Global Bandwidth
K40 K80 P100
0
200
400
600
800
1000
1200
1400
GB/s Cache Bandwidth
K40 K80 P100
0
10
20
30
40
50
60
70
80
GB/s Global Gather Bandwidth
K40 K80 P100
0
200
400
600
800
1000
1200
GB/s Cache Gather Bandwidth
K40 K80 P100
kokkos/benchmarks/bytes_and_flops
SyntheticBenchmarksNVLINKkokkos/benchmarks/bytes_and_flops(Set MemorySpace of Arrays to HostPinned or UVM)
1
10
100
1000
0.125 0.25 0.5 1 2 4 8 16 32 64
Ban
dwid
th G
B/s
Problem Size [GB]
HostPinned 1
HostPinned 32
UVM 1
UVM 32
LAMMPS– AScienceApplication
0
20
40
60
80
100
120
140
160
4000 32000 256000 2048000
Mill
ion
Ato
mst
eps
per
Sec
ond
0
10
20
30
40
50
60
70
4000 32000 256000 2048000
Mill
ion
Ato
mst
eps
per
Sec
ond
K40 Full K80 Full P100 Full K40 Half K80 Half P100 Half P100 Half-HWAtomic
0
20
40
60
80
100
120
140
160
4000 32000 256000 2048000
Mill
ion
Ato
mst
eps
per
Sec
ond
Lennard Jones EAM Tersoff
FiniteElementAssembly
0
10
20
30
40
50
60
70
80
K40 K80 P100Ele
men
ts P
er S
econ
d (in
Tho
usan
d)
Matrix Fill (Graph is reused)
NoAtomic Atomic HW-Atomic
§ OurFiniteElementCodeshave3Phases:§ GraphConstruction (reusedover
multiple timesteps)§ MatrixAssembly (20%-50%)§ Solve(30%-80%)
§ MatrixAssembly§ PerformPhysicsCalculations§ Addcontributions tomultiplematrix
entries§ Majorlimiterisnumberofload/store
operations inflight§ P100addhardwareatomicaddfor
doubles (usedCASloopsbefore)
Conclusion§ P100deliversincreasedperformanceforouralgorithms
§ Mainimprovementseemstocomefrommemory-system§ Moreloads/storesinflight§ 2x-5ximprovementoverK40/K80
§ HardwareAtomicAddfordoublehelpssignificantly§ About20%-50%improvementvsusingCASLoopsforalgorithms§ Inmanycasesmakesnon-atomicalgorithmsobsolete
§ WhataboutNVLink?§ CodesportedtoGPUsarenotdesignedtobenefitfromit…§ Codeswhichneeditarenotyetported…
Kokkos Demo Station at DOE booth: today 2-4pm