12
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL8 50 00. Early Experience with P100 on Power8 Christian Trott - Center for Computing Research Sandia National Laboratories/NM Unclassified Unlimited Release SAND2016-11748 C

Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

Embed Size (px)

Citation preview

Page 1: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

EarlyExperiencewithP100onPower8Christian Trott - CenterforComputingResearch

SandiaNationalLaboratories/NMUnclassified Unlimited ReleaseSAND2016-11748 C

Page 2: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

Sandia’sPathForward§ MandatetorunacrossallDOELeadershipclassSystems

§ Trinity– LANL/SNL– KNL/Haswell(2017)§ Cori-- NERSC– KNL(2017)§ Sierra– LLNL– POWER9/Volta(2018)§ Summit– ORNL– POWER9/Volta(2018)§ Aurora– ANL– KNH(2019)

§ Weneedperformanceportablecode:§ Kokkos forC++Applications(majorityofourapps)§ OpenMP forFortran

Page 3: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

DDR#

HBM#

DDR#

HBM#

DDR#DDR#

DDR#

HBM#HBM#

Kokkos#

LAMMPS# Sierra# Albany#Trilinos#

https://github.com/kokkos

Kokkos:Performance,PortabilityandProductivity

Page 4: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

Kokkos:PerformancePortabilitythroughAbstraction

Kokkos

Execution Spaces (“Where”)

Execution Patterns (“What”)

Execution Policies (“How”)

- N-Level- Support Heterogeneous Execution

- parallel_for/reduce/scan, task spawn- Enable nesting

- Range, Team, Task-Dag- Dynamic / Static Scheduling- Support non-persistent scratch-pads

Memory Spaces (“Where”)

Memory Layouts (“How”)

Memory Traits

- Multiple-Levels- Logical Space (think UVM vs explicit)

- Architecture dependent index-maps- Also needed for subviews

- Access Intent: Stream, Random, …- Access Behavior: Atomic- Enables special load paths: i.e. texture

Parallel ExecutionData Structures

Page 5: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

§ OpenMP 4.5+backendforKokkos

§ Goal:interoperabilityofC++apps/librariesinnativeOpenMP4.5withKokkos libraries/apps.

§ Onbranch:basicfeaturesarenowworking,includingdatamanagementandsimpledataparallelism

§ Expectedtogointomain-lineKokkos earlynextyear.

§ WorkingcloselywithIBMCORALOpenMP researchcompilergroupandIBMXL

OpenMP Support/Interoperability

https://www.ibm.com/developerworks/community/groups/service/html/communitystart?communityUuid=8e0d7b52-b996-424b-bb33-345205594e0d

Page 6: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

P100+POWER8:atestbedforSierra§ P100+POWER8havecommon featureswithSierra/Summit

§ NVLink:Firstgenerationnow, secondgeneration later§ IBMSoftwarestack(XLcompiler,Cuda forPower)§ HBMmemory§ Infiniband Network§ SignificantnewP100feature:Hardwaredoubleprecisionatomicadd

§ Getsoftwaretowork§ Portapplications toPOWER(notalwaystrivialduetodifferentmemorymodel

fromX86)§ BenchmarksFocusonKokkos:usingGCC5.3.0/CUDA8.0.44

Possible through dedicated TestBed team at Sandia: Special thanks to Si Hammond

Page 7: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

SyntheticBenchmarksHBM

0

100

200

300

400

500

600

GB/s Global Bandwidth

K40 K80 P100

0

200

400

600

800

1000

1200

1400

GB/s Cache Bandwidth

K40 K80 P100

0

10

20

30

40

50

60

70

80

GB/s Global Gather Bandwidth

K40 K80 P100

0

200

400

600

800

1000

1200

GB/s Cache Gather Bandwidth

K40 K80 P100

kokkos/benchmarks/bytes_and_flops

Page 8: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

SyntheticBenchmarksNVLINKkokkos/benchmarks/bytes_and_flops(Set MemorySpace of Arrays to HostPinned or UVM)

1

10

100

1000

0.125 0.25 0.5 1 2 4 8 16 32 64

Ban

dwid

th G

B/s

Problem Size [GB]

HostPinned 1

HostPinned 32

UVM 1

UVM 32

Page 9: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

LAMMPS– AScienceApplication

0

20

40

60

80

100

120

140

160

4000 32000 256000 2048000

Mill

ion

Ato

mst

eps

per

Sec

ond

0

10

20

30

40

50

60

70

4000 32000 256000 2048000

Mill

ion

Ato

mst

eps

per

Sec

ond

K40 Full K80 Full P100 Full K40 Half K80 Half P100 Half P100 Half-HWAtomic

0

20

40

60

80

100

120

140

160

4000 32000 256000 2048000

Mill

ion

Ato

mst

eps

per

Sec

ond

Lennard Jones EAM Tersoff

Page 10: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

FiniteElementAssembly

0

10

20

30

40

50

60

70

80

K40 K80 P100Ele

men

ts P

er S

econ

d (in

Tho

usan

d)

Matrix Fill (Graph is reused)

NoAtomic Atomic HW-Atomic

§ OurFiniteElementCodeshave3Phases:§ GraphConstruction (reusedover

multiple timesteps)§ MatrixAssembly (20%-50%)§ Solve(30%-80%)

§ MatrixAssembly§ PerformPhysicsCalculations§ Addcontributions tomultiplematrix

entries§ Majorlimiterisnumberofload/store

operations inflight§ P100addhardwareatomicaddfor

doubles (usedCASloopsbefore)

Page 11: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

Conclusion§ P100deliversincreasedperformanceforouralgorithms

§ Mainimprovementseemstocomefrommemory-system§ Moreloads/storesinflight§ 2x-5ximprovementoverK40/K80

§ HardwareAtomicAddfordoublehelpssignificantly§ About20%-50%improvementvsusingCASLoopsforalgorithms§ Inmanycasesmakesnon-atomicalgorithmsobsolete

§ WhataboutNVLink?§ CodesportedtoGPUsarenotdesignedtobenefitfromit…§ Codeswhichneeditarenotyetported…

Page 12: Early Experience with P100 on Power8 - Sandia National … · 2017-02-17 · Early Experience with P100 on Power8 ... basic features are now working ... 1 10 100 1000 0.125 0.25 0.5

Kokkos Demo Station at DOE booth: today 2-4pm