Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott ([email protected]), Carter Edwards

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

Kokkos:TheC++PerformancePortabilityProgrammingModel

Christian Trott ([email protected]), CarterEdwardsD.Sunderland, N.Ellingwood,D.Ibanez,S.Hammond, S.Rajamanickam,K.Kim,M.Deveci,M.Hoemmen,G.

Center forComputingResearch,SandiaNationalLaboratories,NMSAND2017-4935C

NewProgrammingModels

§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only

§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,

C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...

§ Vendordecouplingdrivesexternaldevelopment

2

NewProgrammingModels

§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only

§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,

C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...

§ Vendordecouplingdrivesexternaldevelopment

3

WhatisKokkos?Whatisnew?

Whyshouldyoutrustus?

Kokkos:Performance,PortabilityandProductivity

DDR#

HBM#

DDR#

HBM#

DDR#DDR#

DDR#

HBM#HBM#

Kokkos#

LAMMPS# Sierra# Albany#Trilinos#

https://github.com/kokkos

PerformancePortabilitythroughAbstraction

Kokkos

Execution Spaces (“Where”)

Execution Patterns (“What”)

Execution Policies (“How”)

- N-Level- Support Heterogeneous Execution

- parallel_for/reduce/scan, task spawn- Enable nesting

- Range, Team, Task-Dag- Dynamic / Static Scheduling- Support non-persistent scratch-pads

Memory Spaces (“Where”)

Memory Layouts (“How”)

Memory Traits

- Multiple-Levels- Logical Space (think UVM vs explicit)

- Architecture dependent index-maps- Also needed for subviews

- Access Intent: Stream, Random, …- Access Behavior: Atomic- Enables special load paths: i.e. texture

Parallel ExecutionData Structures

Separating of Concerns for Future Systems…

CapabilityMatrix

6

Implementation Technique

Parallel Loops

Parallel R

eduction

Tightly N

ested L

oops

Non-tightly

Nested L

oops

Task Parallelism

Data A

llocations

Data Transfers

Advanced D

ata A

bstractions

Kokkos C++ Abstraction X X X X X X X XOpenMP Directives X X X X X X X -OpenACC Directives X X X X - X X -CUDA Extension (X) - (X) X - X X -OpenCL Extension (X) - (X) X - X X -C++AMP Extension X - X - - X X -Raja C++ Abstraction X X X (X) - - - -TBB C++ Abstraction X X X X X X - -C++17 Language X - - - (X) X (X) (X)Fortran2008 Language X - - - - X (X) -

Example:Conjugent GradientSolver

§ SimpleIterativeLinearSolver§ ForexampleusedinMiniFE§ Usesonlythreemathoperations:

§ Vectoraddition(AXPBY)§ Dotproduct(DOT)§ SparseMatrixVectormultiply(SPMV)

§ DatamanagementwithKokkosViews:

7

View<double*,HostSpace,MemoryTraits<Unmanaged> >h_x(x_in, nrows);

View<double*> x("x",nrows);deep_copy(x,h_x);

CGSolve:TheAXPBY

8

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

CGSolve:TheAXPBY

9



});}





}

Parallel Pattern: for loop

CGSolve:TheAXPBY

10



});}





}

String Label: Profiling/Debugging

CGSolve:TheAXPBY

11



});}





}

Execution Policy: do n iterations

CGSolve:TheAXPBY

12



});}





}

Iteration handle: integer index

CGSolve:TheAXPBY

13



});}





}

Loop Body

CGSolve:TheAXPBY

14



});}





}

CGSolve:TheDotProduct

15

double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {

sum += x[i]*y[i];}, x_dot_y);return x_dot_y;

}

§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:


double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)

sum += x[i]*y[i];return sum;

}


16



}





} Parallel Pattern: loop with reduction


17



}





} Iteration Index + Thread-Local Red. Varible


18



}





}

CGSolve:TheSPMV

§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)

19

void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {

for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];

}y[row] = sum;

}}

CGSolve:TheSPMV


20



}y[row] = sum;

}}

Outer loop over matrix rows

CGSolve:TheSPMV


21



}y[row] = sum;

}}

Inner dot product row x vector

CGSolve:TheSPMV


22



}y[row] = sum;

}}

CGSolve:TheSPMV

23

void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {

#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;

#elseint rows_per_team = 512;int team_size = 1;

#endif

parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),

KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {

const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;

parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;

double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));

} , y_row);y(row) = y_row;

});});

}

CGSolve:TheSPMV

24




#endif







});});

}Team Parallelism over Row Worksets

CGSolve:TheSPMV

25




#endif







});});

}Distribute rows in workset over team-threads

CGSolve:TheSPMV

26




#endif







});});

} Row x Vector dot product

CGSolve:TheSPMV

27




#endif







});});

}

CGSolve:Performance

28

0102030405060708090

AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200

Perfo

rmance[G

flop/s]

NVIDIAP100/IBMPower8

OpenACC CUDA Kokkos OpenMP

0

10

20

30

40

50

60

AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200

Performance[G

flop/s]

IntelKNL7250

OpenACC Kokkos OpenMP TBB(Flat) TBB(Hierarchical)

§ ComparisonwithotherProgrammingModels

§ Straightforwardimplementationofkernels

§ OpenMP4.5isimmatureatthispoint

§ Twoproblemsizes:100x100x100and200x200x200elements

CustomReductionsWithLambdas

§ AddedCapabilitytodoCustomReductionswithLambdas§ Providebuilt-inreducersforcommonoperations

§ Add,Min,Max,Prod,MinLoc,MaxLoc,MinMaxLoc,And,Or,Xor,

§ Userscanimplementtheirownreducers§ ExampleMaxreduction:

29

double resultparallel_reduce(N,

KOKKOS_LAMBDA(const int& i, double& lmax) {

if(lmax < a(i)) lmax = a(i);},Max<double>(result));

0

100

200

300

400

500

600

K80 P100 KNL

Ban

dwid

th i

n G

B/s

Reduction Performance

Sum Max MaxLoc

NewFeatures:MDRangePolicy§ Manypeopleperformstructuredgridcalculations

§ Sandia’scodesarepredominantlyunstructuredthough

§ MDRangePolicy introducedfortightlynestedloops§ Usecase correspondingtoOpenMPcollapseclause

§ Optionallysetiterationorderandtiling:

30

void launch (int N0, int N1, [ARGS]) {parallel_for(MDRangePolicy<Rank<3>>({0,0,0},{N0,N1,N2}),

KOKKOS_LAMBDA (int i0, int i1) {/*...*/});

}

MDRangePolicy<Rank<3,Iterate::Right,Iterate::Left>> ({0,0,0},{N0,N1,N2},{T0,T1,T2})

0

100

200

300

400

500

600

KNL P100

Gflo

p/s

Albany Kernel

Naïve Best Raw MDRange

NewFeatures:TaskGraphs§ Taskdags areanimportantclassofalgorithmicstructures§ Usedforalgorithmswithcomplexdatadependencies

§ Forexampletriangularsolve

§ Kokkostaskingison-node§ Futurebased,notexplicitdatacentric(asforexampleLegion)

§ Tasksreturnfutures§ Taskscandependonfutures

§ Respawnoftaskspossible§ Taskscanspawnothertasks§ Taskscanhavedataparallelloops

§ I.e.ataskcanutilizemultiplethreadslikethehyperthreadsonacoreoraCUDAblock

31Carter Edwards S7253 “Task Data Parallelism” , Right after this talk.

NewFeatures:PascalSupport§ PascalGPUsProvideasetofnewCapabilities

§ Muchbettermemorysubsystem§ NVLink (nextslide)§ Hardwaresupportfordoubleprecisionatomicadd

§ Generallysignificantspeedup3-5xoverK80forSandiaApps

32

0

20

40

60

80

100

120

140

160

4000 32000 256000 2048000

Mill

ion

Ato

mst

eps

per

Sec

ond

Number of Atoms

LammpsTersoff Potential

K40

K80

P100

K40 Atomic

K80 Atomic

P100 Atomic

0.001

0.01

0.1

1

10

100

1000

1 8 64 512 4096

Ban

dwid

th i

n G

B/s

Update Array Size

Atomic Bandwidth

NewFeatures:HBMSupport§ NewarchitectureswithHBM:IntelKNL,NVIDIAP100§ Generallythreetypesofallocations:

§ PagepinnedinDDR§ PagepinnedinHBM§ Pagemigratable byOSorhardwarecaching

§ Kokkossupportsallthreeonbotharchitectures§ ForCuda backend:CudaHostPinnedSpace,CudaSpace andCudaUVMSpace§ E.g.:Kokkos::View<double*,CudaUVMSpace>a(”A”,N);

§ P100BandwidthwithandwithoutdataReuse

33

1

10

100

1000

0.125 0.25 0.5 1 2 4 8 16 32 64

Ban

dwid

th G

B/s

Problem Size [GB]

HostPinned 1

HostPinned 32

UVM 1

UVM 32

UpcomingFeatures

§ SupportforOpenMP4.5+TargetBackend§ Experimentallyavailableongithub§ CUDAwillstaypreferredbackend§ MaybesupportforFPGAsinthefuture?§ HelpmaturingOpenMP4.5+compilers

§ SupportforAMDROCm Backend§ Experimentallyavailable ongithub§ MainlydevelopedbyAMD§ SupportforAPUsanddiscreteGPUs§ Expectmaturationfall2017

34

BeyondCapabilities

§ UsingKokkosisinvasive,youcan’tjustswapitout§ Significantpartofdatastructuresneedtobetakenover§ Functionmarkingseverywhere

§ ItisnotenoughtohaveinitialCapabilities§ Robustnesscomesfromutilizationsandexperience

§ Differenttypesofapplicationandcodingstyleswillexposedifferentcornercases

§ Applicationsneedlibraries§ InteractionwithTPLssuchasBLASmustwork§ Manylibrarycapabilitiesmustbereimplemented intheprogramming

model

§ Applicationsneedtools§ ProfilingandDebuggingcapabilitiesarerequiredforseriouswork

35

Timeline

36

Initial Kokkos: Linear Algebra for Trilinos

Restart of Kokkos: Scope now Programming Model

Mantevo MiniApps: Compare Kokkos to other Models

LAMMPS: Demonstrate Legacy App Transition

Trilinos: Move Tpetra over to use Kokkos Views

Multiple Apps start exploring (Albany, Uintah, …)

Sandia Multiday Tutorial (~80 attendees)

Sandia Decision to prefer Kokkos over other models

Github Release of Kokkos 2.0

Kokkos-Kernels and Kokkos-Tools Release

DOE Exascale Computing Project starts

2008

2011

2013

2012

2014

2016

2015

2017

17"

0"2000"4000"6000"8000"10000"12000"14000"

FOM+(Z

/s)+

LULESH+Figure+of+Merit+Results+(Problem+60)+

HSW"1x16" HSW"1x32" P8"1x40"XL" KNC"1x224" ARM64"1x8" NV"K40"Higher"is"

Better"

Results"by"Dennis"Dinge,"Christian"Trott"and"Si"Hammond"

InitialDemonstrations(2012-2015)

37

§ DemonstrateFeasibilityofPerformancePortability§ DevelopmentofanumberofMiniApps fromdifferentsciencedomains

§ DemonstrateLowPerformanceLossversusNativeModels§ MiniApps areimplementedinvariousprogrammingmodels

§ DOETriLab Collaboration§ ShowKokkos worksfor

otherlabsapp§ Notethisishistoricaldata:

Improvementswerefound,RAJAimplementedsimilaroptimizationetc.

TrainingtheUser-Base§ TypicalLegacyApplicationDeveloper

§ ScienceBackground§ MostlySerialCoding(MPIappsusuallyhavecommunicationlayerfew

peopletouch)§ Littlehardwarebackground,littleparallelprogrammingexperience

§ NotsufficienttoteachProgrammingModelSyntax§ Needtraininginparallelprogrammingtechniques§ Teachfundamentalhardwareknowledge(howdoesCPU,MICandGPU

differ,andwhatdoesitmeanformycode)§ Needtraininginperformanceprofiling

§ RegularKokkos Tutorials§ ~200slides,9hands-onexercisestoteachparallelprogramming

techniques,performanceconsiderationsandKokkos§ NowdedicatedECPKokkossupportproject:developonlinesupport

community§ ~200HPCdevelopers(mostlyfromDOElabs)hadKokkostrainingsofar38

KeepingApplicationsHappy§ Neverunderestimatedevelopersabilitytofindnewcorner

cases!!§ HavingaProgrammingModeldeployedinMiniApps orasinglebigappis

verydifferentfromhavinghalfadozenmulti-millionlinecodecustomers.§ 538Issuesin24months§ 28%aresmallenhancements§ 18%biggerfeaturerequests§ 24%arebugs:oftencornercases

390

100

200

300

400

500

600

Issues

Issues since 2015

Enhancements

Feature Request

Bug

Compiler Issue

Questions

Other

§ Example:Subviews§ Initiallydatatypeneededtomatch

includingcompiletimedimensions§ Allowcompile/runtimeconversion§ AllowLayoutconversionifpossible§ Automaticallyfindbestlayout§ Addsubviewpatterns

TestingandSoftwareQuality

Develop

ReleaseVersion

Compilers GCC(4.8-6.3),Clang (3.6-4.0),Intel(15.0-18.0),IBM(13.1.5,14.0),PGI(17.3),NVIDIA(7.0-8.0)

Hardware IntelHaswell,IntelKNL,ARMv8,IBMPower8,NVIDIAK80,NVIDIAP100

Backends OpenMP,Pthreads,Serial,CUDA

WarningsasErrors

FeatureDevelopment/Tests

Review

Tests

Master

Integration

Trilinos, LAMMPS,…Multiconfig integrationtest

Nightly,UnitTests

NewFeaturesaredeveloped onforks,andbranches.Limitednumberofdeveloperscanpush todevelopbranch.Pullrequestsarereviewed/tested.

Eachmergeintomasterisminor release.Extensiveintegrationtestsuiteensuresbackwardcompatibility,andcatchingofunit-testcoveragegaps.

BuildinganEcoSystem

41

Algorithms(Random, Sort)

Containers(Map, CrsGraph, Mem Pool)

Kokkos(Parallel Execution, Data Allocation, Data Transfer)

Kokkos – Kernels(Sparse/Dense BLAS, Graph Kernels, Tensor

Kernels)

Kokk

os–

Tool

s(K

okko

saw

are

Pro

filin

g an

d D

ebug

ging

Too

ls)

Trilinos(Linear Solvers, Load Balancing, Discretization, Distributed Linear

Algebra)

Kokk

os–

Supp

ort C

omm

unity

(App

licat

ion

Sup

port,

Dev

elop

er T

rain

ing)

ApplicationsMiniApps

std::thread OpenMP CUDA ROCm

KokkosTools

42

§ Utilities§ KernelFilter: Enable/DisableProfilingforaselectionofKernels

§ KernelInspection§ KernelLogger:Runtimeinformationaboutentering/leavingKernelsand

Regions§ KernelTimer:Postprocessing informationaboutKernelandRegionTimes

§ MemoryAnalysis§ MemoryHighWater: MaximumMemoryFootprintoverwholerun§ MemoryUsage:PerMemorySpaceUtilizationTimeline§ MemoryEvents:PerMemorySpaceAllocationandDeallocations

§ ThirdPartyConnector§ VTune Connector:MarkKernelsasFramesinsideofVtune§ VTune FocusedConnector:MarkKernelsasFrames+start/stopprofiling

https://github.com/kokkos/kokkos-tools

Kokkos-Tools:ExampleMemUsage§ Toolsareloadedatruntime

§ Profileactualreleasebuildsofapplications§ Setvia:exportKOKKOS_PROFILE_LIBRARY=[PATH_TO_PROFILING_LIB]

§ Outputdependsontool§ Oftenperprocessfile

§ MemoryUsage providesperMemorySpace utilizationtimelines§ TimestartswithKokkos::initialize§ HOSTNAME-PROCESSID-CudaUVM.memspace_usage

43

# Space CudaUVM# Time(s) Size(MB) HighWater(MB) HighWater-Process(MB)0.317260 38.1 38.1 81.80.377285 0.0 38.1 158.10.384785 38.1 38.1 158.10.441988 0.0 38.1 158.1

Kokkos-Kernels

44

§ ProvideBLAS(1,2,3);Sparse;GraphandTensorKernels§ NorequireddependenciesotherthanKokkos§ Localkernels(noMPI)§ HooksinTPLssuchasMKLorcuBLAS/cuSparsewhereapplicable§ Providekernelsforalllevelsofhierarchicalparallelism:

§ GlobalKernels:useallexecutionresourcesavailable§ TeamLevelKernels:useasubsetofthreadsforexecution§ ThreadLevelKernels:utilizevectorization insidethekernel§ SerialKernels:provideelementalfunctions (OpenMP declareSIMD)

§ Workstartedbasedoncustomerpriorities;expectmulti-yeareffortforbroadcoverage

§ People:ManydevelopersfromTrilinos contribute§ Consolidatenodelevelreusablekernelspreviouslydistributed overmultiple

packages

Kokkos-Kernels:DenseBlasExample

45

0

5E+10

1E+11

1.5E+11

2E+11

2.5E+11

3 5 10 15

Per

form

ance

GFl

op/s

Matrix Block Size

Batched DGEMM 32k blocks

MKL @ KNL KK @ KNL

CUBLAS @ P100 KK @ P100

0

5E+10

1E+11

1.5E+11

2E+11

3 5 10 15

Per

form

ance

GFl

op/s

Matrix Block Size

Batched TRSM 32k blocks

MKL @ KNL KK @ KNL

CUBLAS @ P100 KK @ P100

§ Batchedsmallmatricesusinganinterleavedmemorylayout§ Matrixsizesbasedoncommonphysicsproblems:3,5,10,15§ 32ksmallmatrices§ Vendorlibrariesgetbetterformoreandlargermatrices

KokkosUsersSpread

46

§ Usersfromadozenmajorinstitutions§ Morethantwodozenapplications/libraries

§ Includingmanymulti-million-lineprojects

UsersUSERS

Backen

dOptim

izatio

ns

FurtherMaterial§ https://github.com/kokkos Kokkos Github Organization

§ Kokkos: Corelibrary,Containers,Algorithms§ Kokkos-Kernels: SparseandDenseBLAS,Graph,Tensor(underdevelopment)§ Kokkos-Tools: ProfilingandDebugging§ Kokkos-MiniApps:MiniApp repositoryandlinks§ Kokkos-Tutorials: ExtensiveTutorialswithHands-OnExercises

§ https://cs.sandia.gov Publications(searchfor’Kokkos’)§ ManyPresentationsonKokkos anditsuseinlibrariesandapps

§ TalksatthisGTC:§ CarterEdwardsS7253“TaskDataParallelism” ,Today10:00,211B§ Ramanan Sankaran,S7561“HighPres.ReactingFlows”,Today1:30,212B§ PierreKestener,S7166“HighRes.FluidDynamics”,Today14:30,212B§ MichaelCarilli,S7148“LiquidRocketSimulations”,Today16:00,212B§ Panel,S7564“AcceleratorProgrammingEcosystems”,Tuesday16:00,Ball3§ TrainingLab,L7107“Kokkos,Manycore PP”,Wednesday16:00,LL21E 47

http://www.github.com/kokkos

Documents

Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott ([email protected]), Carter Edwards