48
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. Kokkos: The C++ Performance Portability Programming Model Christian Trott ([email protected]), Carter Edwards D. Sunderland, N. Ellingwood, D. Ibanez, S. Hammond, S. Rajamanickam, K. Kim, M. Deveci, M. Hoemmen, G. Center for Computing Research, Sandia National Laboratories, NM SAND2017-4935 C

Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott ([email protected]), Carter Edwards

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

Kokkos:TheC++PerformancePortabilityProgrammingModel

Christian Trott ([email protected]), CarterEdwardsD.Sunderland, N.Ellingwood,D.Ibanez,S.Hammond, S.Rajamanickam,K.Kim,M.Deveci,M.Hoemmen,G.

Center forComputingResearch,SandiaNationalLaboratories,NMSAND2017-4935C

Page 2: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

NewProgrammingModels

§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only

§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,

C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...

§ Vendordecouplingdrivesexternaldevelopment

2

Page 3: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

NewProgrammingModels

§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only

§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,

C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...

§ Vendordecouplingdrivesexternaldevelopment

3

WhatisKokkos?Whatisnew?

Whyshouldyoutrustus?

Page 4: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

Kokkos:Performance,PortabilityandProductivity

DDR#

HBM#

DDR#

HBM#

DDR#DDR#

DDR#

HBM#HBM#

Kokkos#

LAMMPS# Sierra# Albany#Trilinos#

https://github.com/kokkos

Page 5: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

PerformancePortabilitythroughAbstraction

Kokkos

Execution Spaces (“Where”)

Execution Patterns (“What”)

Execution Policies (“How”)

- N-Level- Support Heterogeneous Execution

- parallel_for/reduce/scan, task spawn- Enable nesting

- Range, Team, Task-Dag- Dynamic / Static Scheduling- Support non-persistent scratch-pads

Memory Spaces (“Where”)

Memory Layouts (“How”)

Memory Traits

- Multiple-Levels- Logical Space (think UVM vs explicit)

- Architecture dependent index-maps- Also needed for subviews

- Access Intent: Stream, Random, …- Access Behavior: Atomic- Enables special load paths: i.e. texture

Parallel ExecutionData Structures

Separating of Concerns for Future Systems…

Page 6: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CapabilityMatrix

6

Implementation Technique

Parallel Loops

Parallel R

eduction

Tightly N

ested L

oops

Non-tightly

Nested L

oops

Task Parallelism

Data A

llocations

Data Transfers

Advanced D

ata A

bstractions

Kokkos C++ Abstraction X X X X X X X XOpenMP Directives X X X X X X X -OpenACC Directives X X X X - X X -CUDA Extension (X) - (X) X - X X -OpenCL Extension (X) - (X) X - X X -C++AMP Extension X - X - - X X -Raja C++ Abstraction X X X (X) - - - -TBB C++ Abstraction X X X X X X - -C++17 Language X - - - (X) X (X) (X)Fortran2008 Language X - - - - X (X) -

Page 7: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

Example:Conjugent GradientSolver

§ SimpleIterativeLinearSolver§ ForexampleusedinMiniFE§ Usesonlythreemathoperations:

§ Vectoraddition(AXPBY)§ Dotproduct(DOT)§ SparseMatrixVectormultiply(SPMV)

§ DatamanagementwithKokkosViews:

7

View<double*,HostSpace,MemoryTraits<Unmanaged> >h_x(x_in, nrows);

View<double*> x("x",nrows);deep_copy(x,h_x);

Page 8: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheAXPBY

8

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

Page 9: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheAXPBY

9

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

Parallel Pattern: for loop

Page 10: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheAXPBY

10

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

String Label: Profiling/Debugging

Page 11: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheAXPBY

11

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

Execution Policy: do n iterations

Page 12: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheAXPBY

12

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

Iteration handle: integer index

Page 13: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheAXPBY

13

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

Loop Body

Page 14: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheAXPBY

14

void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {

parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);

});}

§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {

for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];

}

Page 15: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheDotProduct

15

double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {

sum += x[i]*y[i];}, x_dot_y);return x_dot_y;

}

§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)

sum += x[i]*y[i];return sum;

}

Page 16: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheDotProduct

16

double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {

sum += x[i]*y[i];}, x_dot_y);return x_dot_y;

}

§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)

sum += x[i]*y[i];return sum;

} Parallel Pattern: loop with reduction

Page 17: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheDotProduct

17

double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {

sum += x[i]*y[i];}, x_dot_y);return x_dot_y;

}

§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)

sum += x[i]*y[i];return sum;

} Iteration Index + Thread-Local Red. Varible

Page 18: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheDotProduct

18

double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {

sum += x[i]*y[i];}, x_dot_y);return x_dot_y;

}

§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:

§ KokkosImplementation:

double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)

sum += x[i]*y[i];return sum;

}

Page 19: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)

19

void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {

for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];

}y[row] = sum;

}}

Page 20: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)

20

void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {

for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];

}y[row] = sum;

}}

Outer loop over matrix rows

Page 21: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)

21

void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {

for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];

}y[row] = sum;

}}

Inner dot product row x vector

Page 22: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)

22

void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {

for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];

}y[row] = sum;

}}

Page 23: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

23

void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {

#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;

#elseint rows_per_team = 512;int team_size = 1;

#endif

parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),

KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {

const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;

parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;

double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));

} , y_row);y(row) = y_row;

});});

}

Page 24: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

24

void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {

#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;

#elseint rows_per_team = 512;int team_size = 1;

#endif

parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),

KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {

const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;

parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;

double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));

} , y_row);y(row) = y_row;

});});

}Team Parallelism over Row Worksets

Page 25: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

25

void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {

#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;

#elseint rows_per_team = 512;int team_size = 1;

#endif

parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),

KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {

const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;

parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;

double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));

} , y_row);y(row) = y_row;

});});

}Distribute rows in workset over team-threads

Page 26: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

26

void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {

#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;

#elseint rows_per_team = 512;int team_size = 1;

#endif

parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),

KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {

const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;

parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;

double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));

} , y_row);y(row) = y_row;

});});

} Row x Vector dot product

Page 27: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:TheSPMV

27

void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {

#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;

#elseint rows_per_team = 512;int team_size = 1;

#endif

parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),

KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {

const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;

parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;

double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));

} , y_row);y(row) = y_row;

});});

}

Page 28: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CGSolve:Performance

28

0102030405060708090

AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200

Perfo

rmance[G

flop/s]

NVIDIAP100/IBMPower8

OpenACC CUDA Kokkos OpenMP

0

10

20

30

40

50

60

AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200

Performance[G

flop/s]

IntelKNL7250

OpenACC Kokkos OpenMP TBB(Flat) TBB(Hierarchical)

§ ComparisonwithotherProgrammingModels

§ Straightforwardimplementationofkernels

§ OpenMP4.5isimmatureatthispoint

§ Twoproblemsizes:100x100x100and200x200x200elements

Page 29: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

CustomReductionsWithLambdas

§ AddedCapabilitytodoCustomReductionswithLambdas§ Providebuilt-inreducersforcommonoperations

§ Add,Min,Max,Prod,MinLoc,MaxLoc,MinMaxLoc,And,Or,Xor,

§ Userscanimplementtheirownreducers§ ExampleMaxreduction:

29

double resultparallel_reduce(N,

KOKKOS_LAMBDA(const int& i, double& lmax) {

if(lmax < a(i)) lmax = a(i);},Max<double>(result));

0

100

200

300

400

500

600

K80 P100 KNL

Ban

dwid

th i

n G

B/s

Reduction Performance

Sum Max MaxLoc

Page 30: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

NewFeatures:MDRangePolicy§ Manypeopleperformstructuredgridcalculations

§ Sandia’scodesarepredominantlyunstructuredthough

§ MDRangePolicy introducedfortightlynestedloops§ Usecase correspondingtoOpenMPcollapseclause

§ Optionallysetiterationorderandtiling:

30

void launch (int N0, int N1, [ARGS]) {parallel_for(MDRangePolicy<Rank<3>>({0,0,0},{N0,N1,N2}),

KOKKOS_LAMBDA (int i0, int i1) {/*...*/});

}

MDRangePolicy<Rank<3,Iterate::Right,Iterate::Left>> ({0,0,0},{N0,N1,N2},{T0,T1,T2})

0

100

200

300

400

500

600

KNL P100

Gflo

p/s

Albany Kernel

Naïve Best Raw MDRange

Page 31: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

NewFeatures:TaskGraphs§ Taskdags areanimportantclassofalgorithmicstructures§ Usedforalgorithmswithcomplexdatadependencies

§ Forexampletriangularsolve

§ Kokkostaskingison-node§ Futurebased,notexplicitdatacentric(asforexampleLegion)

§ Tasksreturnfutures§ Taskscandependonfutures

§ Respawnoftaskspossible§ Taskscanspawnothertasks§ Taskscanhavedataparallelloops

§ I.e.ataskcanutilizemultiplethreadslikethehyperthreadsonacoreoraCUDAblock

31Carter Edwards S7253 “Task Data Parallelism” , Right after this talk.

Page 32: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

NewFeatures:PascalSupport§ PascalGPUsProvideasetofnewCapabilities

§ Muchbettermemorysubsystem§ NVLink (nextslide)§ Hardwaresupportfordoubleprecisionatomicadd

§ Generallysignificantspeedup3-5xoverK80forSandiaApps

32

0

20

40

60

80

100

120

140

160

4000 32000 256000 2048000

Mill

ion

Ato

mst

eps

per

Sec

ond

Number of Atoms

LammpsTersoff Potential

K40

K80

P100

K40 Atomic

K80 Atomic

P100 Atomic

0.001

0.01

0.1

1

10

100

1000

1 8 64 512 4096

Ban

dwid

th i

n G

B/s

Update Array Size

Atomic Bandwidth

Page 33: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

NewFeatures:HBMSupport§ NewarchitectureswithHBM:IntelKNL,NVIDIAP100§ Generallythreetypesofallocations:

§ PagepinnedinDDR§ PagepinnedinHBM§ Pagemigratable byOSorhardwarecaching

§ Kokkossupportsallthreeonbotharchitectures§ ForCuda backend:CudaHostPinnedSpace,CudaSpace andCudaUVMSpace§ E.g.:Kokkos::View<double*,CudaUVMSpace>a(”A”,N);

§ P100BandwidthwithandwithoutdataReuse

33

1

10

100

1000

0.125 0.25 0.5 1 2 4 8 16 32 64

Ban

dwid

th G

B/s

Problem Size [GB]

HostPinned 1

HostPinned 32

UVM 1

UVM 32

Page 34: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

UpcomingFeatures

§ SupportforOpenMP4.5+TargetBackend§ Experimentallyavailableongithub§ CUDAwillstaypreferredbackend§ MaybesupportforFPGAsinthefuture?§ HelpmaturingOpenMP4.5+compilers

§ SupportforAMDROCm Backend§ Experimentallyavailable ongithub§ MainlydevelopedbyAMD§ SupportforAPUsanddiscreteGPUs§ Expectmaturationfall2017

34

Page 35: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

BeyondCapabilities

§ UsingKokkosisinvasive,youcan’tjustswapitout§ Significantpartofdatastructuresneedtobetakenover§ Functionmarkingseverywhere

§ ItisnotenoughtohaveinitialCapabilities§ Robustnesscomesfromutilizationsandexperience

§ Differenttypesofapplicationandcodingstyleswillexposedifferentcornercases

§ Applicationsneedlibraries§ InteractionwithTPLssuchasBLASmustwork§ Manylibrarycapabilitiesmustbereimplemented intheprogramming

model

§ Applicationsneedtools§ ProfilingandDebuggingcapabilitiesarerequiredforseriouswork

35

Page 36: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

Timeline

36

Initial Kokkos: Linear Algebra for Trilinos

Restart of Kokkos: Scope now Programming Model

Mantevo MiniApps: Compare Kokkos to other Models

LAMMPS: Demonstrate Legacy App Transition

Trilinos: Move Tpetra over to use Kokkos Views

Multiple Apps start exploring (Albany, Uintah, …)

Sandia Multiday Tutorial (~80 attendees)

Sandia Decision to prefer Kokkos over other models

Github Release of Kokkos 2.0

Kokkos-Kernels and Kokkos-Tools Release

DOE Exascale Computing Project starts

2008

2011

2013

2012

2014

2016

2015

2017

Page 37: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

17"

0"2000"4000"6000"8000"10000"12000"14000"

FOM+(Z

/s)+

LULESH+Figure+of+Merit+Results+(Problem+60)+

HSW"1x16" HSW"1x32" P8"1x40"XL" KNC"1x224" ARM64"1x8" NV"K40"Higher"is"

Better"

Results"by"Dennis"Dinge,"Christian"Trott"and"Si"Hammond"

InitialDemonstrations(2012-2015)

37

§ DemonstrateFeasibilityofPerformancePortability§ DevelopmentofanumberofMiniApps fromdifferentsciencedomains

§ DemonstrateLowPerformanceLossversusNativeModels§ MiniApps areimplementedinvariousprogrammingmodels

§ DOETriLab Collaboration§ ShowKokkos worksfor

otherlabsapp§ Notethisishistoricaldata:

Improvementswerefound,RAJAimplementedsimilaroptimizationetc.

Page 38: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

TrainingtheUser-Base§ TypicalLegacyApplicationDeveloper

§ ScienceBackground§ MostlySerialCoding(MPIappsusuallyhavecommunicationlayerfew

peopletouch)§ Littlehardwarebackground,littleparallelprogrammingexperience

§ NotsufficienttoteachProgrammingModelSyntax§ Needtraininginparallelprogrammingtechniques§ Teachfundamentalhardwareknowledge(howdoesCPU,MICandGPU

differ,andwhatdoesitmeanformycode)§ Needtraininginperformanceprofiling

§ RegularKokkos Tutorials§ ~200slides,9hands-onexercisestoteachparallelprogramming

techniques,performanceconsiderationsandKokkos§ NowdedicatedECPKokkossupportproject:developonlinesupport

community§ ~200HPCdevelopers(mostlyfromDOElabs)hadKokkostrainingsofar38

Page 39: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

KeepingApplicationsHappy§ Neverunderestimatedevelopersabilitytofindnewcorner

cases!!§ HavingaProgrammingModeldeployedinMiniApps orasinglebigappis

verydifferentfromhavinghalfadozenmulti-millionlinecodecustomers.§ 538Issuesin24months§ 28%aresmallenhancements§ 18%biggerfeaturerequests§ 24%arebugs:oftencornercases

390

100

200

300

400

500

600

Issues

Issues since 2015

Enhancements

Feature Request

Bug

Compiler Issue

Questions

Other

§ Example:Subviews§ Initiallydatatypeneededtomatch

includingcompiletimedimensions§ Allowcompile/runtimeconversion§ AllowLayoutconversionifpossible§ Automaticallyfindbestlayout§ Addsubviewpatterns

Page 40: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

TestingandSoftwareQuality

Develop

ReleaseVersion

Compilers GCC(4.8-6.3),Clang (3.6-4.0),Intel(15.0-18.0),IBM(13.1.5,14.0),PGI(17.3),NVIDIA(7.0-8.0)

Hardware IntelHaswell,IntelKNL,ARMv8,IBMPower8,NVIDIAK80,NVIDIAP100

Backends OpenMP,Pthreads,Serial,CUDA

WarningsasErrors

FeatureDevelopment/Tests

Review

Tests

Master

Integration

Trilinos, LAMMPS,…Multiconfig integrationtest

Nightly,UnitTests

NewFeaturesaredeveloped onforks,andbranches.Limitednumberofdeveloperscanpush todevelopbranch.Pullrequestsarereviewed/tested.

Eachmergeintomasterisminor release.Extensiveintegrationtestsuiteensuresbackwardcompatibility,andcatchingofunit-testcoveragegaps.

Page 41: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

BuildinganEcoSystem

41

Algorithms(Random, Sort)

Containers(Map, CrsGraph, Mem Pool)

Kokkos(Parallel Execution, Data Allocation, Data Transfer)

Kokkos – Kernels(Sparse/Dense BLAS, Graph Kernels, Tensor

Kernels)

Kokk

os–

Tool

s(K

okko

saw

are

Pro

filin

g an

d D

ebug

ging

Too

ls)

Trilinos(Linear Solvers, Load Balancing, Discretization, Distributed Linear

Algebra)

Kokk

os–

Supp

ort C

omm

unity

(App

licat

ion

Sup

port,

Dev

elop

er T

rain

ing)

ApplicationsMiniApps

std::thread OpenMP CUDA ROCm

Page 42: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

KokkosTools

42

§ Utilities§ KernelFilter: Enable/DisableProfilingforaselectionofKernels

§ KernelInspection§ KernelLogger:Runtimeinformationaboutentering/leavingKernelsand

Regions§ KernelTimer:Postprocessing informationaboutKernelandRegionTimes

§ MemoryAnalysis§ MemoryHighWater: MaximumMemoryFootprintoverwholerun§ MemoryUsage:PerMemorySpaceUtilizationTimeline§ MemoryEvents:PerMemorySpaceAllocationandDeallocations

§ ThirdPartyConnector§ VTune Connector:MarkKernelsasFramesinsideofVtune§ VTune FocusedConnector:MarkKernelsasFrames+start/stopprofiling

https://github.com/kokkos/kokkos-tools

Page 43: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

Kokkos-Tools:ExampleMemUsage§ Toolsareloadedatruntime

§ Profileactualreleasebuildsofapplications§ Setvia:exportKOKKOS_PROFILE_LIBRARY=[PATH_TO_PROFILING_LIB]

§ Outputdependsontool§ Oftenperprocessfile

§ MemoryUsage providesperMemorySpace utilizationtimelines§ TimestartswithKokkos::initialize§ HOSTNAME-PROCESSID-CudaUVM.memspace_usage

43

# Space CudaUVM# Time(s) Size(MB) HighWater(MB) HighWater-Process(MB)0.317260 38.1 38.1 81.80.377285 0.0 38.1 158.10.384785 38.1 38.1 158.10.441988 0.0 38.1 158.1

Page 44: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

Kokkos-Kernels

44

§ ProvideBLAS(1,2,3);Sparse;GraphandTensorKernels§ NorequireddependenciesotherthanKokkos§ Localkernels(noMPI)§ HooksinTPLssuchasMKLorcuBLAS/cuSparsewhereapplicable§ Providekernelsforalllevelsofhierarchicalparallelism:

§ GlobalKernels:useallexecutionresourcesavailable§ TeamLevelKernels:useasubsetofthreadsforexecution§ ThreadLevelKernels:utilizevectorization insidethekernel§ SerialKernels:provideelementalfunctions (OpenMP declareSIMD)

§ Workstartedbasedoncustomerpriorities;expectmulti-yeareffortforbroadcoverage

§ People:ManydevelopersfromTrilinos contribute§ Consolidatenodelevelreusablekernelspreviouslydistributed overmultiple

packages

Page 45: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

Kokkos-Kernels:DenseBlasExample

45

0

5E+10

1E+11

1.5E+11

2E+11

2.5E+11

3 5 10 15

Per

form

ance

GFl

op/s

Matrix Block Size

Batched DGEMM 32k blocks

MKL @ KNL KK @ KNL

CUBLAS @ P100 KK @ P100

0

5E+10

1E+11

1.5E+11

2E+11

3 5 10 15

Per

form

ance

GFl

op/s

Matrix Block Size

Batched TRSM 32k blocks

MKL @ KNL KK @ KNL

CUBLAS @ P100 KK @ P100

§ Batchedsmallmatricesusinganinterleavedmemorylayout§ Matrixsizesbasedoncommonphysicsproblems:3,5,10,15§ 32ksmallmatrices§ Vendorlibrariesgetbetterformoreandlargermatrices

Page 46: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

KokkosUsersSpread

46

§ Usersfromadozenmajorinstitutions§ Morethantwodozenapplications/libraries

§ Includingmanymulti-million-lineprojects

UsersUSERS

Backen

dOptim

izatio

ns

Page 47: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

FurtherMaterial§ https://github.com/kokkos Kokkos Github Organization

§ Kokkos: Corelibrary,Containers,Algorithms§ Kokkos-Kernels: SparseandDenseBLAS,Graph,Tensor(underdevelopment)§ Kokkos-Tools: ProfilingandDebugging§ Kokkos-MiniApps:MiniApp repositoryandlinks§ Kokkos-Tutorials: ExtensiveTutorialswithHands-OnExercises

§ https://cs.sandia.gov Publications(searchfor’Kokkos’)§ ManyPresentationsonKokkos anditsuseinlibrariesandapps

§ TalksatthisGTC:§ CarterEdwardsS7253“TaskDataParallelism” ,Today10:00,211B§ Ramanan Sankaran,S7561“HighPres.ReactingFlows”,Today1:30,212B§ PierreKestener,S7166“HighRes.FluidDynamics”,Today14:30,212B§ MichaelCarilli,S7148“LiquidRocketSimulations”,Today16:00,212B§ Panel,S7564“AcceleratorProgrammingEcosystems”,Tuesday16:00,Ball3§ TrainingLab,L7107“Kokkos,Manycore PP”,Wednesday16:00,LL21E 47

Page 48: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards

http://www.github.com/kokkos