Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.
Kokkos:TheC++PerformancePortabilityProgrammingModel
Christian Trott ([email protected]), CarterEdwardsD.Sunderland, N.Ellingwood,D.Ibanez,S.Hammond, S.Rajamanickam,K.Kim,M.Deveci,M.Hoemmen,G.
Center forComputingResearch,SandiaNationalLaboratories,NMSAND2017-4935C
NewProgrammingModels
§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only
§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,
C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...
§ Vendordecouplingdrivesexternaldevelopment
2
NewProgrammingModels
§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only
§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,
C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...
§ Vendordecouplingdrivesexternaldevelopment
3
WhatisKokkos?Whatisnew?
Whyshouldyoutrustus?
Kokkos:Performance,PortabilityandProductivity
DDR#
HBM#
DDR#
HBM#
DDR#DDR#
DDR#
HBM#HBM#
Kokkos#
LAMMPS# Sierra# Albany#Trilinos#
https://github.com/kokkos
PerformancePortabilitythroughAbstraction
Kokkos
Execution Spaces (“Where”)
Execution Patterns (“What”)
Execution Policies (“How”)
- N-Level- Support Heterogeneous Execution
- parallel_for/reduce/scan, task spawn- Enable nesting
- Range, Team, Task-Dag- Dynamic / Static Scheduling- Support non-persistent scratch-pads
Memory Spaces (“Where”)
Memory Layouts (“How”)
Memory Traits
- Multiple-Levels- Logical Space (think UVM vs explicit)
- Architecture dependent index-maps- Also needed for subviews
- Access Intent: Stream, Random, …- Access Behavior: Atomic- Enables special load paths: i.e. texture
Parallel ExecutionData Structures
Separating of Concerns for Future Systems…
CapabilityMatrix
6
Implementation Technique
Parallel Loops
Parallel R
eduction
Tightly N
ested L
oops
Non-tightly
Nested L
oops
Task Parallelism
Data A
llocations
Data Transfers
Advanced D
ata A
bstractions
Kokkos C++ Abstraction X X X X X X X XOpenMP Directives X X X X X X X -OpenACC Directives X X X X - X X -CUDA Extension (X) - (X) X - X X -OpenCL Extension (X) - (X) X - X X -C++AMP Extension X - X - - X X -Raja C++ Abstraction X X X (X) - - - -TBB C++ Abstraction X X X X X X - -C++17 Language X - - - (X) X (X) (X)Fortran2008 Language X - - - - X (X) -
Example:Conjugent GradientSolver
§ SimpleIterativeLinearSolver§ ForexampleusedinMiniFE§ Usesonlythreemathoperations:
§ Vectoraddition(AXPBY)§ Dotproduct(DOT)§ SparseMatrixVectormultiply(SPMV)
§ DatamanagementwithKokkosViews:
7
View<double*,HostSpace,MemoryTraits<Unmanaged> >h_x(x_in, nrows);
View<double*> x("x",nrows);deep_copy(x,h_x);
CGSolve:TheAXPBY
8
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
CGSolve:TheAXPBY
9
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Parallel Pattern: for loop
CGSolve:TheAXPBY
10
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
String Label: Profiling/Debugging
CGSolve:TheAXPBY
11
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Execution Policy: do n iterations
CGSolve:TheAXPBY
12
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Iteration handle: integer index
CGSolve:TheAXPBY
13
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Loop Body
CGSolve:TheAXPBY
14
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
CGSolve:TheDotProduct
15
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
}
CGSolve:TheDotProduct
16
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
} Parallel Pattern: loop with reduction
CGSolve:TheDotProduct
17
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
} Iteration Index + Thread-Local Red. Varible
CGSolve:TheDotProduct
18
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
}
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
19
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
20
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
Outer loop over matrix rows
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
21
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
Inner dot product row x vector
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
22
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
CGSolve:TheSPMV
23
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}
CGSolve:TheSPMV
24
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}Team Parallelism over Row Worksets
CGSolve:TheSPMV
25
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}Distribute rows in workset over team-threads
CGSolve:TheSPMV
26
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
} Row x Vector dot product
CGSolve:TheSPMV
27
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}
CGSolve:Performance
28
0102030405060708090
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
Perfo
rmance[G
flop/s]
NVIDIAP100/IBMPower8
OpenACC CUDA Kokkos OpenMP
0
10
20
30
40
50
60
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
Performance[G
flop/s]
IntelKNL7250
OpenACC Kokkos OpenMP TBB(Flat) TBB(Hierarchical)
§ ComparisonwithotherProgrammingModels
§ Straightforwardimplementationofkernels
§ OpenMP4.5isimmatureatthispoint
§ Twoproblemsizes:100x100x100and200x200x200elements
CustomReductionsWithLambdas
§ AddedCapabilitytodoCustomReductionswithLambdas§ Providebuilt-inreducersforcommonoperations
§ Add,Min,Max,Prod,MinLoc,MaxLoc,MinMaxLoc,And,Or,Xor,
§ Userscanimplementtheirownreducers§ ExampleMaxreduction:
29
double resultparallel_reduce(N,
KOKKOS_LAMBDA(const int& i, double& lmax) {
if(lmax < a(i)) lmax = a(i);},Max<double>(result));
0
100
200
300
400
500
600
K80 P100 KNL
Ban
dwid
th i
n G
B/s
Reduction Performance
Sum Max MaxLoc
NewFeatures:MDRangePolicy§ Manypeopleperformstructuredgridcalculations
§ Sandia’scodesarepredominantlyunstructuredthough
§ MDRangePolicy introducedfortightlynestedloops§ Usecase correspondingtoOpenMPcollapseclause
§ Optionallysetiterationorderandtiling:
30
void launch (int N0, int N1, [ARGS]) {parallel_for(MDRangePolicy<Rank<3>>({0,0,0},{N0,N1,N2}),
KOKKOS_LAMBDA (int i0, int i1) {/*...*/});
}
MDRangePolicy<Rank<3,Iterate::Right,Iterate::Left>> ({0,0,0},{N0,N1,N2},{T0,T1,T2})
0
100
200
300
400
500
600
KNL P100
Gflo
p/s
Albany Kernel
Naïve Best Raw MDRange
NewFeatures:TaskGraphs§ Taskdags areanimportantclassofalgorithmicstructures§ Usedforalgorithmswithcomplexdatadependencies
§ Forexampletriangularsolve
§ Kokkostaskingison-node§ Futurebased,notexplicitdatacentric(asforexampleLegion)
§ Tasksreturnfutures§ Taskscandependonfutures
§ Respawnoftaskspossible§ Taskscanspawnothertasks§ Taskscanhavedataparallelloops
§ I.e.ataskcanutilizemultiplethreadslikethehyperthreadsonacoreoraCUDAblock
31Carter Edwards S7253 “Task Data Parallelism” , Right after this talk.
NewFeatures:PascalSupport§ PascalGPUsProvideasetofnewCapabilities
§ Muchbettermemorysubsystem§ NVLink (nextslide)§ Hardwaresupportfordoubleprecisionatomicadd
§ Generallysignificantspeedup3-5xoverK80forSandiaApps
32
0
20
40
60
80
100
120
140
160
4000 32000 256000 2048000
Mill
ion
Ato
mst
eps
per
Sec
ond
Number of Atoms
LammpsTersoff Potential
K40
K80
P100
K40 Atomic
K80 Atomic
P100 Atomic
0.001
0.01
0.1
1
10
100
1000
1 8 64 512 4096
Ban
dwid
th i
n G
B/s
Update Array Size
Atomic Bandwidth
NewFeatures:HBMSupport§ NewarchitectureswithHBM:IntelKNL,NVIDIAP100§ Generallythreetypesofallocations:
§ PagepinnedinDDR§ PagepinnedinHBM§ Pagemigratable byOSorhardwarecaching
§ Kokkossupportsallthreeonbotharchitectures§ ForCuda backend:CudaHostPinnedSpace,CudaSpace andCudaUVMSpace§ E.g.:Kokkos::View<double*,CudaUVMSpace>a(”A”,N);
§ P100BandwidthwithandwithoutdataReuse
33
1
10
100
1000
0.125 0.25 0.5 1 2 4 8 16 32 64
Ban
dwid
th G
B/s
Problem Size [GB]
HostPinned 1
HostPinned 32
UVM 1
UVM 32
UpcomingFeatures
§ SupportforOpenMP4.5+TargetBackend§ Experimentallyavailableongithub§ CUDAwillstaypreferredbackend§ MaybesupportforFPGAsinthefuture?§ HelpmaturingOpenMP4.5+compilers
§ SupportforAMDROCm Backend§ Experimentallyavailable ongithub§ MainlydevelopedbyAMD§ SupportforAPUsanddiscreteGPUs§ Expectmaturationfall2017
34
BeyondCapabilities
§ UsingKokkosisinvasive,youcan’tjustswapitout§ Significantpartofdatastructuresneedtobetakenover§ Functionmarkingseverywhere
§ ItisnotenoughtohaveinitialCapabilities§ Robustnesscomesfromutilizationsandexperience
§ Differenttypesofapplicationandcodingstyleswillexposedifferentcornercases
§ Applicationsneedlibraries§ InteractionwithTPLssuchasBLASmustwork§ Manylibrarycapabilitiesmustbereimplemented intheprogramming
model
§ Applicationsneedtools§ ProfilingandDebuggingcapabilitiesarerequiredforseriouswork
35
Timeline
36
Initial Kokkos: Linear Algebra for Trilinos
Restart of Kokkos: Scope now Programming Model
Mantevo MiniApps: Compare Kokkos to other Models
LAMMPS: Demonstrate Legacy App Transition
Trilinos: Move Tpetra over to use Kokkos Views
Multiple Apps start exploring (Albany, Uintah, …)
Sandia Multiday Tutorial (~80 attendees)
Sandia Decision to prefer Kokkos over other models
Github Release of Kokkos 2.0
Kokkos-Kernels and Kokkos-Tools Release
DOE Exascale Computing Project starts
2008
2011
2013
2012
2014
2016
2015
2017
17"
0"2000"4000"6000"8000"10000"12000"14000"
FOM+(Z
/s)+
LULESH+Figure+of+Merit+Results+(Problem+60)+
HSW"1x16" HSW"1x32" P8"1x40"XL" KNC"1x224" ARM64"1x8" NV"K40"Higher"is"
Better"
Results"by"Dennis"Dinge,"Christian"Trott"and"Si"Hammond"
InitialDemonstrations(2012-2015)
37
§ DemonstrateFeasibilityofPerformancePortability§ DevelopmentofanumberofMiniApps fromdifferentsciencedomains
§ DemonstrateLowPerformanceLossversusNativeModels§ MiniApps areimplementedinvariousprogrammingmodels
§ DOETriLab Collaboration§ ShowKokkos worksfor
otherlabsapp§ Notethisishistoricaldata:
Improvementswerefound,RAJAimplementedsimilaroptimizationetc.
TrainingtheUser-Base§ TypicalLegacyApplicationDeveloper
§ ScienceBackground§ MostlySerialCoding(MPIappsusuallyhavecommunicationlayerfew
peopletouch)§ Littlehardwarebackground,littleparallelprogrammingexperience
§ NotsufficienttoteachProgrammingModelSyntax§ Needtraininginparallelprogrammingtechniques§ Teachfundamentalhardwareknowledge(howdoesCPU,MICandGPU
differ,andwhatdoesitmeanformycode)§ Needtraininginperformanceprofiling
§ RegularKokkos Tutorials§ ~200slides,9hands-onexercisestoteachparallelprogramming
techniques,performanceconsiderationsandKokkos§ NowdedicatedECPKokkossupportproject:developonlinesupport
community§ ~200HPCdevelopers(mostlyfromDOElabs)hadKokkostrainingsofar38
KeepingApplicationsHappy§ Neverunderestimatedevelopersabilitytofindnewcorner
cases!!§ HavingaProgrammingModeldeployedinMiniApps orasinglebigappis
verydifferentfromhavinghalfadozenmulti-millionlinecodecustomers.§ 538Issuesin24months§ 28%aresmallenhancements§ 18%biggerfeaturerequests§ 24%arebugs:oftencornercases
390
100
200
300
400
500
600
Issues
Issues since 2015
Enhancements
Feature Request
Bug
Compiler Issue
Questions
Other
§ Example:Subviews§ Initiallydatatypeneededtomatch
includingcompiletimedimensions§ Allowcompile/runtimeconversion§ AllowLayoutconversionifpossible§ Automaticallyfindbestlayout§ Addsubviewpatterns
TestingandSoftwareQuality
Develop
ReleaseVersion
Compilers GCC(4.8-6.3),Clang (3.6-4.0),Intel(15.0-18.0),IBM(13.1.5,14.0),PGI(17.3),NVIDIA(7.0-8.0)
Hardware IntelHaswell,IntelKNL,ARMv8,IBMPower8,NVIDIAK80,NVIDIAP100
Backends OpenMP,Pthreads,Serial,CUDA
WarningsasErrors
FeatureDevelopment/Tests
Review
Tests
Master
Integration
Trilinos, LAMMPS,…Multiconfig integrationtest
Nightly,UnitTests
NewFeaturesaredeveloped onforks,andbranches.Limitednumberofdeveloperscanpush todevelopbranch.Pullrequestsarereviewed/tested.
Eachmergeintomasterisminor release.Extensiveintegrationtestsuiteensuresbackwardcompatibility,andcatchingofunit-testcoveragegaps.
BuildinganEcoSystem
41
Algorithms(Random, Sort)
Containers(Map, CrsGraph, Mem Pool)
Kokkos(Parallel Execution, Data Allocation, Data Transfer)
Kokkos – Kernels(Sparse/Dense BLAS, Graph Kernels, Tensor
Kernels)
Kokk
os–
Tool
s(K
okko
saw
are
Pro
filin
g an
d D
ebug
ging
Too
ls)
Trilinos(Linear Solvers, Load Balancing, Discretization, Distributed Linear
Algebra)
Kokk
os–
Supp
ort C
omm
unity
(App
licat
ion
Sup
port,
Dev
elop
er T
rain
ing)
ApplicationsMiniApps
std::thread OpenMP CUDA ROCm
KokkosTools
42
§ Utilities§ KernelFilter: Enable/DisableProfilingforaselectionofKernels
§ KernelInspection§ KernelLogger:Runtimeinformationaboutentering/leavingKernelsand
Regions§ KernelTimer:Postprocessing informationaboutKernelandRegionTimes
§ MemoryAnalysis§ MemoryHighWater: MaximumMemoryFootprintoverwholerun§ MemoryUsage:PerMemorySpaceUtilizationTimeline§ MemoryEvents:PerMemorySpaceAllocationandDeallocations
§ ThirdPartyConnector§ VTune Connector:MarkKernelsasFramesinsideofVtune§ VTune FocusedConnector:MarkKernelsasFrames+start/stopprofiling
https://github.com/kokkos/kokkos-tools
Kokkos-Tools:ExampleMemUsage§ Toolsareloadedatruntime
§ Profileactualreleasebuildsofapplications§ Setvia:exportKOKKOS_PROFILE_LIBRARY=[PATH_TO_PROFILING_LIB]
§ Outputdependsontool§ Oftenperprocessfile
§ MemoryUsage providesperMemorySpace utilizationtimelines§ TimestartswithKokkos::initialize§ HOSTNAME-PROCESSID-CudaUVM.memspace_usage
43
# Space CudaUVM# Time(s) Size(MB) HighWater(MB) HighWater-Process(MB)0.317260 38.1 38.1 81.80.377285 0.0 38.1 158.10.384785 38.1 38.1 158.10.441988 0.0 38.1 158.1
Kokkos-Kernels
44
§ ProvideBLAS(1,2,3);Sparse;GraphandTensorKernels§ NorequireddependenciesotherthanKokkos§ Localkernels(noMPI)§ HooksinTPLssuchasMKLorcuBLAS/cuSparsewhereapplicable§ Providekernelsforalllevelsofhierarchicalparallelism:
§ GlobalKernels:useallexecutionresourcesavailable§ TeamLevelKernels:useasubsetofthreadsforexecution§ ThreadLevelKernels:utilizevectorization insidethekernel§ SerialKernels:provideelementalfunctions (OpenMP declareSIMD)
§ Workstartedbasedoncustomerpriorities;expectmulti-yeareffortforbroadcoverage
§ People:ManydevelopersfromTrilinos contribute§ Consolidatenodelevelreusablekernelspreviouslydistributed overmultiple
packages
Kokkos-Kernels:DenseBlasExample
45
0
5E+10
1E+11
1.5E+11
2E+11
2.5E+11
3 5 10 15
Per
form
ance
GFl
op/s
Matrix Block Size
Batched DGEMM 32k blocks
MKL @ KNL KK @ KNL
CUBLAS @ P100 KK @ P100
0
5E+10
1E+11
1.5E+11
2E+11
3 5 10 15
Per
form
ance
GFl
op/s
Matrix Block Size
Batched TRSM 32k blocks
MKL @ KNL KK @ KNL
CUBLAS @ P100 KK @ P100
§ Batchedsmallmatricesusinganinterleavedmemorylayout§ Matrixsizesbasedoncommonphysicsproblems:3,5,10,15§ 32ksmallmatrices§ Vendorlibrariesgetbetterformoreandlargermatrices
KokkosUsersSpread
46
§ Usersfromadozenmajorinstitutions§ Morethantwodozenapplications/libraries
§ Includingmanymulti-million-lineprojects
UsersUSERS
Backen
dOptim
izatio
ns
FurtherMaterial§ https://github.com/kokkos Kokkos Github Organization
§ Kokkos: Corelibrary,Containers,Algorithms§ Kokkos-Kernels: SparseandDenseBLAS,Graph,Tensor(underdevelopment)§ Kokkos-Tools: ProfilingandDebugging§ Kokkos-MiniApps:MiniApp repositoryandlinks§ Kokkos-Tutorials: ExtensiveTutorialswithHands-OnExercises
§ https://cs.sandia.gov Publications(searchfor’Kokkos’)§ ManyPresentationsonKokkos anditsuseinlibrariesandapps
§ TalksatthisGTC:§ CarterEdwardsS7253“TaskDataParallelism” ,Today10:00,211B§ Ramanan Sankaran,S7561“HighPres.ReactingFlows”,Today1:30,212B§ PierreKestener,S7166“HighRes.FluidDynamics”,Today14:30,212B§ MichaelCarilli,S7148“LiquidRocketSimulations”,Today16:00,212B§ Panel,S7564“AcceleratorProgrammingEcosystems”,Tuesday16:00,Ball3§ TrainingLab,L7107“Kokkos,Manycore PP”,Wednesday16:00,LL21E 47
http://www.github.com/kokkos