Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Library Interface Design and Performance Portability
Jed Brown, Jeremy Thompson, Valeria Barra
SIAM CSE, Spokane, 2019-02-25
This talk: https://jedbrown.org/files/20190225-PerfPortability.pdf
What is performance portability?
I Performs well across range of architectures and problem configurations with modestdevelopment and maintenance effort.
I All architectures require massive parallelismI 28-core Xeon: dual-issue FMA with 16-lane registers, 6-cycle latency: 5376 in-flight flopsI About 4x more for V100
I Architectural differencesI Persistent cache on CPU, big register space on GPUI Programming model expression of coalesced loads (e.g., CUDA vs OpenMP/OpenACC)I Tolerance for lane divergence (SIMT versus masked SIMD); what does compiler need to
know/use?I Problem configurations: time to solution requirements (problem sizes)
I constitutive models, etc.
HPGMG-FE on Edison, SuperMUC, Titan
Titan >200ms
varia
bilit
y
1.6B
155B
309B12.9B
Modest development and maintenance effort
I Some custom code may be needed; how to leverage?I Granularity of extensibility
I BLAS/LAPACK: Users compose large dense linear algebraic operationsI BLIS: Users implement packing schemes; reuse “microkernel”I Users implement different PDE, discretization, and constitutive modelsI FEniCS: domain-specific language for finite element
I MaintenanceI Interface stability for the extensible partsI Ability to rapidly affect “infrastructure” (turtles all the way down?)
Approaches
Type of software Expre
ssive
Inter
oper
able/
Enviro
nmen
t
Contrib
utors
Newar
chite
cture
Packa
ging/D
istrib
ution
Library (LAPACK, FFTW, BLIS @MS129 ↓) − + + 0 +Dynamic/extensible Library (libCEED) 0 + + plugin if stable ABITemplate Library (Kokkos @MS129) 0 0 0 + recompileDSL/code generation/JIT (Firedrake) + − 0 + ?New Language (Chapel) + − − ? ?
1
10
2008 2010 2012 2014 2016
FLO
P p
er
Byte
End of Year
Theoretical Peak Floating Point Operations per Byte, Double Precision
HD 3870
HD 4870
HD 5870HD 6970
HD 6970
HD 7970 GHz Ed.
HD 8970
FirePro W
9100
FirePro S9150
X5482X5492
W5590
X5680 X5690
E5-2690
E5-2697 v2
E5-2699 v3
E5-2699 v3
E5-2699 v4
Tesla C1060
Tesla C1060
Tesla C2050
Tesla M2090
Tesla K20Tesla K20X
Tesla K40
Tesla K40
Tesla P100
Xeon Phi 7120 (KNC)
Xeon Phi 7290 (K
NL)
INTEL Xeon CPUs
NVIDIA Tesla GPUs
AMD Radeon GPUs
INTEL Xeon Phis
[c/o Karl Rupp]
Performance of assembled versus unassembled
1 2 3 4 5 6 7
polynomial order
102
103
104
byte
s/d
of
1 2 3 4 5 6 7
polynomial order
102
103
flo
ps/
do
f
tensor b = 1
tensor b = 3
tensor-qstore b = 1
tensor-qstore b = 3
assembled b = 1
assembled b = 3
I Arithmetic intensity for Qp elementsI < 1
4 (assembled), ≈ 10 (unassembled), ≈ 5 to 10 (hardware)
I store Jacobian information at Gauss quadrature points, can use AD
libCEED: Code for Efficient Extensible Discretization
I BSD-2 license, C library with Fortran interface
I Releases: v0.1 — v0.3 (2018); v0.4 to be released in March
I Purely algebraic interfaceI Extensible backends
I CPU: reference, blocked/vectorized, libXSMMI OCCA (just-in-time compilation): CPU, OpenMP, OpenCL, CUDAI MAGMAI CUDA using NVRTC (Steven Roberts & Yohann Dudouit, to be merged soon)
I Platform for collaboration with vendors
I Minimal assumptions about execution environment, parallel decompositionI Primary target: high order finite element methods
I H1,H(div),H(curl)I Also of interest to spectral difference, etc.I Exploit tensor product structure when possible
T-vector L-vector E-vector Q-vector
global domainall (shared) dofs
sub-domainsdevice (local) dofs
elementselement dofs
quadraturepoint values
Element operations (dense)libCEED Operator
Global problem
Quadrature Function
vT F(u)∼∫
Ωv · f0(u,∇u)+∇v : f1(u,∇u) vT Jw ∼
∫Ω
[v
∇v
]T [f0,0 f0,1f1,0 f1,1
][w
∇w
]u = BIEeuL ∇u =
∂X∂x
B∇EeuL
Jw = ∑e
E Te
[BI
B∇
]T[
I (∂X∂x
)T
]Wq
[f0,0 f0,1f1,0 f1,1
][I (∂X∂x
)]︸ ︷︷ ︸
coefficients at quadrature points
[BI
B∇
]EewL
I B and B∇ are tensor contractions – independent of element geometry
I Choice of how to order and represent gathers E and scatters E T
I Who computes the metric terms and other coefficients?
I Similar for Neumann/Robin and nonlinear boundary conditions
Quadrature Functions
I Multiple inputs and outputsI Independent operations at each of Q quadrature points
I Ordering and number of elements not specifiedint L2residual(void *ctx, CeedInt Q,
const CeedScalar *const in[],CeedScalar *const out[])
const CeedScalar *u = in[0], *rho = in[1], *target = in[2];CeedScalar *v = out[0];#pragma omp simdfor (CeedInt i=0; i<Q; i++)
v[i] = rho[i] * (u[i] - target[i]);return 0;
CeedQFunctionAddInput(qf, "u", 1, CEED_EVAL_INTERP);CeedQFunctionAddInput(qf, "rho", 1, CEED_EVAL_INTERP);CeedQFunctionAddInput(qf, "target", 1, CEED_EVAL_INTERP);CeedQFunctionAddOutput(qf, "v", 1, CEED_EVAL_INTERP);
Building Operators
vT F(u)∼∫
Ωv · f0(u,∇u)+∇v : f1(u,∇u)
u = BIEeuL ∇u =∂X∂x
B∇EeuL
I CeedOperatorCreate(ceed, f, &op);I CeedOperatorSetField(op, "velocity", E , B, uL);I Apply: implementation handles batching, work buffers, and calling f (u,∇u).
I User links to interface library
I Backend implementation switchable at run-time
I Two-phase implementation enables connectivity analysis and JIT
libCEED Operator
A = PT E T BT DBE︸ ︷︷ ︸CeedOperator
P
I quadrature function D
I For each field: element restriction E , basis B, where to find vectorCeedOperatorCreate(ceed, qf_L2residual, &op);CeedOperatorSetField(op, "u", E, Basis, CEED_VECTOR_ACTIVE);CeedOperatorSetField(op, "rho", E_id, CEED_BASIS_COLOCATED, rho);CeedOperatorSetField(op, "target", E_id, CEED_BASIS_COLOCATED, target);CeedOperatorSetField(op, "v", E, Basis, CEED_VECTOR_ACTIVE);
CeedOperatorApply(op, u, v, &request);
Element restriction Ee
I Conforming homogeneous mesh: boolean matrix with homogeneous block size
I Non-conforming mesh: anchored rows have linear combination
I Nek5000/DG-style E-vector: indexed identity
I libCEED backends are allowed to reorder, compress, etc.
I May be applied all at once or in batches
I CeedOperator implementation chooses if/how to fuse
Performance versatility: n1/2 and t1/2
I Suppose a linear scaling algorithm
I Let r(n) be the performance rate (e.g., DOF/second or GF/s) for local problem sizen = N/P
I Let rmax =maxn r(n) be the peak attainable performanceI n1/2 =minn : r(n)≥ 1
2 rmaxI Local problem sizes n < n1/2 will not yield acceptable efficiency
I t1/2 = 2n1/2/rmax
I Time to solution less than t1/2 is not feasible with acceptable efficiency
Performance spectra
102 103 104 105 106
Points per compute node
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
[DO
Fs x
CG
itera
tions
] / [c
ompu
te n
odes
x s
econ
ds]
1e8 /cpu/self/avx/blocked, petsc-bp1
p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16
10 4 10 3 10 2 10 1
Time per iteration
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
[DOF
s x C
G ite
ratio
ns] /
[com
pute
nod
es x
seco
nds]
1e8 /cpu/self/avx/blocked, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16
Vectorization techniques
I Vectorize within a single high-order elementI Minimal working set (as small as one element)I Specialized implementation for different degree/# quadrature pointsI Hard to avoid cross-lane operations at modest degreeI Nek5000
I Vectorize across elements in batches [i,j,k,e]I Working set has at least vector length number of elements (e.g., 8)I Generic implementation is easy to optimize; no cross-lane operationsI HPGMG-FE, Deal.II (Kronbichler and Kormann), MFEM (new)
AVX internal vs external vectorization
10 4 10 3 10 2 10 1
Time per iteration
0
1
2
3
4
5
6
[DOF
s x C
G ite
ratio
ns] /
[com
pute
nod
es x
seco
nds]
1e8 /cpu/self/avx/serial, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16
10 4 10 3 10 2 10 1
Time per iteration
0
1
2
3
4
5
6
[DOF
s x C
G ite
ratio
ns] /
[com
pute
nod
es x
seco
nds]
1e8 /cpu/self/avx/blocked, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16
libXSMM internal vs external vectorization
10 4 10 3 10 2 10 1
Time per iteration
0
1
2
3
4
5
6
[DOF
s x C
G ite
ratio
ns] /
[com
pute
nod
es x
seco
nds]
1e8 /cpu/self/xsmm/serial, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16
10 4 10 3 10 2 10 1
Time per iteration
0
1
2
3
4
5
6
[DOF
s x C
G ite
ratio
ns] /
[com
pute
nod
es x
seco
nds]
1e8 /cpu/self/xsmm/blocked, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16
Outlook
I spack install ceed works; next release in March
I libCEED is interested in contributors and friendly usersI Need consistent strategy for JIT of Q-functions
I How much fusion and inlining?I Performance optimizations in progress
I Backends should automatically choose internal versus external vectorizationI Choice depends on architecture, element size, number of fields, problem sizeI Even/odd decomposition (symmetry)
I Mixed topology (works, but needs flattening)
I Hanging node interpolation (can be done in P)