Library Interface Design and Performance Portability · libCEED: Code for Efﬁcient Extensible Discretization I BSD-2 license, C library with Fortran interface I Releases: v0.1 —

Library Interface Design and Performance Portability

Jed Brown, Jeremy Thompson, Valeria Barra

SIAM CSE, Spokane, 2019-02-25

This talk: https://jedbrown.org/files/20190225-PerfPortability.pdf

https://jedbrown.org/files/20190225-PerfPortability.pdf

What is performance portability?

I Performs well across range of architectures and problem configurations with modestdevelopment and maintenance effort.

I All architectures require massive parallelismI 28-core Xeon: dual-issue FMA with 16-lane registers, 6-cycle latency: 5376 in-flight flopsI About 4x more for V100

I Architectural differencesI Persistent cache on CPU, big register space on GPUI Programming model expression of coalesced loads (e.g., CUDA vs OpenMP/OpenACC)I Tolerance for lane divergence (SIMT versus masked SIMD); what does compiler need to

know/use?I Problem configurations: time to solution requirements (problem sizes)

I constitutive models, etc.

HPGMG-FE on Edison, SuperMUC, Titan

Titan >200ms

varia

bilit

y

1.6B

155B

309B12.9B

Modest development and maintenance effort

I Some custom code may be needed; how to leverage?I Granularity of extensibility

I BLAS/LAPACK: Users compose large dense linear algebraic operationsI BLIS: Users implement packing schemes; reuse “microkernel”I Users implement different PDE, discretization, and constitutive modelsI FEniCS: domain-specific language for finite element

I MaintenanceI Interface stability for the extensible partsI Ability to rapidly affect “infrastructure” (turtles all the way down?)

Approaches

Type of software Expre

ssive

Inter

oper

able/

Enviro

nmen

t

Contrib

utors

Newar

chite

cture

Packa

ging/D

istrib

ution

Library (LAPACK, FFTW, BLIS @MS129 ↓) − + + 0 +Dynamic/extensible Library (libCEED) 0 + + plugin if stable ABITemplate Library (Kokkos @MS129) 0 0 0 + recompileDSL/code generation/JIT (Firedrake) + − 0 + ?New Language (Chapel) + − − ? ?

1

10

2008 2010 2012 2014 2016

FLO

P p

er

Byte

End of Year

Theoretical Peak Floating Point Operations per Byte, Double Precision

HD 3870

HD 4870

HD 5870HD 6970

HD 6970

HD 7970 GHz Ed.

HD 8970

FirePro W

9100

FirePro S9150

X5482X5492

W5590

X5680 X5690

E5-2690

E5-2697 v2

E5-2699 v3

E5-2699 v3

E5-2699 v4

Tesla C1060

Tesla C1060

Tesla C2050

Tesla M2090

Tesla K20Tesla K20X

Tesla K40

Tesla K40

Tesla P100

Xeon Phi 7120 (KNC)

Xeon Phi 7290 (K

NL)

INTEL Xeon CPUs

NVIDIA Tesla GPUs

AMD Radeon GPUs

INTEL Xeon Phis

[c/o Karl Rupp]

Performance of assembled versus unassembled

1 2 3 4 5 6 7

polynomial order

102

103

104

byte

s/d

of

1 2 3 4 5 6 7

polynomial order

102

103

flo

ps/

do

f

tensor b = 1

tensor b = 3

tensor-qstore b = 1

tensor-qstore b = 3

assembled b = 1

assembled b = 3

I Arithmetic intensity for Qp elementsI < 1

4 (assembled), ≈ 10 (unassembled), ≈ 5 to 10 (hardware)

I store Jacobian information at Gauss quadrature points, can use AD

libCEED: Code for Efficient Extensible Discretization

I BSD-2 license, C library with Fortran interface

I Releases: v0.1 — v0.3 (2018); v0.4 to be released in March

I Purely algebraic interfaceI Extensible backends

I CPU: reference, blocked/vectorized, libXSMMI OCCA (just-in-time compilation): CPU, OpenMP, OpenCL, CUDAI MAGMAI CUDA using NVRTC (Steven Roberts & Yohann Dudouit, to be merged soon)

I Platform for collaboration with vendors

I Minimal assumptions about execution environment, parallel decompositionI Primary target: high order finite element methods

I H1,H(div),H(curl)I Also of interest to spectral difference, etc.I Exploit tensor product structure when possible

T-vector L-vector E-vector Q-vector

global domainall (shared) dofs

sub-domainsdevice (local) dofs

elementselement dofs

quadraturepoint values

Element operations (dense)libCEED Operator

Global problem

Quadrature Function

vT F(u)∼∫

Ωv · f0(u,∇u)+∇v : f1(u,∇u) vT Jw ∼

∫Ω

[v

∇v

]T [f0,0 f0,1f1,0 f1,1

][w

∇w

]u = BIEeuL ∇u =

∂X∂x

B∇EeuL

Jw = ∑e

E Te

[BI

B∇

]T[

I (∂X∂x

)T

]Wq

[f0,0 f0,1f1,0 f1,1

][I (∂X∂x

)]︸︷︷︸

coefficients at quadrature points

[BI

B∇

]EewL

I B and B∇ are tensor contractions – independent of element geometry

I Choice of how to order and represent gathers E and scatters E T

I Who computes the metric terms and other coefficients?

I Similar for Neumann/Robin and nonlinear boundary conditions

Quadrature Functions

I Multiple inputs and outputsI Independent operations at each of Q quadrature points

I Ordering and number of elements not specifiedint L2residual(void *ctx, CeedInt Q,

const CeedScalar *const in[],CeedScalar *const out[])

const CeedScalar *u = in[0], *rho = in[1], *target = in[2];CeedScalar *v = out[0];#pragma omp simdfor (CeedInt i=0; i<Q; i++)

v[i] = rho[i] * (u[i] - target[i]);return 0;

CeedQFunctionAddInput(qf, "u", 1, CEED_EVAL_INTERP);CeedQFunctionAddInput(qf, "rho", 1, CEED_EVAL_INTERP);CeedQFunctionAddInput(qf, "target", 1, CEED_EVAL_INTERP);CeedQFunctionAddOutput(qf, "v", 1, CEED_EVAL_INTERP);

Building Operators

vT F(u)∼∫

Ωv · f0(u,∇u)+∇v : f1(u,∇u)

u = BIEeuL ∇u =∂X∂x

B∇EeuL

I CeedOperatorCreate(ceed, f, &op);I CeedOperatorSetField(op, "velocity", E , B, uL);I Apply: implementation handles batching, work buffers, and calling f (u,∇u).

I User links to interface library

I Backend implementation switchable at run-time

I Two-phase implementation enables connectivity analysis and JIT

libCEED Operator

A = PT E T BT DBE︸︷︷︸CeedOperator

P

I quadrature function D

I For each field: element restriction E , basis B, where to find vectorCeedOperatorCreate(ceed, qf_L2residual, &op);CeedOperatorSetField(op, "u", E, Basis, CEED_VECTOR_ACTIVE);CeedOperatorSetField(op, "rho", E_id, CEED_BASIS_COLOCATED, rho);CeedOperatorSetField(op, "target", E_id, CEED_BASIS_COLOCATED, target);CeedOperatorSetField(op, "v", E, Basis, CEED_VECTOR_ACTIVE);

CeedOperatorApply(op, u, v, &request);

Element restriction Ee

I Conforming homogeneous mesh: boolean matrix with homogeneous block size

I Non-conforming mesh: anchored rows have linear combination

I Nek5000/DG-style E-vector: indexed identity

I libCEED backends are allowed to reorder, compress, etc.

I May be applied all at once or in batches

I CeedOperator implementation chooses if/how to fuse

Performance versatility: n1/2 and t1/2

I Suppose a linear scaling algorithm

I Let r(n) be the performance rate (e.g., DOF/second or GF/s) for local problem sizen = N/P

I Let rmax =maxn r(n) be the peak attainable performanceI n1/2 =minn : r(n)≥ 1

2 rmaxI Local problem sizes n < n1/2 will not yield acceptable efficiency

I t1/2 = 2n1/2/rmax

I Time to solution less than t1/2 is not feasible with acceptable efficiency

Performance spectra

102 103 104 105 106

Points per compute node

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

[DO

Fs x

CG

itera

tions

] / [c

ompu

te n

odes

x s

econ

ds]

1e8 /cpu/self/avx/blocked, petsc-bp1

p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16

10 4 10 3 10 2 10 1

Time per iteration

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

[DOF

s x C

G ite

ratio

ns] /

[com

pute

nod

es x

seco

nds]

1e8 /cpu/self/avx/blocked, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16

Vectorization techniques

I Vectorize within a single high-order elementI Minimal working set (as small as one element)I Specialized implementation for different degree/# quadrature pointsI Hard to avoid cross-lane operations at modest degreeI Nek5000

I Vectorize across elements in batches [i,j,k,e]I Working set has at least vector length number of elements (e.g., 8)I Generic implementation is easy to optimize; no cross-lane operationsI HPGMG-FE, Deal.II (Kronbichler and Kormann), MFEM (new)

AVX internal vs external vectorization

10 4 10 3 10 2 10 1

Time per iteration

0

1

2

3

4

5

6

[DOF

s x C

G ite

ratio

ns] /

[com

pute

nod

es x

seco

nds]

1e8 /cpu/self/avx/serial, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16

10 4 10 3 10 2 10 1

Time per iteration

0

1

2

3

4

5

6

[DOF

s x C

G ite

ratio

ns] /

[com

pute

nod

es x

seco

nds]

1e8 /cpu/self/avx/blocked, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16

libXSMM internal vs external vectorization

10 4 10 3 10 2 10 1

Time per iteration

0

1

2

3

4

5

6

[DOF

s x C

G ite

ratio

ns] /

[com

pute

nod

es x

seco

nds]

1e8 /cpu/self/xsmm/serial, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16

10 4 10 3 10 2 10 1

Time per iteration

0

1

2

3

4

5

6

[DOF

s x C

G ite

ratio

ns] /

[com

pute

nod

es x

seco

nds]

1e8 /cpu/self/xsmm/blocked, petsc-bp1p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16

Outlook

I spack install ceed works; next release in March

I libCEED is interested in contributors and friendly usersI Need consistent strategy for JIT of Q-functions

I How much fusion and inlining?I Performance optimizations in progress

I Backends should automatically choose internal versus external vectorizationI Choice depends on architecture, element size, number of fields, problem sizeI Even/odd decomposition (symmetry)

I Mixed topology (works, but needs flattening)

I Hanging node interpolation (can be done in P)

Documents

Library Interface Design and Performance Portability · libCEED: Code for Efﬁcient Extensible Discretization I BSD-2 license, C library with Fortran interface I Releases: v0.1 —