Grid: data parallel library for QCD

Grid: data parallel library for QCD

Peter Boyle, Azusa YamaguchiUniversity of Edinburgh

Guido CossuKEK

Work funded as an Intel Parallel Computing Centre

Parallelism Paradigm Proliferation

• Large computers are getting increasingly difficult to programme

• Suffering from PPP: “Parallelism Paradigm Proliferation”http://www.nersc.gov/assets/Uploads/RequirementsreviewsHellandV3150610.pdf

System!aIributes! NERSC!!Now!

OLCF!Now!

ALCF!!Now! NERSC!Upgrade! OLCF!Upgrade! ALCF!Upgrades!

Name Planned Installation Edison TITAN MIRA Cori

2016 Summit

2017-2018 Theta 2016

Aurora 2018-2019

System peak (PF) 2.6 27 10 > 30 150 >8.5 180

Peak Power (MW) 2 9 4.8 < 3.7 10 1.7 13

Total system memory 357 TB 710TB 768TB

~1 PB DDR4 + High Bandwidth Memory (HBM)

+1.5PB persistent memory

> 1.74 PB DDR4 + HBM +

2.8 PB persistent memory

>480 TB DDR4 + High Bandwidth Memory (HBM)

> 7 PB High Bandwidth On-

Package Memory Local Memory and Persistent Memory

Node performance (TF) 0.460 1.452 0.204 > 3 > 40 > 3 > 17 times Mira

Node processors Intel Ivy Bridge

AMD Opteron Nvidia Kepler

64-bit PowerPC

A2

Intel Knights Landing many

core CPUs Intel Haswell CPU

in data partition

Multiple IBM Power9 CPUs & multiple Nvidia Voltas GPUS

Intel Knights Landing Xeon Phi many core CPUs

Knights Hill Xeon Phi many core

CPUs

System size (nodes) 5,600 nodes

18,688 nodes 49,152

9,300 nodes 1,900 nodes in data partition

~3,500 nodes >2,500 nodes >50,000 nodes

System Interconnect Aries Gemini 5D Torus Aries Dual Rail EDR-IB Aries

2nd Generation Intel Omni-Path

Architecture

File System 7.6 PB

168 GB/s, Lustre®

32 PB 1 TB/s, Lustre®

26 PB 300 GB/s GPFS™

28 PB 744 GB/s Lustre®

120 PB 1 TB/s

GPFS™

10PB, 210 GB/s Lustre initial

150 PB 1 TB/s Lustre®

%ASCR%%Compu+ng%Upgrades%At%a%Glance%

Exascale Requirements Gathering -- HEP 6/10/2015 5

SIMD is a particularly big pain in the backside...

...but a technologically cheap way to accelerate code

Isn’t there an easier way to get good performance on KNL and Haswell/Skylake?

Text book comp sci: (e.g. Hennessy & Patterson)

• Code optimisations should expose spatial data reference locality

• Code optimisations should expose temporal data reference locality

SIMD brings a new level of restrictivness that is much harder to hit

• Code optimisations should expose spatial operation locality

Aren’t we going to have to make it easier to use 128/256/512/???? bit SIMD?

Plan:

• Clean slate reengineer QDP++ style interface to exploit all forms of parallelismeffectively

• MPI ⊗ OpenMP ⊗ SIMD

• Keep an open strategy for OpenMP 4.0 offload

vSIMD performance portable SIMD library

Define performant classes vRealF, vRealD, vComplexF, vComplexD.

#if defined (AVX1) || defined (AVX2)

typedef __m256 dvec;

#endif

#if defined (SSE2)


#endif

#if defined (AVX512)


#endif

#if defined (QPX)

typedef vector4double dvec;

#endif

#if defined (OPENMP4)

typedef double dvec[4];

#endif

class vRealD {

dvec v;

// Define arithmetic operators

friend inline vRealD operator + (vRealD a, vRealD b);

friend inline vRealD operator - (vRealD a, vRealD b);

friend inline vRealD operator * (vRealD a, vRealD b);

friend inline vRealD operator / (vRealD a, vRealD b);

static int Nsimd(void) { return sizeof(dvec)/sizeof(double);}

}

What is the best SIMD strategy?

SIMD most efficient for independent but identical worke.g. apply N small dense matrix-vector multiplies in parallel:

inline template<int N, class simd>

void matmul( simd * __restrict__ x,

simd * __restrict__ y,

simd * __restrict__ z)

{

for(int i=0;i<N;i++){

for(int j=0;j<N;j++){

fmac(y[i*N+j],z[j],x[i]);

}

}

}

SIMD interleave

= x

Reduction of vector sumis bottleneck for small N

Vector = Matrix x Vector

Many vectors = many matrices x many vectors

No reduction or SIMD lane

crossing operations.

Back to the Future

Q) How do we find copious independent but identical work?

A) Remember that SIMD was NOT hard in the 1980’s

··I· ·:i

i - an P

..:.:··3-i -;·I:; ,.----- :i- · - :· ·-

: ? ' i::

Connection Machine Model CM-2 and DataVault System

The Connection Machine Model CM-2 uses thousands of processors operating in parallel to achieve

peak processing speeds of above 10 gigaflops. The DataVault mass storage system stores up to

60 gigabytes of data.

vii

··I· ·:i

i - an P

..:.:··3-i -;·I:; ,.----- :i- · - :· ·-

: ? ' i::

Connection Machine Model CM-2 and DataVault System

The Connection Machine Model CM-2 uses thousands of processors operating in parallel to achieve

peak processing speeds of above 10 gigaflops. The DataVault mass storage system stores up to

60 gigabytes of data.

vii• Resurrect Jurassic data parallel programming techniques: cmfortran, HPF

• Address SIMD, OpenMP, MPI with single data parallel interface

• Map arrays to virtual nodes with user controlled layout primitives• Conformable array operations proceed data parallel with 100% SIMD efficiency• CSHIFT primitives handle communications

GRID parallel library• Geometrically decompose cartesian arrays across nodes (MPI)

• Subdivide node volume into smaller virtual nodes

• Spread virtual nodes across SIMD lanes

• Use OpenMP+MPI+SIMD to process conformable array operations

• Same instructions executed on many nodes, each node operates on four virtual nodes

Over decompose the subgrids

Interleave overdecomposed subvolumes in SIMD vector

Code for single overdecomposed subvolumewith fat vector data types

Processes all subvolumes in parallel with 100% SIMD efficiency

• Conclusion: Modify data layout to align data parallel operations to SIMD hardware

• Conformable array operations simple and vectorise perfectly

OVERDECOMPOSE and INTERLEAVE for SIMD

GRID data parallel CSHIFT details

• Crossing between SIMD lanes restricted to during cshifts between virtual nodes

• Code for N-virtual nodes is identical to scalar code for one, except datum is N fold bigger

(A,B,C ,D)︸︷︷︸

virtual subnode

(E , F ,G ,H)︸︷︷︸

virtual subnode

→ (AE ,BF ,CG ,DH)︸︷︷︸

Packed SIMD

• CSHIFT involves a CSHIFT of SIMD, and a permute only on the surface

(AE ,BF ,CG ,DH) → (BF , CG ,DH,AE)︸︷︷︸

cshift bulk↓

(BF , CG ,DH, EA)︸︷︷︸

permute face

→ (B,C ,D,E)︸︷︷︸

virtual subnode

(F ,G ,H,A)︸︷︷︸

virtual subnode

• Shuffle overhead is suppressed by surface to volume ratio

GRID data parallel template library

Ordering Layout Vectorisation Data ReuseMicroprocessor Array-of-Structs (AoS) Hard Maximised

Vector Struct-of-Array (SoA) Easy MinimisedBagel Array-of-structs-of-short-vectors (AoSoSV) Easy Maximised

• Opaque C++11 containers hide layout from user

• Automatically transform layout of arrays of mathematical objects using vSIMDscalar, vector, matrix, higher rank tensors.

vRealF, vRealD, vComplexF, vComplexD

template<class vtype> class iScalar

{

vtype _internal;

};

template<class vtype,int N> class iVector

{

vtype _internal[N];

};

template<class vtype,int N> class iMatrix

{

vtype _internal[N][N];

};

typedef Lattice<iMatrix<vComplexD> > LatticeColourMatrix;

typedef iMatrix<ComplexD> ColourMatrix;

• Define matrix, vector, scalar operations site operations

• Conformable array operations are data parallel on the sameGrid layout

• Internal type can be SIMD vectors or scalars

LatticeColourMatrix A(Grid);

LatticeColourMatrix B(Grid);

LatticeColourMatrix C(Grid);

LatticeColourMatrix dC_dy(Grid);

C = A*B;

const int Ydim = 1;

dC_dy = 0.5*Cshift(C,Ydim, 1 )

- 0.5*Cshift(C,Ydim,-1 );

• High-level data parallel code gets 65% of peak on AVX2

• High-level data parallel code gets 160 GF (single) on KNC

• Single data parallelism model targets BOTH SIMD andthreads efficiently.

High level code performance

std:vector<int> grid ({ 8,8,8,8 });

std:vector<int> simd ({ 1,1,2,2 });

std:vector<int> mpi ({ 1,1,1,1 });

std:vector<int> threads ({ 1,1,1,1 });

CartesianGrid Grid(grid,threads,simd,mpi);




A = B * C;

0

5

10

15

20

25

30

35

40

2.76E+04 5.53E+04 1.11E+05 2.21E+05 4.42E+05 8.85E+05 1.77E+06 3.54E+06 7.08E+06 1.42E+07 1.13E+08

Ivybrid

ge core pe

rforman

ce Gflo

p/s for 3x3 com

plex m

ul:p

lies

Performance vs Memory footprint in bytes 65% peak in cache; saturates streams bound out of cache

Grid SU(3)xSU(3)

USQCD QDP++

Streams bandwidth limit

Single precision peak

Performance on 84 lattice on AVX

Performance on 163 lattice on AVX512 KNC

LLVM/Clang++ is a cracking good compile

��

��

��

��

��

��

��

��

��

��

��

��

��

��

• Key routine for 5d chiral fermions (DWF/Overlap/Mobius)

• Test system 2.3GHz quad-core Ivybridge; 147GF peak

• Clang++ beats ICPC at loop unrolling, object copy elision.

• G++ doesn’t support AVX on Mac OS

g++-4.9 performance on Xeon XC30 Ivybridge nodes

��

��

��

��

��

��

��

��

��

��

��

��

��

• 84 × 8 local volume

• Dual 12 core 2.7 GHz Ivybridge (Archer)

• Node peak is 1004 GF in single precision.

• 42% of peak on 1 core

• 26% of peak on 24 cores

• Intel and Clang compilers likely higher

Code examples, cshift

Differences from QDP++

• Layout −→ multiple Grid objects• 5d and 4d grids natural; no multi1d<LatticeFermion> for DWF• Red-black grid for checkerboard fields

• Arbitrary depth cshift (e.g. Naik term)• Subplanes internally addressed through block-strided descriptors• Cshift and Stencil objects proceed via subplane shuffling

• Stencil used for Dirac operators• Less need to drop to high performance plug-in code

• Using C++11 features auto, decltype, etc..• Home grown expression templates: no PETE template engine• 50+k lines of PETE code → under 200 lines• Arbitrary tensor nesting depth via recursive decltype

• Compiled code is around 5.5x faster!

Differences from QDP++

• Recursively infer return type of arithmetic operators

• Arbitrary depth tensor products supported

• QDP++/PETE generates over 50k LOC enumerating the cases foroLattice ⊗ Spin ⊗ Colour ⊗ Reality ⊗ iLattice

• Good example of less code and more general enabled by C++11.

// scal x scal = scal // mat x mat = mat

// mat x scal = mat

// scal x mat = mat

// mat x vec = vec

// vec x scal = vec

// scal x vec = vec

template<class l,class r,int N>

auto operator * (const iScalar<l>& lhs,const iMatrix<r,N>& rhs)// S*M = M at this level

-> iMatrix<decltype(lhs._internal*rhs._internal[0][0]),N> // recurses to next level return type

{

typedef decltype(lhs._internal*rhs._internal[0][0]) ret_t;

iMatrix<ret_t,N> ret;

for(int c1=0;c1<N;c1++){

for(int c2=0;c2<N;c2++){

mult(&ret._internal[c1][c2],&lhs._internal,&rhs._internal[c1][c2]);

}}

return ret;

}

Stencil operators

Grid : Data Parallel QCD Library

We present progress on a new C++ data parallel QCD library. It enables the description of cartesian fields of arbitrary tensor mathematical types.

Ddata parallel interface, conformable array syntax with Cshift and masked operation (c.f. QDP++, cmfortran or HPF).

Three distinct forms of parallelism are transparently used underneath the single simple interface:

•  MPI task parallelism

•  OpenMP thread parallelism

•  SIMD vector parallelism.

The SIMD vector parallelism achieves nearly 100% SIMD efficiency due to the adoption of a virtual node layout transformation, similar to those in the Connection Machine.

This ensures identical and independent work lies in adjacent SIMD lanes. SSE, AVX, AVX2, AVX512 and Arm Neon SIMD targets are implemented.

The library is under development. Solvers for Wilson, Domain, and multiple 5d chiral fermions (Cayley, Continued fraction, partial fraction) are implemented.

Features differing from QDP++:

•  Shift by arbitrary distance

•  Storage:

•  Checkerboarded fields are half length

•  5d fields are same type as 4d with different Grid.

•  Multiple grids and Projection/Promotion support

•  blockProject, blockPromote, blockInnerProduct

•  Stencil object

•  encapsulates geometry of operation

•  Performs halo exchange

•  Simple to write kernel for Dirac operators.

•  C++11 : expression template engine < 200 LoC

.

Peter A Boyle, Azusa Yamaguchi Intel Parallel Computing Centre @ Higgs Centre for Theoretical Physics, University of Edinburgh

vSIMD performance portable SIMD library

Define performant classes vRealF, vRealD, vComplexF, vComplexD.

#if defined (AVX1) || defined (AVX2)


#endif

#if defined (SSE2)


#endif

#if defined (AVX512)


#endif

#if defined (QPX)

typedef vector4double dvec;

#endif

#if defined (OPENMP4)

typedef double dvec[4];

#endif

class vRealD {

dvec v;

// Define arithmetic operators

friend inline vRealD operator + (vRealD a, vRealD b);

friend inline vRealD operator - (vRealD a, vRealD b);

friend inline vRealD operator * (vRealD a, vRealD b);

friend inline vRealD operator / (vRealD a, vRealD b);

static int Nsimd(void) { return sizeof(dvec)/sizeof(double);}

}

What is the best SIMD strategy?

SIMD most e�cient for independent but identical worke.g. apply N small dense matrix-vector multiplies in parallel:

inline template<int N, class simd>

void matmul( simd * __restrict__ x,

simd * __restrict__ y,

simd * __restrict__ z)

{

for(int i=0;i<N;i++){

for(int j=0;j<N;j++){

fmac(y[i*N+j],z[j],x[i]);

}

}

}

SIMD interleave

= x

Reduction of vector sumis bottleneck for small N

Vector = Matrix x Vector

Many vectors = many matrices x many vectors

No reduction or SIMD lanecrossing operations.

GRID parallel library

• Geometrically decompose cartesian arrays across nodes (MPI)

• Subdivide node volume into smaller virtual nodes

• Spread virtual nodes across SIMD lanes

• Use OpenMP+MPI+SIMD to process conformable array operations

• Same instructions executed on many nodes, each node operates on four virtual nodes

Over decompose the subgrids

Interleave overdecomposed subvolumes in SIMD vector

Code for single overdecomposed subvolumewith fat vector data types

Processes all subvolumes in parallel with 100% SIMD efficiency

• Conclusion: Modify data layout to align data parallel operations to SIMD hardware

• Conformable array operations simple and vectorise perfectly

GRID data parallel CSHIFT details

• Crossing between SIMD lanes restricted to during cshifts between virtual nodes

• Code for N-virtual nodes is identical to scalar code for one, except datum is N fold bigger

(A, B, C , D)| {z }

virtual subnode

(E , F , G , H)| {z }

virtual subnode

! (AE , BF , CG , DH)| {z }

Packed SIMD

• CSHIFT involves a CSHIFT of SIMD, and a permute only on the surface

(AE , BF , CG , DH) ! (BF , CG , DH, AE)| {z }

cshift bulk#

(BF , CG , DH, EA)| {z }

permute face

! (B, C , D,E)| {z }

virtual subnode

(F , G , H,A)| {z }

virtual subnode

• Shu✏e overhead is suppressed by surface to volume ratio

GRID data parallel template library

Ordering Layout Vectorisation Data ReuseMicroprocessor Array-of-Structs (AoS) Hard Maximised

Vector Struct-of-Array (SoA) Easy MinimisedBagel Array-of-structs-of-short-vectors (AoSoSV) Easy Maximised

• Opaque C++11 containers hide layout from user

• Automatically transform layout of arrays of mathematical objects using vSIMDscalar, vector, matrix, higher rank tensors.

vRealF, vRealD, vComplexF, vComplexD

template<class vtype> class iScalar

{

vtype _internal;

};

template<class vtype,int N> class iVector

{

vtype _internal[N];

};

template<class vtype,int N> class iMatrix

{

vtype _internal[N][N];

};

typedef Lattice<iMatrix<vComplexD> > LatticeColourMatrix;

typedef iMatrix<ComplexD> ColourMatrix;

• Define matrix, vector, scalar operations site operations

• Conformable array operations are data parallel on the sameGrid layout

• Internal type can be SIMD vectors or scalars




LatticeColourMatrix dC_dy(Grid);

C = A*B;

const int Ydim = 1;

dC_dy = 0.5*Cshift(C,Ydim, 1 )

- 0.5*Cshift(C,Ydim,-1 );

• High-level data parallel code gets 65% of peak on AVX2

• High-level data parallel code gets 160 GF (single) on KNC

• Single data parallelism model targets BOTH SIMD andthreads e�ciently.


std:vector<int> grid = { 8,8,8,8 };

std:vector<int> simd = { 1,1,2,2 };

CartesianGrid Grid(grid,simd);




A = B * C;

0"

5"

10"

15"

20"

25"

30"

35"

40"

1" 10" 100" 1000" 10000" 100000" 1000000"

Gigaflop/s*Single*core*

2.3GHZ**Ivybridge*

Footprint*(KB)*

Performance"(GF)"

Peak"

Streams"limit"

Performance on 84 lattice on AVXPerformance on 163 lattice on AVX512 KNC


std:vector<int> grid ({ 8,8,8,8 });

std:vector<int> simd_layout ({ 1,1,2,2 });

std:vector<int> mpi_layout ({ 1,1,1,1 });

CartesianGrid Grid(grid,simd_layout,mpi_layout);




A = B * C;

0"

5"

10"

15"

20"

25"

30"

35"

40"

1" 10" 100" 1000" 10000" 100000" 1000000"

Gigaflop/s*Single*core*

2.3GHZ**Ivybridge*

Footprint*(KB)*

Performance"(GF)"

Peak"

Streams"limit"

Performance on 84 lattice on AVX

Performance on 163 lattice on AVX512 KNC

0"

5"

10"

15"

20"

25"

30"

35"

40"

2.76E+04" 5.53E+04" 1.11E+05" 2.21E+05" 4.42E+05" 8.85E+05" 1.77E+06" 3.54E+06" 7.08E+06" 1.42E+07" 1.13E+08"

Ivybrid

ge*core*pe

rforman

ce*Gflo

p/s*for*3x3*com

plex*m

ul:p

lies*

Performance*vs*Memory*footprint*in*bytes*65%*peak*in*cache;*saturates*streams*bound*out*of*cache*

Grid"SU(3)xSU(3)"

USQCD"QDP++"

Streams"bandwidth"limit"

Single"precision"peak"

SU3xSU3 AVX SU3xSU3 XeonPhi

Guido Cossu KEK, Tsukuba

Stencil organises halo exchange for any vector type; compressor can do spin proj for Wilson fermions.

Stencil provides index of each neighbour (knows the geometry)

User dictates how to treat the internal indices in operator

Coarse grid operator in Grid

www.github.com/paboyle/Grid

Pass the stencil a list of directions and displacements

Stencil support eases the pain of optimised matrix multiplies

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

GCC 4.9 Performance on Archer and Edison Cray XC30 dual Ivybridge 12/24 core.

•  Fermion actions •  Wilson •  Chiral fermions

•  (Cayley/ContFrac/PartFrac) •  (Zolotareve/Tanh) •  (Wilson/Shamir/Mobius)

•  Redblack 4d and unpreconditioned •  CG, MCR, Multishift CG, GCR •  Zolotarev, Chebyshev, Remez approx

•  Gauge actions •  Quenched Wilson/Symanzik/Iwasaki

•  Multigrid coarse operators •  Nersc file I/O

•  Roadmap: •  (R)HMC, gauge and fermion force terms •  FFT, measurements •  Link smearing

Programming style works with multiple C++11 compilers Clang is the best on 256 bit simd (AVX, AVX2). ICPC is the only one supporting full set of 512 bit intrinsics properly.

Wilson Dirac Kernel

Multi-Grid for Chiral Fermions

• Using Grid as a rapid development environment

• multi-grid transfer interface

Implementation status• Basic Grid type system essentially complete

• QCD types, generic SU(N), arbitrary dimension• Simple to port UKQCD observables code over from QDP++• Sum, SliceSum• To do: Slow and Fast Fourier Transform, Gauge fixing

• Algorithms• CG, MCR, GCR, VPGCR• Chebyshev approx, Remez, Multishift CG• Two level VPGCR (first steps to multigrid)• Quenched heatbath• Quenched HMC

• Fermion dirac operators• Even-odd and unpreconditioned have a single unified definition• Wilson• {Wilson, Shamir, Mobius } - Kernel⊗ {Zolotarev, Tanh } - Approximation⊗ {Continued Fraction, Partial Fraction , Cayley } - Representation

• DWF, Mobius as special cases

• www.github.com/paboyle/Grid• 30k lines of code since April!

Chiral fermion multi-grid

• HDCG (Boyle, Lat 2012) coarsened the Hermitian even-odd matrixNext-to-next-to-next-to-nearest neighbour

• HDCG setup too expensive for (R)HMC acceleration• Biggest problem is recomputation of the coarse operator in an evolved configuration

• Negative mass Wilson breaks half-plane condition for Krylov solvers• Polynomial approximation to complex f (z) = 1/z on open region enclosing origin hard• No analytic function can wind the phase around the pole

• Resolution:• Normal equations M†M is Hermitian pos. def. with positive real spectrum

• Forces non-local coarse space representation; not a true multigrid as sparsity not retained

• Use hermitian indefinite solvers like MCR on• γ5R5M for DWF/Tanh• γ5M for Partial fraction, continued fraction Zolotarev/Tanh

• Successfully solving DWF and PF, CF hermitian indefinite operators• No algorithmic loss compared to CG• Implemented coarsening, two level GCR.• Tuning smoothers but no MultiGrid speedup as yet. Work in progress!• Kobe update: Using both η and γ5R5η has substantially improved coarse space

condition number!Still no overall speed up

Comparison with BFM code

Instructive to compare BFM code for Wilson Kernel to compiler generate Grid codeMeasure of the efficiency of compilers at placing intrincs in registers

Compiler SIMD Load/stores (single site wilson)Compulsory 12 x 8 + 9 x 8 + 12 = 180

Bagel avx512 200icpc avx512 676

Clang++ avx 665g++ sse4 660icpc avx 615

• The “perfect” BFM register allocation strategy was indicated to compiler with 31temporaries

• Equivalent to inserting neon lights in the code flashing“this is the solution to your reg allocation problem”

• Compilers survive rely heavily on L1 cache

• Intel compiler is performing object copies, while Clang++ is eliminating these

• May still end up with a small amount of Bagel code for Wilson term and AVX512Will see how we do with KNL

Summary

• Ground up development of new library for Lattice QCD designed to be fast

• Intended to be efficient on the next generation of systems• SSE, AVX, AVX2, AVX512 implemented; ARM Neon partially implemented• Plan backport to bgclang• Plan OpenMP 4.0 offload targets also

• Plan broad range of Fermion action, Gauge evolution support• Rate of progress is rapid• Already a vehicle for alorithm research• Possibly non-QCD field theories

• In our (unbiased !?) view it is rather good!

Documents

Grid: data parallel library for QCD