Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Grid: data parallel library for QCD
Peter Boyle, Azusa YamaguchiUniversity of Edinburgh
Guido CossuKEK
Work funded as an Intel Parallel Computing Centre
Parallelism Paradigm Proliferation
• Large computers are getting increasingly difficult to programme
• Suffering from PPP: “Parallelism Paradigm Proliferation”http://www.nersc.gov/assets/Uploads/RequirementsreviewsHellandV3150610.pdf
System!aIributes! NERSC!!Now!
OLCF!Now!
ALCF!!Now! NERSC!Upgrade! OLCF!Upgrade! ALCF!Upgrades!
Name Planned Installation Edison TITAN MIRA Cori
2016 Summit
2017-2018 Theta 2016
Aurora 2018-2019
System peak (PF) 2.6 27 10 > 30 150 >8.5 180
Peak Power (MW) 2 9 4.8 < 3.7 10 1.7 13
Total system memory 357 TB 710TB 768TB
~1 PB DDR4 + High Bandwidth Memory (HBM)
+1.5PB persistent memory
> 1.74 PB DDR4 + HBM +
2.8 PB persistent memory
>480 TB DDR4 + High Bandwidth Memory (HBM)
> 7 PB High Bandwidth On-
Package Memory Local Memory and Persistent Memory
Node performance (TF) 0.460 1.452 0.204 > 3 > 40 > 3 > 17 times Mira
Node processors Intel Ivy Bridge
AMD Opteron Nvidia Kepler
64-bit PowerPC
A2
Intel Knights Landing many
core CPUs Intel Haswell CPU
in data partition
Multiple IBM Power9 CPUs & multiple Nvidia Voltas GPUS
Intel Knights Landing Xeon Phi many core CPUs
Knights Hill Xeon Phi many core
CPUs
System size (nodes) 5,600 nodes
18,688 nodes 49,152
9,300 nodes 1,900 nodes in data partition
~3,500 nodes >2,500 nodes >50,000 nodes
System Interconnect Aries Gemini 5D Torus Aries Dual Rail EDR-IB Aries
2nd Generation Intel Omni-Path
Architecture
File System 7.6 PB
168 GB/s, Lustre®
32 PB 1 TB/s, Lustre®
26 PB 300 GB/s GPFS™
28 PB 744 GB/s Lustre®
120 PB 1 TB/s
GPFS™
10PB, 210 GB/s Lustre initial
150 PB 1 TB/s Lustre®
%ASCR%%Compu+ng%Upgrades%At%a%Glance%
Exascale Requirements Gathering -- HEP 6/10/2015 5
SIMD is a particularly big pain in the backside...
...but a technologically cheap way to accelerate code
Isn’t there an easier way to get good performance on KNL and Haswell/Skylake?
Text book comp sci: (e.g. Hennessy & Patterson)
• Code optimisations should expose spatial data reference locality
• Code optimisations should expose temporal data reference locality
SIMD brings a new level of restrictivness that is much harder to hit
• Code optimisations should expose spatial operation locality
Aren’t we going to have to make it easier to use 128/256/512/???? bit SIMD?
Plan:
• Clean slate reengineer QDP++ style interface to exploit all forms of parallelismeffectively
• MPI ⊗ OpenMP ⊗ SIMD
• Keep an open strategy for OpenMP 4.0 offload
vSIMD performance portable SIMD library
Define performant classes vRealF, vRealD, vComplexF, vComplexD.
#if defined (AVX1) || defined (AVX2)
typedef __m256 dvec;
#endif
#if defined (SSE2)
typedef __m128 dvec;
#endif
#if defined (AVX512)
typedef __m512 dvec;
#endif
#if defined (QPX)
typedef vector4double dvec;
#endif
#if defined (OPENMP4)
typedef double dvec[4];
#endif
class vRealD {
dvec v;
// Define arithmetic operators
friend inline vRealD operator + (vRealD a, vRealD b);
friend inline vRealD operator - (vRealD a, vRealD b);
friend inline vRealD operator * (vRealD a, vRealD b);
friend inline vRealD operator / (vRealD a, vRealD b);
static int Nsimd(void) { return sizeof(dvec)/sizeof(double);}
}
What is the best SIMD strategy?
SIMD most efficient for independent but identical worke.g. apply N small dense matrix-vector multiplies in parallel:
inline template<int N, class simd>
void matmul( simd * __restrict__ x,
simd * __restrict__ y,
simd * __restrict__ z)
{
for(int i=0;i<N;i++){
for(int j=0;j<N;j++){
fmac(y[i*N+j],z[j],x[i]);
}
}
}
SIMD interleave
= x
Reduction of vector sumis bottleneck for small N
Vector = Matrix x Vector
Many vectors = many matrices x many vectors
No reduction or SIMD lane
crossing operations.
Back to the Future
Q) How do we find copious independent but identical work?
A) Remember that SIMD was NOT hard in the 1980’s
··I· ·:i
i - an P
..:.:··3-i -;·I:; ,.----- :i- · - :· ·-
: ? ' i::
Connection Machine Model CM-2 and DataVault System
The Connection Machine Model CM-2 uses thousands of processors operating in parallel to achieve
peak processing speeds of above 10 gigaflops. The DataVault mass storage system stores up to
60 gigabytes of data.
vii
··I· ·:i
i - an P
..:.:··3-i -;·I:; ,.----- :i- · - :· ·-
: ? ' i::
Connection Machine Model CM-2 and DataVault System
The Connection Machine Model CM-2 uses thousands of processors operating in parallel to achieve
peak processing speeds of above 10 gigaflops. The DataVault mass storage system stores up to
60 gigabytes of data.
vii• Resurrect Jurassic data parallel programming techniques: cmfortran, HPF
• Address SIMD, OpenMP, MPI with single data parallel interface
• Map arrays to virtual nodes with user controlled layout primitives• Conformable array operations proceed data parallel with 100% SIMD efficiency• CSHIFT primitives handle communications
GRID parallel library• Geometrically decompose cartesian arrays across nodes (MPI)
• Subdivide node volume into smaller virtual nodes
• Spread virtual nodes across SIMD lanes
• Use OpenMP+MPI+SIMD to process conformable array operations
• Same instructions executed on many nodes, each node operates on four virtual nodes
Over decompose the subgrids
Interleave overdecomposed subvolumes in SIMD vector
Code for single overdecomposed subvolumewith fat vector data types
Processes all subvolumes in parallel with 100% SIMD efficiency
• Conclusion: Modify data layout to align data parallel operations to SIMD hardware
• Conformable array operations simple and vectorise perfectly
OVERDECOMPOSE and INTERLEAVE for SIMD
GRID data parallel CSHIFT details
• Crossing between SIMD lanes restricted to during cshifts between virtual nodes
• Code for N-virtual nodes is identical to scalar code for one, except datum is N fold bigger
(A,B,C ,D)︸ ︷︷ ︸
virtual subnode
(E , F ,G ,H)︸ ︷︷ ︸
virtual subnode
→ (AE ,BF ,CG ,DH)︸ ︷︷ ︸
Packed SIMD
• CSHIFT involves a CSHIFT of SIMD, and a permute only on the surface
(AE ,BF ,CG ,DH) → (BF , CG ,DH,AE)︸ ︷︷ ︸
cshift bulk↓
(BF , CG ,DH, EA)︸ ︷︷ ︸
permute face
→ (B,C ,D,E)︸ ︷︷ ︸
virtual subnode
(F ,G ,H,A)︸ ︷︷ ︸
virtual subnode
• Shuffle overhead is suppressed by surface to volume ratio
GRID data parallel template library
Ordering Layout Vectorisation Data ReuseMicroprocessor Array-of-Structs (AoS) Hard Maximised
Vector Struct-of-Array (SoA) Easy MinimisedBagel Array-of-structs-of-short-vectors (AoSoSV) Easy Maximised
• Opaque C++11 containers hide layout from user
• Automatically transform layout of arrays of mathematical objects using vSIMDscalar, vector, matrix, higher rank tensors.
vRealF, vRealD, vComplexF, vComplexD
template<class vtype> class iScalar
{
vtype _internal;
};
template<class vtype,int N> class iVector
{
vtype _internal[N];
};
template<class vtype,int N> class iMatrix
{
vtype _internal[N][N];
};
typedef Lattice<iMatrix<vComplexD> > LatticeColourMatrix;
typedef iMatrix<ComplexD> ColourMatrix;
• Define matrix, vector, scalar operations site operations
• Conformable array operations are data parallel on the sameGrid layout
• Internal type can be SIMD vectors or scalars
LatticeColourMatrix A(Grid);
LatticeColourMatrix B(Grid);
LatticeColourMatrix C(Grid);
LatticeColourMatrix dC_dy(Grid);
C = A*B;
const int Ydim = 1;
dC_dy = 0.5*Cshift(C,Ydim, 1 )
- 0.5*Cshift(C,Ydim,-1 );
• High-level data parallel code gets 65% of peak on AVX2
• High-level data parallel code gets 160 GF (single) on KNC
• Single data parallelism model targets BOTH SIMD andthreads efficiently.
High level code performance
std:vector<int> grid ({ 8,8,8,8 });
std:vector<int> simd ({ 1,1,2,2 });
std:vector<int> mpi ({ 1,1,1,1 });
std:vector<int> threads ({ 1,1,1,1 });
CartesianGrid Grid(grid,threads,simd,mpi);
LatticeColourMatrix A(Grid);
LatticeColourMatrix B(Grid);
LatticeColourMatrix C(Grid);
A = B * C;
0
5
10
15
20
25
30
35
40
2.76E+04 5.53E+04 1.11E+05 2.21E+05 4.42E+05 8.85E+05 1.77E+06 3.54E+06 7.08E+06 1.42E+07 1.13E+08
Ivybrid
ge core pe
rforman
ce Gflo
p/s for 3x3 com
plex m
ul:p
lies
Performance vs Memory footprint in bytes 65% peak in cache; saturates streams bound out of cache
Grid SU(3)xSU(3)
USQCD QDP++
Streams bandwidth limit
Single precision peak
Performance on 84 lattice on AVX
Performance on 163 lattice on AVX512 KNC
LLVM/Clang++ is a cracking good compile
��
���
���
���
���
���
���
���
�� �� �� �� �� ��� ���
����
�������
�����������������������������������������������������������
���������������������������������������������������
�������������
• Key routine for 5d chiral fermions (DWF/Overlap/Mobius)
• Test system 2.3GHz quad-core Ivybridge; 147GF peak
• Clang++ beats ICPC at loop unrolling, object copy elision.
• G++ doesn’t support AVX on Mac OS
g++-4.9 performance on Xeon XC30 Ivybridge nodes
��
���
����
����
����
����
����
�� �� ��� ��� ��� ���
�������������
�����
�������������������������������������������������
�������������������������������������������������
�������������������������������������������������������
• 84 × 8 local volume
• Dual 12 core 2.7 GHz Ivybridge (Archer)
• Node peak is 1004 GF in single precision.
• 42% of peak on 1 core
• 26% of peak on 24 cores
• Intel and Clang compilers likely higher
Code examples, cshift
Differences from QDP++
• Layout −→ multiple Grid objects• 5d and 4d grids natural; no multi1d<LatticeFermion> for DWF• Red-black grid for checkerboard fields
• Arbitrary depth cshift (e.g. Naik term)• Subplanes internally addressed through block-strided descriptors• Cshift and Stencil objects proceed via subplane shuffling
• Stencil used for Dirac operators• Less need to drop to high performance plug-in code
• Using C++11 features auto, decltype, etc..• Home grown expression templates: no PETE template engine• 50+k lines of PETE code → under 200 lines• Arbitrary tensor nesting depth via recursive decltype
• Compiled code is around 5.5x faster!
Differences from QDP++
• Recursively infer return type of arithmetic operators
• Arbitrary depth tensor products supported
• QDP++/PETE generates over 50k LOC enumerating the cases foroLattice ⊗ Spin ⊗ Colour ⊗ Reality ⊗ iLattice
• Good example of less code and more general enabled by C++11.
// scal x scal = scal // mat x mat = mat
// mat x scal = mat
// scal x mat = mat
// mat x vec = vec
// vec x scal = vec
// scal x vec = vec
template<class l,class r,int N>
auto operator * (const iScalar<l>& lhs,const iMatrix<r,N>& rhs)// S*M = M at this level
-> iMatrix<decltype(lhs._internal*rhs._internal[0][0]),N> // recurses to next level return type
{
typedef decltype(lhs._internal*rhs._internal[0][0]) ret_t;
iMatrix<ret_t,N> ret;
for(int c1=0;c1<N;c1++){
for(int c2=0;c2<N;c2++){
mult(&ret._internal[c1][c2],&lhs._internal,&rhs._internal[c1][c2]);
}}
return ret;
}
Stencil operators
Grid : Data Parallel QCD Library
We present progress on a new C++ data parallel QCD library. It enables the description of cartesian fields of arbitrary tensor mathematical types.
Ddata parallel interface, conformable array syntax with Cshift and masked operation (c.f. QDP++, cmfortran or HPF).
Three distinct forms of parallelism are transparently used underneath the single simple interface:
• MPI task parallelism
• OpenMP thread parallelism
• SIMD vector parallelism.
The SIMD vector parallelism achieves nearly 100% SIMD efficiency due to the adoption of a virtual node layout transformation, similar to those in the Connection Machine.
This ensures identical and independent work lies in adjacent SIMD lanes. SSE, AVX, AVX2, AVX512 and Arm Neon SIMD targets are implemented.
The library is under development. Solvers for Wilson, Domain, and multiple 5d chiral fermions (Cayley, Continued fraction, partial fraction) are implemented.
Features differing from QDP++:
• Shift by arbitrary distance
• Storage:
• Checkerboarded fields are half length
• 5d fields are same type as 4d with different Grid.
• Multiple grids and Projection/Promotion support
• blockProject, blockPromote, blockInnerProduct
• Stencil object
• encapsulates geometry of operation
• Performs halo exchange
• Simple to write kernel for Dirac operators.
• C++11 : expression template engine < 200 LoC
.
Peter A Boyle, Azusa Yamaguchi Intel Parallel Computing Centre @ Higgs Centre for Theoretical Physics, University of Edinburgh
vSIMD performance portable SIMD library
Define performant classes vRealF, vRealD, vComplexF, vComplexD.
#if defined (AVX1) || defined (AVX2)
typedef __m256 dvec;
#endif
#if defined (SSE2)
typedef __m128 dvec;
#endif
#if defined (AVX512)
typedef __m512 dvec;
#endif
#if defined (QPX)
typedef vector4double dvec;
#endif
#if defined (OPENMP4)
typedef double dvec[4];
#endif
class vRealD {
dvec v;
// Define arithmetic operators
friend inline vRealD operator + (vRealD a, vRealD b);
friend inline vRealD operator - (vRealD a, vRealD b);
friend inline vRealD operator * (vRealD a, vRealD b);
friend inline vRealD operator / (vRealD a, vRealD b);
static int Nsimd(void) { return sizeof(dvec)/sizeof(double);}
}
What is the best SIMD strategy?
SIMD most e�cient for independent but identical worke.g. apply N small dense matrix-vector multiplies in parallel:
inline template<int N, class simd>
void matmul( simd * __restrict__ x,
simd * __restrict__ y,
simd * __restrict__ z)
{
for(int i=0;i<N;i++){
for(int j=0;j<N;j++){
fmac(y[i*N+j],z[j],x[i]);
}
}
}
SIMD interleave
= x
Reduction of vector sumis bottleneck for small N
Vector = Matrix x Vector
Many vectors = many matrices x many vectors
No reduction or SIMD lanecrossing operations.
GRID parallel library
• Geometrically decompose cartesian arrays across nodes (MPI)
• Subdivide node volume into smaller virtual nodes
• Spread virtual nodes across SIMD lanes
• Use OpenMP+MPI+SIMD to process conformable array operations
• Same instructions executed on many nodes, each node operates on four virtual nodes
Over decompose the subgrids
Interleave overdecomposed subvolumes in SIMD vector
Code for single overdecomposed subvolumewith fat vector data types
Processes all subvolumes in parallel with 100% SIMD efficiency
• Conclusion: Modify data layout to align data parallel operations to SIMD hardware
• Conformable array operations simple and vectorise perfectly
GRID data parallel CSHIFT details
• Crossing between SIMD lanes restricted to during cshifts between virtual nodes
• Code for N-virtual nodes is identical to scalar code for one, except datum is N fold bigger
(A, B, C , D)| {z }
virtual subnode
(E , F , G , H)| {z }
virtual subnode
! (AE , BF , CG , DH)| {z }
Packed SIMD
• CSHIFT involves a CSHIFT of SIMD, and a permute only on the surface
(AE , BF , CG , DH) ! (BF , CG , DH, AE)| {z }
cshift bulk#
(BF , CG , DH, EA)| {z }
permute face
! (B, C , D,E)| {z }
virtual subnode
(F , G , H,A)| {z }
virtual subnode
• Shu✏e overhead is suppressed by surface to volume ratio
GRID data parallel template library
Ordering Layout Vectorisation Data ReuseMicroprocessor Array-of-Structs (AoS) Hard Maximised
Vector Struct-of-Array (SoA) Easy MinimisedBagel Array-of-structs-of-short-vectors (AoSoSV) Easy Maximised
• Opaque C++11 containers hide layout from user
• Automatically transform layout of arrays of mathematical objects using vSIMDscalar, vector, matrix, higher rank tensors.
vRealF, vRealD, vComplexF, vComplexD
template<class vtype> class iScalar
{
vtype _internal;
};
template<class vtype,int N> class iVector
{
vtype _internal[N];
};
template<class vtype,int N> class iMatrix
{
vtype _internal[N][N];
};
typedef Lattice<iMatrix<vComplexD> > LatticeColourMatrix;
typedef iMatrix<ComplexD> ColourMatrix;
• Define matrix, vector, scalar operations site operations
• Conformable array operations are data parallel on the sameGrid layout
• Internal type can be SIMD vectors or scalars
LatticeColourMatrix A(Grid);
LatticeColourMatrix B(Grid);
LatticeColourMatrix C(Grid);
LatticeColourMatrix dC_dy(Grid);
C = A*B;
const int Ydim = 1;
dC_dy = 0.5*Cshift(C,Ydim, 1 )
- 0.5*Cshift(C,Ydim,-1 );
• High-level data parallel code gets 65% of peak on AVX2
• High-level data parallel code gets 160 GF (single) on KNC
• Single data parallelism model targets BOTH SIMD andthreads e�ciently.
High level code performance
std:vector<int> grid = { 8,8,8,8 };
std:vector<int> simd = { 1,1,2,2 };
CartesianGrid Grid(grid,simd);
LatticeColourMatrix A(Grid);
LatticeColourMatrix B(Grid);
LatticeColourMatrix C(Grid);
A = B * C;
0"
5"
10"
15"
20"
25"
30"
35"
40"
1" 10" 100" 1000" 10000" 100000" 1000000"
Gigaflop/s*Single*core*
2.3GHZ**Ivybridge*
Footprint*(KB)*
Performance"(GF)"
Peak"
Streams"limit"
Performance on 84 lattice on AVXPerformance on 163 lattice on AVX512 KNC
High level code performance
std:vector<int> grid ({ 8,8,8,8 });
std:vector<int> simd_layout ({ 1,1,2,2 });
std:vector<int> mpi_layout ({ 1,1,1,1 });
CartesianGrid Grid(grid,simd_layout,mpi_layout);
LatticeColourMatrix A(Grid);
LatticeColourMatrix B(Grid);
LatticeColourMatrix C(Grid);
A = B * C;
0"
5"
10"
15"
20"
25"
30"
35"
40"
1" 10" 100" 1000" 10000" 100000" 1000000"
Gigaflop/s*Single*core*
2.3GHZ**Ivybridge*
Footprint*(KB)*
Performance"(GF)"
Peak"
Streams"limit"
Performance on 84 lattice on AVX
Performance on 163 lattice on AVX512 KNC
0"
5"
10"
15"
20"
25"
30"
35"
40"
2.76E+04" 5.53E+04" 1.11E+05" 2.21E+05" 4.42E+05" 8.85E+05" 1.77E+06" 3.54E+06" 7.08E+06" 1.42E+07" 1.13E+08"
Ivybrid
ge*core*pe
rforman
ce*Gflo
p/s*for*3x3*com
plex*m
ul:p
lies*
Performance*vs*Memory*footprint*in*bytes*65%*peak*in*cache;*saturates*streams*bound*out*of*cache*
Grid"SU(3)xSU(3)"
USQCD"QDP++"
Streams"bandwidth"limit"
Single"precision"peak"
SU3xSU3 AVX SU3xSU3 XeonPhi
Guido Cossu KEK, Tsukuba
Stencil organises halo exchange for any vector type; compressor can do spin proj for Wilson fermions.
Stencil provides index of each neighbour (knows the geometry)
User dictates how to treat the internal indices in operator
Coarse grid operator in Grid
www.github.com/paboyle/Grid
Pass the stencil a list of directions and displacements
Stencil support eases the pain of optimised matrix multiplies
��
���
���
���
���
���
���
���
�� �� �� �� �� ��� ���
����
�������
�����������������������������������������������������������
���������������������������������������������������
�������������
��
���
���
���
���
���
���
���
������ ����� ���� �� ��� ����
����
�������
����������������������������������������
���������������������������������������
�����������������������������������������
��
���
����
����
����
����
����
�� �� ��� ��� ��� ���
�������������
�����
�����������������������������������������������
�������������������������������������������������
�������������������������������������������������������
GCC 4.9 Performance on Archer and Edison Cray XC30 dual Ivybridge 12/24 core.
• Fermion actions • Wilson • Chiral fermions
• (Cayley/ContFrac/PartFrac) • (Zolotareve/Tanh) • (Wilson/Shamir/Mobius)
• Redblack 4d and unpreconditioned • CG, MCR, Multishift CG, GCR • Zolotarev, Chebyshev, Remez approx
• Gauge actions • Quenched Wilson/Symanzik/Iwasaki
• Multigrid coarse operators • Nersc file I/O
• Roadmap: • (R)HMC, gauge and fermion force terms • FFT, measurements • Link smearing
Programming style works with multiple C++11 compilers Clang is the best on 256 bit simd (AVX, AVX2). ICPC is the only one supporting full set of 512 bit intrinsics properly.
Wilson Dirac Kernel
Multi-Grid for Chiral Fermions
• Using Grid as a rapid development environment
• multi-grid transfer interface
Implementation status• Basic Grid type system essentially complete
• QCD types, generic SU(N), arbitrary dimension• Simple to port UKQCD observables code over from QDP++• Sum, SliceSum• To do: Slow and Fast Fourier Transform, Gauge fixing
• Algorithms• CG, MCR, GCR, VPGCR• Chebyshev approx, Remez, Multishift CG• Two level VPGCR (first steps to multigrid)• Quenched heatbath• Quenched HMC
• Fermion dirac operators• Even-odd and unpreconditioned have a single unified definition• Wilson• {Wilson, Shamir, Mobius } - Kernel⊗ {Zolotarev, Tanh } - Approximation⊗ {Continued Fraction, Partial Fraction , Cayley } - Representation
• DWF, Mobius as special cases
• www.github.com/paboyle/Grid• 30k lines of code since April!
Chiral fermion multi-grid
• HDCG (Boyle, Lat 2012) coarsened the Hermitian even-odd matrixNext-to-next-to-next-to-nearest neighbour
• HDCG setup too expensive for (R)HMC acceleration• Biggest problem is recomputation of the coarse operator in an evolved configuration
• Negative mass Wilson breaks half-plane condition for Krylov solvers• Polynomial approximation to complex f (z) = 1/z on open region enclosing origin hard• No analytic function can wind the phase around the pole
• Resolution:• Normal equations M†M is Hermitian pos. def. with positive real spectrum
• Forces non-local coarse space representation; not a true multigrid as sparsity not retained
• Use hermitian indefinite solvers like MCR on• γ5R5M for DWF/Tanh• γ5M for Partial fraction, continued fraction Zolotarev/Tanh
• Successfully solving DWF and PF, CF hermitian indefinite operators• No algorithmic loss compared to CG• Implemented coarsening, two level GCR.• Tuning smoothers but no MultiGrid speedup as yet. Work in progress!• Kobe update: Using both η and γ5R5η has substantially improved coarse space
condition number!Still no overall speed up
Comparison with BFM code
Instructive to compare BFM code for Wilson Kernel to compiler generate Grid codeMeasure of the efficiency of compilers at placing intrincs in registers
Compiler SIMD Load/stores (single site wilson)Compulsory 12 x 8 + 9 x 8 + 12 = 180
Bagel avx512 200icpc avx512 676
Clang++ avx 665g++ sse4 660icpc avx 615
• The “perfect” BFM register allocation strategy was indicated to compiler with 31temporaries
• Equivalent to inserting neon lights in the code flashing“this is the solution to your reg allocation problem”
• Compilers survive rely heavily on L1 cache
• Intel compiler is performing object copies, while Clang++ is eliminating these
• May still end up with a small amount of Bagel code for Wilson term and AVX512Will see how we do with KNL
Summary
• Ground up development of new library for Lattice QCD designed to be fast
• Intended to be efficient on the next generation of systems• SSE, AVX, AVX2, AVX512 implemented; ARM Neon partially implemented• Plan backport to bgclang• Plan OpenMP 4.0 offload targets also
• Plan broad range of Fermion action, Gauge evolution support• Rate of progress is rapid• Already a vehicle for alorithm research• Possibly non-QCD field theories
• In our (unbiased !?) view it is rather good!