View
0
Download
0
Category
Preview:
Citation preview
CL2QCD
Develope
rs
♂ Alessa
ndro Sci
arra
♂ Christ
opher
Czaban
♀ France
scaCut
eri
Past Developers♂ Matthias Bach♂ Christopher Pinke
♂ Frederik Depta♂ Christian Schäfer
♂ Lars Zeidlewicz
Open CL LQCD
https://github.com/CL2QCD/cl2qcd
ZEUTHEN
FRANKFURT
Alessandro SciarraLattice Seminar20.06.2016
DESY
Lattice group of Owe Philipsen
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Our physics motivation (I)
QCD THERMODYNAMICS
N3s ˆ Nt lattice Ñ T “
1a Nt
Ñ 1 ! Nt ! Ns
Second order scenario First order scenario
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 2 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Our physics motivation (II)
mu,d ms
0
µT
2
RW Z2
O(4)
B
A
1st tr. 1st tr. ∞
)(
[M. D’Elia, O. Philipsen et al., Phys.Rev. D90 (2014)]
In the volume below theplane µ “ 0 simulationsare safe
The A point in theRoberge–Weiss plane isat a value of the mass thatcan be simulated
The position of the pointB determines the type oftransition at the Nf “ 2chiral point at µ “ 0
The idea is to study the line connecting A and B in the ms “ 8 plane
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 3 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Why do we need a fast code?
Just to give an idea of how costly it is... We keep fixed: Nt µ
for Nt in ...; do # ~3 values
for mu in ...; do # ~6 values
for mass in ...; do # ~6 values
for Ns in ...; do # ~3 values >= 3*Nt
for T in ...; do # ~5 valuesecho "Run the (R)HMC for >50k trajectories"# ...
done
done
done
done # Consider that the typical time of adone # simulation varies from weeks to months
done
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 4 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Why do we need a fast code?
Just to give an idea of how costly it is... We keep fixed: Nt
µ
for Nt in ...; do # ~3 values
for mu in ...; do # ~6 values
for mass in ...; do # ~6 values
for Ns in ...; do # ~3 values >= 3*Nt
for T in ...; do # ~5 valuesecho "Run the (R)HMC for >50k trajectories"# ...
done
done
donedone
done # Consider that the typical time of adone # simulation varies from weeks to months
done
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 4 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Why do we need a fast code?
Just to give an idea of how costly it is...
We keep fixed: Nt
µ
for Nt in ...; do # ~3 values
for mu in ...; do # ~6 values
for mass in ...; do # ~6 values
for Ns in ...; do # ~3 values >= 3*Nt
for T in ...; do # ~5 valuesecho "Run the (R)HMC for >50k trajectories"# ...
done
done
done # Consider that the typical time of adone # simulation varies from weeks to months
done
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 4 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Why do we need a fast code?
Just to give an idea of how costly it is...
We keep fixed: Nt
µ
for Nt in ...; do # ~3 values
for mu in ...; do # ~6 values
for mass in ...; do # ~6 values
for Ns in ...; do # ~3 values >= 3*Nt
for T in ...; do # ~5 valuesecho "Run the (R)HMC for >50k trajectories"# ...
done
done
done # Consider that the typical time of adone # simulation varies from weeks to months
done
Considering 3 months as average time of a single simulation. . .
For loop in order as above Ñ 405 years
Inner three for loop parallel Ñ 4.5 yearsˆ
(roughly 1.5)
STRETCH FACTOR(feedback, technical)
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 4 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Should a code just work?
NO
Any code in principle should be:
ReadableMaintainableEasy to extendEasy to useHard to breakTestable
CL2QCD
Lattice QCD with Open CL
Clean Code 2 Limit Questions and Doubts
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 5 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Should a code just work? NO
Any code in principle should be:
ReadableMaintainableEasy to extendEasy to useHard to breakTestable
CL2QCD
Lattice QCD with Open CL
Clean Code 2 Limit Questions and Doubts
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 5 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Should a code just work? NO
Any code in principle should be:
ReadableMaintainableEasy to extendEasy to useHard to breakTestable
CL2QCD
Lattice QCD with Open CL
Clean Code 2 Limit Questions and Doubts
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 5 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Should a code just work? NO
Any code in principle should be:
ReadableMaintainableEasy to extendEasy to useHard to breakTestable
CL2QCD
Lattice QCD with Open CL
Clean Code 2 Limit Questions and Doubts
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 5 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Should a code just work? NO
Any code in principle should be:
ReadableMaintainableEasy to extendEasy to useHard to breakTestable
CL2QCD
Lattice QCD with Open CL
Clean Code 2 Limit Questions and Doubts
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 5 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
Motivation and Philosophy
Should a code just work? NO
Any code in principle should be:
ReadableMaintainableEasy to extendEasy to useHard to breakTestable
CL2QCD
Lattice QCD with Open CL
Clean Code 2 Limit Questions and Doubts
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 5 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 6 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 6 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The code from the user point of view (I)
Features implemented in CL2QCD
Different fermion formulationsX Wilson fermions in standard formulationX Staggered fermions in rooted standard formulationX Twisted Mass Wilson fermions
Improved gauge actions (for Monte Carlo simulations)X Tree-level SymanzikX IwasakiX Doubly Blocked Wilson
Pure gauge simulations (Heatbath and Monte Carlo)Standard inversion and integration algorithmsILDG-compatible input/output (using LIME)RANLUX pseudo-random number generatorEven-odd preconditioningHasenbuch’s trick for Wilson fermions
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 7 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The code from the user point of view (II)
Inst
alla
tion
proc
edur
e
# Clone the git repositorycd <CL2QCD_INSTALL_DIR>git clone https://github.com/CL2QCD/cl2qcd.git# Make a build directorycd <CL2QCD_INSTALL_DIR>/cl2qcdmkdir buildcd build/## Run cmake; if not found automatically,# cmake variables can be set via command line as# "-D <CMAKE_VARIABLE>=<VALUE>"#cmake ..# Build all executablesmake -j
The compiler must be capable of basic C++11 features
Req
uire
dlib
rari
es OpenCL Ñ http://www.khronos.org/opencl/
LIME Ñ http://usqcd.jlab.org/usqcd-docs/c-lime/
libxml2 Ñ http://xmlsoft.org/
Boost Ñ http://www.boost.org/
GMP Ñ http://gmplib.org/
MPFR Ñ http://www.mpfr.org/
Nettle Ñ http://www.lysator.liu.se/~nisse/nettle/
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 8 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The code from the user point of view (III)
EXECUTABLES HMC
RHMC
SUp3q
Heatbath
Inverter
Gaugeobservables
Only with (Twisted Mass) Wilson fermion.
Only with standard staggered fermion.
Possibility of doing overrelaxation.
Measurement of fermionic observables.
Measurement of gauge observables.
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 9 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Executable Options
Each executable has its helper Ñ ./<executable> -h
Each option has a default value, used if not specifiedIn the helpers there are options not relevant for that algorithmSome options are used for profiling/benchmarks only
...HMC options:
--tau arg (=0.5)--reversibility_check arg (=0)--integrationsteps0 arg (=10)--integrationsteps1 arg (=10)--integrationsteps2 arg (=10)--hmcsteps arg (=10)--num_timescales arg (=1)--integrator0 arg (=leapfrog)--integrator1 arg (=leapfrog)--integrator2 arg (=leapfrog)--lambda0 arg (=0.19318332750378359)--lambda1 arg (=0.19318332750378359)--lambda2 arg (=0.19318332750378359)--use_gauge_only arg (=0)--use_mp arg (=0)...
Possible input file:
...fermact=rooted_staggnum_tastes=2solver=cgcgmax=8000measure_correlators=0cg_iteration_block_size=50num_timescales=2tau=1integrator0=twomnintegrator1=twomnintegrationsteps0=3integrationsteps1=6nspace=24ntime=6mass=0.6000rhmcsteps=100000savefrequency=500savepointfrequency=200startcondition=hot...
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 10 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The output levels of the code
Einhard c©2010 Matthias BachOFF
FATAL
ERROR
WARN
INFO
DEBUG
TRACE
[01:05:26] FATAL:[22:13:59] ERROR:[15:42:13] WARNING:[10:00:12] INFO:[03:35:17] DEBUG:[08:52:48] TRACE:
The undesired output is removedby pre-processor operations
NO OVERHEAD
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 11 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The output levels of the code
Einhard c©2010 Matthias BachOFF
FATAL
ERROR
WARN
INFO
DEBUG
TRACE
[01:05:26] FATAL:[22:13:59] ERROR:[15:42:13] WARNING:[10:00:12] INFO:[03:35:17] DEBUG:[08:52:48] TRACE:
The undesired output is removedby pre-processor operations
NO OVERHEAD
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 11 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The output levels of the code
Einhard c©2010 Matthias BachOFF
FATAL
ERROR
WARN
INFO
DEBUG
TRACE
[01:05:26] FATAL:[22:13:59] ERROR:[15:42:13] WARNING:[10:00:12] INFO:[03:35:17] DEBUG:[08:52:48] TRACE:
The undesired output is removedby pre-processor operations
NO OVERHEAD
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 11 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 11 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The code structure
M. Bach, O. Philipsen, C. Pinke, and A. Sciarra «CL2QCD– Lattice QCD based on OpenCL»http://arxiv.org/pdf/1411.5219.pdf
M. Bach, «Energy- and Cost-Efficient Lattice-QCD Computations using Graphics Processing Units»http://publikationen.ub.uni-frankfurt.de/frontdoor/index/index/docId/37074
META
Parameters
Utilities
ILDG IO
PHYSICS
AlgorithmsImplementation of algorithms using
Lattices and Fermionmatrix objects
• HMC
• RHMC
• HEATBATH
• SOLVER
• INTEGRATOR
• METROPOLIS
LatticesRepresentations of lattice fields
Operations on fields
• GAUGEFIELD
• FERMIONFIELD
• GAUGEMOMENTA
FermionmatrixMatrix operations on fermion fields
• WILSON
• TWISTED MASS
• STAGGERED
Observables• GAUGEOBSERVABLES
• 〈ψ̄ψ〉• CORRELATORS
PRNG• RANLUX
Noise Sources• POINT • Z2
• GAUSSIAN • Z4
HARDWARE
BuffersRepresentation of OpenCL buffer
Device-dependent: AOS or SOA
• SU3
• SU3VEC
• SPINOR
• GAUGEMOMENTA• · · ·
LatticesRepresentation of fields buffer
Host-devices communications
• GAUGEFIELD
• FERMIONFIELD
• GAUGEMOMENTA
CodeExecution of specific OpenCL kernels
Meta information about kernels
• SPINOR ALGEBRA
• FERMIONS
• MOLECULAR DYNAMICS
• GAUGEFIELD• · · ·
OpenCL CompilerCompilation of OpenCL kernels
Reread functionality
DeviceRepresentation of specific device
Provides Code modules
SystemRepr. of current architecture
Provides available OpenCL devices
OpenCL KernelsCode for execution on device
Compiled at runtime
• PLAQUETTE
• SAXPY
• DSLASH
• GAMMA5• · · ·
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 12 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The code structure
CL2QCD
HARDWARE
System
Device
Code
Open CLcompiler
Open CLBuffers
LatticeBuffers
OPEN CLKERNELS
META
Parameters
ILDG I/O
Utilities
PHYSICS
Latticefields
Fermionmatrices
Algorithms
PRNG
NoiseSources
Observable
C++
Open CL
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 12 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The code structure
CL2QCD
HARDWARE
System
Device
Code
Open CLcompiler
Open CLBuffers
LatticeBuffers
OPEN CLKERNELS
META
Parameters
ILDG I/O
Utilities
PHYSICS
Latticefields
Fermionmatrices
Algorithms
PRNG
NoiseSources
Observable
C++
Open CL
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 12 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The code structure
CL2QCD
HARDWARE
System
Device
Code
Open CLcompiler
Open CLBuffers
LatticeBuffers
OPEN CLKERNELS
META
Parameters
ILDG I/O
Utilities
PHYSICS
Latticefields
Fermionmatrices
Algorithms
PRNG
NoiseSources
Observable
C++
Open CL
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 12 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 12 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Physics package
PHYSICS LATTICEFIELDS
ψ-fields
Uµ-fields
Hµ-fields
FERMIONMATRICES
Wilson
TwistedMass
Staggered
GAUGEOBS.
PlaquettePolyakovLoop
ChiralCondensate
ALGS
Integrator
Leapfrog
2MN
Fermionforce
SolverCG
CG-M
BiCGstabPRNG NOISE
SOURCES
Point
SlicesVolumeTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 13 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Physics package
PHYSICS LATTICEFIELDS
ψ-fields
Uµ-fields
Hµ-fields
FERMIONMATRICES
Wilson
TwistedMass
Staggered
GAUGEOBS.
PlaquettePolyakovLoop
ChiralCondensate
ALGS
Integrator
Leapfrog
2MN
Fermionforce
SolverCG
CG-M
BiCGstabPRNG NOISE
SOURCES
Point
SlicesVolumeTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 13 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Physics package
PHYSICS LATTICEFIELDS
ψ-fields
Uµ-fields
Hµ-fields
FERMIONMATRICES
Wilson
TwistedMass
Staggered
GAUGEOBS.
PlaquettePolyakovLoop
ChiralCondensate
ALGS
Integrator
Leapfrog
2MN
Fermionforce
SolverCG
CG-M
BiCGstabPRNG NOISE
SOURCES
Point
SlicesVolumeTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 13 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Physics package
PHYSICS LATTICEFIELDS
ψ-fields
Uµ-fields
Hµ-fields
FERMIONMATRICES
Wilson
TwistedMass
Staggered
GAUGEOBS.
PlaquettePolyakovLoop
ChiralCondensate
ALGS
Integrator
Leapfrog
2MN
Fermionforce
SolverCG
CG-M
BiCGstabPRNG NOISE
SOURCES
Point
SlicesVolumeTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 13 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 13 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Hardware package
HARDWARE
LATTICEBUFFERS
ψ-fields
Uµ-fields
Hµ-fields
OPEN CLBUFFERS
Plain
3-vector
12-vector
8-vector3 ˆ 3matrix
PRNG
CODEPRNG
Uµ-fieldsHµ-fields
Real
Heatbath MolecularDynamics
Correlator
ψ-fields
Complex
Buffers
Kappa
SYSTEM
DEVICE
OPEN CLCOMPILERTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 14 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Hardware package
HARDWARE
LATTICEBUFFERS
ψ-fields
Uµ-fields
Hµ-fields
OPEN CLBUFFERS
Plain
3-vector
12-vector
8-vector3 ˆ 3matrix
PRNG
CODEPRNG
Uµ-fieldsHµ-fields
Real
Heatbath MolecularDynamics
Correlator
ψ-fields
Complex
Buffers
Kappa
SYSTEM
DEVICE
OPEN CLCOMPILERTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 14 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Hardware package
HARDWARE
LATTICEBUFFERS
ψ-fields
Uµ-fields
Hµ-fields
OPEN CLBUFFERS
Plain
3-vector
12-vector
8-vector3 ˆ 3matrix
PRNG
CODEPRNG
Uµ-fieldsHµ-fields
Real
Heatbath MolecularDynamics
Correlator
ψ-fields
Complex
Buffers
Kappa
SYSTEM
DEVICE
OPEN CLCOMPILERTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 14 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Hardware package
HARDWARE
LATTICEBUFFERS
ψ-fields
Uµ-fields
Hµ-fields
OPEN CLBUFFERS
Plain
3-vector
12-vector
8-vector3 ˆ 3matrix
PRNG
CODEPRNG
Uµ-fieldsHµ-fields
Real
Heatbath MolecularDynamics
Correlator
ψ-fields
Complex
Buffers
Kappa
SYSTEM
DEVICE
OPEN CLCOMPILERTESTS
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 14 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 14 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
General Purpose Graphics Processing Unit
GPU
CARD CHIPMEMORY PEAK DP PEAK BW CLOCK
YEAR{GB} {GFLOPS} {GB/s} {MHz}
AMD Radeon HD 5870 Cypress 1 544 154 850 2009
AMD Radeon HD 7970 Tahiti 3 947 264 925 2012
AMD FirePro S10000 Tahiti Pro GL 2ˆ3 - 6 2ˆ1480 2ˆ240 825 2012
AMD FirePro S9050 Thaiti 12 806 264 900 2014
AMD FirePro S9150 Hawaii 16 2530 320 900 2014
AMD FirePro S9170 Grenada 32 2620 320 930 2015
AMD FirePro S9300 Capsaicin 2ˆ4 (HBM) 868 2ˆ512 850 2016
NVIDIA GeForce GTX 680 Kepler 2 - 4 129 192 1006 2012
NVIDIA Tesla K40 Kepler 12 1680 288 745 2013
NVIDIA Tesla K80 Kepler 2ˆ12 2912 2ˆ240 560 2014
ρ “Number of FLOPs
Number of Bytes to read and writeWilson fermions ρp {Dq „ 0.57Staggered fermions ρpDKSq „ 0.35
«FLOPS do not count.» – Clark
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 15 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
General Purpose Graphics Processing Unit
GPU
CARD CHIPMEMORY PEAK DP PEAK BW CLOCK
YEAR{GB} {GFLOPS} {GB/s} {MHz}
AMD Radeon HD 5870 Cypress 1 544 154 850 2009
AMD Radeon HD 7970 Tahiti 3 947 264 925 2012
AMD FirePro S10000 Tahiti Pro GL 2ˆ3 - 6 2ˆ1480 2ˆ240 825 2012
AMD FirePro S9050 Thaiti 12 806 264 900 2014
AMD FirePro S9150 Hawaii 16 2530 320 900 2014
AMD FirePro S9170 Grenada 32 2620 320 930 2015
AMD FirePro S9300 Capsaicin 2ˆ4 (HBM) 868 2ˆ512 850 2016
NVIDIA GeForce GTX 680 Kepler 2 - 4 129 192 1006 2012
NVIDIA Tesla K40 Kepler 12 1680 288 745 2013
NVIDIA Tesla K80 Kepler 2ˆ12 2912 2ˆ240 560 2014
ρ “Number of FLOPs
Number of Bytes to read and writeWilson fermions ρp {Dq „ 0.57Staggered fermions ρpDKSq „ 0.35
«FLOPS do not count.» – Clark
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 15 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
General Purpose Graphics Processing Unit
GPU
CARD CHIPMEMORY PEAK DP PEAK BW CLOCK
YEAR{GB} {GFLOPS} {GB/s} {MHz}
AMD Radeon HD 5870 Cypress 1 544 154 850 2009
AMD Radeon HD 7970 Tahiti 3 947 264 925 2012
AMD FirePro S10000 Tahiti Pro GL 2ˆ3 - 6 2ˆ1480 2ˆ240 825 2012
AMD FirePro S9050 Thaiti 12 806 264 900 2014
AMD FirePro S9150 Hawaii 16 2530 320 900 2014
AMD FirePro S9170 Grenada 32 2620 320 930 2015
AMD FirePro S9300 Capsaicin 2ˆ4 (HBM) 868 2ˆ512 850 2016
NVIDIA GeForce GTX 680 Kepler 2 - 4 129 192 1006 2012
NVIDIA Tesla K40 Kepler 12 1680 288 745 2013
NVIDIA Tesla K80 Kepler 2ˆ12 2912 2ˆ240 560 2014
ρ “Number of FLOPs
Number of Bytes to read and writeWilson fermions ρp {Dq „ 0.57Staggered fermions ρpDKSq „ 0.35
«FLOPS do not count.» – Clark
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 15 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
General Purpose Graphics Processing Unit
GPU
CARD CHIPMEMORY PEAK DP PEAK BW CLOCK
YEAR{GB} {GFLOPS} {GB/s} {MHz}
AMD Radeon HD 5870 Cypress 1 544 154 850 2009
AMD Radeon HD 7970 Tahiti 3 947 264 925 2012
AMD FirePro S10000 Tahiti Pro GL 2ˆ3 - 6 2ˆ1480 2ˆ240 825 2012
AMD FirePro S9050 Thaiti 12 806 264 900 2014
AMD FirePro S9150 Hawaii 16 2530 320 900 2014
AMD FirePro S9170 Grenada 32 2620 320 930 2015
AMD FirePro S9300 Capsaicin 2ˆ4 (HBM) 868 2ˆ512 850 2016
NVIDIA GeForce GTX 680 Kepler 2 - 4 129 192 1006 2012
NVIDIA Tesla K40 Kepler 12 1680 288 745 2013
NVIDIA Tesla K80 Kepler 2ˆ12 2912 2ˆ240 560 2014
ρ “Number of FLOPs
Number of Bytes to read and writeWilson fermions ρp {Dq „ 0.57Staggered fermions ρpDKSq „ 0.35
«FLOPS do not count.» – Clark
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 15 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
CPU vs GPU code
saxpyψ1
ψ2
α φ “ αψ1 ` ψ2
On CPU On GPU
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 16 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
CPU vs GPU code
saxpyψ1
ψ2
α φ “ αψ1 ` ψ2
On CPU On GPU
void saxpy( const spinor *x,const spinor *y,const hmc_complex *alpha,spinor *out)
{
for(unsigned int index = 0; index < VTOT ; index += 1 ){
const spinor tmp = spinor_times_complex(x[index], alpha);out[index] = spinor_acc(y[index], tmp);
}}
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 16 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
CPU vs GPU code
saxpyψ1
ψ2
α φ “ αψ1 ` ψ2
On CPU On GPU
__kernel void saxpy(__global const spinor *x,__global const spinor *y,__global const hmc_complex *alpha,__global spinor *out)
{int id = get_global_id(0);int gs = get_global_size(0);
for(unsigned int idMem = id; idMem < SPINORFIELDSIZE_MEM; idMem += gs){
const spinor tmp = spinor_times_complex(x[idMem], alpha);out[idMem] = spinor_acc(y[idMem], tmp);
}}
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 16 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 16 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Crucial concepts
Robert C. Martin (2009), «Clean Code»
Kent Beck (2002), «Test Driven Development: By Example»
Test each single part of code on its own
Unit tests implemented using BOOST and CMake unit test frameworks
Regression tests for the Open CL parts are absolutely mandatory
LQCD functions are local Ñ analytic results to test against can be calculated
Avoid dependence of the tests on specific environments
Testing33%Developing
33%
Refactoring25%
Optimisation8%
PRESENT
tMore Development More Refactoring
2011 2012 2013 2014 2015 2017 2018
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 17 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Crucial concepts
Robert C. Martin (2009), «Clean Code»
Kent Beck (2002), «Test Driven Development: By Example»
Test each single part of code on its own
Unit tests implemented using BOOST and CMake unit test frameworks
Regression tests for the Open CL parts are absolutely mandatory
LQCD functions are local Ñ analytic results to test against can be calculated
Avoid dependence of the tests on specific environments
Testing33%Developing
33%
Refactoring25%
Optimisation8%
PRESENT
tMore Development More Refactoring
2011 2012 2013 2014 2015 2017 2018
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 17 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Crucial concepts
Robert C. Martin (2009), «Clean Code»
Kent Beck (2002), «Test Driven Development: By Example»
Test each single part of code on its own
Unit tests implemented using BOOST and CMake unit test frameworks
Regression tests for the Open CL parts are absolutely mandatory
LQCD functions are local Ñ analytic results to test against can be calculated
Avoid dependence of the tests on specific environments
Testing33%Developing
33%
Refactoring25%
Optimisation8%
PRESENT
tMore Development More Refactoring
2011 2012 2013 2014 2015 2017 2018
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 17 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 17 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
The Dirac operator kernel
163
×8
163
×16
163
×24
163
×32
243
×12
243
×16
323
×8
243
×24
323
×12
243
×32
323
×16
243
×48
323
×24
483
×8
483
×12
80
100
200
300
//Lattice Size
GB/s
Performance of Wilson /D
AMD Radeon HD 7970 AMD Radeon HD 5870NVIDIA Tesla K40 AMD FirePro S10000AMD FirePro S9150
0
50
100
150
GFLO
PS
163
×8
163
×16
163
×24
163
×32
243
×12
243
×16
323
×8
243
×24
323
×12
243
×32
323
×16
243
×48
323
×24
483
×8
483
×12
80
100
200
300
//
Lattice Size
GB/s
Performance of Staggered DKS
AMD Radeon HD 7970 AMD Radeon HD 5870NVIDIA Tesla K40 AMD FirePro S10000AMD FirePro S9150
0
50
100
GFLO
PS
50%
75%
100%
50%
75%
100%
<BW>/BWmax
<BW>/BWmax
320 GB/s264 GB/s240 GB/s288 GB/s154 GB/s
Max Bandwith
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 18 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
CL2QCD VS tmLQCD (in 2013)
0 1000 2000 3000 4000 5000 6000 7000
A
B
C
6459
6469
2178
3458
3410
1021
1602
1592
553
Execution Time / s
AMD FirePro S10000 (2012)AMD Radeon HD 5870 (2009)
tmlqcd on 2 AMD Opteron 6172 (2010)
SETUP mπtMeVu
A 260B 310C 520
β “ 3.9
κc “ 0.160856
Nτ “ 8
Nσ “ 24
AT MAXIMAL TWIST
Runs done on LOEWE-CSC
tmLQCD has been run on a whole node (24 cores)
Price per flop for the GPUs much lower than for the CPUs
M. Bach et al. «Twisted-Mass Lattice QCD using OpenCL»http://pos.sissa.it/archive/conferences/187/032/LATTICE%202013_032.pdf
1 GPU « 4 CPU
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 19 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 19 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Multi-GPU scaling
CL2QCD can use multiple GPUs within the same node
At the moment, the lattice can be divided only in the temporal direction
Tested on SANAM (AMD FirePro S10000, in 2013), 4 GPUs per node
1 2 3 40
100
200
GPUs
Solve
rPer
form
ance
/GF
LOPs 323 × 12
483 × 16323 × 64243 × 128
Hard scaling:
The total lattice size is kept constant
1 2 3 40
100
200
GPUsSo
lverP
erfo
rman
ce/
GFLO
Ps
323 × 16243 × 32
Weak scaling:
The local lattice size is kept constant
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 20 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Outline of the talk
1 Motivation and our philosophy
2 CL2QCD features
3 Structure of the codePhysicsHardware
4 Programming on GPU
5 Unit tests, maintainability and portability
6 Performances of the codeMultiple GPUs
7 Ongoing and future work
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 20 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Plausible future work
2016
2017
2018
‚ Arbitrary parallelization direction
Pure Wilson RHMC
Staggered pion mass
Clover term
Multiple pseudofermions
Physics analytic tests
Smearing?
Disentangle all packages + work on CMake
Parallelization in >1 directions
BaHaMA
S
Cleaner and easier to develop code
https://github.com/CL2QCD/cl2qcd
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 21 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Plausible future work
2016
2017
2018
‚ Arbitrary parallelization direction
Pure Wilson RHMC
Staggered pion mass
Clover term
Multiple pseudofermions
Physics analytic tests
Smearing?
Disentangle all packages + work on CMake
Parallelization in >1 directions
BaHaMA
S
Cleaner and easier to develop code
https://github.com/CL2QCD/cl2qcd
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 21 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Plausible future work
2016
2017
2018
‚ Arbitrary parallelization direction
Pure Wilson RHMC
Staggered pion mass
Clover term
Multiple pseudofermions
Physics analytic tests
Smearing?
Disentangle all packages + work on CMake
Parallelization in >1 directions
BaHaMA
S
Cleaner and easier to develop code
https://github.com/CL2QCD/cl2qcd
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 21 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Plausible future work
2016
2017
2018
‚ Arbitrary parallelization direction
Pure Wilson RHMC
Staggered pion mass
Clover term
Multiple pseudofermions
Physics analytic tests
Smearing?
Disentangle all packages + work on CMake
Parallelization in >1 directions
BaHaMA
S
Cleaner and easier to develop code
https://github.com/CL2QCD/cl2qcd
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 21 / 21
Lattice SeminarDESY
ALESSANDRO SCIARRA
CL2 QCD features
Structure of the code
Physics
Hardware
Programming on GPU
Unit tests, maintainabilityand portability
Performances of the code
Multiple GPUs
Ongoing and future work
Plausible future work
2016
2017
2018
‚ Arbitrary parallelization direction
Pure Wilson RHMC
Staggered pion mass
Clover term
Multiple pseudofermions
Physics analytic tests
Smearing?
Disentangle all packages + work on CMake
Parallelization in >1 directions
BaHaMA
S
Cleaner and easier to develop code
https://github.com/CL2QCD/cl2qcd
Alessandro Sciarra (Goethe Universität) CL2 QCD HIC for FAIR – 20.06.2016 21 / 21
Recommended