Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Research Presentation for Computer
Modelling Group Ltd.
Malcolm Roberts
University of Strasbourg
2016-04-26
[email protected], www.malcolmiwroberts.com
Outline
I ConvolutionsI Implicitly Dealiased FFT-based convolutionsI Shared-memory implementationI Parallel OpenMP/MPI implementationI Pseudospectral simulations
I GPU programmingI OpenCLI schnapsI Performance analysis.
Malcolm Roberts malcolmiwroberts.com 2
malcolmiwroberts.com
FFT-based convolutions
The convolution of {Fk}m−1k=0 and {Gk}m−1k=0 is
(1)(F ? G )k =k∑`=0
F`Gk−`, k=0, . . . ,m − 1.
For example, if F and G are:
Then F ∗ G is:
Malcolm Roberts malcolmiwroberts.com 3
malcolmiwroberts.com
FFT-based convolutions
Applications:
I Signal processing
I Machine learning: convolutional neural networks
I Image processing
I Particle-Image-Velocimitry
I Pseudospectral simulations of nonlinear PDEs
The convolution theorem:
(2)F [F ∗ G ] = F [F ]�F [G ] .
Using FFTs improves speed and accuracy.
Malcolm Roberts malcolmiwroberts.com 4
malcolmiwroberts.com
FFT-based convolutions
Let ζm = exp(2πim
). Forward and backward Fourier transforms
are given by:
(3)fj =m−1∑k=0
ζ jkmFk , Fk=1
m
m−1∑j=0
ζ−kjm fk ,
We will use the identity
(4)m−1∑j =0
ζ`jm =
{m if ` = sm for s ∈ Z,1−ζ`mm1−ζmm = 0 otherwise.
Malcolm Roberts malcolmiwroberts.com 5
malcolmiwroberts.com
FFT-based convolutions
The convolution theorem works because
(5)
m−1∑j =0
fjgjζ−jkm =
m−1∑j=0
ζ−jkm
(m−1∑p=0
ζ jpmFp
)(m−1∑q=0
ζ jqmGq
)
=m−1∑p=0
Fp
m−1∑q=0
Gq
m−1∑j=0
ζ j(−k+p+q)m
= m∑s
m−1∑p=0
FpGk−p+sm.
The terms s 6= 0 are aliases; they are bad.
Malcolm Roberts malcolmiwroberts.com 6
malcolmiwroberts.com
Conventional dealiasing: zero padding
Let F̃.
= {F0,F1, . . . ,Fm−2,Fm−1, 0, . . . , 0︸ ︷︷ ︸m
}. Then,
(6)
(F̃ ∗2m G̃
)k
=2m−1∑`=0
F̃` mod (2m)G̃(k−`) mod (2m)
=m−1∑`=0
F`G̃(k−`) mod (2m)
=k∑`=0
F`Gk−`.
There is also a “2/3” padded version for pseudospectralsimulations, where the input {Fk}m−1k=−m is padded to 3m.
Malcolm Roberts malcolmiwroberts.com 7
malcolmiwroberts.com
Dealiasing with conventional zero-padding
F GF G
f g
fg
F ∗GF ∗G
Malcolm Roberts malcolmiwroberts.com 8
malcolmiwroberts.com
Dealiasing with implicit zero-padding
We modify the FFT to account for the zeros implicitly.
Let ζn = exp (−i2π/n). The Fourier transform of F̃ is
(7)fx =2m−1∑k =0
ζxk2mF̃k =m−1∑k=0
ζxk2mF̃k
We can compute this using two discontiguous buffers:
(8)f2x =m−1∑k=0
ζxkm Fk f2x+1=m−1∑k=0
ζxkm(ζk2mFk
).
Malcolm Roberts malcolmiwroberts.com 9
malcolmiwroberts.com
Dealiasing with implicit zero-padding
F G
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
Dealiasing with implicit zero-padding
FFT−1x {F}nx even
FFT−1x {F}nx odd
FFT−1x {G}nx even
FFT−1x {G}nx odd
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
Dealiasing with implicit zero-padding
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
Dealiasing with implicit zero-padding
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
Dealiasing with implicit zero-padding
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
Dealiasing with implicit zero-padding
FFT−1x {F ∗G}nx even
FFT−1x {F ∗G}nx odd
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
Dealiasing with implicit zero-padding
F ∗G
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
Shared-memory implementation
I Implicit dealiasing requires less memory.
I We avoid FFTs on zero-data.
I By using discontiguous buffers, we can use multipleNUMA nodes.
I SSE2 vectorization instructions.
I Additional threads requires additional sub-dimensionalwork buffers.
I We use strides instead of transposes because we need tomulti-thread.
Malcolm Roberts malcolmiwroberts.com 11
malcolmiwroberts.com
Multi-threaded performance: 1D
200
300
400
500
600
700
800
perform
ance:m
log2m/tim
e,(ns)
−1
102 103 104 105 106m
Implicit T=1
Implicit T=4
Explicit T=1
Explicit T=4
Malcolm Roberts malcolmiwroberts.com 12
malcolmiwroberts.com
Multi-threaded performance: 2D
100
200
300
400
500
600
perform
ance:m
2log2m
2/tim
e,(ns)
−1
102 103m
Implicit T=1
Implicit T=4
Explicit T=1
Explicit T=4
Malcolm Roberts malcolmiwroberts.com 13
malcolmiwroberts.com
Multi-threaded performance: 3D
100
200
300
perform
ance:m
3log2m
3/tim
e,(ns)
−1
101 102m
Implicit T=1
Implicit T=4
Explicit T=1
Explicit T=4
Malcolm Roberts malcolmiwroberts.com 14
malcolmiwroberts.com
Multi-threaded speedup: 3D
1
2
3
4
5
6
7relative
speed
101 102m
T=1T=4
Malcolm Roberts malcolmiwroberts.com 15
malcolmiwroberts.com
Distributed-memory implementation
I Implicit dealiasing requires less communication.
I By using discontiguous buffers, we can overlapcommunication and computation.
I We use a hybrid OpenMP/MPI parallelization for clustersof multi-core machines.
I 2D MPI data decomposition.
I We make use of the hybrid transpose algorithm.
Malcolm Roberts malcolmiwroberts.com 16
malcolmiwroberts.com
Hybrid MPI Transpose
Matrix transpose is an essential primitive of high-performancecomputing.
They allow one to localize data on one process so thatshared-memory algorithms can be applied.
I will discuss two algorithms for transposes:
I Direct Transpose.
I Recursive Transpose.
We combine these into a hybrid transpose.
Malcolm Roberts malcolmiwroberts.com 17
malcolmiwroberts.com
Direct (AlltoAll) Transpose
I Efficient for P � m (large messages).I Most direct method.
I Many small messages when P ≈ m.Implementations:
I MPI_Alltoall
I MPI_Send, MPI_Recv
Malcolm Roberts malcolmiwroberts.com 18
malcolmiwroberts.com
Direct (AlltoAll) Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 19
malcolmiwroberts.com
Direct (AlltoAll) Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 19
malcolmiwroberts.com
Direct (AlltoAll) Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 19
malcolmiwroberts.com
Recursive Transpose
I Efficient for P � m (large messages).I Recursively subdivides transpose into smaller block
transposes.
I logm phases.
I Communications are grouped to reduce latency.
I Requires intermediate communication.
Implementations:
I FFTW
Malcolm Roberts malcolmiwroberts.com 20
malcolmiwroberts.com
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
Hybrid Transpose
I Recursive, but just one level.
I Use the empirical properties of the cluster to determinebest parameters.
I Optionally group messages to reduce latency.
Implementation:
I FFTW++
Direct transpose communication cost: P−1P2
m2, P messages.
Hybrid cost with P = ab: (a−1)bm2
P2+ (b−1)am
2
P2, a + b messages.
Malcolm Roberts malcolmiwroberts.com 22
malcolmiwroberts.com
Hybrid Transpose
Let τ` be the message latency, and τd the time to send oneelement. The time to send n elements is
(9)τ` + nτd .
The time required to do a direct transpose is
(10)TD = τ` (P − 1) + τdP − 1P2
m2=(P − 1)(τ` + τd
m2
P2
)The time for a block transpose is
(11)TB(a) = τ`
(a +
P
a− 2)
+ τd
(2P − a − P
a
)m2
P2.
Malcolm Roberts malcolmiwroberts.com 23
malcolmiwroberts.com
Hybrid Transpose
104
105
106
Com
municationCost
101 102 103
P
Zero Latency
Direct
Block
Malcolm Roberts malcolmiwroberts.com 24
malcolmiwroberts.com
Hybrid Transpose
2000
2500
3000
time(µs)
1024× 1 512× 2 256× 4 128× 8nodes × threads
FFTW: 40962
hybrid: 40962
Malcolm Roberts malcolmiwroberts.com 25
malcolmiwroberts.com
Hybrid Transpose
The hybrid transpose
I Uses a direct transpose for large message sizes.
I Uses a block transpose for small message sizes.
I Offers a performance advantage when P ≈ m.I Can be tuned based upon the values of τ` and τd for the
cluster.
We use the hybrid transpose in computing convolutions usingimplicit dealiasing on clusters.
Malcolm Roberts malcolmiwroberts.com 26
malcolmiwroberts.com
MPI Convolution: 2D performance
1000
2000
perform
ance:m
2log2m
2/tim
e,(ns)
−1
102 103 104m
Implicit P=24 T=1
Implicit P=48 T=1
Implicit P=96 T=1
Explicit P=24 T=1
Explicit P=48 T=1
Explicit P=96 T=1
Malcolm Roberts malcolmiwroberts.com 27
malcolmiwroberts.com
MPI Convolution: 2D performance
3
6
9
12
relative
speed
102 103 104m
P=24 T=1P=48 T=1P=96 T=1
Malcolm Roberts malcolmiwroberts.com 28
malcolmiwroberts.com
MPI Convolution: multithreaded 2D performance
500
1000
1500
perform
ance:m
2log2m
2/tim
e,(ns)
−1
102 103 104m
Implicit P=1 T=24
Implicit P=2 T=24
Implicit P=4 T=24
Explicit P=1 T=24
Explicit P=2 T=24
Explicit P=4 T=24
Malcolm Roberts malcolmiwroberts.com 29
malcolmiwroberts.com
MPI Convolution: 3D performance
500
1000
1500
perform
ance:m
3log2m
3/tim
e,(ns)
−1
102 103m
Implicit P=24 T=1
Implicit P=48 T=1
Implicit P=96 T=1
Explicit P=24 T=1
Explicit P=48 T=1
Explicit P=96 T=1
Malcolm Roberts malcolmiwroberts.com 30
malcolmiwroberts.com
MPI Convolution: 3D performance
2.5
3
3.5
relative
speed
102m
P=24 T=1P=48 T=1P=96 T=1
Malcolm Roberts malcolmiwroberts.com 31
malcolmiwroberts.com
MPI Convolution: multithreaded 3D performance
500
1000
1500
perform
ance:m
3log2m
3/tim
e,(ns)
−1
102 103m
Implicit P=1 T=24
Implicit P=2 T=24
Implicit P=4 T=24
Explicit P=1 T=24
Explicit P=2 T=24
Explicit P=4 T=24
Malcolm Roberts malcolmiwroberts.com 32
malcolmiwroberts.com
MPI Convolution: 3D scaling
24 48 96 768 1536Number of cores
1
2
4
32
64Speedup
5122
10242
20482
40962
81922
163842
327682
Malcolm Roberts malcolmiwroberts.com 33
malcolmiwroberts.com
Application: Pseudospectral simulation
Malcolm Roberts malcolmiwroberts.com 34
malcolmiwroberts.com
Application: Pseudospectral simulation
Malcolm Roberts malcolmiwroberts.com 35
robertsetal_tubulent_helical_mhd.aviMedia File (video/avi)
malcolmiwroberts.com
Convolutions Summary
Implicitly dealiased convolutions:
I use less memory
I have less communication costs,
I and are faster than conventional zero-padding techniques.
The hybrid transpose is faster for small message size.
Collaboration with John Bowman, University of Alberta.
Implementation in the open-source project FFTW++:
fftwpp.sf.net
We have around 13 000 downloads (plus clones).
Malcolm Roberts malcolmiwroberts.com 36
fftwpp.sf.netmalcolmiwroberts.com
Running on GPUs
Computing on general-purpose GPU has two advantages:
I High performance
I Low energy consumption
There are a variety of options for running on GPU:
I CUDA: Libraries available, tools available. Nvidia-only.
I OpenMP 4.0: pragma-based, high-level.
I OpenACC: Being rolled into OpenMP
I OpenCL: Similar to CUDA, but released later.I Works on all vendors, very flexible.I Runs on GPUs, CPUs, mics (Xeon Phi).
Malcolm Roberts malcolmiwroberts.com 37
malcolmiwroberts.com
OpenCL
One writes a normal program, in which the code for the GPU iscontained in a string.
At run-time, the program:
1. Selects the OpenCL platform(s) and device(s).
2. Creates an OpenCL context and queue.
3. Compiles the programs into kernels.
4. Allocates buffers on the device.
5. Launches kernels in the queue: managed with events.
Malcolm Roberts malcolmiwroberts.com 38
malcolmiwroberts.com
OpenCL
Kernels are the code from the interior of loops.Example: the C code
void myfunc ( double∗ a , double∗ b , i n t n ) {fo r ( i n t i = 0 ; i < n ; ++i ) {
a [ i ] ∗= b [ i ] ;}
becomes:
k e r n e l void m y k e r n e l ( g l o b a l double∗ a ,g l o b a l double∗ b ) {
i n t i = g e t l o c a l i d ( 0 ) ;a [ i ] ∗= b [ i ] ;
}
Malcolm Roberts malcolmiwroberts.com 39
malcolmiwroberts.com
OpenCL
Since the kernel has no loop dependencies, everything isvectorized: even RAM buffers are aligned to the vector width.
The __global keyword specifies that one uses the globaldevice memory.
One has access to the cache with __local; if one wants tohave data in the cache, then one writes a loop to put it there.
Coalescent memory access is crucial.
So, one has a lot of control, but there’s a bit more work.
But the performance is good!
Malcolm Roberts malcolmiwroberts.com 40
malcolmiwroberts.com
OpenCL
We developed a discontinuous-Galerkin code for solvinghyperbolic conservation laws:
schnaps
Solver for Conservative Hypebolic Non-linear systems Appliedto PlasmaS
∂w
∂t+
k=d∑k=1
∂
∂kF k(w) = S (12)
Malcolm Roberts malcolmiwroberts.com 41
malcolmiwroberts.com
schnaps
Malcolm Roberts malcolmiwroberts.com 42
malcolmiwroberts.com
schnaps
Discontinuous Galerkin method:
I Deals well with complex geometries.
I Local refinement: non-uniform grid.
OpenCL implementation:
I Hexahedral elements for coalescent memory access.
I Macrocell / subcell formulation.
I Array of structs of arrays: yet more coalescence.
Malcolm Roberts malcolmiwroberts.com 43
malcolmiwroberts.com
schnaps
Malcolm Roberts malcolmiwroberts.com 44
malcolmiwroberts.com
schnaps
But, is it fast?
Malcolm Roberts malcolmiwroberts.com 45
malcolmiwroberts.com
Performance analysis of schnaps
10−1
100
tim
e(s
econ
ds)
101
refinement
C
OpenCL
Malcolm Roberts malcolmiwroberts.com 46
malcolmiwroberts.com
Performance analysis schnaps
clFFT, an FFT library written in OpenCL by AMD.
10−4
10−3
10−2
10−1
time(secon
ds)
102 103
problem size
CPUGPU
Malcolm Roberts malcolmiwroberts.com 47
malcolmiwroberts.com
Performance analysis of schnaps
100
tim
e(s
econ
ds)
101
refinement
CPUGPU
Malcolm Roberts malcolmiwroberts.com 48
malcolmiwroberts.com
Performance analysis of schnaps
schnaps works well on the Xeon Phi.
101
102
103
time(s)
101 102 103
N
12 CPU1 GPU1 MIC
Malcolm Roberts malcolmiwroberts.com 49
malcolmiwroberts.com
schnaps summary
We observe that:
1. The C code makes use of all the cores.
2. The C and OpenCL code speeds on the CPU are close forlarge problem sizes.
3. The performance difference of schnaps between the CPUand GPU is near what we should expect.
Thus, we claim that our code makes effective use of the GPU.
We can further improve the code by profiling.
Collaboration with Philippe Helluy and TONUS, University ofStrasbourg.
Malcolm Roberts malcolmiwroberts.com 50
malcolmiwroberts.com
Example simulation: Maxwell’s equations
Malcolm Roberts malcolmiwroberts.com 51
malcolmiwroberts.com
Conclusion
I presented two projects:
I FFTW++
I Implicitly Dealiased Convolutions: faster, less memory.I OpenMP and/or MPI implementation.I Hybrid MPI transpose.I Application to a wide variety of situations.
I schnaps
I OpenCL implementation of the discontinuous Galerkinmethod.
I Good performance on the CPU, GPU, and mic.
Thank you for your attention!
Malcolm Roberts malcolmiwroberts.com 52
malcolmiwroberts.com
Timing statistics
100
101
102
103
104
frequency
5×10−5 7×10−5time (s)
Malcolm Roberts malcolmiwroberts.com 53
malcolmiwroberts.com