Research Presentation for Computer ... - Malcolm Roberts · Research Presentation for Computer Modelling Group Ltd. Malcolm Roberts University of Strasbourg 2016-04-26 [email protected],

Research Presentation for Computer

Modelling Group Ltd.

Malcolm Roberts

University of Strasbourg

2016-04-26

[email protected], www.malcolmiwroberts.com

Outline

I ConvolutionsI Implicitly Dealiased FFT-based convolutionsI Shared-memory implementationI Parallel OpenMP/MPI implementationI Pseudospectral simulations

I GPU programmingI OpenCLI schnapsI Performance analysis.

Malcolm Roberts malcolmiwroberts.com 2

malcolmiwroberts.com

FFT-based convolutions

The convolution of {Fk}m−1k=0 and {Gk}m−1k=0 is

(1)(F ? G )k =k∑`=0

F`Gk−`, k=0, . . . ,m − 1.

For example, if F and G are:

Then F ∗ G is:




Applications:

I Signal processing

I Machine learning: convolutional neural networks

I Image processing

I Particle-Image-Velocimitry

I Pseudospectral simulations of nonlinear PDEs

The convolution theorem:

(2)F [F ∗ G ] = F [F ]�F [G ] .

Using FFTs improves speed and accuracy.




Let ζm = exp(2πim

). Forward and backward Fourier transforms

are given by:

(3)fj =m−1∑k=0

ζ jkmFk , Fk=1

m

m−1∑j=0

ζ−kjm fk ,

We will use the identity

(4)m−1∑j =0

ζ`jm =

{m if ` = sm for s ∈ Z,1−ζ`mm1−ζmm = 0 otherwise.




The convolution theorem works because

(5)

m−1∑j =0

fjgjζ−jkm =

m−1∑j=0

ζ−jkm

(m−1∑p=0

ζ jpmFp

)(m−1∑q=0

ζ jqmGq

)

=m−1∑p=0

Fp

m−1∑q=0

Gq

m−1∑j=0

ζ j(−k+p+q)m

= m∑s

m−1∑p=0

FpGk−p+sm.

The terms s 6= 0 are aliases; they are bad.



Conventional dealiasing: zero padding

Let F̃.

= {F0,F1, . . . ,Fm−2,Fm−1, 0, . . . , 0︸︷︷︸m

}. Then,

(6)

(F̃ ∗2m G̃

)k

=2m−1∑`=0

F̃` mod (2m)G̃(k−`) mod (2m)

=m−1∑`=0

F`G̃(k−`) mod (2m)

=k∑`=0

F`Gk−`.

There is also a “2/3” padded version for pseudospectralsimulations, where the input {Fk}m−1k=−m is padded to 3m.



Dealiasing with conventional zero-padding

F GF G

f g

fg

F ∗GF ∗G



Dealiasing with implicit zero-padding

We modify the FFT to account for the zeros implicitly.

Let ζn = exp (−i2π/n). The Fourier transform of F̃ is

(7)fx =2m−1∑k =0

ζxk2mF̃k =m−1∑k=0

ζxk2mF̃k

We can compute this using two discontiguous buffers:

(8)f2x =m−1∑k=0

ζxkm Fk f2x+1=m−1∑k=0

ζxkm(ζk2mFk

).




F G




FFT−1x {F}nx even

FFT−1x {F}nx odd

FFT−1x {G}nx even

FFT−1x {G}nx odd




FFT−1x {F ∗G}nx even

FFT−1x {F ∗G}nx odd




F ∗G



Shared-memory implementation

I Implicit dealiasing requires less memory.

I We avoid FFTs on zero-data.

I By using discontiguous buffers, we can use multipleNUMA nodes.

I SSE2 vectorization instructions.

I Additional threads requires additional sub-dimensionalwork buffers.

I We use strides instead of transposes because we need tomulti-thread.



Multi-threaded performance: 1D

200

300

400

500

600

700

800

perform

ance:m

log2m/tim

e,(ns)

−1

102 103 104 105 106m

Implicit T=1

Implicit T=4

Explicit T=1

Explicit T=4




100

200

300

400

500

600

perform

ance:m

2log2m

2/tim

e,(ns)

−1

102 103m

Implicit T=1

Implicit T=4

Explicit T=1

Explicit T=4




100

200

300

perform

ance:m

3log2m

3/tim

e,(ns)

−1

101 102m

Implicit T=1

Implicit T=4

Explicit T=1

Explicit T=4



Multi-threaded speedup: 3D

1

2

3

4

5

6

7relative

speed

101 102m

T=1T=4



Distributed-memory implementation

I Implicit dealiasing requires less communication.

I By using discontiguous buffers, we can overlapcommunication and computation.

I We use a hybrid OpenMP/MPI parallelization for clustersof multi-core machines.

I 2D MPI data decomposition.

I We make use of the hybrid transpose algorithm.



Hybrid MPI Transpose

Matrix transpose is an essential primitive of high-performancecomputing.

They allow one to localize data on one process so thatshared-memory algorithms can be applied.

I will discuss two algorithms for transposes:

I Direct Transpose.

I Recursive Transpose.

We combine these into a hybrid transpose.



Direct (AlltoAll) Transpose

I Efficient for P � m (large messages).I Most direct method.

I Many small messages when P ≈ m.Implementations:

I MPI_Alltoall

I MPI_Send, MPI_Recv



Direct (AlltoAll) Transpose

0

1

2

3

4

5

6

7

Process



Recursive Transpose

I Efficient for P � m (large messages).I Recursively subdivides transpose into smaller block

transposes.

I logm phases.

I Communications are grouped to reduce latency.

I Requires intermediate communication.

Implementations:

I FFTW



Recursive Transpose

0

1

2

3

4

5

6

7

Process



Hybrid Transpose

I Recursive, but just one level.

I Use the empirical properties of the cluster to determinebest parameters.

I Optionally group messages to reduce latency.

Implementation:

I FFTW++

Direct transpose communication cost: P−1P2

m2, P messages.

Hybrid cost with P = ab: (a−1)bm2

P2+ (b−1)am

2

P2, a + b messages.



Hybrid Transpose

Let τ` be the message latency, and τd the time to send oneelement. The time to send n elements is

(9)τ` + nτd .

The time required to do a direct transpose is

(10)TD = τ` (P − 1) + τdP − 1P2

m2=(P − 1)(τ` + τd

m2

P2

)The time for a block transpose is

(11)TB(a) = τ`

(a +

P

a− 2)

+ τd

(2P − a − P

a

)m2

P2.



Hybrid Transpose

104

105

106

Com

municationCost

101 102 103

P

Zero Latency

Direct

Block



Hybrid Transpose

2000

2500

3000

time(µs)

1024× 1 512× 2 256× 4 128× 8nodes × threads

FFTW: 40962

hybrid: 40962



Hybrid Transpose

The hybrid transpose

I Uses a direct transpose for large message sizes.

I Uses a block transpose for small message sizes.

I Offers a performance advantage when P ≈ m.I Can be tuned based upon the values of τ` and τd for the

cluster.

We use the hybrid transpose in computing convolutions usingimplicit dealiasing on clusters.



MPI Convolution: 2D performance

1000

2000

perform

ance:m

2log2m

2/tim

e,(ns)

−1

102 103 104m

Implicit P=24 T=1

Implicit P=48 T=1

Implicit P=96 T=1

Explicit P=24 T=1

Explicit P=48 T=1

Explicit P=96 T=1




3

6

9

12

relative

speed

102 103 104m

P=24 T=1P=48 T=1P=96 T=1



MPI Convolution: multithreaded 2D performance

500

1000

1500

perform

ance:m

2log2m

2/tim

e,(ns)

−1

102 103 104m

Implicit P=1 T=24

Implicit P=2 T=24

Implicit P=4 T=24

Explicit P=1 T=24

Explicit P=2 T=24

Explicit P=4 T=24




500

1000

1500

perform

ance:m

3log2m

3/tim

e,(ns)

−1

102 103m

Implicit P=24 T=1

Implicit P=48 T=1

Implicit P=96 T=1

Explicit P=24 T=1

Explicit P=48 T=1

Explicit P=96 T=1




2.5

3

3.5

relative

speed

102m

P=24 T=1P=48 T=1P=96 T=1



MPI Convolution: multithreaded 3D performance

500

1000

1500

perform

ance:m

3log2m

3/tim

e,(ns)

−1

102 103m

Implicit P=1 T=24

Implicit P=2 T=24

Implicit P=4 T=24

Explicit P=1 T=24

Explicit P=2 T=24

Explicit P=4 T=24



MPI Convolution: 3D scaling

24 48 96 768 1536Number of cores

1

2

4

32

64Speedup

5122

10242

20482

40962

81922

163842

327682



Application: Pseudospectral simulation



Application: Pseudospectral simulation


robertsetal_tubulent_helical_mhd.aviMedia File (video/avi)


Convolutions Summary

Implicitly dealiased convolutions:

I use less memory

I have less communication costs,

I and are faster than conventional zero-padding techniques.

The hybrid transpose is faster for small message size.

Collaboration with John Bowman, University of Alberta.

Implementation in the open-source project FFTW++:

fftwpp.sf.net

We have around 13 000 downloads (plus clones).


fftwpp.sf.netmalcolmiwroberts.com

Running on GPUs

Computing on general-purpose GPU has two advantages:

I High performance

I Low energy consumption

There are a variety of options for running on GPU:

I CUDA: Libraries available, tools available. Nvidia-only.

I OpenMP 4.0: pragma-based, high-level.

I OpenACC: Being rolled into OpenMP

I OpenCL: Similar to CUDA, but released later.I Works on all vendors, very flexible.I Runs on GPUs, CPUs, mics (Xeon Phi).



OpenCL

One writes a normal program, in which the code for the GPU iscontained in a string.

At run-time, the program:

1. Selects the OpenCL platform(s) and device(s).

2. Creates an OpenCL context and queue.

3. Compiles the programs into kernels.

4. Allocates buffers on the device.

5. Launches kernels in the queue: managed with events.



OpenCL

Kernels are the code from the interior of loops.Example: the C code

void myfunc ( double∗ a , double∗ b , i n t n ) {fo r ( i n t i = 0 ; i < n ; ++i ) {

a [ i ] ∗= b [ i ] ;}

becomes:

k e r n e l void m y k e r n e l ( g l o b a l double∗ a ,g l o b a l double∗ b ) {

i n t i = g e t l o c a l i d ( 0 ) ;a [ i ] ∗= b [ i ] ;

}



OpenCL

Since the kernel has no loop dependencies, everything isvectorized: even RAM buffers are aligned to the vector width.

The __global keyword specifies that one uses the globaldevice memory.

One has access to the cache with __local; if one wants tohave data in the cache, then one writes a loop to put it there.

Coalescent memory access is crucial.

So, one has a lot of control, but there’s a bit more work.

But the performance is good!



OpenCL

We developed a discontinuous-Galerkin code for solvinghyperbolic conservation laws:

schnaps

Solver for Conservative Hypebolic Non-linear systems Appliedto PlasmaS

∂w

∂t+

k=d∑k=1

∂

∂kF k(w) = S (12)



schnaps



schnaps

Discontinuous Galerkin method:

I Deals well with complex geometries.

I Local refinement: non-uniform grid.

OpenCL implementation:

I Hexahedral elements for coalescent memory access.

I Macrocell / subcell formulation.

I Array of structs of arrays: yet more coalescence.



schnaps



schnaps

But, is it fast?



Performance analysis of schnaps

10−1

100

tim

e(s

econ

ds)

101

refinement

C

OpenCL



Performance analysis schnaps

clFFT, an FFT library written in OpenCL by AMD.

10−4

10−3

10−2

10−1

time(secon

ds)

102 103

problem size

CPUGPU




100

tim

e(s

econ

ds)

101

refinement

CPUGPU




schnaps works well on the Xeon Phi.

101

102

103

time(s)

101 102 103

N

12 CPU1 GPU1 MIC



schnaps summary

We observe that:

1. The C code makes use of all the cores.

2. The C and OpenCL code speeds on the CPU are close forlarge problem sizes.

3. The performance difference of schnaps between the CPUand GPU is near what we should expect.

Thus, we claim that our code makes effective use of the GPU.

We can further improve the code by profiling.

Collaboration with Philippe Helluy and TONUS, University ofStrasbourg.



Example simulation: Maxwell’s equations



Conclusion

I presented two projects:

I FFTW++

I Implicitly Dealiased Convolutions: faster, less memory.I OpenMP and/or MPI implementation.I Hybrid MPI transpose.I Application to a wide variety of situations.

I schnaps

I OpenCL implementation of the discontinuous Galerkinmethod.

I Good performance on the CPU, GPU, and mic.

Thank you for your attention!



Timing statistics

100

101

102

103

104

frequency

5×10−5 7×10−5time (s)


Documents

Research Presentation for Computer ... - Malcolm Roberts · Research Presentation for Computer Modelling Group Ltd. Malcolm Roberts University of Strasbourg 2016-04-26 [email protected],