Computational Science and Engineering (Int. Master’s Program) · 2013. 10. 20. · The introduction of LBM method and its history can be found in [20,22,19,21] and references therein

Computational Science and Engineering(Int. Master’s Program)

Technische Universität München

Master’s Thesis

MPI Parallelization of GPU-based Lattice BoltzmannSimulations

Author: Arash Bakhtiari1st examiner: Prof. Dr. Hans-Joachim Bungartz2nd examiner: Prof. Dr. Michael BaderAssistant advisor: Dr. rer. nat. Philipp Neumann

Dipl.-Inf. Christoph RiesingerDipl.-Inf. Martin Schreiber

Thesis handed in on: October 7, 2013

I hereby declare that this thesis is entirely the result of my own work except where other-wise indicated. I have only used the resources given in the list of references.

October 7, 2013 Arash Bakhtiari

Acknowledgments

I would like to express my gratitude to Prof. Dr. Hans-Joachim Bungartz for giving methe great possibility to work on this project. I wish to thank, Dr. rer. nat. Philipp Neumannfor his scientific support. I would like to express my great appreciation to my advisors,Christoph Riesinger and Martin Schreiber for their ongoing support of my work, for help-ful discussions and encouragement though out the time.

v

Contents

Acknowledgements v

1. Introduction 1

2. Lattice Boltzmann Method 32.1. Boltzmann Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2. Lattice Boltzmann Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3. Lattice Boltzmann and HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3. Turbulent LBM 73.1. Turbulence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2. Overview of Simulation Approaches . . . . . . . . . . . . . . . . . . . . . . . 83.3. BGK-Smagorinsky Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4. GPU Architecture 114.1. CPU vs. GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2. OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.1. Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.2. Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2.3. Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2.4. Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.5. Advanced OpenCL Event Model Usage . . . . . . . . . . . . . . . . . 15

5. Single-GPU LBM 195.1. OpenCL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.1. Memory layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.1.2. Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6. Multi-GPU LBM 236.1. Parallelization Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.1.1. Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.2. Ghost Layer Synchronization . . . . . . . . . . . . . . . . . . . . . . . 246.1.3. CPU/GPU Communication . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2. Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2.1. Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.3. Basic Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.4. Overlapping Work and Communication . . . . . . . . . . . . . . . . . . . . . 32

6.4.1. SBK-SCQ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.4.2. MBK-SCQ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii

Contents

6.4.3. MBK-MCQ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.5. Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.5.1. Validation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5.2. Multi-GPU Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.6. Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.6.1. 1K1DD Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.6.2. 1K19DD Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.6.3. 12K19DD Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.7. Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.7.1. Simulation Platform and Setup . . . . . . . . . . . . . . . . . . . . . . 446.7.2. Weak Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.7.3. Strong Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7. Multi-GPU Turbulent LBM 537.1. OpenCL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8. Conclusion 55

Appendix 55

A. Configuration File Example 57

Bibliography 65

viii

1. Introduction

Fluid flow plays an important rule in our life, the examples can range from the blood flowin our body to the air flow around a space shuttle. Therefore, the field of computationalfluid dynamics (CFD) has been always an area of interest for numerical simulations.

Most of the CFD simulations require the modeling of complex physical effects via nu-merical algorithms with high computational demands. Hence, these simulations have tobe executed on supercomputers. The lattice Boltzmann method (LBM) is a popular classof CFD methods for fluid simulation which is suitable for massively parallel simulationsbecause of its local memory access character. Therefore it is the method of choice in thisthesis.

Nowadays GPGPU (General-purpose graphics processing unit) computing becomesmore and more popular in High Performance Computing (HPC) field, due to themassively-parallel computation power that they provide with a relatively low hardwarecost. Since LBM is suitable for high parallelism and has minimal dependency betweendata elements, it is a perfect candidate for implementation on GPU platforms. Due tothe restricted memory available on the GPUs, performing simulations with high memorydemands requires utilising multiple GPUs.

The major contribution of this thesis is thorough design and implementation of severalmethods to implement an efficient MPI parallelized LBM on a GPU cluster. Its primarygoals are exploiting advanced features of modern GPUs, in order to gain an efficient andscalable massively parallel Multi-GPU LBM code. In order to achieve this goal, severalsophisticated overlapping techniques are designed and implemented. In addition, duringthis thesis the software design has been an important aspect, therefore, development of anextendable and maintainable software was of high priority.

Furthermore, the implementation of these methods is optimized to achieve better overallperformance of the simulation software. To demonstrate that the primary goal has beenmet, the performance of each method is evaluated.

In addition, the GPU code is extended to be able to simulate the turbulent fluids.The Large Eddy Simulation (LES) with the Smagorinsky subgrid-scale (SGS) model wasadopted for the turbulence simulation.

The thesis starts with introduction and discussion on the theory of the LBM. In chap-ter three, turbulence and the equations required to implement the LES with Smagorinskymodel are given. The next chapter is devoted to GPU architecture and introduction toOpenCL programming model and its advanced features. The Single-GPU implementationand its memory layout is given in chapter five. Chapter six consists of the parallelizationmodels, software design and advanced overlapping techniques developed in this thesis.At the end of this chapter, the validation strategy, performance optimization and evalua-tion of the Multi-GPU code is given. In the chapter six, the extension of the Multi-GPUcode for turbulent fluid simulations is discussed.

1

1. Introduction

2

2. Lattice Boltzmann Method

The Lattice Boltzmann Method (LBM) is a well-known approach for fluid simulation. Themethod has its origin in a molecular description of a fluid. Since in this method, mostof the calculations are performed locally, an optimized implementation normally achievea linear scalability in parallel computing. In addition, complex geometries and physicalphenomena can easily be represented by this method. Depending on the underlying sim-ulation requirements, LBM could yield beneficial properties compared to Navier-Stokesmethod as stated previously. However, instead of considering incompressible flows, Lat-tice Boltzmann schemes simulate weakly compressible flows.

The introduction of LBM method and its history can be found in [20, 22, 19, 21] andreferences therein.

2.1. Boltzmann Equation

The LBM is based on the Boltzmann equation:

∂f

∂t+ ~v · ∇f = ∆ (f − feq) (2.1)

where f(~x, v, t) denotes the probability density for finding fluid particles in an infinitesi-mal volume around ~x at time time t having the velocity ~v. On the right hand side of Eq.2.1 is called collision operator and represents changes due to intermolecular collisions inthe fluid. feq is the equilibrium distribution and is given by Eq. 2.2. It is also known as theMaxwell-Boltzmann distribution.

feq (v) = ρ

Åm

2πκbT

ãd/2e−m(v−u)

2

2κbT (2.2)

This equation describes the density of particles which have a specific velocity v withinan area with bulk velocity u at an infinitesimally small volume. The κb is the Boltzmannconstant and d is the number of dimensions. m is the mass of a single particle and T thetemperature value and ρ is the mas density.

Solving Eq. 2.1 analytically is very challenging and can only be done for special cases.Prabhu L. Bhatnagar, Eugene P. Gross, and Max Krook, the three scientists who noticedthat the main effect of the collision term is to bring the velocity distribution function closerto the equilibrium distribution. Based on this, they proposed the BGK approximation. Thecollision operator in Boltzmann equation can be approximated by BGK model:

∆ (f − feq) ≈ −1τ

(f − feq) (2.3)

where τ is the relaxation time of our system, i.g. a characteristic time for the relaxationprocess of f towards equilibrium.

3


Equation 2.1 and 2.3 provides information on our flow using statistical methods, whilethe Navier-Stokes equations are based on a set of continuity equations. However, thequantities known from the Navier-stokes equations such as the velocity ~u or density ~ρ canbe computed in a certain point ~x by integrating our probability density f over the velocityspace:

ρ (~x, t) =

∫RD

f dv1 . . . dvD (2.4)

ρ (~x, t) ~u (~x, t) =

∫RD

f · ~v dv1 . . . dvD (2.5)

2.2. Lattice Boltzmann Method

LBM originated from the lattice gas automata (LGA) method. LGA is a simplified molecu-lar dynamics model in which quantities like space, time and particle velocities are discrete.In this model, each lattice node is connected with its neighbors by 6 lattice velocities. Ineach time interval, as it is demonstrated in Fig. 2.1, particles in each node move to theneighboring nodes in direction of one of 6 lattice velocities.

Figure 2.1.: LGA model lattice vectors. From [2].

When more than one particles from different directions arrive to the same node accord-ing to some collision rules, they collide and as a result change their velocities. A morein-depth information regarding LGA method is provided in [12].

The transition from LGA model to LBM model is accomplished by switching from amodel with quantitative description to a probabilistic description of the particles in phase-space.

The LBM simulates fluid flow by tracking particle distribution function. Accomplishingthis in a continuous phase space is impossible, therefore LBM tracks particle distributionalong a limited number of directions.

The particle collisions are computed based on the density values of the cells with a socalled collision operator. The movement of the particles within one timestep is then sim-ulated by propagating the density values in the direction of the corresponding velocityvector to an adjacent cell. This is called streaming step.

4

2.3. Lattice Boltzmann and HPC

LBM Models can be operated on different discretization schemes which describe the di-mension and the adjacent communication cells for data exchange along the lattice vectors.To classify the different methods the DdQq notation is used, where d is the number ofdimensions and q is the number of density distributions for each cell. E.g. D3Q19 is a 3Dmodel with 19 density distribution (Fig. 2.2) which is used in this work.

dd0 - dd3, dd16,dd17: distributions with lattice speed 1dd6 - dd15: distributions with lattice speed sqrt(2)dd18: particles at rest

x

y

z

01

2

3

16

17

4

56

7

8

9 10

11

12

13

15

14

Figure 2.2.: Lattice vectors for D3Q19 model. From [17].

Based on the LBM model, the discrete representation of Lattice Boltzmann update is

fi (~x+ ~cidt, t+ dt) = fi (~x, t)−1

τ(fi (~x, t)− feqi ) (2.6)

where feqi is the discretized equilibrium distribution function

feqi (~u) = ωiρ

ï1 + 3 (~ei · ~u) +

9

2(~ei · ~u)2 −

3

2~u2ò

(2.7)

with ωi = 118 for i ∈ [0; 3] or [16, 17],136 for i ∈ [4; 15] and

13 for i = 18.

In order to compute the density and velocity of each cell the equations 2.4 and 2.5 canbe rewritten as

ρ =18∑i=0

fi ρ~v =18∑i=0

(fi~ei) (2.8)

2.3. Lattice Boltzmann and HPC

Since the discrete probability distribution functions used in Lattice Boltzmann model re-quire more memory for their storage than the variables of the Navier-Stokes equations,this method might at first sight seem quite resource consuming but in fact on moderncomputers this is never an issue. Since the collision operation of each cell can be com-puted independently and data dependency scheme is not complex, the lattice Boltzmannmethod is particularly well suited for computations on a parallel architecture. This ad-vantage is also valid for other types of high performance hardware like General PurposeGraphics Processing Units (GPGPUs), which are investigated particularly in this thesis.

5


6

3. Turbulent LBM

Turbulent flow is characterised by irregular and stochastic changes in fluid properties likepressure and velocity. Many complex flows, ranging from smoke rising from a cigarettewhich after sometime shows completely disordered structure in the air and a jet exhaust-ing, show chaotic and irregular flow disturbances. In contrast to laminar flows, turbulentflows exhibit a wide range of length scales.

In order to measure the strength of a turbulence in the flow, the Reynolds number:

Re =vL

ν(3.1)

is introduced, where v, the fluid velocity, ν the fluid viscosity and L the characteristiclength scale is. The turbulent fluids have normally a Reynolds number over 5000, whileflows with a Reynolds number below 1500 are typically laminar.

Table 3.1 provide a comparison of the characteristic properties of laminar and turbulentfluids.

Laminar Flows Turbulent FlowsHighly orderly flow Chaotic fluid flowNo stochastic irregularity Irregular in place and timeStable against interference from outside UnsteadyOccur by low Reynolds number Only occur by high Reynolds number

Table 3.1.: Comparison of characteristic properties of turbulent and laminar flows. From[4]

In the next section, the modeling of turbulent fluids is investigated.

3.1. Turbulence Modeling

This section is intended to give a brief introduction to three established turbulence models,namely Direct Numerical Simulation (DNS), Reynolds-averaged Navier-Stokes(RANS)and Large-Eddy Simulation (LES), for a better understanding of the implementation with-out going into mathematical and physical details of these models.

At the end, the BGK-Smagorinsky model, which is implemented in this thesis is investi-gated.

7

3. Turbulent LBM

3.2. Overview of Simulation Approaches

An overview of turbulence models is provided in this section. Interested reader can findmore in-depth material regarding the turbulence models in [4, 11].

Direct Numerical Simulation (DNS) This model bases on solving the three dimensional,unsteady Navier-Stokes equations. The DNS Model is the most accurate method to simu-late the turbulent flows, since in this model the whole range of spatial and temporal scalesof the turbulence must be resolved. Therefore, the computational cost of DNS is very high,even at low Reynolds numbers. The only error source of this model is the approximationerror of the numerical method, which can be resolved by choosing an appropriate numeri-cal method. For the most industrial applications, the computational resources required bya DNS would exceed the capacity of the most powerful computer currently available.

Reynolds-Averaged Navier-Stokes(RANS) This model is based on the statistical obser-vation of a turbulence fluid. In this model the current values in a turbulent flow field isdivided in two parts of temporal average (Ensemble Average) and fluctuation part. Byemploying these values in Navier-Stokes equations, and applying a temporal average,Reynolds-averaged Navier-Stokes(RANS) equations will be obtained.

Large-Eddy Simulation (LES) This model was initially proposed in 1963 by JosephSmagorinsky. The basic idea behind LES model is applying a low-pass filtering to Navier-Stokes equations in order to eliminate the small scales of the solution. This leads to trans-formed equations which solve a filtered velocity field. In order to reduce the computa-tional cost, the LES model resolves the large scales of the solution and models the smallerscales. This makes the LES model suitable for industrial simulations with complex geome-tries.

In this thesis, the BGK-Smagorinsky model which is an LES model is implemented. Thenext section is devoted to this model.

3.3. BGK-Smagorinsky Model

As described in the previous section, LES model bases on applying a low-pass filteringon the Navier-Stokes equations. In order to understand the concept of LES simulations,we investigate the characteristic properties of turbulent fluids. A rough comparison ofcharacteristic properties of large and small scale in turbulent fluids is given in Table 3.2.

8

3.3. BGK-Smagorinsky Model

Large Scales Small ScalesProduced from average fluid flows Originate in large scales movementsDemonstrate coherent structures Chaotic, StochasticLast longer and energy-rich Last shorter and low energyhard to model easier to model

Table 3.2.: Comparison of characteristic properties of large and small scales in turbulentflows. From [4]

In the RANS model, both large and small scales are modeled with an approximate statis-tical model. In contrast, by DNS model, both scales are computed through direct solutionof Navier-Stokes equations. The LES model can be considered as a compromise of bothRANS and DNS methods. In LES model, the large scale values are obtained by direct so-lution of Navier-Stokes equations and the small scales are approximated with a model.Interested reader can find more detailed information about these method in [11].

The following discrete lattice Boltzmann equation is obtained by applying the filter (see[9]):

fα (~x+ ~eαδt, t+ δt) = fα (~x, t)−1

τ∗[fα (~x, t)− fα (~x, t)] +

Å1− 1

2τ∗

ãSα (~x, t) δt (3.2)

where overbars indicate the filtered values. The major difference of this equation and theoriginal lattice Boltzmann equation (see Eq. 2.7) is that the density distribution functionsare replaced with filtered values. The relaxation time is replaced with the total relaxationtime τ∗ = τ + τt. Consequently the total viscosity is defined as following:

ν∗ = ν + νt =1

3

Åτ∗ − 1

2

ãc2δt =

1

3

Åτ + τt −

1

2

ãc2δt (3.3)

where the turbulent viscosity νt is defined in Eq. 3.4 and τt is the turbulent relaxation time.

νt =1

3τtc

2δt (3.4)

Smagorinsky Model In the eddy viscosity subgrid-scale (SGS) model the turbulentstress is given by formula 3.5.

τ tij −1

3δijτ

tkk = −2νtSij (3.5)

In addition, the turbulence eddy viscosity is computed by the following formula:

νt = (Cs∆)2 |S| (3.6)

where Cs is a Smagorinsky constant. In this thesis Cs is set to 0.1. Eq. 3.7 defines thefiltered strain rate:

|S| =»

2SijSij (3.7)

9

3. Turbulent LBM

where filtered strain rate tensor is:

Sij =1

2

Ç∂ui∂xj

+∂uj∂xi

å(3.8)

The computation of filtered strain rate tensor as is formulated in Eq. 3.8 requires finitedifference calculations. In the LBM, this can be avoided by using the second moment ofthe filtered non-equilibrium density distribution functions. In our case the filtered non-equilibrium flux tensor is applied:

Sij = −1

2∆tτ∗ρc2sQij (3.9)

withQij ≡ Π

(neq)ij + 0.5

ÄuiF j + ujF i

ä(3.10)

where Πneqij is the filtered non-equilibrium momentum flux tensor and can be computed asfollowing:

Πneqij =

∑α

eαieαj(fα − f

(eq)α

)(3.11)

Finally, the total relaxation time can be computed directly with the following formula:

τ∗ = τ + τt =1

2

Öτ +

Ãτ2 +

18√

2[Cs∆]2»QijQij

ρ∆t2c4

è(3.12)

In chapter 7, this method is implemented on a Multi-GPU platform.

10

4. GPU Architecture

The programmable GPUs (Graphic Processor Unit) has advanced into a highly parallelprocessors with tremendous computational power. Since the GPUs are specialized forcompute-intensive, highly parallel computation, they are designed such that more tran-sistors are devoted to data processing rather than data caching and flow control. As aresult, the floating-point capability of the GPUs has raised a lot of attention in scientificcomputing community.

Especially for the problems that the same program is executed on many data elementsin parallel (data-parallel computation) or the problems with high memory requirements,e.g. the LBM, it has shown that a great performance boost can be achieved by using theGPU architectures [15, 18].

In the next section, the CPU and GPU architectures are compared more deeply.

4.1. CPU vs. GPU

CPUs are optimized for sequential code performance. Therefore, they use more sophisti-cated control logic and provide large cache memories but neither control logic nor cachememories contribute to the peak calculation speed.

GPUs are originally designed as accelerators for rendering and graphics computations,e.g. computer games. In most of graphical tasks like visualization purposes the sameoperations are often executed for all data elements. As a result, the design of GPUs isoptimized to solve intensive data parallel throughput computations. Hence, by GPUsmuch more chip area is dedicated to the floating-point calculations. Compared to CPUs,GPUs offer higher throughput of floating point operations due to this specialization.

The schematic layout of CPUs and GPUs are illustrated in Fig. 4.1. In GPUs, smaller flowcontrol unit and cache memories are provided to help control the bandwidth requirements.For algorithms with high flow control, GPUs less fit to such an architecture than CPUs, dueto the smaller flow control units provided by their architecture.

Figure 4.1.: The GPU devotes more transistors to data processing. From [5].

11

4. GPU Architecture

The main advantages of GPUs are massive parallelism and wide memory bandwidthwhich its architecture provides. This is achieved by the providing thousands of smaller,more efficient cores designed for parallel performance but the multicore CPUs consist of afew cores optimized for serial processing.

Most of graphics processing units (GPUs) are designed for single instruction, multipledata (SIMD). SIMD computers have multiple processing elements that perform the sameoperation on multiple data points simultaneously.

Multicores CPUs are optimised for coarse, heavyweight threads which supply betterperformance per thread but GPUs create fine, lightweight threads with a relatively poorsingle-thread performance. For instance, the NVIDIA Tesla M2090 provides 512 processorcores with 1.3 GHz processor core clock. Each NVIDIA Tesla M2090 has a main memory orso called global memory of 6 GB. In Table 4.1 the specification of the NVIDIA Tesla M2090and older model M2050 are depicted.

NVIDIA Tesla M2050 M2090Compute Capability 2.0 2.0Code Name Fermi FermiCUDA-Cores 480 512Main Memory [GB] 3 6Memory Bandwidth [GB/s] 140 178Peak DP FLOPS [GFLOPS] 515 665

Table 4.1.: Comparison of NVIDIA Tesla M2050 and M2090. From [5].

In table 4.2 a comparison of NVIDIA Tesla M2090 and Intel Core i7-3770K is provided.

NVIDIA Tesla Intel Core i7-3770K M2090Cores 4 512Main Memory [GB] Max 32 6Memory Bandwidth [GB/s] 25.6 178Peak DP FLOPS [GFLOPS] 112 665

Table 4.2.: Comparison of NVIDIA Tesla M2090 and Intel Core i7-3770.

Programming models like CUDA and OpenCL make GPUs accessible for computationlike CPUs. OpenCL is the currently dominant open general-purpose GPU computing lan-guage. In the following, a brief introduction to OpenCL is given to the extent required forunderstanding of this thesis.

4.2. OpenCL

In the older graphics cards, the computing elements were specialized to process indepen-dent vertices and fragments. By introducing GPGPU direct programming interfaces like

12

4.2. OpenCL

CUDA and OpenCL this specializations are removed and it allows running programs writ-ten in a language similar to C without using the 3D graphics API. OpenCL is an open in-dustry standard for programming a heterogeneous collection of CPUs, GPUs organizedinto a single platform. OpenCL includes a language, API, libraries and a runtime systemto support software development.

OpenCL is the first industry standard that directly addresses the needs for heteroge-neous computations. It is first released in December of 2008. The early products becameavailable in the fall of 2009. With OpenCL, you can write a single program that can run ona wide range of systems, from cell phones, to nodes in massive supercomputers. This isone of the reasons why OpenCL is so important. At the same time this is also the sourceof much of the criticism launched at OpenCL. OpenCL is based on four models:

• Platform Model

• Memory Model

• Execution Model

• Programming Model

In the following sections, each of these models are explained in more details. The contentof following sections are based on OpenCL Specification [10]. A more in-depth explanationof OpenCL standard can also be find in [6, 16, 13].

4.2.1. Platform Model

The platform model is illustrated on Fig. 4.2. The model consist of a single host and someOpenCL devices that are connected to host. The host performs tasks like I/O or interactionwith a program’s user. An OpenCL device can be a CPU, GPU, digital signal processor(DSP) as well as other processors. An OpenCL device consist of one or more computeunits (CUs). Each CU provides one or more processing elements (PEs) which perform theactual computations on a device.

Figure 4.2.: OpenCL platform model. From [10].

13

4. GPU Architecture

4.2.2. Execution Model

An OpenCL application consists of two parts: the host program and one or more kernels.The host program runs on the host. The host program create the context for the kernels. Inaddition, it manages the execution of kernels. The kernels execute on the OpenCL devices.Kernels are simple functions that perform the real work of an OpenCL application. Thesefunctions are written with the OpenCL C programming language and compiled with theOpenCL compiler for the running device.

The kernels are defined on the host. In order to submit the kernels for execution on adevice, the host program issues a command. Afterwards, an integer index space is instan-tiated by the OpenCL runtime system. For each element in this index space, an instanceof the kernel is created and launched. Each instance is called an work-item. In order toidentify the work-items, their coordinates in the index space is used. These coordinatesare called global IDs which are unique for each work-item. While each work-item executethe same instructions, the used data in each work-item can vary by using the global ID.

Work-items are organized into work-groups. Work-groups are assigned a unique IDwith the same dimensionality as the index space used for the work-items. Work-items areassigned a unique local ID within a work-group. As a result, a single work-item can beuniquely identified in two ways: (1) by its global ID (2) by a combination of its local IDand work-group ID.

4.2.3. Memory Model

The OpenCL standard offers two types of memory objects: buffer objects and image ob-jects. A buffer object is a contiguous block of memory which is available to the kernels. Aprogrammer can initialise the buffers with any type of information from host and accessthe buffer through pointers. Image objects, are restricted to holding images.

A summary of OpenCL memory model and its interaction with platform model is de-picted in Fig. 4.3.

Figure 4.3.: OpenCL memory model. From [13].

OpenCL defines four distinct memory regions, that executing work-items have accessto:

14

4.2. OpenCL

• Global Memory: “All work-items in all work-groups have read/write access to thismemory region. Depending on the capabilities of the device Reads and writes to thismemory region may be cached”[10].

• Constant Memory: “This memory remains constant during the execution of a ker-nel. Constant memory is allocated on global memory region. Initialization constantmemory is done through the host”[10].

• Local Memory: “This memory region is local to a work-group. This memory regioncan be used to allocate variables that are shared by all work-items in that work-group. Depending on the device the local memory can mapped onto the globalmemory or it can have its dedicated memory”[10].

• Private Memory: “Variables defined in one work-item’s private memory are not vis-ible to another work-item”[10].

4.2.4. Programming Model

This section describe how a programmer can map parallel algorithms onto OpenCL usinga programming model. OpenCL consist of two different programming models: task par-allelism and data parallelism. OpenCL also supports hybrids of these two models. Theprimary model behind the design of OpenCL is data parallel.

Data Parallelism Data parallelism programming model is main idea behind the OpenCL’sexecution model. In this parallel programming model, the same sequence of instructionsare applied to multiple elements of a memory object. Normally access to the memoryobject is accomplished by the index space associated with the OpenCL execution model.The programmer needs to align the data structures in the problem with the index space inorder to access correct data in the memory object.

Task Parallelism Although OpenCL is designed for data parallelism, it can also be usedto achieve task parallelism. In this case, a single instance of a kernel is executed indepen-dent of any index space. Since task parallelism is not used in this thesis, the details are notdescribed here.

4.2.5. Advanced OpenCL Event Model Usage

OpenCL Usage Models In this section, various usage models of OpenCL standard isexplained. In order to perform an operation on OpenCL objects like memory, kernel andprogram objects, a command-queue is used. In addition, the command-queues are usedto submit work to a device. The commands queued in a command-queue can executein-order or out-of-order. Having multiple command-queues allows application to queuemultiple independent commands. Note that sharing of objects across multiple command-queues or using an out-of-order command-queue will require explicit synchronisation ofcommands. Based on the these choices, various OpenCL usage models can be defined [8].In the following a few of this models are described:

15

4. GPU Architecture

• Single Device In-Order Usage Model (SDIO): This usage model is composed of asimple in-order queue. ”All commands execute on a single device and all memoryoperations occur in single memory pool”.

• Single Device Out-of-Order Usage Model (SDOO): This model is same as SDIOmodel but with an out-of-order queue. As a result, the execution order has no guar-antees and device starts executing commands as soon as it is possible. It is the re-sponsibility of the developer to assure program correctness by analyzing the com-mands dependencies.

• Single Device Multi-Command Queue Usage Model (SDMC): In this model, mul-tiple command-queues are employed to queue commands to a single device. Themodel can be applied in order to overlap execution of different commands or over-lap commands and host/device communication. Dependent on GPU computationcapacity, SDMC model make it possible to launch several kernels concurrently onthe same device.

Dependent on the parallelization algorithm, each OpenCL usage model can lead to differ-ent performance and scalability results. By implementing more sophisticated overlappingtechniques some of these models are applied.

Synchronization Mechanisms In order to ensure that the changes to the state of a sharedobject (such as a command-queue object, memory object, . . . ) occur in the correct order (forinstance, when multiple command-queues in multiple threads are making changes to thestate of a shared object), the application needs to implement appropriate synchronisationacross threads on the host processor by construction of a task graph. OpenCL event modelprovides the ability to construct complicated task graphs for the tasks enqueued in any ofthe command-queues associated with an OpenCL context. In addition, OpenCL events canbe used to interact with functions on the host through the callback mechanism defined inOpenCL 1.1. These OpenCL features and their application in implementing of Multi-GPULBM simulation with overlapping technique is described in the following sections.

Events An event is an object that can be used to determine the status of commands inOpenCL. The events can be generated with commands in a command-queue. Other com-mands can use these events to synchronise themselves. Hence, event objects can be usedas synchronisation points.

All clEnqueue() methods can return event objects. An event can be passed as the final ar-gument to the enqueue functions. In addition, a list of events can be passed to an enqueuefunction to specify the dependence list. Based on the dependence list which in OpenCLterminology is called event wait list, the command will not start the execution until all ofevents in the list have completed. The following code demonstrate an example of usingOpenCL event based synchronization:

Listing 4.1: OpenCL event based synchronization.

cl_uint num_events_in_waitlist = 2;cl_event event_waitlist[2];

16

4.2. OpenCL

err = clEnqueueReadBuffer(queue, buffer0,CL_FALSE /* non-blocking */, 0, 0, 0, NULL, &event_waitlist[0]);

err = clEnqueueReadBuffer(queue, buffer1,CL_FALSE /* non-blocking */, 0, 0, 0, NULL, &event_waitlist[1]);

/* last read buffer waits on previous two read buffer events */err = clEnqueueReadBuffer(queue, buffer2,

CL_FALSE /* non-blocking */, 0, 0,num_events_in_waitlist,event_waitlist,NULL);

User Events Events can also be used to synchronize the commands running within ancommand-queue and functions executing on the host. This can be done by creating theso called user events. The user event can be used in OpenCL enqueue functions like otherevents. In this case, the execution status of the event is set explicitly. Creating a user eventon the host can be accomplished by using the clClreateUserEvent function.

Callback Events Callbacks are functions invoked asynchronously when the associatedevent reach specific state. A programmer can associate a callback with an arbitrary event.Using OpenCL callback mechanism can be beneficial specially for the applications inwhich the host CPU would have to wait while the device is executing. This can leadsto worse system efficiency. In such cases by setting a callback to a host function the CPUcan do other work instead of spinning while waiting on the GPU. The clSetEventCallbackfunction is used to set a callback for an event.

Using Events for Profiling Performance analysis is a crucial part of any HPC program-ming effort. Hence, the mechanism to profile OpenCL programs uses events to collect theprofiling data. To make this functionality available the command-queue should be createdwith the flag CL QUEUE PROFILING ENABLE. Then, by using the function clGetEvent-ProfiingInfo() the timing data can be extracted form event objects. A sample code of thisprocess is shown in listing 4.2.

Listing 4.2: Extracting profiling information with OpenCL events.

cl_event event;cl_ulong start; // start time in nanosecondscl_ulong end; // end time in nanosecondscl_float duration; // duration time in millisecondsclGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,

sizeof(start), &start, NULL);clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,

sizeof(end), &end, NULL);duration = (end - start)*0.000001;

CL PROFILING COMMAND START and CL PROFILING COMMAND END are flagsused to return the value of device time counter in nanoseconds for start and end of acommand associated with the event, respectively.

17

4. GPU Architecture

18

5. Single-GPU LBM

The GPU computation by providing advantages like massive-parallel processing and widememory bandwidth, allows LBM simulations to achieve high performance. The fact thatcomputation of LBM cells can be performed independently, makes this method appropri-ate for parallelization on the GPU.

This chapter is intended to give a brief introduction to implementation of Single-GPULBM with OpenCL API. For a more in-depth information regarding the Single-GPU im-plementation see [17]. The Multi-GPU implementation in section 6 uses this work as theinitial framework.

5.1. OpenCL Implementation

This section describes the implementation of LBM by using the OpenCL standard. Oneof the most important design aspect of a simulation software accelerated with GPUs ismemory access patterns used in the software. In the following sections the memory layoutused in the Single-GPU implementation for storing and accessing the simulation data, i.e.the density distribution values, is discussed.

5.1.1. Memory layout

Since the global memory an the GPUs are still restricted, designing an efficient and fru-gal memory layout for the LBM simulations on GPUs plays a crucial rule to achieve anoptimized software. Several memory layouts to store the density distributions have beendeveloped by different research groups. In the next section two of the memory layouts,namely A-B pattern and A-A pattern, is discussed. For the Single-GPU LBM used in thisthesis, the A-A pattern memory layout is used, which consumes the lowest amount ofmemory[3].

A-B Pattern In this memory layout, after applying the collision operator, the densitydistribution values are stored in the same memory location. The propagation operatorreads the density distribution value of adjacent cell in the opposite direction of currentlattice vector and save the values in corresponding lattice vector direction of the currentcell. The A-B memory layout is illustrated in Fig. 5.1.

19

5. Single-GPU LBM

Collision Propagation

Data storage (implementation):

Density distributions (model):

Collision Propagation

Figure 5.1.: Memory layout of A-B pattern. From [17].

In order to avoid the race condition, reading and writing to the same cells in a parallelimplementation requires an additional density distribution buffer. As a result, the A-Bpattern doubles the memory consumption due to the additional buffer. In the A-A memorylayout, this issue is addressed.

A-A Pattern The main design aspect of A-A pattern is enhancing the memory demandby maintaining almost the same performance. The A-A pattern achieves this goal by usingtwo different kernels for odd and even time steps: we follow referring to different kernelsas alpha and beta kernels (see [17]).

Alpha kernel reads the distribution values the same way as A-B pattern. After applyingthe collision operator, the new values are stored in the opposite lattice vector of the currentcell. Hence, Alpha kernel does not change the values of other cells at all and only accessesthe values of the current cell.

In contrast, the Beta kernel reads and writes the density distribution values only fromadjacent cells. In this kernel, the values are read from the adjacent cells which are in theopposite direction of current lattice vector. After applying the collision operator, the newvalues are stored in the adjacent cell in the same direction of current lattice vector. Figure5.2 demonstrate the A-A memory access pattern.

20

5.1. OpenCL Implementation

Collision +Propagation


Data storage (implementation):

Density distributions (model):

alpha

beta



alpha

beta

Figure 5.2.: Memory layout of A-A pattern. From [17].

Together with the alpha kernel, this access rule implicitly implements the propagationoperator.

5.1.2. Data Storage

The density distributions of a each lattice vector is stored linearly in the global memory ofthe GPU. To achieve a coalesced memory access, first all x components are stored, then they and z components. Consecutively the next lattice vector density distribution values arestored. A schematic illustration of this memory layout is illustrated in Fig. 5.3.

(0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,2) (3,3,3)(3,3,1)

(0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,2) (3,3,3)(3,3,1)

(0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,2) (3,3,3)(3,3,1)

density distributions 0



Figure 5.3.: Density distribution memory layout. From [17].

The velocity vectors of each cell are stored in the same way of the density distributionvalues. In addition, the density values and cell type flags are stored consecutively in sep-arate buffers. The more information about the memory layout of Single-GPU LBM can befound in [17].

21

5. Single-GPU LBM

22

6. Multi-GPU LBM

The simulation of real world scenarios is usually very compute intensive. In addition,the main memory of one compute device is commonly not sufficient to meet the memorydemands (i.e., 6GB on NVIDIA M2090 GPU). Using multiple GPUs efficiently for LBMMethod can help to fulfill the memory requirements and as a result it makes it possible touse the simulation with higher number of unknowns (weak scaling).

However, the use of multiple GPUs demands more sophisticated memory management,communication and synchronization techniques in order to avoid communication over-head in a distributed and even a shared memory system. To overcome the previouslystated challenges in a Multi-GPU LBM simulation, sophisticated optimization techniquesare required.

In the following section, the parallelization paradigm used in this thesis to proceed fromSingle-GPU to a Multi-GPU implementation is described. Additionally the eminent com-ponents of the software which are crucial to achieve a great software design are explainedcomprehensively. In order to achieve a great performance and defeat the communicationoverhead difficulty, various techniques of overlapping the computation and communi-cation are implemented and their efficiency is investigated. At the end, the benchmarkresults of the software on GPU MAC cluster is demonstrated.

6.1. Parallelization Models

There are two parallelization models: shared memory and distributed memory paral-lelization. In a shared memory model, as the name implies, different processes shareda global address space and asynchronously read and write data to it. In a distributedmemory model, the processes exchange data by passing messages to each other in anasynchronous or synchronous way. The most common standard message-passing sys-tem is MPI (Message Passing Interface). For this thesis the distributed memory model isadopted and the data exchange between GPUs is accomplished by MPI.

An essential aspect of any parallelization paradigm is Problem Decomposition. Thetwo types of problem decomposition is Task Parallelism and Data Parallelism. The TaskParallelism focuses on processes of execution. In contrast, by Data Parallelism, a set of taskwill operate on a data set but independently on separate partitions. Since the same com-putation are applied for each domain cell in LBM and they are completely independent ofother cells, it makes it a perfect candidate for a Data Parallelism.

In the next section, these parallelization paradigms are explained specifically for LBMSimulations.

23

6. Multi-GPU LBM

6.1.1. Domain Decomposition

Domain decomposition is most common way to parallelize LBM codes. The term domaindecomposition is understood to mean that the computational domain is divided up inseveral smaller domain parts which are distributed to several computational units. Eachdomain partition is assigned to one MPI process and one GPU which is responsible for thecomputation of that partition.

As a result of data dependency, MPI processes need to exchange data among each other.The demanded data, based on the LBM data dependency, is the outer layer of the processlocal simulation domain. Therefor each MPI process extract the data that is required byother processes and send the data to the receiving process. The received data is saved inthe so called ghost layer subregion of corresponding simulation data.

6.1.2. Ghost Layer Synchronization

Normally the ghost layer data is just used during the local computation of subregion andcan be safely overwritten by the new data in the next simulation step.

As it is described in section 5.1.1 the α-kernel only access the values of the current cellfor the collision and propagation. The general approach of ghost layer data exchangecan be applied to data synchronization after α time step. In this thesis we call it α-synchronization.

Figure 6.1.: α-Synchronization. In α-Synchronization all the lattice vectors values are ex-changed.

Opposite to α-synchronization, the β-kernel reads the density distribution values fromthe adjacent cells and after performing the computation, the results are written to the ad-

24

6.1. Parallelization Models

jacent cells in the direction of lattice vector of the computing density distribution. As aconsequence, the GPU threads computing the neighboring cells of ghost layers, write theircomputation results in ghost layer data. This data is required for the next simulation stepof the process that ghost layers data originally comes from. Thus, the new data in ghostlayer should be sent back to the original process. This procedure is demonstrated in Fig.6.2. We call this procedure β-synchronization.

Figure 6.2.: β-Synchronization. In the β-Synchronization only the red lattice vectors areexchanged.

6.1.3. CPU/GPU Communication

The simulation data of each subdomain is stored in the global memory of correspondingGPU. Therefor, before performing MPI communications the data should be transferred tothe host memory. The data is send over the host and device bus system. (PCI express, e.g.).

In contrast to MPI communications that consist only out of CPUs, in a Multi-GPU com-munication, an additional step is required to transfer data from the GPU memory to thehost memory. This process is presented in Fig 6.3.

GPU-CPU Copy PCI Express Transfers

PCI Express TransfersCPU-GPU Copy

InfinitiBand Send via MPI

InfinitiBand Receive via MPI

Figure 6.3.: Activity diagram of MPI communication for Multi-GPU simulation.

25

6. Multi-GPU LBM

To get a better intuition of Multi-GPU MPI communication process, a geometricoverview of data communication in a Multi-GPU LBM simulation is provided in Fig. 6.4.

Figure 6.4.: A geometric overview of Multi-GPU MPI communication. In this figure, thethree form of communication, namely, GPU-GPU, GPU-CPU and MPI commu-nication are shown. In this thesis, local communication is not implemented.From [5].

In case that each host is managing more than one GPU, no MPI communication for dataexchange is required and communication can be performed locally, since the GPUs candirectly or indirectly through the host, access to each others global memory. However, thisthesis focuses on MPI communication only.

6.2. Software Design

From the beginning of this thesis, the software design has been an important aspect. In thedesign process following aspects are considered:

• Extensibility: Adding new capabilities to the software can be accomplished effort-lessly without restructuring of major parts of the software components and theirinterrelation.

• Modularity: the software comprises independent modules which prompt bettermaintainability. In addition, The components can be tested in isolation before be-ing integrated to the software.

• Maintainability: In consequence of modularity and extensibility, bug determinationis simplified.

• Efficiency: The software is optimized in many aspects. Data structures are chosen ina way to consume less memory and provide the performance needed for a massivelyparallelized large-scale simulation software.

26


• Scalability: Scalability is the major design goal of the software developed in thisthesis. Sophisticated techniques like the overlapping of work and communicationare implemented in order to achieve a great scalability for large-scale simulations.

Apart from HPC and design aspects, the software development of this thesis also offers:

• Storing visualization data in VTK format (legacy format and xml binary format)

• Automatically profile different modules of code with tools like Scalasca[7], Vampir1

and Valgrind[14]

In the following a detailed overview on available modules and their underlying designstrategy is provided.

6.2.1. Modules

In this section, all the modules developed in this thesis and their interoperability is dis-cussed. Figure. 6.5 presents a general overview of the simulation flow. The figure is gen-erated by using Callgrind. In addition, the caller/callee relationship between the mostessential functions of simulation software is presented.

Figure 6.5.: Simulation callgraph.

Each node in the graph represents a function, and each edge represents calls. Cost shownper function is the cost spent while that function is running.

In the following a detailed overview of software modules is provided:

Manager Class: Manager class is responsible for management of general tasks in simu-lation process that are not specific to one subdomain in particular, e.g. domain decompo-sition and assigning subdomains to different processes. These types of tasks are normallycarried out during the initialization phase and needs to be done only once during the sim-ulation.

27

6. Multi-GPU LBM

The class features one template parameter: T. T determines the data type of simlationdata stored in the memory. Using this template parameter make it possible to run the simu-lation software on GPUs with single or double precision support. This template parameteris also used in most of other software modules developed in this thesis.

• Simulation Parameters: The Manager class stores the grid information such as do-main length and number of lattice cells in each direction. Further, the class also savethe number of subdomains for the domain decomposition which is specified by theuser via the configuration file or the command line arguments. The grid informationand number of subdomains are established as constructor arguments during the in-stantiating of Manager class. By using this information, Manager instance computethe size of each subdomain and its location in entire grid.

• Domain Partitioning: The Manager class is also responsible for assigning tasks(computation of each subdomain) to MPI processes. Distributing workloads acrossmultiple processes can be performed by using various strategies. An optimized strat-egy is the one, that provokes the least amount of communication between MPI pro-cesses.

• Partition Boundaries: In addition, based on the location of each subdomain inthe simulation grid, the class determines appropriate boundary conditions foreach subdomain. For instance for the subdomains that shares a domain facewith neighboring subdomains, the boundary condition for that face is assigned toFLAG GHOST LAYER. This flag specifies that the computation related to this facecan only be accomplished only after the corresponding ghost layer data is fetchedfrom neighboring subdomain.

• Communication: After detecting a ghost layer face, the Manager class initiate aComm object. The Comm class contains the required information for communicationwith neighbors like the size and origin of sending and receiving data. The Commclass is explained more deeply in the following sections.

• Simulation Geometry: The Manager class is also responsible for setting the geome-try of simulation domain. This is achieved by setting appropriate flags to each cell ofthe subdomain. FLAG OBSTACLE and FLAG FLUID are some examples of currentlyavailable flags.

• Controller: The Manager class, as is previously stated, is responsible for the tasksrelated to initialization phase of the simulation. Therefore two strategies are avail-able which by the first strategy the initialization process is done on the root processand the results will be sent to other processes via MPI send commands while bythe second strategy each process performs the initialization phase separately. Whichstrategy leads to better performance depends on initialization/communication timeratio. In this thesis, since the initialization operations are not compute intensive, inorder to avoid the communication between MPI processes, each process performsthe initialization related to each subdomain individually. As a consequence, everyMPI process instantiates its own Manager class which carry out the initializationoperations of the corresponding subdomain. In addition, every Manager instance

28


aggregate a Controller class which controls the simulation procedure of that subdo-main.

Controller Class: Class Conroller provides an interface to control the simulation steps onthe corresponding subdomain.

For every subdomain one Controller class is instantiated. Each Controller class has aunique ID which is used in MPI communications as identifier of the subdomains. EachController instance needs to communicate with neighboring Controller instances in orderto synchronize the ghost layer data. This task is accomplished by syncAlpha and sync-Beta functions which send the alpha and beta ghost layer data respectively. An over-loaded version of these functions, by adding two more arguments of types MPI Requestand MPI Status, provides the ability to perform a non-blocking MPI communication.

The required MPI communication information is stored in Comm class instance assignedto the corresponding communication. For subdomains with more than one neighbor theComm instances are stored in a C++ std::map container with the position of neighboringdomain (direction of communication which is a C++ enumeration) as the key of map con-tainer. This makes it possible to access the communication information of each directionindependently. This feature is used later in order to synchronize the ghost layer data ofeach direction separately (see section 6.4).

Before sending the ghost layer data, the functions storeDataAlpha and storeDataBeta loadsthe data to the sending buffers. The setDataAlpha and setDataBeta functions place the re-ceived data from neighboring subdomains in the proper places in local simulation data ofcurrent subdomain. These functions are overloaded two times. The overloaded functionswith one argument of type MPI COMM DIRECTION perform their tasks only in directionspecified in the argument. There is also another overloaded version of these functions,which in addition to direction, provides the ability to profit from event based OpenCLsynchronization features by adding three more OpenCL event arguments.

The strategy behind different optimization methods is implemented in simulationStepAl-pha and simulationStepBeta functions. Basically these functions encapsulates all operationsneeded to perform one simulation step of alpha and beta computations, respectively. Forexample which part of the computation domain should be computed first and the orderin which the computation and communication operations should be performed can be en-capsulated under these functions (see section 6.4). The class Controller performs alphaand beta simulation steps by utilizing the lbm solver attribute which is a type of LbmSolverclass. This class is explained in the following section.

The function initLbmSolver, which is called in the constructor of Controller class, queriesthe local GPUs and create OpenCL platforms, contexts and command-queues. It alsochooses the appropriate available GPU device for performing the computations. In or-der to apply the modularity principle in software design the Controller class does notdirectly enqueues the OpenCL kernels to the command-queues. The LbmSolver class isemployed for this purpose. Therefore, lbm solver attribute is initiated in this function andthe OpenCL contexts and device values are passed as constructor arguments of LbmSolverclass during the instantiating of this attribute.

Furthermore, the class Controller aggregate a visualization class instance in order tovisualize the simulation data in each time step.

29

6. Multi-GPU LBM

LbmSolver Class: This Class works as a wrapper around all available OpenCL kernels. Itprovides interface functions for enqueueing computation and device/host communicationkernels

The function reload allocates OpenCL memory objects which contains the local computa-tion results. Additionally, it creates OpenCL kernels, compile them and assigns the kernelarguments.

Enqueueing OpenCL commands in command-queues is done only through the inter-face provided in this class. For instance, it implements the functions simulationStepAlphaand simulationStepBeta which enqueues OpenCL kernels that perform the alpha and betacomputations on the entire local domain. In addition, simulationStepAlphaRect and simula-tionStepBetaRect functions gives the user the capability to run the alpha and beta kernelson a rectangular part of the domain. To use these functions the user needs to provide theorigin of the ractangular part and size of it. The rectangular functionality is also avail-able for functions which store and set the data from and to the device. With the help ofthese functions is possible to modify the data of an specific rectangular part of the wholedomain. This features used in getting and setting of the ghost layer data.

Most of the functions provided in this class are overwritten to exploit the OpenCL eventssynchronization mechanism. The usage of these functions is explained in later sections.

Comm Class: The Comm class encapsulates the information required for MPI commu-nication between Controller classes. The public interface of the class offers functions foraccessing MPI communication destination rank and origin and size of data to send to andreceive from the destination rank. Comm class also allocates receive and send data buffersin its constructor so the buffers are allocated only once through the whole simulation steps.

Configuration Class: This class is in charge of parsing the general setting of the simula-tion process and providing global access to the settings.

The simulation configurations can be set through the command line arguments or byproviding an xml file. An example of the XSD schema is available in appendix A. Theconfiguration file name can be established as constructor argument during the instantiat-ing of Configuration class or it can be specified as the argument of loadFile function. Thesimulation settings are divided into four categories of data: physics, grid, simulation anddevice settings. The settings related to each category is available under corresponding xmlelement in configuration file (see A).

The class should provide a global point of access to the settings. This can be achievedvia the Singleton software design pattern. The singleton pattern ensures a class has onlyone instance and it is globally available through the function Instance(). In order to be ableto turn every class to a singleton, a c++ singleton template class is implemented. By usingthe configuration class as the template parameter of the Singleton class, it can be easilyused as a singleton.

LbmVisualizationVTK Class: All visualization classes have to inherit from the abstractbase class ILbmVisualization which defines the interface of a visualization class compatiblewith software developed in this thesis.

30

6.3. Basic Implementation

ILbmVisualization interface supply two functions, namely setup() and render(). The firstfunction is used for the initialization of the visualization process. The render() function,should be called whenever the simulation data needs to be visualized. The functional-ity of render() function depends on the implementation class. For instance, in the case ofLbmVisualizationVTK class the render function saves the simulation data to a VTK file.

6.3. Basic Implementation

To develop the software in this thesis the following simple, disciplined and pragmaticapproach to software engineering which has been attributed to Kent Beck, is applied: MakeIt Work, Make It Right, Make It Fast.

According to this approach, first a software is developed which works properly andfulfills the basic goals of the thesis. In the next step, various methods in order to improvethe basic design are implemented and verified. Finally, in the last step of development, thesoftware is optimized in many aspects to gain the best performance in terms of executiontime and scalability for large-scale simulations.

This section describes the implementation of the basic method. This basic method com-putes all subdomains by employing a GPU for each subdomain. After accomplishing allcomputations for a time step, the boundary regions of subdomains are exchanged betweenGPUs. ”A GPU cannot directly access to the global memory of other GPUs, as a result, hostCPUs are used as a bridge for data exchange between GPUs. The data exchange is com-posed of the following 3 steps: (1) Data transfer from GPU to CPU (2) Data exchange be-tween CPUs via MPI (3) Data transfer back from CPU to GPU”[18]. A schematic overviewof this three step communication is illustrated in Fig. 6.6.

Figure 6.6.: Schematic timeline for the basic method. From [18].

The code developed for this approach does not require sophisticated synchronizationtechniques, since the MPI communication is performed first when the computation of allsubdomain cells is accomplished. A sample of an α-iteration is shown in Listing 6.1.

Listing 6.1: Implementation of not optimized basic method for computing one alpha sim-ulation step.

// First the alpha computation of all cells// is performedcLbmPtr->simulationStepAlpha();// The communication phase starts after the computation phase is// completely finished.syncAlpha();

31

6. Multi-GPU LBM

Although the implementation of this method is straightforward, the communicationoverhead introduced by the three-step communication scheme described before, can re-duce the performance of the simulation. We expect its impact getting even more significantwhen we use more GPUs.

In this method, in order to exchange the boundary data, first the computation of wholedomain should be accomplished. Therefore, the GPU/CPU transfer operations and MPIcommunications stay idle for a longer time.

In the next sections, other techniques to avoid the communication costs by overlappingcommunication and computation parts are investigated.

6.4. Overlapping Work and Communication

Increasing the performance becomes more difficult, primarily because the inter-node com-munication can not catch up with the performance increase of massively parallel GPUs.Achieving good parallel efficiency when using distributed-memory machines requiresmore advanced programming techniques for hiding communication overhead by over-lapping methods.

A possibility to prevent communication overhead is to perform communication parallelto the actual simulation. This requires two OpenCL LBM kernels. One updates the outerlayer of the Block, and one updates the inner region. In this way, first, the computation ofouter boundary is accomplished. ”Next, the computation of inner part, and the extraction,insertion and the MPI communication is executed asynchronously. Hence, the time spentfor all PCI-E transfers and the MPI communication can be hidden by the computation ofthe inner part, if the time of the inner kernel is longer than the time for the communication.If this is not the case only part of the communication is hidden.”[5]

Implementing the overlapping techniques require asynchronous execution of GPU ker-nels, GPU-CPU copy operations and MPI communications commands on the CPU. BothOpenCL and MPI standards provide advanced synchronization mechanisms which is dis-cussed in section 4.2.5.

Operations used in a Multi-GPU LBM simulation fall into the following four categories:

• GPU Computation: The operations that perform the actual computation of collisionoperators, belong to this category. This operations run solely on the GPU.

• GPU-CPU Communication: This category consists of operations that are responsibleto transfer updated values form GPU memory to host memory and vice-versa.

• CPU-Computation: In an hybrid model, in addition to GPU also CPU is utilized toperform some part of computations. However, this is not utilized in our implemen-tation.

• MPI-Communication: All MPI commands which are used to synchronize the ghostlayer data.

The primary idea of techniques designed in this thesis is to overlap the operations ofthese four categories, reducing the critical path length of the dependency graph.

32


To overlap the GPU-Computation and GPU-CPU Communication operations, the ad-vanced OpenCL event synchronization techniques described in the section 4.2.5 can beexploited. In addition, in case that several kernels are used to compute different areas ofthe subdomain, the device should be capable of running multiple kernels simultaneously.In CUDA programming model, this can be carried out by exploiting the CUDA Stream con-cept. Achieving the same results with OpenCL is theoretically possible by using severalOpenCL command-queues associated to the same device. Current OpenCL specificationdoes not specify this feature and as a result, this is completely dependent to the vendor pro-viding the OpenCL driver. Overlapping of GPU operations with the CPU-Computationcategory, can be achieved by taking advantage of OpenCL callback mechanism. It willbe discussed in later sections that the overhead introduced by implementing the callbackmechanism, significantly degrade the performance and scaling of the software. By usingnon-blocking MPI commands, CPU operations and MPI-communication operations canrun asynchronously. Based on these overlapping techniques for the Multi-GPU LBM, inthis thesis several approaches are designed, implemented and their performance resultsare compared, in the following sections:

6.4.1. SBK-SCQ Method

Single Boundary Kernel Single Command-Queue (SBK-SCQ) method, consists of oneboundary kernel that computes all the boundary values in each direction at once. Themain idea behind this approach is to utilize this kernel to first update only the bound-ary values then the communication process and computation of inner cells are performedasynchronously.

In Fig. 6.7 a schematic timeline for the operations performed in this method is illustrated.

SBK-SCQ

CPU

Co

mpu

tatio

nM

PI C

omm

.GP

U

Com

puta

tion

GPU

-CPU

Co

mm

.

Phase

Boundaries computation Inner Computation

Store X0

Store X1

Store Y0

Store Y1

Store Z0

Store Z1

X0 ISend

X1 ISend

Y0 ISend

Y1 ISend

Z0 ISend

Z1 ISend

X0 Comm. X1 Comm. Y0 Comm. Y1 Comm. Z0 Comm. Z1 Comm.

X0 Irecv

X1 Irecv

Y0 Irecv

Y1 Irecv

Z0 Irecv

Z1 Irecv

Set X0 Set X1 Set Y0 Set Y1 Set Z0 Set Z1

Figure 6.7.: Schematic timeline for SBK-SCQ method.

33

6. Multi-GPU LBM

After computing the boundary values, the data transfer operations get triggered. Duringthe communication phase, first the boundary values of each direction are separately trans-ferred from the GPU main memory to the host memory. This is achieved by enqueueingOpenCL data transfer functions to the one single command-queue.

In the next step, the ghost layer data are exchanged between MPI processes, by us-ing non-blocking MPI sending/receiving functions. Since in this method no concurrentOpenCL kernel execution can be performed on the device, neither GPU-Computation norGPU-CPU transfer operations can overlap. In contrast, as soon as a transfer operation ofone boundary is finished the corresponding MPI communication can be executed on thehost and the transfer operation of other boundaries can be executed on the GPU, asyn-chronously.

Finally, each simulation step concludes with receiving data from neighboring domains.This data needs to be transferred from the host memory to correct location in the GPUmemory. Again, in this phase separate transfer operations are executed for each boundary.

By using only one OpenCL command-queue, overlapping the GPU computation andGPU-CPU transfer operations is not possible. As a result, the computation of inner part ofthe domain can be triggered only after the data transfer operation is accomplished.

A schematic sample code of this method is demonstrated in Listing 6.2. The simulation-StepAlphaBoundaries() is the function that enqueues the boundary kernel, which computeonly the boundary values.

The functions storeDataAlpha() and setDataAlpha() are responsible for transferring thedata from the GPU memory to host memory and vice-versa. The single argument used bythe functions specifies the intended boundary location.

syncAlpha() performs a non-blocking MPI send and receive operations. This func-tion in addition to direction of communication, has also two output arguments of typeMPI Request for sending and receiving operations. The arguments are used later to pro-vide information about the status of the MPI operation.

During performing the communication operations the computation of inner part of thedomain is triggered by using the simulationStepAlphaRect() function.

Listing 6.2: Implementation of SBK-SCQ method for computing one alpha simulation step.

//////////////////////////////////////////////////////// --> One Kernel computing the entire boundary cells/////////////////////////////////////////////////////cLbmPtr->simulationStepAlphaBoundaries(0, NULL, NULL);

storeDataAlpha(MPI_COMM_DIRECTION_X_0);syncAlpha(MPI_COMM_DIRECTION_X_0, &req_send_x0, &req_recv_x0);

storeDataAlpha(MPI_COMM_DIRECTION_X_1);syncAlpha(MPI_COMM_DIRECTION_X_1, &req_send_x1, &req_recv_x1);

// same for y and z directions...

// --> Computation of inner partcLbmPtr->simulationStepAlphaRect(inner_origin, inner_size, 0,

34


NULL, NULL);

MPI_Status stat_recv;MPI_Wait( req_recv_x0, &stat_recv);setDataAlpha(MPI_COMM_DIRECTION_X_0);

MPI_Status stat_recv;MPI_Wait( req_recv_x1, &stat_recv);setDataAlpha(MPI_COMM_DIRECTION_X_1);

// same for y and z directions...

As it is shown in Fig. 6.7 there are many gaps between the four kind of operations.In order to improve the overall performance, the timeline should be as tight as possible.This can be achieved by overlapping of more independent operations. The identificationof independent tasks requires a more advanced dependency analysis of the simulationprocess.

One way to achieve this is to divide the boundary kernel into six separate kernels foreach boundary region. Hence, the computation of each boundary region can be accom-plished independently. This can provide more overlapping opportunities. This method isexplained in the next section.

6.4.2. MBK-SCQ Method

In “Multiple Boundary Kernels Single Command-Queue” (MBK-SCQ) method, in contrastto SBK-SCQ method, each boundary region is computed separately. Therefore, once thecomputation of one boundary region is finished, the GPU-CPU transfer operation of thatregion get triggered.

Figure 6.8 illustrate a schematic timeline for MBK-SCQ method.

35

6. Multi-GPU LBM

MBK-SCQCP

U

Com

puta

tion

MPI

Com

m.

GPU

-CPU

Co

mm

.GP

U

Com

puta

tion

Phase

Store x0

Store X1

Store Y0

Store Y1

Store Z0

Store Z1

X0 Isend

X1 Isend

Y0 Isend

Y1 Isend

Z0 Isend

Z1 Isend

X0 Comm.

X1 Comm.

Y0 Comm.

Y1 Comm.

Z0 Comm.

Z1 Comm.

Set X0

Set X1

Set Y0

Set Y1

Set Z0

Set Z1

X0 Irecv

X1 Irecv

Y0 Irecv

Y1 Irecv

Z0 Irecv

Z1 Irecv

Inner ComputationX0 Comp.X1

Comp.Y0

Comp.Y1

Comp.Z0

Comp.Z1

Comp.

Figure 6.8.: Schematic timeline for MBK-SCQ method.

Although decomposing the boundary computation provides more flexibility in organiz-ing the operations, it requires more advanced synchronization techniques.

The GPU-CPU transfer operation of each boundary region should be executed directlyafter the computation of corresponding region is finished. In addition, each boundaryregion should be computed after the previous boundary region has successfully completedits task.

A sample code of this method is provided in Listing 6.3. In this method, instead of exe-cuting one kernel to compute all partition’s boundary regions, each boundary is computedseparately by using the simulationStepAlphaRect() function. As it is explained in section 6.2,this function performs the collision and streaming operations on a rectangular subregion ofthe the domain. The origin and size of the the subregion are given as function arguments.

Listing 6.3: Implementation of MBK-SCQ method for computing one alpha simulationstep.

///////////////////////////////////////// --> Simulation step alpha x boundary///////////////////////////////////////// performing the alpha time step on x0 boundarycLbmPtr->simulationStepAlphaRect(x0_origin, x_size, 0, NULL, &ev_ss_x0);

// performing the alpha time step on x1 boundarycLbmPtr->simulationStepAlphaRect(x1_origin, x_size, 1, &ev_ss_x0, &

ev_ss_x1);

// --> Store x boundarystoreDataAlpha(MPI_COMM_DIRECTION_X_0, 1, &ev_ss_x0, &ev_store_x0);storeDataAlpha(MPI_COMM_DIRECTION_X_1, 1, &ev_ss_x1, &ev_store_x1);

36


// --> Sync x boundarysyncAlpha(MPI_COMM_DIRECTION_X_0, &req_send_x0, &req_recv_x0);syncAlpha(MPI_COMM_DIRECTION_X_1, &req_send_x1, &req_recv_x1);

...

// receiving and setting the boundary data from neighborsMPI_Wait( req_recv_x0, &stat_recv);setDataAlpha(MPI_COMM_DIRECTION_X_0);

MPI_Wait( req_recv_x1, &stat_recv);setDataAlpha(MPI_COMM_DIRECTION_X_1);

Although decomposing bigger tasks in smaller tasks provides more flexibility in over-lapping independent computation parts, the overhead introduced during the implemen-tation can degrade the performance. In this method, no concurrent kernel execution onone device can be applied. Hence, the implementation overhead can dominate the per-formance evaluation. This is discussed more deeply in the section 6.7. In the section 6.4.3some techniques to overcome this problem are introduced.

MBK-SCQ Method with OpenCL Callback Mechanism Another technique which isexperimented in this thesis, is taking advantage of OpenCL callback mechanism to executethe MPI communication commands.

The callback mechanism which it is introduced in section 4.2.5, provides the ability toinvoke a function on the host when an OpenCL event has reached a specific status.

A usage scenario of OpenCL callbacks would be for applications that usually the hostwould have to wait while the device is executing. This could reduce the efficiency.

The callback mechanism can be used in Multi-GPU implementation in a way thatwhen the associated event to a GPU-CPU transfer operation has changed its status toCL COMPLETE, a callback function will be triggered which invokes the MPI communica-tion commands to exchange the transferred data with other processes. In this way, the hostprocess can continue with enqueueing the next GPU kernels without getting interruptedwith MPI commands.

Although, this technique in theory is promising, in section 6.7 is shown that overheadintroduced by OpenCL callback mechanism drastically degrade the performance and scal-ability of the software.

6.4.3. MBK-MCQ Method

The fundamental shortage of previous methods is the lack of concurrent execution of GPU-Computation and GPU-CPU transfer operations on a device. Although the OpenCL spec-ification does not require the existence of this capability from the vendors which providesthe OpenCL drivers, this feature is available on some NVIDIA graphics cards by usingCUDA programming language.

In order to execute GPU-CPU transfer operations and GPU-Computation operations si-multaneously under OpenCL standard, each of these commands should be enqueued toseparate OpenCL command-queues associated with the same device. This is the basic

37

6. Multi-GPU LBM

idea behind the “Multiple Boundary Kernels Multiple Command-Queue” (MBK-MCQ)method. By using multiple command-queues, i.e. two command-queues, one of thecommand-queues can be devoted to GPU-CPU transfer operations and the other is uti-lized for GPU-Computation operations. As a result, overlapping of these two commandtypes can be achieved.

Figure 6.9 represent the schematic timeline of this method. As it is shown in this figure,after computing the x0 boundary region, the transfer operation for this boundary region isenqueued to the GPU-CPU transfer command-queue and simultaneously the computationof x1 boundary region is enqueued to the GPU-Computation command-queue.

MBK-MCQ

CPU

Co

mpu

tatio

nM

PI C

omm

.GP

U-C

PU

Com

m.

GPU

Co

mpu

tatio

n

Phase

X1 Comp.

X0 Comp.

Y0 Comp.

Y1 Comp.

Z0 Comp.

Z1 Comp.

Inner Computation

Store x0

Store X1

Store Y0

Store Y1

Store Z0

Store Z1

X0 Isend

X1 Isend

Y0 Isend

Y1 Isend

Z0 Isend

Z1 Isend

X0 Comm.

X1 Comm.

Y0 Comm.

Y1 Comm.

Z0 Comm.

Z1 Comm.

X0 Irecv

X1 Irecv

Y0 Irecv

Y1 Irecv

Z0 Irecv

Z1 Irecv

Set X0

Set X1

Set Y0

Set Y1

Set Z0

Set Z1

Figure 6.9.: Schematic timeline for MBK-MCQ method.

In case that the GPU is capable of concurrent kernel execution, the computation of innerregion can also be triggered at the beginning of simulation step process and in parallel withthe computation of boundary regions. By taking advantage of concurrent kernel executionand overlapped transfer and computation operations on the GPU, the timeline providedin figure 6.9 is much denser than the previous methods, which theoretically results in hugeperformance boost.

Using the hardware features required by this method, not only depend on the capabil-ities of the hardware but also if the vendors implemented these features in their OpenCLdrivers.

In addition, the overhead introduced by simultaneous scheduling of several kernelson the GPU depends on the OpenCL driver implementation. As a result, although thismethod in theory promises better results, in practice the previously stated difficulties candominate the performance results. In section 6.7, the performance results of this method iscompared with previous methods.

38

6.5. Validation

6.5. Validation

This section describes the validation process of Multi-GPU LBM simulation developed inthis thesis. Numerical validation is an important issue specially in GPU computing wherecalculations can be performed in single precision. Multi-GPU implementation in this thesisbases on the Single-GPU code developed in [17]. Physical validation of Single-GPU codeis also provided in [17].

6.5.1. Validation Setup

To validate the LBM software, a lid driven cavity scenario is used. In this scenario a cubicdomain is created with a x velocity on the top wall and non-slip conditions on every otherwall. The validation is performed with different Reynolds numbers with various domainresolutions.

6.5.2. Multi-GPU Validation

In order to verify the correctness of the MPI parallelization, the same simulation scenarioshould be computed on one GPU as well as on multiple GPUs. Afterwards, the densitydistribution values of each cell in both cases should be compared against each other. Sincethe fluid velocity and pressure are computed from the density distribution values, com-parison of these metrics is not necessary.

In order to automate the validation process, a validation module is developed whichperforms the same scenario which runs on multiple GPUs, on one additional GPU withoutdecomposing the domain. Afterwards, the simulation data of each subdomain (excludingthe boundary values), are sent via MPI to the same process that is responsible for comput-ing the Single-GPU simulation. The validation process receives these data and based onthe ID of the sender process the location of received data in the total domain is identified.In the next step, corresponding data of Single-GPU are transferred from the validationGPU’s memory to the host of the validation process. Once the results of Single-GPU andMultiple-GPU simulations are available on the validation process, they will be comparedagainst each other and the number of nonmatching data will be counted. The validationprocess will be reported as successful when no nonmatching value is found.

In order to activate the validation process, the value of the “validate” element in xmlconfiguration file (see A) should be set to 1. In this case, the simulation should be launchedwith one additional process devoted to the computation of Single-GPU data. The valida-tion can be performed only for the scenarios with a domain resolution which fits in oneGPU memory.

In this thesis, the basic method as well as all overlapping methods are successfully vali-dated in this way.

39

6. Multi-GPU LBM

6.6. Performance Optimization

One of the necessary part of any HPC software development is performance optimization.First step in optimization process is profiling of the software. A common profiling strategyis to find out how much time is spend in the different functions in order to identify hotspots. These hot spots are subsequently analyzed for possible optimization opportunities.

In order to identify the hot spots of the software developed in this thesis, the first toolthat is adopted is Callgrind. Callgrind is an extension to Cachegrind which produces in-formation about callgraphs of program.

The collected data by Callgrind consists of the number of instructions executed, theirrelationship to source lines, the caller/callee relationship between functions, and the num-bers of such calls.

The result of analysis of the software for SBK-SCQ method after 100 iterations is demon-strated in Table 6.1. In this table the first seven functions with the highest exclusive execu-tion time is given.

Incl. Self Called Function16.20 0.11 6201 enqueueCopyRectKernel14.89 0.02 200 storeDensityDistribution10.41 0.01 100 setDataAlpha6.53 0.01 100 setDataBeta1.44 0.00 1 initSimulation41.36 0.00 50 simulationStepBeta50.68 0.00 50 simulationStepAlpha

Table 6.1.: Table of called functions in Multi-GPU LBM simulation. The table provides thedata for a simulation with 100 iterations.

As it is shown, the function enqueueCopyRectKernel has the highest execution time andis called more often than the other functions. This function enqueues a kernel to thecommand-line, which copies a rectangular part of the domain from the density distribu-tion buffer to a temporary buffer. This buffer is used in the next step to transfer the datafrom the GPU memory to the host memory. The callgraph of enqueueCopyRectKernel() func-tion is illustrated in Fig. 6.10. In this Figure, it is shown that this function is adopted byfunctions that store the boundary values from the GPU memory to host as well as settingthe values to GPU memory.

40

6.6. Performance Optimization

Figure 6.10.: Rectangle copy kernel callgraph.

As it is discussed in section 5, density distributions for a specific lattice vector are simplystored linearly in the global memory of the GPU. In this memory layout, first the densitydistribution values of lattice vector f0 for all domain cells are saved in a row. The domainscells are saved in x, y, z order. Subsequently, the values of density distributions for the nextlattice vector for all domain cells are saved and so on.

In spite of the fact that this way of storing the data provides an optimized memoryaccess pattern for Single-GPU implementation (see [17]), in a Multi-GPU case which theboundary regions data should be exchanged in each simulation step, leads to an inefficientGPU memory access.

Since the density distribution data are stored linearly, accessing the memory locationsof a rectangular boundary region requires several read and write operations on discontin-uous locations on the GPU memory. In Fig. 6.11 a sample of memory layout for a 3x3x3domain is illustrated. In this sample, the x0 boundary cells are highlighted with red color.

For the β-synchronization the problem becomes more complicated since some of the lat-tice vectors values should be skipped. All these challenges leads to uncoalesced memoryaccess.

Documents

Computational Science and Engineering (Int. Master’s Program) · 2013. 10. 20. · The introduction of LBM method and its history can be found in [20,22,19,21] and references therein