TheImmersedBoundary-LatticeBoltzmannMethodParallel ...downloads.hindawi.com/journals/mpe/2020/3913968.pdf · 5/14/2020 · e IB-LBM simulation contains four different calcu-lations:IBMpart,theentireflowfield,boundaryprocessing,

Research ArticleThe Immersed Boundary-Lattice Boltzmann Method ParallelModel for Fluid-Structure Interaction onHeterogeneous Platforms

Zhixiang Liu 1 Huichao Liu 1 Dongmei Huang 12 and Liping Zhou3

1College of Information Technology Shanghai Ocean University Shanghai 201306 China2Shanghai University of Electric Power Shanghai 200090 China3School of Computer Engineering and Science Shanghai University Shanghai 200444 China

Correspondence should be addressed to Dongmei Huang dmhuangshoueducn

Received 14 May 2020 Revised 27 July 2020 Accepted 5 August 2020 Published 27 August 2020

Academic Editor Efstratios Tzirtzilakis

Copyright copy 2020 Zhixiang Liu et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Immersed boundary-lattice Boltzmann method (IB-LBM) has become a popular method for studying fluid-structure interaction(FSI) problems However the performance issues of the IB-LBM have to be considered when simulating the practical problemse Graphics Processing Units (GPUs) from NVIDIA offer a possible solution for the parallel computing while the CPU is amulticore processor that can also improve the parallel performance is paper proposes a parallel algorithm for IB-LBM on aCPU-GPU heterogeneous platform in which the CPU not only controls the launch of the kernel function but also performscalculations According to the relatively local calculation characteristics of IB-LBM and the features of the heterogeneousplatform the flow field is divided into two parts GPU computing domain and CPU computing domain CUDA and OpenMP areused for parallel computing on the two computing domains respectively Since the calculation time is less than the datatransmission time a buffer is set at the junction of two computational domains e size of the buffer determines the number ofthe evolution of the flow field before the data exchange erefore the number of communications can be reduced by increasingbuffer size e performance of the method was investigated and analyzed using the traditional metric MFLUPS e newalgorithm is applied to the computational simulation of red blood cells (RBCs) in Poiseuille flow and through a microchannel

1 Introduction

In the immersed boundary-lattice Boltzmann method (IB-LBM) the flow field is solved by the lattice Boltzmannmethod (LBM) and the interaction between the fluid andthe immersed object is solved by the immersed boundarymethod (IB) [1] Since Feng and Michaelides [2] firstsuccessfully applied the IB-LBM to simulate the motion ofrigid particles IB-LBM has become a popular method forstudying the fluid-structure interaction (FSI) problemsEshghinejadfard et al [3] used IB-LBM to research tur-bulent channel flow in the presence of spherical particlesCheng et al [4] developed an IB-LBM for simulating amultiphase flow with solid particles and liquid drops eIB-LBM is also widely used to studying red blood cells(RBCs) flow in the blood such as the interaction between

plasma flow and RBCs movement [5] and the aggregationand deformation of red blood cells in an ultrasound field[6] Simulations of practical problems of FSI like bloodflow demand considerable computational resources [7]which will greatly reduce the simulation time

e architecture of the Graphics Processing Units(GPUs) offers high floating performance and memory toprocessor chip bandwidth [8] e LBM simulation has thecharacteristics of regular static and local calculations [9]which can be greatly accelerated by means of GPU [10 11]Shang et al [12] used MPI to show that LBM has goodparallel acceleration performance By proposing two newimplementation methods LBM-Ghost and LBM-SwapValero-Lara [13] reduced the huge memory requirementswhen simulating LBM on the GPU and proposed [14] three

HindawiMathematical Problems in EngineeringVolume 2020 Article ID 3913968 13 pageshttpsdoiorg10115520203913968

different parallel approaches for grid refinement based on amultidomain decomposition e homogeneous GPU andheterogeneous CPU+GPU approaches [15] for mesh re-finement based on Multidomain and irregular meshingcould achieve better performance In terms of IB-LBMrsquosparallel computing Valero-Lara et al [16] proposed anumerical approach parallelizing the IB-LBM on bothGPUs and a heterogeneous GPU-Multicore platformBoroni et al [17] presented a fully parallel GPU imple-mentation of IB-LBM e GPU kernel performs fluidboundary interactions and accelerated each code executionRecently Beny and Latt [9] proposed a new method for theimplementation of IB-LBM on GPUs which allows han-dling a substantially larger immersed surfaces Otherparallel methods compiler directive open multiprocessing(OpenMP) [18] and distributed memory clusters [19] canachieve multicore parallel processing with shared memoryon the CPU for instance a nonuniform staggered Car-tesian grid approach [20] reducing considerably thecomplexity of the communication and memory manage-ment between different refined levels for mesh refinement

In this paper a new parallel algorithm for IB-LBM isproposed on CPU-GPU heterogeneous platform When theGPU is performing simulation the CPU is in a waiting statewhich causes a waste of computing resources e new al-gorithm avoids this problem On the one hand the flow fieldis divided into CPU domains and GPU domains which arecalculated by the Compute Unified Device Architecture(CUDA) and OpenMP respectively According to the cal-culation process of IB-LBM after the flow field has evolvedonce data is exchanged at the junction of the two calculationdomains On the other hand the calculation time is shorterthan the communication time so buffers were introduced toreduce the number of data communications e size of thebuffer determines the number of times the flow field evolvesbefore data is exchanged Setting a buffer of proper size canreduce data communication and improve computing per-formance As expected this strategy greatly improved cal-culation efficiency without affecting the calculation accuracyof the IB-LBM

e remainder of this paper is organized as followsSection 2 introduces IB-LBM Section 3 illustrates theparallel algorithm on CPU-GPU heterogeneous platformthat makes full use of computing resources Section 4 showsthe benchmark tests and analyzes the parallel performanceRBCs in Poiseuille flow and through a microchannel aresimulated in Section 5 Finally Section 6 concludes thisstudy

2 Immersed Boundary-LatticeBoltzmann Method

IB-LBM is a flexible and efficient calculation in the case ofsimulated fluid flow through moving or complex bound-aries e computational area of the IB-LBM consists of twotypes of lattice points the Lagrangian points and theEulerian points (see Figure 1)

e Euler point represents the entire flow field and theLagrangian point illustrates the boundary of the immersed

object e body force of the object is calculated by thedeformation of the boundary and then it is dispersed to theEuler point of the flow field using the Dirac delta functionFinally the macroscopic density and velocity of the flow fieldare solved

e DnQm LBGK is the most widely used model forsolving the flow field in the IB-LBM n is the spatial di-mension and m represents the number of the discrete ve-locity [21 22] In this paper the most commonly usedmodel D2Q9 (see Figure 2) is employed and the corre-sponding discrete velocity ei is

ei

0 i 0

cos (i minus 1)π2

1113876 1113877 sin (i minus 1)π2

1113876 11138771113874 1113875c i 1 2 3 4

2

radiccos (i minus 5)

π2

+π4

1113876 1113877 sin (i minus 5)π2

+π4

1113876 11138771113874 1113875c i 5 6 7 8

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(1)

where c ΔxΔt is the lattice velocity and Δx is the latticespace

e LBGK model can be expressed as

fi x + eiΔt t + Δt( 1113857 minus fi(x t)

minus1τ

fi(x t) minus feqi (x t)1113858 1113859 + ΔtFi

(2)

where fi(x t) is the particle distribution function x iscurrent lattice location and t is current lattice time Δt is thelattice time step and τ is the relaxation time which is de-termined by the fluid viscosity coefficient υ of the fluid

υ τ minus12

1113874 1113875c2sΔt (3)

where cs is the speed of sound e corresponding feqi (x t)

is the equilibrium distribution function It can be written as

Lagrangianpoints

Eulerianpoints

Figure 1 Two-dimensional flow field boundary structure of IB-LBM mode

2 Mathematical Problems in Engineering

feqi (x t) ρωi 1 +

eiu

c2s

+eiu( 1113857

2

2c4s

minus|u|

2

2c2s

1113890 1113891 (4)

where u is the macroscopic velocity at the current latticepoint and ωi is the weight factor of the discrete velocity Inthe D2Q9 model the value of ωi can be written as

ωi

49 e

2i 0

19 e

2i c

2

136

e2i 2c

2

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)

For boundary conditions this paper adopts the non-equilibrium extrapolation method [23] which can beexpressed as

fi xb t( 1113857 feqi ρ xf t1113872 1113873 ub1113872 1113873 + fi xf t1113872 1113873 minus f

eqi xf t1113872 11138731113960 1113961

(6)

where ρ(xf t) represents the density of the adjacent latticepoint xf of the boundary lattice point xb and ub is thevelocity of the boundary lattice point xb

In the IB-LBM the key problem is the calculation anddispersion of forces at the Lagrangian point e penaltyforce method [2] is a way to calculate resilience e basicidea of the penal force method is to apply Hookrsquos law ofelastic deformation Regarding the boundary of the im-mersed object as an elastic boundary composed of La-grangian points the object is deformed by force and theposition of the Lagrangian point is moved to the position ofthe reference point (see Figure 3)

Restoring force due to the deformation is calculated byHookersquos law

Fb minus ε middot xli minus x

ri1113872 1113873 (7)

where xli is the Lagrangian point xr

i is the reference pointafter the corresponding Lagrangian point is moved and ε isthe stiffness coefficient the value is 003

In the calculation process the position of the Lagrangianpoint is constant After each lattice time step the position ofthe reference points moved e body force F(x t) in theflow field is obtained by interpolation of the restoring forceFb at the boundary point to the lattice point of the flow fieldIt is calculated by

e0

e1

e8e4e6

e2

e7e3 e5

(a)

f0

f1

f8f4f6

f2

f7 f3 f5

(b)

Figure 2 Velocity vectors and function vectors in D2Q9 lattice model (a) Velocity vectors (b) Distribution function

xri

xli

Figure 3 Boundary point and reference point of objectdeformation

Mathematical Problems in Engineering 3

F(X t) 1113946ΓFb(s t)δ X minus xi(t)( 1113857ds

1113944n

i

Fb middot δ X minus xi(t)( 1113857dsi

(8)

where X is the Euler point in the flow field xi is the i-thLagrangian point on the boundary Fb is the correspondingrestoring force n is the total number of Lagrangian points onthe boundary and dsi is the length of boundary elementδ(X minus xi(t)) is a Dirac function that can be used to inter-polate the restoring force at the boundary point to the Eulerpoint of the flow field Dirac functions have a variety ofexpressions e interpolation equation used in this paper isas follows

δ(x) 1 minus |x| 0le |x|le 1

0 1le |x|1113896 (9)

After interpolating the restoring force at the boundarypoint of the object to the Euler point of the flow field thebody force on the Euler point needs to be discretizedaccording to the DnQm model [24] It can be expressed by

Fi(X t) 1 minus12τ

1113874 1113875ωi

ei minus u

c2s

+ei middot u( 1113857

c4s

ei1113890 1113891 middot F(X t) (10)

According to the calculation of the above equation themacroscopic density and velocity of each lattice point can bewritten as

ρ 11139448

i0fi

ρu 11139448

i0eifi +

12

FΔt

(11)

After getting the macro speed of each lattice point thespeed of the boundary point is calculated by interpolationen the deformation displacement of the boundary point isobtained

e calculation steps for the IB-LBM can be summarizedas follows (see Figure 4)

3 IB-LBM Parallel Model onHeterogeneous Platforms

For the development of parallel algorithm of IB-LBM it isimportant to make the most of computing resources onCPU-GPU heterogeneous platforms e traditional parallelalgorithm on CPU-GPU heterogeneous platforms isimplemented by dividing the algorithm into GPU partwhich is responsible for computing the LBM part and CPUpart which consists of executing the IBM part However theamount of calculation between IBM and LBM is inconsis-tent which may cause the GPU or CPU to wait for the otherOn the other hand after the evolution of each time stepsome GPU and CPU data need to be transferred which leadsto extra time consumption is paper attempts to provide anew approach to make full use of the computing resources of

both GPU and CPU e new method improves the tra-ditional algorithms

A GPU contains multiple multiprocessors and eachmultiprocessor contains multiple processors called coreserefore in the single-instruction multiple-data mode theGPU can process a large amount of data at the same timee solution of IB-LBM to fluid-structure interactionproblems is relatively local is feature has a high agree-ment with the GPU architecture and can achieve goodperformance results To simulate IB-LBM on the GPU theflow field needs to be divided into grid and blocks

In the structure of the GPU threads are grouped byblocks and all blocks form a grid which can make a three-dimensional shape In the case of three-dimensional thethread in a block can be determined by the thread coor-dinates threadIdxx threadIdxy and threadIdxz Simi-larly the position of the block can be determined by thecoordinates of the block blockIdxx blockIdxy andblockIdxz is method is also applicable to two-dimen-sional and one-dimensional situations

As shown in Figure 5 the flow field is mapped to a two-dimensional grid which contains n times m blocks a blockcontains i times j threads Each thread calculates a lattice pointin the flow field

e index of the array can be obtained by

index x thread Idxx + block Idxx middot blockDimx

index y thread Idxy + block Idxy middot blockDimy

(12)

where blockDimx and blockDimy are the dimensions of theblock and the number of threads in the X and Y directionsrespectively threadIdxx threadIdxy blockIdxx andblockIdxy are the coordinates of the thread and blockrespectively To divide the entire flow field into multiple two-dimensional subregions of a specified size that is blocks the

Start

Initialization

t lt t_max

End

N

Y

Evolution

t++

Calculate the restore force usingequation (7)

Interpolate the force from theLagrange points to the Euler

points by equation (8)

Discretize the force into theLBGK model on the basis of

equation (10)

Compute the equilibriumdistribution function using

equation (4)

Collision and stream withequation (1)

Compute the macroscopicvariables by means of equation

(11) and equation (12)

Figure 4 Flow chart of IB-LBM for penalty method


total number of threads must be consistent with the totalnumber of lattice points in the flow field which can avoidunnecessary resource consumption and extra calculations

Although the efficiency of IB-LBM simulation on theGPU has been greatly improved it still caused a waste ofcomputing resources After the kernel function is launchedthe calculations on the GPU are performed on the GPU Atthe same time the control of the program returns to theCPU and the CPU is in a waiting state is paper proposesan algorithm on CPU-GPU heterogeneous platform to avoidthis situation We divide the flow field into two parts GPUcomputing domain and CPU computing domain Whencontrol returns to the host the CPU immediately performsIB-LBM simulation on the CPU computing domain especific execution steps are shown in Figure 6

e launch time of the kernel function is very short andcan be ignored GPU and CPU can be seen as evolving at thesame time on the CPU and GPU computing domain re-spectively Multiple kernel functions can be launched on theCPU After the kernel functions are launched they arequeued up for execution on the GPU and control is returnedto the CPU e CPU launches some kernel function of atime step and then performs a time step simulation on theCPU so that after the calculation on the CPU and the GPUis completed a time step simulation of the entire flow field iscompleted In order to ensure that the calculation has beencompleted before data transmission GPU synchronizationneeds to be set It is necessary to wait for the GPU-sidecalculation to end when the synchronization position isreached e data exchange is completed on the CPU so theCPU simulation does not need to set synchronization andthe CPU can achieve automatic synchronization Aftersynchronization is achieved only a small amount of dataneeds to be exchanged the density velocity and distributionfunction at the boundary of the calculation domain

It can be seen from Figure 6 that the GPU and the CPUneed to exchange data to ensure that the flow field can evolvecorrectly after the evolution of the flow field is performedonce While the data is being exchanged both the GPU and

the CPU need to wait At the same time frequent dataexchange adds extra time consumption In order to optimizethese problems this paper introduces the concept of a bufferat the junction of two computational domains (see Figure 7)

In Figure 7 the blue area is the buffer with the samenumber of lattice at the junction of the GPU and CPUcomputing domains In this way there is no need to ex-change data after each evolution After the introduction ofthe buffer the corresponding number of evolutions can beexecuted according to the size of the buffer and then dataexchange can be performed e red area in Figure 7 is thearea that needs to be copied e algorithm overwrites thevalue of CPU buffer-zone with the value of GPU copy-zoneand then overwrites the value of GPU buffer-zone with thevalue of CPU copy-zone In this way the evolution of theCPU and GPU computing domains has been correctlydesigned By introducing a buffer the number of data ex-changes and CPU-to-GPU communications is greatly re-duced and additional time consumption is reduced eparallel algorithm on CPU-GPU heterogeneous platformscan be illustrated as Algorithm 1

e IB-LBM simulation contains four different calcu-lations IBM part the entire flow field boundary processingand buffer assignment According to these four kinds ofcalculations different grid divisions Grid_1 Grid_2 Grid_3Grid_4 and different block divisions Block_1 Block_2Block_3 and Block_4 are used in Algorithm 1 Proper di-vision can make better use of the performance of the GPUe GPU can start multiple kernel functions simultaneouslyand then execute them sequentially e start-up time of thekernel function is extremely short and can be ignorederefore Algorithm 1 approximates that the GPU and CPUstart performing calculations at the same time

e key problem of the algorithm proposed in this paperis how to divide the GPU computing domain and the CPUcomputing domain Optimal performance can be achievedwith appropriate mesh sizes in the two computational do-mains e IB-LBM simulation with the same mesh amountis performed on the GPU and the CPU before the calculation

blockIdx (0 0)

blockIdx (0 m) blockIdx (1 m) blockIdx (n m)

blockIdx (n m)blockIdx (1 0) blockIdx (n 0)

Grid

helliphellip

helliphellip helliphellip helliphellip

helliphellip

helliphellip helliphellip

helliphellip

helliphellip

helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip

helliphellip



helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

Block

threadIdx(0 0) threadIdx(1 0) threadIdx(i 0)


threadIdx(0 j) threadIdx(1 j) threadIdx(i j)

Figure 5 e flow field is divided according to the grid structure and block structure of the GPU


domain is divided According to the proportion of GPU andCPU time consumption the flow field is divided into equalproportions

rough the above design a new algorithm on CPU-GPU heterogeneous platform is generated to simulate IB-LBM (see Figure 8) e GPU and CPU simulate at the sametime and exchange data through the buffer e CUDAprogramming model is used on the GPU side and OpenMPis used on the CPU side to perform parallel processing on IB-LBM

4 Performance Analysis

is section tests the performance of IB-LBM on a CPU-GPU heterogeneous platform by conducting experimentsflow around a cylinder e main characteristics of theexperimental platform are shown in Table 1

Evolution

GPU computation domainComputation Copy Buffer

GPU computation domainComputationCopyBuffer

Communication

CPU copy GPU Buffer GPU copyCPU Buffer

DataCopy

DataCopy

Figure 7 Data exchange between buffers in the GPU and CPU computation domains

GPU computation domain CPU computation domain

Buffer copy(buffer ge field)

Buffer copy(field ge buffer )



IB-LBM evolution IB-LBM evolutionCUDA OpenMP

Buffer

CPU-GPU communicationdata exchange

Figure 8 e new IB-LBM parallel model on heterogeneousplatforms

helliphellip

Time step t

Computing on the GPU

Computing on the CPU

Control returns to the CPULaunch kernel

CPU

-GPU

data

tran

sfer

Time step t + 1




CPU

-GPU

data

tran

sfer

Figure 6 IB-LBM calculation process and data exchange process on heterogeneous platforms


is paper used the conventional MFLUPS metric(millions of fluid lattice updates per second) to assess theperformance e following formula shows how the peaknumber of potential MFLUPS is calculated for a specificsimulation [25]

MFLUPS s middot Nfl

Ts middot 106 (13)

where s is the total number of evolutions Nfl is the numberof grids in the flow field and Ts is the execution time

e rules for dividing the flow field into GPU computingdomain and CPU computing domain can be expressed by

Ncpu MFLUPSgpu times Nfl

MFLUPScpu+MFLUPSgpu+ E

Ngpu MFLUPScpu times Nfl


(14)

where Ncpu and Ncpu are the number of GPU computingdomains and CPU computing domains MFLUPScpu andMFLUPScpu are the MFLUPS value calculated by the

multicore CPU and a GPU and E is the number of latticeerrors the value range is minus 104 leEle 104

e choice of buffer size is a key issue and the appro-priate buffer size can greatly improve the calculation effi-ciency After a large number of experiments and analyzingthe experimental results the choice of buffer size can bebased on the following expression

Nbuffer Ncpu

25+ E (15)

where Nbuffer is the number of buffer grids Ncpu is thenumber of CPU computing domains and the value of E isthe same as the value in equation (14)

Heterogeneouse data in the array is assigned to the buffer in the flow fieldbuffer_to_fieldltltltGrid_4 Block_4gtgtgt()e GPU performs IB-LBM simulation

for i 0⟶ buffer docompute_forceltltltGrid_1 Block_1gtgtgt()spread_forceltltltGrid_1 Block_1gtgtgt()discrete_forceltltltGrid_2 Block_2gtgtgt()

collisionltltltGrid_2 Block_2gtgtgt()streamingltltltGrid_2 Block_2gtgtgt()boundaryltltltGrid_3 Block_3gtgtgt()compute_macroscopicltltltGrid_2 Block_2gtgtgt()interpolation_velocityltltltGrid_1 Block_1gtgtgt()lagrangian_moveltltltGrid_1 Block_1gtgtgt()end forCopy the data of the buffer area in the flow field to a separate arrayfield_to_bufferltltltGrid_4 Block_4gtgtgt()e CPU performs IB-LBM simulation

for j 0⟶ buffer docompute_force ()

spread_force ()discrete_force ()collision ()streaming ()boundary ()compute_macroscopic ()interpolation_velocity ()lagrangian_move ()

cudaMemcpy (cudaMemcpyDeviceToHost)exchange_data ()cudaMemcpy (cudaMemcpyHostToDevice)step_count+ bufferif (step_count total_step)

goto heterogeneous

ALGORITHM 1 e IB-LBM parallel algorithm on CPU-GPU heterogeneous platforms

Table 1 Main features of the platforms

Platform Intel Xeon NVIDIA GPUModel Gold 6130 GeForce RTX 2080TiFrequency 21GHz 1318GHzCores 16 4352Memory 256GB (DDR4) 11GB (GDDR6)Bandwidth 213GBs 616GBsCompiler gcc 540 nvcc 90176


e size of the buffer has a relatively small impact onthe performance of the computation simulation on theGPU side and has a relatively large impact on the per-formance of the computation simulation on the CPU sideerefore the size of the buffer can be calculated based onthe size of the CPU computing domain Choose buffers ofdifferent sizes for multiple experiments and compare theoptimal results with the CPU computing domain eabove experiments were conducted under different flowfield scales Finally the conclusion of equation (15) isreached It can be seen from the expression that the size ofthe buffer is related to the grid amount of the CPUcomputing domain e amount of grid in the CPUcomputing domain is related to platform equipment andproblem scale erefore the size of the buffer depends onplatform features and problem scale

e single-precision and double-precision performancetests of the algorithm are shown in Tables 2 and 3

As can be seen from Table 2 and Table 3 using aseparate GPU to simulate IB-LBM can also achieve goodperformance results e parallel performance of the CPUis far from the effect of the GPU it is still a computingresource that can be utilized In some cases the hardwareimplements double precision then single precision isemulated by extending it there and the conversion willcost time is results in double-precision performancebeing faster than single-precision performance Using asingle GPU for computing has greatly improved com-puting performance However due to the computingcharacteristics of the CUDAmodel the CPU only executescontrol of the kernel functions so the computing resourcesof the CPU are not fully utilized In the proposed algo-rithm CPU controls the start of the kernel function andperforms multi-threaded calculations without causing awaste of resources Besides that compared with the tra-ditional heterogeneous method the communication be-tween the CPU and the GPU is reduced It can beconcluded from Tables 2 and 3 that the performance of thenew algorithm has been improved However due to thesmall size of the flow field the performance improvementis not obvious

It can be seen from the experimental result that theperformance with double precision is much lower than thatwith single precision on GPU is can be caused by tworeasons On the one hand the computing power of GPUsingle-precision floating-point numbers is higher than thatof double-precision floating-point numbers On the otherhand double-precision floating-point numbers increasememory access

Figure 9 shows the performance image of the enlargedflow field scale in the single precision case e performanceimprovement can be more clearly observed

It can be seen from Figure 9 that after the scale of theflow field reaches 6 million the MFLPUPS of the CPU-GPU heterogeneous platform with buffer is similar to thesum of GPU and CPU multithreading When the numberof grids is 1 million the simulation effect of a single GPU isthe best is is due to the relatively small number of gridsin the CPU computing domain divided by a heterogeneous

CPU-GPU platform e time spent on thread develop-ment and destruction has a greater impact on the total CPUsimulation time When performing parallel simulation onCPU-GPU heterogeneous platforms not only will com-munication affect performance but synchronization willalso affect performance e introduction of the bufferreduces the number of communications and synchroni-zation at the same time It can also be seen from Figure 9that the effect of using the buffer is better than not using thebuffer

To study the actual FSI the scale of the problem is oftenrelatively large At the same time many evolutionary cal-culations are often required Parallel algorithms can effec-tively reduce computation time Most clusters now includeGPU and CPU devicese algorithm proposed in this papercan make full use of the resources of the cluster and it canreduce the simulation time in the research of RBC aggre-gation and deformation in ultrasonic field [6] viscous flowin large distensible blood vessels [26] suspending viscosityeffect on RBC dynamics and blood flow behaviors inmicrovessels [27] and other studies

5 Application

RBCs are an important component in blood [28] Recentlythe IB-LBM has been used for simulating the deformationand motion of elastic bodies immersed in fluid flow in-cluding RBCs [29] Esmaily et al [30] used IB-LBM to in-vestigate motion and deformation of both healthy and sickRBCs in a microchannel with stenosis Tan et al [31] pre-sented a numerical study on nanoparticle (NP) transportand dispersion in RBCs suspensions by an IB-LBM fluidsolver

In this paper the RBCs are suspended in blood plasmawhich has a density ρ 1 and Reynolds number Relt 1 Inthe boundary processing the nonequilibrium extrapolationis usede cross section profile of a RBC in xndashy plane [32] isgiven by the following relation

y sin θ

x cos θ c0 + c1 sin θ minus c2sin2 θ1113960 1113961

0lt θlt 2π

(16)

with c0 0207 c1 2002 and c2 1122e two-dimensional RBC shape is shown in Figure 10e red points represent the Lagrangian points on the

RBC and the black points are Euler pointse force at the Lagrangian points on RBC consists of

two parts defined as

F(s t) Fs(s t) + Fb(s t) (17)

where s is the arc length Fs is the stretchingcompressionforce and Fb is the bending force Fs and Fb are obtained bythe following relation [33]


Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868 minus Δs1113874 1113875 timesXj+1 minus Xj

Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1

j2Xj+1 minus 2Xj + Xjminus 11113872 1113873 2δjk minus δj+1k minus δjminus 1k1113872 11138731113966 1113967

(18)

Here N is the total number of the Lagrangian points onthe RBC k 1 2 N (Fs)k and (Fb)k are elastic La-grangian forces associated with the node k the values of Fs

and Fb are 10 times 10minus 10 and 10 times 10minus 12 respectively X(s t)

is the Lagrangian points and δjk is the Kronecker deltafunction

Blood flow in the microvessels is better approximated bya Poiseuille flow than a shear flow which makes the study ofRBCs in Poiseuille flows more physiologically realistic [34]erefore this paper first applies the algorithm to study thedynamical RBC behavior in Poiseuille flows the simulationdomain is 30 times 100 grid points (see Figure 11)

30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10

Number of nodes (x 1e6)

Sequential32 threads

(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10


GPUCPU-GPUCPU-GPU-buffer

(b)

Figure 9 Performance of simulating flow around a cylinder on different platforms (a) CPU sequential and multi-threads (b) GPU andCPU-GPU heterogeneous platform

Table 2 Single-precision performance of CPU-GPU heterogeneous platform

Domain size CPUMFLUPS OpenMPMFLUPS GPUMFLUPS CPU-GPUMFLUPS625times 89 (55625) 0801 19449 367375 368564750times107 (80250) 0867 22435 435120 436649900times128 (115200) 0888 23023 485470 4869831080times154 (166320) 0869 24017 501921 503572

Table 3 Double-precision performance of CPU-GPU heterogeneous platform

Domain size CPUMFLUPS OpenMPMFLUPS GPUMFLUPS CPU-GPUMFLUPS62 5times 89 (55625) 0874 19841 193750 193835750times107 (80250) 0953 23902 240037 240849900times128 (115200) 0958 24057 294730 2950361080times154 (166320) 0956 24793 321114 322597


RBCs are perpendicular to the direction of fluid flow inFigure 11 the geometric shape gradually changes from adouble concave to a parachute shape e curvature afterRBC deformation is closely related to the deformability ofthe boundary [35] RBC can recover its initial shape asso-ciated with the minimal elastic energy when the flow stops[36ndash38]

Many diseases narrow the blood vessels e degree ofstenosis is high which will affect the circulatory function ofthe human body e study of RBCs passing through thestenotic area of the blood vessel is more popular is paperapplies the algorithm to this situation Figure 12 showsgeometry of the microchannel

In the geometry of the microchannel X and Y arenumber of Eulerian points in x and y direction the values are30 and 150 D is the diameter of vessel and h is the diameterof constriction e ratio of D and h is D 3 h Figure 13shows the motion and deformation of an RBC through amicrochannel with stenosis

It can be seen from Figure 13 that the velocity of the fluidincreases in the stenosis area resulting in more deformationof the RBCs [39] After passing through this area the RBCsreturn to their initial shape due to the decrease in the velocityof the fluid is is because RBC is very flexible and yetresilient enough to recover the biconcave shape wheneverthe cell is quiescent [40]

Figure 11 Snapshots of RBC behavior in Poiseuille flows

h YD

X

Figure 12 Geometry of the microchannel with stenosis

Figure 10 e two-dimensional structure of RBC in the flow field


6 Conclusions

In this paper a new parallel algorithm on CPU-GPUheterogeneous platforms is proposed to implement acoupled lattice Boltzmann and immersed boundarymethod Good performance results are obtained by in-vestigating heterogeneous platforms e main contri-bution of this paper is to make full use of the CPU andGPU computing resources CPU not only controls thelaunch of the kernel function but also performs calcu-lations A buffer is set in the flow field to reduce theamount of data exchanged between the CPU and theGPU e CUDA programming model and OpenMP areapplied to the GPU and CPU respectively for numericalsimulation is method can approximate the simulta-neous simulation of different regions of the flow field bythe GPU and CPU e final experiments have obtainedrelatively good performance results e CPU-GPUheterogeneous platform MFLUPS value is approximatelythe sum of GPU and CPU multi-threads In addition theproposed algorithm is used to simulate the movementand deformation of RBCs in Poiseuille flow and through amicrochannel with stenosis e simulation results are ingood agreement with other literatures Our future work is

to implement the method on large-scale GPU clusters andlarge-scale CPU clusters to simulate three-dimensionalFSI problems

Data Availability

All data models and codes generated or used during thestudy are included within the article

Conflicts of Interest

e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

is study was supported by the Shanghai Sailing Program(Grant no18YF1410100) National Key Research and De-velopment Program (Grant no 2016YFC1400304) NationalNatural Science Foundation of China (NSFC) (Grant nos41671431 and 61972241) the Natural Science Foundation ofShanghai (no 18ZR1417300) and the Program for theCapacity Development of Shanghai Local Colleges (Grantno 17050501900)

00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )

Figure 13 An RBC through a microchannel with stenosis at different times


References

[1] J Wu Y Cheng W Zhou and W Zhang ldquoGPU accelerationof FSI simulations by the immersed boundary-lattice Boltz-mann coupling schemerdquo Computers and Mathematics withApplications vol 78 no 9 pp 1194ndash1205 2019

[2] Z-G Feng and E E Michaelides ldquoe immersed boundary-lattice Boltzmann method for solving fluid-particles inter-action problemsrdquo Journal of Computational Physics vol 195no 2 pp 602ndash628 2004

[3] A Eshghinejadfard A Abdelsamie S A Hosseini andD evenin ldquoImmersed boundary lattice Boltzmann simu-lation of turbulent channel flows in the presence of sphericalparticlesrdquo International Journal of Multiphase Flow vol 96pp 161ndash172 2017

[4] M Cheng B Zhang and J Lou ldquoA hybrid LBM for flow withparticles and dropsrdquo Computers amp Fluids vol 155 pp 62ndash672017

[5] Y Wei L Mu Y Tang Z Shen and Y He ldquoComputationalanalysis of nitric oxide biotransport in a microvessel influ-enced by red blood cellsrdquo Microvascular Research vol 125Article ID 103878 2019

[6] X Ma B Huang G Wang X Fu and S Qiu ldquoNumericalsimulation of the red blood cell aggregation and deformationbehaviors in ultrasonic fieldrdquo Ultrasonics Sonochemistryvol 38 pp 604ndash613 2017

[7] L Mountrakis E Lorenz O Malaspinas S AlowayyedB Chopard and A G Hoekstra ldquoParallel performance of anIB-LBM suspension simulation frameworkrdquo Journal ofComputational Science vol 9 pp 45ndash50 2015

[8] C Feichtinger J Habich H Kostler U Rude and T AokildquoPerformance modeling and analysis of heterogeneous latticeBoltzmann simulations on CPU-GPU clustersrdquo ParallelComputing vol 46 pp 1ndash13 2015

[9] J Beny and J Latt ldquoEfficient LBM on GPUs for dense movingobjects using immersed boundary conditionrdquo in Proceedingsof the CILAMCE Paris Compiegne France November 2018

[10] M Bernaschi M Fatica S Melchionna S Succi andE Kaxiras ldquoA flexible high-performance Lattice BoltzmannGPU code for the simulations of fluid flows in complex ge-ometriesrdquo Concurrency and Computation Practice and Ex-perience vol 22 no 1 pp 1ndash14 2010

[11] P R Rinaldi E A Dari M J Venere and A Clausse ldquoALattice-Boltzmann solver for 3D fluid simulation on GPUrdquoSimulation Modelling Practice and 9eory vol 25 pp 163ndash171 2012

[12] Z Shang M Cheng and J Lou ldquoParallelization of latticeBoltzmann method using MPI domain decompositiontechnology for a drop impact on a wetted solid wallrdquo In-ternational Journal of Modeling Simulation and ScientificComputing vol 5 no 2 Article ID 1350024 2014

[13] P Valero-Lara ldquoReducing memory requirements for largesize LBM simulations on GPUsrdquo Concurrency and Compu-tation Practice and Experience vol 29 no 24 Article IDe4221 2017

[14] P Valero-Lara and J Jansson ldquoMulti-domain grid refinementfor lattice-Boltzmann simulations on heterogeneous plat-formsrdquo in Proceedings of 18th IEEE International Conferenceon Computational Science and Engineering (CSE) PortoPortugal October 2015

[15] P Valero-Lara and J Jansson ldquoHeterogeneous CPU+ GPUapproaches for mesh refinement over Lattice-Boltzmannsimulationsrdquo Concurrency and Computation Practice andExperience vol 29 no 7 2017

[16] P Valero-Lara P Pinelli andM Prieto-Matias ldquoAcceleratingsolid-fluid interaction using lattice-Boltzmann and immersedboundary coupled simulations on heterogeneous platformsrdquoin Proceedings of 14th International Conference on Compu-tational Science (ICCS) Cairns Australia June 2014

[17] G Boroni J Dottori and P Rinaldi ldquoFULL GPU imple-mentation of lattice-Boltzmann methods with immersedboundary conditions for fast fluid simulationsrdquo 9e Inter-national Journal of Multiphysics vol 11 no 1 2017

[18] F Massaioli and G Amati ldquoAchieving high performance in aLBM code using OpenMPrdquo in Proceedings of the FourthEuropean Workshop on OpenMP (EWOMP 2002) RomaItaly September 2002

[19] P Valero-Lara and J Jansson ldquoLBM-HPC-an open-sourcetool for fluid simulations case study Unified parallel C (UPC-PGAS)rdquo in Proceedings of IEEE International Conference onCluster Computing (CLUSTER) Chicago USA September2015

[20] P Valero-Lara and J Jansson ldquoA non-uniform StaggeredCartesian grid approach for Lattice-Boltzmann methodrdquo inProceedings of International Conference on ComputationalScience (ICCS) Reykjavık Iceland June 2015

[21] Y H Qian D DrsquoHumieres and P Lallemand ldquoLattice BGKmodels for Navier-Stokes equationrdquo Europhysics Letters(EPL) vol 17 no 6 pp 479ndash484 1992

[22] S Chen H Chen D Martnez and W Matthaeus ldquoLatticeBoltzmann model for simulation of magnetohydrodynamicsrdquoPhysical Review Letters vol 67 no 27 pp 3776ndash3779 1991

[23] Z Guo C Zheng and B Shi ldquoNon-equilibrium extrapolationmethod for velocity and boundary conditions in the latticeBoltzmann methodrdquo Chinese Physics vol 11 no 4pp 366ndash374 2002

[24] Z Guo C Zheng and B Shi ldquoDiscrete lattice effects on theforcing term in the lattice Boltzmann methodrdquo PhysicalReview E vol 65 no 4 Article ID 046308 2002

[25] A P Randles V Kale J Hammond W Gropp andE Kaxiras ldquoPerformance analysis of the lattice Boltzmannmodel beyond NavierndashStokesrdquo in Proceedings of the 27th IEEEInternational Parallel and Distributed Processing Symposium(IPDPS) Boston MA USA May 2013

[26] H Fang Z Wang Z Lin and M Liu ldquoLattice Boltzmannmethod for simulating the viscous flow in large distensibleblood vesselsrdquo Physical Review E vol 65 no 5 2002

[27] J Zhang ldquoEffect of suspending viscosity on red blood celldynamics and blood flows in microvesselsrdquo Microcirculationvol 18 no 7 pp 562ndash573 2011

[28] M Navidbakhsh and M Rezazadeh ldquoAn immersed bound-ary-lattice Boltzmann model for simulation of malaria-in-fected red blood cell in micro-channelrdquo Scientia Iranicavol 19 no 5 pp 1329ndash1336 Oct 2012

[29] A Hassanzadeh N Pourmahmoud and A Dadvand ldquoNu-merical simulation of motion and deformation of healthy andsick red blood cell through a constricted vessel using hybridlattice Boltzmann-immersed boundary methodrdquo ComputerMethods in Biomechanics and Biomedical Engineering vol 20no 7 pp 737ndash749 2017

[30] R Esmaily N Pourmahmoud and I Mirzaee ldquoNumericalsimulation of interaction between red blood cell with sur-rounding fluid in Poiseuille flowrdquo Advances in AppliedMathematics and Mechanics vol 10 no 1 pp 62ndash76 2018

[31] J Tan W Keller S Sohrabi J Yang and Y Liu ldquoCharac-terization of nanoparticle dispersion in red blood cell sus-pension by the lattice Boltzmann-immersed boundarymethodrdquo Nanomaterials vol 6 no 2 2016


[32] J Zhang P C Johnson and A S Popel ldquoRed blood cellaggregation and dissociation in shear flows simulated bylattice Boltzmann methodrdquo Journal of Biomechanics vol 41no 1 pp 47ndash55 2008

[33] A Ghafouri R Esmaily and A a Alizadeh ldquoNumericalsimulation of tank-treading and tumbling motion of redblood cell in the Poiseuille flow in a microchannel with andwithout obstaclerdquo Iranian Journal of Science and TechnologyTransactions of Mechanical Engineering vol 43 no 4pp 627ndash638 2019

[34] T Wang T W Pan Z Xing and R Glowinski ldquoNumericalsimulation of rheology of red blood cell rouleaux in micro-channelsrdquo Physical Review E vol 79 no 4 Article ID 0419162009

[35] Y Xu F Tian and Y Deng ldquoAn efficient red blood cell modelin the frame of IB-LBM and its applicationrdquo InternationalJournal of Biomathematics vol 6 no 1 2013

[36] T M Fischer ldquoShape memory of human red blood cellsrdquoBiophysical Journal vol 86 no 5 pp 3304ndash3313 2004

[37] T W Pan and T Wang ldquoDynamical simulation of red bloodcell rheology in microvesselsrdquo International Journal of Nu-merical Analysis amp Modeling vol 6 no 3 pp 455ndash473 2009

[38] C Pozrikidis ldquoAxisymmetric motion of a file of red bloodcells through capillariesrdquo Physics of Fluids vol 17 no 3Article ID 031503 2005

[39] R Esmaily N Pourmahmoud and I Mirzaee ldquoAn immersedboundary method for computational simulation of red bloodcell in Poiseuille flowrdquoMechanika vol 24 no 3 pp 329ndash3342018

[40] J Li G Lykotrafitis M Dao and S Suresh ldquoCytoskeletaldynamics of human erythrocyterdquo Proceedings of the NationalAcademy of Sciences vol 104 no 12 pp 4937ndash4942 2007


different parallel approaches for grid refinement based on amultidomain decomposition e homogeneous GPU andheterogeneous CPU+GPU approaches [15] for mesh re-finement based on Multidomain and irregular meshingcould achieve better performance In terms of IB-LBMrsquosparallel computing Valero-Lara et al [16] proposed anumerical approach parallelizing the IB-LBM on bothGPUs and a heterogeneous GPU-Multicore platformBoroni et al [17] presented a fully parallel GPU imple-mentation of IB-LBM e GPU kernel performs fluidboundary interactions and accelerated each code executionRecently Beny and Latt [9] proposed a new method for theimplementation of IB-LBM on GPUs which allows han-dling a substantially larger immersed surfaces Otherparallel methods compiler directive open multiprocessing(OpenMP) [18] and distributed memory clusters [19] canachieve multicore parallel processing with shared memoryon the CPU for instance a nonuniform staggered Car-tesian grid approach [20] reducing considerably thecomplexity of the communication and memory manage-ment between different refined levels for mesh refinement

In this paper a new parallel algorithm for IB-LBM isproposed on CPU-GPU heterogeneous platform When theGPU is performing simulation the CPU is in a waiting statewhich causes a waste of computing resources e new al-gorithm avoids this problem On the one hand the flow fieldis divided into CPU domains and GPU domains which arecalculated by the Compute Unified Device Architecture(CUDA) and OpenMP respectively According to the cal-culation process of IB-LBM after the flow field has evolvedonce data is exchanged at the junction of the two calculationdomains On the other hand the calculation time is shorterthan the communication time so buffers were introduced toreduce the number of data communications e size of thebuffer determines the number of times the flow field evolvesbefore data is exchanged Setting a buffer of proper size canreduce data communication and improve computing per-formance As expected this strategy greatly improved cal-culation efficiency without affecting the calculation accuracyof the IB-LBM

e remainder of this paper is organized as followsSection 2 introduces IB-LBM Section 3 illustrates theparallel algorithm on CPU-GPU heterogeneous platformthat makes full use of computing resources Section 4 showsthe benchmark tests and analyzes the parallel performanceRBCs in Poiseuille flow and through a microchannel aresimulated in Section 5 Finally Section 6 concludes thisstudy

2 Immersed Boundary-LatticeBoltzmann Method

IB-LBM is a flexible and efficient calculation in the case ofsimulated fluid flow through moving or complex bound-aries e computational area of the IB-LBM consists of twotypes of lattice points the Lagrangian points and theEulerian points (see Figure 1)

e Euler point represents the entire flow field and theLagrangian point illustrates the boundary of the immersed

object e body force of the object is calculated by thedeformation of the boundary and then it is dispersed to theEuler point of the flow field using the Dirac delta functionFinally the macroscopic density and velocity of the flow fieldare solved

e DnQm LBGK is the most widely used model forsolving the flow field in the IB-LBM n is the spatial di-mension and m represents the number of the discrete ve-locity [21 22] In this paper the most commonly usedmodel D2Q9 (see Figure 2) is employed and the corre-sponding discrete velocity ei is

ei

0 i 0

cos (i minus 1)π2

1113876 1113877 sin (i minus 1)π2

1113876 11138771113874 1113875c i 1 2 3 4

2

radiccos (i minus 5)

π2

+π4

1113876 1113877 sin (i minus 5)π2

+π4

1113876 11138771113874 1113875c i 5 6 7 8

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(1)

where c ΔxΔt is the lattice velocity and Δx is the latticespace

e LBGK model can be expressed as

fi x + eiΔt t + Δt( 1113857 minus fi(x t)

minus1τ

fi(x t) minus feqi (x t)1113858 1113859 + ΔtFi

(2)

where fi(x t) is the particle distribution function x iscurrent lattice location and t is current lattice time Δt is thelattice time step and τ is the relaxation time which is de-termined by the fluid viscosity coefficient υ of the fluid

υ τ minus12

1113874 1113875c2sΔt (3)

where cs is the speed of sound e corresponding feqi (x t)

is the equilibrium distribution function It can be written as

Lagrangianpoints

Eulerianpoints

Figure 1 Two-dimensional flow field boundary structure of IB-LBM mode



eiu

c2s

+eiu( 1113857

2

2c4s

minus|u|

2

2c2s

1113890 1113891 (4)


ωi

49 e

2i 0

19 e

2i c

2

136

e2i 2c

2

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)



eqi xf t1113872 11138731113960 1113961

(6)





ri1113872 1113873 (7)




e0

e1

e8e4e6

e2

e7e3 e5

(a)

f0

f1

f8f4f6

f2

f7 f3 f5

(b)


xri

xli




1113944n

i


(8)



0 1le |x|1113896 (9)


Fi(X t) 1 minus12τ

1113874 1113875ωi

ei minus u

c2s


c4s

ei1113890 1113891 middot F(X t) (10)


ρ 11139448

i0fi

ρu 11139448

i0eifi +

12

FΔt

(11)












(12)


Start

Initialization

t lt t_max

End

N

Y

Evolution

t++





equation (10)


equation (4)














blockIdx (0 0)



Grid

helliphellip


helliphellip


helliphellip

helliphellip

helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip

helliphellip



helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

Block










Evolution



Communication


DataCopy

DataCopy








Buffer



helliphellip

Time step t




CPU

-GPU

data

tran

sfer

Time step t + 1




CPU

-GPU

data

tran

sfer




MFLUPS s middot Nfl

Ts middot 106 (13)







(14)




Nbuffer Ncpu

25+ E (15)








goto heterogeneous













5 Application



y sin θ


0lt θlt 2π

(16)




F(s t) Fs(s t) + Fb(s t) (17)



Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References












































eiu

c2s

+eiu( 1113857

2

2c4s

minus|u|

2

2c2s

1113890 1113891 (4)


ωi

49 e

2i 0

19 e

2i c

2

136

e2i 2c

2

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)



eqi xf t1113872 11138731113960 1113961

(6)





ri1113872 1113873 (7)




e0

e1

e8e4e6

e2

e7e3 e5

(a)

f0

f1

f8f4f6

f2

f7 f3 f5

(b)


xri

xli




1113944n

i


(8)



0 1le |x|1113896 (9)


Fi(X t) 1 minus12τ

1113874 1113875ωi

ei minus u

c2s


c4s

ei1113890 1113891 middot F(X t) (10)


ρ 11139448

i0fi

ρu 11139448

i0eifi +

12

FΔt

(11)












(12)


Start

Initialization

t lt t_max

End

N

Y

Evolution

t++





equation (10)


equation (4)














blockIdx (0 0)



Grid

helliphellip


helliphellip


helliphellip

helliphellip

helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip

helliphellip



helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

Block










Evolution



Communication


DataCopy

DataCopy








Buffer



helliphellip

Time step t




CPU

-GPU

data

tran

sfer

Time step t + 1




CPU

-GPU

data

tran

sfer




MFLUPS s middot Nfl

Ts middot 106 (13)







(14)




Nbuffer Ncpu

25+ E (15)








goto heterogeneous













5 Application



y sin θ


0lt θlt 2π

(16)




F(s t) Fs(s t) + Fb(s t) (17)



Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References












































1113944n

i


(8)



0 1le |x|1113896 (9)


Fi(X t) 1 minus12τ

1113874 1113875ωi

ei minus u

c2s


c4s

ei1113890 1113891 middot F(X t) (10)


ρ 11139448

i0fi

ρu 11139448

i0eifi +

12

FΔt

(11)












(12)


Start

Initialization

t lt t_max

End

N

Y

Evolution

t++





equation (10)


equation (4)














blockIdx (0 0)



Grid

helliphellip


helliphellip


helliphellip

helliphellip

helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip

helliphellip



helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

Block










Evolution



Communication


DataCopy

DataCopy








Buffer



helliphellip

Time step t




CPU

-GPU

data

tran

sfer

Time step t + 1




CPU

-GPU

data

tran

sfer




MFLUPS s middot Nfl

Ts middot 106 (13)







(14)




Nbuffer Ncpu

25+ E (15)








goto heterogeneous













5 Application



y sin θ


0lt θlt 2π

(16)




F(s t) Fs(s t) + Fb(s t) (17)



Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References



















































blockIdx (0 0)



Grid

helliphellip


helliphellip


helliphellip

helliphellip

helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip

helliphellip



helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

helliphellip


helliphellip

Block










Evolution



Communication


DataCopy

DataCopy








Buffer



helliphellip

Time step t




CPU

-GPU

data

tran

sfer

Time step t + 1




CPU

-GPU

data

tran

sfer




MFLUPS s middot Nfl

Ts middot 106 (13)







(14)




Nbuffer Ncpu

25+ E (15)








goto heterogeneous













5 Application



y sin θ


0lt θlt 2π

(16)




F(s t) Fs(s t) + Fb(s t) (17)



Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References















































Evolution



Communication


DataCopy

DataCopy








Buffer



helliphellip

Time step t




CPU

-GPU

data

tran

sfer

Time step t + 1




CPU

-GPU

data

tran

sfer




MFLUPS s middot Nfl

Ts middot 106 (13)







(14)




Nbuffer Ncpu

25+ E (15)








goto heterogeneous













5 Application



y sin θ


0lt θlt 2π

(16)




F(s t) Fs(s t) + Fb(s t) (17)



Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References












































MFLUPS s middot Nfl

Ts middot 106 (13)







(14)




Nbuffer Ncpu

25+ E (15)








goto heterogeneous













5 Application



y sin θ


0lt θlt 2π

(16)




F(s t) Fs(s t) + Fb(s t) (17)



Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References



















































5 Application



y sin θ


0lt θlt 2π

(16)




F(s t) Fs(s t) + Fb(s t) (17)



Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References











































Fs( 1113857k Es

(Δs)2 times 1113944

Nminus 1

j1Xj+1 minus Xj

11138681113868111386811138681113868


Xj+1 minus Xj

11138681113868111386811138681113868

11138681113868111386811138681113868δjk minus δj+1k1113872 1113873

⎧⎪⎨

⎪⎩

⎫⎪⎬

⎪⎭

Fb( 1113857k Eb

(Δs)4 times 1113944

Nminus 1


(18)





30

25

20

15

10

5

0

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(a)

840

820

800

900

880

860

780

760

740

720

700

MFL

UPS

1 2 3 4 5 6 7 8 9 10



(b)












h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References
















































h YD

X




6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References











































6 Conclusions



Data Availability




Acknowledgments


00001 00009 00017U

(a)

0 00008 00016U

(b)

00001 00006 00011U

(c)

00001 00007 00013U

(d)

0 00006 00012U

(e)

00002 0001 00018U

(f )



References











































References





















































Documents

TheImmersedBoundary-LatticeBoltzmannMethodParallel ...downloads.hindawi.com/journals/mpe/2020/3913968.pdf · 5/14/2020 · e IB-LBM simulation contains four different calcu-lations:IBMpart,theentireflowfield,boundaryprocessing,