4
8 IEEE ANTENNAS AND WIRELESS PROPAGATION LETTERS, VOL. 9, 2010 Low-Frequency MLFMA on Graphics Processors M. Cwikla, Member, IEEE, J. Aronsson, Student Member, IEEE, and V. Okhmatovski, Senior Member, IEEE Abstract—A parallelization of the low-frequency multilevel fast multipole algorithm (MLFMA) for graphics processing units (GPUs) is presented. The implementation exhibits speedups be- tween 10 and 30 compared to a serial CPU implementation of the algorithm. The error of the MLFMA on the GPU is controllable down to machine precision. Under the typical method-of-moments (MoM) error requirement of three correct digits, modern GPUs are shown to handle problems with up to 7.5 million degrees of freedom in dense matrix approximation. Index Terms—CUDA, graphics processing unit (GPU), low-fre- quency fast multipole method, fast algorithms, multiscale modeling. I. INTRODUCTION T HE multilevel fast multipole algorithm (MLFMA) is today’s most powerful method for solving large-scale electromagnetic problems with the boundary-element method of moments (MoM) [1]. It reduces the computational com- plexity and memory requirement in MoM’s dense matrix-vector products from to being the number of unknowns in the MoM discretization. While the high-fre- quency MLFMA is computationally the fastest version of multipole-centric algorithms, it suffers from the low-frequency breakdown [2]. To remedy this limitation, several low-fre- quency versions of MLFMA have recently been proposed [2]–[5]. Combined with high-frequency MLFMA, they can be used in the construction of broadband MLFMAs for large problems featuring multiscale geometries. As the MLFMA is intended for solving large-scale problems, various parallel versions of the algorithm have been developed in the past, predominantly for CPU-based clusters of worksta- tions. With the advent of the graphics processing unit (GPU) technology, which in recent years has become more than an order of magnitude faster than the CPU in terms of floating- point operations per second (FLOPs), a high interest in the par- allelization of computationally intensive algorithms for GPUs arose in the computational community [6]. However, since the hardware is optimized for graphics applications, special care must be taken in the parallelization of such algorithms for the GPU. In the case of the MLFMA, its first parallelization for GPUs was proposed in [7] for single-precision arithmetic and the Laplace kernel. Here, speedups of 30 to 60 were achieved when compared with serial CPU runs. To date, this remains the Manuscript received October 19, 2009; manuscript revised December 10, 2009. Date of publication January 15, 2010; date of current version March 05, 2010. This work was in part supported by the National Science and Engineering Research Council of Canada (NSERC). The authors are with the Department of Electrical and Computer Engi- neering, University of Manitoba, Winnipeg, MB R3T5V6, Canada (e-mail: [email protected]). Digital Object Identifier 10.1109/LAWP.2010.2040571 only publication on MLFMA parallelization for GPUs. In this letter, we describe a GPU implementation of the low-frequency MLFMA with the Helmholtz kernel [2] using double-precision arithmetic. The parallelization pattern similar to the one dis- cussed in [7] is utilized. The methodology achieves speedups of over 30 compared to a conventional serial implementation of the low-frequency MLFMA on a CPU. II. FIELD EXPANSIONS IN LOW-FREQUENCY MLFMA Consider a spatial distribution of time-harmonic point sources located at , and having magnitudes , respectively. The field produced by such sources at observation point is given by [9] (1) where is the wavelength in free space and . In (1) and everywhere below, the time-harmonic depen- dence is assumed and suppressed for brevity. The MLFMA represents the field in (1) using truncated multipole expansions in spherical coordinates as (2) for observation points outer to a spherical domain enclosing all the sources, and (3) for observation points inside a spherical domain that does not contain the sources. In (2) and (3), and are the spher- ical Bessel and Hankel functions of order , respectively, and are the spherical harmonics of degree and order . Ex- pansions of the form (2) and (3) are known as -expansions and -expansions, respectively [9]. The coefficients and in (2) and (3) depend on a distribution of sources as follows: (4) (5) where , and . The MLFMA subdivides the computational domain recur- sively [9]. At th level of subdivision, the domain is split into cubes (boxes), , where level corresponds 1536-1225/$26.00 © 2010 IEEE Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.

Get PDF 13

Embed Size (px)

Citation preview

Page 1: Get PDF 13

8 IEEE ANTENNAS AND WIRELESS PROPAGATION LETTERS, VOL. 9, 2010

Low-Frequency MLFMA on Graphics ProcessorsM. Cwikla, Member, IEEE, J. Aronsson, Student Member, IEEE, and V. Okhmatovski, Senior Member, IEEE

Abstract—A parallelization of the low-frequency multilevelfast multipole algorithm (MLFMA) for graphics processing units(GPUs) is presented. The implementation exhibits speedups be-tween 10 and 30 compared to a serial CPU implementation of thealgorithm. The error of the MLFMA on the GPU is controllabledown to machine precision. Under the typical method-of-moments(MoM) error requirement of three correct digits, modern GPUsare shown to handle problems with up to 7.5 million degrees offreedom in dense matrix approximation.

Index Terms—CUDA, graphics processing unit (GPU), low-fre-quency fast multipole method, fast algorithms, multiscalemodeling.

I. INTRODUCTION

T HE multilevel fast multipole algorithm (MLFMA) istoday’s most powerful method for solving large-scale

electromagnetic problems with the boundary-element methodof moments (MoM) [1]. It reduces the computational com-plexity and memory requirement in MoM’s dense matrix-vectorproducts from to being the numberof unknowns in the MoM discretization. While the high-fre-quency MLFMA is computationally the fastest version ofmultipole-centric algorithms, it suffers from the low-frequencybreakdown [2]. To remedy this limitation, several low-fre-quency versions of MLFMA have recently been proposed[2]–[5]. Combined with high-frequency MLFMA, they canbe used in the construction of broadband MLFMAs for largeproblems featuring multiscale geometries.

As the MLFMA is intended for solving large-scale problems,various parallel versions of the algorithm have been developedin the past, predominantly for CPU-based clusters of worksta-tions. With the advent of the graphics processing unit (GPU)technology, which in recent years has become more than anorder of magnitude faster than the CPU in terms of floating-point operations per second (FLOPs), a high interest in the par-allelization of computationally intensive algorithms for GPUsarose in the computational community [6]. However, since thehardware is optimized for graphics applications, special caremust be taken in the parallelization of such algorithms for theGPU.

In the case of the MLFMA, its first parallelization for GPUswas proposed in [7] for single-precision arithmetic and theLaplace kernel. Here, speedups of 30 to 60 were achievedwhen compared with serial CPU runs. To date, this remains the

Manuscript received October 19, 2009; manuscript revised December 10,2009. Date of publication January 15, 2010; date of current version March 05,2010. This work was in part supported by the National Science and EngineeringResearch Council of Canada (NSERC).

The authors are with the Department of Electrical and Computer Engi-neering, University of Manitoba, Winnipeg, MB R3T5V6, Canada (e-mail:[email protected]).

Digital Object Identifier 10.1109/LAWP.2010.2040571

only publication on MLFMA parallelization for GPUs. In thisletter, we describe a GPU implementation of the low-frequencyMLFMA with the Helmholtz kernel [2] using double-precisionarithmetic. The parallelization pattern similar to the one dis-cussed in [7] is utilized. The methodology achieves speedupsof over 30 compared to a conventional serial implementation ofthe low-frequency MLFMA on a CPU.

II. FIELD EXPANSIONS IN LOW-FREQUENCY MLFMA

Consider a spatial distribution of time-harmonic pointsources located at , and having magnitudes

, respectively. The field produced by suchsources at observation point is given by [9]

(1)

where is the wavelength in free space and. In (1) and everywhere below, the time-harmonic depen-

dence is assumed and suppressed for brevity.The MLFMA represents the field in (1) using truncated

multipole expansions in spherical coordinates as

(2)

for observation points outer to a spherical domain enclosing allthe sources, and

(3)

for observation points inside a spherical domain that does notcontain the sources. In (2) and (3), and are the spher-ical Bessel and Hankel functions of order , respectively, and

are the spherical harmonics of degree and order . Ex-pansions of the form (2) and (3) are known as -expansions and-expansions, respectively [9]. The coefficients and in

(2) and (3) depend on a distribution of sources as follows:

(4)

(5)

where , and .The MLFMA subdivides the computational domain recur-

sively [9]. At th level of subdivision, the domain is split intocubes (boxes), , where level corresponds

1536-1225/$26.00 © 2010 IEEE

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.

Page 2: Get PDF 13

CWIKLA et al.: LOW-FREQUENCY MLFMA ON GRAPHICS PROCESSORS 9

to the original domain and to the leaf nodes in the treestructure. To represent the field outside a box due to sourcescontained inside of it, the MLFMA expands the box’s field ac-cording to an -expansion (2) defined at level bycoefficients . It also computes co-efficients of a -expansion (3) for the field inside each boxdue to the sources outside of it at levels [9].

The translation of coefficients and from level to levelcan be performed via the rotate-translate-rotate (or “point-and-shoot”) method [9]. This method first rotates the coordinatesystem so that the translation occurs along the -axis in thenew coordinate system (forward rotation). It then performs thetranslation of the expansion coefficients along the new -axis(coaxial translation) and, finally, rotates the coordinate systemback to its original orientation (backward rotation). This re-duces the complexity of translations from (using thenaive method) to . The forward rotation, coaxial trans-lation, and backward rotation are formalized as

(6)

(7)

(8)

Each set of coefficients , and can be com-puted in time via a recurrence relation [9]. In (6),

, and , when translation to the -expansion andto the -expansion is considered, respectively. In (8),and , when translation to the -expansion and tothe -expansion is considered, respectively. Detailed computa-tional procedures for both rotation and translation coefficientsare given in [9].

III. GPU ARCHITECTURE AND PROGRAMMING MODEL

Before detailing the GPU parallelization of key MLFMA op-erations in (1)–(8), we overview the key features of today’s GPUarchitecture and the programming considerations.

A typical GPU card architecture, its interface with CPU, aswell as available computational resources are depicted in Fig. 1.The GPU cards feature a large amount of high-latency globalmemory (up to 4 GB) and a small amount of low-latency sharedmemory. To mitigate the effects of high global memory accesslatency, the following techniques can be exercised in the devel-opment of a parallel MLFMA for the GPU.

Coalescing Global Memory Access: As the access to globalmemory incurs a latency of 400 to 600 clock cycles [10], itis important to coalesce such operations whenever possible.A single sequence of instructions on a GPU is referred to asa thread. Threads are grouped into warps, to a maximum of32 threads per warp. On an nVidia GPU, one of the conditionsrequired for optimal memory bandwidth is to have the first16 or last 16 threads of a warp accessing global memory in acoalesced fashion [10]. Examples of coalesced and uncoalesced

Fig. 1. Typical architecture of a GPU card and its interfacing to the CPU.

Fig. 2. Examples of coalesced and uncoalesced memory accesses.

memory accesses are shown in Fig. 2. For optimal performanceduring kernel executions, coefficients and are stored inglobal memory in a manner such that read and write operationscan be coalesced. As an example, coefficients for contigu-ously numbered boxes are stored in the first contiguouslocations in global memory that are referenced by a pointer,followed by coefficients , and so on.

Caching Repetitively Used Data: Some frequently accesseddata in MLFMA, such as the rotation and translation coefficientsin the rotate-translate-rotate method [9], can be stored be in thelow-latency shared memory available on each multiprocessor onGPU.

Latency Hiding: Multiple blocks should be kept active oneach multiprocessor so that on a given multiprocessor, whilea certain block waits for a memory access, another block cancontinue to undergo execution.

IV. PARALLEL LOW-FREQUENCY MLFMA ON GPU

A. Preprocessing on CPU

The parallel MLFMA implementation on the GPU beginswith preprocessing steps that are performed on the CPU priorto the traversal of the tree. First, the nonempty leaf boxes aresequentially indexed . The sources in-side them are renumbered as follows. The point sources situ-ated in leaf box are numbered from to ,sources numbered to are situated in leaf box , andso on, where sources from to aresituated in the th box. In the second preprocessingstep, the translation and rotation coefficients , and ap-pearing in (6)–(8), respectively, are precomputed on the CPUfor all levels and copied to the GPU’s globalmemory (Fig. 1). This step requires less than 2 s to complete onIntel’s Xeon E5520 CPU running at 2.26 GHz for a seven-leveltree with truncation numbers reaching 30 atthe coarsest tree level . The preprocessing steps are runon the CPU because of their minimal impact on the overall runtime. Upon completion of the preprocessing stage, the traversalof the tree begins.

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.

Page 3: Get PDF 13

10 IEEE ANTENNAS AND WIRELESS PROPAGATION LETTERS, VOL. 9, 2010

B. Leaf Level Aggregation Kernel

Algorithm 1 Parallel Leaf Box Aggregation on GPU

1: {initialize box count}2: while {for all boxes at leaf level} do3: for all

{for boxes in parallel} do4: for to {all points in th box} do5: { is the th point position,

is th box center, is defined in sphericalcoordinates of th box, i.e.

6: for to do7:8: for to do9:

10: end for11: end for12: end for13: {increment by # of

threads}14: end for15: end while

The procedure for parallel leaf level aggregation isgiven by Algorithm 1. The coefficients are computed on theGPU according to (4) in parallel for 1 boxes at a time. GPUmethods for computing spherical Bessel and Hankel functionsare implemented according to [11].

C. Aggregation and Disaggregation Kernels

The nonleaf level aggregation procedureis given by Algorithm 2, which is executed as one thread perparent box. The disaggregation procedure is organized similarlyto Algorithm 2, with the exception that it is executed as onethread per child box, and that is directly assigned to onlyonce.

Algorithm 2 Parallel Nonleaf Aggregation on GPU

The forward rotation, translation, and backward rotationcoefficients , and , respectively, are precomputedon CPU and copied into shared memory of GPU prior toexecution.

1: for to 2 {for all nonleaf levels} do2: {initialize box count}3: while {for all boxes at level } do4: for all

{for boxes in parallel} do5: for each nonempty child of th box do6:7:

8:9:

10: end for11: end for12: {increment by # of threads}

1Although the user specifies the number of blocks and threads per block, theGPU hardware is ultimately responsible for scheduling threads in parallel.

13: end while14: end for

D. Translation Kernel

One of the most time-consuming operations within theMLFMA is the translation of -expansions to -expansions. Inthis step, each of the boxes at level is considered to bea destination box, denoted as . There are sourceboxes, in the same level, which are in ’s interaction list [9].The -expansion in each source box within the interaction listis converted to a -expansions (3) with respect to the center ofbox . Parallelization for the GPU is as follows.

Organization of the interaction set at a given treelevel begins with collecting all unique translation vec-tors , from the centers ofsource boxes to the centers of destination boxes [9]. In thisimplementation, we do not consider additional symmetries[7]. For each unique vector at level , we collect allof the “source box–destination box” pairs that correspond to

and sort the pairs with respect to the source box. We theniterate through each of the 316 vectors. Each iteration hastranslations to perform, each of which is assigned to one thread,with threads being executed concurrently. This approachallows for read operations from global memory to becoalesced when reading coefficients timesin (6)–(8). The amount of shared memory allocated is enoughfor just a few hundred coefficients to be cached, which allowsfor multiple blocks to be active per multiprocessor. Each timea loop on or completes, the algorithm checks if the cacheneeds refreshing with new coefficients and does so if necessary.Additionally, a portion of shared memory ( in size) isused in order to cache values of the sine and cosine ofand . These are used in the “on-the-fly” computationof and . Before any rotations begin, all threadscompute an approximately equal number of said sines andcosines and write the results into shared memory. The aboveprocedure is summarized in Algorithm 3.

Algorithm 3 Parallel Translation on GPU

The interaction list is computed and sorted for all levelsprior to the execution of this kernel. Precomputation ofcoefficients , and is similar to Algorithm 2.

1: for to {for all levels} do2: for to 315 {for all translation vectors } do3: and4: {initialize interaction pair counter for5: while do6: for all

{for threads in parallel (one thread per pair)} do7:8:9:

10:

11:12:13:14: end for

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.

Page 4: Get PDF 13

CWIKLA et al.: LOW-FREQUENCY MLFMA ON GRAPHICS PROCESSORS 11

TABLE IBENCHMARK PARAMETERS

15:16: end while17: end for18: end for

E. Leaf-Level Disaggregation Kernel

The MLFMA evaluates the far-field contribution with respectto each point from coefficients at the leaf level using (3).Parallelization of these computations is similar to Algorithm 1.The exception is that it is executed as one thread per particle and(3) replaces (4).

F. Near-Field Calculation Kernel

The near-field contributions due to points contained insidea box and within its adjacent neighbor boxes are computed di-rectly using (1). The presented implementation can handle up to320 sources per box. This limitation is due the number of regis-ters available per GPU multiprocessor as one thread is assignedto handle one particle within a box. More details on these ker-nels can be found in [8].

V. NUMERICAL RESULTS

In the numerical experiments, we considered a uniform dis-tribution of , andsources confined to a cube of volume m , as well as

sources on a spherical surface of diameter. The magnitudes , were chosen

randomly such that and . The perfor-mance of proposed GPU implementation was compared againstits CPU counterpart. Table I summarizes the parameters and nu-merical results for the benchmarks with (Bench-mark A) and (Benchmark B). The results presentedin Table II were run on a Quadro FX 5800 GPU with 4 GB ofmemory, while the CPU used was an Intel Xeon E5520 run-ning at 2.26 GHz. Error levels listed in Table I were computedusing the -norm. For the translation procedure, caching datain the manner described in the Algorithm 3 was only found tobe advantageous for levels , where speedups of approx-imately 10 were realized for . At the remaining levels

, an alternative to the translation procedure in Algo-rithm 3 was used, which handled one destination box per thread.The GPU code that implemented the coalescing and caching op-timizations discussed in Section III yielded speedups of 1.1 to1.6 in total GPU run time versus code that did not use the saidoptimizations. Since no more than 320 sources per leaf box canbe handled due to the number of available registers on each mul-tiprocessor, this implementation is limited to problem sizesbetween 7.0 and 7.5 million and for a volumetric dis-tribution of sources. For surface distributions, the upper limit isapproximately million for .

TABLE IIBENCHMARKS RUN TIMES IN SECONDS AND SPEEDUP

VI. CONCLUSION

The parallelization of a low-frequency MLFMA for a GPU isdiscussed. The MLFMA operations such as aggregation, disag-gregation, and translation are parallelized to take advantage ofthe GPU’s multiprocessor architecture. A speedup of between10 and 30 was achieved for simulations with 0.5 to 7.3 millionsources. The results are demonstrated for three and nine digitsof accuracy on domains of size and , respectively.

REFERENCES

[1] F. P. Andriulli et al., “A multiplicative Calderon preconditioner for theelectric field integral equation,” IEEE Trans. Antennas Propag., vol.56, no. 8, pt. 1, pp. 2398–2412, Aug. 2008.

[2] H. Cheng et al., “A wideband fast multipole method for the Helmholtzequation in three dimensions,” J. Comp. Phys., vol. 216, no. 1, pp.300–325, Jul. 2006.

[3] L. Xuan et al., “A broadband multilevel fast multipole algorithm,” inProc. IEEE AP-S Intl. Symp., Monterey, CA, Jun. 2004, vol. 2, pp.1195–1198.

[4] H. Wallen and J. Sarvas, “Translation procedures for broadbandMLFMA,” Prog. Electromagn. Res., vol. 55, pp. 47–78, 2005.

[5] I. Bogaert et al., “A nondirective plane wave MLFMA stable at lowfrequencies,” IEEE Trans. Antennas Propag., vol. 56, no. 12, pp.3752–3767, Dec. 2008.

[6] W. Hwu et al., “Compute unified device architecture application suit-ability,” IEEE Comput. Sci. Eng., vol. 11, no. 3, pp. 16–25, May/Jun.2009.

[7] N. Gumerov and R. Duraiswami, “Fast multipole methods on graphicsprocessors,” J. Comput. Phys., vol. 227, pp. 8290–8313, 2005.

[8] M. Cwikla, “The low-frequency MLFMM on graphics processors,”M.S. thesis, Dept. ECE, Univ. Manitoba, Winnipeg, MB, Canada,2009.

[9] N. A. Gumerov and R. Duraiswami, Fast Multipole Methods for theHelmholtz Equation in Three Dimensions. Amsterdam, The Nether-lands: Elsevier, 2006.

[10] “nVidia CUDA Programming Guide,” ver. 2.3, nVidia Corp., May2009.

[11] S. Zhang and J.-M. Jin, Computation of Special Functions. NewYork: Wiley, 1996.

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.