PARALLEL BOUNDARY ELEMENT SOLUTIONS OF BLOCK …

The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

PARALLEL BOUNDARY ELEMENT SOLUTIONS OF BLOCK

CIRCULANT LINEAR SYSTEMS FOR ACOUSTIC RADIATION

PROBLEMS WITH ROTATIONALLY SYMMETRIC

BOUNDARY SURFACES

A Thesis in

Computer Science and Engineering

by

Kenneth D. Czuprynski

c© 2012 Kenneth D. Czuprynski

Submitted in Partial Fulfillmentof the Requirements

for the Degree of

Master of Science

May 2012

The thesis of Kenneth D. Czuprynski was reviewed and approved* by the following:

Suzanne M. ShontzAssistant Professor of Computer Science and EngineeringThesis Adviser

Jesse L. BarlowProfessor of Computer Science and Engineering

John B. FahnlineAssistant Professor of Acoustics

Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering

*Signatures are on file in the Graduate School.

iii

Abstract

Coupled finite element/boundary element (FE/BE) formulations are commonly

used to solve structural-acoustic problems where a vibrating structure is idealized as

being submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE)

formulations, the structural analysis is performed using the finite element method, and

the acoustic analysis is performed using the boundary element method. In general, the

problem is solved frequency by frequency, and the coefficient matrix for the boundary

element analysis is fully populated and little can be done to alleviate the storage and

computational requirements. Because acoustic boundary element calculations require

approximately six elements per wavelength to produce accurate solutions, the boundary

element formulation is limited to relatively low frequencies. However, when the outer

surface of the structure is rotationally symmetric, the system of linear equations becomes

block circulant. We propose a parallel algorithm for distributed memory systems which

takes advantage of the underlying concurrency of the inversion formula for block circulant

matrices. By using the structure of the coefficient matrix in tandem with a distributed

memory system setting, we show that the storage and computational requirements are

substantially lessened.

iv

Table of Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Acoustic Radiation Problems . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Boundary Element Method . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 The Fourier Matrix and Fast Fourier Transform . . . . . . . . . . . . 10

1.4 Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Chapter 2. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Chapter 3. Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Coefficient Matrix Derivation . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Block Circulant Inversion . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Chapter 4. Parallel Solution Algorithm . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Block DFT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Block FFT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 System Solves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Parallel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 5. Theoretical Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . 51

5.1 Parallel Linear System Solve . . . . . . . . . . . . . . . . . . . . . . . 52

v

5.2 Block DFT using the DFT Algorithm . . . . . . . . . . . . . . . . . 53

5.3 Block DFT Using the FFT Algorithm . . . . . . . . . . . . . . . . . 55

5.4 Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Chapter 6. Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Chapter 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Appendix. BEM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

A.1 STATIC MULTIPOLE ARRAYS . . . . . . . . . . . . . . . . . . . . 73

A.1.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A.1.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 73

A.1.1.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 74

A.1.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

A.1.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 74


A.2 COEFF MATRIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.2.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

A.2.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 76

vi


A.2.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.2.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 77


A.3 SOURCE AMPLITUDES MODES . . . . . . . . . . . . . . . . . . . 80

A.3.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A.3.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 80


A.3.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.3.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 82


A.4 SOURCE POWER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

A.4.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.4.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 86


A.4.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A.4.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 87


A.5 MODAL RESISTANCE . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.5.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.5.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 91

A.5.2 Rotationally Symmetric . . . . . . . . . . . . . . . . . . . . . 92

A.5.3 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vii

A.5.3.1 General Case . . . . . . . . . . . . . . . . . . . . . . 92


References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

viii

List of Figures

1.1 Radix 2 element interaction pattern obtained from [18]. . . . . . . . . . 16

3.1 A propeller with three times rotational symmetry [37]. . . . . . . . . . . 26

3.2 A four times rotationally symmetric sketch of a propeller. . . . . . . . . 27

4.1 Initial data distribution assumed in the DFT computation for the case

P = m = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 The DFT computation for the case P = m = 4. Each arrow indicates

the communication of a processor’s owned submatrix to a neighboring

processor in the direction of the arrow. . . . . . . . . . . . . . . . . . . . 40

4.3 Parallel block DFT data decomposition for P > m. . . . . . . . . . . . . 42

4.4 Parallel block DFT data decomposition and processor groupings for P > m. 43

4.5 Process illustrating the distributed FFT. Lines crossing to different pro-

cessors indicate communication from left to right. Note the output is in

reverse bit-reversed order relative to numbering starting at zero; that is,

A1 is element 0; A2 is element 1, etc. . . . . . . . . . . . . . . . . . . . . 47

4.6 Processor grid creation for P=16 and m=4. . . . . . . . . . . . . . . . . 48

6.1 Runtime comparison using the DFT algorithm for varying P and N with

m = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Runtime comparison using the FFT algorithm for varying P and N with

m = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

ix

6.3 Speedups using the DFT algorithm for varying P and N with m = 4. . 61

6.4 Speedups using the FFT algorithm for varying P and N with m = 4. . . 61

6.5 Efficiency using the DFT algorithm for varying N and P with m = 4. . 63

6.6 Efficiency using the FFT algorithm for varying N and P with m = 4. . 64

6.7 Runtime comparison using the DFT algorithm for varying P and N with

m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.8 Runtime comparison using the FFT algorithm for varying P and N with

m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.9 Speedup comparison using the DFT algorithm for varying P and N when

m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.10 Speedup comparison using the FFT algorithm for varying P and N when

m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.11 Efficiency comparison using the DFT algorithm for varying P and N

when m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.12 Efficiency comparison using the FFT algorithm for varying P and N

when m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

1

Chapter 1

Introduction

Coupled finite element/boundary element (FE/BE) formulations are commonly

used to solve structural-acoustic problems where a vibrating structure is idealized as be-

ing submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE)

formulations, the structural analysis is performed using the finite element method, and

the acoustic analysis is performed using the boundary element method (BEM). The

boundary element formulation is advantageous for the acoustic radiation problem be-

cause only the outer surface of the structure in contact with the acoustic medium is

discretized. This formulation also allows us to neglect meshing the infinite fluid exterior

to the structure, as would be required if the finite element method were used instead.

Using the BEM, we compute the radiated sound field of a vibrating structure

Ω ⊂ R3. The main obstacle in computing the sound radiation is solving the linear

system of equations to enforce the specified boundary conditions. In the context of

the BEM, this requires the solution of a dense, complex linear system. In general, the

problem is solved frequency by frequency, and the coefficient matrix for the boundary

element analysis is fully populated and exhibits no exploitable structure. The size, N2,

of the coefficient matrix is directly correlated with the level of discretization, N , used

for the surface in question. Because acoustic boundary element calculations require

approximately six elements per wavelength to produce accurate solutions, the boundary

2

element formulation is limited to relatively low frequencies. For high frequency problems,

and for problems which involve large and/or complex surfaces, these matrices are large,

dense, and unstructured; therefore, there is little which can be done to alleviate the

storage and computational requirements. Iterative solvers and preconditioners have been

investigated [4, 5, 28] and are a natural choice for large problems because the cost of

direct solvers can become prohibitive. While the computational requirements can be

lessened by iterative methods, the storage requirements can still present a problem. One

obvious solution is to perform the solve in a distributed memory parallel setting. A

distributed memory parallel algorithm distributes the workload and allows the storage

of the matrix to be split between many individual systems with local memories, thereby

increasing the total available memory. In addition, because linear systems are ubiquitous

throughout scientific computation, libraries exist for their efficient parallel solution. In

particular, because the matrix is dense, Scalable LAPACK (ScaLAPACK) [6] is a favored

choice.

While in general these matrices exhibit no exploitable structure, when the bound-

ary surface is rotationally symmetric, the coefficient matrix is block circulant. Circulant

matrices are defined as each row being a circular shift of the row above it. One property

of circulant matrices is that they are all diagonalizable by the Fourier matrix. There-

fore, the Discrete or Fast Fourier Transform (D/FFT) can be used in the solution of

the system. These results generalize to the block case and can be used in the solution

of block circulant linear systems arising from acoustic radiation problems involving ro-

tationally symmetric boundary surfaces. In addition, the inversion formula for block

circulant matrices is highly amendable to parallel computation.

3

We propose an algorithm for distributed memory systems which takes advantage

of the underlying concurrency of the inversion formula for block circulant matrices. By

using the structure of the coefficient matrix in tandem with a distributed system setting,

the storage and computational limitations are substantially lessened. Therefore, the

algorithm allows larger and higher frequency acoustic radiation problems to be explored.

1.1 Acoustic Radiation Problems

The goal is to compute the radiated sound field due to a vibrating structure

Ω ⊂ R3 subject to given boundary conditions. The governing partial differential equation

(PDE) for acoustic radiation problems is the Helmholtz equation, i.e.,

(∇2 − k2

)u(p) = 0, p εΩ+, (1.1)

where ∇2 is the Laplacian; k = ωc is the wave number; ω is angular frequency, and c

the speed of sound in the chosen medium. Ω+ = R3\Ω, denotes the region exterior to

Ω. In structural acoustics problems, it is common for the velocity distribution over the

boundary of Ω, denoted by ∂Ω, to be specified. This equates to the Neumann boundary

condition

∂u(p)

∂np= f(p), p ε ∂Ω, (1.2)

4

where ∂∂np

denotes differentiation in the direction of the outward normal at p ε ∂Ω. In

addition, to ensure all radiated waves are outgoing, the Sommerfield radiation condition

limr→∞

r

(∂u(p)

∂r− iku(p)

)= 0 (1.3)

is enforced, where r is the distance of p from a fixed origin. Therefore, in order to solve

for the radiated sound field due to Ω, a solution to the Helmholtz equation (1.1), subject

to equations (1.2) and (1.3), must be found.

1.2 Boundary Element Method

The boundary element method is an algorithm for the numerical solution of PDEs

which have an equivalent boundary integral representation. The BEM reformulates

the PDE into an equivalent boundary integral equation (BIE), which is then solved

numerically. The benefit of the formulation is that it reduces the problem to one over

the boundary. However, because the BEM requires an equivalent BIE formulation, if

the PDE cannot be represented as an equivalent BIE, the BEM cannot be used. The

remainder of the section will outline the BEM within the context of an acoustic radiation

problem.

Consider a vibrating structure Ω ⊂ R3. The Helmholtz equation is the governing

PDE for the radiated sound field produced by Ω and is given by (1.1). A standard

boundary integral formulation of (1.1) yields the following equations

1

4π

∫∂Ω

(∂G(p, q)

∂nqu(q)− ∂u(q)

∂nqG(p, q)

)d (∂Ω) = u(p), p εΩ+ (1.4)

5

and

1

2π

∫∂Ω

(∂G(p, q)


∂nqG(p, q)

)d (∂Ω) = u(p), p ε ∂Ω, (1.5)

where G(p, q) is the Green’s function, which can loosely be thought of as the effect the

point q has on point p. In the context of an acoustic radiation problem, the Green’s

function corresponds to the fundamental solution of the Helmholtz equation and is given

by G(p, q) = eik|p−q|

|p−q| , in which |p− q| denotes the Euclidean distance between the points

p and q. A solution to u in the exterior domain with respect to the points on the

boundary is provided by (1.4). Therefore, if the quantities u and ∂u(p)∂nq

are known over

the boundary, the solution for the points in the exterior can be easily computed. In

addition, (1.5) provides a means of solving for the aforementioned quantities. However,

by applying the Fredholm alternative to (1.5) it is found that the solutions are not unique

for all wave numbers k, and thus an alternative formulation is required [34]. Burton and

Miller [9] showed how a unique solution can be derived. Differentiating (1.5) in the

direction of the outward normal yields

1

2π

∂

∂np

∫∂Ω

(∂G(p, q)


∂nqG(p, q)

)d (∂Ω) =

∂u(p)

∂np, p ε ∂Ω. (1.6)

Then constructing a linear combination of equations (1.5) and (1.6) using a purely imagi-

nary coupling coefficient, β, produces a modified BIE formulation with a unique solution.

6

The formulation is given by

1

2π

∫∂Ω

(∂G(p, q)


∂nqG(p, q)

)d (∂Ω)

+ (1.7)

β

1

2π

∂

∂np

∫∂Ω

(∂G(p, q)


∂nqG(p, q)

)d (∂Ω)

= u(p) + β∂u(p)

∂np.

Assuming a Neumann boundary condition, (1.7) can be rearranged as follows:

∫∂Ω

u(q)

(β∂2G(p, q)

∂nq∂np+∂G(p, q)

∂nq

)d (∂Ω)− 2πu(p) (1.8)

=

∫∂Ω

∂u(q)

∂nq

(G(p, q) + β

∂G(p, q)

∂np

)d (∂Ω) + β2π

∂u(p)

∂np, p ε ∂Ω.

Note, in the case of a Dirichlet boundary condition, ∂u(p)∂nq

can be solved for by rearranging

(1.8). Once u(p) has been solved for over the boundary, the solution for all points in

the exterior can be obtained. Therefore, a means for numerically solving equation (1.8)

must be devised. For notational convenience, let v(q) = ∂u(q)∂nq

, and redefine portions of

both integrands as

T (p, q) = β∂2G(p, q)

∂nq∂np+∂G(p, q)

∂nq(1.9)

and

H(p, q) = G(p, q) + β∂G(p, q)

∂np. (1.10)

7

Equation (1.8) becomes

∫∂Ω

u(q)T (p, q)d (∂Ω)− 2πu(p) =

∫∂Ω

v(q)H(p, q)d (∂Ω) + β2πv(p), p ε ∂Ω. (1.11)

The next step in the BEM is to discretize the boundary surface, ∂Ω, into smaller quadri-

lateral or triangular surface elements. After the discretization, the boundary can be

represented as ∂Ω = ∂Ω1∪∂Ω2∪· · ·∪∂ΩN , where ∂Ωi represents the ith surface element

in the discretization of ∂Ω and ∂Ωi ∩ ∂Ωj = ∅ for i 6= j.

Equation (1.11) can then be represented as

N∑i=1

[∫∂Ωi

u(q)T (p, q)d (∂Ωi)

]− 2πu(p) =

N∑i=1

[∫∂Ωi

v(q)H(p, q)d (∂Ωi)

](1.12)

+2βπv(p), p ε ∂Ω.

The most straightforward approach to numerically solving equation (1.12) is to assume

u(p) and v(p) are constant along each surface element, ∂Ωi, i = 1, . . . , N . Therefore,

let u(p) ≈ uj and v(p) ≈ vj for p ε ∂Ωj , j = 1, . . . , N . Under this assumption, equation

(1.12) can be decomposed into N equations, i.e., one equation for each surface element;

that is,

N∑i=1

ui

[∫∂Ωi

T (p, q)d (∂Ωi)

]− 2πuj =

N∑i=1

vi

[∫∂Ωi

H(p, q)d (∂Ωi)

](1.13)

+2βπvj , p ε ∂Ωj .

Equation (1.13) yields a solution for the jth surface element of the boundary. The

boundary is constructed of N surface elements; therefore, there are N equations and N

8

unknowns total. Using this, equation (1.13) can more concisely be expressed in matrix

notation. Let

M =

∫∂Ω1

T (p, q)d (∂Ω1)∫∂Ω2

T (p, q)d (∂Ω2) . . .∫∂ΩN

T (p, q)d (∂ΩN )∫∂Ω1

T (p, q)d (∂Ω1)∫∂Ω2

T (p, q)d (∂Ω2) . . .∫∂ΩN

T (p, q)d (∂ΩN )

......

. . ....∫

∂Ω1T (p, q)d (∂Ω1)

∫∂Ω2

T (p, q)d (∂Ω2) . . .∫∂ΩN

T (p, q)d (∂ΩN )

.

Similarly, let the column vector b represent right hand side; that is,

b =

∑Ni=1

vi

[∫∂Ωi

H(p, q)d (∂Ωi)]

+ β2πv1∑Ni=1

vi

[∫∂Ωi

H(p, q)d (∂Ωi)]

+ β2πv2

...∑Ni=1

vi

[∫∂Ωi

H(p, q)d (∂Ωi)]

+ β2πvN

.

With a Neumann boundary condition, each vi, i = 1, . . . , N , is known, and the integrals

can be computed via numerical quadrature. Therefore, the matrix M and vector b are

known quantities. Using the new quantities, the linear system

(M − 2πI)u = b, p ε ∂Ω, (1.14)

can be used to solve for the approximation of u over the boundary. Once we have an

approximate solution for u over the surface, (1.4) can be used to solve for u in the

exterior.

9

It is difficult to precisely enforce the boundary conditions for the surface velocity

at edges and corners when the basis functions are constructed using surface distributions

of simple and dipole sources, as they are in Burton and Miller’s standard implementation.

To avoid this difficulty, it is possible to rewrite the solution in terms of surface-averaged

quantities instead, which is common in acoustics. For example, surface-averaged pres-

sures and volume velocities are commonly used in lumped parameter representations

of transducers. Since the goal is no longer to match the boundary conditions on a

point-by-point basis, it becomes permissible to simplify the solution by constructing the

basis functions from discrete sources rather than distributions of sources. Using surface-

averaged pressures and volume velocities as variables can also be shown to produce a

solution that converges with mesh density, unlike the standard formulation which can

produce a less accurate solution as the mesh is refined. The solution is then derived in

terms of source amplitudes rather than physical quantities, such as pressure or velocity.

For this type of indirect solution, an approach similar to Burton and Miller’s can be

used to prevent nonexistence/nonuniqueness difficulties. A hybrid ”tripole” source type

is created from a simple and dipole source with a complex-valued coupling coefficient, as

is discussed by Hwang and Chang [19]. The numerical implementation discussed in this

thesis is based on an indirect solution using tripole sources, but the basic formulation

shares many characteristics with the standard Burton and Miller approach discussed

previously.

10

1.3 The Fourier Matrix and Fast Fourier Transform

The Fourier matrix is given by

F =1√n

1 1 1 · · · 1

1 ω1n

ω2n

· · · ωn−1n

1 ω2n

ω4n

· · · ω2(n−1)n

......

.... . .

...

1 ω(n−1)n

ω2(n−1)n

· · · ω(n−1)(n−1)n

, (1.15)

where ωn = ei2πn , i =

√−1, and normalizing by 1√

nmakes F unitary. The discrete Fourier

transform (DFT) is defined as a matrix vector multiplication involving the Fourier ma-

trix. That is,

y = Fx. (1.16)

The vector y is called the DFT of x. Similarly, the inverse discrete Fourier transform

(IDFT) of x is given by

y = F−1x. (1.17)

However, because F has been defined to be unitary, (1.17) becomes

y = F ∗x. (1.18)

The Fourier matrix is highly structured, and this structure can be used to com-

pute the DFT. The improved method of computing the DFT is called the Fast Fourier

11

transform (FFT) and was first introduced by Cooley and Tukey [12]. It was shown that

for vectors with n = 2h elements, h εZ+, the DFT can be computed in O(n log n). Over

the years, the method has been extended to handle vectors with an arbitrary number of

elements; a comprehensive overview of these can be found in [11, 26]. This thesis uses

the Cooley and Tukey version of the algorithm, also now termed the radix-2 FFT. We

thus now overview the radix-2 algorithm.

Assuming the first column and first row are indexed by 0, consider the element

in the kth row and the jth column of the Fourier matrix, which is given by ωkjn

= ei2πkjn .

Note then that each element is periodic in n. This can readily be seen by using Euler’s

formula. Applying Euler’s formula, we have

ωkjn

= cos

(2πkj

n

)+ i sin

(2πkj

n

). (1.19)

Because sin and cos both have period 2π, by (1.19), if kj ≥ n, the elements begin to

repeat. It follows that each element in the Fourier matrix can be represented by ωkn

for

k = 0, . . . , n− 1. For example, consider the four-by-four Fourier matrix

F4 =1√4

1 1 1 1

1 ω14

ω24

ω34

1 ω24

ω44

ω64

1 ω34

ω64

ω94

. (1.20)

12

By the periodicity of the elements, (1.20) becomes

F4 =1√4

1 1 1 1

1 ω14

ω24

ω34

1 ω24

1 ω24

1 ω34

ω24

ω14

. (1.21)

The FFT algorithm uses properties of ω coupled with a divide and conquer strategy.

The following derivation relies heavily on [11]; we follow their derivation closely.

Recall that n = 2h for h εZ+, and consider the operation y = Fx. Expanding the

matrix vector product gives

yk =n−1∑j=0

xjωjkn, k = 0, . . . , n− 1. (1.22)

Equation (1.22) can be split into two summations: one containing all of the even terms,

and one containing all of the odd terms, i.e.,

yk =

n2−1∑j=0

x2jω2jkn

+

n2−1∑j=0

x2j+1ω(2j+1)kn

, k = 0, . . . , n− 1. (1.23)

A ωkn

term in the second summation can be pulled out of the summation, i.e.,

yk =

n2−1∑j=0

x2jω2jkn

+ ωkn

n2−1∑j=0

x2j+1ω2jkn, k = 0, . . . , n− 1. (1.24)

13

Using the fact that ω2n

= ωn2, (1.24) becomes

yk =

n2−1∑j=0

x2jωjkn2

+ ωkn

n2−1∑j=0

x2j+1ωjkn2, k = 0, . . . , n− 1. (1.25)

The next observation to make is that ω(k+n

2)j

n2

= ωkjn2

for k = 0, . . . , n2 − 1. That is,

because ωn2

has a smaller period, the elements begin to repeat sooner, and k, in turn,

need not go beyond n2 − 1. Therefore, (1.25) becomes

yk =

n2−1∑j=0

x2jωjkn2

+ ωkn

n2−1∑j=0

x2j+1ωjkn2, k = 0, . . . , n2 − 1. (1.26)

Looking more closely, each summation represents a DFT of length n2 . Therefore, a DFT

of length n can be broken into two DFTs each half the size of the previous DFT. However,

(1.26) contains only the first n2 terms of y. Computing the remaining terms yields

yk+n2

=

n2−1∑j=0

x2jωj(k+n

2)

n2

+ ωk+n2

n

n2−1∑j=0

x2j+1ωj(k+n

2)

n2

, k = 0, . . . , n2 − 1. (1.27)

We then obtain

yk+n2

=

n2−1∑j=0

x2jωjkn2ωj n2n2

+ ωk+n2

n

n2−1∑j=0

x2j+1ωjkn2ωj n2n2, k = 0, . . . , n2 − 1. (1.28)

Because ωj n2n2

= 1 and ωk+n2

n= −ωk

n, (1.28) becomes

yk+n2

=

n2−1∑j=0

x2jωjkn2− ωk

n

n2−1∑j=0

x2j+1ωjkn2, k = 0, . . . , n2 − 1. (1.29)

14

Therefore, the entire vector y can be obtained by

yk =

n2−1∑j=0

x2jωjkn2

+ ωkn

n2−1∑j=0

x2j+1ωjkn2, k = 0, . . . , n2 − 1 (1.30)

yk+n2

=

n2−1∑j=0

x2jωjkn2− ωk

n

n2−1∑j=0

x2j+1ωjkn2, k = 0, . . . , n2 − 1.

Let sj = x2j and tj = x2j+1 for j = 0, . . . , N/2− 1; that is, s is the vector containing all

the even elements of x, and t is the vector contain all of its odd elements. Then (1.30)

may be written as

[Fnx]k =[Fn

2s]k

+ ωkn

[Fn

2t]k, k = 0, . . . ,

n

2− 1 (1.31)

[Fnx]k+n2

=[Fn

2s]k− ωk

n

[Fn

2t]k, k = 0, . . . ,

n

2− 1.

From (1.31), the recursive nature of the algorithm should be clear. The DFT of a vector

y can be split into two DFTs of half the size. We can proceed in computing Fn2s and

Fn2t, as if it were the first time, and proceed as above. Algorithm 1.1 gives a pseudocode

of the algorithm.

15

Algorithm 1.1 Radix-2 FFT pseudocode.

1: Y=Radix-2FFT(X,n)2: if n == 1 then3: return Y;4: else5: s = Radix-2FFT(Even(X),n2 );6: t = Radix-2FFT(Odd(X),n2 );7: for k = 0 to n

2 − 1 do

8: Yk = sk + ωkntk;

9: Yk+n2

= sk − ωkntk;10: end for11: end if12: return Y;

Algorithm 1.1 follows nicely from the derived mathematics; however, the recursion

can be unrolled into an iterative format which will later facilitate the explanation of our

parallel algorithm. The algorithm can be found in [24], and our explanation follows their

discussion closely.

Algorithm 1.2 Iterative Radix-2 FFT pseudocode as presented in [24].

1: Y=Radix-2FFT(X,Y,n)2: r = log n;3: R = X;4: for m = 0 to r − 1 do5: S = R;6: for i = 0 to n− 1 do7: //Let (b0b1 . . . br−1) be the binary representation of i8: j = (b0 . . . bm−1 0 bm+1 . . . br−1);9: k = (b0 . . . bm−1 1 bm+1 . . . br−1);

10: r = (bmbm−1 . . . b0 0 . . . 0);11: Ri = Sj + Skω

rn;

12: end for13: end for14: Y = R;

16

Algorithm 1.2 is the iterative version of Algorithm 1.1. Each iteration of the outer

loop (line 4) represents one level of the recursion, starting with the deepest level. At

each level of recursion, the output vector is updated by two entries of the given input

vector and a multiple of the factor ω, (lines 8 and 9 of Algorithm 1.1 and line 11 for

Algorithm 1.2). Algorithm 1.1 uses the input to the function at each level of recursion

to update the output vector; whereas, Algorithm 1.2 uses binary representations of the

index being modified.

The most relevant property to notice, with respect to the parallel algorithm, is the

pattern of interaction between different elements of the input vector. Figure 1.1 shows

which elements in the input vector, denoted x, are used in computing each element of

the output vector, denoted X, for a vector of length n = 16.

Fig. 1.1 Radix 2 element interaction pattern obtained from [18].

17

In order to solidify this notion and to clarify the meaning behind Figure 1.1,

consider the transformation of x(0). The elements of the initial input vector involved in

the transformation of x(0) are: x(0), x(8), followed by modified versions of x(4), x(2),

and x(1). Similarly, each element of the input vector in the diagram can be traced to

see the elements of the initial vector involved in each computation.

A final note about FFTs is the ordering of the output. When the algorithm is

run in place, such that it overwrites the array containing the initial data, the output is

in bit-reversed order. This can be seen in Figure 1.1. For another example, let n = 8,

and consider the computation x = F8x, where the vector x is overwritten. This yields

x =

x0

x1

x2

x3

x4

x5

x6

x7

7−→

x0

x4

x2

x6

x1

x5

x3

x7

.

The indices are converted to binary, and the bit string is reversed before being converted

back into decimal. In the above example, consider the indices one and seven, i.e., (1)10 =

(001)2, and flipping the bit string yields (100)2 = (4)10. This means that data migrates

to bit-reversed order when the FFT is done in place.

18

1.4 Circulant Matrices

Circulant matrices are a subset of Toeplitz matrices which have the added prop-

erty that each row is a circular shift of the previous row. The matrix C is circulant if it

has the form

C =

c1 c2 c3 · · · cn

cn c1 c2 · · · cn−1

cn−1 cn c1 · · · cn−2

......

.... . .

...

c2 c3 c4 · · · c1

.

Matrices of this form can be uniquely represented by their first row and will be denoted

by C = circ(c0, c1, c2, · · · , cn−1).

A thorough treatment of circulant matrices is given in [13]. The important prop-

erty of circulant matrices that is used heavily throughout this thesis concerns the eigen-

values and eigenvectors of circulant matrices. Let v = [c1 c2 c3 . . . cn]T be the column

vector constructed from the first row of a circulant matrix C. Then the eigenvalues of

C are given by

λ = Fv, (1.32)

where F is the unitary Fourier matrix [13]. That is, the discrete Fourier transform (DFT)

of the first row of C yields the eigenvalues of C. Further, the eigenvectors of a circulant

matrix C are given by the columns of the Fourier matrix of appropriate dimension. Thus,

C has eigenvalue decomposition

C = F ∗DF, (1.33)

19

where F is again the Fourier matrix, and D is the diagonal matrix whose elements are

the eigenvalues of C, i.e., D = diag(λ). This means that every circulant matrix of the

same dimension has the same eigenvectors, and that the matrix C is given by

C = F ∗diag(λ)F. (1.34)

With this decomposition, a formulation for the inversion of C can easily be obtained.

The inverse of C is then given by

C−1 = F diag(λ)−1F ∗. (1.35)

This formulation can then be used to solve a linear system. Consider the linear system

Cx = b. (1.36)

Left multiplication by C−1 yields

x = C−1b. (1.37)

Now, substituting for the definition of C−1 given by (1.35) yields

x = F diag(λ)−1F ∗b. (1.38)

Rearranging gives

diag(λ)F ∗x = F ∗b. (1.39)

20

Let x = F ∗x and b = F ∗b; then (1.39) becomes

diag(λ)x = b, (1.40)

whose solution is trivial. Therefore, the solution of a linear system equates to computing

three DFTs and a backsolve involving a diagonal matrix. The steps are:

1. Compute λ = Fv.

2. Compute b = F ∗b.

3. Solve diag(λ)x = b.

4. Compute x = Fx.

This formulation is advantageous because the most expensive operation needed is the

computation of the DFT, which, in its crudest form, is a matrix vector multiplication,

and is thus O(n2). However, if permissible, the fast Fourier transform (FFT) can be

used in place of the DFT, and the computation becomes O(n log n).

21

Chapter 2

Literature Review

Circulant matrices are a desirable structure in computation because of their re-

lation to the Fast Fourier transform (FFT). Therefore, many variations of circulant

matrices have appeared throughout the literature and in a wide variety of contexts.

These range from the solution of circulant tridiagonal and banded systems [32, 16, 15]

to effective preconditioners [25] and are able to exploit their computational relation to

the FFT.

We are concerned with the solution of linear systems involving block circulant

matrices and assume the blocks in the matrix themselves are dense and contain no

additional structure. The desirable properties extend to the block case as well; namely,

block circulant matrices are block diagonalizable by the block Fourier matrix. The

generalization to the block case, however, means that the inversion/solution formula

must be extended. We first note that every block circulant matrix (BCM) can be mapped

to an equivalent block matrix with circulant blocks (CBM). This can be accomplished by

multiplying by the appropriate permutation matrices. Therefore, algorithms for solving

BCMs and CBMs are equivalent.

Within engineering, when problems with periodicity properties are considered,

block circulant matrices arise in many contexts. These usually result when such periodic

problems are solved by means of integral equations, which includes the BEM. Using

22

the method of fundamental solutions [17], block circulant matrices in the contexts of

axisymmetric problems in potential theory [21], as well as axisymmetric harmonic and

biharmonic [38], linear elasticity [23, 22], and heat conduction problems [36] have been

investigated. In addition, scattering and radiation problems in electromagnetics have

taken advantage of block circulant matrices for a variety of integral equation techniques

[33, 30, 14, 20] including the BEM [40]. With respect to acoustics, a National Physical

Laboratory tech report discussed some properties of rotationally symmetric problems for

the BEM as applied to the Helmholtz equation [42].

Just as circulant matrices are a subset of Toeplitz matrices, block circulant matri-

ces are a subset of block Toeplitz matrices. Therefore, it is not surprising that one of the

first inversion algorithms applied to block circulant matrices was an inversion algorithm

for block Toeplitz matrices [2]. Closed form solutions for the inversion of block circulant

matrices were formalized in [27] and presented again more concisely in [41]. The se-

quential inversion formula shows that a BCM, A, has the decomposition A = F ∗bDFb, in

which Fb represents the block Fourier matrix, and D represents a block diagonal matrix.

The blocks along the diagonal are obtained by computing the block DFT of the first

block row of A; this means if v is defined to be the first block row of A, D = diagFbv.

The inversion is then given by A−1 = Fb (diagFbv)−1 F ∗b

, and only the blocks of the

block diagonal matrix are inverted. Extending the closed form inversion formulations, an

algorithm for solving a block circulant linear system was developed alongside many vari-

ants of circulant linear systems [10]. The solution of the linear system involving BCMs

resulted from a straightforward application of the inversion formula. Following these ef-

forts, [31] proposed an algorithm for the solution of CBMs. The most recent contribution

23

to CBMs was given in [39]. The algorithm first diagonalizes each block of the matrix by

the Fourier relation. The matrix is then a block matrix with diagonal blocks. The algo-

rithm decomposes the matrix into a two-by-two block matrix and successively performs

this decomposition to the first principal submatrix until a diagonal matrix is reached.

The diagonal matrix is inverted, and the Schur complement formulation for the inverse

of a two-by-two block matrix is successively used to compute the inversion of the entire

matrix. All inversion/solution formula of consequence use the spectral properties of the

circulant matrices. This is exploited in all aforementioned sequential inversion/solution

algorithms.

While sequential solution algorithms have been fully developed, little work has

been done on parallel algorithms for block circulant linear systems. A parallel solution

for block Toeplitz matrices exists, and parallelizes the generalized Schur algorithm [3].

Yet, using a Toeplitz solver neglects the use of the FFT and potential concurrent cal-

culations found in the BCM inversion formula. In fact, the only work we are aware of

is a parallel solver for electromagnetic problems which considers the axisymmetric case

[29]. The proposed parallel algorithm was for distributed memory systems and paral-

lelized the inversion formulation for BCMs. The assumptions of the work differ from

our own; that is, they assume a larger number of blocks of smaller order, and, in turn,

assumed that the number of processors was some fraction of the number of blocks in the

matrix. This means each processor contained multiple blocks, denoted q, of the BCM.

For each block owned by a processor, the corresponding right-hand side also resides on

that processor. This means that when solving the block diagonal matrix, each processor

could perform the solve of its q blocks simultaneously. However, when solving the linear

24

system, multiplications by the Fourier matrix are needed. These are needed in order

to: obtain the block diagonal matrix, modify the right-hand side vector, and modify the

solution vector. This distribution means that multiplying by the Fourier matrix requires

communication among the processors. Using the fact that block Fourier transforms can

be decomposed into independent Fourier transforms, it performs an all-to-all communi-

cation to give each processor the data needed to compute an independent FFT. They

tested the algorithm for BCMs with m = 256 blocks of order n = 318, m = 128 blocks

of order n = 189, and m = 64 blocks of order n = 93. This is where our assumptions

diverge significantly, and as a result our algorithm differs significantly in implementation

of the same inversion formula.

25

Chapter 3

Problem Formulation

Consider a rotationally symmetric vibrating structure, Ω εR3. The rotational

symmetry implies Ω can be constructed by rotations of a single element around a fixed

axis. Define Ω′ to be a structure in R3, and let Ω′θ

represent the structure obtained

by rotating Ω′ by angle θ. Then, supposing Ω has m rotational symmetries, Ω can be

written as Ω = Ω′0∪ Ω′2π

m∪ Ω′4π

m∪ · · · ∪ Ω′(m−1)2π

m

; that is,

Ω =

m−1⋃k=0

Ω′k2πm. (3.1)

For example, for m = 4 the structure Ω can be written as

Ω = Ω′0∪ Ω′π

2∪ Ω′

π∪ Ω′3π

2. (3.2)

Note, the angle θ is relative to an initial orientation of the structure. This means that

the structure being rotated can have any initial orientation; as long as the rotation

is around a fixed axis and the rotation angle is uniform, the constructed structure is

rotationally symmetric. Figure 3.1 shows a real-world example of a structure containing

three rotational symmetries.

26

Fig. 3.1 A propeller with three times rotational symmetry [37].

3.1 Coefficient Matrix Derivation

Before beginning the algebraic derivation, we first present the underlying intu-

ition. Figure 3.2 shows a sketch of a propeller with four times rotational symmetry.

Consider the effect Ω′0

has on Ω′π2, as well as the effect Ω′π

2has on Ω′

π. Because the blades

are identical and dist(Ω′0,Ω′π

2) = dist(Ω′π

2,Ω′

π), the entries in the coefficient matrix which

describe the effect of Ω′0

on Ω′π2

and Ω′π2

on Ω′π

will be identical. This continues for the

remaining interactions of this form; therefore, the entries of the coefficient matrix due

to the effect of Ω′0

on Ω′π2, Ω′π

2on Ω′

π, Ω′

πon Ω′3π

2, and Ω′3π

2on Ω′

0will be identical. This

same idea is used for all of the remaining interactions to finish populating the coefficient

matrix. The equality between interactions due to symmetry is what leads to the block

circulant structure of the coefficient matrix.

27

Fig. 3.2 A four times rotationally symmetric sketch of a propeller.

This decomposition of the initial structure in R3 into the union of rotated struc-

tures gives insight into the structure of the coefficient matrix. Recall, in the derivation

of the BEM, the solution over the boundary of the structure must first be solved in order

to obtain the solution in the exterior domain. Consider only the base element Ω′ = Ω′0

before any rotations. For clarity, we suppose m = 2 and use the standard boundary

integral formulations given by (1.4) and (1.5). The integral formulations which promise

uniqueness follow in the same manner. Assuming a Neumann boundary condition and

rearranging into knowns and unknowns, the equation over the boundary of Ω′0

is given

28

by

∫∂Ω′

0

∂G(p, q)

∂nqu(q)d

(∂Ω′

0

)− 2πu(p) =

∫∂Ω′

0

∂u(q)

∂nqG(p, q)d

(∂Ω′

0

), p ε ∂Ω′

0. (3.3)

Next, consider the solution of u over the boundary element ∂Ω′π2; that is, the

boundary surface obtained by rotating the base element Ω′0

by 90 degrees. This yields

the following boundary integral formulation

∫∂Ω′π

2

∂G(p, q)

∂nqu(q)d

(∂Ω′π

2

)− 2πu(p) =

∫∂Ω′π

2

∂u(q)

∂nqG(p, q)d

(∂Ω′π

2

), p ε ∂Ω′π

2. (3.4)

As stand-alone structures, Ω′0

and Ω′π2

are identical aside from their orientation. The

boundaries, ∂Ω′0

and ∂Ω′π2, are unaffected by rotations and are therefore identical. Equa-

tion (3.3) and (3.4) involve only points on the boundary and, therefore, assuming the

Neumann conditions are identical for both equations, equality holds. Note, by the

uniqueness, and the equality for identical right-hand sides, it follows the left-hand sides

must be identical.

Intuitively, (3.3) shows the relation between a point p on ∂Ω′0

and all the points

q on ∂Ω′0. If a point p is chosen on ∂Ω′

0, all of the points on ∂Ω′

0contribute to the value

of u at that point. In this sense, an N -body problem is being solved. Similarly, if a

point p is chosen on ∂Ω′π2, all of the points on ∂Ω′π

2contribute to the value of u at that

point; however, ∂Ω′0

are identical ∂Ω′π2. Therefore, under identical boundary conditions,

the same N -body problem is being solved.

29

Now, consider the solution of u over the boundary of the structure obtained by

combining the two aforementioned structures, Ω′0

and Ω′π2. The boundary is then given

by ∂Ω = ∂Ω′0∪ ∂Ω′π

2and the integral equation is

∫∂Ω

∂G(p, q)

∂nqu(q)d (∂Ω)− 2πu(p) =

∫∂Ω

∂u(q)

∂nqG(p, q)d (∂Ω) , p ε ∂Ω. (3.5)

Using the rotational symmetries, equation (3.5) becomes

∫∂Ω′

0

∂G(p, q)

∂nqu(q)d

(∂Ω′

0

)+

∫∂Ω′π

2

∂G(p, q)

∂nqu(q)d

(∂Ω′π

2

)− 2πu(p) = (3.6)

∫∂Ω′

0

∂u(q)

∂nqG(p, q)d

(∂Ω′

0

)+

∫∂Ω′π

2

∂u(q)

∂nqG(p, q)d

(∂Ω′π

2

), p ε ∂Ω′

0∪ ∂Ω′π

2.

Redefine v1(p) = u(p) for p ε ∂Ω′0

and v2(p) = u(p) for p ε ∂Ω′π2. In addition, define

Γ0[v1] =

∫∂Ω′

0

∂G(p, q)

∂nqv1(q)d

(∂Ω′

0

), Γπ

2[v2] =

∫∂Ω′π

2

∂G(p, q)

∂nqv2(q)d

(∂Ω′π

2

)

Σ0 =

∫∂Ω′

0

∂v1(q)

∂nqG(p, q)d

(∂Ω′

0

), and Σπ

2=

∫∂Ω′π

2

∂v2(q)

∂nqG(p, q)d

(∂Ω′π

2

).

Note, the variables v1(p) and v2(p) are unknowns, and, therefore, Γ0[v1] and Γπ2[v2] are

defined as operators; whereas, Σ0 and Σπ2

are known quantities and are treated as known

values. Using the newly-defined quantities, (3.6) can be split into two simultaneous

30

equations over ∂Ω′0

and ∂Ω′π2.

Γ0[v1] + Γπ2[v2]− 2πv1(p) = Σ0 +Σπ

2, p ε ∂Ω′

0(3.7)

Γ0[v1] + Γπ2[v2]− 2πv2(p) = Σ0 +Σπ

2, p ε ∂Ω′π

2.

Upon appropriate discretization, (3.7) can be written as the following linear system

Γ0 − 2πI Γπ2

Γ0 Γπ2− 2πI

v1

v2

=

Σ0 +Σπ2

Σ0 +Σπ2

, (3.8)

where I is the identity matrix. Let A denote the coefficient matrix in (3.8) and consider

the entries (Γ0 − 2πI) and(Γπ

2− 2πI

). By the previous arguments in establishing the

equivalence of (3.3) and (3.4), it follows that

(Γ0 − 2πI) =(Γπ

2− 2πI

). (3.9)

This is true even when the right-hand sides of (3.3) and (3.4) are not identical. With

this relation established, define A1 = (Γ0 − 2πI) =(Γπ

2− 2πI

). Similarly, consider the

entries Γ0 and Γπ2. We would like to show Γ0 = Γπ

2. By definition,

Γ0[v1] =

∫∂Ω′

0

∂G(p, q)

∂nqv1(q)d

(∂Ω′

0

), (3.10)

31

and upon discretization as described in Section 1.2, we obtain

Γ0[v1] =N∑i=1

(v1)i

(∫[∂Ω′

0]i

∂G(p, q)

∂nqd([∂Ω′

0

]i

)). (3.11)

The quantity Γ0[v1] becomes the product Γ0v1, in which v1 is the discretization of the

unknown v1(q), and Γ0 is a matrix of known quantities populated by integrating the

normal derivative of the Green’s function over the individual surface elements of ∂Ω′0.

In considering the discretization of Γπ2[v2], we obtain

Γπ2[v2] =

N∑i=1

(v2)i

∫[∂Ω′π

2

]i

∂G(p, q)

∂nqd([∂Ω′π

2

]i

) . (3.12)

Again, the quantity Γπ2[v2] becomes the product Γπ

2v2, in which v2 is the discretization of

the unknown v2(q), and Γπ2

is a matrix of known quantities populated by integrating the

normal derivative of the Green’s function over the individual surface elements of ∂Ω′π2.

Assuming the discretization of the boundaries are the same, because the boundaries ∂Ω′0

and ∂Ω′π2

are identical, the values populating Γ0 and Γπ2

are identical, and thus Γ0 = Γπ2.

Let A2 = Γ0 = Γπ2, then with the previously established definition A1 =

(Γπ

2− 2πI

)=

(Γ0 − 2πI), the matrix, A, comprising the linear system (3.8) has the form

A =

A1 A2

A2 A1

, (3.13)

which is a 2 × 2 block circulant matrix. In general, given m rotational symmetries, an

m×m block circulant matrix can be obtained.

32

3.2 Block Circulant Inversion

Let N = nm. The coefficient matrix AεCN×N arising from the BEM applied to

an acoustic radiation problem with a rotationally symmetric boundary surface has the

form

A =

A1 A2 · · · Am

Am A1 · · · Am−1

Am−1 Am · · · Am−2

......

. . ....

A2 A3 · · · A1

, (3.14)

where each Aj , j = 1, . . . ,m is contained in Cn×n and is dense. The matrix A is block

circulant and therefore can be represented by circular shifts of its first block row. The

circulant structure of A is contained in the m blocks forming the first block row of A.

Therefore, in order to perform block DFT operations, we need to scale the Fourier matrix

F εCm×m to the block Fourier matrix Fb εCN×N . The Fourier matrix F is defined as

F =1√m

1 1 1 · · · 1

1 ω1m

ω2m

· · · ωm−1m

1 ω2m

ω4m

· · · ω2(m−1)m

......

... · · ·...

1 ω(m−1)m

ω2(m−1)m

· · · ω(m−1)(m−1)m

, (3.15)

where ωm = ei2πm , i =

√−1, and normalizing by 1√

mmakes F unitary. Scaling each

element of F by the n × n identity matrix, In, produces the block Fourier matrix Fb.

33

This is equivalent to the Kronecker product F ⊗ In. After scaling, we have the block

Fourier matrix

Fb =1√m

In In In · · · In

In Inω1m

Inω2m

· · · Inωm−1m

In Inω2m

Inω4m

· · · Inω2(m−1)m

......

... · · ·...

In Inω(m−1)m

Inω2(m−1)m

· · · Inω(m−1)(m−1)m

. (3.16)

Next, the DFT relations needed for the inversion formula are established. Let X εCN×n

be the block column vector containing the first block row of A. The block DFT of X is

given by X = FbX; that is,

A1

A2

A3

...

Am

=

In In In · · · In

In Inω1m

Inω2m

· · · Inωm−1m

In Inω2m

Inω4m

· · · Inω2(m−1)m

......

... · · ·...

In Inω(m−1)m

Inω2(m−1)m

· · · Inω(m−1)(m−1)m

A1

A2

A3

...

Am

, (3.17)

which is nothing more than a DFT of length m with n×n matrices as coefficients in the

transform. Using the formulation of the inverse in [41], we have

A−1 = Fbdiag(A1)−1, (A2)−1, . . . , (Am)−1F ∗b, (3.18)

34

where diag(A1)−1, (A2)−1, . . . , (Am)−1 is a block diagonal matrix whose diagonal blocks

are precisely the inverses of the blocks obtained from the DFT of the first block row of

A. From the formula, we can derive the algorithm for the solution of a linear system.

Consider the system Ax = b; multiplying by A−1 yields

x = A−1b. (3.19)

Substituting in the definition for A−1 from (3.18), we obtain

x = Fb diag(A1)−1, (A2)−1, . . . , (Am)−1F ∗bb. (3.20)

Rearranging, yields

diag(A1), (A2), . . . , (Am)F ∗bx = F ∗

bb. (3.21)

Let x = F ∗bx and b = F ∗

bb. This yields

diag(A1), (A2), . . . , (Am)x = b. (3.22)

Blocking the vectors x and b to match the block sizes of each Aj , it is easy to see we

obtain m independent linear systems to solve

Aj xj = bj , j = 1, . . . ,m. (3.23)

The steps for solution of the linear system Ax = b are given by Algorithm 3.1. Each

multiplication by the matrix Fb or F ∗b

represents a block DFT or inverse DFT (IDFT)

35

operation, respectively. It is worth noting that the system solves in line 3 of the algorithm

are completely independent, and thus make the algorithm very amendable to parallel

implementation as noted in [35].

Algorithm 3.1 Pseudocode for the sequential solution of a block circulant linear system.

1: Compute b = F ∗bb;

2: Compute X = FbX;3: Solve Aj xj = bj , j = 1, . . . ,m;4: Compute x = Fbx;

3.3 Invertibility

Algorithm 3.1 requires the inversion of the blocks obtained from computing the

DFT of the first block row of A. Therefore, assumptions on the invertibility of these

blocks are required by the algorithm. The section will show that if the initial matrix A

is assumed to be nonsingular, then each diagonal block is also nonsingular.

In order to facilitate the proof, we first show that the block Fourier matrix given

in (3.16) is unitary.

Lemma 3.1. The block Fourier matrix, Fb, as defined in (3.16) is unitary.

Proof. Recall, the N × N block Fourier matrix Fb can be constructed as a Kronecker

product of the unitary m×m Fourier matrix F with the n×n identity matrix In. That

is,

Fb = F ⊗ In. (3.24)

36

By the properties of Kronecker products [13] we have (A⊗B)∗ = A∗ ⊗B∗. Therefore,

F ∗b

= (F ⊗ In)∗ = F ∗ ⊗ I∗n

= F ∗ ⊗ In. (3.25)

So F ∗b

is can be constructed in the same fashion. Now consider F−1b

. By the Kronecker

product property (A⊗B)−1 = A−1 ⊗B−1, for square nonsingular A and B, we have

F−1b

= F−1 ⊗ I−1n. (3.26)

However, the Fourier matrix F is unitary, and thus

F−1b

= F ∗ ⊗ In. (3.27)

It has been established that F ∗b

= F ∗ ⊗ In, and, therefore,

F−1b

= F ∗b. (3.28)

Thus Fb is unitary.

Theorem 3.1. Given a nonsingular block circulant matrix A. The block diagonal matrix

diagA1, A2, . . . , Am is nonsingular, where each Aj, j = 1, . . . ,m are the blocks obtained

by computing the block Fourier transform of the first block row of A.

Proof. Since A is block circulant we have

A = F ∗b

diagA1, A2, . . . , AmFb. (3.29)

37

Taking the determinant yields

det (A) = det(F ∗b

diagA1, A2, . . . , AmFb). (3.30)

Using a property of determinants we obtain

det (A) = det(F ∗b

)det(

diagA1, A2, . . . , Am)

det (Fb) . (3.31)

By Lemma 3.1, Fb is unitary, and, thus, det(F ∗b

)= det (Fb) = 1; therefore,

det (A) = det(

diagA1, A2, . . . , Am). (3.32)

Using the relation of the determinant of block diagonal matrices, we have

det (A) = det(A1)det(A2) . . . det(Am). (3.33)

Because A is nonsingular, det(A) 6= 0, and, therefore, det(Aj) 6= 0 for j = 1, . . . ,m. It

follows that each Aj , j = 1, . . . ,m is nonsingular, and, therefore, diagA1, A2, . . . , Am

is nonsingular.

38

Chapter 4

Parallel Solution Algorithm

4.1 Block DFT Algorithm

While it is enticing to develop the algorithm around the Fast Fourier Transform

(FFT), the robustness of the algorithm will be lost. Recall that the length of the DFT is

determined by the number of symmetries of the boundary surface. For problems involving

real world structures, such as propellers or wind turbines, the number of symmetries will

be small e.g., m ≤ 30. Indeed, even if a structure contained symmetries arising every

one degree, i.e., m = 360, there must be at least one surface element in the discretization

representing the symmetry, meaning n ≥ 360. This case is somewhat pathological, and,

in general, we assume each symmetry has a large number of surface elements. This

means it can be reasonably assumed that m n. In addition, FFT’s make assumptions

on the properties of m. The most common assumption being that m is a power of two.

While there are now FFT algorithms for any value of m [11, 26], the algorithms applied

to feasible sizes of m have negligible benefits due to constants in the computation. We

therefore designed our algorithm to be robust in the sense that it will work for any

boundary surface input, and therefore we use a matrix multiplication DFT approach.

We derive the algorithm in the context of computing the block DFT of the first

block row of A, given by (3.17), as this computation is needed during the system solve.

Define P to be the number of processors and assume P = m. The initial data distribution

39

is obtained by assigning each submatrix Aj to processor Pj , for j = 1, . . . ,m. The initial

data distribution for P = m = 4 is illustrated in Figure 4.1.

Fig. 4.1 Initial data distribution assumed in the DFT computation for the case P = m =

4.

Expanding the DFT relation X = FbX, we obtain

A1 = A1 +A2 +A3 + · · ·+Am

A2 = A1 +A2ω1m

+A3ω2m

+ · · ·+Amωm−1m

A3 = A1 +A2ω2m

+A3ω4m

+ · · ·+Amω2(m−1)m

......

Am = A1 +A2ωm−1m

+A3ω2(m−1)m

+ · · ·+Amω(m−1)(m−1)m

.

(4.1)

Given this initial data distribution, in the computation of A1, processor P1 already

contains a portion of the summation, namely A1. In fact, in all of the Aj computations,

each processor contains a scaled portion of the corresponding summation. In addition,

the scalar values ω(k−1)(j−1)m

, for j, k = 1, . . . ,m, are computable. This means that for

the cost of scaling a submatrix by a term in Fourier matrix, we already have a portion

of the computation of each Aj , j = 1, . . . ,m. The algorithm expands on this idea to

compute the entire summation.

40

Starting from the initial data distribution, each processor computes the portion

of the summation that corresponds to the data owned. Then, each Pi cyclically sends its

submatrix to Pi−1 (P1 sends its data to Pm). Each processor computes the corresponding

term in the summation and propagates the submatrix. The computation completes after

m− 1 communications. Figure 4.2 illustrates this process for the case P = m = 4.

Fig. 4.2 The DFT computation for the case P = m = 4. Each arrow indicates the com-munication of a processor’s owned submatrix to a neighboring processor in the directionof the arrow.

The algorithm can be generalized to the case with P = cm, where c εZ+, by

observing that a block DFT with a block size of n×n can be broken into n2 independent

DFTs with block size 1. To see this consider the kth summation taken from (4.1). We

then have

Ak = A1 +A2ω(k−1)m

+A3ω2(k−1)m

+ · · ·+Amω(m−1)(k−1)m

. (4.2)

41

Recall that Aj εCn×n, for each j = 1, . . . ,m. For illustrative purposes, let n = 2. Then

(4.2) becomes

ak11

ak12

ak21

ak22

=

a111

a112

a121

a122

+

a211

a212

a221

a222

ω(k−1)m

+

a311

a312

a321

a322

ω2(k−1)m

+

· · ·+

am11

am12

am21

am22

ω(m−1)(k−1)m

, (4.3)

where superscript k indicates akij

is an element of Ak. From here the computation of the

elements of Ak can be written as the following n2 = 4 independent summations

ak11

= a111

+ a211ω(k−1)m

+ a311ω2(k−1)m

+ · · ·+ am11ω(m−1)(k−1)m

ak12

= a112

+ a212ω(k−1)m

+ a312ω2(k−1)m

+ · · ·+ am12ω(m−1)(k−1)m

ak21

= a121

+ a221ω(k−1)m

+ a321ω2(k−1)m

+ · · ·+ am21ω(m−1)(k−1)m

ak22

= a122

+ a222ω(k−1)m

+ a322ω2(k−1)m

+ · · ·+ am22ω(m−1)(k−1)m

.

(4.4)

The independence of each summation permits us, given a sufficient number of processors,

to perform these summations simultaneously. In a more general setting, this equates to

partitioning each Aj , j = 1, . . . ,m, into smaller block sizes, and then simultaneously

performing block DFTs of this smaller block size.

Now that it has been established that a block DFT can be broken down into block

DFTs of smaller block size, we explain how to exploit this in the P = cm case. Let c = 4,

i.e., P = 4m, and partition each Aj , j = 1, . . . ,m, into c = 4 blocks of size n√c× n√

c.

Note that the block size is arbitrary, therefore, if√c is not an integer the submatrix is

42

simply split into c blocks with slightly different block size. The data decomposition can

be seen in Figure 4.3, where again superscript k indicates Akij

is a block of Ak.

Fig. 4.3 Parallel block DFT data decomposition for P > m.

We rewrite these as c = 4 independent block DFTs of block size n√c× n√

c. We then

group the processors into c = 4 processor groups of size m. Grouping the processors, we

obtain four DFTs in the form presented when P = m. Figure 4.4 shows the processor

group organization. We then apply the P = m DFT algorithm within each processor

group simultaneously. Therefore, when P = cm, we can decompose each Aj into c

independent block DFTs of smaller block size. This decomposition can proceed all the

way down until each Aj is decomposed into n2 independent DFTs of block size 1. In

this case, c = n2, i.e., P = n2m, and n2 one-dimensional DFTs are being performed

simultaneously.

Since the most expensive part of computing the blocked DFT is the communi-

cation of the submatrices, it is desirable to overlap communication and computation as

much as possible. With this in mind, we introduce asynchronous send/receives. Start-

ing from the P = m initial data distribution, begin by the asynchronous send of the

processor’s owned submatrix followed by the asynchronous receive of the neighboring

43

Fig. 4.4 Parallel block DFT data decomposition and processor groupings for P > m.

processor’s submatrix. While the processor’s current submatrix data is being sent, a

neighboring processor’s submatrix is being received. During this communication, the

data being sent is still able to be used because no modifications are being made. The

data being sent is then used to update the partial sum. Therefore, we are sending,

receiving, and computing the partial sum simultaneously.

There is a cost associated with the communication overlap. The cost is in the

amount of memory being used to enable this overlapped communication/computation.

Three times the amount of memory is now being used, the unmodified submatrix, neigh-

boring processor’s unmodified submatrix, and running partial sum for the transformed

submatrix. However, the amount of extra memory used can be managed by only com-

municating portions of a submatrix at a time. While theoretically it is best to minimize

the communication startups, in practice, for large volumes of data, it is beneficial to

send the data spread over a number of smaller packets. This blocking factor for optimal

44

communication times is system dependent, but it also gives a parameter which can be

modified when memory consumption becomes an issue.

Note that this algorithm is used for both the X = FbX and X = F ∗bX. The only

difference are the terms Fourier matrix. When referring to the parallel DFT algorithm,

we differentiate the use of Fb and F ∗b

as parallel DFT and IDFT, respectively.

4.2 Block FFT Algorithm

As mentioned in Section 4.1, the FFT is difficult to apply when considering an

arbitrary number of rotational symmetries, m, because of its restriction on the value of

m, i.e., power of two in the radix-2 algorithm. In certain cases however, when the FFT

is applicable, it can effectively be used. A relevant example concerns acoustic radiation

problems involving axisymmetric structures. These problems deal with structures ob-

tained by rotating a two-dimensional object around a third fixed, orthogonal axis. For

example, cyclinders or spheres are types of axisymmetric structures. By considering the

structure of a propeller, or fan blade, it can be readily deduced that while all axisym-

metric structures are rotationally symmetric, not all rotationally symmetric structures

are axisymmetric; that is, axisymetric structures are a subset of rotationally symmetric

structures. The advantage of axisymmetric structures come from the ability to choose

the number of rotational symmetries in the discretization of the problem. Being able to

choose the values of m means that the choice can be made to exploit the FFT.

Section 4.1 began by detailing a DFT algorithm for the P = m case. It then ex-

tended the algorithm to the P = cm case by breaking the block DFT into c independent

block DFTs of smaller blocksize. The algorithm then constructs c processor groups, each

45

with m processors, around the decompositions. It then uses the P = m algorithm within

each processor group to simultaneously compute the block DFTs of smaller blocksize.

The FFT algorithm keeps the exact same framework as the DFT algorithm. The differ-

ence arises in how the P = m algorithm computes the DFT; in this case, a distributed

FFT algorithm is used.

In order to derive the parallel algorithm, consider the sequential FFT algorithm

given by Algorithm 1.2; the accompanying discussion in Section 1.3 concerned the pattern

of interaction between elements of the initial input vector in producing the transformed

vector. Indeed, this is the essence of the FFT. Figure 1.1 gave a visualization of the

interaction pattern; in addition, it also showed how the data migrated to a bit reversed

order. This is important. The parallel algorithm will distribute each element of the

input vector onto different processors, and these element interactions will become com-

munication patterns. The algorithm used to compute the distributed one-dimensional

FFT has been termed the binary exchange algorithm [24]. Only small modifications to

Algorithm 1.2 are needed to fit the parallel case.

As in Section 4.1, we present the algorithm in the context of computing the block

FFT of the first block row of A. Define P to be the number of processors and assume

P = m, the initial data distribution is obtained by assigning each submatrix Aj to

processor Pj , for j = 1, . . . ,m. The initial data distribution for P = m = 4 can again be

seen in Figure 4.1.

Now, consider the parallel Algorithm 4.1, which is a the parallel FFT algorithm

resulting from simple modifications to Algorithm 1.2.

46

Algorithm 4.1 Distributed Radix-2 FFT pseudocode [24].

1: Y=Radix-2FFT(X,Y,n)2: r = log n;3: R = X;4: for m = 0 to r − 1 do5: S = R;6: //Let (b0b1 . . . br−1) be the binary representation of pid7: j = (b0 . . . bm−10bm+1 . . . br−1);8: k = (b0 . . . bm−11bm+1 . . . br−1);9: r = (bmbm−1 . . . b00 . . . 0);

10: if pid == j then11: Send Apid to processor k12: Receive Ak from processor k13: Apid = Apid +Akω

rn;

14: else15: Receive Aj from processor j16: Send Apid to processor j17: Apid = Aj +Apidω

rn;

18: end if19: end for20: Y = R;

The first difference to note is that the second loop in Algorithm 1.2 is no longer

needed. The iterate variable i was for knowing two things: which element of the initial

vector to update, and determining the other elements involved in the computation. As

each processor only has one element, there is no question which element each processor

is responsible for updating. The second property remains intact because each Ai is

contained on the processor with pid = i, in which, pid is the processor id. During each

iteration of Algorithm 4.1, each processor needs one extra piece of data to perform the

update to the owned data. Each processor uses its processor id to compute which element

it needs to complete the current computation. By determining the element number, the

pid of the processor which owns the data is determined; this can then be used to set up

the communication to obtain the data. Figure 4.5 illustrates this process.

47

Fig. 4.5 Process illustrating the distributed FFT. Lines crossing to different processors

indicate communication from left to right. Note the output is in reverse bit-reversed

order relative to numbering starting at zero; that is, A1 is element 0; A2 is element 1,

etc.

The extension to the P = cm case is identical to the discussion in Section 4.1.

The block DFT can be decomposed into c independent blocks DFTs of smaller block size.

The processors then create c processor groupings and simultaneously perform the P = m

FFT algorithm. The advantage to computing using the distributed FFT in this way is

that the number of communications is minimized. The parallel DFT algorithm requires

O(m) communications; whereas, in using the FFT, the number of communications is

O(logm). Although we have assumed m to be quite small, in the P = m case each

communication requires that n2 data elements be sent. This means the packet sizes are

quite large; therefore, any minimization to the number of communications is beneficial.

48

4.3 System Solves

The goal is to solve all systems in line 3 of Algorithm 3.1 simultaneously. In

addition, the ScaLAPACK routine PZGESV is used to further parallelize each system

solve. By using ScaLAPACK, we are forced to work within the limits of its required

data distribution and processor organization. In particular, the matrix data must be

distributed in a block cyclic fashion, and the processors logically arranged in a grid

format [6]. Using these restrictions, the initial system is setup as follows. Assume

P = cm processors and c εZ+. Now define m processor grids of size√c ×√c and

denote them by Gi, i = 1, . . . ,m. If√c is not an integer, the processors are arranged

in a rectangular grid format such that the number of rows and columns are integers.

Figure 4.6 illustrates the grid creation process for P = 16 and m = 4.

Fig. 4.6 Processor grid creation for P=16 and m=4.

Next, each Aj and corresponding right-hand side bj are block cyclically distributed

over process grid Gj for j = 1, . . . ,m. We require that the block cyclic distribution be

49

performed using the same blocking factor for each Aj and bj . Each Gj is then in a

position where it can solve a system involving Aj and bj . However, before these system

solves can be performed the left and right-hand sides must be transformed by the DFT

and IDFT respectively.

4.4 Parallel Algorithm

We have established an initial data distribution which can be used by ScaLAPACK

and an algorithm for computing the DFT. Working within this data distribution and

using the DFT or FFT algorithm we present the parallel algorithm.

Assume we have P = cm processors where c εZ+. Define m processors grids

Gj , j = 1, . . . ,m, and block cyclically distribute each Aj and bj onto processor grid Gj

for j = 1, . . . ,m. The first step is to perform the IDFT to the right-hand side b. Each

bj is distributed onto their respective processor grid of c processors. Because each bj

was distributed over its corresponding processor grid using the same blocking factor,

the distribution process is identical to decomposing each bj into c smaller size blocks.

Therefore, we can create c processor groupings of size m where each processor group is

composed of one processor from each grid. That is, processor group 1 will be formed

by taking each Gj ’s first element; group 2 will be formed by taking each Gj ’s second

element, and this process continues until we have c processor groupings. These processor

groupings create c independent IDFTs of smaller blocksize which can use the DFT/FFT

algorithm. Therefore, the IDFT involving each bj has been decomposed into c IDFTs

of smaller size which can be done simultaneously. Using the DFT/FFT algorithm, we

perform the IDFT of bj transforming each bj into bj . In the same way, we transform each

50

Aj to Aj . Now note that the data distribution has not changed and each Gj now has the

system Aj xj = bj , which are precisely the systems that need to be solved. Also note, if the

FFT algorithm is used, the data has migrated into a bit reversed order during the IDFT

transformations; however, both sides of the equation have migrated into a bit reversed

order and the correct systems are still obtained. More precisely, if we let rev(j) denote

the bit reversal of j, after the IDFT transformations of Aj and bj each system Aj xj = bj

resides on process grid Grev(j), for j = 1, . . . ,m. Each Gj calls the ScaLAPACK routine

PZGESV and solves its respective system. PZGESV overwrites bj with the solution xj .

Because the solution overwrites the entries of bj , the data distribution has not changed,

and we simply use the DFT/FFT algorithm again to transform each xj to xj . Thus

we have the solution of the original linear system. If the FFT algorithm was used, xj

would be in bit reversed order, that is, xj is contained in grid Grev(j) for j = 1, . . . ,m;

however, when transforming back to xj the bit reversed order is negated. Therefore, xj

is contained on grid Gj , j = 1, . . . ,m, and the solution vector is in the same form as if

the DFT algorithm had been used. Algorithm 4.2 shows the pseudocode for the parallel

algorithm as six concise steps.

Algorithm 4.2 Pseudocode for the parallel solution of a block circulant linear system,assuming P = cm.

1: Define m√c×√c process grids.

2: Block cyclically distribute each Aj and bj onto grid Gj in an identical fashion.

3: Perform c simultaneous IDFTs transforming bj to bj .

4: Perform c simultaneous DFTs transforming Aj to Aj5: Simultaneously solve each Aj xj = bj in parallel using PZGESV.6: Perform c simultaneous DFTs transforming xj to xj .

51

Chapter 5

Theoretical Timing Analysis

In this chapter, the theoretical runtime analysis for the parallel implementations

discussed in Chapter 4 are developed. Algorithm 4.2 contains two core operations: par-

allel computation of the DFT and the parallel linear system solve. Therefore, the parallel

runtime, denoted TP (n,m), can be expressed as:

TP (n,m) = TFT (n,m) + TLS(n,m), (5.1)

where TFT (n,m) denotes the parallel runtime in computing the DFT, and TLS(n,m)

denotes the runtime of the parallel linear system solve. Chapter 4 presented two different

implementations of the DFT, and, therefore, two parallel runtimes will be developed.

Let A be a block circulant matrix with m blocks of order n, and let X contain

A’s first block row; that is,

X =

A1

A2

...

Am

.

Further, let b be a single column vector and right-hand side of the linear system Ax = b.

52

5.1 Parallel Linear System Solve

The parallel linear system solves are performed by ScaLAPACK which conve-

niently provides the theoretical analysis of the implementation [6]. The term TLS(n,m)

is then given by

TLS(n,m) =2n3

3Ptf +

(3 + 14 log2 P )√P

n2tv + (6 + log2 P ) tm, (5.2)

where tf is the time per complex floating point operation, tm is the startup time for each

communication, and tv is the time per data item sent. In general, tm > tv; thus, the

number of communication startups should be minimized. Equation (5.2) can be broken

into three parts: the first term in the summation is the computation term; the second

term is the communication cost concerning the quantity of data items sent, and the last

term corresponds to the number of communication startups.

The variable P in (5.2) is used to denote all processors; however, in the general

case where P = cm, the parallel implementation contains m simultaneous system solves,

each with c = Pm processors devoted to the parallel system solves. Therefore, the term

P in (5.2) should be replaced by c obtaining

TLS(n,m) =2n3

3ctf +

(3 + 14 log2 c)√c

n2tv + (6 + log2 c) tm. (5.3)

Note that (5.3) is the parallel runtime for all of the m linear system solves. Due to

the concurrency of the m linear system solves, solving m linear systems with P = cm

processors is equivalent to solving one linear system with c processors. This overlap in

53

parallelized operations is what makes the inversion formulation so amendable to parallel

solution.

5.2 Block DFT using the DFT Algorithm

In this section, the runtime analysis of Algorithm 4.2 is considered when the block

DFT algorithm (see Section 4.1) is used. There are three transformations which use the

DFT algorithm presented in Section 4.1: the transformation of Aj to Aj , bj to bj , and the

solution vector xj to xj , for j = 1, . . . ,m. Each of these transformations requires m− 1

communications. When transforming Aj to Aj , for j = 1, . . . ,m, each communication

involves messages of size n2; similarly, the transformation of bj to bj , and xj to xj , for

j = 1, . . . ,m, both involve messages of size n. Using this, the communication term in

the analysis, denoted To(n,m), can be constructed. Accounting for the communications

needed by these transformations To(n,m) is given by

To(n,m) = 3(m− 1)tm + (m− 1)(n2 + 2n)tv, (5.4)

where, again, tm is the time to initialize a communication, and tv is the time per data

item sent.

The computational term in the analysis is relatively straightforward. During each

step of the algorithm, each processor multiplies the data it currently owns and adds it

to its running sum. When transforming Aj to Aj , for j = 1, . . . ,m, each processor scales

n2 elements by a term in the Fourier matrix and adds it to the running sum; therefore,

we have n2m multiplications plus n2(m−1) additions in the transformation of Aj to Aj ,

54

for j = 1, . . . ,m. Similarly, the transformation of bj to bj , and xj to xj , for j = 1, . . . ,m,

both involve nm multiplications and n(m− 1) additions. Combining the computational

and communication terms yields

TDFT (n,m) = (m− 1)(n2 + 2n)tf +m(n2 + 2n)tf + 3(m− 1)tm + (m− 1)(n2 + 2n)tv.

(5.5)

The analysis can easily be extended to the P = cm case. Recall, the P = cm

DFT algorithm creates c DFTs of smaller blocksize and arranges c processor groups.

Using these processor groups, c simultaneous P = m DFTs of smaller blocksize are then

performed. While the same number of communication startups are still needed, the size

of the messages as well as the amount of computation are reduced by 1c ; therefore, by

dividing the appropriate terms in (5.5) by c the P = cm case is obtained

TDFT (n,m) =(m− 1)(n2 + 2n) +m(n2 + 2n)

ctf + 3(m− 1)tm +

(m− 1)(n2 + 2n)

ctv.

(5.6)

More compactly,

TDFT (n,m) =(2m− 1)(n2 + 2n)

ctf + 3(m− 1)tm +

(m− 1)(n2 + 2n)

ctv. (5.7)

55

By combining (5.3) and (5.7), the parallel runtime for Algorithm 4.2, which is given by

(5.8), is obtained:

TP1(n,m) =(2m− 1)(n2 + 2n)

ctf + 3(m− 1)tm +

(m− 1)(n2 + 2n)

ctv +

2n3

3ctf

+(3 + 1

4 log2 c)√c

n2tv + (6 + log2 c)tm. (5.8)

By rearranging (5.8) and grouping computation and communication-specific constants,

the final parallel runtime using the DFT algorithm is given by

TP1(n,m) =

[2n3

3c+

(2m− 1)(n2 + 2n)

c

]tf + [3(m− 1) + (6 + log2 c)] tm

+

[(m− 1)(n2 + 2n)

c+

(3 + 14 log2 c)√c

n2

]tv. (5.9)

5.3 Block DFT Using the FFT Algorithm

The FFT timing analysis follows directly from Section 5.2. Recall the main differ-

ence between the DFT algorithm and the FFT algorithm is the communication pattern.

Whereas the DFT required m − 1 communications, the FFT only requires log2m, for

m a power of two. Consider the DFT implementation’s communication term (5.4). By

substituting log2m for the appropriate communication terms (5.4) becomes

To(n,m) = 3 log2(m)tm + log2(m)(n2 + 2n)tv (5.10)

when the FFT algorithm is used. In the FFT case, after each communication, each pro-

cessor scales a portion of its owned data by a term in the Fourier matrix. This modified

56

data is then added to the processor’s running sum; therefore, log2m communications

implies log2m multiplications and log2m additions are performed. This is reflected in

the computational term. Note, these are the only terms that change relative to the anal-

ysis involving the DFT algorithm. By proceeding in the same manner as Section 5.2, we

obtain

TP2(n,m) =

[2n3

3c+

2 log2(m)(n2 + 2n)

c

]tf + [3 log2m+ (6 + log2 c)] tm

+

[log2(m)(n2 + 2n)

c+

(3 + 14 log2 c)√c

n2

]tv (5.11)

for the final runtime of Algorithm 4.2 when using the FFT algorithm.

5.4 Bounds

Constructing the parallel complexity analysis allows us to find the dominating

term in both parallel algorithms. Recall the assumptions in the development of the

parallel algorithms in Chapter 4, namely, n m; that is, the order of each block

in the coefficient matrix is large relative to the number of blocks. In general, it was

assumed m < 30. Looking at (5.9), it is clear that the first term, i.e., 2n3

3c , dominates

the computation. Similarly, by considering (5.11), it follows that both expressions have

the same dominating term. Therefore, we obtain

TP1(n,m) = O

(n3

c

)(5.12)

57

and

TP2(n,m) = O

(n3

c

). (5.13)

Recalling that N = nm, for N large, the term which dominates arises from the ScaLA-

PACK linear system solve. Therefore, under our assumptions, the most expensive part

of the computation is offloaded to the ScaLAPACK routine. This means that although

large packets of data must be communicated between processors in computing the DFT,

when N is large, the dominating term comes from the linear system solves. While this is

not extremely surprising, the implication is that the communication terms in the devel-

oped algorithms do not overwhelm the overall algorithm. As a result, the computational

portion of the linear system solves dominates. This is also the result reached in the

ScaLAPACK user guide [6] where only the parallel linear system solve is analyzed. This

is considered advantageous because the linear system solves are computed via ScaLA-

PACK which are optimized for scalability.

58

Chapter 6

Numerical Experiments

All experiments were run using the Intel Nehelam processors of the Cyberstar

compute cluster [1] running at 2.66 Ghz with 24 GB of RAM. We implemented the

parallel algorithm in FORTRAN 90 and used the ScaLAPACK and MPI libraries. A

blocking factor of 50 was used for the block cyclic distribution of each Aj and bj onto

the their respective processor grids.

The FFT and DFT parallel algorithms differ in the communication routines used.

The DFT algorithm broke the communications into blocks of size 4000 which were sent

and received asynchronously using MPI’s ISEND/IRECV functions. In the case that a

processor does not contain 4000 elements, all of its data is sent in one communication.

The blocking of the communications also parsed each matrix columnwise to work within

FORTRAN’s column major data storage format. Whereas, the FFT algorithm did not

perform asynchronous sends/receives and used the standard BLACS routines for sending

2D blocks of data.

6.1 Experiment 1

First, we look at the runtime, speedup, and efficiency for a vibrating structure

with four times rotational symmetry for both the DFT algorithm and FFT algorithm.

59

In each case, the number of processors P and matrix size N are varied. The number of

processors is varied from 4 to 48 and N is varied from roughly 13, 000 to 24, 000.

6.2 Experiment 2

We look at the runtime, speedup, and efficiency for a vibrating structure with

eight times rotational symmetry for both the DFT algorithm and FFT algorithm. In

each case, the number of processors P and matrix size N are varied. The number of

processors is varied from 8 to 48 and N is varied from roughly 13, 000 to 24, 000.

6.3 Numerical Results

6.3.1 Experiment 1

First, consider the algorithm’s behavior when a structure with four times rota-

tional symmetry, m = 4, is examined using the DFT algorithm as well as the FFT

algorithm. Figure 6.1 shows the runtimes when using the DFT algorithm; a sharp de-

cline in runtime can be seen as the number of processors increase for various N . The

runtimes using the FFT implementation are given in Figure 6.2 showing similar trends

and runtimes as their DFT counterpart. The runtime improvements are also apparent

when looking at the speedup, which are given in Figures 6.3 and 6.4. The oscillations

in the speedups can be explained by looking more closely at the values of the runtimes.

Figures 6.1 and 6.2 show that the wall clock times are quite low, and small benign vari-

ances in the runtime for large P cause large oscillations in the speedup. This is why the

60

oscillations are flushed out for larger problems. Therefore as N increases, the oscillations

are dampened, and the speedups become more linear.

Fig. 6.1 Runtime comparison using the DFT algorithm for varying P and N with m = 4.

Fig. 6.2 Runtime comparison using the FFT algorithm for varying P and N with m = 4.

61

Fig. 6.3 Speedups using the DFT algorithm for varying P and N with m = 4.

Fig. 6.4 Speedups using the FFT algorithm for varying P and N with m = 4.

62

The most important category in parallel algorithm analysis is probably efficiency.

Efficiency is a measure of useful work done by a parallel algorithm and gives insight

into how much time the algorithm spends waiting on communication. Ideally, we would

like the efficiency to be as close to 1 as possible, which means all of the work is use-

ful. However, we are restricted by the underlying parallelism of the computation being

performed. Here we look at the behavior of the efficiency for varying P and N . Fig-

ures 6.5 and 6.6 show the efficiency as a function of problem size for the DFT and FFT

implementations, respectively. For nearly all processor numbers, excluding P = 4 which

simply remains efficient, the algorithm becomes more efficient as the problem size in-

creases. This tells us that as the problem size increases, the amount of time spent doing

useful work increases. Because N = nm, for m fixed, an increase in problem size directly

correlates to an increase in n. Recall the discussion in Section 5.4; for fixed m such that

n m, the dominating term comes from the computational portion of the ScaLAPACK

linear system solve. This fact is seen in Figures 6.5 and 6.6; as a function of problem

size, the amount of time spent computing grows faster than the amount of time spent

communicating.

Notice, generally the larger the number of processors, the lower the efficiency.

Although the DFT computations are ideally parallel and are able to overlap communi-

cations resulting from additional processors, the linear system solve computations are

not. More processors imply that the communication term in the linear system solve

will contribute more to the overall runtime; however, as the size of the linear system

grows, the computational portion of the ScaLAPACK solve begins to dominate. That

is, more processors means the efficiency will be lower for problems of the same size, but

63

as the computational term of the linear system solve begins to dominate, the efficiency

increases. Therefore, even though the efficiencies for different processors are decreasing

with P in Figure 6.5, it is only because the data point, i.e., the value of N , is fixed.

Fig. 6.5 Efficiency using the DFT algorithm for varying N and P with m = 4.

An interesting remark with respect to the two different implementations is to

note the similarity in their performance. Recall, the main difference in the algorithms

is the number of communications needed when computing the DFT; that is, the DFT

algorithm, i.e., matrix multiplication, or the FFT algorithm. For the case m = 4, the

number of communications is negligible; however, m = 4 also means the linear systems

are larger. Therefore, the computational term in the ScaLAPACK linear system solve

will dominate the computation more, making the algorithms behave in a similar fashion.

This was also alluded to in the theoretical analysis given in Chapter 5.

64

Fig. 6.6 Efficiency using the FFT algorithm for varying N and P with m = 4.

6.3.2 Experiment 2

Now, consider the performance of the algorithm when the number of rotational

symmetries m = 8. Figures 6.7 and 6.8 show the runtime analysis for the DFT and FFT

implementation. The trend in the runtimes is similar to that of the four times rotational

symmetry case. The main difference are the runtime values. Consider the largest case,

N = 24, 000; Figure 6.1 shows the runtime for P = 8 is roughly 38 seconds. Whereas,

Figure 6.7 shows that for the same values of P and N , the computation time is only 12

seconds. As m increases, the size of the linear system decreases. This means that the

most expensive part of the computation, which is the linear system solve, decreases with

m, and results in a faster overall runtime. Even though the number of communications

in the DFT/FFT algorithms grows with m, the messages per communication are smaller.

65

Fig. 6.7 Runtime comparison using the DFT algorithm for varying P and N with m = 8.

Fig. 6.8 Runtime comparison using the FFT algorithm for varying P and N with m = 8.

66

Figure 6.9 shows the speedup for the DFT algorithm in the eight times rotational

symmetry case. What is interesting is that for smaller problems the speedup begins

leveling off past a certain point. The DFT and FFT algorithms have no increasing

dependence on P in their communication terms, and, therefore, this must be due to

the size of the linear systems. This shows that for a fixed problem size, the advantage

of additional processors becomes negligible after a certain point due to the ratio of

computation to communication in the linear system solve. What is important is that

until this point of leveling off, the speedup increases nearly linearly. This means that the

extra communications in the DFT/FFT algorithm, which are due to the increase in m, do

not overwhelm the algorithm. Indeed, by considering the larger values of N , Figure 6.9

shows that the speedups are nearly linear for larger problem sizes. The speedup in the

case of the FFT algorithm is give in Figure 6.10. The trend for the smaller values of

N appears to extend further than in Figure 6.9 and is most likely due to savings of the

FFT.

Lastly, efficiency is considered and is shown in Figures 6.11 and 6.12 for varying

values of P and N . Again, it is found that the efficiency increases for increasing problem

size. It can be seen that the overall value of the efficiency is slightly less than the

m = 4 case; this is due to two things: the increase in communications due to the

DFT/FFT algorithms, and the size of the linear system solve. However, because m is

fixed, the number of communications by the DFT/FFT algorithm will not grow with

N , even though the message size will grow. Therefore, as the problem size increases,

the computational term of the linear system solve will again begin to dominate, and the

efficiency can be expected to increase.

67

Fig. 6.9 Speedup comparison using the DFT algorithm for varying P and N when m = 8.

Fig. 6.10 Speedup comparison using the FFT algorithm for varying P and N whenm = 8.

68

Fig. 6.11 Efficiency comparison using the DFT algorithm for varying P and N whenm = 8.

The effect of using the FFT over the DFT can be seen by comparing the first data

point N = 13, 000 of Figures 6.11 and 6.12. The efficiencies for the FFT algorithm are

higher at this data point. In this instance, the linear systems are still relatively small, and

the computational term of the ScaLAPACK solve does not yet dominate. This is because

for the given value of N = 13, 000, the communications of the DFT transformations

contribute more. Because the FFT implementation uses fewer communications, the

efficiencies are higher for smaller N .

As in the four times rotational symmetry case, we find that the DFT and FFT

implementations perform similarly. The observed benefits for using the FFT appeared

at the lower bound of our experimental values. The FFT algorithm showed a higher

efficiency when N was small. In this instance, the linear system solves did not yet

dominate, and, therefore, the communications contribute more. However, in both the

69

Fig. 6.12 Efficiency comparison using the FFT algorithm for varying P and N whenm = 8.

m = 4 and m = 8 cases, as N increases, the algorithms exhibit similar performance. In

our case, the assumptions rely on the solution of larger linear systems. In the case of

smaller linear systems and larger m, the FFT algorithm could be expected to produce

better performance results.

70

Chapter 7

Conclusions

We have proposed a parallel algorithm for the solution of block circulant linear

systems arising from acoustic radiation problems with rotationally symmetric boundary

surfaces. A derivation of the linear system was given along with conditions for application

of the algorithm. The algorithm takes advantage of the ScaLAPACK library and exploits

the embarrassingly parallel nature of block DFTs within ScaLAPACK’s required data

distributions. In addition, by exploiting the block circulant structure of the matrix

in the context of the parallel algorithm, the memory requirements are reduced. The

reduction in the memory requirements allows for the solution of larger block circulant

linear systems. Because the size of the matrix directly correlates with the number of

surface elements in the discretization, problems which require a finer discretization, i.e.,

higher frequency problems, can be explored. In addition, problems with larger overall

structures can be investigated.

The behavior of the DFT and FFT algorithms was similar for large N . The exper-

imental results show near linear speedup for varying problem sizes and that the speedups

become more linear for increasingly large N . We also showed that the efficiency of the

algorithm increases as a function of problem size. The theoretical analysis coupled with

the experimental results showed that in both cases the algorithm becomes dominated by

the ScaLAPACK linear system solve portion of the algorithm. Given the requirements

71

of the problem, i.e., n m with m ≤ 30, it is found that for larger problems, the

difference in the two algorithms is negligible. It has also been established that the block

DFT transformations can be performed within the ScaLAPACK data distribution, and

that the necessary communications for the DFT transformations do not overwhelm the

algorithm’s runtime.

In addition, because we developed an algorithm using a matrix multiplication

DFT approach, it can be applied to any rotationally symmetric structure. The parallel

algorithm therefore permits the efficient computation of larger acoustic radiation prob-

lems with rotationally symmetric boundary surfaces. While small gains exist by choosing

the FFT algorithm over the developed DFT algorithm, these gains are negligible given

our assumptions on N and m. The FFT also places additional requirements on the

values of m, i.e., m is a power of 2. Indeed, for the assumption m ≤ 30, there are only

four viable values of m, namely, 2, 4, 8, and 16. Nevertheless, small gains do exist, and,

therefore, one avenue for further investigation is the development of a robust algorithm

which uses FFTs within the context of using ScaLAPACK for the linear systems. If an

elegant domain decomposition can be devised, and if a robust FFT algorithm, such as

Bluestein’s FFT algorithm [7, 8], can be fitted to the problem, the algorithm could be

further improved.

72

Appendix

BEM Code

The modified code, in its most general form, has four different cases:

1. Sequential with no rotational symmetries.

2. Sequential with rotational symmetries.

3. Parallel with no rotational symmetries.

4. Parallel with rotational symmetries.

Therefore, in the main program, logic exists to direct the program flow through one of

the four cases given above. There are five core functions which have been modified to

support the cases given above. These are:

1. STATIC MULTIPOLE ARRAYS

2. COEFF MATRIX

3. SOURCE AMPLITUDES MODES

4. SOURCE POWER

5. MODAL RESISTANCE

Before describing each function individually, we first define some frequently used ter-

minology. When using the term “distributed data structure”, we are referring to each

73

processor containing a portion of a global data structure. For example, assume we are

given a matrix A and we have P processors. Instead of one processor containing all of

the matrix A, the elements are split up, and each processor has a data structure which

contains these portions of the matrix. We refer to the collection of these data structures

as a “distributed data structure” and denote it as sub[A]. This is because when all

processors combine their corresponding sub[A], we obtain the global data structure A.

A.1 STATIC MULTIPOLE ARRAYS

This function uses multipole expansions to approximate values which will end up

populating the coefficient matrix. It attempts to speed up future runs by storing the

approximated values in a file. The function initially checks for the existence of the file. If

it is not there, the function proceeds in computing the approximations and creating the

file. If, however, a file containing the approximations exists, the function immediately

returns, performing no computations.

A.1.1 Sequential

A.1.1.1 General Case

In the sequential case, the BEM code does not change with respect to the original

code. The function generates the approximations and writes the data to a file or returns.

The pseudocode for this case is given by Algorithm A.1.

74

Algorithm A.1 Pseudocode for the STATIC MULTIPOLE ARRAYS general sequen-tial case.

if (Multipole data file exists) thenreturn;

elseCompute multipole expansion approximations;Write multipole expansion data to file;

end if

A.1.1.2 Rotationally Symmetric

Rotational symmetry plays no role in the sequential computation, and the function

performs as it does in the general case (see Section A.1.1.1 and Algorithm A.1).

A.1.2 Parallel


The parallel code behaves differently than the sequential code. A distributed data

structure, called sub[U ], is created to store the data in a distributed setting. This data

structure is a three-dimensional array. The first two dimensions vary with respect to the

total number of acoustic elements in the BEM computation. The third dimension has a

fixed value of 5, which corresponds to the number of terms in the multipole expansion.

Each processor then, simultaneously, populates its data structure. When all processors

have populated the corresponding sub[U ] data structure, the function returns. The

main difference in the computation is that no file is generated in the parallel case.

The multipole expansion data is instead held in memory distributed over the available

processors. The pseudocode is given in Algorithm A.2.

75

Algorithm A.2 Pseudocode for the STATIC MULTIPOLE ARRAYS general parallelcase. Define sub[U ]n and sub[U ]m to be the number of rows and columns of the proces-sor’s owned sub[U ] data structure respectively.

for i = 1 to sub[U ]n dofor j = 1 to sub[U ]m do

Compute multipole expansion approximation;Assign sub[U ](i, j);

end forend for


The multipole file is written out in this case. Due to the way the computa-

tion proceeds in the generation of the coefficient matrix (see Section A.2), the parallel

rotationally symmetric case behaves in the same fashion as the sequential case (see Al-

gorithm A.1). That is, a file containing the multipole approximations is written out if no

such file already exists, or the function returns. Because the generation of the multipole

expansion data is not time consuming, the benefits for computing the multipole expan-

sions in parallel are lost in the communication back to a single processor for the writing

of the file. Therefore, one processor computes the multipole expansion data and writes

the data out to a file. All other processors wait for the processor performing the cal-

culations and file creation to finish. Once the working processor finishes, the remaining

processors continue with the computation.

A.2 COEFF MATRIX

The COEFF MATRIX routine populates the coefficient matrix A to be used in

the computation of Ax= b. It now uses the multipole data which was computed in the

STATIC MULTIPOLE ARRAYS routine.

76

A.2.1 Sequential


This routine is the same as the original; it loops through each entry of the ma-

trix reading one row of the multipole data at a time and populates the matrix. Note,

the multipole data is only used if the distance between the points on the surface are

sufficiently far; however, even when the multipole data is not used, the file is still read.

Algorithm A.3 Pseudocode for the COEFF MATRIX general sequential case. DefineN to be the number of rows and columns of A.

for i = 1 to N doRead row i of multipole data in from file;for j = 1 to N doif dist(pi, pj) > threshold then

Compute using multipole data if;else

Compute without multipole data;end ifAssign A(i, j);

end forend for


The coefficient matrix for this case is block circulant. As noted previously, block

circulant matrices can be uniquely represented by their first block row. Therefore, only

the first block row of the matrix is generated by this routine. It proceeds in the same

manner as the general sequential version; however, it does not fill in the matrix beyond

the first block row.

77

Algorithm A.4 Pseudocode for the COEFF MATRIX rotationally symmetric sequen-tial case. Define N to be the number of rows and columns of A . In addition, define mto be the number of symmetries.

for i = 1 to Nm do

Read row i of multipole data from file;for j = 1 to N doif dist(pi, pj) > threshold then

Compute using multipole data if;else

Compute without multipole data;end ifAssign A(i, j);

end forend for

A.2.2 Parallel


In this case, each processor contains a distributed data structure containing the

global matrix A, and is denoted by sub[A]. Each processor’s data structure contains only

a portion of the data contained in the entire coefficient matrix A. Each processor then

populates its data structure, sub[A], simultaneously. The simultaneous population of the

matrix is due to the sub[U ] data structure populated in STATIC MULTIPOLE ARRAYS.

Without this distributed data structure, the file containing the multipole data would have

to be opened and read sequentially.


Again, the coefficient matrix can be uniquely defined by its first block row. The

first block row will contain m blocks each of order n. In this case, the number of

processors, defined by P , is assumed to be some multiple of m. That is, P = cm for

c εZ+. From here, m processor grids are defined; these are denoted byGi for i = 1, . . . ,m.

78

Algorithm A.5 Pseudocode for the COEFF MATRIX general parallel case. Definesub[A]N and sub[A]M to be the number of rows and columns of sub[A] respectively.

for i = 1 to sub[A]N dofor j = 1 to sub[A]M doif dist(pi, pj) > threshold then

Compute using multipole data in sub[U ];else

Compute without multipole data;end ifAssign sub[A](i, j);

end forend for

Each Gi contains c processors and is of dimension√c×√c. If

√c is not an integer, the

closest rectangular grid is formed. In addition to defining the grids, each processor

defines variables called pId and gId. The variable pId is the processor number, and

gId identifies which of the m processor grids a processor is a part of; their existence

is acknowledged only because they are used in the coefficient matrix generation. At

this point, m processor grids have been defined. In addition, the first block row of

A is composed of m blocks of order n. Therefore, each of the m blocks in the first

block row of A will be distributed onto a corresponding processor grid. That is, block

Ai is distributed over Gi for i = 1, . . . ,m. In order to distribute Ai onto grid Gi for

i = 1, . . . ,m, the processors belonging to grid Gi must define a distributed data structure

for the corresponding Ai. The distributed data structure is denoted by sub[Ai]. Note,

only the processors which are a part of Gi contain the data structure sub[Ai]. That

is, processors belonging to G1 use the distributed data structure sub[A1]; processors

belonging to G2 use the distributed data structure sub[A2], and so on and so forth.

Each grid is populated simultaneously, but not the distributed data structures,

sub[Ai], i = 1, . . . ,m, within the grid. This is due to the file containing the multipole

79

approximations. The file must be read sequentially, and only one row is read at a time.

Therefore, the loops are of length n, which is the order of each Ai, i = 1, . . . ,m. The

computation proceeds as follows, for element Ai(j, k), j, k = 1, . . . , n, the function BC-

CMPT L INDX is called and returns the processor whose local data structure, sub[Ai],

contains element Ai(j, k). In addition, the function returns the index into the local data

structure, denoted by (lj , lk). Therefore, in iteration (j, k), sub[Ai](lj , lk) is populated,

and this happens simultaneously for each grid. Algorithm A.6 shows the pseudocode for

the routine.

Algorithm A.6 Pseudocode for the COEFF MATRIX rotationally symmetric parallelcase. The variable aP denotes the processor whose data structure will be assigned in agiven iteration. Note, the variable kG allows the grids to be populated in parallel.

Define processor grids;for j = 1 to n do

Read row of multipole data in from file;for k = 1 to n dokG = n ∗ gId+ k;aP = processor containing AgId(j, kG);Compute index (lj , lkG) into sub[AgId] using global index (j, kG);if pId == ap then

Assign sub[AgId](lj , lkG);end if

end forend for

80

A.3 SOURCE AMPLITUDES MODES

A.3.1 Sequential


The general sequential case makes no changes to the original routine. The right-

hand side, b, is populated by a double for loop. It should be noted that, in general, there

will be multiple right-hand sides. That is, b will not be a single column vector. Following

this, the system Ax = b is solved using the LAPACK routine ZGESV. Algorithm A.7

gives the pseudocode.

Algorithm A.7 Pseudocode for the SOURCE AMPLITUDES MODES general sequen-tial case. Let N and rhsn denotes the number of rows and columns of b respectively.

for i = 1 to N dofor j = 1 to rhsn do

Assign b(i, j);end for

end forSolve Ax = b using LAPACK routine ZGESV;


In the initial section of the routine, the right-hand side, b, is populated from a

simple double for loop. Following this, the solve of the system Ax = b is performed. The

rotationally symmetric system solve has been discussed in detail in Section 3.2; therefore,

the discussion in this section will be somewhat terse. The inversion formula for a block

circulant matrix A is given by

A−1 = F ∗b

diag(A1)−1, (A2)−1, . . . , (Am)−1Fb, (A.1)

81

where Fb is the block Fourier matrix, and diag(A1)−1, (A2)−1, . . . , (Am)−1 is a block

diagonal matrix. In the context of solving the linear system Ax = b, we obtain

diag(A1), (A2), . . . , (Am)F ∗bx = F ∗

bb. (A.2)

Now, let X be a block column vector constructed from the elements of the block diagonal

matrix in (A.2). That is,

X =

A1

A2

A3

...

Am

. (A.3)

In addition, let X be the column vector of the first block row of A. We then have the

relation X = FbX; therefore, the elements of the block diagonal matrix are precisely the

values obtained from computing the block DFT of the first block row of A. With these

relations, the solve is computed by the following steps:

1. Compute b = F ∗bb.

2. Compute X = FbX.

3. Solve Aj xj = bj , j = 1, . . . ,m.

4. Compute x = xFb.

The pseudocode for this case of the SOURCE AMPLITUDES MODES routine given by

Algorithm A.8.

82

Algorithm A.8 Pseudocode for the SOURCE AMPLITUDES MODES rotationallysymmetric sequential case. Let N and rhsn denotes the number of rows and columns ofb, respectively. In addition, let m be the number of blocks in the first block row of A.

for i = 1 to N dofor j = 1 to rhsn do

Assign b(i, j);end for

end forCompute inverse DFT of left-hand side by b = F ∗

bb;

Compute the elements of diag(A1), (A2), . . . , (Am) by X = F ∗bX;

for k = 1 to m doSolve Akxk = bk using LAPACK routine ZGESV;

end forCompute DFT of solution vector x by x = Fbx;

A.3.2 Parallel


The general parallel case is very similar to the sequential general case. The

only difference is that the global matrix A has been distributed over the processors

and resides in the distributed data structure sub[A]. Recall that this data structure

was populated by the COEFF MATRIX routine (see Section A.2.2.1). In addition, a

distributed data structure, sub[b], is defined for the global right-hand side b. Each

processor simultaneously populates its corresponding data structure. The pseudocode

is given in Algorithm A.9. Note the existence of the variable x in line 6. This variable

is only for clarity of presentation of the algorithm. The routine PZGESV overwrites

the distributed data structure sub[b] with the result. In this way, there is no need to

maintain a distributed data structure for the variable x.

83

Algorithm A.9 Pseudocode for the SOURCE AMPLITUDES MODES general parallelcase. Let sub[b]n and sub[b]m denote the number of rows and columns of b, respectively.

1: for i = 1 to sub[b]n do2: for j = 1 to sub[b]m do3: Assign sub[b](i, j);4: end for5: end for6: Solve sub[A]x = sub[b] using ScaLAPACK routine PZGESV;


With the exception of one additional operation, this routine performs the same

operations as the preceding cases. That is, it populates the right-hand side, and solves

the linear system. The extra operation comes from moving the distributed solution vector

into a different distributed format needed for a later parallel matrix multiplication. The

parallel rotationally symmetric system solve is discussed in detail and is the main topic

of the paper; therefore, this section will not discuss the details of the solve. Rather, this

section will detail the population of the right-hand side in a way which is amendable to

the parallel block circulant system solve. Recall the routine in Section A.2.2.2 defined m

processor grids, Gi, for i = 1, . . . ,m. In addition, the routine distributed each Ai onto

grid Gi. In a similar fashion, this routine will block b into m blocks of corresponding

size, denoted by bi, i = 1, . . . ,m, and distribute each bi onto Gi for i = 1, . . . ,m. Each

bi is in Cn×rhsn where n is the order of each block Ai, and rhsn denotes the number

of right-hand sides, i.e., columns of b. In order to distribute each bi onto grid Gi for

i = 1, . . . ,m, a distributed data structure sub[bi] is defined. Note, only the processors

which are part of Gi contain the data structure sub[bi]. That is, processors belonging

to G1 use the distributed data structure sub[b1]; processors belonging to G2 use the

distributed data structure sub[b2], and so on and so forth. All of the distributed data

84

structures are populated simultaneously by looping over their corresponding distributed

data structures. At this point, each Ai and bi reside on grid Gi for i = 1, . . . ,m,

and this is precisely the setting which is needed for the parallel block circulant system

solve. Once the block circulant linear system has been solved, the solution will reside

in each sub[bi] for i = 1, . . . ,m. However, following this subroutine, a parallel matrix

vector multiplication will be performed using the solution vector. The multiplication is

performed in the context of one large process grid containing all processors and therefore,

the right-hand side, b, is distributed over the processor grid using a distributed data

structure sub[b]. Since the solution resides on the distributed data structures sub[bi] for

i = 1, . . . ,m, the data needs to be communicated to the appropriate format for sub[b].

In order to perform this reorganization of data, the routine uses a double for loop to loop

through the global b, computes which processor currently owns it, and which processor

needs it, and performs the communication. Once this reorganization is finished, the

routine is finished.

A.4 SOURCE POWER

At this point in the program, the system Ax = b has been solved, and the resultant

vector, x, has been obtained. Because LAPACK and ScaLAPACK overwrite b with the

solution vector x, the data structures containing b now have the solution x. Therefore,

any data structures previously denoted by a b will be denoted by x. The matrix A and

the right-hand side b are no longer needed (in terms their initial values). This routine

populates a matrix S with the intent of computing s = x∗Sx, in which x is the solution

obtained from the SOURCE AMPLITUDES MODES routine.

85

Algorithm A.10 Pseudocode for the SOURCE AMPLITUDES MODES rotationallysymmetric parallel case. The variables pId and gId denotes the processor number andthe grid the processor belongs to respectively. Let sub[bgId]n and sub[bgId]m denote thenumber of rows and columns of b respectively.

for j = 1 to sub[bgId]n dofor k = 1 to sub[bgId]m do

Assign sub[bgId](j, k);end for

end forSolve Ax = b using the block circulant solve (Algorithm 4.2);for j = 1 to N dofor k = 1 to rhsn doSendProc=Processor which has data b(j, k) in sub[bgId];RecvProc=Process which needs b(j, k) data;Compute index into sub[bgId]; denote by (sj , sk);Compute index into sub[b]; denote by (lj , lk);if SendProc == RecvProc then

sub[b](lj , lk) = sub[bgId](sj , sk)elseif pId == SendProc then

Send sub[bgId](sj , sk) to processor RecvProc;else if pId == RecvProc then

Recv temp = sub[bSendProc](sj , sk) from processor SendProc ;sub[b](lj , lk) = temp;

end ifend if

end forend for

86

A.4.1 Sequential


The general sequential case simply populates the matrix; however, the method

of population differs from the previous routines. Instead of populating the matrix by

looping over each element in the matrix, the routine loops over the sources used in

the overall computation for populating S. That is, for each source, it computes which

element, S(i, j), uses that source, and adds the source’s contribution to S(i, j). In

general, there can be multiple sources per element and, therefore, S(i, j) will be updated

multiple times.

There are three different types of sources: simple, dipole, and a coupled simple and

dipole source which will be called a tripole source. The contribution of each source type

is done separately. That is, the simple source contributions are computed, followed by

computation of the dipole sources, and finally by computation of the tripole sources. In

the SOURCE POWER routine, there is a separate routine for each source type; however,

the algorithmic idea for populating the matrix S is the same in all cases.

In addition, the matrix S is Hermitian, and so only the upper triangular portion

is computed using the source contributions discussed above. After the upper triangular

portion is populated, the routine fills in the second half of the matrix by copying the

conjugate of the elements into the lower triangular portion of the matrix. Algorithm A.11

gives the pseudocode for the routine.

87

Algorithm A.11 Pseudocode for the SOURCE POWER general sequential case. LetN1, N2, and N3 be the number of simple, dipole, and tripole sources respectively. LetN be the number of rows and columns of S.

//Fill upper triangular portion for Sfor l = 1 to 3 dofor k = 1 to Nl do

Let (i, j) be the element to which source k contributes;if i ≤ j then

Update S(i, j);end if

end forend for

//S is Hermitian, copy the datafor i = 2 to N dofor j = 1 to i− 1 doS(i, j) = Conj(S(j, i));

end forend for


In the rotationally symmetric case, the matrix is also block circulant. Because

the matrix is block circulant, only the first block row of S is filled. Then, using the first

block row of S, the matrix is filled. The fact that S is Hermitian is also used, but in this

case, it is used only for the first block in the first block row of S. The pseudocode given

in Algorithm A.12 is very similar to Algorithm A.11 except for a change in the bounds.

A.4.2 Parallel


Essentially, this routine populates the matrix S in parallel. It reuses the dis-

tributed data structure sub[A], which will now be denoted as sub[S], and populates it

by having each processor simultaneously loop over their corresponding data structures.

However, as discussed in the sequential cases, the original routines loop over sources,

88

Algorithm A.12 Pseudocode for the SOURCE POWER rotationally symmetric se-quential case. Let N1, N2, and N3 be the number of simple, dipole, and tripole sources,respectively. Let m be the number of blocks in the first block row of S and N be thenumber of rows and columns of S.

//Fill the first block rows of S except for the lower triangular portion of the first block.for l = 1 to 3 dofor k = 1 to Nl do

Let (i, j) be the element to which source k contributes;if i ≤ j and i ≤ N

m thenUpdate S(i, j);

end ifend for

end for

//The first block of S is Hermitian, copy the dataLet n = N

m ;for i = 1 to n do

for j = 1 to i doS(i, j) = Conj(S(j, i));

end forend for

//Fill the remainder of S knowing it is block circulantfor k = 1 to m− 1 dofor i = 1 to n dol = n ∗ k + i;for j = 1 to N dot = nk + j;if t > N thent = t−N ;

end ifS(l, t) = S(i, j);

end forend for

end for

89

not matrix elements. Therefore, this routine’s computations proceed by taking a pro-

cessor’s local indices, (li, lj), into sub[S], converting the local indices into global matrix

indices, (i, j), finding which sources are owned by this S(i, j), and looping through these

sources to compute their contributions to S(i, j). As in the sequential case, there are

three types of sources and, therefore, there are three separate routines for computing

their contributions. However, the overall algorithmic structure is the same.

Again, the matrix S is Hermitian. In contrast to the sequential case, instead of

computing only the upper triangular portion of S, all of S is computed using the source

contributions. While this adds some extra computation, the computation is being done

in parallel. If data were to be copied from the upper triangular section to the lower

triangular section, a large number of communications would have to take place and

would result in a bottleneck.

Algorithm A.13 Pseudocode for the SOURCE POWER general parallel case. Let mbe the number of blocks in the first block row of S, and let sub[S]N and sub[S]M be thenumber of rows and columns of sub[S], respectively.

for li = 1 to sub[S]N dofor lj = 1 to sub[S]M do

Compute global indices (i, j) corresponding to (li, lj);Let SourceList = Sources corresponding to S(i, j);for each source type t dofor each source of type t in SourceList do

Update sub[S](li, lj)end for

end forend for

end for

90


The rotationally symmetric case is identical to the general parallel case in Sec-

tion A.4.2.1 with one modification. Because the matrix S is block circulant, the global

indices are modified to stay within the first block row of S when accessing the source

list. For example, say a processor’s local index corresponds to an entry residing in the

first block of the second block row. Knowing the matrix is block circulant, the second

block row is only a circular shift of the first block row. This means that the first block of

the second row is the last block of the first row. Therefore, the indices corresponding to

the first block of the second row are modified to point to the last block of the first row.

In performing the computations this way, no communication between the processors is

necessary to fill in the matrix S. The pseudocode for the algorithm, including the index

modifications, is given by Algorithm A.14.

Algorithm A.14 Pseudocode for the SOURCE POWER rotationally symmetric par-allel case. Let N be the number of rows and columns of S, m the number of blocks inthe first block row of S, and n the order of each block. In addition, define sub[S]N andsub[S]M to be the number of rows and columns of sub[S] respectively.

for li = 1 to sub[S]N dofor lj = 1 to sub[S]M do

Compute global indices (i, j) corresponding to (li, lj);

j =[m−

(i−1n

)]∗ n− j

if j > N thenj = j −N ;

end ifi = mod(i− 1, n) + 1;Let SourceList = Sources corresponding to S(i, j);for each source type in SourceList dot = SourceType;for each source of type t in SourceList do

Update sub[S](li, lj)end for

end forend for

end for

91

A.5 MODAL RESISTANCE

This routine is quite straight forward. Using the solution vector x obtained from

the SOURCE AMPLITUDES MODES routine (see Section A.3) and the matrix S from

the SOURCE POWER routine (see Section A.4), compute s = x∗Sx. Because previous

routines have already populated the needed data structures, this routine simply performs

the needed multiplications using LAPACK or ScaLAPACK.

A.5.1 Sequential


The sequential routine performs two multiplications and contains one intermediate

data structure, W , to hold the first multiplication, i.e., W = Sx. After computing

the first multiplication, the second multiplication S = x∗W is performed reusing the

data structure S to hold the solution. The LAPACK routine ZGEMM is used for the

multiplications. The ZGEMM routine corresponds to matrix-matrix multiplications and

is used in this instance because, in general, the solution vector x will contain multiple

columns. For completeness, the pseudocode for this operation is given by Algorithm A.15

Algorithm A.15 pseudocode for the MODAL RESISTANCE general sequential case.

Compute W = Sx using LAPACK routine ZGEMM;Compute S = x∗W using LAPACK routine ZGEMM;

92

A.5.2 Rotationally Symmetric

This behaves in exactly the same as the general sequential case (see Section A.5.1.1).

A.5.3 Parallel


As in the sequential case, an additional data structure is required for the inter-

mediate multiplication. Therefore, the distributed data structure sub[W ] is defined. Be-

cause the distributed data structures needed for the parallel multiplications, i.e., sub[S]

and sub[x], have already been populated, this routine simply uses ScaLAPACK to per-

form the parallel matrix multiplications. Therefore, the routine calls PZGEMM to com-

pute the multiplication W = Sx using the distributed data structures. Following the

first multiplication, S = x∗W is computed which reuses the distributed data structure

sub[S] to store the solution. Algorithm A.16 shows the pseudocode for the routine.

Algorithm A.16 pseudocode for the MODAL RESISTANCE general parallel case.

Compute sub[W ] = sub[S]sub[x] using ScaLAPACK routine PZGEMM;Compute sub[S] = sub[x∗]sub[W ] using ScaLAPACK routine PZGEMM;


The parallel rotationally symmetric case is exactly the same as the general parallel

case (see Section A.5.3.1 and Algorithm A.16).

93

Bibliography

[1] The Cyberstar compute cluster. http://www.ics.psu.edu/infrast/specs.html.

[2] H. Akaike. Block Toeplitz matrix inversion. Society for Inustrial and Applied Math-

ematics Journal on Applied Mathematics, 24:234–241, 1973.

[3] P. Alonso, J.M. Badia, and A.M. Vidal. An efficient parallel algorithm to solve

block Toeplitz systems. The Journal of Supercomputing, 32:251–278, 2005.

[4] S. Amini. An iterative method for the boundary element solution of the exterior

acoustic problem. Journal of Computational and Applied Mathematics, 20:109–117,

1987.

[5] S. Amini and C. Ke. Conjugate gradient method for second kind integral equations -

Applications to the exterior acoustic problem. Engineering Analysis with Boundary

Elements, 6, 1989.

[6] L.S. Blackford, A. Cleary, J. Choi, E. D’Azevedo, J. Demmel, I. Dhillon, J. Don-

garra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C.

Whaley. ScaLAPACK User’s Guide. SIAM Press, 1997.

[7] L.I. Bluestein. A linear filtering approach to the computation of the discrete Fourier

transform. Northeast Electronics Research and Engineering Meeting Record, 10:218–

219, 1968.

94

[8] L.I. Bluestein. A linear filtering approach to the computation of discrete Fourier

transform. IEEE Transactions on Audio and Electroacoustics, AU-18:451–455, 1970.

[9] A.J. Burton and G.F Miller. The application of integral equation methods to the

numerical solution of some exterior boundary-value problems. Proceedings of the

Royal Society of London. Series A, Mathematical and Physical Sciences, 323:201–

210, 1971.

[10] M. Chen. On the solution of circulant linear systems. Society for Inustrial and

Applied Mathematics Journal on Numerical Analysis, 24:668–683, 1987.

[11] E. Chu and A. George. Inside the FFT Black Box: Serial and Parallel Fast Fourier

Transform Algorithms. CRC Press, Boca Raton, 2000.

[12] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex

fourier series. Mathematics of Computation, 19:297–301, 1965.

[13] P.J. Davis. Circulant Matrices. Chelsea Publishing, 1994.

[14] H.A. El-Mikati and J.B. Davies. Improved boundary element techniques for two-

dimensional scattering problems with circular boundaries. IEEE Transactions on

Antennas and Propagation, AP-35:539–544, 1987.

[15] S.M. El-Sayed. A direct method for solving circulant tridiagonal block systems of

linear equations. Applied Mathematics and Computation, 165:23–30, 2005.

[16] S.M. El-Sayed, I.G. Ivanov, and M.G. Petkov. A new modification of the rojo

method for solving symmetric circulant five-diagonal systems of linear equations.

Computers & Mathematics with Applications, 35:35–44, 1998.

95

[17] G. Fairweather and A. Karageorghis. The method of fundamental solutions for

elliptic boundary value problems. Advances in Computational Mathematics, 9:69–

95, 1998.

[18] K.M. Fauske. Example: Radix-2 fft signal flow. Online, 12 2006.

http://www.texample.net/tikz/examples/radix2fft/.

[19] J.-Y Hwang and S.-C Chang. A retracted boundary integral equation for exte-

rior acoustic problem with unique solution for all wave numbers. Journal of the

Acoustical Society of America, 90:1167–1180, 1991.

[20] C.C. Ioannidi and H.T. Anastassiu. Circulant adaptive integral method (CAIM) for

electromagnetic scattering from large targets of arbitrary shape. IEEE Transactions

on Magnetics, 45:1308–1311, 2009.

[21] A. Karageorghis and G. Fairweather. The method of fundamental solutions for

axisymmetric potential problems. International Journal for Numerical Methods in

Engineering, 44:1653–1669, 1999.

[22] A. Karageorghis and Y.-S. Smyrlis. Matrix decomposition MFS algorithms for elas-

ticity and thermo-elasticity problems in axisymmetric domains. Journal of Compu-

tational and Applied Mathematics, 206:774–795, 2007.

[23] A. Karageorghis, Y.-S. Symyrlis, and T. Tsangarsi. A matrix decomposition MFS

algorithm for certain linear elasticity problems. Numerical Algorithms, 43:123–149,

2006.

96

[24] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Com-

puting Design and Analysis of Algorithms. The Benjamin/Cummings Publishing

Company, 1994.

[25] L D. Lirkov, S. D. Margenov, and P. S. Vassilevski. Circulant block-factorization

preconditioners for elliptic problems. Computing, 53:59–74, 1994.

[26] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society

for Industrial and Applied Mathematics, 1992.

[27] T. De Mazancourt and D. Gerlic. The inverse of a block-circulant matrix. IEEE

Transactions on Antennas and Propagation, AP-31:808–810, 1983.

[28] M. Ochmann, A. Homm, S. Makarov, and S. Semenov. An iterative GMRES-

based boundary element solver for acoustic scattering. Engineering Analysis with

Boundary Elements, 27:717–725, 2003.

[29] A. Padiy and M.Neytcheva. On a parallel solver for boundary electric current com-

putations, Report 9726. Technical report, Department of Mathematics, University

of Nijmegen, The Netherlands., 1997.

[30] P.J. Papakanellos, N.L. Tsitsas, and H.T. Anastassiu. Efficient modeling of radiation

and scattering for a large array of loops. IEEE Transactions on Antennas and

Propagation, 58:999–1002, 2010.

[31] S. Rjasanow. Effective algorithms with circulant-block matrices. Linear Algebra

and Its Applications, 202:55–69, 1994.

97

[32] O. Rojo. A new method for solving symmetric circulant tridiagonal systems of linear

equations. Computers and Mathematics with Applications, 20:61–67, 1990.

[33] S. Rjasanow S. Kurz, O. Rain. Application of the adaptive cross approximation

technique for the coupled BE-FE solution of symmetric electromagnetic problems.

Computational Mechanics, 32:423–429, 2003.

[34] H.A. Schenck. Improved integral formulation for acoustic radiation problems. Jour-

nal of the Acoustical Society of America, 44:41–58, 1968.

[35] Y.-S. Smyrlis and A. Karageorghis. A matrix decomposition MFS algorithm for

axisymmetric potential problems. Engineering Analysis with Boundary Elements,

28:463–474, 2004.

[36] Y.-S. Smyrlis and A. Karageorghis. The method of fundamental solutions for sta-

tionary heat conduction problems in rotationally symmetric domains. Society for

Inustrial and Applied Mathematics Journal of Scientific Computing, 27:1493–1512,

2006.

[37] Parad SS. Technical seminar on propeller. Online, August 2008.

http://aplonset.blogspot.com/2008/08/prasads-propeller.html.

[38] Th. Tsangaris, Y.-S Symrlis, and A. Karegeorghis. A matrix decomposition MFS

algorithm for problems in hollow axisymmetric domains. Journal of Scientific Com-

puting, 28:31–50, 2006.

98

[39] N.L. Tsitsas and G.H. Kalogeropoulos. A recursive algorithm for the inversion of

matrices with circulant blocks. Applied Mathematics and Computation, 188:877–

894, 2007.

[40] H. Tsuboi, A. Sakurai, and T. Naito. A simplification of boundary element model

with rotational symmetry in electromagnetic field analysis. IEEE Transactions on

Magnetics, 26:2771–2773, 1990.

[41] R. Vescovo. Inversion of block-circulant matrices and circular array approach. IEEE

Transactions on Antennas and Propagation, 45:1565–1567, 1997.

[42] L. Wright, S.Robinson, V. Humphrey, P. Harris, and G. Hayman. The application

of boundary element methods to nearfield acoustic measurements on cylindrical

surfaces at NPL. Technical report, NPL REPORT, 2005.

Documents

PARALLEL BOUNDARY ELEMENT SOLUTIONS OF BLOCK …