Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The Pennsylvania State University
The Graduate School
Department of Computer Science and Engineering
PARALLEL BOUNDARY ELEMENT SOLUTIONS OF BLOCK
CIRCULANT LINEAR SYSTEMS FOR ACOUSTIC RADIATION
PROBLEMS WITH ROTATIONALLY SYMMETRIC
BOUNDARY SURFACES
A Thesis in
Computer Science and Engineering
by
Kenneth D. Czuprynski
c© 2012 Kenneth D. Czuprynski
Submitted in Partial Fulfillmentof the Requirements
for the Degree of
Master of Science
May 2012
The thesis of Kenneth D. Czuprynski was reviewed and approved* by the following:
Suzanne M. ShontzAssistant Professor of Computer Science and EngineeringThesis Adviser
Jesse L. BarlowProfessor of Computer Science and Engineering
John B. FahnlineAssistant Professor of Acoustics
Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering
*Signatures are on file in the Graduate School.
iii
Abstract
Coupled finite element/boundary element (FE/BE) formulations are commonly
used to solve structural-acoustic problems where a vibrating structure is idealized as
being submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE)
formulations, the structural analysis is performed using the finite element method, and
the acoustic analysis is performed using the boundary element method. In general, the
problem is solved frequency by frequency, and the coefficient matrix for the boundary
element analysis is fully populated and little can be done to alleviate the storage and
computational requirements. Because acoustic boundary element calculations require
approximately six elements per wavelength to produce accurate solutions, the boundary
element formulation is limited to relatively low frequencies. However, when the outer
surface of the structure is rotationally symmetric, the system of linear equations becomes
block circulant. We propose a parallel algorithm for distributed memory systems which
takes advantage of the underlying concurrency of the inversion formula for block circulant
matrices. By using the structure of the coefficient matrix in tandem with a distributed
memory system setting, we show that the storage and computational requirements are
substantially lessened.
iv
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Acoustic Radiation Problems . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Boundary Element Method . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Fourier Matrix and Fast Fourier Transform . . . . . . . . . . . . 10
1.4 Circulant Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 2. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 3. Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Coefficient Matrix Derivation . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Block Circulant Inversion . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 4. Parallel Solution Algorithm . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Block DFT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Block FFT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 System Solves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Parallel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 5. Theoretical Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Parallel Linear System Solve . . . . . . . . . . . . . . . . . . . . . . . 52
v
5.2 Block DFT using the DFT Algorithm . . . . . . . . . . . . . . . . . 53
5.3 Block DFT Using the FFT Algorithm . . . . . . . . . . . . . . . . . 55
5.4 Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 6. Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Chapter 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Appendix. BEM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.1 STATIC MULTIPOLE ARRAYS . . . . . . . . . . . . . . . . . . . . 73
A.1.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 73
A.1.1.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 74
A.1.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.1.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 74
A.1.2.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 75
A.2 COEFF MATRIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.2.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.2.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 76
vi
A.2.1.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 76
A.2.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.2.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 77
A.2.2.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 77
A.3 SOURCE AMPLITUDES MODES . . . . . . . . . . . . . . . . . . . 80
A.3.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.3.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 80
A.3.1.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 80
A.3.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.3.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 82
A.3.2.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 83
A.4 SOURCE POWER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.4.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.4.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 86
A.4.1.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 87
A.4.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.4.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . 87
A.4.2.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 90
A.5 MODAL RESISTANCE . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.5.1 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.5.1.1 General Case . . . . . . . . . . . . . . . . . . . . . . 91
A.5.2 Rotationally Symmetric . . . . . . . . . . . . . . . . . . . . . 92
A.5.3 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vii
A.5.3.1 General Case . . . . . . . . . . . . . . . . . . . . . . 92
A.5.3.2 Rotationally Symmetric . . . . . . . . . . . . . . . . 92
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
viii
List of Figures
1.1 Radix 2 element interaction pattern obtained from [18]. . . . . . . . . . 16
3.1 A propeller with three times rotational symmetry [37]. . . . . . . . . . . 26
3.2 A four times rotationally symmetric sketch of a propeller. . . . . . . . . 27
4.1 Initial data distribution assumed in the DFT computation for the case
P = m = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 The DFT computation for the case P = m = 4. Each arrow indicates
the communication of a processor’s owned submatrix to a neighboring
processor in the direction of the arrow. . . . . . . . . . . . . . . . . . . . 40
4.3 Parallel block DFT data decomposition for P > m. . . . . . . . . . . . . 42
4.4 Parallel block DFT data decomposition and processor groupings for P > m. 43
4.5 Process illustrating the distributed FFT. Lines crossing to different pro-
cessors indicate communication from left to right. Note the output is in
reverse bit-reversed order relative to numbering starting at zero; that is,
A1 is element 0; A2 is element 1, etc. . . . . . . . . . . . . . . . . . . . . 47
4.6 Processor grid creation for P=16 and m=4. . . . . . . . . . . . . . . . . 48
6.1 Runtime comparison using the DFT algorithm for varying P and N with
m = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Runtime comparison using the FFT algorithm for varying P and N with
m = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ix
6.3 Speedups using the DFT algorithm for varying P and N with m = 4. . 61
6.4 Speedups using the FFT algorithm for varying P and N with m = 4. . . 61
6.5 Efficiency using the DFT algorithm for varying N and P with m = 4. . 63
6.6 Efficiency using the FFT algorithm for varying N and P with m = 4. . 64
6.7 Runtime comparison using the DFT algorithm for varying P and N with
m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.8 Runtime comparison using the FFT algorithm for varying P and N with
m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.9 Speedup comparison using the DFT algorithm for varying P and N when
m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.10 Speedup comparison using the FFT algorithm for varying P and N when
m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.11 Efficiency comparison using the DFT algorithm for varying P and N
when m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.12 Efficiency comparison using the FFT algorithm for varying P and N
when m = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1
Chapter 1
Introduction
Coupled finite element/boundary element (FE/BE) formulations are commonly
used to solve structural-acoustic problems where a vibrating structure is idealized as be-
ing submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE)
formulations, the structural analysis is performed using the finite element method, and
the acoustic analysis is performed using the boundary element method (BEM). The
boundary element formulation is advantageous for the acoustic radiation problem be-
cause only the outer surface of the structure in contact with the acoustic medium is
discretized. This formulation also allows us to neglect meshing the infinite fluid exterior
to the structure, as would be required if the finite element method were used instead.
Using the BEM, we compute the radiated sound field of a vibrating structure
Ω ⊂ R3. The main obstacle in computing the sound radiation is solving the linear
system of equations to enforce the specified boundary conditions. In the context of
the BEM, this requires the solution of a dense, complex linear system. In general, the
problem is solved frequency by frequency, and the coefficient matrix for the boundary
element analysis is fully populated and exhibits no exploitable structure. The size, N2,
of the coefficient matrix is directly correlated with the level of discretization, N , used
for the surface in question. Because acoustic boundary element calculations require
approximately six elements per wavelength to produce accurate solutions, the boundary
2
element formulation is limited to relatively low frequencies. For high frequency problems,
and for problems which involve large and/or complex surfaces, these matrices are large,
dense, and unstructured; therefore, there is little which can be done to alleviate the
storage and computational requirements. Iterative solvers and preconditioners have been
investigated [4, 5, 28] and are a natural choice for large problems because the cost of
direct solvers can become prohibitive. While the computational requirements can be
lessened by iterative methods, the storage requirements can still present a problem. One
obvious solution is to perform the solve in a distributed memory parallel setting. A
distributed memory parallel algorithm distributes the workload and allows the storage
of the matrix to be split between many individual systems with local memories, thereby
increasing the total available memory. In addition, because linear systems are ubiquitous
throughout scientific computation, libraries exist for their efficient parallel solution. In
particular, because the matrix is dense, Scalable LAPACK (ScaLAPACK) [6] is a favored
choice.
While in general these matrices exhibit no exploitable structure, when the bound-
ary surface is rotationally symmetric, the coefficient matrix is block circulant. Circulant
matrices are defined as each row being a circular shift of the row above it. One property
of circulant matrices is that they are all diagonalizable by the Fourier matrix. There-
fore, the Discrete or Fast Fourier Transform (D/FFT) can be used in the solution of
the system. These results generalize to the block case and can be used in the solution
of block circulant linear systems arising from acoustic radiation problems involving ro-
tationally symmetric boundary surfaces. In addition, the inversion formula for block
circulant matrices is highly amendable to parallel computation.
3
We propose an algorithm for distributed memory systems which takes advantage
of the underlying concurrency of the inversion formula for block circulant matrices. By
using the structure of the coefficient matrix in tandem with a distributed system setting,
the storage and computational limitations are substantially lessened. Therefore, the
algorithm allows larger and higher frequency acoustic radiation problems to be explored.
1.1 Acoustic Radiation Problems
The goal is to compute the radiated sound field due to a vibrating structure
Ω ⊂ R3 subject to given boundary conditions. The governing partial differential equation
(PDE) for acoustic radiation problems is the Helmholtz equation, i.e.,
(∇2 − k2
)u(p) = 0, p εΩ+, (1.1)
where ∇2 is the Laplacian; k = ωc is the wave number; ω is angular frequency, and c
the speed of sound in the chosen medium. Ω+ = R3\Ω, denotes the region exterior to
Ω. In structural acoustics problems, it is common for the velocity distribution over the
boundary of Ω, denoted by ∂Ω, to be specified. This equates to the Neumann boundary
condition
∂u(p)
∂np= f(p), p ε ∂Ω, (1.2)
4
where ∂∂np
denotes differentiation in the direction of the outward normal at p ε ∂Ω. In
addition, to ensure all radiated waves are outgoing, the Sommerfield radiation condition
limr→∞
r
(∂u(p)
∂r− iku(p)
)= 0 (1.3)
is enforced, where r is the distance of p from a fixed origin. Therefore, in order to solve
for the radiated sound field due to Ω, a solution to the Helmholtz equation (1.1), subject
to equations (1.2) and (1.3), must be found.
1.2 Boundary Element Method
The boundary element method is an algorithm for the numerical solution of PDEs
which have an equivalent boundary integral representation. The BEM reformulates
the PDE into an equivalent boundary integral equation (BIE), which is then solved
numerically. The benefit of the formulation is that it reduces the problem to one over
the boundary. However, because the BEM requires an equivalent BIE formulation, if
the PDE cannot be represented as an equivalent BIE, the BEM cannot be used. The
remainder of the section will outline the BEM within the context of an acoustic radiation
problem.
Consider a vibrating structure Ω ⊂ R3. The Helmholtz equation is the governing
PDE for the radiated sound field produced by Ω and is given by (1.1). A standard
boundary integral formulation of (1.1) yields the following equations
1
4π
∫∂Ω
(∂G(p, q)
∂nqu(q)− ∂u(q)
∂nqG(p, q)
)d (∂Ω) = u(p), p εΩ+ (1.4)
5
and
1
2π
∫∂Ω
(∂G(p, q)
∂nqu(q)− ∂u(q)
∂nqG(p, q)
)d (∂Ω) = u(p), p ε ∂Ω, (1.5)
where G(p, q) is the Green’s function, which can loosely be thought of as the effect the
point q has on point p. In the context of an acoustic radiation problem, the Green’s
function corresponds to the fundamental solution of the Helmholtz equation and is given
by G(p, q) = eik|p−q|
|p−q| , in which |p− q| denotes the Euclidean distance between the points
p and q. A solution to u in the exterior domain with respect to the points on the
boundary is provided by (1.4). Therefore, if the quantities u and ∂u(p)∂nq
are known over
the boundary, the solution for the points in the exterior can be easily computed. In
addition, (1.5) provides a means of solving for the aforementioned quantities. However,
by applying the Fredholm alternative to (1.5) it is found that the solutions are not unique
for all wave numbers k, and thus an alternative formulation is required [34]. Burton and
Miller [9] showed how a unique solution can be derived. Differentiating (1.5) in the
direction of the outward normal yields
1
2π
∂
∂np
∫∂Ω
(∂G(p, q)
∂nqu(q)− ∂u(q)
∂nqG(p, q)
)d (∂Ω) =
∂u(p)
∂np, p ε ∂Ω. (1.6)
Then constructing a linear combination of equations (1.5) and (1.6) using a purely imagi-
nary coupling coefficient, β, produces a modified BIE formulation with a unique solution.
6
The formulation is given by
1
2π
∫∂Ω
(∂G(p, q)
∂nqu(q)− ∂u(q)
∂nqG(p, q)
)d (∂Ω)
+ (1.7)
β
1
2π
∂
∂np
∫∂Ω
(∂G(p, q)
∂nqu(q)− ∂u(q)
∂nqG(p, q)
)d (∂Ω)
= u(p) + β∂u(p)
∂np.
Assuming a Neumann boundary condition, (1.7) can be rearranged as follows:
∫∂Ω
u(q)
(β∂2G(p, q)
∂nq∂np+∂G(p, q)
∂nq
)d (∂Ω)− 2πu(p) (1.8)
=
∫∂Ω
∂u(q)
∂nq
(G(p, q) + β
∂G(p, q)
∂np
)d (∂Ω) + β2π
∂u(p)
∂np, p ε ∂Ω.
Note, in the case of a Dirichlet boundary condition, ∂u(p)∂nq
can be solved for by rearranging
(1.8). Once u(p) has been solved for over the boundary, the solution for all points in
the exterior can be obtained. Therefore, a means for numerically solving equation (1.8)
must be devised. For notational convenience, let v(q) = ∂u(q)∂nq
, and redefine portions of
both integrands as
T (p, q) = β∂2G(p, q)
∂nq∂np+∂G(p, q)
∂nq(1.9)
and
H(p, q) = G(p, q) + β∂G(p, q)
∂np. (1.10)
7
Equation (1.8) becomes
∫∂Ω
u(q)T (p, q)d (∂Ω)− 2πu(p) =
∫∂Ω
v(q)H(p, q)d (∂Ω) + β2πv(p), p ε ∂Ω. (1.11)
The next step in the BEM is to discretize the boundary surface, ∂Ω, into smaller quadri-
lateral or triangular surface elements. After the discretization, the boundary can be
represented as ∂Ω = ∂Ω1∪∂Ω2∪· · ·∪∂ΩN , where ∂Ωi represents the ith surface element
in the discretization of ∂Ω and ∂Ωi ∩ ∂Ωj = ∅ for i 6= j.
Equation (1.11) can then be represented as
N∑i=1
[∫∂Ωi
u(q)T (p, q)d (∂Ωi)
]− 2πu(p) =
N∑i=1
[∫∂Ωi
v(q)H(p, q)d (∂Ωi)
](1.12)
+2βπv(p), p ε ∂Ω.
The most straightforward approach to numerically solving equation (1.12) is to assume
u(p) and v(p) are constant along each surface element, ∂Ωi, i = 1, . . . , N . Therefore,
let u(p) ≈ uj and v(p) ≈ vj for p ε ∂Ωj , j = 1, . . . , N . Under this assumption, equation
(1.12) can be decomposed into N equations, i.e., one equation for each surface element;
that is,
N∑i=1
ui
[∫∂Ωi
T (p, q)d (∂Ωi)
]− 2πuj =
N∑i=1
vi
[∫∂Ωi
H(p, q)d (∂Ωi)
](1.13)
+2βπvj , p ε ∂Ωj .
Equation (1.13) yields a solution for the jth surface element of the boundary. The
boundary is constructed of N surface elements; therefore, there are N equations and N
8
unknowns total. Using this, equation (1.13) can more concisely be expressed in matrix
notation. Let
M =
∫∂Ω1
T (p, q)d (∂Ω1)∫∂Ω2
T (p, q)d (∂Ω2) . . .∫∂ΩN
T (p, q)d (∂ΩN )∫∂Ω1
T (p, q)d (∂Ω1)∫∂Ω2
T (p, q)d (∂Ω2) . . .∫∂ΩN
T (p, q)d (∂ΩN )
......
. . ....∫
∂Ω1T (p, q)d (∂Ω1)
∫∂Ω2
T (p, q)d (∂Ω2) . . .∫∂ΩN
T (p, q)d (∂ΩN )
.
Similarly, let the column vector b represent right hand side; that is,
b =
∑Ni=1
vi
[∫∂Ωi
H(p, q)d (∂Ωi)]
+ β2πv1∑Ni=1
vi
[∫∂Ωi
H(p, q)d (∂Ωi)]
+ β2πv2
...∑Ni=1
vi
[∫∂Ωi
H(p, q)d (∂Ωi)]
+ β2πvN
.
With a Neumann boundary condition, each vi, i = 1, . . . , N , is known, and the integrals
can be computed via numerical quadrature. Therefore, the matrix M and vector b are
known quantities. Using the new quantities, the linear system
(M − 2πI)u = b, p ε ∂Ω, (1.14)
can be used to solve for the approximation of u over the boundary. Once we have an
approximate solution for u over the surface, (1.4) can be used to solve for u in the
exterior.
9
It is difficult to precisely enforce the boundary conditions for the surface velocity
at edges and corners when the basis functions are constructed using surface distributions
of simple and dipole sources, as they are in Burton and Miller’s standard implementation.
To avoid this difficulty, it is possible to rewrite the solution in terms of surface-averaged
quantities instead, which is common in acoustics. For example, surface-averaged pres-
sures and volume velocities are commonly used in lumped parameter representations
of transducers. Since the goal is no longer to match the boundary conditions on a
point-by-point basis, it becomes permissible to simplify the solution by constructing the
basis functions from discrete sources rather than distributions of sources. Using surface-
averaged pressures and volume velocities as variables can also be shown to produce a
solution that converges with mesh density, unlike the standard formulation which can
produce a less accurate solution as the mesh is refined. The solution is then derived in
terms of source amplitudes rather than physical quantities, such as pressure or velocity.
For this type of indirect solution, an approach similar to Burton and Miller’s can be
used to prevent nonexistence/nonuniqueness difficulties. A hybrid ”tripole” source type
is created from a simple and dipole source with a complex-valued coupling coefficient, as
is discussed by Hwang and Chang [19]. The numerical implementation discussed in this
thesis is based on an indirect solution using tripole sources, but the basic formulation
shares many characteristics with the standard Burton and Miller approach discussed
previously.
10
1.3 The Fourier Matrix and Fast Fourier Transform
The Fourier matrix is given by
F =1√n
1 1 1 · · · 1
1 ω1n
ω2n
· · · ωn−1n
1 ω2n
ω4n
· · · ω2(n−1)n
......
.... . .
...
1 ω(n−1)n
ω2(n−1)n
· · · ω(n−1)(n−1)n
, (1.15)
where ωn = ei2πn , i =
√−1, and normalizing by 1√
nmakes F unitary. The discrete Fourier
transform (DFT) is defined as a matrix vector multiplication involving the Fourier ma-
trix. That is,
y = Fx. (1.16)
The vector y is called the DFT of x. Similarly, the inverse discrete Fourier transform
(IDFT) of x is given by
y = F−1x. (1.17)
However, because F has been defined to be unitary, (1.17) becomes
y = F ∗x. (1.18)
The Fourier matrix is highly structured, and this structure can be used to com-
pute the DFT. The improved method of computing the DFT is called the Fast Fourier
11
transform (FFT) and was first introduced by Cooley and Tukey [12]. It was shown that
for vectors with n = 2h elements, h εZ+, the DFT can be computed in O(n log n). Over
the years, the method has been extended to handle vectors with an arbitrary number of
elements; a comprehensive overview of these can be found in [11, 26]. This thesis uses
the Cooley and Tukey version of the algorithm, also now termed the radix-2 FFT. We
thus now overview the radix-2 algorithm.
Assuming the first column and first row are indexed by 0, consider the element
in the kth row and the jth column of the Fourier matrix, which is given by ωkjn
= ei2πkjn .
Note then that each element is periodic in n. This can readily be seen by using Euler’s
formula. Applying Euler’s formula, we have
ωkjn
= cos
(2πkj
n
)+ i sin
(2πkj
n
). (1.19)
Because sin and cos both have period 2π, by (1.19), if kj ≥ n, the elements begin to
repeat. It follows that each element in the Fourier matrix can be represented by ωkn
for
k = 0, . . . , n− 1. For example, consider the four-by-four Fourier matrix
F4 =1√4
1 1 1 1
1 ω14
ω24
ω34
1 ω24
ω44
ω64
1 ω34
ω64
ω94
. (1.20)
12
By the periodicity of the elements, (1.20) becomes
F4 =1√4
1 1 1 1
1 ω14
ω24
ω34
1 ω24
1 ω24
1 ω34
ω24
ω14
. (1.21)
The FFT algorithm uses properties of ω coupled with a divide and conquer strategy.
The following derivation relies heavily on [11]; we follow their derivation closely.
Recall that n = 2h for h εZ+, and consider the operation y = Fx. Expanding the
matrix vector product gives
yk =n−1∑j=0
xjωjkn, k = 0, . . . , n− 1. (1.22)
Equation (1.22) can be split into two summations: one containing all of the even terms,
and one containing all of the odd terms, i.e.,
yk =
n2−1∑j=0
x2jω2jkn
+
n2−1∑j=0
x2j+1ω(2j+1)kn
, k = 0, . . . , n− 1. (1.23)
A ωkn
term in the second summation can be pulled out of the summation, i.e.,
yk =
n2−1∑j=0
x2jω2jkn
+ ωkn
n2−1∑j=0
x2j+1ω2jkn, k = 0, . . . , n− 1. (1.24)
13
Using the fact that ω2n
= ωn2, (1.24) becomes
yk =
n2−1∑j=0
x2jωjkn2
+ ωkn
n2−1∑j=0
x2j+1ωjkn2, k = 0, . . . , n− 1. (1.25)
The next observation to make is that ω(k+n
2)j
n2
= ωkjn2
for k = 0, . . . , n2 − 1. That is,
because ωn2
has a smaller period, the elements begin to repeat sooner, and k, in turn,
need not go beyond n2 − 1. Therefore, (1.25) becomes
yk =
n2−1∑j=0
x2jωjkn2
+ ωkn
n2−1∑j=0
x2j+1ωjkn2, k = 0, . . . , n2 − 1. (1.26)
Looking more closely, each summation represents a DFT of length n2 . Therefore, a DFT
of length n can be broken into two DFTs each half the size of the previous DFT. However,
(1.26) contains only the first n2 terms of y. Computing the remaining terms yields
yk+n2
=
n2−1∑j=0
x2jωj(k+n
2)
n2
+ ωk+n2
n
n2−1∑j=0
x2j+1ωj(k+n
2)
n2
, k = 0, . . . , n2 − 1. (1.27)
We then obtain
yk+n2
=
n2−1∑j=0
x2jωjkn2ωj n2n2
+ ωk+n2
n
n2−1∑j=0
x2j+1ωjkn2ωj n2n2, k = 0, . . . , n2 − 1. (1.28)
Because ωj n2n2
= 1 and ωk+n2
n= −ωk
n, (1.28) becomes
yk+n2
=
n2−1∑j=0
x2jωjkn2− ωk
n
n2−1∑j=0
x2j+1ωjkn2, k = 0, . . . , n2 − 1. (1.29)
14
Therefore, the entire vector y can be obtained by
yk =
n2−1∑j=0
x2jωjkn2
+ ωkn
n2−1∑j=0
x2j+1ωjkn2, k = 0, . . . , n2 − 1 (1.30)
yk+n2
=
n2−1∑j=0
x2jωjkn2− ωk
n
n2−1∑j=0
x2j+1ωjkn2, k = 0, . . . , n2 − 1.
Let sj = x2j and tj = x2j+1 for j = 0, . . . , N/2− 1; that is, s is the vector containing all
the even elements of x, and t is the vector contain all of its odd elements. Then (1.30)
may be written as
[Fnx]k =[Fn
2s]k
+ ωkn
[Fn
2t]k, k = 0, . . . ,
n
2− 1 (1.31)
[Fnx]k+n2
=[Fn
2s]k− ωk
n
[Fn
2t]k, k = 0, . . . ,
n
2− 1.
From (1.31), the recursive nature of the algorithm should be clear. The DFT of a vector
y can be split into two DFTs of half the size. We can proceed in computing Fn2s and
Fn2t, as if it were the first time, and proceed as above. Algorithm 1.1 gives a pseudocode
of the algorithm.
15
Algorithm 1.1 Radix-2 FFT pseudocode.
1: Y=Radix-2FFT(X,n)2: if n == 1 then3: return Y;4: else5: s = Radix-2FFT(Even(X),n2 );6: t = Radix-2FFT(Odd(X),n2 );7: for k = 0 to n
2 − 1 do
8: Yk = sk + ωkntk;
9: Yk+n2
= sk − ωkntk;10: end for11: end if12: return Y;
Algorithm 1.1 follows nicely from the derived mathematics; however, the recursion
can be unrolled into an iterative format which will later facilitate the explanation of our
parallel algorithm. The algorithm can be found in [24], and our explanation follows their
discussion closely.
Algorithm 1.2 Iterative Radix-2 FFT pseudocode as presented in [24].
1: Y=Radix-2FFT(X,Y,n)2: r = log n;3: R = X;4: for m = 0 to r − 1 do5: S = R;6: for i = 0 to n− 1 do7: //Let (b0b1 . . . br−1) be the binary representation of i8: j = (b0 . . . bm−1 0 bm+1 . . . br−1);9: k = (b0 . . . bm−1 1 bm+1 . . . br−1);
10: r = (bmbm−1 . . . b0 0 . . . 0);11: Ri = Sj + Skω
rn;
12: end for13: end for14: Y = R;
16
Algorithm 1.2 is the iterative version of Algorithm 1.1. Each iteration of the outer
loop (line 4) represents one level of the recursion, starting with the deepest level. At
each level of recursion, the output vector is updated by two entries of the given input
vector and a multiple of the factor ω, (lines 8 and 9 of Algorithm 1.1 and line 11 for
Algorithm 1.2). Algorithm 1.1 uses the input to the function at each level of recursion
to update the output vector; whereas, Algorithm 1.2 uses binary representations of the
index being modified.
The most relevant property to notice, with respect to the parallel algorithm, is the
pattern of interaction between different elements of the input vector. Figure 1.1 shows
which elements in the input vector, denoted x, are used in computing each element of
the output vector, denoted X, for a vector of length n = 16.
Fig. 1.1 Radix 2 element interaction pattern obtained from [18].
17
In order to solidify this notion and to clarify the meaning behind Figure 1.1,
consider the transformation of x(0). The elements of the initial input vector involved in
the transformation of x(0) are: x(0), x(8), followed by modified versions of x(4), x(2),
and x(1). Similarly, each element of the input vector in the diagram can be traced to
see the elements of the initial vector involved in each computation.
A final note about FFTs is the ordering of the output. When the algorithm is
run in place, such that it overwrites the array containing the initial data, the output is
in bit-reversed order. This can be seen in Figure 1.1. For another example, let n = 8,
and consider the computation x = F8x, where the vector x is overwritten. This yields
x =
x0
x1
x2
x3
x4
x5
x6
x7
7−→
x0
x4
x2
x6
x1
x5
x3
x7
.
The indices are converted to binary, and the bit string is reversed before being converted
back into decimal. In the above example, consider the indices one and seven, i.e., (1)10 =
(001)2, and flipping the bit string yields (100)2 = (4)10. This means that data migrates
to bit-reversed order when the FFT is done in place.
18
1.4 Circulant Matrices
Circulant matrices are a subset of Toeplitz matrices which have the added prop-
erty that each row is a circular shift of the previous row. The matrix C is circulant if it
has the form
C =
c1 c2 c3 · · · cn
cn c1 c2 · · · cn−1
cn−1 cn c1 · · · cn−2
......
.... . .
...
c2 c3 c4 · · · c1
.
Matrices of this form can be uniquely represented by their first row and will be denoted
by C = circ(c0, c1, c2, · · · , cn−1).
A thorough treatment of circulant matrices is given in [13]. The important prop-
erty of circulant matrices that is used heavily throughout this thesis concerns the eigen-
values and eigenvectors of circulant matrices. Let v = [c1 c2 c3 . . . cn]T be the column
vector constructed from the first row of a circulant matrix C. Then the eigenvalues of
C are given by
λ = Fv, (1.32)
where F is the unitary Fourier matrix [13]. That is, the discrete Fourier transform (DFT)
of the first row of C yields the eigenvalues of C. Further, the eigenvectors of a circulant
matrix C are given by the columns of the Fourier matrix of appropriate dimension. Thus,
C has eigenvalue decomposition
C = F ∗DF, (1.33)
19
where F is again the Fourier matrix, and D is the diagonal matrix whose elements are
the eigenvalues of C, i.e., D = diag(λ). This means that every circulant matrix of the
same dimension has the same eigenvectors, and that the matrix C is given by
C = F ∗diag(λ)F. (1.34)
With this decomposition, a formulation for the inversion of C can easily be obtained.
The inverse of C is then given by
C−1 = F diag(λ)−1F ∗. (1.35)
This formulation can then be used to solve a linear system. Consider the linear system
Cx = b. (1.36)
Left multiplication by C−1 yields
x = C−1b. (1.37)
Now, substituting for the definition of C−1 given by (1.35) yields
x = F diag(λ)−1F ∗b. (1.38)
Rearranging gives
diag(λ)F ∗x = F ∗b. (1.39)
20
Let x = F ∗x and b = F ∗b; then (1.39) becomes
diag(λ)x = b, (1.40)
whose solution is trivial. Therefore, the solution of a linear system equates to computing
three DFTs and a backsolve involving a diagonal matrix. The steps are:
1. Compute λ = Fv.
2. Compute b = F ∗b.
3. Solve diag(λ)x = b.
4. Compute x = Fx.
This formulation is advantageous because the most expensive operation needed is the
computation of the DFT, which, in its crudest form, is a matrix vector multiplication,
and is thus O(n2). However, if permissible, the fast Fourier transform (FFT) can be
used in place of the DFT, and the computation becomes O(n log n).
21
Chapter 2
Literature Review
Circulant matrices are a desirable structure in computation because of their re-
lation to the Fast Fourier transform (FFT). Therefore, many variations of circulant
matrices have appeared throughout the literature and in a wide variety of contexts.
These range from the solution of circulant tridiagonal and banded systems [32, 16, 15]
to effective preconditioners [25] and are able to exploit their computational relation to
the FFT.
We are concerned with the solution of linear systems involving block circulant
matrices and assume the blocks in the matrix themselves are dense and contain no
additional structure. The desirable properties extend to the block case as well; namely,
block circulant matrices are block diagonalizable by the block Fourier matrix. The
generalization to the block case, however, means that the inversion/solution formula
must be extended. We first note that every block circulant matrix (BCM) can be mapped
to an equivalent block matrix with circulant blocks (CBM). This can be accomplished by
multiplying by the appropriate permutation matrices. Therefore, algorithms for solving
BCMs and CBMs are equivalent.
Within engineering, when problems with periodicity properties are considered,
block circulant matrices arise in many contexts. These usually result when such periodic
problems are solved by means of integral equations, which includes the BEM. Using
22
the method of fundamental solutions [17], block circulant matrices in the contexts of
axisymmetric problems in potential theory [21], as well as axisymmetric harmonic and
biharmonic [38], linear elasticity [23, 22], and heat conduction problems [36] have been
investigated. In addition, scattering and radiation problems in electromagnetics have
taken advantage of block circulant matrices for a variety of integral equation techniques
[33, 30, 14, 20] including the BEM [40]. With respect to acoustics, a National Physical
Laboratory tech report discussed some properties of rotationally symmetric problems for
the BEM as applied to the Helmholtz equation [42].
Just as circulant matrices are a subset of Toeplitz matrices, block circulant matri-
ces are a subset of block Toeplitz matrices. Therefore, it is not surprising that one of the
first inversion algorithms applied to block circulant matrices was an inversion algorithm
for block Toeplitz matrices [2]. Closed form solutions for the inversion of block circulant
matrices were formalized in [27] and presented again more concisely in [41]. The se-
quential inversion formula shows that a BCM, A, has the decomposition A = F ∗bDFb, in
which Fb represents the block Fourier matrix, and D represents a block diagonal matrix.
The blocks along the diagonal are obtained by computing the block DFT of the first
block row of A; this means if v is defined to be the first block row of A, D = diagFbv.
The inversion is then given by A−1 = Fb (diagFbv)−1 F ∗b
, and only the blocks of the
block diagonal matrix are inverted. Extending the closed form inversion formulations, an
algorithm for solving a block circulant linear system was developed alongside many vari-
ants of circulant linear systems [10]. The solution of the linear system involving BCMs
resulted from a straightforward application of the inversion formula. Following these ef-
forts, [31] proposed an algorithm for the solution of CBMs. The most recent contribution
23
to CBMs was given in [39]. The algorithm first diagonalizes each block of the matrix by
the Fourier relation. The matrix is then a block matrix with diagonal blocks. The algo-
rithm decomposes the matrix into a two-by-two block matrix and successively performs
this decomposition to the first principal submatrix until a diagonal matrix is reached.
The diagonal matrix is inverted, and the Schur complement formulation for the inverse
of a two-by-two block matrix is successively used to compute the inversion of the entire
matrix. All inversion/solution formula of consequence use the spectral properties of the
circulant matrices. This is exploited in all aforementioned sequential inversion/solution
algorithms.
While sequential solution algorithms have been fully developed, little work has
been done on parallel algorithms for block circulant linear systems. A parallel solution
for block Toeplitz matrices exists, and parallelizes the generalized Schur algorithm [3].
Yet, using a Toeplitz solver neglects the use of the FFT and potential concurrent cal-
culations found in the BCM inversion formula. In fact, the only work we are aware of
is a parallel solver for electromagnetic problems which considers the axisymmetric case
[29]. The proposed parallel algorithm was for distributed memory systems and paral-
lelized the inversion formulation for BCMs. The assumptions of the work differ from
our own; that is, they assume a larger number of blocks of smaller order, and, in turn,
assumed that the number of processors was some fraction of the number of blocks in the
matrix. This means each processor contained multiple blocks, denoted q, of the BCM.
For each block owned by a processor, the corresponding right-hand side also resides on
that processor. This means that when solving the block diagonal matrix, each processor
could perform the solve of its q blocks simultaneously. However, when solving the linear
24
system, multiplications by the Fourier matrix are needed. These are needed in order
to: obtain the block diagonal matrix, modify the right-hand side vector, and modify the
solution vector. This distribution means that multiplying by the Fourier matrix requires
communication among the processors. Using the fact that block Fourier transforms can
be decomposed into independent Fourier transforms, it performs an all-to-all communi-
cation to give each processor the data needed to compute an independent FFT. They
tested the algorithm for BCMs with m = 256 blocks of order n = 318, m = 128 blocks
of order n = 189, and m = 64 blocks of order n = 93. This is where our assumptions
diverge significantly, and as a result our algorithm differs significantly in implementation
of the same inversion formula.
25
Chapter 3
Problem Formulation
Consider a rotationally symmetric vibrating structure, Ω εR3. The rotational
symmetry implies Ω can be constructed by rotations of a single element around a fixed
axis. Define Ω′ to be a structure in R3, and let Ω′θ
represent the structure obtained
by rotating Ω′ by angle θ. Then, supposing Ω has m rotational symmetries, Ω can be
written as Ω = Ω′0∪ Ω′2π
m∪ Ω′4π
m∪ · · · ∪ Ω′(m−1)2π
m
; that is,
Ω =
m−1⋃k=0
Ω′k2πm. (3.1)
For example, for m = 4 the structure Ω can be written as
Ω = Ω′0∪ Ω′π
2∪ Ω′
π∪ Ω′3π
2. (3.2)
Note, the angle θ is relative to an initial orientation of the structure. This means that
the structure being rotated can have any initial orientation; as long as the rotation
is around a fixed axis and the rotation angle is uniform, the constructed structure is
rotationally symmetric. Figure 3.1 shows a real-world example of a structure containing
three rotational symmetries.
26
Fig. 3.1 A propeller with three times rotational symmetry [37].
3.1 Coefficient Matrix Derivation
Before beginning the algebraic derivation, we first present the underlying intu-
ition. Figure 3.2 shows a sketch of a propeller with four times rotational symmetry.
Consider the effect Ω′0
has on Ω′π2, as well as the effect Ω′π
2has on Ω′
π. Because the blades
are identical and dist(Ω′0,Ω′π
2) = dist(Ω′π
2,Ω′
π), the entries in the coefficient matrix which
describe the effect of Ω′0
on Ω′π2
and Ω′π2
on Ω′π
will be identical. This continues for the
remaining interactions of this form; therefore, the entries of the coefficient matrix due
to the effect of Ω′0
on Ω′π2, Ω′π
2on Ω′
π, Ω′
πon Ω′3π
2, and Ω′3π
2on Ω′
0will be identical. This
same idea is used for all of the remaining interactions to finish populating the coefficient
matrix. The equality between interactions due to symmetry is what leads to the block
circulant structure of the coefficient matrix.
27
Fig. 3.2 A four times rotationally symmetric sketch of a propeller.
This decomposition of the initial structure in R3 into the union of rotated struc-
tures gives insight into the structure of the coefficient matrix. Recall, in the derivation
of the BEM, the solution over the boundary of the structure must first be solved in order
to obtain the solution in the exterior domain. Consider only the base element Ω′ = Ω′0
before any rotations. For clarity, we suppose m = 2 and use the standard boundary
integral formulations given by (1.4) and (1.5). The integral formulations which promise
uniqueness follow in the same manner. Assuming a Neumann boundary condition and
rearranging into knowns and unknowns, the equation over the boundary of Ω′0
is given
28
by
∫∂Ω′
0
∂G(p, q)
∂nqu(q)d
(∂Ω′
0
)− 2πu(p) =
∫∂Ω′
0
∂u(q)
∂nqG(p, q)d
(∂Ω′
0
), p ε ∂Ω′
0. (3.3)
Next, consider the solution of u over the boundary element ∂Ω′π2; that is, the
boundary surface obtained by rotating the base element Ω′0
by 90 degrees. This yields
the following boundary integral formulation
∫∂Ω′π
2
∂G(p, q)
∂nqu(q)d
(∂Ω′π
2
)− 2πu(p) =
∫∂Ω′π
2
∂u(q)
∂nqG(p, q)d
(∂Ω′π
2
), p ε ∂Ω′π
2. (3.4)
As stand-alone structures, Ω′0
and Ω′π2
are identical aside from their orientation. The
boundaries, ∂Ω′0
and ∂Ω′π2, are unaffected by rotations and are therefore identical. Equa-
tion (3.3) and (3.4) involve only points on the boundary and, therefore, assuming the
Neumann conditions are identical for both equations, equality holds. Note, by the
uniqueness, and the equality for identical right-hand sides, it follows the left-hand sides
must be identical.
Intuitively, (3.3) shows the relation between a point p on ∂Ω′0
and all the points
q on ∂Ω′0. If a point p is chosen on ∂Ω′
0, all of the points on ∂Ω′
0contribute to the value
of u at that point. In this sense, an N -body problem is being solved. Similarly, if a
point p is chosen on ∂Ω′π2, all of the points on ∂Ω′π
2contribute to the value of u at that
point; however, ∂Ω′0
are identical ∂Ω′π2. Therefore, under identical boundary conditions,
the same N -body problem is being solved.
29
Now, consider the solution of u over the boundary of the structure obtained by
combining the two aforementioned structures, Ω′0
and Ω′π2. The boundary is then given
by ∂Ω = ∂Ω′0∪ ∂Ω′π
2and the integral equation is
∫∂Ω
∂G(p, q)
∂nqu(q)d (∂Ω)− 2πu(p) =
∫∂Ω
∂u(q)
∂nqG(p, q)d (∂Ω) , p ε ∂Ω. (3.5)
Using the rotational symmetries, equation (3.5) becomes
∫∂Ω′
0
∂G(p, q)
∂nqu(q)d
(∂Ω′
0
)+
∫∂Ω′π
2
∂G(p, q)
∂nqu(q)d
(∂Ω′π
2
)− 2πu(p) = (3.6)
∫∂Ω′
0
∂u(q)
∂nqG(p, q)d
(∂Ω′
0
)+
∫∂Ω′π
2
∂u(q)
∂nqG(p, q)d
(∂Ω′π
2
), p ε ∂Ω′
0∪ ∂Ω′π
2.
Redefine v1(p) = u(p) for p ε ∂Ω′0
and v2(p) = u(p) for p ε ∂Ω′π2. In addition, define
Γ0[v1] =
∫∂Ω′
0
∂G(p, q)
∂nqv1(q)d
(∂Ω′
0
), Γπ
2[v2] =
∫∂Ω′π
2
∂G(p, q)
∂nqv2(q)d
(∂Ω′π
2
)
Σ0 =
∫∂Ω′
0
∂v1(q)
∂nqG(p, q)d
(∂Ω′
0
), and Σπ
2=
∫∂Ω′π
2
∂v2(q)
∂nqG(p, q)d
(∂Ω′π
2
).
Note, the variables v1(p) and v2(p) are unknowns, and, therefore, Γ0[v1] and Γπ2[v2] are
defined as operators; whereas, Σ0 and Σπ2
are known quantities and are treated as known
values. Using the newly-defined quantities, (3.6) can be split into two simultaneous
30
equations over ∂Ω′0
and ∂Ω′π2.
Γ0[v1] + Γπ2[v2]− 2πv1(p) = Σ0 +Σπ
2, p ε ∂Ω′
0(3.7)
Γ0[v1] + Γπ2[v2]− 2πv2(p) = Σ0 +Σπ
2, p ε ∂Ω′π
2.
Upon appropriate discretization, (3.7) can be written as the following linear system
Γ0 − 2πI Γπ2
Γ0 Γπ2− 2πI
v1
v2
=
Σ0 +Σπ2
Σ0 +Σπ2
, (3.8)
where I is the identity matrix. Let A denote the coefficient matrix in (3.8) and consider
the entries (Γ0 − 2πI) and(Γπ
2− 2πI
). By the previous arguments in establishing the
equivalence of (3.3) and (3.4), it follows that
(Γ0 − 2πI) =(Γπ
2− 2πI
). (3.9)
This is true even when the right-hand sides of (3.3) and (3.4) are not identical. With
this relation established, define A1 = (Γ0 − 2πI) =(Γπ
2− 2πI
). Similarly, consider the
entries Γ0 and Γπ2. We would like to show Γ0 = Γπ
2. By definition,
Γ0[v1] =
∫∂Ω′
0
∂G(p, q)
∂nqv1(q)d
(∂Ω′
0
), (3.10)
31
and upon discretization as described in Section 1.2, we obtain
Γ0[v1] =N∑i=1
(v1)i
(∫[∂Ω′
0]i
∂G(p, q)
∂nqd([∂Ω′
0
]i
)). (3.11)
The quantity Γ0[v1] becomes the product Γ0v1, in which v1 is the discretization of the
unknown v1(q), and Γ0 is a matrix of known quantities populated by integrating the
normal derivative of the Green’s function over the individual surface elements of ∂Ω′0.
In considering the discretization of Γπ2[v2], we obtain
Γπ2[v2] =
N∑i=1
(v2)i
∫[∂Ω′π
2
]i
∂G(p, q)
∂nqd([∂Ω′π
2
]i
) . (3.12)
Again, the quantity Γπ2[v2] becomes the product Γπ
2v2, in which v2 is the discretization of
the unknown v2(q), and Γπ2
is a matrix of known quantities populated by integrating the
normal derivative of the Green’s function over the individual surface elements of ∂Ω′π2.
Assuming the discretization of the boundaries are the same, because the boundaries ∂Ω′0
and ∂Ω′π2
are identical, the values populating Γ0 and Γπ2
are identical, and thus Γ0 = Γπ2.
Let A2 = Γ0 = Γπ2, then with the previously established definition A1 =
(Γπ
2− 2πI
)=
(Γ0 − 2πI), the matrix, A, comprising the linear system (3.8) has the form
A =
A1 A2
A2 A1
, (3.13)
which is a 2 × 2 block circulant matrix. In general, given m rotational symmetries, an
m×m block circulant matrix can be obtained.
32
3.2 Block Circulant Inversion
Let N = nm. The coefficient matrix AεCN×N arising from the BEM applied to
an acoustic radiation problem with a rotationally symmetric boundary surface has the
form
A =
A1 A2 · · · Am
Am A1 · · · Am−1
Am−1 Am · · · Am−2
......
. . ....
A2 A3 · · · A1
, (3.14)
where each Aj , j = 1, . . . ,m is contained in Cn×n and is dense. The matrix A is block
circulant and therefore can be represented by circular shifts of its first block row. The
circulant structure of A is contained in the m blocks forming the first block row of A.
Therefore, in order to perform block DFT operations, we need to scale the Fourier matrix
F εCm×m to the block Fourier matrix Fb εCN×N . The Fourier matrix F is defined as
F =1√m
1 1 1 · · · 1
1 ω1m
ω2m
· · · ωm−1m
1 ω2m
ω4m
· · · ω2(m−1)m
......
... · · ·...
1 ω(m−1)m
ω2(m−1)m
· · · ω(m−1)(m−1)m
, (3.15)
where ωm = ei2πm , i =
√−1, and normalizing by 1√
mmakes F unitary. Scaling each
element of F by the n × n identity matrix, In, produces the block Fourier matrix Fb.
33
This is equivalent to the Kronecker product F ⊗ In. After scaling, we have the block
Fourier matrix
Fb =1√m
In In In · · · In
In Inω1m
Inω2m
· · · Inωm−1m
In Inω2m
Inω4m
· · · Inω2(m−1)m
......
... · · ·...
In Inω(m−1)m
Inω2(m−1)m
· · · Inω(m−1)(m−1)m
. (3.16)
Next, the DFT relations needed for the inversion formula are established. Let X εCN×n
be the block column vector containing the first block row of A. The block DFT of X is
given by X = FbX; that is,
A1
A2
A3
...
Am
=
In In In · · · In
In Inω1m
Inω2m
· · · Inωm−1m
In Inω2m
Inω4m
· · · Inω2(m−1)m
......
... · · ·...
In Inω(m−1)m
Inω2(m−1)m
· · · Inω(m−1)(m−1)m
A1
A2
A3
...
Am
, (3.17)
which is nothing more than a DFT of length m with n×n matrices as coefficients in the
transform. Using the formulation of the inverse in [41], we have
A−1 = Fbdiag(A1)−1, (A2)−1, . . . , (Am)−1F ∗b, (3.18)
34
where diag(A1)−1, (A2)−1, . . . , (Am)−1 is a block diagonal matrix whose diagonal blocks
are precisely the inverses of the blocks obtained from the DFT of the first block row of
A. From the formula, we can derive the algorithm for the solution of a linear system.
Consider the system Ax = b; multiplying by A−1 yields
x = A−1b. (3.19)
Substituting in the definition for A−1 from (3.18), we obtain
x = Fb diag(A1)−1, (A2)−1, . . . , (Am)−1F ∗bb. (3.20)
Rearranging, yields
diag(A1), (A2), . . . , (Am)F ∗bx = F ∗
bb. (3.21)
Let x = F ∗bx and b = F ∗
bb. This yields
diag(A1), (A2), . . . , (Am)x = b. (3.22)
Blocking the vectors x and b to match the block sizes of each Aj , it is easy to see we
obtain m independent linear systems to solve
Aj xj = bj , j = 1, . . . ,m. (3.23)
The steps for solution of the linear system Ax = b are given by Algorithm 3.1. Each
multiplication by the matrix Fb or F ∗b
represents a block DFT or inverse DFT (IDFT)
35
operation, respectively. It is worth noting that the system solves in line 3 of the algorithm
are completely independent, and thus make the algorithm very amendable to parallel
implementation as noted in [35].
Algorithm 3.1 Pseudocode for the sequential solution of a block circulant linear system.
1: Compute b = F ∗bb;
2: Compute X = FbX;3: Solve Aj xj = bj , j = 1, . . . ,m;4: Compute x = Fbx;
3.3 Invertibility
Algorithm 3.1 requires the inversion of the blocks obtained from computing the
DFT of the first block row of A. Therefore, assumptions on the invertibility of these
blocks are required by the algorithm. The section will show that if the initial matrix A
is assumed to be nonsingular, then each diagonal block is also nonsingular.
In order to facilitate the proof, we first show that the block Fourier matrix given
in (3.16) is unitary.
Lemma 3.1. The block Fourier matrix, Fb, as defined in (3.16) is unitary.
Proof. Recall, the N × N block Fourier matrix Fb can be constructed as a Kronecker
product of the unitary m×m Fourier matrix F with the n×n identity matrix In. That
is,
Fb = F ⊗ In. (3.24)
36
By the properties of Kronecker products [13] we have (A⊗B)∗ = A∗ ⊗B∗. Therefore,
F ∗b
= (F ⊗ In)∗ = F ∗ ⊗ I∗n
= F ∗ ⊗ In. (3.25)
So F ∗b
is can be constructed in the same fashion. Now consider F−1b
. By the Kronecker
product property (A⊗B)−1 = A−1 ⊗B−1, for square nonsingular A and B, we have
F−1b
= F−1 ⊗ I−1n. (3.26)
However, the Fourier matrix F is unitary, and thus
F−1b
= F ∗ ⊗ In. (3.27)
It has been established that F ∗b
= F ∗ ⊗ In, and, therefore,
F−1b
= F ∗b. (3.28)
Thus Fb is unitary.
Theorem 3.1. Given a nonsingular block circulant matrix A. The block diagonal matrix
diagA1, A2, . . . , Am is nonsingular, where each Aj, j = 1, . . . ,m are the blocks obtained
by computing the block Fourier transform of the first block row of A.
Proof. Since A is block circulant we have
A = F ∗b
diagA1, A2, . . . , AmFb. (3.29)
37
Taking the determinant yields
det (A) = det(F ∗b
diagA1, A2, . . . , AmFb). (3.30)
Using a property of determinants we obtain
det (A) = det(F ∗b
)det(
diagA1, A2, . . . , Am)
det (Fb) . (3.31)
By Lemma 3.1, Fb is unitary, and, thus, det(F ∗b
)= det (Fb) = 1; therefore,
det (A) = det(
diagA1, A2, . . . , Am). (3.32)
Using the relation of the determinant of block diagonal matrices, we have
det (A) = det(A1)det(A2) . . . det(Am). (3.33)
Because A is nonsingular, det(A) 6= 0, and, therefore, det(Aj) 6= 0 for j = 1, . . . ,m. It
follows that each Aj , j = 1, . . . ,m is nonsingular, and, therefore, diagA1, A2, . . . , Am
is nonsingular.
38
Chapter 4
Parallel Solution Algorithm
4.1 Block DFT Algorithm
While it is enticing to develop the algorithm around the Fast Fourier Transform
(FFT), the robustness of the algorithm will be lost. Recall that the length of the DFT is
determined by the number of symmetries of the boundary surface. For problems involving
real world structures, such as propellers or wind turbines, the number of symmetries will
be small e.g., m ≤ 30. Indeed, even if a structure contained symmetries arising every
one degree, i.e., m = 360, there must be at least one surface element in the discretization
representing the symmetry, meaning n ≥ 360. This case is somewhat pathological, and,
in general, we assume each symmetry has a large number of surface elements. This
means it can be reasonably assumed that m n. In addition, FFT’s make assumptions
on the properties of m. The most common assumption being that m is a power of two.
While there are now FFT algorithms for any value of m [11, 26], the algorithms applied
to feasible sizes of m have negligible benefits due to constants in the computation. We
therefore designed our algorithm to be robust in the sense that it will work for any
boundary surface input, and therefore we use a matrix multiplication DFT approach.
We derive the algorithm in the context of computing the block DFT of the first
block row of A, given by (3.17), as this computation is needed during the system solve.
Define P to be the number of processors and assume P = m. The initial data distribution
39
is obtained by assigning each submatrix Aj to processor Pj , for j = 1, . . . ,m. The initial
data distribution for P = m = 4 is illustrated in Figure 4.1.
Fig. 4.1 Initial data distribution assumed in the DFT computation for the case P = m =
4.
Expanding the DFT relation X = FbX, we obtain
A1 = A1 +A2 +A3 + · · ·+Am
A2 = A1 +A2ω1m
+A3ω2m
+ · · ·+Amωm−1m
A3 = A1 +A2ω2m
+A3ω4m
+ · · ·+Amω2(m−1)m
......
Am = A1 +A2ωm−1m
+A3ω2(m−1)m
+ · · ·+Amω(m−1)(m−1)m
.
(4.1)
Given this initial data distribution, in the computation of A1, processor P1 already
contains a portion of the summation, namely A1. In fact, in all of the Aj computations,
each processor contains a scaled portion of the corresponding summation. In addition,
the scalar values ω(k−1)(j−1)m
, for j, k = 1, . . . ,m, are computable. This means that for
the cost of scaling a submatrix by a term in Fourier matrix, we already have a portion
of the computation of each Aj , j = 1, . . . ,m. The algorithm expands on this idea to
compute the entire summation.
40
Starting from the initial data distribution, each processor computes the portion
of the summation that corresponds to the data owned. Then, each Pi cyclically sends its
submatrix to Pi−1 (P1 sends its data to Pm). Each processor computes the corresponding
term in the summation and propagates the submatrix. The computation completes after
m− 1 communications. Figure 4.2 illustrates this process for the case P = m = 4.
Fig. 4.2 The DFT computation for the case P = m = 4. Each arrow indicates the com-munication of a processor’s owned submatrix to a neighboring processor in the directionof the arrow.
The algorithm can be generalized to the case with P = cm, where c εZ+, by
observing that a block DFT with a block size of n×n can be broken into n2 independent
DFTs with block size 1. To see this consider the kth summation taken from (4.1). We
then have
Ak = A1 +A2ω(k−1)m
+A3ω2(k−1)m
+ · · ·+Amω(m−1)(k−1)m
. (4.2)
41
Recall that Aj εCn×n, for each j = 1, . . . ,m. For illustrative purposes, let n = 2. Then
(4.2) becomes
ak11
ak12
ak21
ak22
=
a111
a112
a121
a122
+
a211
a212
a221
a222
ω(k−1)m
+
a311
a312
a321
a322
ω2(k−1)m
+
· · ·+
am11
am12
am21
am22
ω(m−1)(k−1)m
, (4.3)
where superscript k indicates akij
is an element of Ak. From here the computation of the
elements of Ak can be written as the following n2 = 4 independent summations
ak11
= a111
+ a211ω(k−1)m
+ a311ω2(k−1)m
+ · · ·+ am11ω(m−1)(k−1)m
ak12
= a112
+ a212ω(k−1)m
+ a312ω2(k−1)m
+ · · ·+ am12ω(m−1)(k−1)m
ak21
= a121
+ a221ω(k−1)m
+ a321ω2(k−1)m
+ · · ·+ am21ω(m−1)(k−1)m
ak22
= a122
+ a222ω(k−1)m
+ a322ω2(k−1)m
+ · · ·+ am22ω(m−1)(k−1)m
.
(4.4)
The independence of each summation permits us, given a sufficient number of processors,
to perform these summations simultaneously. In a more general setting, this equates to
partitioning each Aj , j = 1, . . . ,m, into smaller block sizes, and then simultaneously
performing block DFTs of this smaller block size.
Now that it has been established that a block DFT can be broken down into block
DFTs of smaller block size, we explain how to exploit this in the P = cm case. Let c = 4,
i.e., P = 4m, and partition each Aj , j = 1, . . . ,m, into c = 4 blocks of size n√c× n√
c.
Note that the block size is arbitrary, therefore, if√c is not an integer the submatrix is
42
simply split into c blocks with slightly different block size. The data decomposition can
be seen in Figure 4.3, where again superscript k indicates Akij
is a block of Ak.
Fig. 4.3 Parallel block DFT data decomposition for P > m.
We rewrite these as c = 4 independent block DFTs of block size n√c× n√
c. We then
group the processors into c = 4 processor groups of size m. Grouping the processors, we
obtain four DFTs in the form presented when P = m. Figure 4.4 shows the processor
group organization. We then apply the P = m DFT algorithm within each processor
group simultaneously. Therefore, when P = cm, we can decompose each Aj into c
independent block DFTs of smaller block size. This decomposition can proceed all the
way down until each Aj is decomposed into n2 independent DFTs of block size 1. In
this case, c = n2, i.e., P = n2m, and n2 one-dimensional DFTs are being performed
simultaneously.
Since the most expensive part of computing the blocked DFT is the communi-
cation of the submatrices, it is desirable to overlap communication and computation as
much as possible. With this in mind, we introduce asynchronous send/receives. Start-
ing from the P = m initial data distribution, begin by the asynchronous send of the
processor’s owned submatrix followed by the asynchronous receive of the neighboring
43
Fig. 4.4 Parallel block DFT data decomposition and processor groupings for P > m.
processor’s submatrix. While the processor’s current submatrix data is being sent, a
neighboring processor’s submatrix is being received. During this communication, the
data being sent is still able to be used because no modifications are being made. The
data being sent is then used to update the partial sum. Therefore, we are sending,
receiving, and computing the partial sum simultaneously.
There is a cost associated with the communication overlap. The cost is in the
amount of memory being used to enable this overlapped communication/computation.
Three times the amount of memory is now being used, the unmodified submatrix, neigh-
boring processor’s unmodified submatrix, and running partial sum for the transformed
submatrix. However, the amount of extra memory used can be managed by only com-
municating portions of a submatrix at a time. While theoretically it is best to minimize
the communication startups, in practice, for large volumes of data, it is beneficial to
send the data spread over a number of smaller packets. This blocking factor for optimal
44
communication times is system dependent, but it also gives a parameter which can be
modified when memory consumption becomes an issue.
Note that this algorithm is used for both the X = FbX and X = F ∗bX. The only
difference are the terms Fourier matrix. When referring to the parallel DFT algorithm,
we differentiate the use of Fb and F ∗b
as parallel DFT and IDFT, respectively.
4.2 Block FFT Algorithm
As mentioned in Section 4.1, the FFT is difficult to apply when considering an
arbitrary number of rotational symmetries, m, because of its restriction on the value of
m, i.e., power of two in the radix-2 algorithm. In certain cases however, when the FFT
is applicable, it can effectively be used. A relevant example concerns acoustic radiation
problems involving axisymmetric structures. These problems deal with structures ob-
tained by rotating a two-dimensional object around a third fixed, orthogonal axis. For
example, cyclinders or spheres are types of axisymmetric structures. By considering the
structure of a propeller, or fan blade, it can be readily deduced that while all axisym-
metric structures are rotationally symmetric, not all rotationally symmetric structures
are axisymmetric; that is, axisymetric structures are a subset of rotationally symmetric
structures. The advantage of axisymmetric structures come from the ability to choose
the number of rotational symmetries in the discretization of the problem. Being able to
choose the values of m means that the choice can be made to exploit the FFT.
Section 4.1 began by detailing a DFT algorithm for the P = m case. It then ex-
tended the algorithm to the P = cm case by breaking the block DFT into c independent
block DFTs of smaller blocksize. The algorithm then constructs c processor groups, each
45
with m processors, around the decompositions. It then uses the P = m algorithm within
each processor group to simultaneously compute the block DFTs of smaller blocksize.
The FFT algorithm keeps the exact same framework as the DFT algorithm. The differ-
ence arises in how the P = m algorithm computes the DFT; in this case, a distributed
FFT algorithm is used.
In order to derive the parallel algorithm, consider the sequential FFT algorithm
given by Algorithm 1.2; the accompanying discussion in Section 1.3 concerned the pattern
of interaction between elements of the initial input vector in producing the transformed
vector. Indeed, this is the essence of the FFT. Figure 1.1 gave a visualization of the
interaction pattern; in addition, it also showed how the data migrated to a bit reversed
order. This is important. The parallel algorithm will distribute each element of the
input vector onto different processors, and these element interactions will become com-
munication patterns. The algorithm used to compute the distributed one-dimensional
FFT has been termed the binary exchange algorithm [24]. Only small modifications to
Algorithm 1.2 are needed to fit the parallel case.
As in Section 4.1, we present the algorithm in the context of computing the block
FFT of the first block row of A. Define P to be the number of processors and assume
P = m, the initial data distribution is obtained by assigning each submatrix Aj to
processor Pj , for j = 1, . . . ,m. The initial data distribution for P = m = 4 can again be
seen in Figure 4.1.
Now, consider the parallel Algorithm 4.1, which is a the parallel FFT algorithm
resulting from simple modifications to Algorithm 1.2.
46
Algorithm 4.1 Distributed Radix-2 FFT pseudocode [24].
1: Y=Radix-2FFT(X,Y,n)2: r = log n;3: R = X;4: for m = 0 to r − 1 do5: S = R;6: //Let (b0b1 . . . br−1) be the binary representation of pid7: j = (b0 . . . bm−10bm+1 . . . br−1);8: k = (b0 . . . bm−11bm+1 . . . br−1);9: r = (bmbm−1 . . . b00 . . . 0);
10: if pid == j then11: Send Apid to processor k12: Receive Ak from processor k13: Apid = Apid +Akω
rn;
14: else15: Receive Aj from processor j16: Send Apid to processor j17: Apid = Aj +Apidω
rn;
18: end if19: end for20: Y = R;
The first difference to note is that the second loop in Algorithm 1.2 is no longer
needed. The iterate variable i was for knowing two things: which element of the initial
vector to update, and determining the other elements involved in the computation. As
each processor only has one element, there is no question which element each processor
is responsible for updating. The second property remains intact because each Ai is
contained on the processor with pid = i, in which, pid is the processor id. During each
iteration of Algorithm 4.1, each processor needs one extra piece of data to perform the
update to the owned data. Each processor uses its processor id to compute which element
it needs to complete the current computation. By determining the element number, the
pid of the processor which owns the data is determined; this can then be used to set up
the communication to obtain the data. Figure 4.5 illustrates this process.
47
Fig. 4.5 Process illustrating the distributed FFT. Lines crossing to different processors
indicate communication from left to right. Note the output is in reverse bit-reversed
order relative to numbering starting at zero; that is, A1 is element 0; A2 is element 1,
etc.
The extension to the P = cm case is identical to the discussion in Section 4.1.
The block DFT can be decomposed into c independent blocks DFTs of smaller block size.
The processors then create c processor groupings and simultaneously perform the P = m
FFT algorithm. The advantage to computing using the distributed FFT in this way is
that the number of communications is minimized. The parallel DFT algorithm requires
O(m) communications; whereas, in using the FFT, the number of communications is
O(logm). Although we have assumed m to be quite small, in the P = m case each
communication requires that n2 data elements be sent. This means the packet sizes are
quite large; therefore, any minimization to the number of communications is beneficial.
48
4.3 System Solves
The goal is to solve all systems in line 3 of Algorithm 3.1 simultaneously. In
addition, the ScaLAPACK routine PZGESV is used to further parallelize each system
solve. By using ScaLAPACK, we are forced to work within the limits of its required
data distribution and processor organization. In particular, the matrix data must be
distributed in a block cyclic fashion, and the processors logically arranged in a grid
format [6]. Using these restrictions, the initial system is setup as follows. Assume
P = cm processors and c εZ+. Now define m processor grids of size√c ×√c and
denote them by Gi, i = 1, . . . ,m. If√c is not an integer, the processors are arranged
in a rectangular grid format such that the number of rows and columns are integers.
Figure 4.6 illustrates the grid creation process for P = 16 and m = 4.
Fig. 4.6 Processor grid creation for P=16 and m=4.
Next, each Aj and corresponding right-hand side bj are block cyclically distributed
over process grid Gj for j = 1, . . . ,m. We require that the block cyclic distribution be
49
performed using the same blocking factor for each Aj and bj . Each Gj is then in a
position where it can solve a system involving Aj and bj . However, before these system
solves can be performed the left and right-hand sides must be transformed by the DFT
and IDFT respectively.
4.4 Parallel Algorithm
We have established an initial data distribution which can be used by ScaLAPACK
and an algorithm for computing the DFT. Working within this data distribution and
using the DFT or FFT algorithm we present the parallel algorithm.
Assume we have P = cm processors where c εZ+. Define m processors grids
Gj , j = 1, . . . ,m, and block cyclically distribute each Aj and bj onto processor grid Gj
for j = 1, . . . ,m. The first step is to perform the IDFT to the right-hand side b. Each
bj is distributed onto their respective processor grid of c processors. Because each bj
was distributed over its corresponding processor grid using the same blocking factor,
the distribution process is identical to decomposing each bj into c smaller size blocks.
Therefore, we can create c processor groupings of size m where each processor group is
composed of one processor from each grid. That is, processor group 1 will be formed
by taking each Gj ’s first element; group 2 will be formed by taking each Gj ’s second
element, and this process continues until we have c processor groupings. These processor
groupings create c independent IDFTs of smaller blocksize which can use the DFT/FFT
algorithm. Therefore, the IDFT involving each bj has been decomposed into c IDFTs
of smaller size which can be done simultaneously. Using the DFT/FFT algorithm, we
perform the IDFT of bj transforming each bj into bj . In the same way, we transform each
50
Aj to Aj . Now note that the data distribution has not changed and each Gj now has the
system Aj xj = bj , which are precisely the systems that need to be solved. Also note, if the
FFT algorithm is used, the data has migrated into a bit reversed order during the IDFT
transformations; however, both sides of the equation have migrated into a bit reversed
order and the correct systems are still obtained. More precisely, if we let rev(j) denote
the bit reversal of j, after the IDFT transformations of Aj and bj each system Aj xj = bj
resides on process grid Grev(j), for j = 1, . . . ,m. Each Gj calls the ScaLAPACK routine
PZGESV and solves its respective system. PZGESV overwrites bj with the solution xj .
Because the solution overwrites the entries of bj , the data distribution has not changed,
and we simply use the DFT/FFT algorithm again to transform each xj to xj . Thus
we have the solution of the original linear system. If the FFT algorithm was used, xj
would be in bit reversed order, that is, xj is contained in grid Grev(j) for j = 1, . . . ,m;
however, when transforming back to xj the bit reversed order is negated. Therefore, xj
is contained on grid Gj , j = 1, . . . ,m, and the solution vector is in the same form as if
the DFT algorithm had been used. Algorithm 4.2 shows the pseudocode for the parallel
algorithm as six concise steps.
Algorithm 4.2 Pseudocode for the parallel solution of a block circulant linear system,assuming P = cm.
1: Define m√c×√c process grids.
2: Block cyclically distribute each Aj and bj onto grid Gj in an identical fashion.
3: Perform c simultaneous IDFTs transforming bj to bj .
4: Perform c simultaneous DFTs transforming Aj to Aj5: Simultaneously solve each Aj xj = bj in parallel using PZGESV.6: Perform c simultaneous DFTs transforming xj to xj .
51
Chapter 5
Theoretical Timing Analysis
In this chapter, the theoretical runtime analysis for the parallel implementations
discussed in Chapter 4 are developed. Algorithm 4.2 contains two core operations: par-
allel computation of the DFT and the parallel linear system solve. Therefore, the parallel
runtime, denoted TP (n,m), can be expressed as:
TP (n,m) = TFT (n,m) + TLS(n,m), (5.1)
where TFT (n,m) denotes the parallel runtime in computing the DFT, and TLS(n,m)
denotes the runtime of the parallel linear system solve. Chapter 4 presented two different
implementations of the DFT, and, therefore, two parallel runtimes will be developed.
Let A be a block circulant matrix with m blocks of order n, and let X contain
A’s first block row; that is,
X =
A1
A2
...
Am
.
Further, let b be a single column vector and right-hand side of the linear system Ax = b.
52
5.1 Parallel Linear System Solve
The parallel linear system solves are performed by ScaLAPACK which conve-
niently provides the theoretical analysis of the implementation [6]. The term TLS(n,m)
is then given by
TLS(n,m) =2n3
3Ptf +
(3 + 14 log2 P )√P
n2tv + (6 + log2 P ) tm, (5.2)
where tf is the time per complex floating point operation, tm is the startup time for each
communication, and tv is the time per data item sent. In general, tm > tv; thus, the
number of communication startups should be minimized. Equation (5.2) can be broken
into three parts: the first term in the summation is the computation term; the second
term is the communication cost concerning the quantity of data items sent, and the last
term corresponds to the number of communication startups.
The variable P in (5.2) is used to denote all processors; however, in the general
case where P = cm, the parallel implementation contains m simultaneous system solves,
each with c = Pm processors devoted to the parallel system solves. Therefore, the term
P in (5.2) should be replaced by c obtaining
TLS(n,m) =2n3
3ctf +
(3 + 14 log2 c)√c
n2tv + (6 + log2 c) tm. (5.3)
Note that (5.3) is the parallel runtime for all of the m linear system solves. Due to
the concurrency of the m linear system solves, solving m linear systems with P = cm
processors is equivalent to solving one linear system with c processors. This overlap in
53
parallelized operations is what makes the inversion formulation so amendable to parallel
solution.
5.2 Block DFT using the DFT Algorithm
In this section, the runtime analysis of Algorithm 4.2 is considered when the block
DFT algorithm (see Section 4.1) is used. There are three transformations which use the
DFT algorithm presented in Section 4.1: the transformation of Aj to Aj , bj to bj , and the
solution vector xj to xj , for j = 1, . . . ,m. Each of these transformations requires m− 1
communications. When transforming Aj to Aj , for j = 1, . . . ,m, each communication
involves messages of size n2; similarly, the transformation of bj to bj , and xj to xj , for
j = 1, . . . ,m, both involve messages of size n. Using this, the communication term in
the analysis, denoted To(n,m), can be constructed. Accounting for the communications
needed by these transformations To(n,m) is given by
To(n,m) = 3(m− 1)tm + (m− 1)(n2 + 2n)tv, (5.4)
where, again, tm is the time to initialize a communication, and tv is the time per data
item sent.
The computational term in the analysis is relatively straightforward. During each
step of the algorithm, each processor multiplies the data it currently owns and adds it
to its running sum. When transforming Aj to Aj , for j = 1, . . . ,m, each processor scales
n2 elements by a term in the Fourier matrix and adds it to the running sum; therefore,
we have n2m multiplications plus n2(m−1) additions in the transformation of Aj to Aj ,
54
for j = 1, . . . ,m. Similarly, the transformation of bj to bj , and xj to xj , for j = 1, . . . ,m,
both involve nm multiplications and n(m− 1) additions. Combining the computational
and communication terms yields
TDFT (n,m) = (m− 1)(n2 + 2n)tf +m(n2 + 2n)tf + 3(m− 1)tm + (m− 1)(n2 + 2n)tv.
(5.5)
The analysis can easily be extended to the P = cm case. Recall, the P = cm
DFT algorithm creates c DFTs of smaller blocksize and arranges c processor groups.
Using these processor groups, c simultaneous P = m DFTs of smaller blocksize are then
performed. While the same number of communication startups are still needed, the size
of the messages as well as the amount of computation are reduced by 1c ; therefore, by
dividing the appropriate terms in (5.5) by c the P = cm case is obtained
TDFT (n,m) =(m− 1)(n2 + 2n) +m(n2 + 2n)
ctf + 3(m− 1)tm +
(m− 1)(n2 + 2n)
ctv.
(5.6)
More compactly,
TDFT (n,m) =(2m− 1)(n2 + 2n)
ctf + 3(m− 1)tm +
(m− 1)(n2 + 2n)
ctv. (5.7)
55
By combining (5.3) and (5.7), the parallel runtime for Algorithm 4.2, which is given by
(5.8), is obtained:
TP1(n,m) =(2m− 1)(n2 + 2n)
ctf + 3(m− 1)tm +
(m− 1)(n2 + 2n)
ctv +
2n3
3ctf
+(3 + 1
4 log2 c)√c
n2tv + (6 + log2 c)tm. (5.8)
By rearranging (5.8) and grouping computation and communication-specific constants,
the final parallel runtime using the DFT algorithm is given by
TP1(n,m) =
[2n3
3c+
(2m− 1)(n2 + 2n)
c
]tf + [3(m− 1) + (6 + log2 c)] tm
+
[(m− 1)(n2 + 2n)
c+
(3 + 14 log2 c)√c
n2
]tv. (5.9)
5.3 Block DFT Using the FFT Algorithm
The FFT timing analysis follows directly from Section 5.2. Recall the main differ-
ence between the DFT algorithm and the FFT algorithm is the communication pattern.
Whereas the DFT required m − 1 communications, the FFT only requires log2m, for
m a power of two. Consider the DFT implementation’s communication term (5.4). By
substituting log2m for the appropriate communication terms (5.4) becomes
To(n,m) = 3 log2(m)tm + log2(m)(n2 + 2n)tv (5.10)
when the FFT algorithm is used. In the FFT case, after each communication, each pro-
cessor scales a portion of its owned data by a term in the Fourier matrix. This modified
56
data is then added to the processor’s running sum; therefore, log2m communications
implies log2m multiplications and log2m additions are performed. This is reflected in
the computational term. Note, these are the only terms that change relative to the anal-
ysis involving the DFT algorithm. By proceeding in the same manner as Section 5.2, we
obtain
TP2(n,m) =
[2n3
3c+
2 log2(m)(n2 + 2n)
c
]tf + [3 log2m+ (6 + log2 c)] tm
+
[log2(m)(n2 + 2n)
c+
(3 + 14 log2 c)√c
n2
]tv (5.11)
for the final runtime of Algorithm 4.2 when using the FFT algorithm.
5.4 Bounds
Constructing the parallel complexity analysis allows us to find the dominating
term in both parallel algorithms. Recall the assumptions in the development of the
parallel algorithms in Chapter 4, namely, n m; that is, the order of each block
in the coefficient matrix is large relative to the number of blocks. In general, it was
assumed m < 30. Looking at (5.9), it is clear that the first term, i.e., 2n3
3c , dominates
the computation. Similarly, by considering (5.11), it follows that both expressions have
the same dominating term. Therefore, we obtain
TP1(n,m) = O
(n3
c
)(5.12)
57
and
TP2(n,m) = O
(n3
c
). (5.13)
Recalling that N = nm, for N large, the term which dominates arises from the ScaLA-
PACK linear system solve. Therefore, under our assumptions, the most expensive part
of the computation is offloaded to the ScaLAPACK routine. This means that although
large packets of data must be communicated between processors in computing the DFT,
when N is large, the dominating term comes from the linear system solves. While this is
not extremely surprising, the implication is that the communication terms in the devel-
oped algorithms do not overwhelm the overall algorithm. As a result, the computational
portion of the linear system solves dominates. This is also the result reached in the
ScaLAPACK user guide [6] where only the parallel linear system solve is analyzed. This
is considered advantageous because the linear system solves are computed via ScaLA-
PACK which are optimized for scalability.
58
Chapter 6
Numerical Experiments
All experiments were run using the Intel Nehelam processors of the Cyberstar
compute cluster [1] running at 2.66 Ghz with 24 GB of RAM. We implemented the
parallel algorithm in FORTRAN 90 and used the ScaLAPACK and MPI libraries. A
blocking factor of 50 was used for the block cyclic distribution of each Aj and bj onto
the their respective processor grids.
The FFT and DFT parallel algorithms differ in the communication routines used.
The DFT algorithm broke the communications into blocks of size 4000 which were sent
and received asynchronously using MPI’s ISEND/IRECV functions. In the case that a
processor does not contain 4000 elements, all of its data is sent in one communication.
The blocking of the communications also parsed each matrix columnwise to work within
FORTRAN’s column major data storage format. Whereas, the FFT algorithm did not
perform asynchronous sends/receives and used the standard BLACS routines for sending
2D blocks of data.
6.1 Experiment 1
First, we look at the runtime, speedup, and efficiency for a vibrating structure
with four times rotational symmetry for both the DFT algorithm and FFT algorithm.
59
In each case, the number of processors P and matrix size N are varied. The number of
processors is varied from 4 to 48 and N is varied from roughly 13, 000 to 24, 000.
6.2 Experiment 2
We look at the runtime, speedup, and efficiency for a vibrating structure with
eight times rotational symmetry for both the DFT algorithm and FFT algorithm. In
each case, the number of processors P and matrix size N are varied. The number of
processors is varied from 8 to 48 and N is varied from roughly 13, 000 to 24, 000.
6.3 Numerical Results
6.3.1 Experiment 1
First, consider the algorithm’s behavior when a structure with four times rota-
tional symmetry, m = 4, is examined using the DFT algorithm as well as the FFT
algorithm. Figure 6.1 shows the runtimes when using the DFT algorithm; a sharp de-
cline in runtime can be seen as the number of processors increase for various N . The
runtimes using the FFT implementation are given in Figure 6.2 showing similar trends
and runtimes as their DFT counterpart. The runtime improvements are also apparent
when looking at the speedup, which are given in Figures 6.3 and 6.4. The oscillations
in the speedups can be explained by looking more closely at the values of the runtimes.
Figures 6.1 and 6.2 show that the wall clock times are quite low, and small benign vari-
ances in the runtime for large P cause large oscillations in the speedup. This is why the
60
oscillations are flushed out for larger problems. Therefore as N increases, the oscillations
are dampened, and the speedups become more linear.
Fig. 6.1 Runtime comparison using the DFT algorithm for varying P and N with m = 4.
Fig. 6.2 Runtime comparison using the FFT algorithm for varying P and N with m = 4.
61
Fig. 6.3 Speedups using the DFT algorithm for varying P and N with m = 4.
Fig. 6.4 Speedups using the FFT algorithm for varying P and N with m = 4.
62
The most important category in parallel algorithm analysis is probably efficiency.
Efficiency is a measure of useful work done by a parallel algorithm and gives insight
into how much time the algorithm spends waiting on communication. Ideally, we would
like the efficiency to be as close to 1 as possible, which means all of the work is use-
ful. However, we are restricted by the underlying parallelism of the computation being
performed. Here we look at the behavior of the efficiency for varying P and N . Fig-
ures 6.5 and 6.6 show the efficiency as a function of problem size for the DFT and FFT
implementations, respectively. For nearly all processor numbers, excluding P = 4 which
simply remains efficient, the algorithm becomes more efficient as the problem size in-
creases. This tells us that as the problem size increases, the amount of time spent doing
useful work increases. Because N = nm, for m fixed, an increase in problem size directly
correlates to an increase in n. Recall the discussion in Section 5.4; for fixed m such that
n m, the dominating term comes from the computational portion of the ScaLAPACK
linear system solve. This fact is seen in Figures 6.5 and 6.6; as a function of problem
size, the amount of time spent computing grows faster than the amount of time spent
communicating.
Notice, generally the larger the number of processors, the lower the efficiency.
Although the DFT computations are ideally parallel and are able to overlap communi-
cations resulting from additional processors, the linear system solve computations are
not. More processors imply that the communication term in the linear system solve
will contribute more to the overall runtime; however, as the size of the linear system
grows, the computational portion of the ScaLAPACK solve begins to dominate. That
is, more processors means the efficiency will be lower for problems of the same size, but
63
as the computational term of the linear system solve begins to dominate, the efficiency
increases. Therefore, even though the efficiencies for different processors are decreasing
with P in Figure 6.5, it is only because the data point, i.e., the value of N , is fixed.
Fig. 6.5 Efficiency using the DFT algorithm for varying N and P with m = 4.
An interesting remark with respect to the two different implementations is to
note the similarity in their performance. Recall, the main difference in the algorithms
is the number of communications needed when computing the DFT; that is, the DFT
algorithm, i.e., matrix multiplication, or the FFT algorithm. For the case m = 4, the
number of communications is negligible; however, m = 4 also means the linear systems
are larger. Therefore, the computational term in the ScaLAPACK linear system solve
will dominate the computation more, making the algorithms behave in a similar fashion.
This was also alluded to in the theoretical analysis given in Chapter 5.
64
Fig. 6.6 Efficiency using the FFT algorithm for varying N and P with m = 4.
6.3.2 Experiment 2
Now, consider the performance of the algorithm when the number of rotational
symmetries m = 8. Figures 6.7 and 6.8 show the runtime analysis for the DFT and FFT
implementation. The trend in the runtimes is similar to that of the four times rotational
symmetry case. The main difference are the runtime values. Consider the largest case,
N = 24, 000; Figure 6.1 shows the runtime for P = 8 is roughly 38 seconds. Whereas,
Figure 6.7 shows that for the same values of P and N , the computation time is only 12
seconds. As m increases, the size of the linear system decreases. This means that the
most expensive part of the computation, which is the linear system solve, decreases with
m, and results in a faster overall runtime. Even though the number of communications
in the DFT/FFT algorithms grows with m, the messages per communication are smaller.
65
Fig. 6.7 Runtime comparison using the DFT algorithm for varying P and N with m = 8.
Fig. 6.8 Runtime comparison using the FFT algorithm for varying P and N with m = 8.
66
Figure 6.9 shows the speedup for the DFT algorithm in the eight times rotational
symmetry case. What is interesting is that for smaller problems the speedup begins
leveling off past a certain point. The DFT and FFT algorithms have no increasing
dependence on P in their communication terms, and, therefore, this must be due to
the size of the linear systems. This shows that for a fixed problem size, the advantage
of additional processors becomes negligible after a certain point due to the ratio of
computation to communication in the linear system solve. What is important is that
until this point of leveling off, the speedup increases nearly linearly. This means that the
extra communications in the DFT/FFT algorithm, which are due to the increase in m, do
not overwhelm the algorithm. Indeed, by considering the larger values of N , Figure 6.9
shows that the speedups are nearly linear for larger problem sizes. The speedup in the
case of the FFT algorithm is give in Figure 6.10. The trend for the smaller values of
N appears to extend further than in Figure 6.9 and is most likely due to savings of the
FFT.
Lastly, efficiency is considered and is shown in Figures 6.11 and 6.12 for varying
values of P and N . Again, it is found that the efficiency increases for increasing problem
size. It can be seen that the overall value of the efficiency is slightly less than the
m = 4 case; this is due to two things: the increase in communications due to the
DFT/FFT algorithms, and the size of the linear system solve. However, because m is
fixed, the number of communications by the DFT/FFT algorithm will not grow with
N , even though the message size will grow. Therefore, as the problem size increases,
the computational term of the linear system solve will again begin to dominate, and the
efficiency can be expected to increase.
67
Fig. 6.9 Speedup comparison using the DFT algorithm for varying P and N when m = 8.
Fig. 6.10 Speedup comparison using the FFT algorithm for varying P and N whenm = 8.
68
Fig. 6.11 Efficiency comparison using the DFT algorithm for varying P and N whenm = 8.
The effect of using the FFT over the DFT can be seen by comparing the first data
point N = 13, 000 of Figures 6.11 and 6.12. The efficiencies for the FFT algorithm are
higher at this data point. In this instance, the linear systems are still relatively small, and
the computational term of the ScaLAPACK solve does not yet dominate. This is because
for the given value of N = 13, 000, the communications of the DFT transformations
contribute more. Because the FFT implementation uses fewer communications, the
efficiencies are higher for smaller N .
As in the four times rotational symmetry case, we find that the DFT and FFT
implementations perform similarly. The observed benefits for using the FFT appeared
at the lower bound of our experimental values. The FFT algorithm showed a higher
efficiency when N was small. In this instance, the linear system solves did not yet
dominate, and, therefore, the communications contribute more. However, in both the
69
Fig. 6.12 Efficiency comparison using the FFT algorithm for varying P and N whenm = 8.
m = 4 and m = 8 cases, as N increases, the algorithms exhibit similar performance. In
our case, the assumptions rely on the solution of larger linear systems. In the case of
smaller linear systems and larger m, the FFT algorithm could be expected to produce
better performance results.
70
Chapter 7
Conclusions
We have proposed a parallel algorithm for the solution of block circulant linear
systems arising from acoustic radiation problems with rotationally symmetric boundary
surfaces. A derivation of the linear system was given along with conditions for application
of the algorithm. The algorithm takes advantage of the ScaLAPACK library and exploits
the embarrassingly parallel nature of block DFTs within ScaLAPACK’s required data
distributions. In addition, by exploiting the block circulant structure of the matrix
in the context of the parallel algorithm, the memory requirements are reduced. The
reduction in the memory requirements allows for the solution of larger block circulant
linear systems. Because the size of the matrix directly correlates with the number of
surface elements in the discretization, problems which require a finer discretization, i.e.,
higher frequency problems, can be explored. In addition, problems with larger overall
structures can be investigated.
The behavior of the DFT and FFT algorithms was similar for large N . The exper-
imental results show near linear speedup for varying problem sizes and that the speedups
become more linear for increasingly large N . We also showed that the efficiency of the
algorithm increases as a function of problem size. The theoretical analysis coupled with
the experimental results showed that in both cases the algorithm becomes dominated by
the ScaLAPACK linear system solve portion of the algorithm. Given the requirements
71
of the problem, i.e., n m with m ≤ 30, it is found that for larger problems, the
difference in the two algorithms is negligible. It has also been established that the block
DFT transformations can be performed within the ScaLAPACK data distribution, and
that the necessary communications for the DFT transformations do not overwhelm the
algorithm’s runtime.
In addition, because we developed an algorithm using a matrix multiplication
DFT approach, it can be applied to any rotationally symmetric structure. The parallel
algorithm therefore permits the efficient computation of larger acoustic radiation prob-
lems with rotationally symmetric boundary surfaces. While small gains exist by choosing
the FFT algorithm over the developed DFT algorithm, these gains are negligible given
our assumptions on N and m. The FFT also places additional requirements on the
values of m, i.e., m is a power of 2. Indeed, for the assumption m ≤ 30, there are only
four viable values of m, namely, 2, 4, 8, and 16. Nevertheless, small gains do exist, and,
therefore, one avenue for further investigation is the development of a robust algorithm
which uses FFTs within the context of using ScaLAPACK for the linear systems. If an
elegant domain decomposition can be devised, and if a robust FFT algorithm, such as
Bluestein’s FFT algorithm [7, 8], can be fitted to the problem, the algorithm could be
further improved.
72
Appendix
BEM Code
The modified code, in its most general form, has four different cases:
1. Sequential with no rotational symmetries.
2. Sequential with rotational symmetries.
3. Parallel with no rotational symmetries.
4. Parallel with rotational symmetries.
Therefore, in the main program, logic exists to direct the program flow through one of
the four cases given above. There are five core functions which have been modified to
support the cases given above. These are:
1. STATIC MULTIPOLE ARRAYS
2. COEFF MATRIX
3. SOURCE AMPLITUDES MODES
4. SOURCE POWER
5. MODAL RESISTANCE
Before describing each function individually, we first define some frequently used ter-
minology. When using the term “distributed data structure”, we are referring to each
73
processor containing a portion of a global data structure. For example, assume we are
given a matrix A and we have P processors. Instead of one processor containing all of
the matrix A, the elements are split up, and each processor has a data structure which
contains these portions of the matrix. We refer to the collection of these data structures
as a “distributed data structure” and denote it as sub[A]. This is because when all
processors combine their corresponding sub[A], we obtain the global data structure A.
A.1 STATIC MULTIPOLE ARRAYS
This function uses multipole expansions to approximate values which will end up
populating the coefficient matrix. It attempts to speed up future runs by storing the
approximated values in a file. The function initially checks for the existence of the file. If
it is not there, the function proceeds in computing the approximations and creating the
file. If, however, a file containing the approximations exists, the function immediately
returns, performing no computations.
A.1.1 Sequential
A.1.1.1 General Case
In the sequential case, the BEM code does not change with respect to the original
code. The function generates the approximations and writes the data to a file or returns.
The pseudocode for this case is given by Algorithm A.1.
74
Algorithm A.1 Pseudocode for the STATIC MULTIPOLE ARRAYS general sequen-tial case.
if (Multipole data file exists) thenreturn;
elseCompute multipole expansion approximations;Write multipole expansion data to file;
end if
A.1.1.2 Rotationally Symmetric
Rotational symmetry plays no role in the sequential computation, and the function
performs as it does in the general case (see Section A.1.1.1 and Algorithm A.1).
A.1.2 Parallel
A.1.2.1 General Case
The parallel code behaves differently than the sequential code. A distributed data
structure, called sub[U ], is created to store the data in a distributed setting. This data
structure is a three-dimensional array. The first two dimensions vary with respect to the
total number of acoustic elements in the BEM computation. The third dimension has a
fixed value of 5, which corresponds to the number of terms in the multipole expansion.
Each processor then, simultaneously, populates its data structure. When all processors
have populated the corresponding sub[U ] data structure, the function returns. The
main difference in the computation is that no file is generated in the parallel case.
The multipole expansion data is instead held in memory distributed over the available
processors. The pseudocode is given in Algorithm A.2.
75
Algorithm A.2 Pseudocode for the STATIC MULTIPOLE ARRAYS general parallelcase. Define sub[U ]n and sub[U ]m to be the number of rows and columns of the proces-sor’s owned sub[U ] data structure respectively.
for i = 1 to sub[U ]n dofor j = 1 to sub[U ]m do
Compute multipole expansion approximation;Assign sub[U ](i, j);
end forend for
A.1.2.2 Rotationally Symmetric
The multipole file is written out in this case. Due to the way the computa-
tion proceeds in the generation of the coefficient matrix (see Section A.2), the parallel
rotationally symmetric case behaves in the same fashion as the sequential case (see Al-
gorithm A.1). That is, a file containing the multipole approximations is written out if no
such file already exists, or the function returns. Because the generation of the multipole
expansion data is not time consuming, the benefits for computing the multipole expan-
sions in parallel are lost in the communication back to a single processor for the writing
of the file. Therefore, one processor computes the multipole expansion data and writes
the data out to a file. All other processors wait for the processor performing the cal-
culations and file creation to finish. Once the working processor finishes, the remaining
processors continue with the computation.
A.2 COEFF MATRIX
The COEFF MATRIX routine populates the coefficient matrix A to be used in
the computation of Ax= b. It now uses the multipole data which was computed in the
STATIC MULTIPOLE ARRAYS routine.
76
A.2.1 Sequential
A.2.1.1 General Case
This routine is the same as the original; it loops through each entry of the ma-
trix reading one row of the multipole data at a time and populates the matrix. Note,
the multipole data is only used if the distance between the points on the surface are
sufficiently far; however, even when the multipole data is not used, the file is still read.
Algorithm A.3 Pseudocode for the COEFF MATRIX general sequential case. DefineN to be the number of rows and columns of A.
for i = 1 to N doRead row i of multipole data in from file;for j = 1 to N doif dist(pi, pj) > threshold then
Compute using multipole data if;else
Compute without multipole data;end ifAssign A(i, j);
end forend for
A.2.1.2 Rotationally Symmetric
The coefficient matrix for this case is block circulant. As noted previously, block
circulant matrices can be uniquely represented by their first block row. Therefore, only
the first block row of the matrix is generated by this routine. It proceeds in the same
manner as the general sequential version; however, it does not fill in the matrix beyond
the first block row.
77
Algorithm A.4 Pseudocode for the COEFF MATRIX rotationally symmetric sequen-tial case. Define N to be the number of rows and columns of A . In addition, define mto be the number of symmetries.
for i = 1 to Nm do
Read row i of multipole data from file;for j = 1 to N doif dist(pi, pj) > threshold then
Compute using multipole data if;else
Compute without multipole data;end ifAssign A(i, j);
end forend for
A.2.2 Parallel
A.2.2.1 General Case
In this case, each processor contains a distributed data structure containing the
global matrix A, and is denoted by sub[A]. Each processor’s data structure contains only
a portion of the data contained in the entire coefficient matrix A. Each processor then
populates its data structure, sub[A], simultaneously. The simultaneous population of the
matrix is due to the sub[U ] data structure populated in STATIC MULTIPOLE ARRAYS.
Without this distributed data structure, the file containing the multipole data would have
to be opened and read sequentially.
A.2.2.2 Rotationally Symmetric
Again, the coefficient matrix can be uniquely defined by its first block row. The
first block row will contain m blocks each of order n. In this case, the number of
processors, defined by P , is assumed to be some multiple of m. That is, P = cm for
c εZ+. From here, m processor grids are defined; these are denoted byGi for i = 1, . . . ,m.
78
Algorithm A.5 Pseudocode for the COEFF MATRIX general parallel case. Definesub[A]N and sub[A]M to be the number of rows and columns of sub[A] respectively.
for i = 1 to sub[A]N dofor j = 1 to sub[A]M doif dist(pi, pj) > threshold then
Compute using multipole data in sub[U ];else
Compute without multipole data;end ifAssign sub[A](i, j);
end forend for
Each Gi contains c processors and is of dimension√c×√c. If
√c is not an integer, the
closest rectangular grid is formed. In addition to defining the grids, each processor
defines variables called pId and gId. The variable pId is the processor number, and
gId identifies which of the m processor grids a processor is a part of; their existence
is acknowledged only because they are used in the coefficient matrix generation. At
this point, m processor grids have been defined. In addition, the first block row of
A is composed of m blocks of order n. Therefore, each of the m blocks in the first
block row of A will be distributed onto a corresponding processor grid. That is, block
Ai is distributed over Gi for i = 1, . . . ,m. In order to distribute Ai onto grid Gi for
i = 1, . . . ,m, the processors belonging to grid Gi must define a distributed data structure
for the corresponding Ai. The distributed data structure is denoted by sub[Ai]. Note,
only the processors which are a part of Gi contain the data structure sub[Ai]. That
is, processors belonging to G1 use the distributed data structure sub[A1]; processors
belonging to G2 use the distributed data structure sub[A2], and so on and so forth.
Each grid is populated simultaneously, but not the distributed data structures,
sub[Ai], i = 1, . . . ,m, within the grid. This is due to the file containing the multipole
79
approximations. The file must be read sequentially, and only one row is read at a time.
Therefore, the loops are of length n, which is the order of each Ai, i = 1, . . . ,m. The
computation proceeds as follows, for element Ai(j, k), j, k = 1, . . . , n, the function BC-
CMPT L INDX is called and returns the processor whose local data structure, sub[Ai],
contains element Ai(j, k). In addition, the function returns the index into the local data
structure, denoted by (lj , lk). Therefore, in iteration (j, k), sub[Ai](lj , lk) is populated,
and this happens simultaneously for each grid. Algorithm A.6 shows the pseudocode for
the routine.
Algorithm A.6 Pseudocode for the COEFF MATRIX rotationally symmetric parallelcase. The variable aP denotes the processor whose data structure will be assigned in agiven iteration. Note, the variable kG allows the grids to be populated in parallel.
Define processor grids;for j = 1 to n do
Read row of multipole data in from file;for k = 1 to n dokG = n ∗ gId+ k;aP = processor containing AgId(j, kG);Compute index (lj , lkG) into sub[AgId] using global index (j, kG);if pId == ap then
Assign sub[AgId](lj , lkG);end if
end forend for
80
A.3 SOURCE AMPLITUDES MODES
A.3.1 Sequential
A.3.1.1 General Case
The general sequential case makes no changes to the original routine. The right-
hand side, b, is populated by a double for loop. It should be noted that, in general, there
will be multiple right-hand sides. That is, b will not be a single column vector. Following
this, the system Ax = b is solved using the LAPACK routine ZGESV. Algorithm A.7
gives the pseudocode.
Algorithm A.7 Pseudocode for the SOURCE AMPLITUDES MODES general sequen-tial case. Let N and rhsn denotes the number of rows and columns of b respectively.
for i = 1 to N dofor j = 1 to rhsn do
Assign b(i, j);end for
end forSolve Ax = b using LAPACK routine ZGESV;
A.3.1.2 Rotationally Symmetric
In the initial section of the routine, the right-hand side, b, is populated from a
simple double for loop. Following this, the solve of the system Ax = b is performed. The
rotationally symmetric system solve has been discussed in detail in Section 3.2; therefore,
the discussion in this section will be somewhat terse. The inversion formula for a block
circulant matrix A is given by
A−1 = F ∗b
diag(A1)−1, (A2)−1, . . . , (Am)−1Fb, (A.1)
81
where Fb is the block Fourier matrix, and diag(A1)−1, (A2)−1, . . . , (Am)−1 is a block
diagonal matrix. In the context of solving the linear system Ax = b, we obtain
diag(A1), (A2), . . . , (Am)F ∗bx = F ∗
bb. (A.2)
Now, let X be a block column vector constructed from the elements of the block diagonal
matrix in (A.2). That is,
X =
A1
A2
A3
...
Am
. (A.3)
In addition, let X be the column vector of the first block row of A. We then have the
relation X = FbX; therefore, the elements of the block diagonal matrix are precisely the
values obtained from computing the block DFT of the first block row of A. With these
relations, the solve is computed by the following steps:
1. Compute b = F ∗bb.
2. Compute X = FbX.
3. Solve Aj xj = bj , j = 1, . . . ,m.
4. Compute x = xFb.
The pseudocode for this case of the SOURCE AMPLITUDES MODES routine given by
Algorithm A.8.
82
Algorithm A.8 Pseudocode for the SOURCE AMPLITUDES MODES rotationallysymmetric sequential case. Let N and rhsn denotes the number of rows and columns ofb, respectively. In addition, let m be the number of blocks in the first block row of A.
for i = 1 to N dofor j = 1 to rhsn do
Assign b(i, j);end for
end forCompute inverse DFT of left-hand side by b = F ∗
bb;
Compute the elements of diag(A1), (A2), . . . , (Am) by X = F ∗bX;
for k = 1 to m doSolve Akxk = bk using LAPACK routine ZGESV;
end forCompute DFT of solution vector x by x = Fbx;
A.3.2 Parallel
A.3.2.1 General Case
The general parallel case is very similar to the sequential general case. The
only difference is that the global matrix A has been distributed over the processors
and resides in the distributed data structure sub[A]. Recall that this data structure
was populated by the COEFF MATRIX routine (see Section A.2.2.1). In addition, a
distributed data structure, sub[b], is defined for the global right-hand side b. Each
processor simultaneously populates its corresponding data structure. The pseudocode
is given in Algorithm A.9. Note the existence of the variable x in line 6. This variable
is only for clarity of presentation of the algorithm. The routine PZGESV overwrites
the distributed data structure sub[b] with the result. In this way, there is no need to
maintain a distributed data structure for the variable x.
83
Algorithm A.9 Pseudocode for the SOURCE AMPLITUDES MODES general parallelcase. Let sub[b]n and sub[b]m denote the number of rows and columns of b, respectively.
1: for i = 1 to sub[b]n do2: for j = 1 to sub[b]m do3: Assign sub[b](i, j);4: end for5: end for6: Solve sub[A]x = sub[b] using ScaLAPACK routine PZGESV;
A.3.2.2 Rotationally Symmetric
With the exception of one additional operation, this routine performs the same
operations as the preceding cases. That is, it populates the right-hand side, and solves
the linear system. The extra operation comes from moving the distributed solution vector
into a different distributed format needed for a later parallel matrix multiplication. The
parallel rotationally symmetric system solve is discussed in detail and is the main topic
of the paper; therefore, this section will not discuss the details of the solve. Rather, this
section will detail the population of the right-hand side in a way which is amendable to
the parallel block circulant system solve. Recall the routine in Section A.2.2.2 defined m
processor grids, Gi, for i = 1, . . . ,m. In addition, the routine distributed each Ai onto
grid Gi. In a similar fashion, this routine will block b into m blocks of corresponding
size, denoted by bi, i = 1, . . . ,m, and distribute each bi onto Gi for i = 1, . . . ,m. Each
bi is in Cn×rhsn where n is the order of each block Ai, and rhsn denotes the number
of right-hand sides, i.e., columns of b. In order to distribute each bi onto grid Gi for
i = 1, . . . ,m, a distributed data structure sub[bi] is defined. Note, only the processors
which are part of Gi contain the data structure sub[bi]. That is, processors belonging
to G1 use the distributed data structure sub[b1]; processors belonging to G2 use the
distributed data structure sub[b2], and so on and so forth. All of the distributed data
84
structures are populated simultaneously by looping over their corresponding distributed
data structures. At this point, each Ai and bi reside on grid Gi for i = 1, . . . ,m,
and this is precisely the setting which is needed for the parallel block circulant system
solve. Once the block circulant linear system has been solved, the solution will reside
in each sub[bi] for i = 1, . . . ,m. However, following this subroutine, a parallel matrix
vector multiplication will be performed using the solution vector. The multiplication is
performed in the context of one large process grid containing all processors and therefore,
the right-hand side, b, is distributed over the processor grid using a distributed data
structure sub[b]. Since the solution resides on the distributed data structures sub[bi] for
i = 1, . . . ,m, the data needs to be communicated to the appropriate format for sub[b].
In order to perform this reorganization of data, the routine uses a double for loop to loop
through the global b, computes which processor currently owns it, and which processor
needs it, and performs the communication. Once this reorganization is finished, the
routine is finished.
A.4 SOURCE POWER
At this point in the program, the system Ax = b has been solved, and the resultant
vector, x, has been obtained. Because LAPACK and ScaLAPACK overwrite b with the
solution vector x, the data structures containing b now have the solution x. Therefore,
any data structures previously denoted by a b will be denoted by x. The matrix A and
the right-hand side b are no longer needed (in terms their initial values). This routine
populates a matrix S with the intent of computing s = x∗Sx, in which x is the solution
obtained from the SOURCE AMPLITUDES MODES routine.
85
Algorithm A.10 Pseudocode for the SOURCE AMPLITUDES MODES rotationallysymmetric parallel case. The variables pId and gId denotes the processor number andthe grid the processor belongs to respectively. Let sub[bgId]n and sub[bgId]m denote thenumber of rows and columns of b respectively.
for j = 1 to sub[bgId]n dofor k = 1 to sub[bgId]m do
Assign sub[bgId](j, k);end for
end forSolve Ax = b using the block circulant solve (Algorithm 4.2);for j = 1 to N dofor k = 1 to rhsn doSendProc=Processor which has data b(j, k) in sub[bgId];RecvProc=Process which needs b(j, k) data;Compute index into sub[bgId]; denote by (sj , sk);Compute index into sub[b]; denote by (lj , lk);if SendProc == RecvProc then
sub[b](lj , lk) = sub[bgId](sj , sk)elseif pId == SendProc then
Send sub[bgId](sj , sk) to processor RecvProc;else if pId == RecvProc then
Recv temp = sub[bSendProc](sj , sk) from processor SendProc ;sub[b](lj , lk) = temp;
end ifend if
end forend for
86
A.4.1 Sequential
A.4.1.1 General Case
The general sequential case simply populates the matrix; however, the method
of population differs from the previous routines. Instead of populating the matrix by
looping over each element in the matrix, the routine loops over the sources used in
the overall computation for populating S. That is, for each source, it computes which
element, S(i, j), uses that source, and adds the source’s contribution to S(i, j). In
general, there can be multiple sources per element and, therefore, S(i, j) will be updated
multiple times.
There are three different types of sources: simple, dipole, and a coupled simple and
dipole source which will be called a tripole source. The contribution of each source type
is done separately. That is, the simple source contributions are computed, followed by
computation of the dipole sources, and finally by computation of the tripole sources. In
the SOURCE POWER routine, there is a separate routine for each source type; however,
the algorithmic idea for populating the matrix S is the same in all cases.
In addition, the matrix S is Hermitian, and so only the upper triangular portion
is computed using the source contributions discussed above. After the upper triangular
portion is populated, the routine fills in the second half of the matrix by copying the
conjugate of the elements into the lower triangular portion of the matrix. Algorithm A.11
gives the pseudocode for the routine.
87
Algorithm A.11 Pseudocode for the SOURCE POWER general sequential case. LetN1, N2, and N3 be the number of simple, dipole, and tripole sources respectively. LetN be the number of rows and columns of S.
//Fill upper triangular portion for Sfor l = 1 to 3 dofor k = 1 to Nl do
Let (i, j) be the element to which source k contributes;if i ≤ j then
Update S(i, j);end if
end forend for
//S is Hermitian, copy the datafor i = 2 to N dofor j = 1 to i− 1 doS(i, j) = Conj(S(j, i));
end forend for
A.4.1.2 Rotationally Symmetric
In the rotationally symmetric case, the matrix is also block circulant. Because
the matrix is block circulant, only the first block row of S is filled. Then, using the first
block row of S, the matrix is filled. The fact that S is Hermitian is also used, but in this
case, it is used only for the first block in the first block row of S. The pseudocode given
in Algorithm A.12 is very similar to Algorithm A.11 except for a change in the bounds.
A.4.2 Parallel
A.4.2.1 General Case
Essentially, this routine populates the matrix S in parallel. It reuses the dis-
tributed data structure sub[A], which will now be denoted as sub[S], and populates it
by having each processor simultaneously loop over their corresponding data structures.
However, as discussed in the sequential cases, the original routines loop over sources,
88
Algorithm A.12 Pseudocode for the SOURCE POWER rotationally symmetric se-quential case. Let N1, N2, and N3 be the number of simple, dipole, and tripole sources,respectively. Let m be the number of blocks in the first block row of S and N be thenumber of rows and columns of S.
//Fill the first block rows of S except for the lower triangular portion of the first block.for l = 1 to 3 dofor k = 1 to Nl do
Let (i, j) be the element to which source k contributes;if i ≤ j and i ≤ N
m thenUpdate S(i, j);
end ifend for
end for
//The first block of S is Hermitian, copy the dataLet n = N
m ;for i = 1 to n do
for j = 1 to i doS(i, j) = Conj(S(j, i));
end forend for
//Fill the remainder of S knowing it is block circulantfor k = 1 to m− 1 dofor i = 1 to n dol = n ∗ k + i;for j = 1 to N dot = nk + j;if t > N thent = t−N ;
end ifS(l, t) = S(i, j);
end forend for
end for
89
not matrix elements. Therefore, this routine’s computations proceed by taking a pro-
cessor’s local indices, (li, lj), into sub[S], converting the local indices into global matrix
indices, (i, j), finding which sources are owned by this S(i, j), and looping through these
sources to compute their contributions to S(i, j). As in the sequential case, there are
three types of sources and, therefore, there are three separate routines for computing
their contributions. However, the overall algorithmic structure is the same.
Again, the matrix S is Hermitian. In contrast to the sequential case, instead of
computing only the upper triangular portion of S, all of S is computed using the source
contributions. While this adds some extra computation, the computation is being done
in parallel. If data were to be copied from the upper triangular section to the lower
triangular section, a large number of communications would have to take place and
would result in a bottleneck.
Algorithm A.13 Pseudocode for the SOURCE POWER general parallel case. Let mbe the number of blocks in the first block row of S, and let sub[S]N and sub[S]M be thenumber of rows and columns of sub[S], respectively.
for li = 1 to sub[S]N dofor lj = 1 to sub[S]M do
Compute global indices (i, j) corresponding to (li, lj);Let SourceList = Sources corresponding to S(i, j);for each source type t dofor each source of type t in SourceList do
Update sub[S](li, lj)end for
end forend for
end for
90
A.4.2.2 Rotationally Symmetric
The rotationally symmetric case is identical to the general parallel case in Sec-
tion A.4.2.1 with one modification. Because the matrix S is block circulant, the global
indices are modified to stay within the first block row of S when accessing the source
list. For example, say a processor’s local index corresponds to an entry residing in the
first block of the second block row. Knowing the matrix is block circulant, the second
block row is only a circular shift of the first block row. This means that the first block of
the second row is the last block of the first row. Therefore, the indices corresponding to
the first block of the second row are modified to point to the last block of the first row.
In performing the computations this way, no communication between the processors is
necessary to fill in the matrix S. The pseudocode for the algorithm, including the index
modifications, is given by Algorithm A.14.
Algorithm A.14 Pseudocode for the SOURCE POWER rotationally symmetric par-allel case. Let N be the number of rows and columns of S, m the number of blocks inthe first block row of S, and n the order of each block. In addition, define sub[S]N andsub[S]M to be the number of rows and columns of sub[S] respectively.
for li = 1 to sub[S]N dofor lj = 1 to sub[S]M do
Compute global indices (i, j) corresponding to (li, lj);
j =[m−
(i−1n
)]∗ n− j
if j > N thenj = j −N ;
end ifi = mod(i− 1, n) + 1;Let SourceList = Sources corresponding to S(i, j);for each source type in SourceList dot = SourceType;for each source of type t in SourceList do
Update sub[S](li, lj)end for
end forend for
end for
91
A.5 MODAL RESISTANCE
This routine is quite straight forward. Using the solution vector x obtained from
the SOURCE AMPLITUDES MODES routine (see Section A.3) and the matrix S from
the SOURCE POWER routine (see Section A.4), compute s = x∗Sx. Because previous
routines have already populated the needed data structures, this routine simply performs
the needed multiplications using LAPACK or ScaLAPACK.
A.5.1 Sequential
A.5.1.1 General Case
The sequential routine performs two multiplications and contains one intermediate
data structure, W , to hold the first multiplication, i.e., W = Sx. After computing
the first multiplication, the second multiplication S = x∗W is performed reusing the
data structure S to hold the solution. The LAPACK routine ZGEMM is used for the
multiplications. The ZGEMM routine corresponds to matrix-matrix multiplications and
is used in this instance because, in general, the solution vector x will contain multiple
columns. For completeness, the pseudocode for this operation is given by Algorithm A.15
Algorithm A.15 pseudocode for the MODAL RESISTANCE general sequential case.
Compute W = Sx using LAPACK routine ZGEMM;Compute S = x∗W using LAPACK routine ZGEMM;
92
A.5.2 Rotationally Symmetric
This behaves in exactly the same as the general sequential case (see Section A.5.1.1).
A.5.3 Parallel
A.5.3.1 General Case
As in the sequential case, an additional data structure is required for the inter-
mediate multiplication. Therefore, the distributed data structure sub[W ] is defined. Be-
cause the distributed data structures needed for the parallel multiplications, i.e., sub[S]
and sub[x], have already been populated, this routine simply uses ScaLAPACK to per-
form the parallel matrix multiplications. Therefore, the routine calls PZGEMM to com-
pute the multiplication W = Sx using the distributed data structures. Following the
first multiplication, S = x∗W is computed which reuses the distributed data structure
sub[S] to store the solution. Algorithm A.16 shows the pseudocode for the routine.
Algorithm A.16 pseudocode for the MODAL RESISTANCE general parallel case.
Compute sub[W ] = sub[S]sub[x] using ScaLAPACK routine PZGEMM;Compute sub[S] = sub[x∗]sub[W ] using ScaLAPACK routine PZGEMM;
A.5.3.2 Rotationally Symmetric
The parallel rotationally symmetric case is exactly the same as the general parallel
case (see Section A.5.3.1 and Algorithm A.16).
93
Bibliography
[1] The Cyberstar compute cluster. http://www.ics.psu.edu/infrast/specs.html.
[2] H. Akaike. Block Toeplitz matrix inversion. Society for Inustrial and Applied Math-
ematics Journal on Applied Mathematics, 24:234–241, 1973.
[3] P. Alonso, J.M. Badia, and A.M. Vidal. An efficient parallel algorithm to solve
block Toeplitz systems. The Journal of Supercomputing, 32:251–278, 2005.
[4] S. Amini. An iterative method for the boundary element solution of the exterior
acoustic problem. Journal of Computational and Applied Mathematics, 20:109–117,
1987.
[5] S. Amini and C. Ke. Conjugate gradient method for second kind integral equations -
Applications to the exterior acoustic problem. Engineering Analysis with Boundary
Elements, 6, 1989.
[6] L.S. Blackford, A. Cleary, J. Choi, E. D’Azevedo, J. Demmel, I. Dhillon, J. Don-
garra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C.
Whaley. ScaLAPACK User’s Guide. SIAM Press, 1997.
[7] L.I. Bluestein. A linear filtering approach to the computation of the discrete Fourier
transform. Northeast Electronics Research and Engineering Meeting Record, 10:218–
219, 1968.
94
[8] L.I. Bluestein. A linear filtering approach to the computation of discrete Fourier
transform. IEEE Transactions on Audio and Electroacoustics, AU-18:451–455, 1970.
[9] A.J. Burton and G.F Miller. The application of integral equation methods to the
numerical solution of some exterior boundary-value problems. Proceedings of the
Royal Society of London. Series A, Mathematical and Physical Sciences, 323:201–
210, 1971.
[10] M. Chen. On the solution of circulant linear systems. Society for Inustrial and
Applied Mathematics Journal on Numerical Analysis, 24:668–683, 1987.
[11] E. Chu and A. George. Inside the FFT Black Box: Serial and Parallel Fast Fourier
Transform Algorithms. CRC Press, Boca Raton, 2000.
[12] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex
fourier series. Mathematics of Computation, 19:297–301, 1965.
[13] P.J. Davis. Circulant Matrices. Chelsea Publishing, 1994.
[14] H.A. El-Mikati and J.B. Davies. Improved boundary element techniques for two-
dimensional scattering problems with circular boundaries. IEEE Transactions on
Antennas and Propagation, AP-35:539–544, 1987.
[15] S.M. El-Sayed. A direct method for solving circulant tridiagonal block systems of
linear equations. Applied Mathematics and Computation, 165:23–30, 2005.
[16] S.M. El-Sayed, I.G. Ivanov, and M.G. Petkov. A new modification of the rojo
method for solving symmetric circulant five-diagonal systems of linear equations.
Computers & Mathematics with Applications, 35:35–44, 1998.
95
[17] G. Fairweather and A. Karageorghis. The method of fundamental solutions for
elliptic boundary value problems. Advances in Computational Mathematics, 9:69–
95, 1998.
[18] K.M. Fauske. Example: Radix-2 fft signal flow. Online, 12 2006.
http://www.texample.net/tikz/examples/radix2fft/.
[19] J.-Y Hwang and S.-C Chang. A retracted boundary integral equation for exte-
rior acoustic problem with unique solution for all wave numbers. Journal of the
Acoustical Society of America, 90:1167–1180, 1991.
[20] C.C. Ioannidi and H.T. Anastassiu. Circulant adaptive integral method (CAIM) for
electromagnetic scattering from large targets of arbitrary shape. IEEE Transactions
on Magnetics, 45:1308–1311, 2009.
[21] A. Karageorghis and G. Fairweather. The method of fundamental solutions for
axisymmetric potential problems. International Journal for Numerical Methods in
Engineering, 44:1653–1669, 1999.
[22] A. Karageorghis and Y.-S. Smyrlis. Matrix decomposition MFS algorithms for elas-
ticity and thermo-elasticity problems in axisymmetric domains. Journal of Compu-
tational and Applied Mathematics, 206:774–795, 2007.
[23] A. Karageorghis, Y.-S. Symyrlis, and T. Tsangarsi. A matrix decomposition MFS
algorithm for certain linear elasticity problems. Numerical Algorithms, 43:123–149,
2006.
96
[24] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Com-
puting Design and Analysis of Algorithms. The Benjamin/Cummings Publishing
Company, 1994.
[25] L D. Lirkov, S. D. Margenov, and P. S. Vassilevski. Circulant block-factorization
preconditioners for elliptic problems. Computing, 53:59–74, 1994.
[26] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society
for Industrial and Applied Mathematics, 1992.
[27] T. De Mazancourt and D. Gerlic. The inverse of a block-circulant matrix. IEEE
Transactions on Antennas and Propagation, AP-31:808–810, 1983.
[28] M. Ochmann, A. Homm, S. Makarov, and S. Semenov. An iterative GMRES-
based boundary element solver for acoustic scattering. Engineering Analysis with
Boundary Elements, 27:717–725, 2003.
[29] A. Padiy and M.Neytcheva. On a parallel solver for boundary electric current com-
putations, Report 9726. Technical report, Department of Mathematics, University
of Nijmegen, The Netherlands., 1997.
[30] P.J. Papakanellos, N.L. Tsitsas, and H.T. Anastassiu. Efficient modeling of radiation
and scattering for a large array of loops. IEEE Transactions on Antennas and
Propagation, 58:999–1002, 2010.
[31] S. Rjasanow. Effective algorithms with circulant-block matrices. Linear Algebra
and Its Applications, 202:55–69, 1994.
97
[32] O. Rojo. A new method for solving symmetric circulant tridiagonal systems of linear
equations. Computers and Mathematics with Applications, 20:61–67, 1990.
[33] S. Rjasanow S. Kurz, O. Rain. Application of the adaptive cross approximation
technique for the coupled BE-FE solution of symmetric electromagnetic problems.
Computational Mechanics, 32:423–429, 2003.
[34] H.A. Schenck. Improved integral formulation for acoustic radiation problems. Jour-
nal of the Acoustical Society of America, 44:41–58, 1968.
[35] Y.-S. Smyrlis and A. Karageorghis. A matrix decomposition MFS algorithm for
axisymmetric potential problems. Engineering Analysis with Boundary Elements,
28:463–474, 2004.
[36] Y.-S. Smyrlis and A. Karageorghis. The method of fundamental solutions for sta-
tionary heat conduction problems in rotationally symmetric domains. Society for
Inustrial and Applied Mathematics Journal of Scientific Computing, 27:1493–1512,
2006.
[37] Parad SS. Technical seminar on propeller. Online, August 2008.
http://aplonset.blogspot.com/2008/08/prasads-propeller.html.
[38] Th. Tsangaris, Y.-S Symrlis, and A. Karegeorghis. A matrix decomposition MFS
algorithm for problems in hollow axisymmetric domains. Journal of Scientific Com-
puting, 28:31–50, 2006.
98
[39] N.L. Tsitsas and G.H. Kalogeropoulos. A recursive algorithm for the inversion of
matrices with circulant blocks. Applied Mathematics and Computation, 188:877–
894, 2007.
[40] H. Tsuboi, A. Sakurai, and T. Naito. A simplification of boundary element model
with rotational symmetry in electromagnetic field analysis. IEEE Transactions on
Magnetics, 26:2771–2773, 1990.
[41] R. Vescovo. Inversion of block-circulant matrices and circular array approach. IEEE
Transactions on Antennas and Propagation, 45:1565–1567, 1997.
[42] L. Wright, S.Robinson, V. Humphrey, P. Harris, and G. Hayman. The application
of boundary element methods to nearfield acoustic measurements on cylindrical
surfaces at NPL. Technical report, NPL REPORT, 2005.