Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Fourier Transforms
for the
BlueGene/L Communication Network
Heike Jagode
MSc in High Performance Computing
The University of Edinburgh
Year of Presentation: 2006
ABSTRACT
A computational kernel of particular importance for many scientific applications is
the Fast Fourier Transform (FFT) of multi-dimensional data. A fundamental
challenge is the design and implementation of such parallel numerical algorithms to
utilise efficiently thousands of nodes. The BlueGene/L is a massively parallel high
performance computer organised as a three-dimensional torus of compute nodes. To
maintain application performance and scaling, the correct mapping of MPI tasks onto
the three-dimensional torus communication network is a critical factor. This paper
presents design and implementation of the parallel two-dimensional and three-
dimensional FFT. For the three-dimensional case we compare one-dimensional with
two-dimensional decomposition of the complex data. The applications call the one-
dimensional single-processor FFT kernel routine provided by the Fastest Fourier
Transform in the West (FFTW) library. We present experimental results of different
node mappings onto the BlueGene/L’s torus on up to 1,024 nodes. The
implementation of the FFT algorithm using two-dimensional decomposition scales
well up to 1,024 nodes of a variety of problem sizes (128³, 256³, 512³). Our
experiments clearly indicate that a carefully chosen mapping of MPI tasks onto the
torus network that takes the network characteristics into account is beneficial in
obtaining improved performance for this type of application.
CONTENTS
I
CONTENTS
1 INTRODUCTION 1
2 OVERVIEW OF THE BLUEGENE/L ARCHITECTURE 3
2.1 Hardware Architecture…………..………………………….. 3
2.2 Software Architecture……....………………………………. 5
3 FOURIER TRANSFORM 8
3.1 Continuous Fourier Transform.…………………………….. 8
3.2 Discrete Fourier Transform……………………..………….. 9
3.3 Fast Fourier Transform.…………………………………….. 10
3.4 Fastest Fourier Transform in the West.…………………….. 11
4 TWO-DIMENTIONAL FAST FOURIER TRANSFORMS 13
4.1 Parallel FFT in Two Dimensions...…….…………………… 13
4.2 Taskfarm of parallel FFTs…….....…….…………………… 13
4.3 Algorithm Details………………...…….…………………… 14
4.4 Verification of Results.….………………………………….. 18
4.5 Performance Analysis...…………………………………….. 19
4.5.1 mesh versus torus network………………………….. 20
4.5.2 Virtual Node Mode on BlueGene/L..……………….. 21
4.5.3 Double FPU on BlueGene/L.……………………….. 22
4.5.4 MPI task mapping strategies.……………………….. 23
4.5.4.1 Mappings on the 32-node partition.....…….... 24
4.5.4.2 Mappings on the 128-node partition...…….... 25
4.5.4.3 Mappings on the 512-node partition...…….... 27
5 THREE-DIMENTIONAL FAST FOURIER TRANSFORMS 32
5.1 Parallelisation……………………………………………….. 32
5.2 Verification of Results.…………………………………….... 35
5.3 Performance Analysis……………………………………….. 37
5.3.1 1D-Decomposiiton versus 2D-Decomposition..…….. 37
CONTENTS
II
5.3.2 MPI task mapping strategies..……………………….. 42
5.3.2.1 Mappings on the 32-node partition.....…….... 42
5.3.2.2 Mappings on the 128-node partition...…….... 44
5.3.2.3 Mappings on the 512-node partition...…….... 47
5.3.2.4 Mappings on the 1024-node partition...…….... 54
6 CONCLUSION 59
APPENDIX A 63
APPENDIX B 65
APPENDIX C 70
BIBLIOGRAPHY 72
LIST OF TABLES
III
LIST OF TABLES 4.1 Times measured in seconds for a problem size of 16384² using 128 nodes…… 17 4.2 Times measured in seconds for a problem size of 16384² using 512 nodes…… 17 4.3 Execution times in seconds for the 2D-FFT computation for
different problem sizes using coprocessor mode and virtual node mode on BlueGene/L………………………….…………………………. 22
4.4 Summary of the investigated node mappings for problem sizes between 2048² and 16384²……………………………………………..….…… 29
4.5 Communication and 2D-FFT computation costs (seconds) for different problem sizes with the mapping yielding best results for mesh and torus………………………………………………………….……… 30
5.1 Performance improvement of the slab decomposition compared to
2D-decomposition……………….………………………………….…….…… 40 5.2 Summary of the investigated node mappings for different
subdivisions of the 2D virtual processor grid…………………….……….…… 53 5.3 Communication costs measured in seconds for different problem
sizes using the best mapping for each particular subdivision of the 2D virtual processor grid……………………………….………….….… 54
5.4 Cost for entire forward 3D-FFT computation measured in seconds for different problem sizes using the best mapping for each particular subdivision of the 2D virtual processor grid………………………….…..…… 54
A.1 Performance improvement of communication costs for slab
decomposition compared to 2D-decomposition………………………….…… 64 B.1 Performance measurements in seconds for the 3D-FFT implementations
using 1D and 2D decomposition for problem size 128³………………….…… 65 C.1 Performance measurements in seconds for ping-pong application
Sending / receiving 1000 messages of 10 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 70
C.2 Performance measurements in seconds for ping-pong application Sending / receiving 1000 messages of 100 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 70
C.3 Performance measurements in seconds for ping-pong application Sending / receiving 1000 messages of 1,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 71
C.4 Performance measurements in seconds for ping-pong application Sending / receiving 1000 messages of 10,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 71
LIST OF FIGURES
IV
LIST OF FIGURES 2.1 torus network with periodic boundary conditions………………….……. 4 2.2 Node mappings on torus network along a line, diagonal,
and volume diagonal…………………………………....…..……………. 4 2.3 Performance measurements for ping-pong application
sending / receiving messages of 100 integers between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal………………………………………..……………. 5
4.1 Computational steps of the two-dimensional FFT implementation…..…. 15 4.2 FFTW library functions and resort strategies for the two-dimensional
FFT computation………………..……………………………….………. 16 4.3 Compare mesh vs torus network for a variety of problem sizes…………. 20 4.4 Compare mesh vs torus network…………………..………..……………. 21 4.5 Times of the forward 2D-FFT for a problem size of 2048²……………… 23 4.6 Customised versus default mapping on 32-node partition………………. 24 4.7 Performance impact of customised versus default mapping on
32-node partition…………………………………………....……………. 24 4.8 Two customised mappings versus default mapping on
128-node partition…………………………………………..……………. 25 4.9 Performance impact of customised versus default mapping on
128-node partition…………………………………………..……………. 26 4.10 Customised mappings versus default mapping on
128-node partition…………………………………………..……………. 26 4.11 Performance impact of customised versus default mapping on
128-node partition…………………………………………..……………. 27 4.12 Two customised mappings versus default mapping on
512-node partition…………………………………………..……………. 27 4.13 Performance impact of customised versus default mapping on
512-node partition (mesh) ………………..……………………………… 28 4.14 Performance impact of customised versus default mapping on
512-node partition (torus) …………………..…………..………….……. 28 4.15 Customised mappings versus default mapping on
512-node partition…………………………………………..……………. 30 4.16 Performance impact of customised versus default mapping on
512-node partition (torus) ………………..……………..…….…………. 31 5.1 Computational steps of the 3D-FFT implementation
using 1D-decomposition…………………………..………..……………. 33 5.2 Computational steps of the 3D-FFT implementation
using 2D-decomposition…………………………..………..……………. 34 5.3 Speedup of the 3D-FFT implementation using 1D-decomposition……… 37 5.4 Speedup of the 3D-FFT implementation using 2D-decomposition……… 38 5.5 (a) Performance measurements for the 3D-FFT implementations using 1D
and 2D decomposition for five different problem sizes, respectively……. 40 5.5 (b) Performance measurements for the communication times of the 3D-FFT
implementations using 1D and 2D decomposition for five different problem sizes, respectively…………………………………..…………… 41
LIST OF FIGURES
V
5.6 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 32-node partition using dims={8, 4} for the 2D-virtual processor grid……………………………..……………. 42
5.7 Performance impact of customised node mapping for 3D-FFT on a 32-node partition for various problem sizes………………..……………. 43
5.8 Two customised and default node mappings the 1st and 2nd all-to-all communication on a 128-node partition using dims={16,8} and dims={8,16} for the 2D-virtual processor grid………………..…………… 45
5.9 Performance impact of customised node mapping for 3D-FFT on a 128-node partition using dims={16, 8} for the 2D-virtual processor grid……………………………………..……………. 45
5.10 Performance impact of customised node mapping for 3D-FFT on a 128-node partition using dims={8, 16} for the 2D-virtual processor grid……………………………………..……………. 46
5.11 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={32, 16} for the 2D-virtual processor grid……………………………..……………. 47
5.12 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={32, 16} for the 2D-virtual processor grid on the torus network……………………………………..……………. 48
5.13 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={64, 8} for the 2D-virtual processor grid……………………………………..……………. 48
5.14 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={64, 8} for the 2D-virtual processor grid on the mesh network……………………………………..……………. 49
5.15 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={64, 8} for the 2D-virtual processor grid on the torus network……………………………………..……………. 49
5.16 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={8,64} for the 2D-virtual processor grid……………………………………..……………. 50
5.17 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={8, 64} for the 2D-virtual processor grid on the mesh network……………………………………..……………. 51
5.18 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={8, 64} for the 2D-virtual processor grid on the torus network……………………………………..……………. 51
5.19 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={128, 4} for the 2D-virtual processor grid……………………………………..……………. 52
5.20 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={128, 4} for the 2D-virtual processor grid on the torus network……………………………………..……………. 53
5.21 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 1024-node partition using dims={32, 32} for the 2D-virtual processor grid……………………………………..……………. 55
5.22 Performance impact of customised node mapping for 3D-FFT on a 1024-node partition using dims={32, 32} for the 2D-virtual processor grid on the torus network……………………………………..……………. 55
5.23 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 1024-node partition using dims={8, 128} for the 2D-virtual processor grid……………………………………..……………. 56
5.24 Performance impact of customised node mapping for 3D-FFT on a 1024-node partition using dims={8, 128} for the 2D-virtual processor grid on the torustorus network………………………………..……………. 57
LIST OF FIGURES
VI
5.25 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 1024-node partition using dims={256, 4} for the 2D-virtual processor grid……………………………………..……………. 58
5.26 Performance impact of customised node mapping for 3D-FFT on a 1024-node partition using dims={4, 256} for the 2D-virtual processor grid on the torus network…………………………..……………. 58
A.1 Speedup of the 3D-FFT implementation using 1D-decomposition…..……. 63 A.2 Speedup of the 3D-FFT implementation using 2D-decomposition…..……. 63
ACKNOWLEDGEMENTS
VII
ACKNOWLEDGEMENTS
I wish to thank Dr Joachim Hein for his excellent guidance, support, patience and
encouragement throughout the duration of this project.
Jon Bashor is greatly acknowledged for his proofreading assistance.
A special thank goes to Professor Dr Wolfgang E. Nagel for making it all possible.
I don’t want to miss the opportunity to thank the “HM-Team” for many enjoyable
hours we spent together and for making this time unforgettable.
And, I would like to thank my family for their unbelievable support and
understanding through the entire year I spent at the University of Edinburgh.
1. INTRODUCTION
1
1. INTRODUCTION
The Fast Fourier Transforms (FFTs) of multi-dimensional data are of particular
importance in a variety of different scientific applications, but are often one of the
most computationally expensive components. Parallel FFTs are communication
intensive. They often prevent the application from scaling to a very large number of
processors.
The BlueGene/L system architecture was designed to support efficient execution of
massively parallel message-passing programs [13]. The system consists of thousands
of compute nodes which operate at a moderate clock frequency of 700 MHz [3]. This
vast parallelism is characterised by lower power consumption compared to current
supercomputer systems.
A fundamental challenge of parallel numerical algorithms – such as the FFTs of
multi-dimensional data – is their design and implementation to utilise efficiently
thousands of nodes. Our starting point is the description of the design and
implementation of the parallel two-dimensional and three-dimensional FFT. For the
three-dimensional case we investigated two different implementations which are
presently widely discussed in the literature [5, 6]. The first implementation uses a
one-dimensional decomposition of the data and the second uses a two-dimensional
decomposition. The application that decomposes data in only one dimension is
bounded with the fact that the utilisation of a number of more processors is limited to
the data elements along one dimension. On the other hand, for two-dimensional
decomposition, N² processors can be utilised (N … size of the data along a single
axis). We compare the performance of both implementations with respect to the
problem size and number of processors used.
Another important architectural characteristic of BlueGene/L is the organisation of
compute nodes as a three-dimensional torus. The main feature of the torus
communication network is that every node is connected to its six neighbour nodes
through bidirectional links.
1. INTRODUCTION
2
To maintain application performance and scaling, the correct mapping of MPI tasks
onto the torus network is a critical factor. We explore the impact of a variety of node
mappings on the performance of the three-dimensional FFT computation using two-
dimensional decomposition of the data.
Before we consider the three-dimensional FFT, a number of investigations on the
two-dimensional FFT computation have been carried out. There is a set of reasons
for exhaustively exploring the two-dimensional case. For instance, the two-
dimensional computation is half of the three-dimensional computation that uses two-
dimensional decomposition. This is because the communication kernel for the
parallel three-dimensional FFT with two-dimensional decomposition consists of two
all-to-all communications. With the investigations carried out for the two-
dimensional FFT, we study separately the impacts of node mappings on the
performance of one all-to-all communication. It is supposed to decode possibly
performance issues for the three-dimensional case.
The rest of this paper is organised as follows. Next is an overview of the hardware
and software architecture of BlueGene/L. Chapter 3 contains mathematical
background information of the Fourier transforms. It is followed by a mathematical
description of the two-dimensional FFT implementation. The node mapping
strategies for the two-dimensional case is briefly discussed in chapter 4. The
description of the design and implementation for the three-dimensional FFT is
broken down into two versions, one decomposing the data array in one dimension
and the second is for two dimensions. Both are covered in chapter 5. We continue
with our investigations of a number of node mappings onto the BlueGene/L’s torus
on up to 1,024 nodes for the three-dimensional FFT computation that uses two-
dimensional decomposition. In chapter 6, we then describe and discuss the
experimental results and draw our conclusions.
2. OVERVIEW OF THE BLUEGENE/L ARCHITECTURE
3
2. OVERVIEW OF THE BLUEGENE/L
ARCHITECTURE
2. 1 Hardware Architecture
The BlueGene/L supercomputer is a massively parallel system developed by IBM in
partnership with Lawrence Livermore National Laboratory (LLNL) [5, 8, 11, 13].
This system-on-a-chip design that integrates embedded low-power processors, high-
performance network interfaces and embedded memory [13] results in extremely
high power and space efficiency [8]. The full details of the system architecture are
extensively described elsewhere [11, 14] and we provide a brief overview with the
focus on the features that are particularly relevant to our project.
The University of Edinburgh BlueGene/L machine, BlueSky, offers a total of 1024
compute chips in a single cabinet. Each chip has two processors (nodes) which
means, BlueSky offers a total of 2048 processors [3]. Operating at a moderate clock
frequency of 700 MHz, BlueSky delivers a theoretical peak computing power of
5.7 TFlops [3], when both processors in each chip are used. Each chip incorporates
two standard 32-bit embedded IBM PowerPC 440 processors with private L1
instruction and data cache, a small (2 KB) L2 cache and prefetch buffer, 4 MB of
embedded dynamic random access memory [13, 14] acting as a L3 cache, and 512
MB of main memory. The L2 and L3 cache as well as the main memory are shared
by the two compute nodes on a chip.
The dual-processor compute chip can operate in one of two modes [3, 13]. The
coprocessor mode that spans the entire memory of the chip uses the first processor
for computation and the second processor for communication. In virtual node mode,
two single-threaded processes, each effectively using half of the chip memory, run
on one compute node [13].
Each processor in a chip has a dual floating-point unit (FPU) – also known as
“Double Hummer” [3] – consisting of two 64-bit FPUs operating in parallel to
2. OVERVIEW OF THE BLUEGENE/L ARCHITECTURE
4
Figure 2.1: torus network with periodic boundary conditions
Figure 2.2: Node mappings on torus network along a line, diagonal, and volume diagonal
mainly support complex number arithmetic. For the efficient use of the double FPU,
16-byte alignment of the data is required [3]. For generating code, which uses the
dual FPU, the compiler has to know about the alignment properties of the data [3].
How it is actually implemented is described elsewhere [3]. However, for FFT
implementations this feature can be utilised to save a significant number of
arithmetic operations, leading to improved performance [4].
The BlueGene/L architecture features five different networks (not all of which are
described here). For the FFT computation, the most important network is the three-
dimensional torus. The 512-node partition forms the smallest 8 8 8× × torus. Each of
the 512 compute nodes is connected to its six neighbours through 154 MB/s/link
bidirectional channels [13] (see figure 2.1).
To maintain application performance and scaling, the correct mapping of MPI tasks
onto the torus network plays a crucial role. A performance analysis of a ping-pong
application has underlined that the communication times can be minimised when a
particular node mapping is used which takes the torus network features into account.
2. OVERVIEW OF THE BLUEGENE/L ARCHITECTURE
5
0,0126
0,0131
0,0136
0,0141
0,0146
0-1 0-2 0-3 0-4 0-5 0-6 0-7
Nodes
Exec
utio
n tim
es (s
econ
ds)
LineDiagonalVolume Diagonal
Figure 2.3: Performance measurements for ping-pong application sending / receiving messages of 100 integers between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal
Figure 2.3 shows the results from sending and receiving messages (of 100 integers in
size and 4 bytes in length) from rank 0 to one of the 7 remaining nodes separately
mapped along a line (a), the diagonal (b) and volume diagonal (c). The reported
times are for 1000 full cycles. The node mapping of all three cases is illustrated in
figure 2.2. It clarifies in all three cases that communication between nodes is fastest
in the nearest neighbourhood and slows down as nodes are located further away
within the network. The measurements for a variety of message sizes have been
added to Appendix C.
2. 2 Software Architecture
Here we focus on the Message Passing Interface (MPI) implementation rather than
the system software. However, we briefly mention two for this project relevant
system information. The C compiler version on the University of Edinburgh’s
BlueGene/L is IBM Visual Age C compiler version 7.0 and the driver version is
v1r2m1 (V1R2M1_020_2006-060110). For a detailed discussion of the system
software we refer to [11, 13, 14]. We briefly summarise the main features of the MPI
for BlueGene/L implementation – relevant to our project – which are extensively
discussed in [13].
2. OVERVIEW OF THE BLUEGENE/L ARCHITECTURE
6
The BlueGene/L supercomputer was designed to support efficient execution of
massively parallel message-passing programs. Part of this support is an optimised
implementation of the Message Passing Interface, which takes the hardware features
of BlueGene/L into account [13]. MPI for BlueGene/L is implemented on top of
MPICH2 library [15] from Argonne National Laboratory.
The MPI and MPICH2 libraries are used in both BlueGene/L modes of operation: the
coprocessor mode and virtual node mode. In coprocessor mode, to support the
concurrent operation of the two non-cache-coherent processors in a compute node,
the messages layer allows the use of the second processor as a communication
coprocessor [13]. The message layer provides a non-L1-cached – and hence coherent
– area of the memory to coordinate the two processors [13]. In virtual node mode,
two separate processes run in each processor of a chip. Hence, memory and torus
network are evenly shared between the processors; others, such as L3 cache, are
shared [13]. The two MPI tasks share not only the network, but also communicate
with each other. Therefore, the MPI for BlueGene/L implementation provides a
virtual torus device, served by a virtual packet layer [13].
On a machine such as BlueGene/L, the correct mapping of MPI tasks to the torus
network is a critical factor in maintaining application performance and scaling [13].
For that reason, the message layer allows arbitrary mapping of torus coordinates to
ranks. This mapping can be specified via an input file – the so-called mapfile –
listing the torus coordinates of each process in increasing rank order.
Within the torus network, the data packets are routed on an individual basis using
one of two routing strategies. The algorithm, in which all packets follow the same
path along the x, y, and z dimension (in this order), is called the deterministic routing
algorithm [13]. The second is a minimal adaptive routing algorithm, which allows
individual packets to make decisions about routing, resulting in potential out-of-order
delivery of packets [13]. This potential out-of-order delivery forces the MPI library
to reorder them in software. A packet reordering is expensive because it involves
memory copies and requires packets to carry additional information [13]. On the
other hand, deterministic routing leads to more network congestion, even on lightly
2. OVERVIEW OF THE BLUEGENE/L ARCHITECTURE
7
used networks. For our implementations, data packets are routed entirely in hardware
from the source to the destination node.
Most MPI implementations, including MPICH2, typically implement collective
communication in terms of point-to-point messages [12, 13]. On the BlueGene/L
platform, the default collective implementations of MPICH2 suffer from low
performance because they are written for a crossbar-type network, not for special
network topologies such as the BlueGene/L torus network [13]. In terms of all-to-all
communication, both, MPI_Alltoall and MPI_Alltoallv algorithms are
optimised for the BlueGene/L architecture. The algorithm uses the message layer
directly and optimises the injection of packets to achieve high network efficiency
[13]. For this investigation of the two-dimensional and three-dimensional FFT
computations, the MPI_Alltoall algorithm is used. For future studies, we assume
it is useful to perform an investigation of FFT computations using
MPI_Alltoallv.
3. FOURIER TRANSFORM
8
2( ) ( ) ixuF u f x e dxπ+∞
−
−∞
= ⋅∫
( )
( )
cos( ) sin( )1cos( )21sin( )2
i
i i
i i
e i
e e
e ei
θ
θ θ
θ θ
θ θ
θ
θ
−
−
= +
→ = +
→ = −
3. FOURIER TRANSFORM
To explain better the two-dimensional and the two three-dimensional Fast Fourier
Transform (FFT) implementations, covered in chapter 4.3 and 5.1, some background
information about the Fourier Transforms (FTs) at the level relevant to this project is
provided. Fourier Transforms are of enormous importance for many applications in
applied and engineering science. The mathematical tool is a linear transform which
converts, for example spatial information into information lying in the frequency
domain and vice versa [26, 27]. All periodic signals may be represented by an
infinite sum or integral of trigonometric sines and cosines which are associated with
the symmetrical and asymmetrical information, respectively [26, 27]. Alternatively
to trigonometric functions one can use exponentials to formulate Fourier Transforms.
The connection between the two is via Euler’s formula:
(3.1)
3. 1 Continuous Fourier Transform
If one considers the one-dimensional case, then generally the FT tool converts a
function f of a single variable x, e.g. f(x), from the spatial domain into another
function F of frequencies u, e.g. F(u), into the frequency domain to analyse these
frequencies in a sampled signal [17]. In general, one has two descriptions of the same
physical process which are defined through a function. Consider a continuous
function f(x) of a single variable. The Fourier Transform of that function is defined
by:
(3.2)
Generally, the Fourier Transform F(u) will be a complex quantity, even if the
original data is real. To regenerate the original function f(x) from its Fourier
Transform (3.2), the inverse Fourier Transform (3.3) comes into play which looks
fairly similar, except that the exponential term has the opposite sign:
3. FOURIER TRANSFORM
9
2( ) ( ) ixuf x F u e duπ+∞
−∞
= ⋅∫
2 ( )
2 ( )
( , ) ( , )
( , ) ( , )
i xu yv
i xu yv
F u v f x y e dxdy
f x y F u v e dudv
π
π
+∞ +∞− +
−∞ −∞
+∞ +∞+
−∞ −∞
= ⋅
= ⋅
∫ ∫
∫ ∫
2 ( )
2 ( )
( , , ) ( , , )
( , , ) ( , , )
i xu yv zw
i xu yv zw
F u v w f x y z e dxdydz
f x y z F u v w e dudvdw
π
π
+∞ +∞ +∞− + +
−∞ −∞ −∞
+∞ +∞ +∞+ +
−∞ −∞ −∞
= ⋅
= ⋅
∫ ∫ ∫
∫ ∫ ∫
1 2
0
1 2
0
( ) ( )
1( ) ( )
uxL iL
x
uxL iL
x
F u f x e
f x F u eL
π
π
⎛ ⎞− − ⎜ ⎟⎝ ⎠
=
⎛ ⎞− ⎜ ⎟⎝ ⎠
=
= ⋅
= ⋅ ⋅
∑
∑
1 1 2
0 0
1 1 2
0 0
( , ) ( , )
1( , ) ( , )
ux vyM L iL M
y x
ux vyM L iL M
v u
F u v f x y e
f x y F u v eL M
π
π
⎛ ⎞− − − +⎜ ⎟⎝ ⎠
= =
⎛ ⎞− − +⎜ ⎟⎝ ⎠
= =
= ⋅
= ⋅ ⋅⋅
∑∑
∑∑
(3.3)
The two- and three-dimensional FT equations can be developed from the equations
(3.2) and (3.3) in a fairly straightforward way. The following equations (3.4) present
the Fourier Transform equation and its inverse for the two-dimensional case:
(3.4)
For the sake of completeness and, since the emphasis of this project is on the three-
dimensional case, equation (3.5) shows the three-dimensional continuous Fourier
Transform equation including its inverse:
(3.5)
3. 2 Discrete Fourier Transform
For computational calculations one often needs functions defined for discrete instead
of continuous domains. In the most common situation, the function’s values can be
obtained by sampling at evenly spaced intervals [17]. One has to approximate the
integrals in (3.2) and (3.3) (as an example for the one-dimensional case) by a discrete
sum. The discrete Fourier Transform and its inverse for the one-dimensional case of
L samples at values of x from 0 to L-1 are of the form:
(3.6)
Again, the two-dimensional discrete Fourier Transform works in a similar way. For a
L M× grid in x and y direction, one gets the following equations:
(3.7)
3. FOURIER TRANSFORM
10
1 1 1 2
0 0 0
1 1 1 2
0 0 0
( , , ) ( , , )
1( , , ) ( , , )
ux vy wzN M L iL M N
z y x
ux vy wzN M L iL M N
w v u
F u v w f x y z e
f x y z F u v w eL M N
π
π
⎛ ⎞− − − − + +⎜ ⎟⎝ ⎠
= = =
⎛ ⎞− − − + +⎜ ⎟⎝ ⎠
= = =
= ⋅
= ⋅ ⋅⋅ ⋅
∑∑∑
∑∑∑
The three-dimensional discrete Fourier Transform for a L M N× × data grid in x, y
and z direction is shown in (3.8).
(3.8)
3. 3 Fast Fourier Transform
The computational cost of the discrete Fourier Transform of N points can be obtained
by the fact that each of the N points is computed in terms of all the N points in the
original function [26, 27]. In the mathematical sense, one has a matrix-vector
multiplication which requires N² complex multiplications and therefore the discrete
Fourier Transform appears to be of an order O(N²) process.
In the mid-1960s, J. W. Cooley and J. W. Tukey published a discrete Fourier
Transform algorithm, known as Fast Fourier Transform (FFT), which computes the
discrete Fourier Transform in O(N log2 N) operations [17]. One of the clearest
derivations of the FFT algorithm, which is known as Danielson-Lanczos Lemma,
shows that a discrete Fourier Transform of length N can be rewritten as a sum of two
discrete Fourier Transforms of length 2N , respectively [17]. One sum is formed
from the even-numbered points of N and the other from the odd-numbered points.
Both transforms are periodic with length 2N . For the proof of this derivation, the
reader is referred to the “Numerical Recipes” book [17]. In the “Numerical Recipes”
it is also recommended that one use FFTs with N as an integer power of two to
maintain O(N log2 N), although other cases can also be treated. With this restriction
on N, the Danielson-Lanczos Lemma can be applied until the data has been
subdivided all the way down to transforms of length one. As a better exemplification,
the continuing steps would be to recursively subdivide the two sums of even-
numbered and odd-numbered points of length 2N into respective sub-sums of even-
even-numbered, even-odd-numbered, odd-even-numbered and odd-odd-numbered
3. FOURIER TRANSFORM
11
...( ) for some eoeeoeo oeenF u f n=
points each of length 4N . So, for every pattern of log2 N even’s (e) and odd’s (o),
there is a one-point transform that is just one of the input numbers nf (see (3.9))
[17], e.g.
(3.9)
The next necessary part of the Fast Fourier Transform is the so-called bit reversal
reordering to find the corresponding even-odd pairs in equation (3.8) to the values of
n [17]. The mathematically exact approach of this method is beyond the scope of this
project, more details can be found in the “Numerical Recipes” book [17]. However,
we summarise the main steps. First, one has to reverse the pattern of evens and odds.
Secondly, one has to write even and odd as binary notation, which means even = 0
and odd = 1. Once these two steps are done, one has the value of n in binary notation.
The points as given are the one-point transforms, which is simply the operation that
copies the one input number into its one output slot. Now the Danielson-Lanczos
Lemma can be applied, which combines pairs of one-point transforms to get two-
point transforms, and so on, until the first and second halves of the entire data set are
combined into the final transform [17]. Each combination is an order N process and
there are log2 N combinations. So, in summary the entire algorithm is of order O(N
log2 N) [17].
3. 4 Fastest Fourier Transform in the West library
The applications used here call one-dimensional single-processor FFT kernel
routines. The portable open-source Fastest Fourier Transform in the West (FFTW)
2.1.5 library has been used. FFTW is a state-of-the-art C subroutine library for
computing the discrete Fourier Transform in one or more dimensions, for both real
and complex data, and of arbitrary input size [28]. FFTW uses empirical approaches
to automatically optimise FFT computation on a wide range of architectures [8]. The
current version installed on the University of Edinburgh’s BlueGene/L is FFTW
2.1.5, available in two releases. One is a version of FFTW-GEL from the Vienna
University of Technology [10], which is based on FFTW 2.1.5 and optimised for the
double floating point unit which is specially designed for each processing core on
3. FOURIER TRANSFORM
12
BlueGene/L. The other is the standard FFTW 2.1.5 library [3]. Both versions have
been used and the one yielding the best performance has been used for all future
implementations and investigations.
FFTW implements a two-step algorithm to calculate a transform [28]. At first, a plan
is computed which serves as input for the second step. In order to create the plan, all
data necessary for the Fourier Transform computation is needed. During the plan
computation, several FFTs are run and measured at run time in order to find the best
way to compute the requested transform of a given size [29, 19]. That makes a plan
computation more expensive than the actual transform. However, once a plan is
created, it can be reused for a fixed problem size many times which, in summary
speeds the FFTW significantly up [28]. In the second step the created plan is used for
the computation of the actual transform.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
13
4. TWO-DIMENSIONAL FAST FOURIER
TRANSFORMS
4. 1 Parallel FFTs in Two Dimensions
Before the three-dimensional Fast Fourier Transform (FFT) has been implemented
and different mapping strategies of MPI tasks over the physical processor grid have
been investigated, fairly extensive investigations of taskfarms of the two-dimensional
FFTs have been carried out. There are at least three principal reasons for this. First,
two-dimensional computation is half of the three-dimensional computation. This is
because the communication kernel for the parallel two-dimensional FFT computation
[21] is one all-to-all communication between the two one-dimensional FFT
calculations. For the parallel three-dimensional case, two implementations have been
investigated – one where the three-dimensional complex data array is decomposed in
one dimension and for the other version in two dimensions. For the first, only one
all-to-all communication is needed. For the second implementation, two all-to-all
communications are necessary. We study separately the impacts of several node
mappings on the performance of one all-to-all communication for the two-
dimensional case to decode possibly performance issues for the three-dimensional
case.
The second principal reason for extensively investigating the two-dimensional case is
a taskfarm. We will come back to it in the following section. And the third reason is
a partial Fast Fourier Transform that is also related to a taskfarm.
4. 2 Taskfarm of parallel FFTs
For the two-dimensional case, taskfarms are of considerable importance since the
BlueGene/L is characterised by constraints on the partition sizes. More precisely, if
an application yields best performance results on 256 nodes, one has to request the
512-node partition. However, to enforce the use of the entire partition requested, the
partition has been filled up as a taskfarm by simultaneously running the same
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
14
1 1 2
0 0
1 1 2
0 0
( , , ) ( , , )
1( , , ) ( , , )
vx wyN M iM N
z y
vx wyN M iM N
w v
F x v w f x y z e
f x y z F x v w eM N
π
π
⎛ ⎞− − − +⎜ ⎟⎝ ⎠
= =
⎛ ⎞− − +⎜ ⎟⎝ ⎠
= =
= ⋅
= ⋅ ⋅⋅
∑∑
∑∑
1 1 2
, ,0 0
ux vyL M iL M
u v x yx y
B A eπ ⎛ ⎞− − − +⎜ ⎟⎝ ⎠
= =
= ⋅∑∑
1 1 2 2
, ,0 0
1st one-dimensional computationalong y dimension
2nd one-dimensional computationalong x dimension
1 2
, ,0
is the 1D-FFT
vy uxL M i iM L
u v x yx y
vyM iM
x v x yy
B A e e
C A e
π π
π
− − − −
= =
− −
=
= ⋅ ⋅
→ = ⋅
∑ ∑
∑ :,y
1 2
, , x,:0
of A for all y values
is the 1D-FFT of A for all x valuesuxL iL
u v x vx
B C eπ− −
=
→ = ⋅∑
program with the investigating mapping strategies several times. It also verifies
reproducibility of the execution times.
Partial Fast Fourier Transforms are also important for many scientific applications.
Consider the function f(x, y, z) and the Fast Fourier Transform computation for only
two dimensions, e.g. y and z. Then the partial FFT equation and its inverse would be:
(4.1)
For partial FFTs one can perform taskfarm computations in the sense of
simultaneous runs of the same application, but for different values of x.
4. 3 Algorithm Details
Consider ,x yA as a two-dimensional array of L M× complex numbers with:
,
, 0 , 0
x yA
x x x Ly y y M
∈
∈ ∀ ≤ <∈ ∀ ≤ <
The two-dimensional FFT is computed by the equation described in (3.7).
(4.2)
In other words, the two-dimensional FFT is an array ,u vB of L M× complex
numbers. This computation has been performed in two single stages. First, the one-
dimensional FFT was computed along the y dimension and, secondly, along x
dimension. Therefore, (4.2) can be written as:
(4.3)
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
15
Figure 4.1: Computational steps of the two-dimensional FFT implementation
Figure 4.1 illustrates the described implementation of the two-dimensional FFT of an
array of size L M× – where a data size equal in each dimensions, i.e. L M= , has
been used. More precisely, ( )0 : 1, 0 : 1xA L M− − is an xL M× array of complex
numbers distributed onto P nodes. So, each node stores a section of size xL M×
( )xLL P= of the data array A in its local memory. At first (a), xL independent one-
dimensional FFTs of size M along y dimension have been calculated. Secondly (b),
yM independent one-dimensional FFTs of size L along x dimension were calculated
( )yMM P= .
For calculating the independent one-dimensional FFTs, the FFTW library function
fftw() has been used. The justification for using FFTW has been covered in chapter
3.4. Before starting with the actual investigations, two different input parameters for
the FFTW library function fftw() have been compared and the one yielding the best
performance for the entire two-dimensional forward FFT computation has been used.
In this context, re-sorting strategies of the data play a crucial role.
In general, the input parameters for the fftw() library function are fftw( plan,
howmany, in_array, in_stride, in_distance, out_array, out_stride, out_distance ). We
compare results using different values for the stride and distance parameters
((stride=1 AND distance≠1) OR (stride≠1 AND distance=1)). The same values used
for the input parameters in_stride and in_distance were used for the output
parameters out_stride and out_distance. For both versions, we considered the
advantages and disadvantages regarding plan creation, re-sorting the data and the
overall times for the entire forward FFT computation.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
16
Figure 4.2: FFTW library functions and resort strategies for the two-dimensional FFT computation
Consider the L M× two-dimensional data array shown in figure 4.2. The numbering
of the first two columns of the data grid provides a better understanding of the two
different re-sort strategies which are partially sequential. For the first
implementation, shown in figure 4.2.1 (a) and (b), for both Fast Fourier Transform
calculations along the y and x dimensions the fftw() library function with stride=1 is
used. This means that after the first FFT computation two re-sort methods are
necessary, one before and one after the all-to-all communication, as can be seen in
figure 4.2.1.b. The first re-sort method before the all-to-all communication sorts the
data along rows first. This means it doesn’t access data as it is stored in memory. If
data is accessed in non-sequential order, cache misses will occur in every single step
- since the data in a cache block is rejected before it is used [30]. The second re-sort
method after the all-to-all communication becomes necessary to get the data in the
correct order for using the fftw() library function with stride=1 to compute the
second one-dimensional FFT along the x dimension. It is interesting to see whether
not re-sorting and calling strided fftw() is more efficient.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
17
For this reason, we have investigated a second implementation using fftw() with
stride=1 only for the first FFT computation along the y dimension, and fftw() with
stride≠1 for the second FFT computation along the x dimension. It has the advantage
that only one re-sort method before the all-to-all communication is needed which is
still partially sequential. A further, and not inconsiderable, advantage is that this re-
sort method doesn’t experience cache misses in every single step, but in every MP
step (see figure 4.2.2.a and b). Since this fact becomes beneficial for large problem
sizes, tables 4.1 and 4.2 summarise times for a data problem size of 16384² using 128
and 512 nodes. All presented times have been measured on a hot L3 cache, e.g. by
transforming the same data multiple times. It also verifies reproducibility of the times
shown in table 4.1 and 4.2. After discarding the very first run, the fastest of the
remaining runs has been chosen for all future times presented in this paper. To get an
idea where the differences in the times for the entire forward FFT come from, all
main steps haven been measured separately (creation of plans for the FFT
computation, the fftw() calls, re-sort methods, all-to-all communication). The timing
of the entire forward two-dimensional FFT computation is encapsulated in a
MPI_Barrier() pair. It starts after getting data arrays ready for the first FFT
computation and ends directly after the final fftw() call.
TIME STRIDE=1 STRIDE≠1WITH- OUT NEW PLAN
STRIDE≠1 WITH NEW PLAN
for plan creations 15.491 15.436 31.293for FFTW()s 0.509 1.416 1.255for re-sort methods 0.414 0.050 0.051for all-to-all comm 0.232 0.232 0.226for entire forward 2D-FFT 1.157 1.698 1.534
Table 4.1: Times measured in seconds for a problem size of 16384² using 128 nodes
TIME STRIDE=1 STRIDE!=1WITH- OUT NEW PLAN
STRIDE!=1 WITH NEW PLAN
for plan creations 15.483 15.464 30.989for FFTW()s 0.128 0.331 0.293for re-sort methods 0.075 0.017 0.017for all-to-all comm 0.057 0.057 0.058for entire forward 2D-FFT 0.265 0.410 0.379
Table 4.2: Times measured in seconds for a problem size of 16384² using 512 nodes
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
18
Times for three different implementations are shown in table 4.1 and 4.2. The first
case with the table header “stride=1” is illustrated in figure 4.2.1.a and b. The second
and third case with table header “stride≠1 without new plan” and “stride≠1 with new
plan” is illustrated in figure 4.2.2.a and b. The reason for a second new plan
computation is the use of two different FFTW routines, one with stride=1 and the
other with stride≠1. Since the FFTW plan creation is expensive, it is also
investigated whether there is an extensive advantage by providing additional plans
for the forward and backward FFTs, evaluated with the fftw() library function with
stride≠1.
For both runs using 128 and 512 nodes respectively, the total amount of time spent in
the fftw() library function is a considerable fraction of the total time. As expected, the
reorganisation of the data which avoids cache misses in every single step becomes
extremely cheap if one uses fftw() with stride≠1 for the second FFT computation
along the x dimension. However, the fftw() call with stride≠1 is much more
expensive so that even the overall times for the entire two-dimensional FFT
computation is affected. The advantage won by cheap reorganisation of the data is
lost by the strided fftw() calls. The results presented in both tables show also that the
additional plan creation for the strided fftw() calls has a beneficial impact on the
performance and it would be worth to consider a second plan creation since it is only
done once and can be reused many times.
However, the overall performance of the strided fftw() is poor and for all further
implementations and investigations, the fftw() library functions with stride=1 and
expensive re-sort strategies have been used – hence no second plan computation is
needed.
4. 4 Verification of Results
Before any investigations were made, we have ensured that the two-dimensional FFT
computation is correct. The test function which has also been implemented for the
three-dimensional FFT computation has only been described once. To ensure a
complete mathematical description, the test function has been specified for the more
complex case, the three-dimensional case, in chapter 5.2.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
19
4. 5 Performance Analysis
Various runs of the two-dimensional FFT computation have been performed on
BlueGene/L. Two parameters have been varied – the number of nodes used and the
size of the data being transformed. For the implementation, only one processor of the
chip has been used to run one MPI task (coprocessor mode). A more detailed
description of the implementation has been covered earlier in chapter 4.3.
For the parallel two-dimensional Fast Fourier Transform computation, the effect of
the three-dimensional mesh and torus communication networks on BlueGene/L has
been investigated. The mesh and torus are the same network with the exception that
the torus has periodic boundary conditions in all three dimensions [3]. More
precisely, for the 512-node partition, each of the 512 compute nodes is connected to
its six neighbours through bidirectional links [13]. The mesh is characterised by open
boundary conditions in all three dimensions. The network provides fastest
communication between processors close to each other [3].
When running parallel codes on BlueGene/L using the mpirun command, the MPI
tasks are mapped to the physical processor grid of the machine [3]. The performance
measurements of the ping-pong application (see chapter 2.1) have shown that the
communication times can be minimised when a particular node mapping is used
which takes the network characteristics into account. To benefit from these features,
for the two-dimensional FFT computation, where all-to-all communications become
extremely expensive for very large node counts, we have explored a variety of MPI
task mappings on the physical processor grid on BlueGene/L. This optimisation has
to be carried out for each partition size on BlueGene/L since the shape of the
partitions changes with size [3, 8]. The partitions on BlueGene/L have three
dimensions and 32, 128, 512, and 1024 are the total numbers of chips available in the
partitions. The possibility to choose between mesh and torus is only for the 512- and
1024-node partitions. For the smaller partitions only mesh is offered. The default
node mapping on the machine is done by filling up a three-dimensional array first in
x direction then in y- and finally in z direction. The mpitrace tool reports that, for
instance, the 128-node partition consists of an 8 4 4× × block of nodes, whereas a
512-node partition is an 8 8 8× × block [3]. Consequently, the optimal mapping for
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
20
Figure 4.3: Comparison of mesh vs torus network for a variety of problem sizes
one partition size can differ substantially from the optimal mapping for another
partition [8]. This process is facilitated by the capability to specify a node mapping
at run time using the –mapfile option to the mpirun command. Unless
specifically noted, all of the following performance results are from the coprocessor
mode. The performance measurements achieved from the customised node mappings
have been compared with the results obtained from the default node mapping on
BlueGene/L.
4. 5. 1 mesh versus torus network
Before particular node mappings were considered, the performance impact of mesh
and torus on the two-dimensional FFT computation have been investigated. Figure
4.3 shows the performance measurements for different problem sizes on the 512-
node partition. It demonstrates clearly that the three-dimensional torus is highly
efficient and even becomes more beneficial as the problem size grows. For the all-to-
all communication and a fairly large problem size, the torus network is about 20%
faster than mesh.
As the problem size is increased, the length of the messages which are going to be
exchanged gets longer. In fact, the bandwidth utilisation becomes higher and is a
more critical factor for performance differences. Within the torus network, the data
packets are routed on an individual basis, using one of two routing strategies. The
algorithm, in which all packets follow the same path along the x, y, z dimensions (in
this order) is called deterministic routing algorithm [13]. The second is a minimal
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
21
Figure 4.4: Comparison of mesh vs torus network
adaptive routing algorithm, which allows individual packets to make decisions about
routing [13]. The BlueGene/L torus network features have been extensively
described elsewhere and will not discussed here in more detail.
Figure 4.4 illustrates, in an extremely simplified way, how point-to-point packets
possibly travel through the mesh and torus networks. This simplified example, using
4 nodes for an all-to-all communication, shows clearly the impact of the additional
link for the torus network on the bandwidth utilisation. On the other hand, we
assume that on the mesh network deterministic routing is used, which leads to more
network congestion and increased messages latency, even on lightly used networks
[13]. This effect becomes even more pronounced as more packets are sent through
the network.
4. 5. 2 Virtual Node Mode on Bluegene/L
Another important architectural feature of BlueGene/L is its dual-processor compute
chip which can operate in one of two modes [3, 13]. The coprocessor mode spans the
entire memory of the chip and can use both processors by running one thread on each
[13]. In virtual node mode, two single-threaded processes which share the chip’s
memory, run on one compute node [13]. Each process is bound to one processor.
Table 4.3 compares the execution times for the two-dimensional FFT computation
for different problem sizes using coprocessor mode and virtual node mode.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
22
PROBLEM 512² 1024² 2056² NODES CO mode VN mode CO mode VN mode CO mode VN mode
1 0.182 0.092 0.793 0.418 3,596 1,887 2 0.091 0.048 0.417 0.206 1,875 0.992 4 0.048 0.024 0.205 0.099 0.985 0.493 8 0.024 0.012 0.099 0.050 0.489 0.236
16 0.011 0.006 0.049 0.024 0.232 0.114 32 0.006 0.003 0.024 0.014 0.113 0.063
Table 4.3: Execution times in seconds for the 2D-FFT computation for different problem sizes using
coprocessor mode (CO) and virtual node mode (VN) on BlueGene/L
The benchmarks show that the virtual node mode has a positive impact on our
application performance. The execution times are almost twice as fast when
dedicating the same amount of chips but running one MPI task on each of the two
processors on a chip. When virtual node mode is used, it means the two MPI tasks
running on the two processors of one chip also share the network. More precisely,
the bandwidth utilisation becomes higher than if only one node is dedicated to one
MPI task. This may be the reason that the execution times using virtual node mode
are slightly higher than half of the execution times achieved with coprocessor mode.
More research on the virtual node mode would constitute an interesting future
project. However, due to limited time, for all the further investigations performed in
this paper the coprocessor mode has been used.
4. 5. 3 Double FPU on BlueGene/L
As mentioned in chapter 2 the BlueGene/L processing cores have a specially
designed double floating point unit – also known as “Double Hummer”[1] – which
mainly provides support for complex arithmetic [4]. For FFT implementations this
feature can be utilised to save a significant number of arithmetic operations, leading
to improved performance [4]. Therefore, for efficient exploitation of the double
FPUs for our FFT computations, 16-byte alignment [3, 4, 13] for the data arrays has
been declared to the compiler which allows the compiler to issue “Double Hummer”
instructions. Additionally, we used the optimised FFTW2 library available on
BlueGene/L, which also benefits from the “Double Hummer” feature. It is a version
of FFTW-GEL from the Vienna University of Technology [4] which is based on
FFTW 2.1.5 and built by IBM. Figure 4.5 shows the superior performance impact of
both, declaring the data alignment for the local arrays and utilising the optimised
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
23
Figure 4.5: Times of the forward 2D-FFT for a problem size of 2048²
FFTW2 library routines. For instance, the application using both optimisations is
about 40% faster than the version using the standard FFTW 2.1.5. For this reason,
this optimised version is used for all future investigations performed on BlueGene/L.
4. 5. 4 MPI task mapping strategies
Two main node mapping patterns have been investigated – contiguous and
discontiguous blocks. To better explain the reasons for the investigations of the two
main patterns, we jump forward a bit to chapter 5 where the three-dimensional FFT
computation is explored. For the three-dimensional FFT computation where the data
array is decomposed in two dimensions, the MPI tasks have been organised in a two-
dimensional processor grid using the MPI Cartesian grid topology [22] construct.
More precisely, for a subdivision dims={e, f} (where e f P× = ) of the two-
dimensional virtual processor grid, we have f subgroups of nodes each consisting of e
nodes. We have two all-to-all communications, the first within each subgroup of
nodes and the second between the subgroups of nodes. This means that if the MPI
tasks for the first all-to-all communication are mapped onto nodes which are as close
as possible to each other, then the mapping for the second communication between
the subgroups would be fragmentary. Analysing both node mapping patterns
separately is supposed to clarify performance results achieved for the three-
dimensional FFT computation which uses two-dimensional decomposition of the
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
24
Figure 4.6: Customised versus default mapping on 32-node partition
Figure 4.7: Performance impact of customised versus default mapping on 32-node partition
data. The following figures illustrate how MPI tasks are mapped onto the processor
grid for a single run. However, for all investigations, the entire requested partition is
filled using the same mapping pattern.
4. 5. 4 . 1 Mappings on the 32-node partition
Figure 4.6 shows the customised and default node mapping used on the 32-node
partition, the smallest partition available on BlueGene/L. For the first example, 4
processors are used, which means the two-dimensional FFT application has been run
8 times, simultaneously, within the 32-node partition. The timing of the entire
forward FFT computation is encapsulated in a MPI_Barrier(MPI_COMM_WORLD)
pair.
To ensure that the default mapping is really what we assume, a mapfile which
represents the supposed default mapping has been provided and used. We compared
the results from two runs, one using the mapfile and the second run without using a
mapfile. This investigation yields the same results (with the tolerance of few
microseconds) measured for a variety of problem sizes. Figure 4.7 presents the
performance variations of customised versus default node mapping. Independent of
the size of the data array being transformed, the total time spent for communication
is about 10% less if default mapping is used. To explain this, we have considered
both, latency and bandwidth.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
25
Figure 4.8: Two customised mappings versus default mapping on 128-node partition
The fragmentary mapping increases message latency considerably, which leads to
poor performance. With increasing problem size, the effect of bandwidth utilisation
dominates since data packets which are going to be exchanged are bigger. The
overall performance impact on the entire two-dimensional FFT computation becomes
smaller as the problem size grows since the computation dominates over
communication.
4. 5. 4 . 2 Mappings on the 128-node partition
The potential to explore several different node mappings on the 32-node partition is
limited due to its size. Therefore, ongoing investigations have been performed on the
128-node partition. Two choices of node mappings using 8 processors for the 128-
node partition have been studied and illustrated in figure 4.8. Again, to fill up the
partition ensuring all 128 chips are in use, the application runs 16 times at once using
8 processors. In case (a), for the customised mapping, non-contiguous blocks in all
three dimensions of a total of 8 processors are used. The case (b) compares a 8-cube
shape with the default mapping in line. The performance will be compared with the
default case using contiguous placements in a line.
Figure 4.9 presents the performance variations of the discontiguous and contiguous
node mapping normalised to the default mapping. It means the 100%-line (reference
line) represents the times achieved from the computation using default mapping. The
cube mapping (b) has a big impact on performance due to lower latency for
communication between nodes in nearest neighbourhood.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
26
Figure 4.9: Performance impact of customised versus default mapping on 128-node partition
However, the trend drops off as the data array being transformed grows. We assume
this is due to network congestion since bandwidth utilisation becomes higher because
more data packets are delivered. More precisely, the communication between nodes
in the nearest neighbourhood yields excellent performance gain due to decreased
message latency, but shows a turn-over as the problem size grows because of higher
bandwidth utilisation leading to network congestion. In summary, there is always a
trade-off between latency and bandwidth.
The fragmentary node mapping (a) is characterised by higher latency for short
messages. But nevertheless, the path messages have to travel between nodes furthest
away from each other is still shorter than for the communication pattern in line. This
becomes more profitable as the problem size grows since the communication is more
affected by network congestion for the default mapping. The opposite occurs for the
next mapping on the 128-node partition, shown in figure 4.10.
The application is simultaneously run 8 times, each with 16 processors. Here the
discontiguous mapping of nodes is of the same radius along the x dimension as the
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
27
Figure 4.12: Two customised mappings versus default mapping on 512-node partition
Figure 4.11: Performance impact of customised versus default mapping on 128-node partition
default mapping. Therefore, the fragmentary mapping shows poor performance
because of increased message latency compared to the communication in line.
This becomes less and less of an issue as problem size increases since the default
mapping experiences higher network congestion than the distributed mapping does.
Figure 4.11 displays the performance measurements for communication and the
impact on overall performance for the entire two-dimensional FFT computation.
4. 5. 4 . 3 Mappings on the 512-node partition
Further investigations have been carried out on the 512-node partition for a variety of
larger problems.
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
28
Figure 4.14 (*): Performance impact of customised versus default mapping on 512-node partition (torus)
Figure 4.13: Performance impact of customised versus default mapping on 512-node partition (mesh)
Figure 4.12 shows the discontiguous (a) and contiguous (b) customised node
mapping compared with default mapping. The application has been run 8 times, at
the same time, within the 512-node partition. The 512-node partition is the smallest
partition where all 512 nodes are connected to their six neighbours through
bidirectional links [13]. Hence, all the mapping patterns have been investigated for
the mesh and torus network.
The results for the two customised mappings versus default mapping on the mesh
network are shown in figures 4.13. Since the same fragmentary and contiguous
patterns have been explored on the 128-node partition and are simply extended to the
512-node partition, the outputs obtained on the mesh are straightforward and will not
be repeated here. On the other hand, exploring the same node mapping patterns on
the torus network yields entirely different results which are presented in figure 4.14.
(*) Unfortunately, we don’t have measurement results for the problem sizes 256², 512², 1024²
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
29
We have discovered with the investigation discussed in chapter 4.5.1 that the torus
network has a remarkably profitable impact on the performance because the
bandwidth utilisation is balanced as equally as possible over the network to avoid
congestion. For the contiguous node mapping (b), we assume that the additional links
which account for the torus are not in use. On the other hand, for the default node
mapping, the torus links can be utilised in two dimensions. This fact leads to
significant worsening of the performance achieved with customised contiguous
mapping. Clearly, the performance drops further as the problem size grows, since
higher bandwidth utilisation is responsible for congestion within the cube-shaped
node block.
The discontiguous node mapping (a) can take advantage of the torus links in all three
dimensions. However, the poor performance of the fragmentary mapping is
characterised by increased massage latency. A marginal improvement of the
customised mapping over the default mapping can be experienced for large problem
sizes. So, we assume that the fragmentary mapping becomes beneficial if messages
are very long, which leads to network congestion for the default mapping.
So far, a variety of customised versus default mappings of MPI tasks have been
explored for mesh and torus on the 512-node partition. Our interest lies in finding the
mapping pattern yielding the best performance results. Table 4.4 summarises the
results from all the investigated node mappings for mesh and torus on the 512-node
partition on BlueGene/L.
NETWORK CONTIGUOUS NODE MAPPING
DISCONTIGUOUSNODE MAPPING
DEFAULT NODE MAPPING
mesh x x torus x x
Table 4.4: Summary of the investigated node mappings for problem sizes between 2048² and 16384²
Table 4.5 shows the communication times and the amount of time required for the
entire two-dimensional FFT computation for the mappings yielding best performance
results for mesh and torus. It points out that the discontiguous node mapping on torus
leads to best performance results. Especially for large problem sizes because of the
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
30
Figure 4.15: Customised mappings versus default mapping on 512-node partition
beneficial feature of the torus network, balancing bandwidth utilisation as equally as
possible over the entire network.
PROBLEM COMM COSTS
mesh 2D-FFT COSTS
meshCOMM COSTS
torus2D-FFT COSTS
torus256² 0.00022 0.00038 512² 0.00061 0.0014
1024² 0.0022 0.0055 2084² 0.0085 0.024 0.007 0.023 4096³ 0.035 0.149 0.027 0.142 8192² 0.139 0.792 0.110 0.754 16384² 0.557 3.421 0.444 3.295
Table 4.5: Communication and 2D-FFT computation costs (seconds) for different problem sizes with
the mapping yielding best results for mesh and torus
As a final investigation before we consider the more complex three-dimensional FFT
computation, we double the 64-cube and compare an 8 4 4× × -node-block with
blocks containing 2 planes of the 512-node partition. Figure 4.15 illustrates the two
node mappings being compared. Here, investigations just for the torus have been
considered since the performance measurements presented above have shown that
mesh is not particularly relevant for the 512-node partition in our case.
The results presented in figure 4.16 strengthen the previous outcome achieved for
node mapping comparison shown in figure 4.12 (b). Here, for the customised
mapping, the torus links in one dimension can be utilised. We assume that the
additional links in the other two dimensions are not in use. Again, for the default
node mapping the links can be utilised for two dimensions. Since, for the customised
case the torus connectivity comes into play for at least one dimension, the
performance difference between customised and default can be reduced (for the
4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS
31
Figure 4.16: Performance impact of customised versus default mapping on 512-node partition (torus)
mapping comparison illustrated in figure 4.12 (b), default mapping leaded to 15%,
17%, 18%, 19% performance improvements while now it is about 7%, 8%, 12%,
13% for problem sizes 2048², 4096², 8192², 16384², each). However, customised
mapping leads to poor performance most likely due to high bandwidth utilisation that
causes network congestion. That is also the reason why the performance of the
customised node mapping falls off as the problem size grows.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
32
1 1 1 2
, , , ,0 0 0
ux vy wzL N M iL M N
u v w x y zx z y
B A eπ ⎛ ⎞− − − − + +⎜ ⎟⎝ ⎠
= = =
= ⋅∑∑∑
1 1 1 22 2
, , , ,0 0 0
1st one-dimensional computationalong y dimension
2nd one-dimensional computationalong z dimension
3rd one-dimensional com
wzvy uxL N M ii iNM L
u v w x y zx z y
B A e e eππ π− − − −− −
= = =
= ⋅ ⋅ ⋅∑∑ ∑
putationalong x dimension
1 2
, , , , x,:,z0
1 2
, , , , x,y,:0
, ,
is the 1D-FFT of A for all (x, z) pairs
is the 1D-FFT of A for all (x, y) pairs
vyM iM
x v z x y zy
wzN iN
x v w x v zz
u v w
C A e
D C e
B
π
π
− −
=
− −
=
→ = ⋅
→ = ⋅
→
∑
∑1 2
, , :,y,z0
is the 1D-FFT of A for all (y, z) pairsuxL iL
x v wx
D eπ− −
=
= ⋅∑
5. THREE-DIMENSIONAL FAST FOURIER
TRANSFORMS
5. 1 Parallelisation
As with the two-dimensional case and to provide a better explanation at the three-
dimensional Fast Fourier Transform (FFT) computation, a mathematical description
is first provided.
Consider , ,x y zA as a three-dimensional array of L M N× × complex numbers with:
, ,
, 0 , 0 , 0
x y zA
x x x Ly y y Mz z z N
∈
∈ ∀ ≤ <∈ ∀ ≤ <∈ ∀ ≤ <
The three-dimensional FFT is computed using the equation described in (3.8).
(5.1)
In other words, the three-dimensional FFT is an array , ,u v wB of L M N× × complex
numbers. This computation is performed in three single stages. First, the one-
dimensional FFT along the y dimension for all (x, z) pairs is computed, then secondly
along the z dimension for all (x, y) pairs, and finally along the x dimension for all (y,
z) pairs. Therefore, (5.1) can be written as:
(5.2)
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
33
Figure 5.1: Computational steps of the 3D-FFT implementation using 1D-decomposition
For the three-dimensional case, two different implementations have been considered.
Performance data for volumetric fast Fourier Transform computations on the
BlueGene/L architecture has been published earlier [5, 6, 7]. One common approach
for computing the FFT for a L M N× × data array in parallel on multiple nodes is to
use a technique called slab decomposition [5]. For the investigations here, a data size
equal in each dimension, i.e. L M N= = , has been used. For slab decomposition, the
data is distributed along one single axis; therefore it is also called one-dimensional
decomposition. Figure 5.1 illustrates the described implementation of the three-
dimensional FFT using slab decomposition for a data array of size L M N× × . More
precisely, ( )0 : 1, 0 : 1, 0 : 1xA L M N− − − is an xL M N× × array of complex
numbers distributed onto P nodes. So, each node stores a section of size xL M N× ×
( )xLL P= of the data array A in its local memory. First (a), xL N× independent
one-dimensional FFTs of size M along y dimension and xL M× independent one-
dimensional FFTs of size N along z dimension have been calculated. Secondly (b),
yM N× ( )yMM P= independent one-dimensional FFTs of size L along the x
dimension were calculated. For calculating the independent one-dimensional FFTs,
the FFTW library function fftw() with stride=1 and the expensive re-sort strategies
have been used. More about re-sort strategies were covered earlier in chapter 4.3. It
must be pointed out that within the figures used here; the coordinate system has been
rotated for simplification.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
34
Figure 5.2: Computational steps of the 3D-FFT implementation using 2D-decomposition
Let’s assume the computation is performed on ( )L LP P P L= = nodes and the data is
decomposed along the x axis as is shown in figure 5.1, with each node assigned a
slab of size 1 M N× × . There are two perspectives to consider. First, the performance
perspectives [5] where two one-dimensional FFTs can be performed locally on each
node without any communication. Secondly, the scalability perspective, where the
scalability of this method is limited by the extent of the data along a single axis [5] In
this example it is limited by L. This becomes a not negligible problem if one wants to
exploit a very large number of nodes, such as BlueGene/L is designed for.
A more scalable implementation of three-dimensional FFTs, called volume
decomposition, has been presented in [5] and implemented here for further
investigations in respect to mapping strategies of MPI tasks. For simplicity, we
assume that data is distributed in two dimensions so that it is ready to perform the
first one-dimensional FFT computation without any communication in advance. It
would be an interesting future step to extend this implementation in such a way that
data is distributed in three dimensions. There is a significant reason for this. Namely,
three-dimensional decomposition would be a more likely decomposition for many
scientific applications. However, that would involve another expensive all-to-all
communication to get data ready for the first one-dimensional FFT evaluation.
Figure 5.2 illustrates the described implementation of the three-dimensional FFT
using two-dimensional decomposition for a data array of size L M N× × .
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
35
2 2 2( , , ) sin with , ,ax by czf x y z a b cL M Nπ π π⎛ ⎞= + + ∈⎜ ⎟
⎝ ⎠
2 2 2 2 2 2
( , , )2
ax by cz ax by czi iL M N L M Ne ef x y z
i
π π π π π π⎛ ⎞ ⎛ ⎞+ + − + +⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
⎛ ⎞−⎜ ⎟= ⎜ ⎟
⎜ ⎟⎝ ⎠
The MPI tasks have been organised in a two-dimensional virtual processor grid using
the MPI Cartesian grid topology [22] construct. More precisely, for a subdivision
dims={ , x zP P } (where x zP P P× = ) of the two-dimensional virtual processor grid, we
have zP subgroups of nodes each consisting of xP nodes. Let
( )0 : 1, 0 : 1, 0 : 1x zA L M N− − − be an x zL M N× × array of complex numbers
distributed onto a x zP P× grid of nodes. So, each node stores a section of size
x zL M N× × , x zx z
NLL NP P⎛ ⎞= =⎜ ⎟⎝ ⎠
of the data array A in its local memory. First
(a), x zL N× independent one-dimensional FFTs of size M along y dimension have
been calculated. Then within each subgroup of nodes – in figures 5.2 marked with
four main colours – an all-to-all communication is performed to get data ready for
the second one-dimensional FFTs. Secondly (b), x yL M× independent one-
dimensional FFTs of size N along z dimension have been performed. To evaluate the
third one-dimensional FFT, a second all-to-all communication between the
subgroups of nodes becomes necessary. Finally (c), y zM N× independent one-
dimensional FFTs of size L along the x dimension are calculated.
5. 2 Verification of Results
Before any investigations can be made, we must ensure that the three-dimensional
FFT computation is correct. For that reason, synthetic input data was chosen to
guarantee reliable verification of the results taken from the implementation with
analytically calculated results. The chosen input function (5.3) delivers not only the
advantage that results can be safely verified, but also the size of the problem to be
investigated can easily be modified by simply changing the values of L, M and N.
(5.3)
With the Euler-formula (3.1), equation (5.3) can be rewritten as:
(5.4)
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
36
2 2 21 1 1
0 0 0
2 2 2 2 2 2 2 2 21 1 1
0 0 0
( , , ) ( , , )
1( , , )2
1( , , )2
ux vy wzN M L iL M N
z y x
ax by cz ax by cz ux vy wzN M L i i iL M N L M N L M N
z y x
F u v w f x y z e
F u v w i e e e
F u v w
π π π
π π π π π π π π π
⎛ ⎞− − − − + +⎜ ⎟⎝ ⎠
= = =
⎛ ⎞ ⎛ ⎞ ⎛ ⎞− − − + + − + + − + +⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠ ⎝ ⎠
= = =
= ×
→
⎛ ⎞= ⋅ − + ×⎜ ⎟⎜ ⎟
⎝ ⎠
= ⋅
∑∑∑
∑∑∑( ) ( ) ( ) ( ) ( ) ( )1 1 1 2 2
0 0 0
x a u y b v z c w x a u y b v z c wN M L i iL M N L M N
z y x
i e eπ π− − − + + +⎛ ⎞ ⎛ ⎞− − − + + − + +⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
= = =
⎛ ⎞− +⎜ ⎟⎜ ⎟⎝ ⎠
∑∑∑
2 21
0
( )
2 ( )
(periodic discrete delta function (Kronecker symbol)) :
( ) , x , 1
( )1 ,
( )
iax iuxLL L
aux
g x
i a u xL
G u e e L L
g x eif u a n
g x
π π
π
δ− −
=
−
⎛ ⎞= ⋅ = ⋅ ∈ >⎜ ⎟
⎝ ⎠
=
= +=
∑
( ) ( ), , ,
, n0 ,
, , n
( )0 ,
1( , , )2
1 2
( , , )
L a u M b v N c w au bv cw
Lelse
L if u a nLG u
else
F u v w i L M N
i L M N
F u v w
δ δ δ δ δ δ− − −
∈⎧⎨⎩
→
= + ∈⎧= ⎨⎩
→
⎡ ⎤= ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅⎣ ⎦
− ⋅ ⋅ ⋅ ⋅
=
, , ,
1 , , , 2
0 ,
if u a v b w c
i L M N if u L a v M b w N c
else
⎧ = = =⎪⎪⎪ ⋅ ⋅ ⋅ ⋅ = − = − = −⎨⎪⎪⎪⎩
From equation (5.4) the three-dimensional discrete Fourier Transform has been
calculated analytically:
(5.4)
For further calculations, the periodic discrete delta function δ has been brought into
play. For the next calculation we use periodicity of the exponential function.
(5.5)
Equation (5.5) presents the result of the three-dimensional Fourier Transform for the
particular input function (5.3). Now it becomes obvious that within our
implementation it is only necessary to verify that the two peaks (5.5) are located at
the correct place and have the right height.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
37
1
10
100
1000
1 10 100 1000Number of nodes
Spee
dup
ideal
512³
256³
128³
64³
32³
Figure 5.3: Speedup of the 3D-FFT implementation using 1D-decomposition
5. 3 Performance Analysis
5. 3. 1 1D-Decomposition versus 2D-Decomposition
Various runs of the two three-dimensional FFT implementations – the one where the
three-dimensional data array is decomposed in one dimension and in the other
version in two dimensions – have been performed on BlueGene/L. Two parameters
have been varied – the number of nodes used and the size of the data being
transformed. For these investigations we have used the coprocessor mode on
BlueGene/L. For the FFT implementation where data is decomposed in two
dimensions, the MPI tasks were organised in a two-dimensional logical processor
grid using the MPI Cartesian grid topology [22] construct. The MPI_Cart_sub()
creates new communicators which allow all-to-all communications within and
between the subgroups of nodes. A more detailed description of both
implementations has been covered in chapter 5.1.
Figures 5.3 and 5.4 present the speedup curves for each of the two implementations
for different problem sizes starting from 32³ up to 512³, run on the 512-node partition
with torus. In both cases, it is assumed that the applications scale ideally up to 4
nodes for a 256³ problem and up to 32 nodes for a 512³ problem, since computation
using 1 node is not feasible due to the limited amount of memory available on the
system. The features of the speedup curves for the implementation using one-
dimensional decomposition are more explicit when employing logarithmic scaling
while – in my opinion – it is not the case for the implementation using two-
dimensional decomposition.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
38
0
200
400
600
800
1000
0 200 400 600 800 1000
Number of nodes
Spee
dup
ideal
512³
256³
128³
64³
32³
Figure 5.4: Speedup of the 3D-FFT implementation using 2D-decomposition
However, the respective figures can be found in Appendix A. Both figures show that
the implementations scale well with increasing numbers of nodes and for problem
sizes greater than 32³. It even shows a super linear speedup achieved for particular
problem sizes.
If one compares the curves presented for the FFT implementation using one-
dimensional decomposition with the other using two-dimensional decomposition in
respect to the number of nodes used then it is precisely observable that the scalability
of the implementation using one-dimensional decomposition is limited but scales
well up to the end. For instance, the curve for the 64³ problem size shows clearly that
the limitation of exploiting an increasing number of processors efficiently is the size
of the data of a single axis, i.e. for the 64³ problem the scalability is limited to 64
nodes. Looking at the curve which presents the speedup for the same problem size
but decomposing the data in two dimensions rather than only one, we can
theoretically use up to 64² nodes for the FFT computation. Concerning efficient
exploitation of 64² nodes for a 64³ problem size, it was not the most ideal example
since even for 512 nodes, the application doesn’t scale linearly anymore. It scales
linearly up to 256 processors, though still gets faster all the way to 1024 processors.
Here, Gustafson’s law [1, 2] comes into play – to efficiently utilise a larger number
of processors, a bigger problem size is needed.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
39
3 TotalMemoryP
size of L cache=
3256 6 1,536 3844 4
complex numbers arrays MBPMB MB
⋅= = =
One possible explanation for the observed super linear speedup in our case may
result from the hardware architecture because of cache effects. The cache effects will
come into account when the problem size is small, so that the variables frequently
accessed fit into the cache. However, there is a trade-off between problem size and
fitting variables into the cache since, as mentioned before, if the problem size is too
small, then relatively more time is spent in communication than computation, which
affects the efficient utilisation of an increasing number of processors.
Each chip on BlueGene/L has three levels of cache [7, 11, 13]. We expect the 4 MB
for the L3 cache to be relevant here. More precisely, if one considers a fixed data
size, e.g. 256³ of complex numbers, with equation (5.6), one can easily calculate the
number of processors for which the local data fits into the L3 cache.
(5.6)
The code has 6 work arrays and for the example problem size of 256³ complex
numbers we get:
It means that for transforming a fixed size of 256³ complex numbers using the FFT
implementation with two-dimensional decomposition on 384 or more nodes, data fits
entirely into the cache on each node, which speeds up the computation super linearly.
The problem sizes here are increased by a factor of eight. In Figure 5.3 (as well as in
figure 5.4 and A.2) it is clearly observable that the speedup curves for the different
problem sizes become super linear in regular intervals by a factor of eight, which
underlines our formula. However, the actual jumps are earlier, hence not all work
arrays seem relevant.
Figure 5.5 (a) compares the performance of both FFT implementations using one-
dimensional and two-dimensional decomposition. As mentioned earlier, each
experiment was run several times and the one measured on a hot L3 cache and
yielding the best performance regarding the total amount of time, taken for the entire
three-dimensional forward FFT computation, has been presented in this paper. The
corresponding table containing all the execution times for the example problem size
128³ can be found in Appendix B. The results show that the slab decomposition is
faster than the two-dimensional decomposition, independent of the problem size and
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
40
0,0001
0,0010
0,0100
0,1000
1,0000
10,0000
1 10 100 1000 10000Number of nodes
Exec
utio
n tim
es (s
econ
ds)
1024³ 1D
1024³ 2D
512³ 1D
512³ 2D
256³ 1D
256³ 2D
128³ 1D
128³ 2D
64³ 1D
64³ 2D
32³ 1D
32³ 2D
Figure 5.5 (a): Performance measurements for the 3D-FFT implementations using 1D and 2D
decomposition for five different problem sizes, respectively
MPI task counts. Earlier performance data appeared in [5] show that slab
decomposition is faster on small task counts (< 64) and volumetric FFT is faster on
large task counts (> 64). These early performance measurements, performed on a
cluster of IBM POWER4 servers [5] are most likely due to a slower interconnect
(SP2 interconnect [24]) (each node has two 320 MB/s/link bidirectional channels
[5]). The performance measurements summarised in table 5.1, show that slab
decomposition is faster until it stops because of limited scalability due to the number
of the data elements along a single dimension.
NODES 32³ 64³ 128³ 256³ 512³ 1024³1 21.603 22.168 -1.294 2 20.534 14.183 -1.289 4 25.232 21.263 6.432 2.647 8 26.663 21.188 19.260 6.586
16 32.692 31.393 23.035 9.447 32 27.655 30.965 24.544 14.523 7.671 64 26.700 25.907 23.897 7.105
128 25.597 21.715 14.185 256 22.995 14.543 8.277512 22.713 10.420
Table 5.1: Performance improvement of the slab decomposition compared to 2D-decomposition
Figure 5.5 (b) compares the performance of the communication times for both FFT
implementations using one-dimensional and two-dimensional decomposition. The
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
41
0,0001
0,0010
0,0100
0,1000
1,0000
1 10 100 1000 10000Number of nodes
Com
mun
icat
ion
times
(sec
onds
)
1024³ 1D
1024³ 2D
512³ 1D
512³ 2D
256³ 1D
256³ 2D
128³ 1D
128³ 2D
64³ 1D
64³ 2D
32³ 1D
32³ 2D
Figure 5.5 (b): Performance measurements for the communication times of the 3D-FFT implementations using 1D and 2D decomposition for five different problem sizes, respectively
respective table, containing the performance improvement of the communication
costs for slab decomposition compared to 2D-decomposition, is included in
Appendix A. The total amount of time spent for communication for the slab-
decomposition is on average 45% of the time used for communication in the
implementation using two-dimensional decomposition. Figure 5.5 (b) shows
precisely the superior impact on communication costs, which merits future research
in terms of using possibly faster FFT packages. Combined with theses packages, the
beneficial communication costs will have a greater impact on the entire FFT
performance. The straight trend for very small node counts is due to latency effects
caused by very long messages.
Since the scalability of the slab decomposition is limited to the number of the data
elements along a single dimension, at this point the two-dimensional decomposition
comes beneficially into play. It has the advantage that, for a particular problem size
at which slab decomposition reaches the final point, a number of additional
processors can efficiently be utilised using two-dimensional decomposition. This is
of interest for smaller problem sizes, for instance on the 512-node partition for
problem sizes smaller than 512³. To achieve a possible additional performance
benefit for the three-dimensional FFT implementation when using two-dimensional
decomposition, different strategies for MPI task placements on the physical
processor grip on BlueGene/L will be described in the following section.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
42
Figure 5.6: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 32-node partition using dims={8, 4} for the 2D-virtual processor grid
5. 3. 2 MPI task mapping strategies
The investigations of the two-dimensional FFT computation have shown that the
performance of the application can depend on the particular mapping, especially
because communication times can be minimised [3, 9]. Therefore, for the three-
dimensional FFT computation using two-dimensional decomposition, a variety of
MPI task mappings on the physical processor grid on BlueGene/L has been explored.
This optimisation has to be carried out for each partition size on BlueGene/L since
the shape of the partitions changes with size [3, 8]. For instance, the 128-node
partition consists of an 8 4 4× × block of nodes, whereas a 512-node partition is an
8 8 8× × block [3]. As a consequence, the optimal mapping for one partition size can
differ substantially from the optimal mapping for another partition [8]. In the same
way as for the two-dimensional case, this process is facilitated by the capability to
specify a node mapping at run time using the –mapfile option to the mpirun
command. For all of the following performance results, the coprocessor mode where
one processor in each chip is available for computation [3] has been used, unless it is
explicitly mentioned otherwise.
5. 3. 2. 1 Mappings on the 32-node partition
Figure 5.6 shows the customised and default node mapping used on a 32-node
partition. MPI sub-communicators are defined appropriately in order to facilitate all-
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
43
Figure 5.7: Performance impact of customised node mapping for 3D-FFT on a 32-node partition for
various problem sizes
to-all communication within each subgroup of nodes to obtain data locally on each
processor for performing the second transform (intra-subgroup communication). A
second set of MPI sub-communicators is defined to afford all-to-all communication
between the subgroups of nodes to obtain data locally on each processor for the third
Fourier transform computation (inter-subgroup communication). From here it
becomes apparent that the investigations achieved for the two-dimensional case help
to understand the performance differences obtained from the three-dimensional FFT
computation using two-dimensional decomposition. More precisely, two mapping
patterns investigated individually for the two-dimensional case are now used together
to compute the three-dimensional FFTs. Figure 5.7 presents the difference in
performance between the customised node mapping normalised to the default
mapping. So, the 100%-line (reference line) represents the times achieved from the
computation using the default node mapping. The two intra- and inter-subgroup all-
to-all communications have been analysed separately, which helps to decode the
performance effects of the entire three-dimensional forward Fourier Transform.
For a possible explanation of how different mappings affect the performance, both
latency and bandwidth have to be considered. As the problem size is increased, the
impact of bandwidth utilisation becomes greater, since the messages which are going
to be exchanged are longer or split up into a number of more packets. How it is
actually done within the MPI library is beyond the scope of this project and will not
be discussed further.
As expected and learnt from the investigation achieved for the two-dimensional case,
the customised mapping for the intra-subgroup communication yields excellent
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
44
performance, since the communication is performed between nodes in closest
neighbourhood. More precisely, the communication time is beneficially affected by
lower latency, while latency is higher for intra-subgroup communication with default
node mapping. If one considers the trend of the line in Figure 5.7 representing the
communication costs for the first all-to-all communication, then it is readily
identifiable that as problem size grows, the impact of higher bandwidth utilisation
comes into play. So, the positive effect of the customised node mapping for the intra-
subgroup communication slows down as the problem size increases, due to
congestion caused by higher bandwidth utilisation.
For the second communication, the communication between the subgroups of nodes,
the results are the exact opposite – here the latency of the customised mapping is
higher than for default mapping. Regarding the effect of higher bandwidth utilisation
for larger problem sizes, the trend shows a slow rise since we will have more
congestion for the default mapping than for the customised mapping. However, while
there is a trade-off between the node mappings used for the first and second
communication, it turns out that the benefit achieved from the customised mapping
for the intra-subgroup communication balances the deterioration of performances
achieved for the inter-subgroup communication. Therefore, this balancing still leads
to a slightly beneficial performance impact on the entire three-dimensional forward
FFT – about 4% for small problem sizes and 2% for the biggest problem (512³) size
used here.
5. 3. 2. 2 Mappings on the 128-node partition
The 32-node partition is too small to explore several different node mappings.
Hence, the 128-node partition has been used for ongoing mapping investigations.
Using 128 nodes instead of 32 also allows us to compute bigger problems. Two
choices of node mappings for the 128-node partition have been studied and
illustrated in Figure 5.8. As mentioned before, the MPI tasks have been organised in
a two-dimensional virtual processor grid using the MPI Cartesian grid topology [22]
construct. For both mapping patterns, the two dimensions which are as close as
possible to each other were used. They simply differ in the swapping of the sizes of
the dimensions from dims={16, 8} to dims={8, 16}.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
45
Figure 5.8: Two customised and default node mappings the 1st and 2nd all-to-all communication on a 128-node partition using dims={16,8} and dims={8,16} for the 2D-virtual processor grid
Figure 5.9: Performance impact of customised node mapping for 3D-FFT on a 128-node partition
using dims={16, 8} for the 2D-virtual processor grid
From the Figures 5.9 and 5.10 it is recognisable that for the intra-subgroup
communication both customised node mappings take advantage of lower latency for
communication between nodes in the nearest neighbourhood. However, a different
trend is observed if one solves bigger problems. For the mapping strategy using the
subdivision size dims={8, 16} for the virtual processor grid (case (b)), higher
bandwidth utilisation caused by longer messages is most likely the reason of the slow
down of the curve since the customised mapping is affected by congestion. However,
for the node mapping using the subdivision size dims={16, 8} for the virtual
processor grid (case (a)), we also expected a fairly similar trend because of the
achieved results from the investigations done on the 32-node partition. Figure 5.9
shows that the customised mapping improves even when messages become longer.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
46
Figure 5.10: Performance impact of customised node mapping for 3D-FFT on a 128-node partition using dims={8, 16} for the 2D-virtual processor grid
For this purpose, another important fact which needs to be considered is the size of
the messages which will be exchanged for both all-to-all communications. The
default implementation of MPI_Alltoall uses different algorithms for different
message sizes [15, 25]. The main reason for the use of different algorithms for
collective communications based on different message sizes is to reduce bandwidth
utilisation, especially if messages are long [25]. For more details, we refer to the
appropriate MPI library implementation. For a better explanation in our particular
case, we consider the problem size 256³ and the two different subdivisions of the
virtual processor grid, dims={16, 8} and dims={8, 16}. That means we have
256 256 25616 8⋅ ⋅ complex numbers locally on each processor, independent of the
size of the dimensions of the virtual processor grid. Case (a) involves 16 MPI tasks
for the first all-to-all intra-subgroup communication while case (b) involves 8 MPI
tasks. More precisely, the number of the messages for the intra-subgroup
communication in case (a) is half the number of the messages being exchanged in
case (b). We assume that this fact might be a possible justification for the two
different trends. More detailed investigations might be of reasonable interest for
future research work.
For the second communication between the subgroups of nodes, the MPI tasks are
entirely fragmentary mapped onto the physical processor grid, which leads to higher
latency and affects the communication time inauspiciously. However, with
increasing problem sizes which means longer messages will be sent through the
network, the poor performance of the fragmentary mappings is improved and even
comes close to the performance achieved for the default mapping. Again, a possible
explanation might be that the communication in line is affected by congestion for
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
47
Figure 5.11: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={32, 16} for the 2D-virtual processor grid
large message sizes. This assumption needs to be investigated in more detail and is of
value for ongoing future research. However, for both node mappings on the 128-
node partition, the beneficial performance impact of the low latency communication
is cancelled out by the high latency for the inter-subgroup communication. The
impact on the total time used for the entire three-dimensional forward FFT
computation is negligible for case (a) and in a range of 8% down to 4% for case (b),
depending on the size of complex data being transformed.
5. 3. 2. 3 Mappings on the 512-node partition
Further investigations have been carried out on the 512-node partition using different
sizes of the dimensions for the two-dimensional virtual processor grid, starting from
sizes as close as possible to each other down to dims={256, 2}.
Figure 5.11 shows the customised and default node mapping used on the 512-node
partition for the subdivision of the two-dimensional virtual processor grid
dims={32,16}. This investigation has been carried out on the torus network. For the
intra-subgroup communication using customised mapping, the torus cables are only
utilised in one dimension, while for the default mapping they can be exploited for
two dimensions. We have learnt from the investigations carried out for the two-
dimensional FFT computation that the switched-on cables, representing the torus
network, have an enormous impact on bandwidth utilisation. However, Figure 5.12
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
48
Figure 5.13: Customised and default node mappings for the 1st and 2nd all-to-all communication on
a 512-node partition using dims={64, 8} for the 2D-virtual processor grid
Figure 5.12: Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={32, 16} for the 2D-virtual processor grid on the torus network
shows that the customised mapping is – despite utilising the cables around the torus
in one dimension only – beneficial for smaller problems because of lower latency.
But with increasing problem sizes, higher bandwidth utilisation causes more
congestion in the customised case.
The same can be experienced for the second all-to-all communication, so that in
summary, the default mapping used on BlueGene/L wins when the sizes of the
dimensions for the virtual processor grid are as close as possible to each other.
Figure 5.13 shows the determined node mapping for the virtual processor grid
subdivision dims={64,8}. We are aware of load imbalance consequences, caused by
choosing the sizes of the dimensions not as close as possible to each other. Impacts
of both networks, mesh and torus, have been investigated.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
49
Figure 5.15: Performance impact of customised node mapping for 3D-FFT on a 512-node partition
using dims={64, 8} for the 2D-virtual processor grid on the torus network
Figure 5.14: Performance impact of customised node mapping for 3D-FFT on a 512-node partition
using dims={64, 8} for the 2D-virtual processor grid on the mesh network
The results (figure 5.14) obtained using the mesh network, are fairly in line to what
we have discussed above and hence are only briefly summarised.
Clearly, for the intra-subgroup all-to-all communication, the customised node
mapping gains from low latency because of communication between nodes in the
nearest neighbourhood. For the second all-to-all communication, the customised
mapping is affected by very high latency and is not expected to deliver a positive
impact on performance. Concerning the overall performance, the poor mapping for
the inter-subgroup communication cancels the beneficial mapping for the intra-
subgroup communication which, in summary, yields almost no performance impact
on the entire forward FFT computation.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
50
Figure 5.16: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={8,64} for the 2D-virtual processor grid
Using exactly the same node mappings on the torus network leads to entirely
different results which are presented in Figure 5.15. We have learnt that the torus
network has a remarkably profitable impact on the performance due to balancing the
bandwidth utilisation as equally as possible over the network to avoid congestion.
This may be the reason for the poor performance achieved from the customised
mapping used for the first all-to-all communication, since we assume there is no
torus cable in use, while for the default node mapping, cables are likely to be utilised
in two dimensions. The customised mapping used for the second communication is
characterised by very high latency. Even the torus cables in all three dimensions do
not yield better results. Our assumption for the turnover observed for the problem
size 128³ is that the MPI implementation uses different algorithms for the all-to-all
communication depending on message sizes. This was already discussed above.
Figure 5.16 shows the investigated node mappings where for the two-dimensional
virtual processor grid the sizes of the dimensions are simply swapped from
dims={64, 8} to dims={8, 64}. The customised mapping of MPI tasks for the inter-
subgroup communication is now more evenly distributed over the entire network.
Again, investigations have been performed for both types of networks, mesh and
torus, and the results are shown in Figure 5.17 and 5.18. For the intra-subgroup
communication where, with customised mapping, communication between nodes in
nearest neighbourhood is guaranteed, the amount of time spent in the MPI_Alltoall
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
51
Figure 5.17: Performance impact of customised node mapping for 3D-FFT on a 512-node partition
using dims={8, 64} for the 2D-virtual processor grid on the mesh network
Figure 5.18: Performance impact of customised node mapping for 3D-FFT on a 512-node partition
using dims={8, 64} for the 2D-virtual processor grid on the torus network
routine is clearly smaller than for default mapping. Even when there is no torus
connectivity, for the default case the path messages have to travel is still longer for
communication between nodes furthest away from each other. Therefore, if the
smallest cube node mapping comes into play, it wins over all the other different node
mappings, independent of mesh or torus.
The customised node mapping for the second all-to-all communication is
characterised by high latency for both network types. However, if one uses the torus
network, the effect of higher bandwidth utilisation for larger problems results in a
trend which comes closer to the performance obtained for the default mapping.
Concerning the impact of communication costs on the entire three-dimensional FFT
computation, a benefit of up to 9% on the torus network and 15% on the mesh
network can be achieved from the customised node mapping. In both cases, the peak
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
52
Figure 5.19: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={128, 4} for the 2D-virtual processor grid
performance improvement was obtained for the 128³ problem. The two figures show
both a turnover of trends between small and bigger problem sizes. Also, here we
assume that one reason may be the use of different algorithms for different message
sizes.
We have seen that using the same subdivision size for the two-dimensional virtual
processor grid but completely different node mapping strategies causes significant
performance differences.
If one is working with the 512-node partition on BlueGene/L, the torus network
becomes of more interest than mesh. Hence, we focus on torus rather than mesh for
our next investigations. In figure 5.19 the node mapping strategies are presented for a
very unbalanced subdivision dims={128,4} for the two-dimensional processor grid.
Both mappings – default and customised – are of fragmentary design. Here, the
default mapping has a complete mapping in two dimensions but a relatively large gap
in the third dimension. The customised mapping has a complete pattern in only one
dimension and shows gaps in the second and third dimensions. Both are
characterised by high latency. However, as one can see in figure 5.20, the large gaps
between the two dense planes seem to be more affected by higher latency than using
a pattern with gaps in two dimensions.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
53
Figure 5.20: Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={128, 4} for the 2D-virtual processor grid on the TORUS network
Also, for the second communication between the subgroups, the customised mapping
performs better than the default mapping. We assume that for both mappings (for the
inter-subgroup communication) the torus cables are not in use. The reason for the
superior customised performance is again lower latency, since the messages which
have to be sent between nodes furthest away from each other have a longer way to go
in the case of default mapping than for customised mapping. The imbalanced
subdivision of the two-dimensional processor grid allows the customised mapping to
achieve a performance gain from both all-to-all communications. This improvement
of the communication times has a respectable impact on the entire forward FFT,
ranging from 10% down to 4% depending on the size of complex data being
transformed.
So far, a variety of customised versus default node mappings have been investigated
for different subdivisions of the two-dimensional virtual processor grid. Our interest
is in finding the best mapping strategy for the torus network on the 512-node
partition. Table 5.2 summarises the results from all the investigated node mappings
for the torus network on the 512-node partition on BlueGene/L.
SUBDIVISION VIRTUAL GRID
CUSTOMISED NODE MAPPING
DEFAULT NODE MAPPING
dims = {32, 16} xdims = {64, 8} x x dims = {8, 64} x dims = {128, 4} x dims = {256 ,2} -- dims = {512, 1} - 1D --
Table 5.2: Summary of the investigated node mappings for different subdivisions of the 2D virtual processor grid
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
54
The next step is to take the node mapping yielding best performance results for each
of the different subdivision sizes and compare them to each other. The following two
tables show the times spent for communication (5.3) and the execution times for the
entire three-dimensional forward FFT computation – each for the mappings yielding
best performance results for each of the different subdivision sizes – for different
problem sizes. However, it seems that the trend heads for more unbalanced
subdivisions. This is not ideal since with subdivision sizes not as equal as possible to
each other, the possibility of efficiently utilising more processors becomes more and
more awkward, especially on a large-scale computing platform [8] such as
BlueGene/L.
SUBDIVISION VIRTUAL GRID
64³ 128³ 256³ 512³ 1024³
dims = {32, 16} 0.000286 0.001101 0.0071 0.0559 0.454 dims = {8, 64} 0.000278 0.000969 0.0065 0.0539 0.415 dims = {128, 4} 0.001208 0.0061 0.0476 0.381 dims = {256 ,2} 0.0065 0.0466 0.357 dims = {512, 1} - 1D 0.0299 0.223
Table 5.3: Communication costs measured in seconds for different problem sizes using the best
mapping for each particular subdivision of the 2D virtual processor grid
SUBDIVISION VIRTUAL GRID
64³ 128³ 256³ 512³ 1024³
dims = {32, 16} 0.000397 0.002478 0.0209 0.194 1.828 dims = {8, 64} 0.000394 0.002410 0.0205 0.193 1.827 dims = {128, 4} 0.002717 0.0203 0.191 1.826 dims = {256 ,2} 0.0206 0.185 1.815 dims = {512, 1} - 1D 0.150 1.639
Table 5.4: Cost for entire forward 3D-FFT computation measured in seconds for different problem
sizes using the best mapping for each particular subdivision of the 2D virtual processor grid
5. 3. 2. 4 Mappings on the 1024-node partition
Further investigations have been carried out on the 1024-node partition. We decided
on three different subdivisions of the two-dimensional virtual processor grid for
which a customised and the default mapping have been explored. The first
subdivision was chosen with sizes as close as possible to each other, dims={32, 32}.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
55
Figure 5.21: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 1024-node partition using dims={32, 32} for the 2D-virtual processor grid
Figure 5.22: Performance impact of customised node mapping for 3D-FFT on a 1024-node partition
using dims={32, 32} for the 2D-virtual processor grid on the torus network
The reason for the second and third choice, dims={8, 128} and dims={256, 4}, was
based on the results achieved from the same node mapping pattern used on the 512-
node partition.
Figure 5.21 illustrates the customised and default node mappings for both all-to-all
communications on the 1024-node partition with the subdivision dims={32, 32}. The
outcome is close to what we expected because of the results achieved from the
mapping investigations on the 512-node partition using the subdivision dimensions
as close as possible to each other, dims={32, 16} (see figure 5.11).
The customised mapping for the intra-subgroup communication, which is the same
as used for the 512-node partition, shows a great impact on performance compared to
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
56
Figure 5.23: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 1024-node partition using dims={8, 128} for the 2D-virtual processor grid
the respective default mapping. The reason for this is the poor pattern of the default
mapping, which is highly fragmentary and hence characterised by very high latency
costs. The opposite can be applied to the mappings used for the inter-subgroup
communication. Here the customised mapping features higher latency costs than the
default mapping, equal to the mapping pattern for the 512-node partition. But now
there are more MPI tasks distributed over a 16 8× plane, compared to tasks arranged
in lines of eight in length. This has the consequence that high latency impacts the
performance results. It is useful to compare the curves for the inter-subgroup
communication achieved for the investigations on the 512-node partition (figure
5.12) and the current for the 1024-node partition (figure 5.22). This illustrates that
the poor performance obtained for the same fragmentary mapping pattern used over a
larger surface is worse due to higher latency. The impact on the performance for the
entire three-dimensional FFT computation is the same obtained on the 512-node
partition.
For the second investigation on the 1024-node partition using the subdivision
dims={8, 128}, the default mappings for both all-to-all communication, within
subgroups and between subgroups, show unsuitable mapping pattern. For the first
communication, the MPI tasks are arranged in a line but cannot efficiently make use
of the torus as it was potentially able to do for the same mapping on the 512-node
partition (Figure 5.16). Clearly, that yields better performance for the cube-shaped
mapping since now, both mappings can be considered as if mesh would have been
used.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
57
Figure 5.24: Performance impact of customised node mapping for 3D-FFT on a 1024-node partition using dims={8, 128} for the 2D-virtual processor grid on the torus network
The results achieved for the second all-to-all communication are not obvious right
from the start since the mapping is different from the previously investigated pattern.
However, since a variety of investigations have been carried out, the following
assumption may match to the previous speculations. For the research done one the
512-node partition (see figure 5.16), we came to the conclusion that the worse
performance of the fragmentary mapping pattern is due to higher latency costs.
However, in this particular case, we compared a mapping characterised by gaps in all
three dimensions with another mapping where MPI tasks were arranged continuously
over a surface. But what we want to compare here on the 1024-node partition is
rather related to the mapping shown in figure 5.19, where we have the same pattern
for both mappings along z-dimension, small gaps versus no gaps along the y-
dimension, and small gaps versus large gaps along the x-dimension. For this
mapping, we have achieved a slight improvement compared to the one illustrated in
figure 5.16. However, with the current mappings on the 1024-node partition, we take
a further step and compare a mapping completely fragmentary in all three
dimensions versus a mapping characterised by 2 completely filled planes with a large
gap in between. The result, presented in figure 5.24, shows that on the 1024-node
partition, a completely fragmentary mapping is more affected by high latency costs
than a consistent mapping with a very large gap. However, the enormous
performance gain achieved for the first all-to-all communication entirely balances
out the poor performance. In summary, on the 512-node partition an overall
performance improvement for the three-dimensional FFT computation of up to 9%
was obtained. Here, on the 1024-node partition we have an additional improvement
of 2% for each problem size apart from 128³.
5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS
58
Figure 5.25: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 1024-node partition using dims={256, 4} for the 2D-virtual processor grid
Figure 5.26: Performance impact of customised node mapping for 3D-FFT on a 1024-node partition using dims={4, 256} for the 2D-virtual processor grid on the torus network
For the third investigation carried out on the 1024-node partition, the virtual
processor grid is subdivided into dims={256, 4}. Both mappings, customised and
default, are illustrated in figure 5.25 and show the same pattern as was investigated
on the 512-node partition (see figure 5.19). Using a dimension subdivision
dims={256, 4}, the smallest three-dimensional complex data array which can be
transformed, is of the size 256³.
This example shows very precisely that the same results can be achieved by using the
same node mapping pattern, extended on a twice-as-large partition. The results,
presented in figure 5.26 and compared to the results achieve on the 512-node
partition (see figure 5.20) show the expected behaviour.
6. CONCLUSION
59
6. CONCLUSION
We have demonstrated the excellent scalability of the three-dimensional FFT code
for large problem sizes on the BlueGene/L platform on up to 1,024 processors. For
relatively small problem sizes (32³ complex numbers), the three-dimensional FFT
using slab decomposition is typically 20% to 30% faster than the FFT computation
where data is decomposed in two dimensions. On the other hand, the efficient
utilisation of a larger number of processors for slab decomposition is limited to the
data elements along one dimension. At this point, the FFT computation using two-
dimensional decomposition comes beneficially into play. To even improve this
performance, a variety of mappings of MPI tasks onto the three-dimensional torus
communication network have been explored.
Our experiments clearly indicate that a carefully chosen mapping of MPI tasks on the
torus network that takes the network characteristics into account is beneficial in
obtaining improved performance for our type of application. This is especially
important for scientific applications that call FFT routines many times.
For the FFT computation using two-dimensional decomposition, we have seen that
choosing the dimension sizes of the two-dimensional processor grid as equally as
possible to each other (for instance, dims={32, 16} on the 512-node partition or
dims={32, 32} on the 1024-node partition), the default node mapping utilises the
torus pretty good and is difficult to improve further. On the other hand, choosing the
dimension sizes more unbalanced to each other (with other words, coming closer and
closer to the one-dimensional decomposition), delivers performance improvements
by using customised node mappings. Our results show excellent performance
enhancements for the 8-cube (typically 20% to 45% improvement of communication
costs) and the 4-square (typically 50% to 65% improvement of communication costs)
pattern, compared to a row of processors which utilises the torus links in one
dimension. Even if the communication costs are higher for the second all-to-all
communication because of unprofitable node mapping, the gain obtained from the
first communication is good enough to cancels the poor performance completely.
6. CONCLUSION
60
If the mapping for the intra-subgroup communication is small and dense (as it is for
8-cube and 4-square) then the mapping for the intra-subgroup communication is
discontiguous. In general we can say that discontiguous mappings are expensive.
Even if an evenly distributed fragmentary node mapping is spanned over the whole
512-node partition which allows utilisation of the torus links in all three dimensions,
the higher latency costs are hardly reduced.
These small and dense shapes – where communication between nodes in nearest
neighbourhood is responsible for the benefit – can be exploited if the subdivision of
the two-dimensional virtual processor grid is more unbalanced to each other. For
instance, for a subdivision of the virtual processor grid dims={128, 4} the customised
node mapping shows a 10% improvement of the FFT (for the problem sizes 128³ and
256³), while the total communication costs are improved by 25% - compared to the
default node mapping on BlueGene/L. This fact leads to an additional conclusion that
ongoing developments in FFT libraries as well as re-sort strategies have the potential
to improve some of these results. More precisely, this significant impact on
communication costs, combined with the use of possibly faster FFT libraries and
more efficient re-sort methods, would have a greater impact on the entire FFT
performance. Therefore, looking for possibilities to reduce the computation costs
(FFT as well as re-sort methods) could be valuable for future studies.
In the following section we discuss additional likely sources of performance
improvements and possible future research projects.
The impact of the virtual node mode on BlueGene/L has only briefly investigated for
the two-dimensional case. The results have shown that the execution times can be
almost halved by dedicating the same amount of chips but running one MPI task on
each of the two processors on a chip. This makes virtual node mode for future
investigation on parallel FFTs for multi-dimensional data interesting.
The following was not discussed in the previous chapters. However, within this
project, we spent some time on auxiliary research on how we can beneficially exploit
the dual-processor compute nodes operating in coprocessor mode on BlueGene/L. In
the coprocessor mode, all computations are performed on the first processor of each
chip [3]. The second processor is used for communications [3]. It has the advantage
6. CONCLUSION
61
that computations and communications can be overlapped. For the two-dimensional
FFT computation, we wrote a customised My_MPI_Alltoall routine that allows
overlapping of communication and computation. The computation of the one-
dimensional FFTs is broken down into two halves. While the first half of the
complex data array will be transformed, the second processor is idle. After finishing
the computation of the first half, the second processor is in charge of communication
while the first processor continues with transforming the second half of the complex
data array. However, since this optimisation is not the major part of this project, the
customised My_MPI_Alltoall is written in a very naïve way consisting of a non-
blocking standard send-and-receive pair. It doesn’t take into account different
algorithms depending on the size of the messages. The results show a marginal
improvement only for very large problems (32,768²). Future work may involve a
successive splitting up of the one-dimensional FFT computation, rather than only
dividing it in two halves, so that the second processor can be exploited earlier.
The MPI implementation for BlueGene/L is a particularly optimised port of the
MPICH2 [15] library for the BlueGene/L architecture. It comes with, amongst
others, an optimised MPI_Alltoall as well as MPI_Alltoallv algorithm
which optimise the injection of packets to achieve high network efficiency [13]. It
may be useful to investigate the performance impact of using the optimised
MPI_Alltoallv for the communication kernel of the multi-dimensional parallel
FFT computation since within our implementations we have only used the
MPI_Alltoall algorithm. The MPI_Alltoallv algorithm allows more
flexibility with respect to the structure and size of the input and output data.
Therefore, expensive re-sorting of the data before and after the all-to-all
communication can be eliminated in many cases.
The recently published parallel three-dimensional FFT library for BlueGene/L
(BGL3DFFT) is specifically designed to take advantage of the IBM BlueGene/L
architecture by enabling applications that use three-dimensional FFTs to scale to
thousands of BlueGene/L processors [16]. Most of the alternative parallel libraries
compute three-dimensional FFTs by using the slab decomposition technique [16].
We know that the scalability of the slab-based methods is limited by the size of the
data of a single axis. In BGL3DFFT, the three-dimensional FFT implementation is
6. CONCLUSION
62
based on a two-dimensional decomposition which enables scalability to N²
processors [16] (N := number of elements along a single dimension). At the time of
this writing, the BGL3DFFT library has not yet been installed on BlueSky. However,
utilising this library for projects with a three dimensional FFT core might be of
particular importance in the near future.
APPENDIX A
63
0
100
200
300
400
500
600
700
0 100 200 300 400 500 600
Number of nodes
Spee
dup
ideal
512³
256³
128³
64³
32³
Figure A.1: Speedup of the 3D-FFT implementation using 1D-decomposition
1
10
100
1000
10000
1 10 100 1000 10000Number of nodes
Spee
dup
ideal
512³
256³
128³
64³
32³
Figure A.2: Speedup of the 3D-FFT implementation using 2D-decomposition
APPENDIX A
APPENDIX A
64
NODES 32³ 64³ 128³ 256³ 512³ 1024³
1 49.875 53.479 51.586 2 29.538 36.361 29.581 4 36.237 35.987 37.949 39.053 8 35.750 38.636 40.531 47.739
16 48.288 50.000 50.868 51.735 32 39.067 47.340 48.173 50.172 49.625 64 41.306 48.563 49.576 50.845
128 44.251 48.060 49.876 256 42.318 48.034 48.969 512 46.813 51.072
Table A.1: Performance improvement of communication costs for slab decomposition compared to
2D-decomposition
APPENDIX B
65
APPENDIX B
NODES 1D DECOM 2D DECOM 1 Time for plan (s) : 5.970136 5.967375 Time for 1st FFTW (row) (s) : 0.124627 0.122882 Time for 2st FFTW (col) (s) : 0.124999 0.122423 Time for 3st FFTW (plane) (s) : 0.116200 0.123429 Time for 1st Resort (s): 0.372843 0.366375 Time for 2nd Resort (s): 0.553968 0.048459 Time for 3rd Resort (s): 0.382529 Time for 4th Resort (s): 0.054857 Time for 1st Comm (s): 0.052827 Time for 2nd Comm (s): 0.052834 Time for FFTW (s) : 0.365826 0.368734 Time for Resort(s): 0.926811 0.852221 Time for Comms (s): 0.051155 0.105662 Time for forward 3D-FFT BEF.BARR: 1.343793 1.326617 Time for forward 3D-FFT AFT.BARR: 1.343794 1.326619 Time for backward 3D-FFT (s): 1.388044 1.528895 SPEEDUP 1.000000 1.000000 EFFICIENCY 1.000000 1.000000 2 Time for plan (s) : 5.970110 5.963465 Time for 1st FFTW (row) (s) : 0.062167 0.061846 Time for 2st FFTW (col) (s) : 0.062371 0.061728 Time for 3st FFTW (plane) (s) : 0.058333 0.062184 Time for 1st Resort (s): 0.188852 0.178932 Time for 2nd Resort (s): 0.274624 0.027077 Time for 3rd Resort (s): 0.189501 Time for 4th Resort (s): 0.028056 Time for 1st Comm (s): 0.067745 Time for 2nd Comm (s): 0.026816 Time for FFTW (s) : 0.182871 0.185758 Time for Resort(s): 0.463476 0.423566 Time for Comms (s): 0.066588 0.094561 Time for forward 3D-FFT BEF.BARR: 0.712936 0.703885 Time for forward 3D-FFT AFT.BARR: 0.712966 0.703887 Time for backward 3D-FFT (s): 0.744644 0.811360 SPEEDUP 1.884793 1.884704 EFFICIENCY 0.942396 0.942352 4 Time for plan (s) : 5.968986 5.960823 Time for 1st FFTW (row) (s) : 0.029588 0.031450 Time for 2st FFTW (col) (s) : 0.029709 0.031374
APPENDIX B
66
Time for 3st FFTW (plane) (s) : 0.030861 0.031672 Time for 1st Resort (s): 0.092216 0.090013 Time for 2nd Resort (s): 0.134422 0.014205 Time for 3rd Resort (s): 0.093685 Time for 4th Resort (s): 0.015688 Time for 1st Comm (s): 0.058314 Time for 2nd Comm (s): 0.034014 Time for FFTW (s) : 0.090158 0.094497 Time for Resort(s): 0.226638 0.213591 Time for Comms (s): 0.057290 0.092328 Time for forward 3D-FFT BEF.BARR: 0.374087 0.400416 Time for forward 3D-FFT AFT.BARR: 0.374661 0.400419 Time for backward 3D-FFT (s): 0.393519 0.457348 SPEEDUP 3.586693 3.313077 EFFICIENCY 0.896673 0.828226 8 Time for plan (s) : 5.965137 5.957422 Time for 1st FFTW (row) (s) : 0.015103 0.015085 Time for 2st FFTW (col) (s) : 0.014656 0.014852 Time for 3st FFTW (plane) (s) : 0.015364 0.014812 Time for 1st Resort (s): 0.032551 0.031501 Time for 2nd Resort (s): 0.033830 0.007976 Time for 3rd Resort (s): 0.033048 Time for 4th Resort (s): 0.007492 Time for 1st Comm (s): 0.031318 Time for 2nd Comm (s): 0.016695 Time for FFTW (s) : 0.045123 0.044749 Time for Resort(s): 0.066381 0.080017 Time for Comms (s): 0.028552 0.048012 Time for forward 3D-FFT BEF.BARR: 0.140056 0.172778 Time for forward 3D-FFT AFT.BARR: 0.140119 0.173544 Time for backward 3D-FFT (s): 0.141519 0.170658 SPEEDUP 9.590376 7.644280 EFFICIENCY 1.198797 0.955535 16 Time for plan (s) : 5.960476 5.964690 Time for 1st FFTW (row) (s) : 0.006392 0.006972 Time for 2st FFTW (col) (s) : 0.006802 0.006403 Time for 3st FFTW (plane) (s) : 0.006949 0.006406 Time for 1st Resort (s): 0.013645 0.013502 Time for 2nd Resort (s): 0.015878 0.003292 Time for 3rd Resort (s): 0.014431 Time for 4th Resort (s): 0.003288 Time for 1st Comm (s): 0.014190 Time for 2nd Comm (s): 0.014119 Time for FFTW (s) : 0.020143 0.019781
APPENDIX B
67
Time for Resort(s): 0.029523 0.034512 Time for Comms (s): 0.013909 0.028310 Time for forward 3D-FFT BEF.BARR: 0.063575 0.082603 Time for forward 3D-FFT AFT.BARR: 0.063577 0.082606 Time for backward 3D-FFT (s): 0.061973 0.079165 SPEEDUP 21.136480 16.059596 EFFICIENCY 1.321030 1.003724 32 Time for plan (s) : 5.961193 5.958275 Time for 1st FFTW (row) (s) : 0.003062 0.003202 Time for 2st FFTW (col) (s) : 0.003024 0.003037 Time for 3st FFTW (plane) (s) : 0.003165 0.003038 Time for 1st Resort (s): 0.005870 0.005966 Time for 2nd Resort (s): 0.006985 0.001764 Time for 3rd Resort (s): 0.006118 Time for 4th Resort (s): 0.001594 Time for 1st Comm (s): 0.007249 Time for 2nd Comm (s): 0.006987 Time for FFTW (s) : 0.009252 0.009276 Time for Resort(s): 0.012855 0.015441 Time for Comms (s): 0.007378 0.014236 Time for forward 3D-FFT BEF.BARR: 0.029485 0.038954 Time for forward 3D-FFT AFT.BARR: 0.029488 0.039080 Time for backward 3D-FFT (s): 0.029450 0.036021 SPEEDUP 45.570876 33.946238 EFFICIENCY 1.424089 1.060819 64 Time for plan (s) : 5.968804 5.958727 Time for 1st FFTW (row) (s) : 0.001689 0.001517 Time for 2st FFTW (col) (s) : 0.001474 0.001515 Time for 3st FFTW (plane) (s) : 0.001530 0.001514 Time for 1st Resort (s): 0.002949 0.002988 Time for 2nd Resort (s): 0.003201 0.000882 Time for 3rd Resort (s): 0.003048 Time for 4th Resort (s): 0.000886 Time for 1st Comm (s): 0.003535 Time for 2nd Comm (s): 0.003637 Time for FFTW (s) : 0.004692 0.004546 Time for Resort(s): 0.006150 0.007804 Time for Comms (s): 0.003689 0.007172 Time for forward 3D-FFT BEF.BARR: 0.014531 0.019522 Time for forward 3D-FFT AFT.BARR: 0.014534 0.019616 Time for backward 3D-FFT (s): 0.014526 0.018692 SPEEDUP 92.458648 67.629435 EFFICIENCY 1.444666 1.056709
APPENDIX B
68
128 Time for plan (s) : 5.978766 5.965477 Time for 1st FFTW (row) (s) : 0.000761 0.000783 Time for 2st FFTW (col) (s) : 0.000760 0.000781 Time for 3st FFTW (plane) (s) : 0.000791 0.000781 Time for 1st Resort (s): 0.001482 0.001512 Time for 2nd Resort (s): 0.001518 0.000614 Time for 3rd Resort (s): 0.001496 Time for 4th Resort (s): 0.000488 Time for 1st Comm (s): 0.001823 Time for 2nd Comm (s): 0.001804 Time for FFTW (s) : 0.002312 0.002345 Time for Resort(s): 0.003000 0.004110 Time for Comms (s): 0.002022 0.003627 Time for forward 3D-FFT BEF.BARR: 0.007335 0.010082 Time for forward 3D-FFT AFT.BARR: 0.007502 0.010083 Time for backward 3D-FFT (s): 0.007607 0.009644 SPEEDUP 179.124766 131.569870 EFFICIENCY 1.399412 1.027889 256 Time for plan (s) : 5.954514 Time for 1st FFTW (row) (s) : 0.000380 Time for 2st FFTW (col) (s) : 0.000378 Time for 3st FFTW (plane) (s) : 0.000379 Time for 1st Resort (s): 0.000571 Time for 2nd Resort (s): 0.000254 Time for 3rd Resort (s): 0.000580 Time for 4th Resort (s): 0.000260 Time for 1st Comm (s): 0.001006 Time for 2nd Comm (s): 0.000935 Time for FFTW (s) : 0.001137 Time for Resort(s): 0.001665 Time for Comms (s): 0.001940 Time for forward 3D-FFT BEF.BARR: 0.004743 Time for forward 3D-FFT AFT.BARR: 0.004786 Time for backward 3D-FFT (s): 0.004675 SPEEDUP 277.187421 EFFICIENCY 1.082763 512 Time for plan (s) : 5.956440 Time for 1st FFTW (row) (s) : 0.000193 Time for 2st FFTW (col) (s) : 0.000192 Time for 3st FFTW (plane) (s) : 0.000191 Time for 1st Resort (s): 0.000268 Time for 2nd Resort (s): 0.000171 Time for 3rd Resort (s): 0.000260 Time for 4th Resort (s): 0.000096
APPENDIX B
69
Time for 1st Comm (s): 0.000535 Time for 2nd Comm (s): 0.000536 Time for FFTW (s) : 0.000576 Time for Resort(s): 0.000795 Time for Comms (s): 0.001071 Time for forward 3D-FFT BEF.BARR: 0.002442 Time for forward 3D-FFT AFT.BARR: 0.002507 Time for backward 3D-FFT (s): 0.002474 SPEEDUP 529.165935 EFFICIENCY 1.033527 1024 Time for plan (s) : 5.953987 Time for 1st FFTW (row) (s) : 0.000099 Time for 2st FFTW (col) (s) : 0.000097 Time for 3st FFTW (plane) (s) : 0.000097 Time for 1st Resort (s): 0.000133 Time for 2nd Resort (s): 0.000078 Time for 3rd Resort (s): 0.000128 Time for 4th Resort (s): 0.000055 Time for 1st Comm (s): 0.000584 Time for 2nd Comm (s): 0.000339 Time for FFTW (s) : 0.000293 Time for Resort(s): 0.000394 Time for Comms (s): 0.000922 Time for forward 3D-FFT BEF.BARR: 0.001610 Time for forward 3D-FFT AFT.BARR: 0.001610 Time for backward 3D-FFT (s): 0.001598 SPEEDUP 798.686900 EFFICIENCY 0.779900
Table B.1: Performance measurements in seconds for the 3D-FFT implementations using 1D
and 2D decomposition for problem size 128³
APPENDIX C
70
APPENDIX C
COMMS BETWEEN
NODES LINE DIAGONAL VOLUME
DIAGONAL
0-1 0.007230 0.007444 0.007634 0-2 0.007425 0.007835 0.008192 0-3 0.007659 0.008200 0.008776 0-4 0.007940 0.008723 0.009484 0-5 0.007688 0.008291 0.008879 0-6 0.007466 0.007891 0.008306 0-7 0.007276 0.007536 0.007767
Table C.1: Performance measurements in seconds for ping-pong application Sending / receiving
1000 messages of 10 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal
COMMS BETWEEN
NODES LINE DIAGONAL VOLUME
DIAGONAL
0-1 0.012639 0.012845 0.013064 0-2 0.013009 0.013274 0.013609 0-3 0.013239 0.013592 0.014159 0-4 0.013389 0.014059 0.014904 0-5 0.013254 0.013662 0.014271 0-6 0.013042 0.013320 0.013689 0-7 0.012670 0.012900 0.013252
Table C.2: Performance measurements in seconds for ping-pong application Sending / receiving
1000 messages of 100 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal
APPENDIX C
71
COMMS BETWEEN
NODES LINE DIAGONAL VOLUME
DIAGONAL
0-1 0.263972 0.263337 0.263979 0-2 0.264637 0.264388 0.265659 0-3 0.265207 0.265586 0.267373 0-4 0.266122 0.267029 0.269522 0-5 0.265270 0.265800 0.267732 0-6 0.264750 0.264657 0.266029 0-7 0.264121 0.263597 0.264356
Table C.3: Performance measurements in seconds for ping-pong application Sending / receiving
1000 messages of 1,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal
COMMS BETWEEN
NODES LINE DIAGONAL VOLUME
DIAGONAL
0-1 0.737561 0.722970 0.723543 0-2 0.738190 0.724083 0.725269 0-3 0.738716 0.725243 0.727019 0-4 0.739368 0.726798 0.729185 0-5 0.738797 0.725518 0.727335 0-6 0.738221 0.724304 0.725663 0-7 0.737681 0.723136 0.723962
Table C.4: Performance measurements in seconds for ping-pong application Sending / receiving
1000 messages of 10,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal
For long messages, such as 10,000 integers, the mapping in line becomes even more
expensive than mapping along the diagonal of volume diagonal. We assume this is
due to network congestion since the mapping along the diagonal has a greater choice
of paths.
BIBLIOGRAPHY
72
BIBLIOGRAPHY [1] Gustafson, J.L., Reevaluating Amdahl's Law, CACM, 31(5),
1988. pp. 532-533.
[2] Amdahl, G.M., Validity of single-processor approach to achieving large-scale computing capability, Proceedings of AFIPS Conference, Reston, VA. 1967. pp. 483-485.
[3] University of Edinburgh, BlueGene/L User Information http://www.epcc.ed.ac.uk/~bgapps/user_info.html.
[4] Franchetti F., Kral S., Lorenz J., Püschel M., Ueberhuber C. W., Automatically Tuned FFTs for BlueGene/L’s Double FPU, High Performance Computing for Computational Science - VECPAR 2004, pp. 23-36.
[5] Eleftheriou, M., Moreira, J. E., Fitch, B. G., Germain, R. S., A Volumetric FFT for BlueGene/L, Lecture Notes in Computer Science, volume 2913, 2003, pp. 194-203.
[6] Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T. J. C., Germain, R. S.,
Performance Measurements of the 3D FFT on the Blue Gene/L Supercomputer, Euro-Par 2005, pp. 795-803.
[7] Davis, K., Hoisie, A., Johnson, G., Kerbyson, D. J., Lang, M., Pakin, S.,
Petrini, F., A Performance and Scalability Analysis of the BlueGene/L Architecture, Proceedings of the ACM/IEEE Conference on Supercomputing, 2004.
[8] Gygi, F., Yates, R. K., Lorenz, J., Draeger, E. W., Franchetti, F., Ueberhuber, C., W., de Supinski, B., R., Kral, S., Gunnels, J. A., Sexton, J. C., Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code, Conference on High Performance Networking and Computing, 2005, pp. 24 et sqq.
[9] Fang, B., Deng, Y., Performance of 3D FFT on 6D QCDOC Torus Parallel Supercomputer, J. Comp. Phys. Submitted, 2005.
[10] Kral, S., FFTW-GEL Homepage, http://www.complang.tuwien.ac.at/skral/fftwgel.html.
[11] Gara, A., Blumrich, M. A., Chen, D., Chiu, G. L.-T., Coteus, P., Giampapa,
M. E., Haring, R. A., Heidelberger, P., Hoenicke, D., Kopcsay, G. V., Liebsch, T. A., Ohmacht, M., Steinmacher-Burow, B. D., Takken, T., Vranas P., Overview of the Blue Gene/L system architecture, IMB Journal of Research and Development, Volume 49, Number 2/3, 2005.
BIBLIOGRAPHY
73
[12] Message Passing Interface Forum, MPI: A Message-Passing Interface
Standard, University of Tennessee, 1995, see http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html.
[13] Almási, G., Archer, C., Castaños, J. G., Gunnels, J. A., Erway, C. C., Heidelberger, P., Martorell, X., Moreira, J. E., Pinnow, K., Ratterman, J., Steinmacher-Burow, B. D., Gropp, W., Toonen B., Design and implementation of message-passing services for the Blue Gene/L supercomputer, IMB Journal of Research and Development, Volume 49, Number 2/3, 2005.
[14] Adiga N. R. et al., An Overview of the Blue Gene/L Supercomputer, Proceedings of the ACM/IEEE Conference on Supercomputing, 2002, pp. 1–22, see http://www.sc-conference.org/sc2002/.
[15] MPICH and MPICH2 homepage, see http://www-unix.mcs.anl.gov/mpi/mpich.
[16] Eleftheriou, M., 3D Fast Fourier Transform Library for Blue Gene/L, http://www.alphaworks.ibm.com/tech/bgl3dfft.
[17] Numerical Recipes in C, http://www.library.cornell.edu/nr/bookcpdf.html.
[18] Fourier Theory, http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/MARSHALL/node17.html.
[19] Fastest Fourier Transform in the West (FFTW), http://www.dl.ac.uk/TCSC/Subjects/Parallel_Algorithms/FFTreport/node82.html
[20] Franchetti, F., FFTs on BlueGene/L machines, http://www.llnl.gov/asci/platforms/bluegene/talks/franchetti.pdf.
[21] Allan, R.J., Taylor, K., Parallel Application Software on High Performance Computers, Serial and Parallel FFT Routines, http://www.dl.ac.uk/TCSC/Subjects/Parallel_Algorithms/FFTreport/.
[22] MPI Routines, http://www-unix.mcs.anl.gov/mpi/www/www3/.
[23] Multiprocessing by Message Passing MPI, http://scv.bu.edu/tutorials/MPI/.
[24] Agerwala, T., Martin, J. L., Mirza, J. H., Sadler, D. C., Dias, D. M., Snir, M., SP2 system architecture, Scalable Parallel Computing , Volume 34, Number 2, 1995, see http://www.research.ibm.com/journal/sj/342/agerwala.html.
BIBLIOGRAPHY
74
[25] MPI over InfiniBand Project homepage, default implementation of
MPI_Alltoall algorithm, https://mvapich.cse.ohio-
state.edu/svn/mpi/mvapich2/trunk/src/mpi/coll/alltoall.c. [26] University of Edinburgh, EPCC Course Slides, Applied Numerical
Algorithms. [27] James, J. F., A Student's Guide to Fourier Transforms: With Applications
in Physics and Engineering, Cambridge University Press, 2002. [28] Kallies, B., FFTW, 2004, http://www.hlrn.de/doc/fftw/index.html. [29] FFTW Homepage, http://www.fftw.org/. [30] Hennessy, J. L., Patterson, D. A., Computer Architecture, A Quantitative
Approach, Third Edition, 2003.