Fourier Transforms for the BlueGene/L Communication Network · 2017-11-01 · BlueGene/L Communication Network Heike Jagode MSc in High Performance Computing The University of Edinburgh

Fourier Transforms

for the

BlueGene/L Communication Network

Heike Jagode

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2006

ABSTRACT

A computational kernel of particular importance for many scientific applications is

the Fast Fourier Transform (FFT) of multi-dimensional data. A fundamental

challenge is the design and implementation of such parallel numerical algorithms to

utilise efficiently thousands of nodes. The BlueGene/L is a massively parallel high

performance computer organised as a three-dimensional torus of compute nodes. To

maintain application performance and scaling, the correct mapping of MPI tasks onto

the three-dimensional torus communication network is a critical factor. This paper

presents design and implementation of the parallel two-dimensional and three-

dimensional FFT. For the three-dimensional case we compare one-dimensional with

two-dimensional decomposition of the complex data. The applications call the one-

dimensional single-processor FFT kernel routine provided by the Fastest Fourier

Transform in the West (FFTW) library. We present experimental results of different

node mappings onto the BlueGene/L’s torus on up to 1,024 nodes. The

implementation of the FFT algorithm using two-dimensional decomposition scales

well up to 1,024 nodes of a variety of problem sizes (128³, 256³, 512³). Our

experiments clearly indicate that a carefully chosen mapping of MPI tasks onto the

torus network that takes the network characteristics into account is beneficial in

obtaining improved performance for this type of application.

CONTENTS

I

CONTENTS

1 INTRODUCTION 1

2 OVERVIEW OF THE BLUEGENE/L ARCHITECTURE 3

2.1 Hardware Architecture…………..………………………….. 3

2.2 Software Architecture……....………………………………. 5

3 FOURIER TRANSFORM 8

3.1 Continuous Fourier Transform.…………………………….. 8

3.2 Discrete Fourier Transform……………………..………….. 9

3.3 Fast Fourier Transform.…………………………………….. 10

3.4 Fastest Fourier Transform in the West.…………………….. 11

4 TWO-DIMENTIONAL FAST FOURIER TRANSFORMS 13

4.1 Parallel FFT in Two Dimensions...…….…………………… 13

4.2 Taskfarm of parallel FFTs…….....…….…………………… 13

4.3 Algorithm Details………………...…….…………………… 14

4.4 Verification of Results.….………………………………….. 18

4.5 Performance Analysis...…………………………………….. 19

4.5.1 mesh versus torus network………………………….. 20

4.5.2 Virtual Node Mode on BlueGene/L..……………….. 21

4.5.3 Double FPU on BlueGene/L.……………………….. 22

4.5.4 MPI task mapping strategies.……………………….. 23

4.5.4.1 Mappings on the 32-node partition.....…….... 24

4.5.4.2 Mappings on the 128-node partition...…….... 25


5 THREE-DIMENTIONAL FAST FOURIER TRANSFORMS 32

5.1 Parallelisation……………………………………………….. 32

5.2 Verification of Results.…………………………………….... 35

5.3 Performance Analysis……………………………………….. 37

5.3.1 1D-Decomposiiton versus 2D-Decomposition..…….. 37

CONTENTS

II

5.3.2 MPI task mapping strategies..……………………….. 42

5.3.2.1 Mappings on the 32-node partition.....…….... 42




6 CONCLUSION 59

APPENDIX A 63

APPENDIX B 65

APPENDIX C 70

BIBLIOGRAPHY 72

LIST OF TABLES

III

LIST OF TABLES 4.1 Times measured in seconds for a problem size of 16384² using 128 nodes…… 17 4.2 Times measured in seconds for a problem size of 16384² using 512 nodes…… 17 4.3 Execution times in seconds for the 2D-FFT computation for

different problem sizes using coprocessor mode and virtual node mode on BlueGene/L………………………….…………………………. 22

4.4 Summary of the investigated node mappings for problem sizes between 2048² and 16384²……………………………………………..….…… 29

4.5 Communication and 2D-FFT computation costs (seconds) for different problem sizes with the mapping yielding best results for mesh and torus………………………………………………………….……… 30

5.1 Performance improvement of the slab decomposition compared to

2D-decomposition……………….………………………………….…….…… 40 5.2 Summary of the investigated node mappings for different

subdivisions of the 2D virtual processor grid…………………….……….…… 53 5.3 Communication costs measured in seconds for different problem

sizes using the best mapping for each particular subdivision of the 2D virtual processor grid……………………………….………….….… 54

5.4 Cost for entire forward 3D-FFT computation measured in seconds for different problem sizes using the best mapping for each particular subdivision of the 2D virtual processor grid………………………….…..…… 54

A.1 Performance improvement of communication costs for slab

decomposition compared to 2D-decomposition………………………….…… 64 B.1 Performance measurements in seconds for the 3D-FFT implementations

using 1D and 2D decomposition for problem size 128³………………….…… 65 C.1 Performance measurements in seconds for ping-pong application

Sending / receiving 1000 messages of 10 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 70

C.2 Performance measurements in seconds for ping-pong application Sending / receiving 1000 messages of 100 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 70

C.3 Performance measurements in seconds for ping-pong application Sending / receiving 1000 messages of 1,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 71

C.4 Performance measurements in seconds for ping-pong application Sending / receiving 1000 messages of 10,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal……………………………..…………….…… 71

LIST OF FIGURES

IV

LIST OF FIGURES 2.1 torus network with periodic boundary conditions………………….……. 4 2.2 Node mappings on torus network along a line, diagonal,

and volume diagonal…………………………………....…..……………. 4 2.3 Performance measurements for ping-pong application

sending / receiving messages of 100 integers between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal………………………………………..……………. 5

4.1 Computational steps of the two-dimensional FFT implementation…..…. 15 4.2 FFTW library functions and resort strategies for the two-dimensional

FFT computation………………..……………………………….………. 16 4.3 Compare mesh vs torus network for a variety of problem sizes…………. 20 4.4 Compare mesh vs torus network…………………..………..……………. 21 4.5 Times of the forward 2D-FFT for a problem size of 2048²……………… 23 4.6 Customised versus default mapping on 32-node partition………………. 24 4.7 Performance impact of customised versus default mapping on

32-node partition…………………………………………....……………. 24 4.8 Two customised mappings versus default mapping on

128-node partition…………………………………………..……………. 25 4.9 Performance impact of customised versus default mapping on

128-node partition…………………………………………..……………. 26 4.10 Customised mappings versus default mapping on


128-node partition…………………………………………..……………. 27 4.12 Two customised mappings versus default mapping on


512-node partition (mesh) ………………..……………………………… 28 4.14 Performance impact of customised versus default mapping on

512-node partition (torus) …………………..…………..………….……. 28 4.15 Customised mappings versus default mapping on


512-node partition (torus) ………………..……………..…….…………. 31 5.1 Computational steps of the 3D-FFT implementation

using 1D-decomposition…………………………..………..……………. 33 5.2 Computational steps of the 3D-FFT implementation

using 2D-decomposition…………………………..………..……………. 34 5.3 Speedup of the 3D-FFT implementation using 1D-decomposition……… 37 5.4 Speedup of the 3D-FFT implementation using 2D-decomposition……… 38 5.5 (a) Performance measurements for the 3D-FFT implementations using 1D

and 2D decomposition for five different problem sizes, respectively……. 40 5.5 (b) Performance measurements for the communication times of the 3D-FFT

implementations using 1D and 2D decomposition for five different problem sizes, respectively…………………………………..…………… 41

LIST OF FIGURES

V

5.6 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 32-node partition using dims={8, 4} for the 2D-virtual processor grid……………………………..……………. 42

5.7 Performance impact of customised node mapping for 3D-FFT on a 32-node partition for various problem sizes………………..……………. 43

5.8 Two customised and default node mappings the 1st and 2nd all-to-all communication on a 128-node partition using dims={16,8} and dims={8,16} for the 2D-virtual processor grid………………..…………… 45

5.9 Performance impact of customised node mapping for 3D-FFT on a 128-node partition using dims={16, 8} for the 2D-virtual processor grid……………………………………..……………. 45

5.10 Performance impact of customised node mapping for 3D-FFT on a 128-node partition using dims={8, 16} for the 2D-virtual processor grid……………………………………..……………. 46

5.11 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={32, 16} for the 2D-virtual processor grid……………………………..……………. 47

5.12 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={32, 16} for the 2D-virtual processor grid on the torus network……………………………………..……………. 48

5.13 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={64, 8} for the 2D-virtual processor grid……………………………………..……………. 48

5.14 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={64, 8} for the 2D-virtual processor grid on the mesh network……………………………………..……………. 49


5.16 Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={8,64} for the 2D-virtual processor grid……………………………………..……………. 50

5.17 Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={8, 64} for the 2D-virtual processor grid on the mesh network……………………………………..……………. 51







5.24 Performance impact of customised node mapping for 3D-FFT on a 1024-node partition using dims={8, 128} for the 2D-virtual processor grid on the torustorus network………………………………..……………. 57

LIST OF FIGURES

VI


5.26 Performance impact of customised node mapping for 3D-FFT on a 1024-node partition using dims={4, 256} for the 2D-virtual processor grid on the torus network…………………………..……………. 58

A.1 Speedup of the 3D-FFT implementation using 1D-decomposition…..……. 63 A.2 Speedup of the 3D-FFT implementation using 2D-decomposition…..……. 63

ACKNOWLEDGEMENTS

VII

ACKNOWLEDGEMENTS

I wish to thank Dr Joachim Hein for his excellent guidance, support, patience and

encouragement throughout the duration of this project.

Jon Bashor is greatly acknowledged for his proofreading assistance.

A special thank goes to Professor Dr Wolfgang E. Nagel for making it all possible.

I don’t want to miss the opportunity to thank the “HM-Team” for many enjoyable

hours we spent together and for making this time unforgettable.

And, I would like to thank my family for their unbelievable support and

understanding through the entire year I spent at the University of Edinburgh.

1. INTRODUCTION

1

1. INTRODUCTION

The Fast Fourier Transforms (FFTs) of multi-dimensional data are of particular

importance in a variety of different scientific applications, but are often one of the

most computationally expensive components. Parallel FFTs are communication

intensive. They often prevent the application from scaling to a very large number of

processors.

The BlueGene/L system architecture was designed to support efficient execution of

massively parallel message-passing programs [13]. The system consists of thousands

of compute nodes which operate at a moderate clock frequency of 700 MHz [3]. This

vast parallelism is characterised by lower power consumption compared to current

supercomputer systems.

A fundamental challenge of parallel numerical algorithms – such as the FFTs of

multi-dimensional data – is their design and implementation to utilise efficiently

thousands of nodes. Our starting point is the description of the design and

implementation of the parallel two-dimensional and three-dimensional FFT. For the

three-dimensional case we investigated two different implementations which are

presently widely discussed in the literature [5, 6]. The first implementation uses a

one-dimensional decomposition of the data and the second uses a two-dimensional

decomposition. The application that decomposes data in only one dimension is

bounded with the fact that the utilisation of a number of more processors is limited to

the data elements along one dimension. On the other hand, for two-dimensional

decomposition, N² processors can be utilised (N … size of the data along a single

axis). We compare the performance of both implementations with respect to the

problem size and number of processors used.

Another important architectural characteristic of BlueGene/L is the organisation of

compute nodes as a three-dimensional torus. The main feature of the torus

communication network is that every node is connected to its six neighbour nodes

through bidirectional links.

1. INTRODUCTION

2

To maintain application performance and scaling, the correct mapping of MPI tasks

onto the torus network is a critical factor. We explore the impact of a variety of node

mappings on the performance of the three-dimensional FFT computation using two-

dimensional decomposition of the data.

Before we consider the three-dimensional FFT, a number of investigations on the

two-dimensional FFT computation have been carried out. There is a set of reasons

for exhaustively exploring the two-dimensional case. For instance, the two-

dimensional computation is half of the three-dimensional computation that uses two-

dimensional decomposition. This is because the communication kernel for the

parallel three-dimensional FFT with two-dimensional decomposition consists of two

all-to-all communications. With the investigations carried out for the two-

dimensional FFT, we study separately the impacts of node mappings on the

performance of one all-to-all communication. It is supposed to decode possibly

performance issues for the three-dimensional case.

The rest of this paper is organised as follows. Next is an overview of the hardware

and software architecture of BlueGene/L. Chapter 3 contains mathematical

background information of the Fourier transforms. It is followed by a mathematical

description of the two-dimensional FFT implementation. The node mapping

strategies for the two-dimensional case is briefly discussed in chapter 4. The

description of the design and implementation for the three-dimensional FFT is

broken down into two versions, one decomposing the data array in one dimension

and the second is for two dimensions. Both are covered in chapter 5. We continue

with our investigations of a number of node mappings onto the BlueGene/L’s torus

on up to 1,024 nodes for the three-dimensional FFT computation that uses two-

dimensional decomposition. In chapter 6, we then describe and discuss the

experimental results and draw our conclusions.

2. OVERVIEW OF THE BLUEGENE/L ARCHITECTURE

3

2. OVERVIEW OF THE BLUEGENE/L

ARCHITECTURE

2. 1 Hardware Architecture

The BlueGene/L supercomputer is a massively parallel system developed by IBM in

partnership with Lawrence Livermore National Laboratory (LLNL) [5, 8, 11, 13].

This system-on-a-chip design that integrates embedded low-power processors, high-

performance network interfaces and embedded memory [13] results in extremely

high power and space efficiency [8]. The full details of the system architecture are

extensively described elsewhere [11, 14] and we provide a brief overview with the

focus on the features that are particularly relevant to our project.

The University of Edinburgh BlueGene/L machine, BlueSky, offers a total of 1024

compute chips in a single cabinet. Each chip has two processors (nodes) which

means, BlueSky offers a total of 2048 processors [3]. Operating at a moderate clock

frequency of 700 MHz, BlueSky delivers a theoretical peak computing power of

5.7 TFlops [3], when both processors in each chip are used. Each chip incorporates

two standard 32-bit embedded IBM PowerPC 440 processors with private L1

instruction and data cache, a small (2 KB) L2 cache and prefetch buffer, 4 MB of

embedded dynamic random access memory [13, 14] acting as a L3 cache, and 512

MB of main memory. The L2 and L3 cache as well as the main memory are shared

by the two compute nodes on a chip.

The dual-processor compute chip can operate in one of two modes [3, 13]. The

coprocessor mode that spans the entire memory of the chip uses the first processor

for computation and the second processor for communication. In virtual node mode,

two single-threaded processes, each effectively using half of the chip memory, run

on one compute node [13].

Each processor in a chip has a dual floating-point unit (FPU) – also known as

“Double Hummer” [3] – consisting of two 64-bit FPUs operating in parallel to


4

Figure 2.1: torus network with periodic boundary conditions

Figure 2.2: Node mappings on torus network along a line, diagonal, and volume diagonal

mainly support complex number arithmetic. For the efficient use of the double FPU,

16-byte alignment of the data is required [3]. For generating code, which uses the

dual FPU, the compiler has to know about the alignment properties of the data [3].

How it is actually implemented is described elsewhere [3]. However, for FFT

implementations this feature can be utilised to save a significant number of

arithmetic operations, leading to improved performance [4].

The BlueGene/L architecture features five different networks (not all of which are

described here). For the FFT computation, the most important network is the three-

dimensional torus. The 512-node partition forms the smallest 8 8 8× × torus. Each of

the 512 compute nodes is connected to its six neighbours through 154 MB/s/link

bidirectional channels [13] (see figure 2.1).

To maintain application performance and scaling, the correct mapping of MPI tasks

onto the torus network plays a crucial role. A performance analysis of a ping-pong

application has underlined that the communication times can be minimised when a

particular node mapping is used which takes the torus network features into account.


5

0,0126

0,0131

0,0136

0,0141

0,0146

0-1 0-2 0-3 0-4 0-5 0-6 0-7

Nodes

Exec

utio

n tim

es (s

econ

ds)

LineDiagonalVolume Diagonal

Figure 2.3: Performance measurements for ping-pong application sending / receiving messages of 100 integers between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal

Figure 2.3 shows the results from sending and receiving messages (of 100 integers in

size and 4 bytes in length) from rank 0 to one of the 7 remaining nodes separately

mapped along a line (a), the diagonal (b) and volume diagonal (c). The reported

times are for 1000 full cycles. The node mapping of all three cases is illustrated in

figure 2.2. It clarifies in all three cases that communication between nodes is fastest

in the nearest neighbourhood and slows down as nodes are located further away

within the network. The measurements for a variety of message sizes have been

added to Appendix C.

2. 2 Software Architecture

Here we focus on the Message Passing Interface (MPI) implementation rather than

the system software. However, we briefly mention two for this project relevant

system information. The C compiler version on the University of Edinburgh’s

BlueGene/L is IBM Visual Age C compiler version 7.0 and the driver version is

v1r2m1 (V1R2M1_020_2006-060110). For a detailed discussion of the system

software we refer to [11, 13, 14]. We briefly summarise the main features of the MPI

for BlueGene/L implementation – relevant to our project – which are extensively

discussed in [13].


6

The BlueGene/L supercomputer was designed to support efficient execution of

massively parallel message-passing programs. Part of this support is an optimised

implementation of the Message Passing Interface, which takes the hardware features

of BlueGene/L into account [13]. MPI for BlueGene/L is implemented on top of

MPICH2 library [15] from Argonne National Laboratory.

The MPI and MPICH2 libraries are used in both BlueGene/L modes of operation: the

coprocessor mode and virtual node mode. In coprocessor mode, to support the

concurrent operation of the two non-cache-coherent processors in a compute node,

the messages layer allows the use of the second processor as a communication

coprocessor [13]. The message layer provides a non-L1-cached – and hence coherent

– area of the memory to coordinate the two processors [13]. In virtual node mode,

two separate processes run in each processor of a chip. Hence, memory and torus

network are evenly shared between the processors; others, such as L3 cache, are

shared [13]. The two MPI tasks share not only the network, but also communicate

with each other. Therefore, the MPI for BlueGene/L implementation provides a

virtual torus device, served by a virtual packet layer [13].

On a machine such as BlueGene/L, the correct mapping of MPI tasks to the torus

network is a critical factor in maintaining application performance and scaling [13].

For that reason, the message layer allows arbitrary mapping of torus coordinates to

ranks. This mapping can be specified via an input file – the so-called mapfile –

listing the torus coordinates of each process in increasing rank order.

Within the torus network, the data packets are routed on an individual basis using

one of two routing strategies. The algorithm, in which all packets follow the same

path along the x, y, and z dimension (in this order), is called the deterministic routing

algorithm [13]. The second is a minimal adaptive routing algorithm, which allows

individual packets to make decisions about routing, resulting in potential out-of-order

delivery of packets [13]. This potential out-of-order delivery forces the MPI library

to reorder them in software. A packet reordering is expensive because it involves

memory copies and requires packets to carry additional information [13]. On the

other hand, deterministic routing leads to more network congestion, even on lightly


7

used networks. For our implementations, data packets are routed entirely in hardware

from the source to the destination node.

Most MPI implementations, including MPICH2, typically implement collective

communication in terms of point-to-point messages [12, 13]. On the BlueGene/L

platform, the default collective implementations of MPICH2 suffer from low

performance because they are written for a crossbar-type network, not for special

network topologies such as the BlueGene/L torus network [13]. In terms of all-to-all

communication, both, MPI_Alltoall and MPI_Alltoallv algorithms are

optimised for the BlueGene/L architecture. The algorithm uses the message layer

directly and optimises the injection of packets to achieve high network efficiency

[13]. For this investigation of the two-dimensional and three-dimensional FFT

computations, the MPI_Alltoall algorithm is used. For future studies, we assume

it is useful to perform an investigation of FFT computations using

MPI_Alltoallv.

3. FOURIER TRANSFORM

8

2( ) ( ) ixuF u f x e dxπ+∞

−

−∞

= ⋅∫

( )

( )

cos( ) sin( )1cos( )21sin( )2

i

i i

i i

e i

e e

e ei

θ

θ θ

θ θ

θ θ

θ

θ

−

−

= +

→ = +

→ = −


To explain better the two-dimensional and the two three-dimensional Fast Fourier

Transform (FFT) implementations, covered in chapter 4.3 and 5.1, some background

information about the Fourier Transforms (FTs) at the level relevant to this project is

provided. Fourier Transforms are of enormous importance for many applications in

applied and engineering science. The mathematical tool is a linear transform which

converts, for example spatial information into information lying in the frequency

domain and vice versa [26, 27]. All periodic signals may be represented by an

infinite sum or integral of trigonometric sines and cosines which are associated with

the symmetrical and asymmetrical information, respectively [26, 27]. Alternatively

to trigonometric functions one can use exponentials to formulate Fourier Transforms.

The connection between the two is via Euler’s formula:

(3.1)

3. 1 Continuous Fourier Transform

If one considers the one-dimensional case, then generally the FT tool converts a

function f of a single variable x, e.g. f(x), from the spatial domain into another

function F of frequencies u, e.g. F(u), into the frequency domain to analyse these

frequencies in a sampled signal [17]. In general, one has two descriptions of the same

physical process which are defined through a function. Consider a continuous

function f(x) of a single variable. The Fourier Transform of that function is defined

by:

(3.2)

Generally, the Fourier Transform F(u) will be a complex quantity, even if the

original data is real. To regenerate the original function f(x) from its Fourier

Transform (3.2), the inverse Fourier Transform (3.3) comes into play which looks

fairly similar, except that the exponential term has the opposite sign:


9

2( ) ( ) ixuf x F u e duπ+∞

−∞

= ⋅∫

2 ( )

2 ( )

( , ) ( , )

( , ) ( , )

i xu yv

i xu yv

F u v f x y e dxdy

f x y F u v e dudv

π

π

+∞ +∞− +

−∞ −∞

+∞ +∞+

−∞ −∞

= ⋅

= ⋅

∫ ∫

∫ ∫

2 ( )

2 ( )

( , , ) ( , , )

( , , ) ( , , )

i xu yv zw

i xu yv zw

F u v w f x y z e dxdydz

f x y z F u v w e dudvdw

π

π

+∞ +∞ +∞− + +

−∞ −∞ −∞

+∞ +∞ +∞+ +

−∞ −∞ −∞

= ⋅

= ⋅

∫ ∫ ∫

∫ ∫ ∫

1 2

0

1 2

0

( ) ( )

1( ) ( )

uxL iL

x

uxL iL

x

F u f x e

f x F u eL

π

π

⎛ ⎞− − ⎜ ⎟⎝ ⎠

=

⎛ ⎞− ⎜ ⎟⎝ ⎠

=

= ⋅

= ⋅ ⋅

∑

∑

1 1 2

0 0

1 1 2

0 0

( , ) ( , )

1( , ) ( , )

ux vyM L iL M

y x

ux vyM L iL M

v u

F u v f x y e

f x y F u v eL M

π

π

⎛ ⎞− − − +⎜ ⎟⎝ ⎠

= =

⎛ ⎞− − +⎜ ⎟⎝ ⎠

= =

= ⋅

= ⋅ ⋅⋅

∑∑

∑∑

(3.3)

The two- and three-dimensional FT equations can be developed from the equations

(3.2) and (3.3) in a fairly straightforward way. The following equations (3.4) present

the Fourier Transform equation and its inverse for the two-dimensional case:

(3.4)

For the sake of completeness and, since the emphasis of this project is on the three-

dimensional case, equation (3.5) shows the three-dimensional continuous Fourier

Transform equation including its inverse:

(3.5)

3. 2 Discrete Fourier Transform

For computational calculations one often needs functions defined for discrete instead

of continuous domains. In the most common situation, the function’s values can be

obtained by sampling at evenly spaced intervals [17]. One has to approximate the

integrals in (3.2) and (3.3) (as an example for the one-dimensional case) by a discrete

sum. The discrete Fourier Transform and its inverse for the one-dimensional case of

L samples at values of x from 0 to L-1 are of the form:

(3.6)

Again, the two-dimensional discrete Fourier Transform works in a similar way. For a

L M× grid in x and y direction, one gets the following equations:

(3.7)


10

1 1 1 2

0 0 0

1 1 1 2

0 0 0

( , , ) ( , , )

1( , , ) ( , , )

ux vy wzN M L iL M N

z y x


w v u

F u v w f x y z e

f x y z F u v w eL M N

π

π

⎛ ⎞− − − − + +⎜ ⎟⎝ ⎠

= = =

⎛ ⎞− − − + +⎜ ⎟⎝ ⎠

= = =

= ⋅

= ⋅ ⋅⋅ ⋅

∑∑∑

∑∑∑

The three-dimensional discrete Fourier Transform for a L M N× × data grid in x, y

and z direction is shown in (3.8).

(3.8)

3. 3 Fast Fourier Transform

The computational cost of the discrete Fourier Transform of N points can be obtained

by the fact that each of the N points is computed in terms of all the N points in the

original function [26, 27]. In the mathematical sense, one has a matrix-vector

multiplication which requires N² complex multiplications and therefore the discrete

Fourier Transform appears to be of an order O(N²) process.

In the mid-1960s, J. W. Cooley and J. W. Tukey published a discrete Fourier

Transform algorithm, known as Fast Fourier Transform (FFT), which computes the

discrete Fourier Transform in O(N log2 N) operations [17]. One of the clearest

derivations of the FFT algorithm, which is known as Danielson-Lanczos Lemma,

shows that a discrete Fourier Transform of length N can be rewritten as a sum of two

discrete Fourier Transforms of length 2N , respectively [17]. One sum is formed

from the even-numbered points of N and the other from the odd-numbered points.

Both transforms are periodic with length 2N . For the proof of this derivation, the

reader is referred to the “Numerical Recipes” book [17]. In the “Numerical Recipes”

it is also recommended that one use FFTs with N as an integer power of two to

maintain O(N log2 N), although other cases can also be treated. With this restriction

on N, the Danielson-Lanczos Lemma can be applied until the data has been

subdivided all the way down to transforms of length one. As a better exemplification,

the continuing steps would be to recursively subdivide the two sums of even-

numbered and odd-numbered points of length 2N into respective sub-sums of even-

even-numbered, even-odd-numbered, odd-even-numbered and odd-odd-numbered


11

...( ) for some eoeeoeo oeenF u f n=

points each of length 4N . So, for every pattern of log2 N even’s (e) and odd’s (o),

there is a one-point transform that is just one of the input numbers nf (see (3.9))

[17], e.g.

(3.9)

The next necessary part of the Fast Fourier Transform is the so-called bit reversal

reordering to find the corresponding even-odd pairs in equation (3.8) to the values of

n [17]. The mathematically exact approach of this method is beyond the scope of this

project, more details can be found in the “Numerical Recipes” book [17]. However,

we summarise the main steps. First, one has to reverse the pattern of evens and odds.

Secondly, one has to write even and odd as binary notation, which means even = 0

and odd = 1. Once these two steps are done, one has the value of n in binary notation.

The points as given are the one-point transforms, which is simply the operation that

copies the one input number into its one output slot. Now the Danielson-Lanczos

Lemma can be applied, which combines pairs of one-point transforms to get two-

point transforms, and so on, until the first and second halves of the entire data set are

combined into the final transform [17]. Each combination is an order N process and

there are log2 N combinations. So, in summary the entire algorithm is of order O(N

log2 N) [17].

3. 4 Fastest Fourier Transform in the West library

The applications used here call one-dimensional single-processor FFT kernel

routines. The portable open-source Fastest Fourier Transform in the West (FFTW)

2.1.5 library has been used. FFTW is a state-of-the-art C subroutine library for

computing the discrete Fourier Transform in one or more dimensions, for both real

and complex data, and of arbitrary input size [28]. FFTW uses empirical approaches

to automatically optimise FFT computation on a wide range of architectures [8]. The

current version installed on the University of Edinburgh’s BlueGene/L is FFTW

2.1.5, available in two releases. One is a version of FFTW-GEL from the Vienna

University of Technology [10], which is based on FFTW 2.1.5 and optimised for the

double floating point unit which is specially designed for each processing core on


12

BlueGene/L. The other is the standard FFTW 2.1.5 library [3]. Both versions have

been used and the one yielding the best performance has been used for all future

implementations and investigations.

FFTW implements a two-step algorithm to calculate a transform [28]. At first, a plan

is computed which serves as input for the second step. In order to create the plan, all

data necessary for the Fourier Transform computation is needed. During the plan

computation, several FFTs are run and measured at run time in order to find the best

way to compute the requested transform of a given size [29, 19]. That makes a plan

computation more expensive than the actual transform. However, once a plan is

created, it can be reused for a fixed problem size many times which, in summary

speeds the FFTW significantly up [28]. In the second step the created plan is used for

the computation of the actual transform.

4. TWO-DIMENSIONAL FAST FOURIER TRANSFORMS

13

4. TWO-DIMENSIONAL FAST FOURIER

TRANSFORMS

4. 1 Parallel FFTs in Two Dimensions

Before the three-dimensional Fast Fourier Transform (FFT) has been implemented

and different mapping strategies of MPI tasks over the physical processor grid have

been investigated, fairly extensive investigations of taskfarms of the two-dimensional

FFTs have been carried out. There are at least three principal reasons for this. First,

two-dimensional computation is half of the three-dimensional computation. This is

because the communication kernel for the parallel two-dimensional FFT computation

[21] is one all-to-all communication between the two one-dimensional FFT

calculations. For the parallel three-dimensional case, two implementations have been

investigated – one where the three-dimensional complex data array is decomposed in

one dimension and for the other version in two dimensions. For the first, only one

all-to-all communication is needed. For the second implementation, two all-to-all

communications are necessary. We study separately the impacts of several node

mappings on the performance of one all-to-all communication for the two-

dimensional case to decode possibly performance issues for the three-dimensional

case.

The second principal reason for extensively investigating the two-dimensional case is

a taskfarm. We will come back to it in the following section. And the third reason is

a partial Fast Fourier Transform that is also related to a taskfarm.

4. 2 Taskfarm of parallel FFTs

For the two-dimensional case, taskfarms are of considerable importance since the

BlueGene/L is characterised by constraints on the partition sizes. More precisely, if

an application yields best performance results on 256 nodes, one has to request the

512-node partition. However, to enforce the use of the entire partition requested, the

partition has been filled up as a taskfarm by simultaneously running the same


14

1 1 2

0 0

1 1 2

0 0

( , , ) ( , , )

1( , , ) ( , , )

vx wyN M iM N

z y

vx wyN M iM N

w v

F x v w f x y z e

f x y z F x v w eM N

π

π

⎛ ⎞− − − +⎜ ⎟⎝ ⎠

= =

⎛ ⎞− − +⎜ ⎟⎝ ⎠

= =

= ⋅

= ⋅ ⋅⋅

∑∑

∑∑

1 1 2

, ,0 0

ux vyL M iL M

u v x yx y

B A eπ ⎛ ⎞− − − +⎜ ⎟⎝ ⎠

= =

= ⋅∑∑

1 1 2 2

, ,0 0

1st one-dimensional computationalong y dimension

2nd one-dimensional computationalong x dimension

1 2

, ,0

is the 1D-FFT

vy uxL M i iM L

u v x yx y

vyM iM

x v x yy

B A e e

C A e

π π

π

− − − −

= =

− −

=

= ⋅ ⋅

→ = ⋅

∑ ∑

∑ :,y

1 2

, , x,:0

of A for all y values

is the 1D-FFT of A for all x valuesuxL iL

u v x vx

B C eπ− −

=

→ = ⋅∑

program with the investigating mapping strategies several times. It also verifies

reproducibility of the execution times.

Partial Fast Fourier Transforms are also important for many scientific applications.

Consider the function f(x, y, z) and the Fast Fourier Transform computation for only

two dimensions, e.g. y and z. Then the partial FFT equation and its inverse would be:

(4.1)

For partial FFTs one can perform taskfarm computations in the sense of

simultaneous runs of the same application, but for different values of x.

4. 3 Algorithm Details

Consider ,x yA as a two-dimensional array of L M× complex numbers with:

,

, 0 , 0

x yA

x x x Ly y y M

∈

∈ ∀ ≤ <∈ ∀ ≤ <

The two-dimensional FFT is computed by the equation described in (3.7).

(4.2)

In other words, the two-dimensional FFT is an array ,u vB of L M× complex

numbers. This computation has been performed in two single stages. First, the one-

dimensional FFT was computed along the y dimension and, secondly, along x

dimension. Therefore, (4.2) can be written as:

(4.3)


15

Figure 4.1: Computational steps of the two-dimensional FFT implementation

Figure 4.1 illustrates the described implementation of the two-dimensional FFT of an

array of size L M× – where a data size equal in each dimensions, i.e. L M= , has

been used. More precisely, ( )0 : 1, 0 : 1xA L M− − is an xL M× array of complex

numbers distributed onto P nodes. So, each node stores a section of size xL M×

( )xLL P= of the data array A in its local memory. At first (a), xL independent one-

dimensional FFTs of size M along y dimension have been calculated. Secondly (b),

yM independent one-dimensional FFTs of size L along x dimension were calculated

( )yMM P= .

For calculating the independent one-dimensional FFTs, the FFTW library function

fftw() has been used. The justification for using FFTW has been covered in chapter

3.4. Before starting with the actual investigations, two different input parameters for

the FFTW library function fftw() have been compared and the one yielding the best

performance for the entire two-dimensional forward FFT computation has been used.

In this context, re-sorting strategies of the data play a crucial role.

In general, the input parameters for the fftw() library function are fftw( plan,

howmany, in_array, in_stride, in_distance, out_array, out_stride, out_distance ). We

compare results using different values for the stride and distance parameters

((stride=1 AND distance≠1) OR (stride≠1 AND distance=1)). The same values used

for the input parameters in_stride and in_distance were used for the output

parameters out_stride and out_distance. For both versions, we considered the

advantages and disadvantages regarding plan creation, re-sorting the data and the

overall times for the entire forward FFT computation.


16

Figure 4.2: FFTW library functions and resort strategies for the two-dimensional FFT computation

Consider the L M× two-dimensional data array shown in figure 4.2. The numbering

of the first two columns of the data grid provides a better understanding of the two

different re-sort strategies which are partially sequential. For the first

implementation, shown in figure 4.2.1 (a) and (b), for both Fast Fourier Transform

calculations along the y and x dimensions the fftw() library function with stride=1 is

used. This means that after the first FFT computation two re-sort methods are

necessary, one before and one after the all-to-all communication, as can be seen in

figure 4.2.1.b. The first re-sort method before the all-to-all communication sorts the

data along rows first. This means it doesn’t access data as it is stored in memory. If

data is accessed in non-sequential order, cache misses will occur in every single step

- since the data in a cache block is rejected before it is used [30]. The second re-sort

method after the all-to-all communication becomes necessary to get the data in the

correct order for using the fftw() library function with stride=1 to compute the

second one-dimensional FFT along the x dimension. It is interesting to see whether

not re-sorting and calling strided fftw() is more efficient.


17

For this reason, we have investigated a second implementation using fftw() with

stride=1 only for the first FFT computation along the y dimension, and fftw() with

stride≠1 for the second FFT computation along the x dimension. It has the advantage

that only one re-sort method before the all-to-all communication is needed which is

still partially sequential. A further, and not inconsiderable, advantage is that this re-

sort method doesn’t experience cache misses in every single step, but in every MP

step (see figure 4.2.2.a and b). Since this fact becomes beneficial for large problem

sizes, tables 4.1 and 4.2 summarise times for a data problem size of 16384² using 128

and 512 nodes. All presented times have been measured on a hot L3 cache, e.g. by

transforming the same data multiple times. It also verifies reproducibility of the times

shown in table 4.1 and 4.2. After discarding the very first run, the fastest of the

remaining runs has been chosen for all future times presented in this paper. To get an

idea where the differences in the times for the entire forward FFT come from, all

main steps haven been measured separately (creation of plans for the FFT

computation, the fftw() calls, re-sort methods, all-to-all communication). The timing

of the entire forward two-dimensional FFT computation is encapsulated in a

MPI_Barrier() pair. It starts after getting data arrays ready for the first FFT

computation and ends directly after the final fftw() call.

TIME STRIDE=1 STRIDE≠1WITH- OUT NEW PLAN

STRIDE≠1 WITH NEW PLAN

for plan creations 15.491 15.436 31.293for FFTW()s 0.509 1.416 1.255for re-sort methods 0.414 0.050 0.051for all-to-all comm 0.232 0.232 0.226for entire forward 2D-FFT 1.157 1.698 1.534

Table 4.1: Times measured in seconds for a problem size of 16384² using 128 nodes

TIME STRIDE=1 STRIDE!=1WITH- OUT NEW PLAN

STRIDE!=1 WITH NEW PLAN

for plan creations 15.483 15.464 30.989for FFTW()s 0.128 0.331 0.293for re-sort methods 0.075 0.017 0.017for all-to-all comm 0.057 0.057 0.058for entire forward 2D-FFT 0.265 0.410 0.379

Table 4.2: Times measured in seconds for a problem size of 16384² using 512 nodes


18

Times for three different implementations are shown in table 4.1 and 4.2. The first

case with the table header “stride=1” is illustrated in figure 4.2.1.a and b. The second

and third case with table header “stride≠1 without new plan” and “stride≠1 with new

plan” is illustrated in figure 4.2.2.a and b. The reason for a second new plan

computation is the use of two different FFTW routines, one with stride=1 and the

other with stride≠1. Since the FFTW plan creation is expensive, it is also

investigated whether there is an extensive advantage by providing additional plans

for the forward and backward FFTs, evaluated with the fftw() library function with

stride≠1.

For both runs using 128 and 512 nodes respectively, the total amount of time spent in

the fftw() library function is a considerable fraction of the total time. As expected, the

reorganisation of the data which avoids cache misses in every single step becomes

extremely cheap if one uses fftw() with stride≠1 for the second FFT computation

along the x dimension. However, the fftw() call with stride≠1 is much more

expensive so that even the overall times for the entire two-dimensional FFT

computation is affected. The advantage won by cheap reorganisation of the data is

lost by the strided fftw() calls. The results presented in both tables show also that the

additional plan creation for the strided fftw() calls has a beneficial impact on the

performance and it would be worth to consider a second plan creation since it is only

done once and can be reused many times.

However, the overall performance of the strided fftw() is poor and for all further

implementations and investigations, the fftw() library functions with stride=1 and

expensive re-sort strategies have been used – hence no second plan computation is

needed.

4. 4 Verification of Results

Before any investigations were made, we have ensured that the two-dimensional FFT

computation is correct. The test function which has also been implemented for the

three-dimensional FFT computation has only been described once. To ensure a

complete mathematical description, the test function has been specified for the more

complex case, the three-dimensional case, in chapter 5.2.


19

4. 5 Performance Analysis

Various runs of the two-dimensional FFT computation have been performed on

BlueGene/L. Two parameters have been varied – the number of nodes used and the

size of the data being transformed. For the implementation, only one processor of the

chip has been used to run one MPI task (coprocessor mode). A more detailed

description of the implementation has been covered earlier in chapter 4.3.

For the parallel two-dimensional Fast Fourier Transform computation, the effect of

the three-dimensional mesh and torus communication networks on BlueGene/L has

been investigated. The mesh and torus are the same network with the exception that

the torus has periodic boundary conditions in all three dimensions [3]. More

precisely, for the 512-node partition, each of the 512 compute nodes is connected to

its six neighbours through bidirectional links [13]. The mesh is characterised by open

boundary conditions in all three dimensions. The network provides fastest

communication between processors close to each other [3].

When running parallel codes on BlueGene/L using the mpirun command, the MPI

tasks are mapped to the physical processor grid of the machine [3]. The performance

measurements of the ping-pong application (see chapter 2.1) have shown that the

communication times can be minimised when a particular node mapping is used

which takes the network characteristics into account. To benefit from these features,

for the two-dimensional FFT computation, where all-to-all communications become

extremely expensive for very large node counts, we have explored a variety of MPI

task mappings on the physical processor grid on BlueGene/L. This optimisation has

to be carried out for each partition size on BlueGene/L since the shape of the

partitions changes with size [3, 8]. The partitions on BlueGene/L have three

dimensions and 32, 128, 512, and 1024 are the total numbers of chips available in the

partitions. The possibility to choose between mesh and torus is only for the 512- and

1024-node partitions. For the smaller partitions only mesh is offered. The default

node mapping on the machine is done by filling up a three-dimensional array first in

x direction then in y- and finally in z direction. The mpitrace tool reports that, for

instance, the 128-node partition consists of an 8 4 4× × block of nodes, whereas a

512-node partition is an 8 8 8× × block [3]. Consequently, the optimal mapping for


20

Figure 4.3: Comparison of mesh vs torus network for a variety of problem sizes

one partition size can differ substantially from the optimal mapping for another

partition [8]. This process is facilitated by the capability to specify a node mapping

at run time using the –mapfile option to the mpirun command. Unless

specifically noted, all of the following performance results are from the coprocessor

mode. The performance measurements achieved from the customised node mappings

have been compared with the results obtained from the default node mapping on

BlueGene/L.

4. 5. 1 mesh versus torus network

Before particular node mappings were considered, the performance impact of mesh

and torus on the two-dimensional FFT computation have been investigated. Figure

4.3 shows the performance measurements for different problem sizes on the 512-

node partition. It demonstrates clearly that the three-dimensional torus is highly

efficient and even becomes more beneficial as the problem size grows. For the all-to-

all communication and a fairly large problem size, the torus network is about 20%

faster than mesh.

As the problem size is increased, the length of the messages which are going to be

exchanged gets longer. In fact, the bandwidth utilisation becomes higher and is a

more critical factor for performance differences. Within the torus network, the data

packets are routed on an individual basis, using one of two routing strategies. The

algorithm, in which all packets follow the same path along the x, y, z dimensions (in

this order) is called deterministic routing algorithm [13]. The second is a minimal


21

Figure 4.4: Comparison of mesh vs torus network

adaptive routing algorithm, which allows individual packets to make decisions about

routing [13]. The BlueGene/L torus network features have been extensively

described elsewhere and will not discussed here in more detail.

Figure 4.4 illustrates, in an extremely simplified way, how point-to-point packets

possibly travel through the mesh and torus networks. This simplified example, using

4 nodes for an all-to-all communication, shows clearly the impact of the additional

link for the torus network on the bandwidth utilisation. On the other hand, we

assume that on the mesh network deterministic routing is used, which leads to more

network congestion and increased messages latency, even on lightly used networks

[13]. This effect becomes even more pronounced as more packets are sent through

the network.

4. 5. 2 Virtual Node Mode on Bluegene/L

Another important architectural feature of BlueGene/L is its dual-processor compute

chip which can operate in one of two modes [3, 13]. The coprocessor mode spans the

entire memory of the chip and can use both processors by running one thread on each

[13]. In virtual node mode, two single-threaded processes which share the chip’s

memory, run on one compute node [13]. Each process is bound to one processor.

Table 4.3 compares the execution times for the two-dimensional FFT computation

for different problem sizes using coprocessor mode and virtual node mode.


22

PROBLEM 512² 1024² 2056² NODES CO mode VN mode CO mode VN mode CO mode VN mode

1 0.182 0.092 0.793 0.418 3,596 1,887 2 0.091 0.048 0.417 0.206 1,875 0.992 4 0.048 0.024 0.205 0.099 0.985 0.493 8 0.024 0.012 0.099 0.050 0.489 0.236

16 0.011 0.006 0.049 0.024 0.232 0.114 32 0.006 0.003 0.024 0.014 0.113 0.063

Table 4.3: Execution times in seconds for the 2D-FFT computation for different problem sizes using

coprocessor mode (CO) and virtual node mode (VN) on BlueGene/L

The benchmarks show that the virtual node mode has a positive impact on our

application performance. The execution times are almost twice as fast when

dedicating the same amount of chips but running one MPI task on each of the two

processors on a chip. When virtual node mode is used, it means the two MPI tasks

running on the two processors of one chip also share the network. More precisely,

the bandwidth utilisation becomes higher than if only one node is dedicated to one

MPI task. This may be the reason that the execution times using virtual node mode

are slightly higher than half of the execution times achieved with coprocessor mode.

More research on the virtual node mode would constitute an interesting future

project. However, due to limited time, for all the further investigations performed in

this paper the coprocessor mode has been used.

4. 5. 3 Double FPU on BlueGene/L

As mentioned in chapter 2 the BlueGene/L processing cores have a specially

designed double floating point unit – also known as “Double Hummer”[1] – which

mainly provides support for complex arithmetic [4]. For FFT implementations this

feature can be utilised to save a significant number of arithmetic operations, leading

to improved performance [4]. Therefore, for efficient exploitation of the double

FPUs for our FFT computations, 16-byte alignment [3, 4, 13] for the data arrays has

been declared to the compiler which allows the compiler to issue “Double Hummer”

instructions. Additionally, we used the optimised FFTW2 library available on

BlueGene/L, which also benefits from the “Double Hummer” feature. It is a version

of FFTW-GEL from the Vienna University of Technology [4] which is based on

FFTW 2.1.5 and built by IBM. Figure 4.5 shows the superior performance impact of

both, declaring the data alignment for the local arrays and utilising the optimised


23

Figure 4.5: Times of the forward 2D-FFT for a problem size of 2048²

FFTW2 library routines. For instance, the application using both optimisations is

about 40% faster than the version using the standard FFTW 2.1.5. For this reason,

this optimised version is used for all future investigations performed on BlueGene/L.

4. 5. 4 MPI task mapping strategies

Two main node mapping patterns have been investigated – contiguous and

discontiguous blocks. To better explain the reasons for the investigations of the two

main patterns, we jump forward a bit to chapter 5 where the three-dimensional FFT

computation is explored. For the three-dimensional FFT computation where the data

array is decomposed in two dimensions, the MPI tasks have been organised in a two-

dimensional processor grid using the MPI Cartesian grid topology [22] construct.

More precisely, for a subdivision dims={e, f} (where e f P× = ) of the two-

dimensional virtual processor grid, we have f subgroups of nodes each consisting of e

nodes. We have two all-to-all communications, the first within each subgroup of

nodes and the second between the subgroups of nodes. This means that if the MPI

tasks for the first all-to-all communication are mapped onto nodes which are as close

as possible to each other, then the mapping for the second communication between

the subgroups would be fragmentary. Analysing both node mapping patterns

separately is supposed to clarify performance results achieved for the three-

dimensional FFT computation which uses two-dimensional decomposition of the


24

Figure 4.6: Customised versus default mapping on 32-node partition

Figure 4.7: Performance impact of customised versus default mapping on 32-node partition

data. The following figures illustrate how MPI tasks are mapped onto the processor

grid for a single run. However, for all investigations, the entire requested partition is

filled using the same mapping pattern.

4. 5. 4 . 1 Mappings on the 32-node partition

Figure 4.6 shows the customised and default node mapping used on the 32-node

partition, the smallest partition available on BlueGene/L. For the first example, 4

processors are used, which means the two-dimensional FFT application has been run

8 times, simultaneously, within the 32-node partition. The timing of the entire

forward FFT computation is encapsulated in a MPI_Barrier(MPI_COMM_WORLD)

pair.

To ensure that the default mapping is really what we assume, a mapfile which

represents the supposed default mapping has been provided and used. We compared

the results from two runs, one using the mapfile and the second run without using a

mapfile. This investigation yields the same results (with the tolerance of few

microseconds) measured for a variety of problem sizes. Figure 4.7 presents the

performance variations of customised versus default node mapping. Independent of

the size of the data array being transformed, the total time spent for communication

is about 10% less if default mapping is used. To explain this, we have considered

both, latency and bandwidth.


25

Figure 4.8: Two customised mappings versus default mapping on 128-node partition

The fragmentary mapping increases message latency considerably, which leads to

poor performance. With increasing problem size, the effect of bandwidth utilisation

dominates since data packets which are going to be exchanged are bigger. The

overall performance impact on the entire two-dimensional FFT computation becomes

smaller as the problem size grows since the computation dominates over

communication.


The potential to explore several different node mappings on the 32-node partition is

limited due to its size. Therefore, ongoing investigations have been performed on the

128-node partition. Two choices of node mappings using 8 processors for the 128-

node partition have been studied and illustrated in figure 4.8. Again, to fill up the

partition ensuring all 128 chips are in use, the application runs 16 times at once using

8 processors. In case (a), for the customised mapping, non-contiguous blocks in all

three dimensions of a total of 8 processors are used. The case (b) compares a 8-cube

shape with the default mapping in line. The performance will be compared with the

default case using contiguous placements in a line.

Figure 4.9 presents the performance variations of the discontiguous and contiguous

node mapping normalised to the default mapping. It means the 100%-line (reference

line) represents the times achieved from the computation using default mapping. The

cube mapping (b) has a big impact on performance due to lower latency for

communication between nodes in nearest neighbourhood.


26


However, the trend drops off as the data array being transformed grows. We assume

this is due to network congestion since bandwidth utilisation becomes higher because

more data packets are delivered. More precisely, the communication between nodes

in the nearest neighbourhood yields excellent performance gain due to decreased

message latency, but shows a turn-over as the problem size grows because of higher

bandwidth utilisation leading to network congestion. In summary, there is always a

trade-off between latency and bandwidth.

The fragmentary node mapping (a) is characterised by higher latency for short

messages. But nevertheless, the path messages have to travel between nodes furthest

away from each other is still shorter than for the communication pattern in line. This

becomes more profitable as the problem size grows since the communication is more

affected by network congestion for the default mapping. The opposite occurs for the

next mapping on the 128-node partition, shown in figure 4.10.

The application is simultaneously run 8 times, each with 16 processors. Here the

discontiguous mapping of nodes is of the same radius along the x dimension as the


27

Figure 4.12: Two customised mappings versus default mapping on 512-node partition


default mapping. Therefore, the fragmentary mapping shows poor performance

because of increased message latency compared to the communication in line.

This becomes less and less of an issue as problem size increases since the default

mapping experiences higher network congestion than the distributed mapping does.

Figure 4.11 displays the performance measurements for communication and the

impact on overall performance for the entire two-dimensional FFT computation.


Further investigations have been carried out on the 512-node partition for a variety of

larger problems.


28

Figure 4.14 (*): Performance impact of customised versus default mapping on 512-node partition (torus)

Figure 4.13: Performance impact of customised versus default mapping on 512-node partition (mesh)

Figure 4.12 shows the discontiguous (a) and contiguous (b) customised node

mapping compared with default mapping. The application has been run 8 times, at

the same time, within the 512-node partition. The 512-node partition is the smallest

partition where all 512 nodes are connected to their six neighbours through

bidirectional links [13]. Hence, all the mapping patterns have been investigated for

the mesh and torus network.

The results for the two customised mappings versus default mapping on the mesh

network are shown in figures 4.13. Since the same fragmentary and contiguous

patterns have been explored on the 128-node partition and are simply extended to the

512-node partition, the outputs obtained on the mesh are straightforward and will not

be repeated here. On the other hand, exploring the same node mapping patterns on

the torus network yields entirely different results which are presented in figure 4.14.

(*) Unfortunately, we don’t have measurement results for the problem sizes 256², 512², 1024²


29

We have discovered with the investigation discussed in chapter 4.5.1 that the torus

network has a remarkably profitable impact on the performance because the

bandwidth utilisation is balanced as equally as possible over the network to avoid

congestion. For the contiguous node mapping (b), we assume that the additional links

which account for the torus are not in use. On the other hand, for the default node

mapping, the torus links can be utilised in two dimensions. This fact leads to

significant worsening of the performance achieved with customised contiguous

mapping. Clearly, the performance drops further as the problem size grows, since

higher bandwidth utilisation is responsible for congestion within the cube-shaped

node block.

The discontiguous node mapping (a) can take advantage of the torus links in all three

dimensions. However, the poor performance of the fragmentary mapping is

characterised by increased massage latency. A marginal improvement of the

customised mapping over the default mapping can be experienced for large problem

sizes. So, we assume that the fragmentary mapping becomes beneficial if messages

are very long, which leads to network congestion for the default mapping.

So far, a variety of customised versus default mappings of MPI tasks have been

explored for mesh and torus on the 512-node partition. Our interest lies in finding the

mapping pattern yielding the best performance results. Table 4.4 summarises the

results from all the investigated node mappings for mesh and torus on the 512-node

partition on BlueGene/L.

NETWORK CONTIGUOUS NODE MAPPING

DISCONTIGUOUSNODE MAPPING

DEFAULT NODE MAPPING

mesh x x torus x x

Table 4.4: Summary of the investigated node mappings for problem sizes between 2048² and 16384²

Table 4.5 shows the communication times and the amount of time required for the

entire two-dimensional FFT computation for the mappings yielding best performance

results for mesh and torus. It points out that the discontiguous node mapping on torus

leads to best performance results. Especially for large problem sizes because of the


30

Figure 4.15: Customised mappings versus default mapping on 512-node partition

beneficial feature of the torus network, balancing bandwidth utilisation as equally as

possible over the entire network.

PROBLEM COMM COSTS

mesh 2D-FFT COSTS

meshCOMM COSTS

torus2D-FFT COSTS

torus256² 0.00022 0.00038 512² 0.00061 0.0014

1024² 0.0022 0.0055 2084² 0.0085 0.024 0.007 0.023 4096³ 0.035 0.149 0.027 0.142 8192² 0.139 0.792 0.110 0.754 16384² 0.557 3.421 0.444 3.295

Table 4.5: Communication and 2D-FFT computation costs (seconds) for different problem sizes with

the mapping yielding best results for mesh and torus

As a final investigation before we consider the more complex three-dimensional FFT

computation, we double the 64-cube and compare an 8 4 4× × -node-block with

blocks containing 2 planes of the 512-node partition. Figure 4.15 illustrates the two

node mappings being compared. Here, investigations just for the torus have been

considered since the performance measurements presented above have shown that

mesh is not particularly relevant for the 512-node partition in our case.

The results presented in figure 4.16 strengthen the previous outcome achieved for

node mapping comparison shown in figure 4.12 (b). Here, for the customised

mapping, the torus links in one dimension can be utilised. We assume that the

additional links in the other two dimensions are not in use. Again, for the default

node mapping the links can be utilised for two dimensions. Since, for the customised

case the torus connectivity comes into play for at least one dimension, the

performance difference between customised and default can be reduced (for the


31

Figure 4.16: Performance impact of customised versus default mapping on 512-node partition (torus)

mapping comparison illustrated in figure 4.12 (b), default mapping leaded to 15%,

17%, 18%, 19% performance improvements while now it is about 7%, 8%, 12%,

13% for problem sizes 2048², 4096², 8192², 16384², each). However, customised

mapping leads to poor performance most likely due to high bandwidth utilisation that

causes network congestion. That is also the reason why the performance of the

customised node mapping falls off as the problem size grows.

5. THREE-DIMENSIONAL FAST FOURIER TRANSFORMS

32

1 1 1 2

, , , ,0 0 0

ux vy wzL N M iL M N

u v w x y zx z y

B A eπ ⎛ ⎞− − − − + +⎜ ⎟⎝ ⎠

= = =

= ⋅∑∑∑

1 1 1 22 2

, , , ,0 0 0

1st one-dimensional computationalong y dimension

2nd one-dimensional computationalong z dimension

3rd one-dimensional com

wzvy uxL N M ii iNM L

u v w x y zx z y

B A e e eππ π− − − −− −

= = =

= ⋅ ⋅ ⋅∑∑ ∑

putationalong x dimension

1 2

, , , , x,:,z0

1 2

, , , , x,y,:0

, ,

is the 1D-FFT of A for all (x, z) pairs

is the 1D-FFT of A for all (x, y) pairs

vyM iM

x v z x y zy

wzN iN

x v w x v zz

u v w

C A e

D C e

B

π

π

− −

=

− −

=

→ = ⋅

→ = ⋅

→

∑

∑1 2

, , :,y,z0

is the 1D-FFT of A for all (y, z) pairsuxL iL

x v wx

D eπ− −

=

= ⋅∑

5. THREE-DIMENSIONAL FAST FOURIER

TRANSFORMS

5. 1 Parallelisation

As with the two-dimensional case and to provide a better explanation at the three-

dimensional Fast Fourier Transform (FFT) computation, a mathematical description

is first provided.

Consider , ,x y zA as a three-dimensional array of L M N× × complex numbers with:

, ,

, 0 , 0 , 0

x y zA

x x x Ly y y Mz z z N

∈

∈ ∀ ≤ <∈ ∀ ≤ <∈ ∀ ≤ <

The three-dimensional FFT is computed using the equation described in (3.8).

(5.1)

In other words, the three-dimensional FFT is an array , ,u v wB of L M N× × complex

numbers. This computation is performed in three single stages. First, the one-

dimensional FFT along the y dimension for all (x, z) pairs is computed, then secondly

along the z dimension for all (x, y) pairs, and finally along the x dimension for all (y,

z) pairs. Therefore, (5.1) can be written as:

(5.2)


33

Figure 5.1: Computational steps of the 3D-FFT implementation using 1D-decomposition

For the three-dimensional case, two different implementations have been considered.

Performance data for volumetric fast Fourier Transform computations on the

BlueGene/L architecture has been published earlier [5, 6, 7]. One common approach

for computing the FFT for a L M N× × data array in parallel on multiple nodes is to

use a technique called slab decomposition [5]. For the investigations here, a data size

equal in each dimension, i.e. L M N= = , has been used. For slab decomposition, the

data is distributed along one single axis; therefore it is also called one-dimensional

decomposition. Figure 5.1 illustrates the described implementation of the three-

dimensional FFT using slab decomposition for a data array of size L M N× × . More

precisely, ( )0 : 1, 0 : 1, 0 : 1xA L M N− − − is an xL M N× × array of complex

numbers distributed onto P nodes. So, each node stores a section of size xL M N× ×

( )xLL P= of the data array A in its local memory. First (a), xL N× independent

one-dimensional FFTs of size M along y dimension and xL M× independent one-

dimensional FFTs of size N along z dimension have been calculated. Secondly (b),

yM N× ( )yMM P= independent one-dimensional FFTs of size L along the x

dimension were calculated. For calculating the independent one-dimensional FFTs,

the FFTW library function fftw() with stride=1 and the expensive re-sort strategies

have been used. More about re-sort strategies were covered earlier in chapter 4.3. It

must be pointed out that within the figures used here; the coordinate system has been

rotated for simplification.


34

Figure 5.2: Computational steps of the 3D-FFT implementation using 2D-decomposition

Let’s assume the computation is performed on ( )L LP P P L= = nodes and the data is

decomposed along the x axis as is shown in figure 5.1, with each node assigned a

slab of size 1 M N× × . There are two perspectives to consider. First, the performance

perspectives [5] where two one-dimensional FFTs can be performed locally on each

node without any communication. Secondly, the scalability perspective, where the

scalability of this method is limited by the extent of the data along a single axis [5] In

this example it is limited by L. This becomes a not negligible problem if one wants to

exploit a very large number of nodes, such as BlueGene/L is designed for.

A more scalable implementation of three-dimensional FFTs, called volume

decomposition, has been presented in [5] and implemented here for further

investigations in respect to mapping strategies of MPI tasks. For simplicity, we

assume that data is distributed in two dimensions so that it is ready to perform the

first one-dimensional FFT computation without any communication in advance. It

would be an interesting future step to extend this implementation in such a way that

data is distributed in three dimensions. There is a significant reason for this. Namely,

three-dimensional decomposition would be a more likely decomposition for many

scientific applications. However, that would involve another expensive all-to-all

communication to get data ready for the first one-dimensional FFT evaluation.

Figure 5.2 illustrates the described implementation of the three-dimensional FFT

using two-dimensional decomposition for a data array of size L M N× × .


35

2 2 2( , , ) sin with , ,ax by czf x y z a b cL M Nπ π π⎛ ⎞= + + ∈⎜ ⎟

⎝ ⎠

2 2 2 2 2 2

( , , )2

ax by cz ax by czi iL M N L M Ne ef x y z

i

π π π π π π⎛ ⎞ ⎛ ⎞+ + − + +⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

⎛ ⎞−⎜ ⎟= ⎜ ⎟

⎜ ⎟⎝ ⎠

The MPI tasks have been organised in a two-dimensional virtual processor grid using

the MPI Cartesian grid topology [22] construct. More precisely, for a subdivision

dims={ , x zP P } (where x zP P P× = ) of the two-dimensional virtual processor grid, we

have zP subgroups of nodes each consisting of xP nodes. Let

( )0 : 1, 0 : 1, 0 : 1x zA L M N− − − be an x zL M N× × array of complex numbers

distributed onto a x zP P× grid of nodes. So, each node stores a section of size

x zL M N× × , x zx z

NLL NP P⎛ ⎞= =⎜ ⎟⎝ ⎠

of the data array A in its local memory. First

(a), x zL N× independent one-dimensional FFTs of size M along y dimension have

been calculated. Then within each subgroup of nodes – in figures 5.2 marked with

four main colours – an all-to-all communication is performed to get data ready for

the second one-dimensional FFTs. Secondly (b), x yL M× independent one-

dimensional FFTs of size N along z dimension have been performed. To evaluate the

third one-dimensional FFT, a second all-to-all communication between the

subgroups of nodes becomes necessary. Finally (c), y zM N× independent one-

dimensional FFTs of size L along the x dimension are calculated.

5. 2 Verification of Results

Before any investigations can be made, we must ensure that the three-dimensional

FFT computation is correct. For that reason, synthetic input data was chosen to

guarantee reliable verification of the results taken from the implementation with

analytically calculated results. The chosen input function (5.3) delivers not only the

advantage that results can be safely verified, but also the size of the problem to be

investigated can easily be modified by simply changing the values of L, M and N.

(5.3)

With the Euler-formula (3.1), equation (5.3) can be rewritten as:

(5.4)


36

2 2 21 1 1

0 0 0

2 2 2 2 2 2 2 2 21 1 1

0 0 0

( , , ) ( , , )

1( , , )2

1( , , )2


z y x

ax by cz ax by cz ux vy wzN M L i i iL M N L M N L M N

z y x

F u v w f x y z e

F u v w i e e e

F u v w

π π π

π π π π π π π π π

⎛ ⎞− − − − + +⎜ ⎟⎝ ⎠

= = =

⎛ ⎞ ⎛ ⎞ ⎛ ⎞− − − + + − + + − + +⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠ ⎝ ⎠

= = =

= ×

→

⎛ ⎞= ⋅ − + ×⎜ ⎟⎜ ⎟

⎝ ⎠

= ⋅

∑∑∑

∑∑∑( ) ( ) ( ) ( ) ( ) ( )1 1 1 2 2

0 0 0

x a u y b v z c w x a u y b v z c wN M L i iL M N L M N

z y x

i e eπ π− − − + + +⎛ ⎞ ⎛ ⎞− − − + + − + +⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

= = =

⎛ ⎞− +⎜ ⎟⎜ ⎟⎝ ⎠

∑∑∑

2 21

0

( )

2 ( )

(periodic discrete delta function (Kronecker symbol)) :

( ) , x , 1

( )1 ,

( )

iax iuxLL L

aux

g x

i a u xL

G u e e L L

g x eif u a n

g x

π π

π

δ− −

=

−

⎛ ⎞= ⋅ = ⋅ ∈ >⎜ ⎟

⎝ ⎠

=

= +=

∑

( ) ( ), , ,

, n0 ,

, , n

( )0 ,

1( , , )2

1 2

( , , )

L a u M b v N c w au bv cw

Lelse

L if u a nLG u

else

F u v w i L M N

i L M N

F u v w

δ δ δ δ δ δ− − −

∈⎧⎨⎩

→

= + ∈⎧= ⎨⎩

→

⎡ ⎤= ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅⎣ ⎦

− ⋅ ⋅ ⋅ ⋅

=

, , ,

1 , , , 2

0 ,

if u a v b w c

i L M N if u L a v M b w N c

else

⎧ = = =⎪⎪⎪ ⋅ ⋅ ⋅ ⋅ = − = − = −⎨⎪⎪⎪⎩

From equation (5.4) the three-dimensional discrete Fourier Transform has been

calculated analytically:

(5.4)

For further calculations, the periodic discrete delta function δ has been brought into

play. For the next calculation we use periodicity of the exponential function.

(5.5)

Equation (5.5) presents the result of the three-dimensional Fourier Transform for the

particular input function (5.3). Now it becomes obvious that within our

implementation it is only necessary to verify that the two peaks (5.5) are located at

the correct place and have the right height.


37

1

10

100

1000

1 10 100 1000Number of nodes

Spee

dup

ideal

512³

256³

128³

64³

32³

Figure 5.3: Speedup of the 3D-FFT implementation using 1D-decomposition

5. 3 Performance Analysis

5. 3. 1 1D-Decomposition versus 2D-Decomposition

Various runs of the two three-dimensional FFT implementations – the one where the

three-dimensional data array is decomposed in one dimension and in the other

version in two dimensions – have been performed on BlueGene/L. Two parameters

have been varied – the number of nodes used and the size of the data being

transformed. For these investigations we have used the coprocessor mode on

BlueGene/L. For the FFT implementation where data is decomposed in two

dimensions, the MPI tasks were organised in a two-dimensional logical processor

grid using the MPI Cartesian grid topology [22] construct. The MPI_Cart_sub()

creates new communicators which allow all-to-all communications within and

between the subgroups of nodes. A more detailed description of both

implementations has been covered in chapter 5.1.

Figures 5.3 and 5.4 present the speedup curves for each of the two implementations

for different problem sizes starting from 32³ up to 512³, run on the 512-node partition

with torus. In both cases, it is assumed that the applications scale ideally up to 4

nodes for a 256³ problem and up to 32 nodes for a 512³ problem, since computation

using 1 node is not feasible due to the limited amount of memory available on the

system. The features of the speedup curves for the implementation using one-

dimensional decomposition are more explicit when employing logarithmic scaling

while – in my opinion – it is not the case for the implementation using two-

dimensional decomposition.


38

0

200

400

600

800

1000

0 200 400 600 800 1000

Number of nodes

Spee

dup

ideal

512³

256³

128³

64³

32³

Figure 5.4: Speedup of the 3D-FFT implementation using 2D-decomposition

However, the respective figures can be found in Appendix A. Both figures show that

the implementations scale well with increasing numbers of nodes and for problem

sizes greater than 32³. It even shows a super linear speedup achieved for particular

problem sizes.

If one compares the curves presented for the FFT implementation using one-

dimensional decomposition with the other using two-dimensional decomposition in

respect to the number of nodes used then it is precisely observable that the scalability

of the implementation using one-dimensional decomposition is limited but scales

well up to the end. For instance, the curve for the 64³ problem size shows clearly that

the limitation of exploiting an increasing number of processors efficiently is the size

of the data of a single axis, i.e. for the 64³ problem the scalability is limited to 64

nodes. Looking at the curve which presents the speedup for the same problem size

but decomposing the data in two dimensions rather than only one, we can

theoretically use up to 64² nodes for the FFT computation. Concerning efficient

exploitation of 64² nodes for a 64³ problem size, it was not the most ideal example

since even for 512 nodes, the application doesn’t scale linearly anymore. It scales

linearly up to 256 processors, though still gets faster all the way to 1024 processors.

Here, Gustafson’s law [1, 2] comes into play – to efficiently utilise a larger number

of processors, a bigger problem size is needed.


39

3 TotalMemoryP

size of L cache=

3256 6 1,536 3844 4

complex numbers arrays MBPMB MB

⋅= = =

One possible explanation for the observed super linear speedup in our case may

result from the hardware architecture because of cache effects. The cache effects will

come into account when the problem size is small, so that the variables frequently

accessed fit into the cache. However, there is a trade-off between problem size and

fitting variables into the cache since, as mentioned before, if the problem size is too

small, then relatively more time is spent in communication than computation, which

affects the efficient utilisation of an increasing number of processors.

Each chip on BlueGene/L has three levels of cache [7, 11, 13]. We expect the 4 MB

for the L3 cache to be relevant here. More precisely, if one considers a fixed data

size, e.g. 256³ of complex numbers, with equation (5.6), one can easily calculate the

number of processors for which the local data fits into the L3 cache.

(5.6)

The code has 6 work arrays and for the example problem size of 256³ complex

numbers we get:

It means that for transforming a fixed size of 256³ complex numbers using the FFT

implementation with two-dimensional decomposition on 384 or more nodes, data fits

entirely into the cache on each node, which speeds up the computation super linearly.

The problem sizes here are increased by a factor of eight. In Figure 5.3 (as well as in

figure 5.4 and A.2) it is clearly observable that the speedup curves for the different

problem sizes become super linear in regular intervals by a factor of eight, which

underlines our formula. However, the actual jumps are earlier, hence not all work

arrays seem relevant.

Figure 5.5 (a) compares the performance of both FFT implementations using one-

dimensional and two-dimensional decomposition. As mentioned earlier, each

experiment was run several times and the one measured on a hot L3 cache and

yielding the best performance regarding the total amount of time, taken for the entire

three-dimensional forward FFT computation, has been presented in this paper. The

corresponding table containing all the execution times for the example problem size

128³ can be found in Appendix B. The results show that the slab decomposition is

faster than the two-dimensional decomposition, independent of the problem size and


40

0,0001

0,0010

0,0100

0,1000

1,0000

10,0000

1 10 100 1000 10000Number of nodes

Exec

utio

n tim

es (s

econ

ds)

1024³ 1D

1024³ 2D

512³ 1D

512³ 2D

256³ 1D

256³ 2D

128³ 1D

128³ 2D

64³ 1D

64³ 2D

32³ 1D

32³ 2D

Figure 5.5 (a): Performance measurements for the 3D-FFT implementations using 1D and 2D

decomposition for five different problem sizes, respectively

MPI task counts. Earlier performance data appeared in [5] show that slab

decomposition is faster on small task counts (< 64) and volumetric FFT is faster on

large task counts (> 64). These early performance measurements, performed on a

cluster of IBM POWER4 servers [5] are most likely due to a slower interconnect

(SP2 interconnect [24]) (each node has two 320 MB/s/link bidirectional channels

[5]). The performance measurements summarised in table 5.1, show that slab

decomposition is faster until it stops because of limited scalability due to the number

of the data elements along a single dimension.

NODES 32³ 64³ 128³ 256³ 512³ 1024³1 21.603 22.168 -1.294 2 20.534 14.183 -1.289 4 25.232 21.263 6.432 2.647 8 26.663 21.188 19.260 6.586

16 32.692 31.393 23.035 9.447 32 27.655 30.965 24.544 14.523 7.671 64 26.700 25.907 23.897 7.105

128 25.597 21.715 14.185 256 22.995 14.543 8.277512 22.713 10.420

Table 5.1: Performance improvement of the slab decomposition compared to 2D-decomposition

Figure 5.5 (b) compares the performance of the communication times for both FFT

implementations using one-dimensional and two-dimensional decomposition. The


41

0,0001

0,0010

0,0100

0,1000

1,0000

1 10 100 1000 10000Number of nodes

Com

mun

icat

ion

times

(sec

onds

)

1024³ 1D

1024³ 2D

512³ 1D

512³ 2D

256³ 1D

256³ 2D

128³ 1D

128³ 2D

64³ 1D

64³ 2D

32³ 1D

32³ 2D

Figure 5.5 (b): Performance measurements for the communication times of the 3D-FFT implementations using 1D and 2D decomposition for five different problem sizes, respectively

respective table, containing the performance improvement of the communication

costs for slab decomposition compared to 2D-decomposition, is included in

Appendix A. The total amount of time spent for communication for the slab-

decomposition is on average 45% of the time used for communication in the

implementation using two-dimensional decomposition. Figure 5.5 (b) shows

precisely the superior impact on communication costs, which merits future research

in terms of using possibly faster FFT packages. Combined with theses packages, the

beneficial communication costs will have a greater impact on the entire FFT

performance. The straight trend for very small node counts is due to latency effects

caused by very long messages.

Since the scalability of the slab decomposition is limited to the number of the data

elements along a single dimension, at this point the two-dimensional decomposition

comes beneficially into play. It has the advantage that, for a particular problem size

at which slab decomposition reaches the final point, a number of additional

processors can efficiently be utilised using two-dimensional decomposition. This is

of interest for smaller problem sizes, for instance on the 512-node partition for

problem sizes smaller than 512³. To achieve a possible additional performance

benefit for the three-dimensional FFT implementation when using two-dimensional

decomposition, different strategies for MPI task placements on the physical

processor grip on BlueGene/L will be described in the following section.


42

Figure 5.6: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 32-node partition using dims={8, 4} for the 2D-virtual processor grid

5. 3. 2 MPI task mapping strategies

The investigations of the two-dimensional FFT computation have shown that the

performance of the application can depend on the particular mapping, especially

because communication times can be minimised [3, 9]. Therefore, for the three-

dimensional FFT computation using two-dimensional decomposition, a variety of

MPI task mappings on the physical processor grid on BlueGene/L has been explored.

This optimisation has to be carried out for each partition size on BlueGene/L since

the shape of the partitions changes with size [3, 8]. For instance, the 128-node

partition consists of an 8 4 4× × block of nodes, whereas a 512-node partition is an

8 8 8× × block [3]. As a consequence, the optimal mapping for one partition size can

differ substantially from the optimal mapping for another partition [8]. In the same

way as for the two-dimensional case, this process is facilitated by the capability to

specify a node mapping at run time using the –mapfile option to the mpirun

command. For all of the following performance results, the coprocessor mode where

one processor in each chip is available for computation [3] has been used, unless it is

explicitly mentioned otherwise.

5. 3. 2. 1 Mappings on the 32-node partition

Figure 5.6 shows the customised and default node mapping used on a 32-node

partition. MPI sub-communicators are defined appropriately in order to facilitate all-


43

Figure 5.7: Performance impact of customised node mapping for 3D-FFT on a 32-node partition for

various problem sizes

to-all communication within each subgroup of nodes to obtain data locally on each

processor for performing the second transform (intra-subgroup communication). A

second set of MPI sub-communicators is defined to afford all-to-all communication

between the subgroups of nodes to obtain data locally on each processor for the third

Fourier transform computation (inter-subgroup communication). From here it

becomes apparent that the investigations achieved for the two-dimensional case help

to understand the performance differences obtained from the three-dimensional FFT

computation using two-dimensional decomposition. More precisely, two mapping

patterns investigated individually for the two-dimensional case are now used together

to compute the three-dimensional FFTs. Figure 5.7 presents the difference in

performance between the customised node mapping normalised to the default

mapping. So, the 100%-line (reference line) represents the times achieved from the

computation using the default node mapping. The two intra- and inter-subgroup all-

to-all communications have been analysed separately, which helps to decode the

performance effects of the entire three-dimensional forward Fourier Transform.

For a possible explanation of how different mappings affect the performance, both

latency and bandwidth have to be considered. As the problem size is increased, the

impact of bandwidth utilisation becomes greater, since the messages which are going

to be exchanged are longer or split up into a number of more packets. How it is

actually done within the MPI library is beyond the scope of this project and will not

be discussed further.

As expected and learnt from the investigation achieved for the two-dimensional case,

the customised mapping for the intra-subgroup communication yields excellent


44

performance, since the communication is performed between nodes in closest

neighbourhood. More precisely, the communication time is beneficially affected by

lower latency, while latency is higher for intra-subgroup communication with default

node mapping. If one considers the trend of the line in Figure 5.7 representing the

communication costs for the first all-to-all communication, then it is readily

identifiable that as problem size grows, the impact of higher bandwidth utilisation

comes into play. So, the positive effect of the customised node mapping for the intra-

subgroup communication slows down as the problem size increases, due to

congestion caused by higher bandwidth utilisation.

For the second communication, the communication between the subgroups of nodes,

the results are the exact opposite – here the latency of the customised mapping is

higher than for default mapping. Regarding the effect of higher bandwidth utilisation

for larger problem sizes, the trend shows a slow rise since we will have more

congestion for the default mapping than for the customised mapping. However, while

there is a trade-off between the node mappings used for the first and second

communication, it turns out that the benefit achieved from the customised mapping

for the intra-subgroup communication balances the deterioration of performances

achieved for the inter-subgroup communication. Therefore, this balancing still leads

to a slightly beneficial performance impact on the entire three-dimensional forward

FFT – about 4% for small problem sizes and 2% for the biggest problem (512³) size

used here.


The 32-node partition is too small to explore several different node mappings.

Hence, the 128-node partition has been used for ongoing mapping investigations.

Using 128 nodes instead of 32 also allows us to compute bigger problems. Two

choices of node mappings for the 128-node partition have been studied and

illustrated in Figure 5.8. As mentioned before, the MPI tasks have been organised in

a two-dimensional virtual processor grid using the MPI Cartesian grid topology [22]

construct. For both mapping patterns, the two dimensions which are as close as

possible to each other were used. They simply differ in the swapping of the sizes of

the dimensions from dims={16, 8} to dims={8, 16}.


45

Figure 5.8: Two customised and default node mappings the 1st and 2nd all-to-all communication on a 128-node partition using dims={16,8} and dims={8,16} for the 2D-virtual processor grid

Figure 5.9: Performance impact of customised node mapping for 3D-FFT on a 128-node partition

using dims={16, 8} for the 2D-virtual processor grid

From the Figures 5.9 and 5.10 it is recognisable that for the intra-subgroup

communication both customised node mappings take advantage of lower latency for

communication between nodes in the nearest neighbourhood. However, a different

trend is observed if one solves bigger problems. For the mapping strategy using the

subdivision size dims={8, 16} for the virtual processor grid (case (b)), higher

bandwidth utilisation caused by longer messages is most likely the reason of the slow

down of the curve since the customised mapping is affected by congestion. However,

for the node mapping using the subdivision size dims={16, 8} for the virtual

processor grid (case (a)), we also expected a fairly similar trend because of the

achieved results from the investigations done on the 32-node partition. Figure 5.9

shows that the customised mapping improves even when messages become longer.


46

Figure 5.10: Performance impact of customised node mapping for 3D-FFT on a 128-node partition using dims={8, 16} for the 2D-virtual processor grid

For this purpose, another important fact which needs to be considered is the size of

the messages which will be exchanged for both all-to-all communications. The

default implementation of MPI_Alltoall uses different algorithms for different

message sizes [15, 25]. The main reason for the use of different algorithms for

collective communications based on different message sizes is to reduce bandwidth

utilisation, especially if messages are long [25]. For more details, we refer to the

appropriate MPI library implementation. For a better explanation in our particular

case, we consider the problem size 256³ and the two different subdivisions of the

virtual processor grid, dims={16, 8} and dims={8, 16}. That means we have

256 256 25616 8⋅ ⋅ complex numbers locally on each processor, independent of the

size of the dimensions of the virtual processor grid. Case (a) involves 16 MPI tasks

for the first all-to-all intra-subgroup communication while case (b) involves 8 MPI

tasks. More precisely, the number of the messages for the intra-subgroup

communication in case (a) is half the number of the messages being exchanged in

case (b). We assume that this fact might be a possible justification for the two

different trends. More detailed investigations might be of reasonable interest for

future research work.

For the second communication between the subgroups of nodes, the MPI tasks are

entirely fragmentary mapped onto the physical processor grid, which leads to higher

latency and affects the communication time inauspiciously. However, with

increasing problem sizes which means longer messages will be sent through the

network, the poor performance of the fragmentary mappings is improved and even

comes close to the performance achieved for the default mapping. Again, a possible

explanation might be that the communication in line is affected by congestion for


47


large message sizes. This assumption needs to be investigated in more detail and is of

value for ongoing future research. However, for both node mappings on the 128-

node partition, the beneficial performance impact of the low latency communication

is cancelled out by the high latency for the inter-subgroup communication. The

impact on the total time used for the entire three-dimensional forward FFT

computation is negligible for case (a) and in a range of 8% down to 4% for case (b),

depending on the size of complex data being transformed.


Further investigations have been carried out on the 512-node partition using different

sizes of the dimensions for the two-dimensional virtual processor grid, starting from

sizes as close as possible to each other down to dims={256, 2}.

Figure 5.11 shows the customised and default node mapping used on the 512-node

partition for the subdivision of the two-dimensional virtual processor grid

dims={32,16}. This investigation has been carried out on the torus network. For the

intra-subgroup communication using customised mapping, the torus cables are only

utilised in one dimension, while for the default mapping they can be exploited for

two dimensions. We have learnt from the investigations carried out for the two-

dimensional FFT computation that the switched-on cables, representing the torus

network, have an enormous impact on bandwidth utilisation. However, Figure 5.12


48

Figure 5.13: Customised and default node mappings for the 1st and 2nd all-to-all communication on

a 512-node partition using dims={64, 8} for the 2D-virtual processor grid

Figure 5.12: Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={32, 16} for the 2D-virtual processor grid on the torus network

shows that the customised mapping is – despite utilising the cables around the torus

in one dimension only – beneficial for smaller problems because of lower latency.

But with increasing problem sizes, higher bandwidth utilisation causes more

congestion in the customised case.

The same can be experienced for the second all-to-all communication, so that in

summary, the default mapping used on BlueGene/L wins when the sizes of the

dimensions for the virtual processor grid are as close as possible to each other.

Figure 5.13 shows the determined node mapping for the virtual processor grid

subdivision dims={64,8}. We are aware of load imbalance consequences, caused by

choosing the sizes of the dimensions not as close as possible to each other. Impacts

of both networks, mesh and torus, have been investigated.


49


using dims={64, 8} for the 2D-virtual processor grid on the torus network


using dims={64, 8} for the 2D-virtual processor grid on the mesh network

The results (figure 5.14) obtained using the mesh network, are fairly in line to what

we have discussed above and hence are only briefly summarised.

Clearly, for the intra-subgroup all-to-all communication, the customised node

mapping gains from low latency because of communication between nodes in the

nearest neighbourhood. For the second all-to-all communication, the customised

mapping is affected by very high latency and is not expected to deliver a positive

impact on performance. Concerning the overall performance, the poor mapping for

the inter-subgroup communication cancels the beneficial mapping for the intra-

subgroup communication which, in summary, yields almost no performance impact

on the entire forward FFT computation.


50

Figure 5.16: Customised and default node mappings for the 1st and 2nd all-to-all communication on a 512-node partition using dims={8,64} for the 2D-virtual processor grid

Using exactly the same node mappings on the torus network leads to entirely

different results which are presented in Figure 5.15. We have learnt that the torus

network has a remarkably profitable impact on the performance due to balancing the

bandwidth utilisation as equally as possible over the network to avoid congestion.

This may be the reason for the poor performance achieved from the customised

mapping used for the first all-to-all communication, since we assume there is no

torus cable in use, while for the default node mapping, cables are likely to be utilised

in two dimensions. The customised mapping used for the second communication is

characterised by very high latency. Even the torus cables in all three dimensions do

not yield better results. Our assumption for the turnover observed for the problem

size 128³ is that the MPI implementation uses different algorithms for the all-to-all

communication depending on message sizes. This was already discussed above.

Figure 5.16 shows the investigated node mappings where for the two-dimensional

virtual processor grid the sizes of the dimensions are simply swapped from

dims={64, 8} to dims={8, 64}. The customised mapping of MPI tasks for the inter-

subgroup communication is now more evenly distributed over the entire network.

Again, investigations have been performed for both types of networks, mesh and

torus, and the results are shown in Figure 5.17 and 5.18. For the intra-subgroup

communication where, with customised mapping, communication between nodes in

nearest neighbourhood is guaranteed, the amount of time spent in the MPI_Alltoall


51


using dims={8, 64} for the 2D-virtual processor grid on the mesh network



routine is clearly smaller than for default mapping. Even when there is no torus

connectivity, for the default case the path messages have to travel is still longer for

communication between nodes furthest away from each other. Therefore, if the

smallest cube node mapping comes into play, it wins over all the other different node

mappings, independent of mesh or torus.

The customised node mapping for the second all-to-all communication is

characterised by high latency for both network types. However, if one uses the torus

network, the effect of higher bandwidth utilisation for larger problems results in a

trend which comes closer to the performance obtained for the default mapping.

Concerning the impact of communication costs on the entire three-dimensional FFT

computation, a benefit of up to 9% on the torus network and 15% on the mesh

network can be achieved from the customised node mapping. In both cases, the peak


52


performance improvement was obtained for the 128³ problem. The two figures show

both a turnover of trends between small and bigger problem sizes. Also, here we

assume that one reason may be the use of different algorithms for different message

sizes.

We have seen that using the same subdivision size for the two-dimensional virtual

processor grid but completely different node mapping strategies causes significant

performance differences.

If one is working with the 512-node partition on BlueGene/L, the torus network

becomes of more interest than mesh. Hence, we focus on torus rather than mesh for

our next investigations. In figure 5.19 the node mapping strategies are presented for a

very unbalanced subdivision dims={128,4} for the two-dimensional processor grid.

Both mappings – default and customised – are of fragmentary design. Here, the

default mapping has a complete mapping in two dimensions but a relatively large gap

in the third dimension. The customised mapping has a complete pattern in only one

dimension and shows gaps in the second and third dimensions. Both are

characterised by high latency. However, as one can see in figure 5.20, the large gaps

between the two dense planes seem to be more affected by higher latency than using

a pattern with gaps in two dimensions.


53

Figure 5.20: Performance impact of customised node mapping for 3D-FFT on a 512-node partition using dims={128, 4} for the 2D-virtual processor grid on the TORUS network

Also, for the second communication between the subgroups, the customised mapping

performs better than the default mapping. We assume that for both mappings (for the

inter-subgroup communication) the torus cables are not in use. The reason for the

superior customised performance is again lower latency, since the messages which

have to be sent between nodes furthest away from each other have a longer way to go

in the case of default mapping than for customised mapping. The imbalanced

subdivision of the two-dimensional processor grid allows the customised mapping to

achieve a performance gain from both all-to-all communications. This improvement

of the communication times has a respectable impact on the entire forward FFT,

ranging from 10% down to 4% depending on the size of complex data being

transformed.

So far, a variety of customised versus default node mappings have been investigated

for different subdivisions of the two-dimensional virtual processor grid. Our interest

is in finding the best mapping strategy for the torus network on the 512-node

partition. Table 5.2 summarises the results from all the investigated node mappings

for the torus network on the 512-node partition on BlueGene/L.

SUBDIVISION VIRTUAL GRID

CUSTOMISED NODE MAPPING

DEFAULT NODE MAPPING

dims = {32, 16} xdims = {64, 8} x x dims = {8, 64} x dims = {128, 4} x dims = {256 ,2} -- dims = {512, 1} - 1D --

Table 5.2: Summary of the investigated node mappings for different subdivisions of the 2D virtual processor grid


54

The next step is to take the node mapping yielding best performance results for each

of the different subdivision sizes and compare them to each other. The following two

tables show the times spent for communication (5.3) and the execution times for the

entire three-dimensional forward FFT computation – each for the mappings yielding

best performance results for each of the different subdivision sizes – for different

problem sizes. However, it seems that the trend heads for more unbalanced

subdivisions. This is not ideal since with subdivision sizes not as equal as possible to

each other, the possibility of efficiently utilising more processors becomes more and

more awkward, especially on a large-scale computing platform [8] such as

BlueGene/L.


64³ 128³ 256³ 512³ 1024³

dims = {32, 16} 0.000286 0.001101 0.0071 0.0559 0.454 dims = {8, 64} 0.000278 0.000969 0.0065 0.0539 0.415 dims = {128, 4} 0.001208 0.0061 0.0476 0.381 dims = {256 ,2} 0.0065 0.0466 0.357 dims = {512, 1} - 1D 0.0299 0.223

Table 5.3: Communication costs measured in seconds for different problem sizes using the best

mapping for each particular subdivision of the 2D virtual processor grid


64³ 128³ 256³ 512³ 1024³

dims = {32, 16} 0.000397 0.002478 0.0209 0.194 1.828 dims = {8, 64} 0.000394 0.002410 0.0205 0.193 1.827 dims = {128, 4} 0.002717 0.0203 0.191 1.826 dims = {256 ,2} 0.0206 0.185 1.815 dims = {512, 1} - 1D 0.150 1.639

Table 5.4: Cost for entire forward 3D-FFT computation measured in seconds for different problem

sizes using the best mapping for each particular subdivision of the 2D virtual processor grid


Further investigations have been carried out on the 1024-node partition. We decided

on three different subdivisions of the two-dimensional virtual processor grid for

which a customised and the default mapping have been explored. The first

subdivision was chosen with sizes as close as possible to each other, dims={32, 32}.


55




The reason for the second and third choice, dims={8, 128} and dims={256, 4}, was

based on the results achieved from the same node mapping pattern used on the 512-

node partition.

Figure 5.21 illustrates the customised and default node mappings for both all-to-all

communications on the 1024-node partition with the subdivision dims={32, 32}. The

outcome is close to what we expected because of the results achieved from the

mapping investigations on the 512-node partition using the subdivision dimensions

as close as possible to each other, dims={32, 16} (see figure 5.11).

The customised mapping for the intra-subgroup communication, which is the same

as used for the 512-node partition, shows a great impact on performance compared to


56


the respective default mapping. The reason for this is the poor pattern of the default

mapping, which is highly fragmentary and hence characterised by very high latency

costs. The opposite can be applied to the mappings used for the inter-subgroup

communication. Here the customised mapping features higher latency costs than the

default mapping, equal to the mapping pattern for the 512-node partition. But now

there are more MPI tasks distributed over a 16 8× plane, compared to tasks arranged

in lines of eight in length. This has the consequence that high latency impacts the

performance results. It is useful to compare the curves for the inter-subgroup

communication achieved for the investigations on the 512-node partition (figure

5.12) and the current for the 1024-node partition (figure 5.22). This illustrates that

the poor performance obtained for the same fragmentary mapping pattern used over a

larger surface is worse due to higher latency. The impact on the performance for the

entire three-dimensional FFT computation is the same obtained on the 512-node

partition.

For the second investigation on the 1024-node partition using the subdivision

dims={8, 128}, the default mappings for both all-to-all communication, within

subgroups and between subgroups, show unsuitable mapping pattern. For the first

communication, the MPI tasks are arranged in a line but cannot efficiently make use

of the torus as it was potentially able to do for the same mapping on the 512-node

partition (Figure 5.16). Clearly, that yields better performance for the cube-shaped

mapping since now, both mappings can be considered as if mesh would have been

used.


57


The results achieved for the second all-to-all communication are not obvious right

from the start since the mapping is different from the previously investigated pattern.

However, since a variety of investigations have been carried out, the following

assumption may match to the previous speculations. For the research done one the

512-node partition (see figure 5.16), we came to the conclusion that the worse

performance of the fragmentary mapping pattern is due to higher latency costs.

However, in this particular case, we compared a mapping characterised by gaps in all

three dimensions with another mapping where MPI tasks were arranged continuously

over a surface. But what we want to compare here on the 1024-node partition is

rather related to the mapping shown in figure 5.19, where we have the same pattern

for both mappings along z-dimension, small gaps versus no gaps along the y-

dimension, and small gaps versus large gaps along the x-dimension. For this

mapping, we have achieved a slight improvement compared to the one illustrated in

figure 5.16. However, with the current mappings on the 1024-node partition, we take

a further step and compare a mapping completely fragmentary in all three

dimensions versus a mapping characterised by 2 completely filled planes with a large

gap in between. The result, presented in figure 5.24, shows that on the 1024-node

partition, a completely fragmentary mapping is more affected by high latency costs

than a consistent mapping with a very large gap. However, the enormous

performance gain achieved for the first all-to-all communication entirely balances

out the poor performance. In summary, on the 512-node partition an overall

performance improvement for the three-dimensional FFT computation of up to 9%

was obtained. Here, on the 1024-node partition we have an additional improvement

of 2% for each problem size apart from 128³.


58



For the third investigation carried out on the 1024-node partition, the virtual

processor grid is subdivided into dims={256, 4}. Both mappings, customised and

default, are illustrated in figure 5.25 and show the same pattern as was investigated

on the 512-node partition (see figure 5.19). Using a dimension subdivision

dims={256, 4}, the smallest three-dimensional complex data array which can be

transformed, is of the size 256³.

This example shows very precisely that the same results can be achieved by using the

same node mapping pattern, extended on a twice-as-large partition. The results,

presented in figure 5.26 and compared to the results achieve on the 512-node

partition (see figure 5.20) show the expected behaviour.

6. CONCLUSION

59

6. CONCLUSION

We have demonstrated the excellent scalability of the three-dimensional FFT code

for large problem sizes on the BlueGene/L platform on up to 1,024 processors. For

relatively small problem sizes (32³ complex numbers), the three-dimensional FFT

using slab decomposition is typically 20% to 30% faster than the FFT computation

where data is decomposed in two dimensions. On the other hand, the efficient

utilisation of a larger number of processors for slab decomposition is limited to the

data elements along one dimension. At this point, the FFT computation using two-

dimensional decomposition comes beneficially into play. To even improve this

performance, a variety of mappings of MPI tasks onto the three-dimensional torus

communication network have been explored.

Our experiments clearly indicate that a carefully chosen mapping of MPI tasks on the

torus network that takes the network characteristics into account is beneficial in

obtaining improved performance for our type of application. This is especially

important for scientific applications that call FFT routines many times.

For the FFT computation using two-dimensional decomposition, we have seen that

choosing the dimension sizes of the two-dimensional processor grid as equally as

possible to each other (for instance, dims={32, 16} on the 512-node partition or

dims={32, 32} on the 1024-node partition), the default node mapping utilises the

torus pretty good and is difficult to improve further. On the other hand, choosing the

dimension sizes more unbalanced to each other (with other words, coming closer and

closer to the one-dimensional decomposition), delivers performance improvements

by using customised node mappings. Our results show excellent performance

enhancements for the 8-cube (typically 20% to 45% improvement of communication

costs) and the 4-square (typically 50% to 65% improvement of communication costs)

pattern, compared to a row of processors which utilises the torus links in one

dimension. Even if the communication costs are higher for the second all-to-all

communication because of unprofitable node mapping, the gain obtained from the

first communication is good enough to cancels the poor performance completely.

6. CONCLUSION

60

If the mapping for the intra-subgroup communication is small and dense (as it is for

8-cube and 4-square) then the mapping for the intra-subgroup communication is

discontiguous. In general we can say that discontiguous mappings are expensive.

Even if an evenly distributed fragmentary node mapping is spanned over the whole

512-node partition which allows utilisation of the torus links in all three dimensions,

the higher latency costs are hardly reduced.

These small and dense shapes – where communication between nodes in nearest

neighbourhood is responsible for the benefit – can be exploited if the subdivision of

the two-dimensional virtual processor grid is more unbalanced to each other. For

instance, for a subdivision of the virtual processor grid dims={128, 4} the customised

node mapping shows a 10% improvement of the FFT (for the problem sizes 128³ and

256³), while the total communication costs are improved by 25% - compared to the

default node mapping on BlueGene/L. This fact leads to an additional conclusion that

ongoing developments in FFT libraries as well as re-sort strategies have the potential

to improve some of these results. More precisely, this significant impact on

communication costs, combined with the use of possibly faster FFT libraries and

more efficient re-sort methods, would have a greater impact on the entire FFT

performance. Therefore, looking for possibilities to reduce the computation costs

(FFT as well as re-sort methods) could be valuable for future studies.

In the following section we discuss additional likely sources of performance

improvements and possible future research projects.

The impact of the virtual node mode on BlueGene/L has only briefly investigated for

the two-dimensional case. The results have shown that the execution times can be

almost halved by dedicating the same amount of chips but running one MPI task on

each of the two processors on a chip. This makes virtual node mode for future

investigation on parallel FFTs for multi-dimensional data interesting.

The following was not discussed in the previous chapters. However, within this

project, we spent some time on auxiliary research on how we can beneficially exploit

the dual-processor compute nodes operating in coprocessor mode on BlueGene/L. In

the coprocessor mode, all computations are performed on the first processor of each

chip [3]. The second processor is used for communications [3]. It has the advantage

6. CONCLUSION

61

that computations and communications can be overlapped. For the two-dimensional

FFT computation, we wrote a customised My_MPI_Alltoall routine that allows

overlapping of communication and computation. The computation of the one-

dimensional FFTs is broken down into two halves. While the first half of the

complex data array will be transformed, the second processor is idle. After finishing

the computation of the first half, the second processor is in charge of communication

while the first processor continues with transforming the second half of the complex

data array. However, since this optimisation is not the major part of this project, the

customised My_MPI_Alltoall is written in a very naïve way consisting of a non-

blocking standard send-and-receive pair. It doesn’t take into account different

algorithms depending on the size of the messages. The results show a marginal

improvement only for very large problems (32,768²). Future work may involve a

successive splitting up of the one-dimensional FFT computation, rather than only

dividing it in two halves, so that the second processor can be exploited earlier.

The MPI implementation for BlueGene/L is a particularly optimised port of the

MPICH2 [15] library for the BlueGene/L architecture. It comes with, amongst

others, an optimised MPI_Alltoall as well as MPI_Alltoallv algorithm

which optimise the injection of packets to achieve high network efficiency [13]. It

may be useful to investigate the performance impact of using the optimised

MPI_Alltoallv for the communication kernel of the multi-dimensional parallel

FFT computation since within our implementations we have only used the

MPI_Alltoall algorithm. The MPI_Alltoallv algorithm allows more

flexibility with respect to the structure and size of the input and output data.

Therefore, expensive re-sorting of the data before and after the all-to-all

communication can be eliminated in many cases.

The recently published parallel three-dimensional FFT library for BlueGene/L

(BGL3DFFT) is specifically designed to take advantage of the IBM BlueGene/L

architecture by enabling applications that use three-dimensional FFTs to scale to

thousands of BlueGene/L processors [16]. Most of the alternative parallel libraries

compute three-dimensional FFTs by using the slab decomposition technique [16].

We know that the scalability of the slab-based methods is limited by the size of the

data of a single axis. In BGL3DFFT, the three-dimensional FFT implementation is

6. CONCLUSION

62

based on a two-dimensional decomposition which enables scalability to N²

processors [16] (N := number of elements along a single dimension). At the time of

this writing, the BGL3DFFT library has not yet been installed on BlueSky. However,

utilising this library for projects with a three dimensional FFT core might be of

particular importance in the near future.

APPENDIX A

63

0

100

200

300

400

500

600

700

0 100 200 300 400 500 600

Number of nodes

Spee

dup

ideal

512³

256³

128³

64³

32³

Figure A.1: Speedup of the 3D-FFT implementation using 1D-decomposition

1

10

100

1000

10000

1 10 100 1000 10000Number of nodes

Spee

dup

ideal

512³

256³

128³

64³

32³

Figure A.2: Speedup of the 3D-FFT implementation using 2D-decomposition

APPENDIX A

APPENDIX A

64

NODES 32³ 64³ 128³ 256³ 512³ 1024³

1 49.875 53.479 51.586 2 29.538 36.361 29.581 4 36.237 35.987 37.949 39.053 8 35.750 38.636 40.531 47.739

16 48.288 50.000 50.868 51.735 32 39.067 47.340 48.173 50.172 49.625 64 41.306 48.563 49.576 50.845

128 44.251 48.060 49.876 256 42.318 48.034 48.969 512 46.813 51.072

Table A.1: Performance improvement of communication costs for slab decomposition compared to

2D-decomposition

APPENDIX B

65

APPENDIX B

NODES 1D DECOM 2D DECOM 1 Time for plan (s) : 5.970136 5.967375 Time for 1st FFTW (row) (s) : 0.124627 0.122882 Time for 2st FFTW (col) (s) : 0.124999 0.122423 Time for 3st FFTW (plane) (s) : 0.116200 0.123429 Time for 1st Resort (s): 0.372843 0.366375 Time for 2nd Resort (s): 0.553968 0.048459 Time for 3rd Resort (s): 0.382529 Time for 4th Resort (s): 0.054857 Time for 1st Comm (s): 0.052827 Time for 2nd Comm (s): 0.052834 Time for FFTW (s) : 0.365826 0.368734 Time for Resort(s): 0.926811 0.852221 Time for Comms (s): 0.051155 0.105662 Time for forward 3D-FFT BEF.BARR: 1.343793 1.326617 Time for forward 3D-FFT AFT.BARR: 1.343794 1.326619 Time for backward 3D-FFT (s): 1.388044 1.528895 SPEEDUP 1.000000 1.000000 EFFICIENCY 1.000000 1.000000 2 Time for plan (s) : 5.970110 5.963465 Time for 1st FFTW (row) (s) : 0.062167 0.061846 Time for 2st FFTW (col) (s) : 0.062371 0.061728 Time for 3st FFTW (plane) (s) : 0.058333 0.062184 Time for 1st Resort (s): 0.188852 0.178932 Time for 2nd Resort (s): 0.274624 0.027077 Time for 3rd Resort (s): 0.189501 Time for 4th Resort (s): 0.028056 Time for 1st Comm (s): 0.067745 Time for 2nd Comm (s): 0.026816 Time for FFTW (s) : 0.182871 0.185758 Time for Resort(s): 0.463476 0.423566 Time for Comms (s): 0.066588 0.094561 Time for forward 3D-FFT BEF.BARR: 0.712936 0.703885 Time for forward 3D-FFT AFT.BARR: 0.712966 0.703887 Time for backward 3D-FFT (s): 0.744644 0.811360 SPEEDUP 1.884793 1.884704 EFFICIENCY 0.942396 0.942352 4 Time for plan (s) : 5.968986 5.960823 Time for 1st FFTW (row) (s) : 0.029588 0.031450 Time for 2st FFTW (col) (s) : 0.029709 0.031374

APPENDIX B

66

Time for 3st FFTW (plane) (s) : 0.030861 0.031672 Time for 1st Resort (s): 0.092216 0.090013 Time for 2nd Resort (s): 0.134422 0.014205 Time for 3rd Resort (s): 0.093685 Time for 4th Resort (s): 0.015688 Time for 1st Comm (s): 0.058314 Time for 2nd Comm (s): 0.034014 Time for FFTW (s) : 0.090158 0.094497 Time for Resort(s): 0.226638 0.213591 Time for Comms (s): 0.057290 0.092328 Time for forward 3D-FFT BEF.BARR: 0.374087 0.400416 Time for forward 3D-FFT AFT.BARR: 0.374661 0.400419 Time for backward 3D-FFT (s): 0.393519 0.457348 SPEEDUP 3.586693 3.313077 EFFICIENCY 0.896673 0.828226 8 Time for plan (s) : 5.965137 5.957422 Time for 1st FFTW (row) (s) : 0.015103 0.015085 Time for 2st FFTW (col) (s) : 0.014656 0.014852 Time for 3st FFTW (plane) (s) : 0.015364 0.014812 Time for 1st Resort (s): 0.032551 0.031501 Time for 2nd Resort (s): 0.033830 0.007976 Time for 3rd Resort (s): 0.033048 Time for 4th Resort (s): 0.007492 Time for 1st Comm (s): 0.031318 Time for 2nd Comm (s): 0.016695 Time for FFTW (s) : 0.045123 0.044749 Time for Resort(s): 0.066381 0.080017 Time for Comms (s): 0.028552 0.048012 Time for forward 3D-FFT BEF.BARR: 0.140056 0.172778 Time for forward 3D-FFT AFT.BARR: 0.140119 0.173544 Time for backward 3D-FFT (s): 0.141519 0.170658 SPEEDUP 9.590376 7.644280 EFFICIENCY 1.198797 0.955535 16 Time for plan (s) : 5.960476 5.964690 Time for 1st FFTW (row) (s) : 0.006392 0.006972 Time for 2st FFTW (col) (s) : 0.006802 0.006403 Time for 3st FFTW (plane) (s) : 0.006949 0.006406 Time for 1st Resort (s): 0.013645 0.013502 Time for 2nd Resort (s): 0.015878 0.003292 Time for 3rd Resort (s): 0.014431 Time for 4th Resort (s): 0.003288 Time for 1st Comm (s): 0.014190 Time for 2nd Comm (s): 0.014119 Time for FFTW (s) : 0.020143 0.019781

APPENDIX B

67

Time for Resort(s): 0.029523 0.034512 Time for Comms (s): 0.013909 0.028310 Time for forward 3D-FFT BEF.BARR: 0.063575 0.082603 Time for forward 3D-FFT AFT.BARR: 0.063577 0.082606 Time for backward 3D-FFT (s): 0.061973 0.079165 SPEEDUP 21.136480 16.059596 EFFICIENCY 1.321030 1.003724 32 Time for plan (s) : 5.961193 5.958275 Time for 1st FFTW (row) (s) : 0.003062 0.003202 Time for 2st FFTW (col) (s) : 0.003024 0.003037 Time for 3st FFTW (plane) (s) : 0.003165 0.003038 Time for 1st Resort (s): 0.005870 0.005966 Time for 2nd Resort (s): 0.006985 0.001764 Time for 3rd Resort (s): 0.006118 Time for 4th Resort (s): 0.001594 Time for 1st Comm (s): 0.007249 Time for 2nd Comm (s): 0.006987 Time for FFTW (s) : 0.009252 0.009276 Time for Resort(s): 0.012855 0.015441 Time for Comms (s): 0.007378 0.014236 Time for forward 3D-FFT BEF.BARR: 0.029485 0.038954 Time for forward 3D-FFT AFT.BARR: 0.029488 0.039080 Time for backward 3D-FFT (s): 0.029450 0.036021 SPEEDUP 45.570876 33.946238 EFFICIENCY 1.424089 1.060819 64 Time for plan (s) : 5.968804 5.958727 Time for 1st FFTW (row) (s) : 0.001689 0.001517 Time for 2st FFTW (col) (s) : 0.001474 0.001515 Time for 3st FFTW (plane) (s) : 0.001530 0.001514 Time for 1st Resort (s): 0.002949 0.002988 Time for 2nd Resort (s): 0.003201 0.000882 Time for 3rd Resort (s): 0.003048 Time for 4th Resort (s): 0.000886 Time for 1st Comm (s): 0.003535 Time for 2nd Comm (s): 0.003637 Time for FFTW (s) : 0.004692 0.004546 Time for Resort(s): 0.006150 0.007804 Time for Comms (s): 0.003689 0.007172 Time for forward 3D-FFT BEF.BARR: 0.014531 0.019522 Time for forward 3D-FFT AFT.BARR: 0.014534 0.019616 Time for backward 3D-FFT (s): 0.014526 0.018692 SPEEDUP 92.458648 67.629435 EFFICIENCY 1.444666 1.056709

APPENDIX B

68

128 Time for plan (s) : 5.978766 5.965477 Time for 1st FFTW (row) (s) : 0.000761 0.000783 Time for 2st FFTW (col) (s) : 0.000760 0.000781 Time for 3st FFTW (plane) (s) : 0.000791 0.000781 Time for 1st Resort (s): 0.001482 0.001512 Time for 2nd Resort (s): 0.001518 0.000614 Time for 3rd Resort (s): 0.001496 Time for 4th Resort (s): 0.000488 Time for 1st Comm (s): 0.001823 Time for 2nd Comm (s): 0.001804 Time for FFTW (s) : 0.002312 0.002345 Time for Resort(s): 0.003000 0.004110 Time for Comms (s): 0.002022 0.003627 Time for forward 3D-FFT BEF.BARR: 0.007335 0.010082 Time for forward 3D-FFT AFT.BARR: 0.007502 0.010083 Time for backward 3D-FFT (s): 0.007607 0.009644 SPEEDUP 179.124766 131.569870 EFFICIENCY 1.399412 1.027889 256 Time for plan (s) : 5.954514 Time for 1st FFTW (row) (s) : 0.000380 Time for 2st FFTW (col) (s) : 0.000378 Time for 3st FFTW (plane) (s) : 0.000379 Time for 1st Resort (s): 0.000571 Time for 2nd Resort (s): 0.000254 Time for 3rd Resort (s): 0.000580 Time for 4th Resort (s): 0.000260 Time for 1st Comm (s): 0.001006 Time for 2nd Comm (s): 0.000935 Time for FFTW (s) : 0.001137 Time for Resort(s): 0.001665 Time for Comms (s): 0.001940 Time for forward 3D-FFT BEF.BARR: 0.004743 Time for forward 3D-FFT AFT.BARR: 0.004786 Time for backward 3D-FFT (s): 0.004675 SPEEDUP 277.187421 EFFICIENCY 1.082763 512 Time for plan (s) : 5.956440 Time for 1st FFTW (row) (s) : 0.000193 Time for 2st FFTW (col) (s) : 0.000192 Time for 3st FFTW (plane) (s) : 0.000191 Time for 1st Resort (s): 0.000268 Time for 2nd Resort (s): 0.000171 Time for 3rd Resort (s): 0.000260 Time for 4th Resort (s): 0.000096

APPENDIX B

69

Time for 1st Comm (s): 0.000535 Time for 2nd Comm (s): 0.000536 Time for FFTW (s) : 0.000576 Time for Resort(s): 0.000795 Time for Comms (s): 0.001071 Time for forward 3D-FFT BEF.BARR: 0.002442 Time for forward 3D-FFT AFT.BARR: 0.002507 Time for backward 3D-FFT (s): 0.002474 SPEEDUP 529.165935 EFFICIENCY 1.033527 1024 Time for plan (s) : 5.953987 Time for 1st FFTW (row) (s) : 0.000099 Time for 2st FFTW (col) (s) : 0.000097 Time for 3st FFTW (plane) (s) : 0.000097 Time for 1st Resort (s): 0.000133 Time for 2nd Resort (s): 0.000078 Time for 3rd Resort (s): 0.000128 Time for 4th Resort (s): 0.000055 Time for 1st Comm (s): 0.000584 Time for 2nd Comm (s): 0.000339 Time for FFTW (s) : 0.000293 Time for Resort(s): 0.000394 Time for Comms (s): 0.000922 Time for forward 3D-FFT BEF.BARR: 0.001610 Time for forward 3D-FFT AFT.BARR: 0.001610 Time for backward 3D-FFT (s): 0.001598 SPEEDUP 798.686900 EFFICIENCY 0.779900

Table B.1: Performance measurements in seconds for the 3D-FFT implementations using 1D

and 2D decomposition for problem size 128³

APPENDIX C

70

APPENDIX C

COMMS BETWEEN

NODES LINE DIAGONAL VOLUME

DIAGONAL

0-1 0.007230 0.007444 0.007634 0-2 0.007425 0.007835 0.008192 0-3 0.007659 0.008200 0.008776 0-4 0.007940 0.008723 0.009484 0-5 0.007688 0.008291 0.008879 0-6 0.007466 0.007891 0.008306 0-7 0.007276 0.007536 0.007767

Table C.1: Performance measurements in seconds for ping-pong application Sending / receiving

1000 messages of 10 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal

COMMS BETWEEN


DIAGONAL

0-1 0.012639 0.012845 0.013064 0-2 0.013009 0.013274 0.013609 0-3 0.013239 0.013592 0.014159 0-4 0.013389 0.014059 0.014904 0-5 0.013254 0.013662 0.014271 0-6 0.013042 0.013320 0.013689 0-7 0.012670 0.012900 0.013252


1000 messages of 100 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal

APPENDIX C

71

COMMS BETWEEN


DIAGONAL

0-1 0.263972 0.263337 0.263979 0-2 0.264637 0.264388 0.265659 0-3 0.265207 0.265586 0.267373 0-4 0.266122 0.267029 0.269522 0-5 0.265270 0.265800 0.267732 0-6 0.264750 0.264657 0.266029 0-7 0.264121 0.263597 0.264356


1000 messages of 1,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal

COMMS BETWEEN


DIAGONAL

0-1 0.737561 0.722970 0.723543 0-2 0.738190 0.724083 0.725269 0-3 0.738716 0.725243 0.727019 0-4 0.739368 0.726798 0.729185 0-5 0.738797 0.725518 0.727335 0-6 0.738221 0.724304 0.725663 0-7 0.737681 0.723136 0.723962


1000 messages of 10,000 integers (4 byte in length) between 2 nodes differently mapped on torus network along a line, diagonal, and volume diagonal

For long messages, such as 10,000 integers, the mapping in line becomes even more

expensive than mapping along the diagonal of volume diagonal. We assume this is

due to network congestion since the mapping along the diagonal has a greater choice

of paths.

BIBLIOGRAPHY

72

BIBLIOGRAPHY [1] Gustafson, J.L., Reevaluating Amdahl's Law, CACM, 31(5),

1988. pp. 532-533.

[2] Amdahl, G.M., Validity of single-processor approach to achieving large-scale computing capability, Proceedings of AFIPS Conference, Reston, VA. 1967. pp. 483-485.

[3] University of Edinburgh, BlueGene/L User Information http://www.epcc.ed.ac.uk/~bgapps/user_info.html.

[4] Franchetti F., Kral S., Lorenz J., Püschel M., Ueberhuber C. W., Automatically Tuned FFTs for BlueGene/L’s Double FPU, High Performance Computing for Computational Science - VECPAR 2004, pp. 23-36.

[5] Eleftheriou, M., Moreira, J. E., Fitch, B. G., Germain, R. S., A Volumetric FFT for BlueGene/L, Lecture Notes in Computer Science, volume 2913, 2003, pp. 194-203.

[6] Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T. J. C., Germain, R. S.,

Performance Measurements of the 3D FFT on the Blue Gene/L Supercomputer, Euro-Par 2005, pp. 795-803.

[7] Davis, K., Hoisie, A., Johnson, G., Kerbyson, D. J., Lang, M., Pakin, S.,

Petrini, F., A Performance and Scalability Analysis of the BlueGene/L Architecture, Proceedings of the ACM/IEEE Conference on Supercomputing, 2004.

[8] Gygi, F., Yates, R. K., Lorenz, J., Draeger, E. W., Franchetti, F., Ueberhuber, C., W., de Supinski, B., R., Kral, S., Gunnels, J. A., Sexton, J. C., Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code, Conference on High Performance Networking and Computing, 2005, pp. 24 et sqq.

[9] Fang, B., Deng, Y., Performance of 3D FFT on 6D QCDOC Torus Parallel Supercomputer, J. Comp. Phys. Submitted, 2005.

[10] Kral, S., FFTW-GEL Homepage, http://www.complang.tuwien.ac.at/skral/fftwgel.html.

[11] Gara, A., Blumrich, M. A., Chen, D., Chiu, G. L.-T., Coteus, P., Giampapa,

M. E., Haring, R. A., Heidelberger, P., Hoenicke, D., Kopcsay, G. V., Liebsch, T. A., Ohmacht, M., Steinmacher-Burow, B. D., Takken, T., Vranas P., Overview of the Blue Gene/L system architecture, IMB Journal of Research and Development, Volume 49, Number 2/3, 2005.

BIBLIOGRAPHY

73

[12] Message Passing Interface Forum, MPI: A Message-Passing Interface

Standard, University of Tennessee, 1995, see http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html.

[13] Almási, G., Archer, C., Castaños, J. G., Gunnels, J. A., Erway, C. C., Heidelberger, P., Martorell, X., Moreira, J. E., Pinnow, K., Ratterman, J., Steinmacher-Burow, B. D., Gropp, W., Toonen B., Design and implementation of message-passing services for the Blue Gene/L supercomputer, IMB Journal of Research and Development, Volume 49, Number 2/3, 2005.

[14] Adiga N. R. et al., An Overview of the Blue Gene/L Supercomputer, Proceedings of the ACM/IEEE Conference on Supercomputing, 2002, pp. 1–22, see http://www.sc-conference.org/sc2002/.

[15] MPICH and MPICH2 homepage, see http://www-unix.mcs.anl.gov/mpi/mpich.

[16] Eleftheriou, M., 3D Fast Fourier Transform Library for Blue Gene/L, http://www.alphaworks.ibm.com/tech/bgl3dfft.

[17] Numerical Recipes in C, http://www.library.cornell.edu/nr/bookcpdf.html.

[18] Fourier Theory, http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/MARSHALL/node17.html.

[19] Fastest Fourier Transform in the West (FFTW), http://www.dl.ac.uk/TCSC/Subjects/Parallel_Algorithms/FFTreport/node82.html

[20] Franchetti, F., FFTs on BlueGene/L machines, http://www.llnl.gov/asci/platforms/bluegene/talks/franchetti.pdf.

[21] Allan, R.J., Taylor, K., Parallel Application Software on High Performance Computers, Serial and Parallel FFT Routines, http://www.dl.ac.uk/TCSC/Subjects/Parallel_Algorithms/FFTreport/.

[22] MPI Routines, http://www-unix.mcs.anl.gov/mpi/www/www3/.

[23] Multiprocessing by Message Passing MPI, http://scv.bu.edu/tutorials/MPI/.

[24] Agerwala, T., Martin, J. L., Mirza, J. H., Sadler, D. C., Dias, D. M., Snir, M., SP2 system architecture, Scalable Parallel Computing , Volume 34, Number 2, 1995, see http://www.research.ibm.com/journal/sj/342/agerwala.html.

BIBLIOGRAPHY

74

[25] MPI over InfiniBand Project homepage, default implementation of

MPI_Alltoall algorithm, https://mvapich.cse.ohio-

state.edu/svn/mpi/mvapich2/trunk/src/mpi/coll/alltoall.c. [26] University of Edinburgh, EPCC Course Slides, Applied Numerical

Algorithms. [27] James, J. F., A Student's Guide to Fourier Transforms: With Applications

in Physics and Engineering, Cambridge University Press, 2002. [28] Kallies, B., FFTW, 2004, http://www.hlrn.de/doc/fftw/index.html. [29] FFTW Homepage, http://www.fftw.org/. [30] Hennessy, J. L., Patterson, D. A., Computer Architecture, A Quantitative

Approach, Third Edition, 2003.

Documents

Fourier Transforms for the BlueGene/L Communication Network · 2017-11-01 · BlueGene/L Communication Network Heike Jagode MSc in High Performance Computing The University of Edinburgh