Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
AN EFFICIENT IMPLEMENTATION OF ADAPTP/E GAUSS
QUADRATURE RULE FOR PARALLEL ENVIRONMENT
by
VrVEK JAISWAL, B.E.
A THESIS
IN
COMPUTER SCIENCE
Submitted to the Graduate Faculty of Texas Tech University in
Partial Fulfillment of the Requirements for
the Degree of
MASTER OF SCIENCE
Approved
Chainji^rson of the Comm><fee
Accepted
Dean of the Graduate School
December, 2004
ACKNOWLEDGEMENTS
1 would like to express my gratitude to all those who gave me the possibility to
complete this thesis. I would like to thank Dr. Noe Lopez-Benitez for his help, support
and valuable discussions. His overly enthusiasm and integral view on research has made
a deep impression on me.
1 am deeply indebted to Prof Dr. Philp Smith, Director of High Performance
Computing Center, whose stimulating suggestions and encouragement helped me all the
time during the research and during the writing of this thesis. I would like to thank the
entire HPCC unit, especially Stephanie, Sri Rangam, Dr. James Abbott and Dr. David
Chaffin for constant support, encouragement.
I would like to thank The Virtual Vietnam Archive for the financial support I have
received during my tenure as a smdent. I would especially thank Steve Maxner, Mary,
Justin, Ty and all other members for their support and encouragement.
I would like to thank my parents and other family members, who have always
supported me and encouraged me to aim higher.
My special thanks to Sachin, Rohit, Gauri, Vamshi, Doc, Pum, Vinay, Anuya and
other members of Raapchick group. Finally a big thank you to the members of
Distributed Computing Group, especially Vijay, Rajkumar and Nitin for their help,
suggestions and comments and coffee.
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ii
LIST OF FIGURES v
CHAPTER
I. INTRODUCTION 1
1.1 Motivation 3
1.2 Document Organization 5
n. PRELIMINARY STUDIES 7
2.1 Definitions and Terminologies 7
2.2 Legendre-Gauss Quadramre 9
2.3 Adaptive Quadrature 11
2.4 Preliminary Work on Adaptive Quadrature 13
m. MESSAGE PASSING INTERFACE 16
3.1 Packing and Unpacking Variables in MPI 17
3.2 Synchronous Communication in MPI 18
3.3 Asynchronous Communication in MPI 20
3.4 SGI Ongin 2000 24
IV. DESIGN ISSUES AND IMPLEMENTATION 26
4.1 Algorithm Design 27
4.2 Serial Implementation of Adaptive Quadrature 28
4.2.1 Generations of Intervals and Behavior of Stack 29
4.2.2 Implementation for two and three-dimensional functions 30
111
4.3 Parallel Implementation of Adaptive Quadrature using blocking calls 33
4.4 Drawbacks 42
4.5 Parallel Implementation of Adaptive Quadramre using 46 non-blocking calls
V EXPERIMENTS AND RESULTS 51
VI. CONCLUSIONS 60
REFERENCES 62
IV
LIST OF FIGURES
2,1 Adaptive Quadramre using Trapezoid Rule 12
3.1 Hyper Cube Arrangement of SGI Origin 24
4.1 Working of the stack 30
4.2 Two and Three Dimensional Integration 31
4.3 Initial distribution of intervals 38
4.4 Sub-intervals stored on the stack 39
4.5 Sub-intervals distributed to the slave processors 40
4.6 Termination of the parallel program 41
4.7 Problem with blocking communication calls 44
5.1 Speedup for P processors over 2 processors 53
5.2 Number of subintervals with P processors for PAQNBC 54
5.3 PAQNBC with normalized speedup 56
5.4 Speedup for P processors over 2 processors for PAQNBC and 57 PAQBC
5.5 Speedup for p processors over 2 processors with processor p2 58 executing a dummy for loop
5.6 Speedup for P processors over 2 processors with the slave processors 59 executing a dummy for loop based on value of random number
CHAPTER 1
INTRODUCTION
In scientific computing and numerical analysis, the term numerical integration is used
to describe a broad family of algorithms for calculating the numerical value of a definite
integral, and by extension, the term is also sometimes used to describe numerical
algorithms for solving differential equations.
Numerical integration of a function is performed for three main reasons. First, the
function is only known in certain discrete points. Those obtained by sampling. Several
embedded systems and other computer applications may need a lot of numerical
integration for this reason. Second, while the function is known, it is impossible to
calculate the integral analytically, because a primitive function or antiderivative, which is
needed for the integration, caimot be obtained. Example of one such function is the
probability density function of a normal distribution. Third, the function is known, but it
is too hard to solve analytically, and we want to fall back on approximation.
The Newton-Cotes formulas are the most commonly used numerical integration
methods. They are based on replacing a complicated function by some approximating
function that can be integrated easily. The Trapezoidal mle and the Simpson rule are
examples of the Newton-Cotes formulas [13]. Newton-Cotes formulas require the
evaluation of the integral at equal intervals. Alternative methods termed Gaussian
Quadrature [13] methods have been proposed that select irregularly-placed evaluation
points, chosen to determine the integral as accurately as possible. The most popular form
of quadramre uses Legendre Polynomials [13] to approximate a function f(x). Some other
Gauss Quadrature formulas are Gauss-Chebyshev formulas, Gauss-Hermite formulas,
and Gauss-Laguerre formulas.
In general, care must be taken to match the numerical integration method to the
expected nature of the function f(x). For example, it may be known that f(x) is regular.
On the other hand f(x) may be singular or oscillatory and will then need special
treatment. Often a special method called a product integration method can be developed
for the integration of functions of the form f(x) = w(x)g(x) where w(x) is a pre-set
function and the function g(x) is known to be a relatively nice function.
Adaptive quadrature is a numerical integration procedure in which the interval of
integration is recursively subdivided until a specified error tolerance is met for the
approximate integral on each subinterval. The error estimate for a given subinterval is
based on the difference between two different quadrature mles applied on the subinterval.
This research feamres an algorithm that calculates the integration of a given function,
using Adaptive Gaussian Quadrature in a parallel environment, with intervals stored in a
list at a central repository, accessible to all processors. Our feamred implementation is
designed for dynamic load balancing to increase the efficiency of the algorithm. We
specifically compare our algorithms to both serial implementation (to assess "speed-up")
and to the implementation using synchronous communication to assess efficiency.
1.1 Motivation
We begin any parallelization process by identifying the parts of the program that
consume the most run time. The goal is to know which code should be parallelized and
which code should be recycled from the serial program. After the parts are identified, the
problem is partitioned into smaller tasks that can be executed in parallel.
There are two primary methods for partitioning a problem. Data Decomposition and
Functional Decomposition [10]. Data Decomposition requires partitioning the data and
then partitioning the computation based on the partitioned data. Functional
Decomposition requires partitioning the computation into smaller tasks and then partition
the data based on these tasks. This is common in problems where there are no obvious
data strucmres to partition, or where the data stmcmres are highly unstmctured. Our
feamred algorithm and the algorithm referred in [1] uses data decomposition to attain
parallelism.
In recent years, motivated by different programming models, two approaches to
parallel numerical integration have emerged. One is based on adapting the ideas of
sequential globally adaptive algorithms [1] to the parallel context by selecting a number
of sub-regions, of integration rather than simply the one with the largest associated error
estimate. The other approach proceeds by imposing an initial partitioning of the region of
integration and treats the resulting sub-problems as independent, and therefore capable of
concurrent solution. The more sophisticated of these latter algorithms include a
mechanism for detecting load imbalance and for redistributing work to neighboring
processors [1,2].
Parallelization of Adaptive Quadrature Based Integration Methods [14] is based on
the second method, i.e. the algorithm divides the initial partition into sub-regions and
gives these sub-regions to individual processors for computation. This implementation is
inefficient under certain conditions. Let's assume a simple integration problem: a
function and an interval is provided and the objective is to calculate the integrand across
that interval. The implementation designed in [14] proceeds by dividing the interval into
P-1 sub-regions, where P is the number of processors specified to calculate the integral.
In a master/slave paradigm, processor 0 is not used for computational purposes, hence P-
1. The calculated integral over the sub-region is accepted as a valid answer, or the sub-
region is again divided depends on the specified error condition. In both cases, result or
sub-region is sent to the master processor {Po), and a new interval is received for
computation. Consider a case where Pi is slow, Po executes receive, butP; doesn't
execute the send until sometime later. The function used by [14] for receive, MPI_Recv,
is blocking. This means that vdien Po calls MPIRecv, if the message is not available, Po
will not remm from the receive function call. As a result, a relatively faster processor say
P2, that has already executed its MPl_Send call will be blocked, because Po is blocked
and is not able to execute or post the corresponding MPIRecv for MPlSend from P2.
This has two implications: first it would result in a lot of idle time for faster processors
and second the stack at Po may have a lot of intervals ready to be distributed across idle
processors, but because of the blocking namre of the call Po is unable to do so. A similar
situation will arise when a particular sub-region takes a longer time to evaluate.
Many Computational Science applications show dynamic behavior. When a workload
is divided between processors, it is not necessary that all the processors complete the
work at the same time. Some processors will take less time, while other may take more.
Consider a case of a heterogeneous cluster, where processors are of different speeds,
hence time required to complete the computation differs resulting in a load imbalance
across the system. The load imbalance would also result from the data decomposition
across the processors. There is a fair chance that computation time is longer for certain
data set after applying data decomposition. As a result the processor handling this data set
would continue to work, while the other processors wait. MPI inherently does not provide
a solution to handle imbalance of such nature. The situation worsens when synchronous
communication is used to design the solution to a problem, as in [14]. Because of one
processor's Send/Receive all the other processors will have to wait, to execute their
communication calls. The result of which are lots of idle processors. There is a need for
an efficient implementation that can handle load imbalances of such kind.
A lot of scientific applications such as. Digital Signal Processing, Fourier
Transformation, Fluid and Gas Dynamics, involve integrating a function. It is important
that the scientific community has an algorithm, that can integrate any given function of
one and more than one variable, in the fastest possible way, and with a desired accuracy.
1.2 Organization of the Document.
The rest of the document is organized as follows. Chapter 2 provides definitions and
terminologies used in this thesis. It also provides the mathematical background of Gauss
Quadramre and on Adaptive Quadrature. Section 2.4 of Chapter 2 gives a summary of the
work reported in the field of parallelizing quadrature rules. Chapter 3 deals with MPI and
architecture of the SGI Origin machine. Chapter 4 contains the problem statement of the
thesis. Chapter 4 gives details of implementation issues. Chapter 6 explains the
experiments and results conducted using asynchronous algorithm (non-blocking) and
synchronous algorithm. References are the last part of the report.
CHAPTER 2
PRELIMINARY STUDIES
This chapter defines some basic terms and notations in section 2.1, used in the rest of
the report. Section 2.2 discusses Gauss Quadramre rules, the mathematical background
required to understand this report. Section 2.3 explains Adaptive Quadramre. Section 2.4
briefly summarizes the work already done in the field of adaptive quadramre and
parallelizing integration.
2.1 Definitions and Terminologies
Shared Memory: It is a model where parallel tasks all have the same picture of memory
and can address and access the same logical memory locations regardless of where the
physical memory actually exists. Pleione, a SGI Onyx2 with 46-300 MHz processors,
having 46 GB of shared RAM and HPC- Linpack rating of 24 G-flops will be used as a
testing bed for this research.
Distributed Memory: In contrast to shared memory, distributed memory is associated
with individual processors and a processor can only address its own memory [4].
Load Sharing: The division of load or tasks among subsystem components. Many times it
is used synonymously with load balancing.
Nonblocking: A procedure is nonblocking if the procedure may return before the
operation completes, and before the user is allowed to reuse resources (such as buffers)
specified in the call. A nonblocking request is started by the call that initiates h, e.g.,
MPIISEND.
Blocking: A communication routine is blocking if the completion of the call is dependent
on certain "events" For sends, the data must be successfully sent or safely copied to
system buffer space, so that the application buffer that contained the data is available for
reuse. For receives, the data must be safely stored in the receive buffer, so that it is ready
for use.
Local: A procedure is local if completion of the procedure depends only on the local
executing process.
Non-local: A procedure is non-local if completion of the operation may require the
execution of some MPI procedure on another process. Such an operation may require
communication occurring with another user process.
Collective: A procedure is collective if all processes in a process group need to invoke
the procedure. A collective call may or may not be synchronizing.
Singularity: In mathematics a singularity [13] is in general a point at which a given
mathematical object is not defined or lacks some "nice" property, such as
differentiability. In mathematics, the derivative of a function is one of the two central
concepts of calculus. The inverse of a derivative is called the antiderivative, or indefinite
integral.
Gradient: Gradient [13] can be defined as rate of change, or slope for a function.
Generally referred as dy/dx.
2.2 Legendre-Gauss Ouadramre
This section gives the mathematical background of Gauss Quadramre [13] required
for understanding the thesis.
The Newton-Cotes formulas [13] are an exfemely useful and sttaightforward family
of numerical integration techniques. To integrate a function f(x) over some interval [a, b],
divide it into n equal parts such that
fn = f(xn). Then find polynomials, which approximate the tabulated function, and
integrate them to approximate the area under the curve. To find the fitting polynomials,
use Lagrange interpolating polynomials. The resulting formulas are called Newton-Cotes
formulas, or quadrature formulas.
The Newton-Cotes [13] methods requires the evaluation of the integrand at equal
intervals. Gauss Quadramre requires the evaluation of the integrand at specified, but
unequal, intervals. Gauss Quadrature is a powerful method of numerical integration and
its accuracy is much higher the Newton-Cotes formulas. As such, it is not used to
integrate functions that are given in tabular form with equispaced intervals. The most
popular form of Gauss Quadrature is Gauss Legendre Quadrature. The method uses the
roots of Legendre polynomials to locate the points at which the integrand is evaluated.
In Gauss integration, the integral is evaluated by using the formula
] f(x)dx = f^w,fix,) (2.2.1) i=i
where n is called the number of Gauss points, Wi are the unknown coefficients, also called
weights, and Xi are specific values of x, also called Gauss points, at which the integrand is
evaluated. For any specified n, the values of Wi and Xi are chosen so that the formula will
be exact for polynomials up to and including degree (2n -1).
As can be seen from equation 2.2.1, Gauss integration requires the range of
integration from -1 to +1. For convenience of notation, let the original coordinate be y
and the range of integration of f(y) be from a to b. Then the transformation
^^2yz£zl (2.2.2)
b-a
gives the normalized coordinates x=-l, when y = a and x=+l, when y=b. The
transformation from x to y is given by
(b -a)x + a + b ,r. ~ N̂
y^- (^.^.J)
by noting thatdy = ^ " -̂̂ dx , the original integrand from a to b for f(y)dy can be
rewritten as
|^(,),, . }^((M±ii±^),, .^|.J(«^--)y''-*))(2,2.4) a -1
if Xi is the Gauss Point of the normalized coordinate, the corresponding value of yi can be
determined as
_ ( 6 - a ) x , + a + 6 .2 2 4) ^' 2
Since the weight Wi remain the same, the integrand can be evaluated using the right-
hand side expression of Eq.(2.2.4).
10
In many engineering and other practical situations, we need to evaluate integrals over
two- or three-dimensional domains. Gauss Quadramre can be extended easily for two and
three-dimensional integrations over rectangles. For example we can evaluate a three-
dimensional integral with limits -1 and +1 using the equation 2.2.1.
1 1 ' n n n
jjj/(Ar,j;,z)^Jvc/z = ££ | ; iv ,w.vv, / (x„>; . ,z , ) (2.2.1) -1-1-1 1=1 J=l k=\
2.3 Adaptive Quadrature
"Adaptive quadrature is a numerical integration procedure in which the interval of
integration is recursively subdivided until an error tolerance is met for the approximate
integral on each subinterval", Michael T. Heath [8]. The error estimate for a given
subinterval is based on the difference between two distinct Quadramre mles applied on
the subinterval. The two Quadramre mles used in this research are the 7 point Gauss Rule
and 14 point Gauss BCronrod Rule. A graphical explanation of adaptive quadramre is
available at [9], the work is part of the Computational Science and Engineering Program
at the University of Illinois at Urbana-Champaign.
The adaptive quadramre using trapezoidal mle can be explained with the help of the
following steps:
1. Define/f'x), (a, b) and an error tolerance e,
2. Calculate / (a, b), I (a, ; and I ( , h), where I indicates the integral to be
calculated using the trapezoidal rule.
11
3. If I(a,b) - (/(a,——) + /(——,6)) < e then accept I{a,b) as the result.
4. Else divide (a, b) to {a, )and ( ,b),
5. Repeat the steps with the subintervals if created.
Figure 2.1 explains the function and the sub-division performed if the error condition
fails. In this figure/fjr) is integrated between the limits [a, b], [a, / and [ , bj,
failure of the error tolerance condition will result in two sub-divisions {a, )and 2
, a + 6 , .
a (a^b)l2 b
Figure 2.1 Adaptive Quadrature using Trapezoid Rule
The Trapezoid mle requires an interval to be sub-divided into n equal subdivisions
before calculating the integral. Therefore 3n subdivisions are required.
12
The Gauss Quadrature integrates a function/fx) for the interval [a, b] by dividing the
interval [a, b] into 7 unequal segments. The Gauss Kronrod 14 point mle uses the 7
points of the Gauss Quadrature rule and adds 8 more points to improve the accuracy.
Therefore to reduce the number of functional evaluations the following strategy is
used:
1. Define f (x), (a, b) and error tolerance e,
2. Calculate g/r,; {a,b) and gk.,(a,b), where gkjs will be used to denote the integral
value calculated using Gauss Kronrod 15 point mle and gq? will denote the
integral value calculated Gauss Quadramre rule,
3. If |gA:, 5 (a, i ) - gk.^ {a,h)\<e, accept gkj4 as the result,
4. Else divide (a, b) to (a, )and ( ,b),
5. Repeat the steps with the subintervals if created.
The Gauss Quadrature uses 15 function evaluations, which is significantly less then
3n function evaluations, used by the Trapezoid mle. This thesis uses adaptive Gauss
Quadramre to calculate integration for a given function.
2.4 Preliminarv Work on Adaptive Quadramre
Much work has been reported in the field of parallel quadramre. Parallel Algorithms
for Multidimensional Integration [1] categorizes the algorithms as single list algorithms
and multiple list algorithms. The essence of single list algorithms is a selection phase
during which a number of sub-regions that require further refinement are identified.
13
followed by the concurrent application of the numerical integration (quadrature) mle to
these identified sub regions. Multiple list algorithms are based on initial static disfribution
of the work; for example, for a machine with P processors the region of integration is
divided into P sub-regions and each sub integral is assigned to a separate processor.
Partitioning techniques are discussed in [2], which explains the static and dynamic
partitioning of intervals. Parallel Globally Adaptive Quadramre on the KSR-1 (Kendall
Square Research) [2] also discusses the numerical performance of their algorithm on
KSR-1 machine. The dynamic partitioning of the integration intervals are addressed in
[3]; this paper also gives a broad idea of the advantages involved in subdividing the
interval.
Napierala and Gladwell [6] developed a parallel adaptive quadramre algorithm,
building on the Quadpack code QAG for one-dimensional problems and the NAG
DOIFCF Genz-Malik code for multidimensional problems. They replaced the error
estimate ranking strategy by a tabulation approach to ranking.
The Center for Research on Parallel Computing and Supercomputers, CPS [4],
developed several parallel software package for Multidimensional Quadramre. Their
package is designed for MIMD distributed memory platforms such as massively parallel
processors systems and/or cluster of workstations and PCs and will be developed by
using standard tools as FORTRAN and C languages and BLACS or MPI communication
systems. The CPS package provides subroutine(s) based on adaptive algorithms. Such
approach attempts to evaluate the function mainly where the integral function shows
some difficulty (peaks, singularity,...). PAMIHR [4] subroutine is based on a nine-degree
14
formula for integrals up to ten dimensions on hyper rectangular regions. CPS also
provides subroutine(s) based on Quasi Monte Carlo methods, suitable for a class of (very)
high dimensional integrals and subroutine(s) based on a deterministic approach. Such an
approach uses very regular sequences of nodes called Lattices, generalizing the
trapezoidal mle in high dimensions.
The aim of Parallelization of Adaptive Quadramre Bases Integration Methods [14]
was to develop an adaptive, parallel algorithm for Gaussian Quadrature to evaluate
integrands in one and subsequently more dimensions. This research was divided into the
following parts.
1. Developing a serial algorithm for adaptive Quadrature.
2. Parallelizing the serial algorithm using blocking communication calls.
3. Extend the algorithm to work for 2 dimensions.
4. Compare the results of the parallel algorithm with a serial algorithm to verify the
efficiency gained.
5. To compare the behavior of the parallel algorithm on the Origin 2000 machine.
15
CHAPTER 3
MESSAGE PASSING INTERFACE (MPI)
This chapter defines some basic concepts of MPI Section 3.1. Section 3.2 discusses
MPI Pack and Unpack functions. Section 3.3 discusses synchronous communication.
Section 3.4 treats asynchronous communication in MPI. Section 3.4 gives a brief idea of
architectural description of the SGI Origin machine.
Message Passing Interface [11] is a paradigm, a standard, used widely on certain
classes of parallel machines, especially those with distributed memory. The attractiveness
of the message-passing paradigm at least partially stems from its wide portability.
Programs expressed in MPI, may run on distributed-memory multiprocessors, networks
of workstations, and combinations of all of these. In addition, shared-memory
implementations are possible.
The interface is suitable for use by fully general MIMD programs, as well as those
written in the more restricted style of SPMD. Although no explicit support for threads is
provided, the interface has been designed so as not to prejudice their use.
A message passing function is simply a function that explicitly ttansfers data from
one process to another. The MPI communication calls assume that the processes are
statically allocated i.e. the number of processes is set at the beginning of execution and
no additional processes are created during execution. An important design goal of MPI
was to allow efficient implementations across machines of differing characteristics. MPI
carefully avoids specifying how operations will take place. It only specifies what an
16
operation does logically. As a result, MPI can be easily implemented on systems that
buffer messages at the sender, receiver or do no buffering at all. Implementations can
take advantage of specific feamres of the communication subsystem of various machines.
MPI guarantees that the underlying transmission of messages is reliable. The user need
not check if a message is received correctiy thus relieving the programmer of worrying
about underlying communication details.
3.1 Packing and Unpacking Variables in MPI
MPI contains routines to pack and unpack data. These routines are MPlPack [11],
MPlUnpack [11] and MPIPacksize [11]. An MPI implementation must provide these;
further, a user may send data that has been constmcted with MPlPack with datatype
MPIPACKED and receive it either with datatype MPIPACKED or vrith any MPI
datatype with the same type signature that went into the packed data. Because of this, the
device must provide the routines to pack and unpack data. Of course, many
implementations of the device may use the model implementation's version of these
routines.
The MPI pack and unpack routines are designed to handle data on a communicator-
wide basis. That is, data is packed relative to a communicator; a natural implementation
is to pick a data representation that is a good choice for all members of the communicator
(including the sender). However, a common use of these routines in an implementation is
to pack and unpack data sent with the point-to-point operations [11].
17
3.2 Svnchronous Communication in MPI
Sending and receiving of messages by processes is the basic MPI communication
mechanism. The communication mechanism is synchronous [11] if the completion of the
call is dependent on certain "events". For sends, the data must be successfully sent or
safely copied to system buffer space, so that the application buffer that contained the data
is available for reuse. For receives, the data must be safely stored in the receive buffer, so
that it is ready for use.
The syntax of the blocking send is given below:
int MPI_Send( void* buf,
int count,
MPI_Datatype datatype,
int dest,
int tag,
MPI_Comm comm
)
The send buffer specified by the MPISEND [11] operation consists of count
successive entries of the type indicated by datatype, starting with the entry at address buf.
Note that we specify the message length in terms of number of elements, not number of
bytes. The data part of the message consists of a sequence of count values, each of the
type indicated by datatype. Count may be zero, in which case the data part of the message
is empty. The basic datatypes that can be specified for message data values correspond to
the basic datatypes of the host language.
18
hi addition to the data part, messages carry information that can be used to distinguish
messages and selectively receive them. This information consists of a fixed number of
fields, which we collectively call the message envelope. These fields are source,
destination, tag, and communicator.
The syntax of the blocking receive is given below:
int MPI_Recv(void'̂ buf,
int count,
MPlDatatype datatype,
int source,
int tag,
MPlComm comm,
MPl_Status *stams
)
The receive buffer consists of the storage containing count consecutive elements of
the type specified by datatype, starting at address buf. The length of the received message
must be less than or equal to the length of the receive buffer.
An overflow error occurs if all incoming data does not fit, without truncation, into the
receive buffer. If a message that is shorter than the receive buffer arrives, then only those
locations corresponding to the (shorter) message are modified. Even though no specific
behavior is mandated by MPI for erroneous programs, the recommended handling of
overflow simations is to return in status information about the source and tag of the
incoming message. The receive operation will return an error code. A quality
19
implementation will also ensure that no memory that is outside the receive buffer will
ever be overwritten.
3.3 Asynchronous Communication in MPI
One can improve performance on many systems by overiapping communication and
computation. This is especially tme on systems where communication can be executed
autonomously by an intelligent communication controller. Light-weight threads are one
mechanism for achieving such overlap. An alternative mechanism that often leads to
better performance is Nonblocking Communication [11].
int MPI_Isend(void'̂ buf,
int count,
MPIDatatype datatype,
int dest,
int tag,
MPIComm comm,
MPlRequest ^request
)
int MPl_Irecv(void* buf,
int count,
MPIDatatype datatype,
int source,
int tag.
20
MPIComm comm,
MPlRequest ""request
)
A Call to the non-blocking send or receive simply starts , or post the communication
operation. It is then up to the user program to explicitiy complete the communication at
some later point in the program.
Non-blocking send start calls can use the same four modes as blocking sends:
standard, buffered, synchronous and ready. These carry the same meaning. Sends of all
modes, except ready, can be started whether a matching receive has been posted or not; a
Nonblocking ready send can be started only if a matching receive is posted. In all cases,
the send start call is local: it remms immediately, irrespective of the status of other
processes. If the call causes some system resource to be exhausted, then it will fail and
remm an error code. The send-complete call returns when data has been copied out of the
send buffer. It may carry additional meaning, depending on the send mode.
If the send mode is synchronous, then the send can complete only if a matching
receive has started. That is, a receive has been posted, and has been matched wdth the
send. In this case, the send-complete call is non-local. Note that a synchronous,
nonblocking send may complete, if matched by a nonblocking receive, before the receive
complete call occurs. (It can complete as soon as the sender " knows" the ttansfer vrill
complete, but before the receiver "knows" the transfer wdll complete.)
If the send mode is buffered then the message must be buffered if there is no pending
receive. In this case, the send-complete call is local, and must succeed irrespective of the
21
stams of a matching receive. If the send mode is standard then the send-complete call
may return before a matching receive occurred, if the message is buffered. On the other
hand, the send-complete may not complete until a matching receive occurred, and the
message was copied into the receive buffer.
These calls allocate a communication request object and associate it with the request
handle (the argument request). The request can be used later to query the stams of the
communication or wait for its completion. A nonblocking send call indicates that the
system may start copying data out of the send buffer. The sender should not access any
part of the send buffer after a nonblocking send operation is called, until the send
completes.
A non-blocking receive call indicates that the system may start writing data into the
receive buffer. The receiver should not access any part of the receive buffer after a
nonblocking receive operation is called, until receive completes.
When using non-blocking communication it is essential to ensure that the
communication has completed before making use of the result of the communication or
re-using the communication buffer. Completion tests come in two types:
WAIT type These routines block until the communication has completed. They are
useful when the data from the communication is required for the computations or the
communication buffer is about to be re-used. Therefore a non-blocking communication
immediately followed by a WAIT-type test is equivalent to the corresponding blocking
communication.
MPIWAIT (request, status)
22
This routine blocks until the communication specified by the handle request has
completed. The request handle will have been returned by an earlier call to a non-
blocking communication routine.
TEST type These routines return a TRUE or FALSE value depending on whether or
not the communication has completed. They do not block and are useful in situations
where we want to know if the communication has completed but do not yet need the
result or to re-use the communication buffer i.e. the process can usefully perform some
other task in the meantime.
MPITEST (request, flag, stams)
In this case the communication specified by the handle request is simply queried to
see if the communication has completed and the result of the query (TRUE or FALSE) is
remmed immediately in flag.
23
3.4 SGI Origin 2000
The SGI Ongin is a Scalable, Shared-Memory Processor (SSMP) system, offering the
benefits of both shared memory and distributed system architecmres. The CPU's in the
system are connected using a hub, to form a node. Multiple nodes are connected together
to form a complete system. The Origin architecture is an instantiation of Cache-Coherent
Non-Uniform Memory Access (CC-NUMA) Architecture,
• T P ^ Fig 3.1 Hyper Cube Arrangement of SGI Origin
CC-NUMA stands for cache coherent Non-Uniform Memory Access. Memory is
physically distributed throughout the system; as a result memory and peripherals are
globally addressable. Local memory accesses are faster than remote accesses. Local
accesses on different nodes do not interfere with each other.
The MIPS RIOOOO is a superscalar RISC processor used in several SGI product lines
from desktops to large parallel systems. The RIOOOO is not much faster than the its
24
predecessor, the R8000, but it was designed to operate efficiently with cache and in the
NUMA environment.
The RIOOO is 4-way superscalar; it can fetch and decode 4 insfructions per cycle to be
scheduled to run on its five independent, pipelined execution units:
1. a non-blocking load store unit,
2. 64-bit integer ALUs, (Arithmetic and Logical Unit)
3. a 32/64-bit pipelined floating point adder,
4. a 32/64-bit pipelined floating point multiplier.
The RIOOOO is a cache-based RISC CPU, and programs must utilize cache well if
they are to run efficientiy on the Origin. The Origin 2000 at the HPCC at Texas Tech
University is a 46-node machine. The operating system is the IRIX 6.4 and it uses LSF
(Load sharing Fachity) for load distribution. The processing speed of each processor is
300 MHz.
25
CHAPTER 4
DESIGN ISSUES AND IMPLEMENTATION
This chapter describes in detail various strategies that were developed and
implemented for this research work. The issues addressed are the creation of a serial
algorithm for one to three dimensions, parallelizing the serial algorithm using blocking
communication calls, drawbacks involving this parallelized code, and designing and
implementing a parallel algorithm using asynchronous communication. This latter
algorithm is demonstrably superior in many situations, and never significantiy worse than
its competitors.
One of the objectives accomplished in this research is the implementation of
adaptive quadramre integration in an efficient manner such that sub-intervals of the
integration (tasks) can be allocated depending on the turnaround time of each processor.
All processors ready for work are assigned new tasks as long as there are tasks to be
assigned. In [14], the integral for functions up to two variables were reported. This work
will extend the integration dimension to three variables, increasing the scope of practical
utilization of a parallelized adaptive quadramre integration method. More importantly,
this work implements and benchmarks a load-balancing algorithm that is significantly
more efficient than the previous implementation, which was based on synchronous
communication. Since heterogeneous execution behavior cannot be predicted in many
cases, we feel that this programming paradigm deserves more study.
26
4.1 Algorithm Design
The aim of this work is to increase efficiency with respect to the implementation
reported in [15]; therefore certain implementation-based enhancements are required.
These enhancements are, use of asynchronous communication and reducing the number
of send/receive calls.
The previous work [15] uses a master/slave paradigm to implement a synchronous
blocking algonthm. Under this paradigm master processor {Po) is responsible for storing
the intervals (tasks) and distributing them across idle processors. The master processor
uses the stack to store sub-intervals. The slave processor (P,, where i is any processor
other than Po) is responsible for performing the computation and reporting the result to
Po.
The design and implementation of this work can be broadly classified into the
following three stages:
1. Design and implementation of a serial adaptive quadramre algorithm,
2. Design and implementation of the parallel adaptive quadramre algorithm using
blocking calls,
3. Design and implementation of the parallel adaptive algorithm using non-blocking
calls.
We benchmark (2) and (3) and demonstrate that (3) results in an implementation that
is able to handle variable load conditions across the processors.
27
4.2 Serial Implementation of Adaptive Quadramre
The development of the serial code formed the basis of this research work. For
simplicity we will present the serial implementation for one dimension only. The serial
code for adaptive quadrature is explained in the following paragraphs.
For any given function, interval [a, b], and error tolerance e, gkn and gq? are
calculated. The integral value abs(gki5) is also calculated, that is the absolute value of the
integral using the 15-point Gauss Kronrod mle.
The values of the integrals obtained are then checked for the error tolerance e in order
to decide whether to accept the result or reject it and consequently create additional sub-
intervals.
The following condition is used in the implementation of the adaptive quadrature:
if{fabs(gk,,-gqT) > e(sqrt(e) (lb-la) + abs{gk,,)))
then sub-intervals are created, else the result gku is accepted. The intervals la and lb are
local (sub-intervals) to the slave processor. The right hand side of the condition distributes
the error for an interval/sub-interval. The error e specified for the initial interval [a, bJ, is
scaled for any additional sub-intervals that are created. The condition stated above
provides the error scaling by multiplying the specified error with the sub-interval. Note
that the intervals created in this process are stored in the stack, and the entire process is
repeated until the stack is empty.
28
4.2.1 Generation of Intervals and Behavior of Stack
Consider any function/fx) with an initial interval [a, bJ. In one dimension, [a, b] can
be considered as a segment with end points a and b. If the calculated values of the
integral do not satisfy the error tolerance, then [a, b] is partitioned into [a, c] and [c, b],
, a + b where c =
2
The interval [a, c] is stored in the stack and integrals are calculated for the interval [c,
bJ. If the calculated values fail to satisfy the error condition (section 4.1), then interval [c,
bJ is subdivided into intervals [c, d] and [d, bJ. The interval fc, d] is stored on the stack
and the integral values are again calculated for [d, b]. If the integral values for interval
[d, b] satisfy the error condition then the integral value gk^ for this interval is added to
the result. The interval [c, d] is popped from the stack and the integral values are
calculated for the interval. If the integral values satisfy the error condition then [c, b] is
popped for processing, otherwise interval [c, d] is subdivided as explained before. This
process is repeated until the stack is empty.
The algorithm starts with an empty stack as shown in figure 4.1(a). The stack when
interval [d, b] is being processed looks as shown in figure 4.1(b), When the program
terminates the stack again is as shown in figure 4.1(a). When interval [c, d] is being
processed the stack looks as shown in figure 4.1 (c). The left column on the stack is the
lower limit and right column is the upper limit of the integral.
29
c a
d c
(b)
a c
(c)
Figure 4.1 Working of the Stack
4.2.2 Implementation for two and three-dimensional functions
This section discusses the implementation details for two and three-dimensional
functions. The basic approach is the same but the complexity increases from the fact that
intervals created as a result of subdivision will be rectangles in case of two-dimensional
integration (figure 4.2 (a) and cubes in case of three-dimensional integration (figure 4.2
(b)).
30
[c,d]
[d, e, {]
[a,b] (a)
[a,b,c] (b)
(c)
Figure 4.2 Two and Three Dimensional Integration
In case of two-dimensional integration the error tolerance condition is modified as
follows:
//(fabsigk,^ -gqi)> e{sqrt{e){b -a){d -c) + abs(gk^^)))
where (b-a)(d-c) represents the area of the rectangle over which the integration is carried.
For three-dimensional integration the error tolerance condition is modified as follows:
if{ fab{gk,, -gq^)> e{sqrt{e){d -a){e -b){f -c) + abs{gk,,)))
31
where (d-a)(c-b)(f-c) represents the volume of the cube over which integration is carried
out.
The stack operation for two and three dimension integration handles four and six
variables respectively, compared to two in one dimensional integration.
An important feature to discuss here is the way sub-divisions are created when the
error tolerance condition fails for two and three-dimensional integration. In one-
dimensional integration interval [a, b] would result into [a, c] and [c, b], where
c = The implementation would have proceeded for one and two-dimensional
integration without any changes, i.e. sub-dividing along the first dimension. After
experimenting with different functions it was concluded that the sub-intervals should be
created along the longest edge, as seen in figure 4.2 (c) for a two-dimensional case.
Consider a three-dimensional integration, the longest edge would be the dimension
for which the absolute value of the difference between upper and lower limit is greatest.
For the intervals shown in figure 4.2 (b), longest edge will be
max (abs(d-a), abs(e-b), abs(f-c)),
if, the third dimension is remmed as the longest edge then the new sub-intervals -will be
[a, b, cj, [d, e, k] and [a, b, kj, [d, e,f} where k = — Note that the remaining two
dimensions are the same after the sub-division is performed.
32
4.3 Parallel Implementation of Adaptive Quadrature using blocking calls
After creating the serial algorithm, MPI is used to write a parallel version of the
adaptive quadramre algorithm. Most of the work in this part of the thesis is based on the
work reported in [15]. In [15 the parallel implementation of adaptive quadramre using
blocking calls is reported, but due to efficiency considerations, the algorithm was re-
implemented for this thesis.
Consider the case of a three-dimensional integration. The computation will
require six variables to be sent and received between the slave and the master processor.
Six variables are required because the finite integration requires every variable/dimension
to have a lower and a upper limit. There are three dimensions in this problem; therefore
six variables are required. MPlPack and MPlUnpack are used for packing and
unpacking variables. Packing the variables for send operation and unpacking them upon
receive will reduce the number of message passing calls to one call. In confrast, in [15]
MPlPack and MPlUnpack functions are not used to reduce the number of send/receive
operations.
Let's assume there are P processors, Po-Pp-j, where processor Po acts as a master.
All the other processors Pi...Pp.j interacts only with the master processor Po. The parallel
implementation can be explained with the help of following pseudo codes:
1. Pseudo code to explain the initial distribution of intervals,
2. Pseudo code to explain the work assigned to the slave processors,
3. Pseudo code to explain the role of the master processor.
33
1. Initial distribution of intervals.
The following pseudo code describes the procedure employed to distribute the initial
interval [a, b] across P-1 processors, for a one-dimensional integration problem.
if(my_rank !=0)
(
if (number of dimensions < 2)
{
/•Divide the initial interval between p-1 processors*/
( / ' - ! ) ( ; ' - l )
}
else if (number of dimensions > 2 )
{
find the longest edge and divide along the longest edge
}
}
Variables la and lb are local lower and upper limits. In the pseudo code {b-a)
(P-1)
defines the size of the interval given to every processor, pi...pp-i. So
a + - -{my _rank -1) determines the lower limit of the interval for processor rank P - 1
my rank and adding ^̂ ^to this determines the upper limit of the interval. Let's
(p-1)
assume we use three processors, Po, Pi and P2. to calculate an integral for function/("x)
across the interval [-1,1]- The interval [-1, 1] should be distributed to P; and P2,Po is not
given any interval since it is the master processor. The lower limit for Pywill be -1 as
seen by performing the following calculation on the formula, -1 + ^̂ —^̂ — (̂1 -1), the
34
upper limit for Py will be 0 as obtained from -1 + ^—^̂ —-̂ Similady for P2 the lower
limit will be 0 and the upper limit will be 1.
For two and three-dimensional problems the initial distribution of intervals is created
along the longest edge (section 4.2), the remaining dimensions are unchanged. It is
important to note here that the master processor has no involvement at the time of initial
distribution of the intervals.
2. Work allocation to slave processors
The following pseudo code gives a brief description of the work allocation to the
slave processors. The slave processors receive the interval(s), calculate the integral, and
depending on the error tolerance condition sends the result or sub-interval to Po-
if(myrank != 0)
{
while(l)
{
Receive a flag from po using MPI_Recv
if (flag = 1 )
Indicates po has no work to send so break the while loop
{
if (flag = 2)
{
receive interval in MPlPack format from po using MPI_Recv
}
Calculate gk^ and gq7 for the sub-interval(s) received
if the results are within the tolerance limit send the result using MPI_Send
else subdivide and send the two sub-intervals to po using MPI_Send
}
35
There are two important points to note here, first the slave processors already have an
interval to work on during the first iteration of the while loop as explained in the previous
paragraph(s), so the code is designed such a way that the receive calls are not executed
during the first iteration of the wdiile loop. Second, the sub-division result into two sub-
intervals, both of them (sub-intervals) are sent to master processor Po.
The intervals are received in the MPlPack format, so they are unpacked into
variables before the computation starts. The packing and unpacking reduces the number
of send/receive operation and increase the efficiency of the program.
3. The role of the master processor
The following pseudo code describes the role of the master processor, Po.
if(my_rank ^=0)
{
while(l)
{
fori = 1 to p-1
{ Receive result/sub-intervals (data) in the form of MPI_Pack, from the slave processors using MPI_Recv.
Unpack the data using MPlUnpack.
If the data is an integral value then add the value to the temporary result.
Else store the intervals on the stack.
}
If the stack is empty,
- inform slave processor, this acts as a terminating condition.
- break from the while loop.
Else distribute task (intervals) to the slave processors based on the availability of the intervals and
processors.
}
}
36
As seen from the pseudo code, the master processor does the following: receives the
results/intervals from every slave processor, storing the intervals on the stack and
distributing the intervals (if any) to the slave processors. The intervals are distributed to
every slave processor provided there are enough intervals on the stack to distribute. There
could be a simation when the numbers of intervals on the stack are less then P-1. The
code is adapted to this situation, before sending the interval the master processor sends a
flag to the slave processor, informing the slave processor of what is it sending. The slave
processor depending on the value of this flag knows whether to post a receive call for the
intervals.
The distribution of intervals by the master processor takes into account the dimension
of the application under execution, so for one-dimensional integration the master
processor pops two values from the stack, MPI_Pack the values and sends this packet to
the slave processor.
37
The working of master/slave processor(s) configuration and the stack operations
during the execution of the program is shown in 4 stages. Let's assume a case where two-
dimensional integration is being calculated for function/(̂ x) by 5 processors, an interval
[a, b].
Figure 4.3 shows the diagram after the initial distribution and before the computing
starts at the slave processors P1...P4. It shows that the stack is empty and the slaves have
intervals to process.
Master Po
Stack: Pi [a,b,]
P2 [bi. b2]
P3 [b2 b3]
' P4 [b3 b]
Figure 4.3 Initial disttibution of intervals
38
Figure 4.4 shows a diagram when slaves have calculated the integral values, they
have checked the error tolerance, and they have sent sub-intervals/result back to the
master processor Po. The figure also shows the state of the stack at the master processor
after it has stored the intervals. Let's assume that Pj and P3 are the processors sending
result and, P2 and P4 are sending sub-intervals. So [bj, b2j is sub-divided into ft>i, b4] and
fb4, b2], and [b3, h] is sub-divided into [b^. b4] and /Z»4, b].
Master Po
Stack:
b4 b3 b4 b,
b b4 b2 b4
Pi [ ]
P2
[ ]
P3 [ ]
P4 [ ]
Figure 4.4 Sub-intervals stored on the stack
39
Figure 4.5 shows a diagram after the master has distributed the intervals to the slave
processor. The distribution of intervals uses a for loop that pops an interval from the
stack and distributes it to Pj, then pops another and disfributes it to P2 and so on. After
distributing all intervals, the stack is empty. It is possible that there are more intervals on
the stack than the number of processors, in that case, the master processor will have to
wait for the next turn when the distribution algorithm is called by the master processor
and as a result some intervals are left on the stack. A case may also arise where the
number of intervals are less than P-1, in that case the slaves are informed that intervals
are not sent and they take a proper measure to handle the simation (like sending dummy
results to the master).
PI
[b4, b]
P2 [b3 b4]
P3 [b4 b2]
P4 [bi b4]
Figure 4.5 Sub-intervals distributed to the slave processors
40
Figure 4.6 shows the stage when the computation is complete, all processor have
returned the result to the master processor and the program is ready to terminate.
Pi [ ]
P2 [ ]
P3
[ ]
P4 [ ]
Figure 4.6 Termination of the parallel program
41
4.4 Drawbacks
The algorithm discussed in the previous section implements a parallel algorithm using
blocking communication calls. The algorithm follows a linear speedup when compared
with a serial algorithm. The performance reported in [14] show a considerable speedup
and accuracy, expected from the parallel implementation of adaptive quadrature. As with
every implementation there are certain drawbacks associated with the implementation
reported in [14]. These drawbacks are because of the use of blocking communication
calls (MPIRecv and MPISend).
The algorithm discussed in section 4.2 for blocking case and in [1] performs
inefficiently under the following conditions:
1. When the program runs in a heterogeneous environment. An example of this
condition will be a parallel code running on a grid, where processors with
different architectures are networked for the purpose of solving complex time-
consuming computations.
2. When the program runs on a cluster, with processors of varying speed. In this
case the architecmres are the same but the speed of processors differs.
3. When program runs on a set of processors that are fighting for resources,
implying variable load conditions. This condition can arise on a
heterogeneous/homogeneous cluster as well as on supercomputing machine like
Pleione (Section 2.4).
All the conditions stated above would result in an uneven time required by the slave
processors to do the required processing on the interval. The processing includes
42
calculating integral values, checking for error tolerance condition, sub-dividing if
required, sending results/sub-intervals and waiting for the next set of intervals. The
problem can clearly be understood by looking at the Figure 4.7. Consider a case when
four processors are used for the computation of an integral. The parallelization as already
explained, is achieved by data decomposition, so same copy of the code is getting
executed on different slave processors, and all the slave processors are working on
different data set.
Figure 4.7 shows a shuation that will decrease the efficiency of the parallel program.
The boxes on the right hand side of the figure represent the slave processors, the box on
the left hand side represents the master processor. All the boxes in the figure have brief
descnption of the activity the master/slave configuration would be performing during the
entire course of execution. Let's refer to the boxes by their name, i.e. Pi, P2, P3 and Po. It
is possible that slave processor are executing different part of the same code at a
particular time, bold lines inside Pi, P2, P3 describe this situation.
43
Po Master Processor - for i = 1 to p
-receive interval/results -store intervals on stack -if result add to temp value
-if stack is empty - intimate slaves to terminate
-else -inform slaves that an interval would be sent -forj = 1 top
-pop an interval from stack -send interval to j " ' processor
Pi Slave Processor - receive interval - calculate integral - check for error - send results/intervals - repeat until termination
P2 Slave Processor - receive interval - calculate integral - check for error - send results/ intervals - repeat until termination
P3 Slave Processor - receive interval - calculate integral - check for error - send results/ intervals - repeat until termination
Figure 4.7 Problem with the blocking communication calls
Let's assume that the master is waiting for the results/intervals from Pi, but Pi is still
calculating the integral for some interval, as a result the receive operation at the master
processor will block, this would not allow the send operation at Pi and P2 to complete
because the respective receive at the master processor is not posted (the for loop receives
44
the result/intervals serially). Hence Pi and P2 are idle, while the master processor may
have intervals to distribute. However, because of the blocking namre of the receive call
the master is not able to distribute more intervals. This situation will result in loss of
efficiency.
There can be a lot of simations similar to the one mentioned in the previous paragraph
where blocking nature of the communication calls is responsible for low efficiency of the
parallel program.
The for loop run by the master processor, which is responsible for receiving the
intervals/result from the slave processor, is identified as the bottleneck point for the entire
communication in this parallel implementation. The successful change in the behavior of
this for loop will increase the efficiency of the parallel algorithm. The desired behavior is
that none of the receives should be blocking, thus the sends by any slave processors are
independent of each other.
The other modification required in the implementation are the sends needed when the
master processor distributes intervals to the slave processor. This change will make the
distribution of intervals faster. It is important to note here that the corresponding receive
at the slave processor should be blocking because the computation should not be allowed
to proceed at the slave processors until it receives an interval from the master processor.
The use of nonblocking send/receive will alleviate most of the blocking problems
discussed so far. The next section discusses the implementation of the adaptive
quadrature algorithm using non-blocking communication.
45
4.5 Parallel Implementation of Adaptive Quadrature using Non-Blocking Calls
The algorithm discussed in this section discusses the implementation details of a
parallel adaptive quadrature using non-blocking communication calls. This
implementation not only removes the drawbacks associated with blocking
communication calls, but also adds a new feature that enhances the efficiency of the
algorithm.
The implementation has the following features:
1. The initial distribution of intervals,
2. The use of MPlPack and MPlUnpack to reduce the amount of communication,
3. The stack representation and handling by the master processor.
Implementation details of the behavior of the master/slave configuration are discussed
below.
1. Role of the slave processors
The following pseudo code explains the working of the slave processors.
while(l)
{
/•this is the client code*/
Receive message from Po in MPI_Pack format using MPI_Recv;
Unpack the message using MPlUnpack;
If message contains termination condition then break from while loop;
Calculate integral values gk^, abs (gk^) and gq,;
While (error tolerance condition for these values fail)
{ Sub-divide the interval into two sub-intervals;
Send one half of interval to master processor using MPI_hecv;
Calculate integral values for the second mterval;
}
46
Add accepted integral to the temporary result;
Inform the master that interval is required;
}
This implementation is different from the working of slaves in [15] and in the
blocking algonthm (Section 4.2). In this case the slave processor sends both the sub-
intervals to the master processor if the error condition fails after the integration is
calculated using the Gauss Kronrod mle and the Gauss Quadrature mle for an interval.
Consequentiy, the slaves are stopped from being overioaded. The overioading happens
because of some property of the function for that interval (e.g. singularity) such that large
numbers of subsequent sub-divisions are created. The non-blocking implementation is
efficient because only one of the sub-intervals is sent to the master processor, the other
sub-interval is retained by the slave processor for processing. The non-blocking
communication calls allow the slaves to start processing immediately on the next sub-
interval.
In [14] and in the blocking algorithm, a flag is used to indicate whether a result or
sub-interval is being sent to the master processor. This implementation maintains a local
result variable, this variable contains the cumulative accepted results for all the intervals
the slave processor has worked upon. The master processor uses MPIReduce operation
to collect this local value from the slave processors and sum them to get the final resuh.
The MPl_Recv, a blocking call, is used to receive the intervals from the master
processor. This is to ensure that computation is not allowed to proceed on the slave
processors unless they have an interval to work upon.
47
The parallel code executed by the master processor is much more difficult to
implement than anticipated. The next page describes the pseudo code that explains the
working of the master processor. It is important to note here that non-blocking
communication calls have a parameter of type MPlRequest (section 3.3) that takes care
of completing the send/receive communication call, so the execution may proceed to the
next line of code without waiting for the send'receive operation to complete. This
parameter is exploited by this implementation to get the desired improvement in
efficiency, permitting communication to overiap with computation, and eliminate the
drawbacks associated with the blocking algorithm.
2. Role of the Master Processor
The notation used in the following pseudo code is as follows: P is the number of
processors used for the computation, "i" is any value between 1 to P-1, 0 is not included
because Po is the master processor:
{/*this is the client code*/
for i= 1 to p-1
Receive message in MPI_Pack format from the i* slave processor using MPI_hecv;
while ( (one or more processors are busy) or (stack has elements)) {
fori =1 to p-l{
MPITes t the receive request associated with i* slave processor;
If the request is complete {
MPI_Unpack the message;
If the message indicate request for intervals
If the stack has elements send another interval to the i* processor;
Else put this processor in the free processor list.;
Else if the message is an interval, store it on the stack;
Call MPI_Irecv for interval/request for interval from slave;
}/*end if for checking request completion*/
If i* processor is the free processor and stack has elements, send interval to the i*
48
Processor
}/* end for */
}/* end while*/
Inform slave processors that program should terminate.
MPI_Reduce the final result from local variables.
}/* end of master code*/
The algorithm for the master processor works as follows:
1. The master processor calls MPl_Irecv, a non-blocking call, to receive data from
the slave processors. There is negligible wait involved because of the namre of the non-
blocking calls.
2. Using afor loop constmct, check the receive stams from every slave processors.
MPl_Test function is used to test the status (complete/ pending) of the non-blocking
communication call. The non-blocking communication call has a parameter of type
MPI_Request; this parameter is called the communication handle, and is responsible for
completing the communication call. The MPITest. function takes as parameter, the
request handle associated with the receive request and remms tme or false depending on
the stams of the communication call. If the MPITest function remms tme, h means that
the receive operation is complete.
3. Unpack the data using MPl_Unpack
4. If data contains a request for an interval, and there are intervals on the stack, pop
the interval and send the interval to the processor.
5. If the stack is empty place the slave processor in the free processor list.
6. Else, if the data is an interval, store the data on the stack.
49
7. Post a receive call (MPIlrecv) to receive the next data packet from the slave
processor. This step ends the //construct started by MPITest function if it remms ttiae.
8. Check if any processor is free and there are intervals on the stack. If both the
conditions are tme then pop an interval from the stack and send it to the processor. This
step is executed if step 5 is executed at the master processor. This step ends the/or loop
started at step 2.
9. Check for the termination condition, i.e. the stack is empty and all processors are
free; if these conditions are tme then send a termination signal (in the form of message)
to the slave processors and collect the result from the slave processors using the
MPIReduce function.
Steps 2 to 7 removes the bottleneck (for loop) of the communication between the
master and the slave processor(s). By using a non-blocking receive the program proceeds
without wait and MPITest makes sure that there is no more than one receive for
corresponding send at the slave processor. Whereas in the implementation for the
blocking case (section 4.2) and in [1] has to wait for send and receive to complete.
50
CHAPTER 5
EXPERIMENTS AND RESULTS
This chapter presents the results of the experiments conducted on the parallel adaptive
quadrature algorithm using blocking communication calls (PAQBC) and the parallel
adaptive quadramre algorithm using non-blocking calls (PAQNBC).
The objective function used in the experiments is
f{x,y,z) = l,ifx- +y^ +z^ <\ = 0, otherwise
the limits for the integral are [x: -1, 1], [y: -1, 1] and [z: -1, 1]. The algorithm was tested
on different functions for the accuracy of the results obtained for one, two and three-
dimensional integration. The function /(x,_y,z)is chosen as an objective function
because the discontinuities in this function on the boundary of the unit sphere result into
many sub-divisions.
The experiments performed can be categorized as follows:
1. Evaluation of f(x,y,z) using PAQNBC to demonstrate speedup.
2. Evaluation of f(x,y,z) using PAQNBC and PAQRM to compare speed up.
3. Evaluation of f(x,y,z) using PAQNBC and PAQRM to Compare speed up in a
heterogeneous environment.
4. Evaluation of f(x,y,z) using PAQNBC and PAQRM to Compare speed up during
variable load conditions.
51
A very important point to keep in mind while observing these results is that the
master processor does not participate in the actual computation or generation of sub-
intervals. The master is responsible only for allocating work and receiving
results/intervals. So when we say that we use two processors it acmally means just one
slave processor for computation and a master processor for allocation of work and
collection of results. Always the acmal number of processors utilized for computation is
one less than the total number of processors. The use of two processors corresponds to
the serial execution of the PAQNBC with one slave processor but with added
computation overhead between the master and the slave.
Speedup is defined as ratio of execution time using one processor to the execution
time using P processors [7]. Under the master/slave paradigm speedup is defines as
follows:
_ Execution time using 2 processors r —
Execution Ume using P processors
A parallel program is said to have a linear speedup if the speedup S is equal to P. In
all our experiments, the maximum speedup that we can possibly see is P-1, as PQ
(processor zero) is not involved in any computation.
Figure 5.1 shows that a speedup is obtained with an increase in the number of
processors used in PAQNBA. Speedup (2, P) where P = 2... 10 is the ratio of time taken
by processor P^ over processor Pp, T(2)/r(P). The experiment was conducted 10 times
and the averaged data was used to plot the graph.
52
Speedup
''5
•T" on _ T
(2)/
T(P
), S
peed
1
3 oi
o
en
c
i r^ / ^
} i
\ 'i / i
\
> »
/ —•—Speed Up
0 5 10 15
P, Number of Processors
Figure 5.1 Speedup for;? processors over 2 processors
The graph is expected to show a linear speedup, but at P = 4,6,8,10 a sudden surge in
the speedup is observed. To find out the reason for this behavior, we calculated a total
number of sub-divisions created during the entire computation for P = 2 to 10. The total
number of subdivisions(CfPj) created by PAQNBC is calculated by declaring a local
counter for every slave processor and incrementing this counter if a sub-interval is
created by the slave processor . Before the termination of the program, when all the slave
processors are free and the stack at the master processor is empty, the master processor
collects this counter value and adds them together to give the value of C(P). The
collection and addition are carried out simultaneously with the help of a collective
communication called MPl_Reduce. The value C(P) is equal to the total number of
iterations performed by the slave processor. Figure 5.2 shows the graph between P and
C(P).
53
Note that the value of C(P) is not same the for every value of P The reason for this
dissimilarity is the way intervals are distributed at the beginning of the execution (Section
4.2).
Number of Iterations/Sub-divisions
looon
^ 1 8000-5) > •° -o 6000 -c X}
^. 1 4000 -
" 1 2000 -CO
0)
- n -(
F
< . t
1
It
' 1
D 5
', Number
/
I'^l I
of P
L \
- < • -
10 1
rocessor
* Ni imhpr nf
Iterations/Subdivisions
5
5
Figure 5.2 Number of subintervals with P processors for PAQNBC
The ideal way of starting the process of parallel integration would be placing the
initial interval on the stack at the master processor and then calling the interval
distribution algorithm. The distribution algorithm would have given this un-divided
interval to P;. The remaining processors, if P>2, would have to wait until the call to the
distribution algorithm is again made by the master processor and there are enough
intervals on the stack to be distributed. This method is not the most efficient method of
distributing the intervals because this results in too many idle processors. The advantage
of this method is that the number of sub-divisions/iterations performed by the algorithm
is the same when we vary P from 2 to 10.
54
The blocking and non-blocking algorithms implemented for this thesis use the initial
distribution of the interval. Depending on the value of P, the initial interval is divided
into P-1 sub-intervals (panels), and each panel is given to one of the P-1 slave processor.
This initial distribution is responsible for the variation in the number of sub-
intervals/iterations created. Thus the sudden surge in the speedup at P = 4,6,8,10 is
attributed to the initial distribution of the interval; at these values of P the initial
distribution results in lesser number of subdivisions.
The variation in the number of iterations required the normalization of the speedup by
dividing the time with the amount of work done. By dividing the time with the number of
iterations, we get time per iteration, which is essential to find out the amount of work
done for different values of P The normalized speedup is the ratio of time required per
iterations when the number of processors is 2 to the time required per iterations when the
numbers of processors is P Therefore the normalized speedup can be mathematically
defined as
^ ^ r(2)/C(2) P(2) C(P) ^ T{P)IC{P) T(P) C(2)
Note that if the algorithm were to start with one interval on the stack C(P)/C(2) will
cancel each other.
55
Figure 5.3 shows the graph obtained with normalized speedup using PAQNBC. For
all further experiments on PAQNBC and PAQBC normalized speedup is calculated and
plotted against P to check the performance of the algorithm. The speedup as expected is
P-1 because in the master/slave paradigm only P-1 processors are involved in the acmal
computation. The speedup is linear in namre.
i n -,
m
0 t
o
1 ' « § 4 -t
1 2 n
1
Normalized Speedup
0 5 10 1 P, Number of Processors
5
Figure 5.3 PAQNBC with normalized speedup
Similar experiments were conducted with the synchronous/blocking algonthm
PAQBC. Figure 5.4 shows the comparison of the speedup between PAQBC and
PAQNBC. Note that PAQNBC algorithm performs slightiy better than the PAQBC
algorithm.
56
10 -
8 -
^ 6 -•D
^ A a. 4 -Vi
2 n -
(
PAQNBC Vs PAQBC to Compare Speedup
• ' •
i /
/ 1
/
J 1 f
A Y
(\
/
A >
1
-•—PAQNBC
- » - PAQBC
3 5 10 15
P, Number of Processors
Figure 5.4 Speedup for p processors over 2 processors for PAQNBC and PAQBC
The next experiment was conducted on PAQNBC and PAQBC algorithm with the
processor P2 executing a dummy for loop. The processor P2, second in rank, takes into
account that the processor with rank 0 is the master processor. This would in theory make
processor P2 the slowest of all the processors. Figure 5.4 shows the comparison of the
speedup between PAQNBC and PAQBC.
57
10 -1
8 -
3 6 -•D
1 4 -2
n -
PAQNBC Vs PAQBC when P2 is the
processor
1 ^
t
/ /
H
1
/
H
/
H
/
HI
/
H
/
H 1
i s lowest
- • - P A Q N B C
- • - P A Q B C
0 5 10 15
P, Number of Processors
Figure 5.5 Speedup for p processors over 2 processors with processor/72 executing a dummy for loop.
The graph shows that PAQNBC is adaptive to the simation where a processor is
running slow. The PAQBC algorithm completely loses its linearity in speedup where as
PAQNBC retains its linearity when the number of processors are increased to 4 and
onwards. It is important to note that for PAQNBC the number of processors equals 2 and
3 represents approximately the same case. This can be attributed to the fact that when
P=3, Pi IS doing the maximum work because P2 is ruiming slow. Therefore the speedup
for PAQNBC in this case is P-2 instead of P-1.
In the next experiment a slave processor executes a dummy for loop depending on the
value of a random number generated. This is achieved by inserting the following code
before the slave processor calculates integral values for an interval:
{ srand(time(0));
v = randO/(RAND_MAX+l .0);
58
if(v<0.2) my_sleepO;
}
The above code generates a random number that lies between 0.0 and 0.9 and
depending on the value executes a dummy for loop defined in the my_sleep() function.
The effect of inserting this code is that an unspecified number of the slave processors
from 0 to P can execute the dummy for loop. There is a probability that 1/4'*' of the time
before calculating the integral values; a slave processor executes a dummy for loop. This
is very similar to a condition when the load is variable across the processors, and is
unpredictable. The load here also means that certain intervals take a longer computation
time than others. The graph in figure 5.6 shows that PAQNBC performs better than
PAQBC under such conditions.
PAQNBC Vs PAQBC wi th a random num ber before call to Gauss function
-IC
Speedup
3 e
n
o
c 1
1
(
A /
4 4
V / / / f H
) (
^ / M •
-
) 5 10 15 2
P, Number of Processors
0
generator
-•—PAQNBC
-m- PAQBC
Figure 5.6 Speedup for P processors over 2 processors with the slave processors executing a dummy for loop based on value of random number.
59
CHAPTER 6
CONCLUSIONS
This thesis discussed the design and implementation of two parallel algorithms
PAQNBC and PAQBC. The PAQNBC algorithm uses non-blocking communication
calls, whereas the PAQBC algorithm uses blocking communication calls. The algorithms
were tested and from the experiments, the following conclusions can be drawn:
1. The PAQNBC algorithm performs better in a heterogeneous environment and in
variable load conditions.
2. The PAQNBC algorithm adaptively allocates tasks to processors such that the
fastest processor gets the maximum work. The PAQNBC minimizes the idle time or the
wait time for every processor to get a new task.
3. The PAQBC is a simple algorithm to design and use. The PAQBC will perform
nearly as good as the PAQNBC when we have a homogeneous environment or invariable
load conditions.
There were certain issues related to non-blocking communication, that were taken
care of during the implementation of the parallel code using non-blocking calls. These
issues were specific to the platform on which the code was running. This thesis used the
SGI machine with the IRIX as OS to experiment and validate the results. The number of
MPI non-blocking communication requests this environment is able to handle is limited.
By default this value is 32,768, but the function
60
f{x,y,z) = \,ifx^+y^+z^<\
= 0, othervrise
for interval [x: -1 to 1], [y: -1 to 1] and [z: -1 to 1], and for error tolerance value of
10"^, required requests more than the default value. As a result the program failed very
time. After careful study of MPI, it was found that MPlREQUESTfree function
supported by MPI should be used to free the requests that are completed. Every receive
request that returns tme at step 2 of the explanation given for non-blocking master
processor are freed and non-blocking sends are freed immediately after the call.
The parallel adaptive quadramre algorithm can be implemented as a multi-list
algorithm, thus saving time in passing intervals/results from the master to slave and
utilizing all the P processors for computation instead of P-1. Although in this case certain
issues such as task migration will come up to balance load across processors. The results
of the multi-list algorithm can be compared with the single list algorithm to determine the
most efficient way of parallelizing algorithms that uses data decomposition to achieve
parallelism. The implementation can be generically extended for supporting multi
dimensional integration problems.
61
REFERENCES
1. Bull J.M., Freeman T.L., Parallel Algorithms For Multi-Dimensional Integration Parallel and Distnbuted Computing Practices, vol. 1, no. 1, pp. 89-102, 1998.
2. Bull J.M., Freeman T. L., Parallel Globally Adaptive Quadrature on the KSR-1, Advances in Computational Mathematics, vol, 2, pp, 347-373, 1994 1994.
3. Bull J.M., Freeman T. L. and Gladwell I, Parallel Quadrature Algorithms for Singular Integrals, Proceedings 14th Worid Congress on Computational and Applied Mathematics, 1994, IMACS pp. 1136-1139.
4. Center for Research on Parallel Computing and Supercomputers, http://pixel,dma.unina.it/RESEARCH/pamihr.html, 1996.
5. Computational Science Education Project http://csepl.phy.oml gov/ca/node21 hUnl 1996.
6. Gladwell Ian, Napierala Malgorzata, http://www-fp.mcs.anl.gov/ccst/research/reports_prel998/mcs/ numerical_integration/napierala.html,.Multidimensional Numerical Integration, Mathematics Department, Southern Methodist University, January 2000.
7. Grobe M., "The Architecmre and use of SGI Origin 2000" http://www.cc,ukans.edu/~grobe/docs/sgi-short-intro/index.shtml - RIOOOO, University of Kansas, August 1997.
8. Heath, Michael T, Scientific Computing, An Introductory Survey, Second Edition, McGraw-Hill, New York, 2002.
9. Interactive Educational Modules in Computational Science, http://www,cse,uiuc.edu/eot/modules/integration/adaptivq. Computational Science and Engineering Program, University of Illinois at Urbana-Champaign October, 2004.
10. Maui High Performance Computing Center http://www.mhpcc,edu/fraining/workshop/parallel_develop/MAIN,html#overview, September 2003,
11. MPI fomm, http://www.mpi-fomm,org/docs/mpi-1 l-html/node2,html-Node2 MPI: A Message-Passing Interface Standard, University of Tennessee, Knoxville, 1994,
12. NA Research Area http://www.maths.man.ac.uk/DeptWeb/Groups/NA/Parallel.html, Department of Mathematics, University of Manchester, UK, Febmary 2001.
62
13. Rao Singiresu S, Applied Numerical Techniques for Engineers and Scientist, Prentice Hall Publications, 2002.
14. P,S. Pacheco, Parallel Programming with MPI, 1997,Morgan Kaufmann Publishers Inc., San Mateo, CA.
15. Walawalkar, Milind, Parallelization of Adaptive Quadrature Rule-Based Integration Methods, Master's Thesis, Computer Science Department, Texas Tech University, 2003
63
PERMISSION TO COPY
In presenting this thesis m partial ftilfiUment of the requhements for a master's
degree at Texas Tech University or Texas Tech University Health Sciences Center, I
agree that the Library and my major department shall make it freely available for
research purposes. Permission to copy this diesis for scholarly purposes may be
granted by the Dhector of the Library or my major professor. It is understood that any
copymg or publication of this thesis for financial gain shall not be allowed without my
further written permission and that any user may be liable for copyright infrmgement.
Agree (Permission is granted.)
Student Signamre Date
Disagree (Permission is not granted.)
Student Signamre Date