AN EFFICIENT IMPLEMENTATION OF ADAPTP/E GAUSS …

AN EFFICIENT IMPLEMENTATION OF ADAPTP/E GAUSS

QUADRATURE RULE FOR PARALLEL ENVIRONMENT

by

VrVEK JAISWAL, B.E.

A THESIS

IN

COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in

Partial Fulfillment of the Requirements for

the Degree of

MASTER OF SCIENCE

Approved

Chainji^rson of the Comm><fee

Accepted

Dean of the Graduate School

December, 2004

ACKNOWLEDGEMENTS

1 would like to express my gratitude to all those who gave me the possibility to

complete this thesis. I would like to thank Dr. Noe Lopez-Benitez for his help, support

and valuable discussions. His overly enthusiasm and integral view on research has made

a deep impression on me.

1 am deeply indebted to Prof Dr. Philp Smith, Director of High Performance

Computing Center, whose stimulating suggestions and encouragement helped me all the

time during the research and during the writing of this thesis. I would like to thank the

entire HPCC unit, especially Stephanie, Sri Rangam, Dr. James Abbott and Dr. David

Chaffin for constant support, encouragement.

I would like to thank The Virtual Vietnam Archive for the financial support I have

received during my tenure as a smdent. I would especially thank Steve Maxner, Mary,

Justin, Ty and all other members for their support and encouragement.

I would like to thank my parents and other family members, who have always

supported me and encouraged me to aim higher.

My special thanks to Sachin, Rohit, Gauri, Vamshi, Doc, Pum, Vinay, Anuya and

other members of Raapchick group. Finally a big thank you to the members of

Distributed Computing Group, especially Vijay, Rajkumar and Nitin for their help,

suggestions and comments and coffee.

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ii

LIST OF FIGURES v

CHAPTER

I. INTRODUCTION 1

1.1 Motivation 3

1.2 Document Organization 5

n. PRELIMINARY STUDIES 7

2.1 Definitions and Terminologies 7

2.2 Legendre-Gauss Quadramre 9

2.3 Adaptive Quadrature 11

2.4 Preliminary Work on Adaptive Quadrature 13

m. MESSAGE PASSING INTERFACE 16

3.1 Packing and Unpacking Variables in MPI 17

3.2 Synchronous Communication in MPI 18

3.3 Asynchronous Communication in MPI 20

3.4 SGI Ongin 2000 24

IV. DESIGN ISSUES AND IMPLEMENTATION 26

4.1 Algorithm Design 27

4.2 Serial Implementation of Adaptive Quadrature 28

4.2.1 Generations of Intervals and Behavior of Stack 29

4.2.2 Implementation for two and three-dimensional functions 30

111

4.3 Parallel Implementation of Adaptive Quadrature using blocking calls 33

4.4 Drawbacks 42

4.5 Parallel Implementation of Adaptive Quadramre using 46 non-blocking calls

V EXPERIMENTS AND RESULTS 51

VI. CONCLUSIONS 60

REFERENCES 62

IV

LIST OF FIGURES

2,1 Adaptive Quadramre using Trapezoid Rule 12

3.1 Hyper Cube Arrangement of SGI Origin 24

4.1 Working of the stack 30

4.2 Two and Three Dimensional Integration 31

4.3 Initial distribution of intervals 38

4.4 Sub-intervals stored on the stack 39

4.5 Sub-intervals distributed to the slave processors 40

4.6 Termination of the parallel program 41

4.7 Problem with blocking communication calls 44

5.1 Speedup for P processors over 2 processors 53

5.2 Number of subintervals with P processors for PAQNBC 54

5.3 PAQNBC with normalized speedup 56

5.4 Speedup for P processors over 2 processors for PAQNBC and 57 PAQBC

5.5 Speedup for p processors over 2 processors with processor p2 58 executing a dummy for loop

5.6 Speedup for P processors over 2 processors with the slave processors 59 executing a dummy for loop based on value of random number

CHAPTER 1

INTRODUCTION

In scientific computing and numerical analysis, the term numerical integration is used

to describe a broad family of algorithms for calculating the numerical value of a definite

integral, and by extension, the term is also sometimes used to describe numerical

algorithms for solving differential equations.

Numerical integration of a function is performed for three main reasons. First, the

function is only known in certain discrete points. Those obtained by sampling. Several

embedded systems and other computer applications may need a lot of numerical

integration for this reason. Second, while the function is known, it is impossible to

calculate the integral analytically, because a primitive function or antiderivative, which is

needed for the integration, caimot be obtained. Example of one such function is the

probability density function of a normal distribution. Third, the function is known, but it

is too hard to solve analytically, and we want to fall back on approximation.

The Newton-Cotes formulas are the most commonly used numerical integration

methods. They are based on replacing a complicated function by some approximating

function that can be integrated easily. The Trapezoidal mle and the Simpson rule are

examples of the Newton-Cotes formulas [13]. Newton-Cotes formulas require the

evaluation of the integral at equal intervals. Alternative methods termed Gaussian

Quadrature [13] methods have been proposed that select irregularly-placed evaluation

points, chosen to determine the integral as accurately as possible. The most popular form

of quadramre uses Legendre Polynomials [13] to approximate a function f(x). Some other

Gauss Quadrature formulas are Gauss-Chebyshev formulas, Gauss-Hermite formulas,

and Gauss-Laguerre formulas.

In general, care must be taken to match the numerical integration method to the

expected nature of the function f(x). For example, it may be known that f(x) is regular.

On the other hand f(x) may be singular or oscillatory and will then need special

treatment. Often a special method called a product integration method can be developed

for the integration of functions of the form f(x) = w(x)g(x) where w(x) is a pre-set

function and the function g(x) is known to be a relatively nice function.

Adaptive quadrature is a numerical integration procedure in which the interval of

integration is recursively subdivided until a specified error tolerance is met for the

approximate integral on each subinterval. The error estimate for a given subinterval is

based on the difference between two different quadrature mles applied on the subinterval.

This research feamres an algorithm that calculates the integration of a given function,

using Adaptive Gaussian Quadrature in a parallel environment, with intervals stored in a

list at a central repository, accessible to all processors. Our feamred implementation is

designed for dynamic load balancing to increase the efficiency of the algorithm. We

specifically compare our algorithms to both serial implementation (to assess "speed-up")

and to the implementation using synchronous communication to assess efficiency.

1.1 Motivation

We begin any parallelization process by identifying the parts of the program that

consume the most run time. The goal is to know which code should be parallelized and

which code should be recycled from the serial program. After the parts are identified, the

problem is partitioned into smaller tasks that can be executed in parallel.

There are two primary methods for partitioning a problem. Data Decomposition and

Functional Decomposition [10]. Data Decomposition requires partitioning the data and

then partitioning the computation based on the partitioned data. Functional

Decomposition requires partitioning the computation into smaller tasks and then partition

the data based on these tasks. This is common in problems where there are no obvious

data strucmres to partition, or where the data stmcmres are highly unstmctured. Our

feamred algorithm and the algorithm referred in [1] uses data decomposition to attain

parallelism.

In recent years, motivated by different programming models, two approaches to

parallel numerical integration have emerged. One is based on adapting the ideas of

sequential globally adaptive algorithms [1] to the parallel context by selecting a number

of sub-regions, of integration rather than simply the one with the largest associated error

estimate. The other approach proceeds by imposing an initial partitioning of the region of

integration and treats the resulting sub-problems as independent, and therefore capable of

concurrent solution. The more sophisticated of these latter algorithms include a

mechanism for detecting load imbalance and for redistributing work to neighboring

processors [1,2].

Parallelization of Adaptive Quadrature Based Integration Methods [14] is based on

the second method, i.e. the algorithm divides the initial partition into sub-regions and

gives these sub-regions to individual processors for computation. This implementation is

inefficient under certain conditions. Let's assume a simple integration problem: a

function and an interval is provided and the objective is to calculate the integrand across

that interval. The implementation designed in [14] proceeds by dividing the interval into

P-1 sub-regions, where P is the number of processors specified to calculate the integral.

In a master/slave paradigm, processor 0 is not used for computational purposes, hence P-

1. The calculated integral over the sub-region is accepted as a valid answer, or the sub-

region is again divided depends on the specified error condition. In both cases, result or

sub-region is sent to the master processor {Po), and a new interval is received for

computation. Consider a case where Pi is slow, Po executes receive, butP; doesn't

execute the send until sometime later. The function used by [14] for receive, MPI_Recv,

is blocking. This means that vdien Po calls MPIRecv, if the message is not available, Po

will not remm from the receive function call. As a result, a relatively faster processor say

P2, that has already executed its MPl_Send call will be blocked, because Po is blocked

and is not able to execute or post the corresponding MPIRecv for MPlSend from P2.

This has two implications: first it would result in a lot of idle time for faster processors

and second the stack at Po may have a lot of intervals ready to be distributed across idle

processors, but because of the blocking namre of the call Po is unable to do so. A similar

situation will arise when a particular sub-region takes a longer time to evaluate.

Many Computational Science applications show dynamic behavior. When a workload

is divided between processors, it is not necessary that all the processors complete the

work at the same time. Some processors will take less time, while other may take more.

Consider a case of a heterogeneous cluster, where processors are of different speeds,

hence time required to complete the computation differs resulting in a load imbalance

across the system. The load imbalance would also result from the data decomposition

across the processors. There is a fair chance that computation time is longer for certain

data set after applying data decomposition. As a result the processor handling this data set

would continue to work, while the other processors wait. MPI inherently does not provide

a solution to handle imbalance of such nature. The situation worsens when synchronous

communication is used to design the solution to a problem, as in [14]. Because of one

processor's Send/Receive all the other processors will have to wait, to execute their

communication calls. The result of which are lots of idle processors. There is a need for

an efficient implementation that can handle load imbalances of such kind.

A lot of scientific applications such as. Digital Signal Processing, Fourier

Transformation, Fluid and Gas Dynamics, involve integrating a function. It is important

that the scientific community has an algorithm, that can integrate any given function of

one and more than one variable, in the fastest possible way, and with a desired accuracy.

1.2 Organization of the Document.

The rest of the document is organized as follows. Chapter 2 provides definitions and

terminologies used in this thesis. It also provides the mathematical background of Gauss

Quadramre and on Adaptive Quadrature. Section 2.4 of Chapter 2 gives a summary of the

work reported in the field of parallelizing quadrature rules. Chapter 3 deals with MPI and

architecture of the SGI Origin machine. Chapter 4 contains the problem statement of the

thesis. Chapter 4 gives details of implementation issues. Chapter 6 explains the

experiments and results conducted using asynchronous algorithm (non-blocking) and

synchronous algorithm. References are the last part of the report.

CHAPTER 2

PRELIMINARY STUDIES

This chapter defines some basic terms and notations in section 2.1, used in the rest of

the report. Section 2.2 discusses Gauss Quadramre rules, the mathematical background

required to understand this report. Section 2.3 explains Adaptive Quadramre. Section 2.4

briefly summarizes the work already done in the field of adaptive quadramre and

parallelizing integration.

2.1 Definitions and Terminologies

Shared Memory: It is a model where parallel tasks all have the same picture of memory

and can address and access the same logical memory locations regardless of where the

physical memory actually exists. Pleione, a SGI Onyx2 with 46-300 MHz processors,

having 46 GB of shared RAM and HPC- Linpack rating of 24 G-flops will be used as a

testing bed for this research.

Distributed Memory: In contrast to shared memory, distributed memory is associated

with individual processors and a processor can only address its own memory [4].

Load Sharing: The division of load or tasks among subsystem components. Many times it

is used synonymously with load balancing.

Nonblocking: A procedure is nonblocking if the procedure may return before the

operation completes, and before the user is allowed to reuse resources (such as buffers)

specified in the call. A nonblocking request is started by the call that initiates h, e.g.,

MPIISEND.

Blocking: A communication routine is blocking if the completion of the call is dependent

on certain "events" For sends, the data must be successfully sent or safely copied to

system buffer space, so that the application buffer that contained the data is available for

reuse. For receives, the data must be safely stored in the receive buffer, so that it is ready

for use.

Local: A procedure is local if completion of the procedure depends only on the local

executing process.

Non-local: A procedure is non-local if completion of the operation may require the

execution of some MPI procedure on another process. Such an operation may require

communication occurring with another user process.

Collective: A procedure is collective if all processes in a process group need to invoke

the procedure. A collective call may or may not be synchronizing.

Singularity: In mathematics a singularity [13] is in general a point at which a given

mathematical object is not defined or lacks some "nice" property, such as

differentiability. In mathematics, the derivative of a function is one of the two central

concepts of calculus. The inverse of a derivative is called the antiderivative, or indefinite

integral.

Gradient: Gradient [13] can be defined as rate of change, or slope for a function.

Generally referred as dy/dx.

2.2 Legendre-Gauss Ouadramre

This section gives the mathematical background of Gauss Quadramre [13] required

for understanding the thesis.

The Newton-Cotes formulas [13] are an exfemely useful and sttaightforward family

of numerical integration techniques. To integrate a function f(x) over some interval [a, b],

divide it into n equal parts such that

fn = f(xn). Then find polynomials, which approximate the tabulated function, and

integrate them to approximate the area under the curve. To find the fitting polynomials,

use Lagrange interpolating polynomials. The resulting formulas are called Newton-Cotes

formulas, or quadrature formulas.

The Newton-Cotes [13] methods requires the evaluation of the integrand at equal

intervals. Gauss Quadramre requires the evaluation of the integrand at specified, but

unequal, intervals. Gauss Quadrature is a powerful method of numerical integration and

its accuracy is much higher the Newton-Cotes formulas. As such, it is not used to

integrate functions that are given in tabular form with equispaced intervals. The most

popular form of Gauss Quadrature is Gauss Legendre Quadrature. The method uses the

roots of Legendre polynomials to locate the points at which the integrand is evaluated.

In Gauss integration, the integral is evaluated by using the formula

] f(x)dx = f^w,fix,) (2.2.1) i=i

where n is called the number of Gauss points, Wi are the unknown coefficients, also called

weights, and Xi are specific values of x, also called Gauss points, at which the integrand is

evaluated. For any specified n, the values of Wi and Xi are chosen so that the formula will

be exact for polynomials up to and including degree (2n -1).

As can be seen from equation 2.2.1, Gauss integration requires the range of

integration from -1 to +1. For convenience of notation, let the original coordinate be y

and the range of integration of f(y) be from a to b. Then the transformation

^^2yz£zl (2.2.2)

b-a

gives the normalized coordinates x=-l, when y = a and x=+l, when y=b. The

transformation from x to y is given by

(b -a)x + a + b ,r. ~ N̂

y^- (^.^.J)

by noting thatdy = ^ " -̂̂ dx , the original integrand from a to b for f(y)dy can be

rewritten as

|^(,),, . }^((M±ii±^),, .^|.J(«^--)y''-*))(2,2.4) a -1

if Xi is the Gauss Point of the normalized coordinate, the corresponding value of yi can be

determined as

_ ( 6 - a ) x , + a + 6 .2 2 4) ^' 2

Since the weight Wi remain the same, the integrand can be evaluated using the right-

hand side expression of Eq.(2.2.4).

10

In many engineering and other practical situations, we need to evaluate integrals over

two- or three-dimensional domains. Gauss Quadramre can be extended easily for two and

three-dimensional integrations over rectangles. For example we can evaluate a three-

dimensional integral with limits -1 and +1 using the equation 2.2.1.

1 1 ' n n n

jjj/(Ar,j;,z)^Jvc/z = ££ | ; iv ,w.vv, / (x„>; . ,z , ) (2.2.1) -1-1-1 1=1 J=l k=\

2.3 Adaptive Quadrature

"Adaptive quadrature is a numerical integration procedure in which the interval of

integration is recursively subdivided until an error tolerance is met for the approximate

integral on each subinterval", Michael T. Heath [8]. The error estimate for a given

subinterval is based on the difference between two distinct Quadramre mles applied on

the subinterval. The two Quadramre mles used in this research are the 7 point Gauss Rule

and 14 point Gauss BCronrod Rule. A graphical explanation of adaptive quadramre is

available at [9], the work is part of the Computational Science and Engineering Program

at the University of Illinois at Urbana-Champaign.

The adaptive quadramre using trapezoidal mle can be explained with the help of the

following steps:

1. Define/f'x), (a, b) and an error tolerance e,

2. Calculate / (a, b), I (a, ; and I ( , h), where I indicates the integral to be

calculated using the trapezoidal rule.

11

3. If I(a,b) - (/(a,——) + /(——,6)) < e then accept I{a,b) as the result.

4. Else divide (a, b) to {a, )and ( ,b),

5. Repeat the steps with the subintervals if created.

Figure 2.1 explains the function and the sub-division performed if the error condition

fails. In this figure/fjr) is integrated between the limits [a, b], [a, / and [ , bj,

failure of the error tolerance condition will result in two sub-divisions {a, )and 2

, a + 6 , .

a (a^b)l2 b

Figure 2.1 Adaptive Quadrature using Trapezoid Rule

The Trapezoid mle requires an interval to be sub-divided into n equal subdivisions

before calculating the integral. Therefore 3n subdivisions are required.

12

The Gauss Quadrature integrates a function/fx) for the interval [a, b] by dividing the

interval [a, b] into 7 unequal segments. The Gauss Kronrod 14 point mle uses the 7

points of the Gauss Quadrature rule and adds 8 more points to improve the accuracy.

Therefore to reduce the number of functional evaluations the following strategy is

used:

1. Define f (x), (a, b) and error tolerance e,

2. Calculate g/r,; {a,b) and gk.,(a,b), where gkjs will be used to denote the integral

value calculated using Gauss Kronrod 15 point mle and gq? will denote the

integral value calculated Gauss Quadramre rule,

3. If |gA:, 5 (a, i ) - gk.^ {a,h)\<e, accept gkj4 as the result,

4. Else divide (a, b) to (a, )and ( ,b),

5. Repeat the steps with the subintervals if created.

The Gauss Quadrature uses 15 function evaluations, which is significantly less then

3n function evaluations, used by the Trapezoid mle. This thesis uses adaptive Gauss

Quadramre to calculate integration for a given function.

2.4 Preliminarv Work on Adaptive Quadramre

Much work has been reported in the field of parallel quadramre. Parallel Algorithms

for Multidimensional Integration [1] categorizes the algorithms as single list algorithms

and multiple list algorithms. The essence of single list algorithms is a selection phase

during which a number of sub-regions that require further refinement are identified.

13

followed by the concurrent application of the numerical integration (quadrature) mle to

these identified sub regions. Multiple list algorithms are based on initial static disfribution

of the work; for example, for a machine with P processors the region of integration is

divided into P sub-regions and each sub integral is assigned to a separate processor.

Partitioning techniques are discussed in [2], which explains the static and dynamic

partitioning of intervals. Parallel Globally Adaptive Quadramre on the KSR-1 (Kendall

Square Research) [2] also discusses the numerical performance of their algorithm on

KSR-1 machine. The dynamic partitioning of the integration intervals are addressed in

[3]; this paper also gives a broad idea of the advantages involved in subdividing the

interval.

Napierala and Gladwell [6] developed a parallel adaptive quadramre algorithm,

building on the Quadpack code QAG for one-dimensional problems and the NAG

DOIFCF Genz-Malik code for multidimensional problems. They replaced the error

estimate ranking strategy by a tabulation approach to ranking.

The Center for Research on Parallel Computing and Supercomputers, CPS [4],

developed several parallel software package for Multidimensional Quadramre. Their

package is designed for MIMD distributed memory platforms such as massively parallel

processors systems and/or cluster of workstations and PCs and will be developed by

using standard tools as FORTRAN and C languages and BLACS or MPI communication

systems. The CPS package provides subroutine(s) based on adaptive algorithms. Such

approach attempts to evaluate the function mainly where the integral function shows

some difficulty (peaks, singularity,...). PAMIHR [4] subroutine is based on a nine-degree

14

formula for integrals up to ten dimensions on hyper rectangular regions. CPS also

provides subroutine(s) based on Quasi Monte Carlo methods, suitable for a class of (very)

high dimensional integrals and subroutine(s) based on a deterministic approach. Such an

approach uses very regular sequences of nodes called Lattices, generalizing the

trapezoidal mle in high dimensions.

The aim of Parallelization of Adaptive Quadramre Bases Integration Methods [14]

was to develop an adaptive, parallel algorithm for Gaussian Quadrature to evaluate

integrands in one and subsequently more dimensions. This research was divided into the

following parts.

1. Developing a serial algorithm for adaptive Quadrature.

2. Parallelizing the serial algorithm using blocking communication calls.

3. Extend the algorithm to work for 2 dimensions.

4. Compare the results of the parallel algorithm with a serial algorithm to verify the

efficiency gained.

5. To compare the behavior of the parallel algorithm on the Origin 2000 machine.

15

CHAPTER 3

MESSAGE PASSING INTERFACE (MPI)

This chapter defines some basic concepts of MPI Section 3.1. Section 3.2 discusses

MPI Pack and Unpack functions. Section 3.3 discusses synchronous communication.

Section 3.4 treats asynchronous communication in MPI. Section 3.4 gives a brief idea of

architectural description of the SGI Origin machine.

Message Passing Interface [11] is a paradigm, a standard, used widely on certain

classes of parallel machines, especially those with distributed memory. The attractiveness

of the message-passing paradigm at least partially stems from its wide portability.

Programs expressed in MPI, may run on distributed-memory multiprocessors, networks

of workstations, and combinations of all of these. In addition, shared-memory

implementations are possible.

The interface is suitable for use by fully general MIMD programs, as well as those

written in the more restricted style of SPMD. Although no explicit support for threads is

provided, the interface has been designed so as not to prejudice their use.

A message passing function is simply a function that explicitly ttansfers data from

one process to another. The MPI communication calls assume that the processes are

statically allocated i.e. the number of processes is set at the beginning of execution and

no additional processes are created during execution. An important design goal of MPI

was to allow efficient implementations across machines of differing characteristics. MPI

carefully avoids specifying how operations will take place. It only specifies what an

16

operation does logically. As a result, MPI can be easily implemented on systems that

buffer messages at the sender, receiver or do no buffering at all. Implementations can

take advantage of specific feamres of the communication subsystem of various machines.

MPI guarantees that the underlying transmission of messages is reliable. The user need

not check if a message is received correctiy thus relieving the programmer of worrying

about underlying communication details.

3.1 Packing and Unpacking Variables in MPI

MPI contains routines to pack and unpack data. These routines are MPlPack [11],

MPlUnpack [11] and MPIPacksize [11]. An MPI implementation must provide these;

further, a user may send data that has been constmcted with MPlPack with datatype

MPIPACKED and receive it either with datatype MPIPACKED or vrith any MPI

datatype with the same type signature that went into the packed data. Because of this, the

device must provide the routines to pack and unpack data. Of course, many

implementations of the device may use the model implementation's version of these

routines.

The MPI pack and unpack routines are designed to handle data on a communicator-

wide basis. That is, data is packed relative to a communicator; a natural implementation

is to pick a data representation that is a good choice for all members of the communicator

(including the sender). However, a common use of these routines in an implementation is

to pack and unpack data sent with the point-to-point operations [11].

17

3.2 Svnchronous Communication in MPI

Sending and receiving of messages by processes is the basic MPI communication

mechanism. The communication mechanism is synchronous [11] if the completion of the

call is dependent on certain "events". For sends, the data must be successfully sent or

safely copied to system buffer space, so that the application buffer that contained the data

is available for reuse. For receives, the data must be safely stored in the receive buffer, so

that it is ready for use.

The syntax of the blocking send is given below:

int MPI_Send( void* buf,

int count,

MPI_Datatype datatype,

int dest,

int tag,

MPI_Comm comm

)

The send buffer specified by the MPISEND [11] operation consists of count

successive entries of the type indicated by datatype, starting with the entry at address buf.

Note that we specify the message length in terms of number of elements, not number of

bytes. The data part of the message consists of a sequence of count values, each of the

type indicated by datatype. Count may be zero, in which case the data part of the message

is empty. The basic datatypes that can be specified for message data values correspond to

the basic datatypes of the host language.

18

hi addition to the data part, messages carry information that can be used to distinguish

messages and selectively receive them. This information consists of a fixed number of

fields, which we collectively call the message envelope. These fields are source,

destination, tag, and communicator.

The syntax of the blocking receive is given below:

int MPI_Recv(void'̂ buf,

int count,

MPlDatatype datatype,

int source,

int tag,

MPlComm comm,

MPl_Status *stams

)

The receive buffer consists of the storage containing count consecutive elements of

the type specified by datatype, starting at address buf. The length of the received message

must be less than or equal to the length of the receive buffer.

An overflow error occurs if all incoming data does not fit, without truncation, into the

receive buffer. If a message that is shorter than the receive buffer arrives, then only those

locations corresponding to the (shorter) message are modified. Even though no specific

behavior is mandated by MPI for erroneous programs, the recommended handling of

overflow simations is to return in status information about the source and tag of the

incoming message. The receive operation will return an error code. A quality

19

implementation will also ensure that no memory that is outside the receive buffer will

ever be overwritten.

3.3 Asynchronous Communication in MPI

One can improve performance on many systems by overiapping communication and

computation. This is especially tme on systems where communication can be executed

autonomously by an intelligent communication controller. Light-weight threads are one

mechanism for achieving such overlap. An alternative mechanism that often leads to

better performance is Nonblocking Communication [11].

int MPI_Isend(void'̂ buf,

int count,

MPIDatatype datatype,

int dest,

int tag,

MPIComm comm,

MPlRequest ^request

)

int MPl_Irecv(void* buf,

int count,

MPIDatatype datatype,

int source,

int tag.

20

MPIComm comm,

MPlRequest ""request

)

A Call to the non-blocking send or receive simply starts , or post the communication

operation. It is then up to the user program to explicitiy complete the communication at

some later point in the program.

Non-blocking send start calls can use the same four modes as blocking sends:

standard, buffered, synchronous and ready. These carry the same meaning. Sends of all

modes, except ready, can be started whether a matching receive has been posted or not; a

Nonblocking ready send can be started only if a matching receive is posted. In all cases,

the send start call is local: it remms immediately, irrespective of the status of other

processes. If the call causes some system resource to be exhausted, then it will fail and

remm an error code. The send-complete call returns when data has been copied out of the

send buffer. It may carry additional meaning, depending on the send mode.

If the send mode is synchronous, then the send can complete only if a matching

receive has started. That is, a receive has been posted, and has been matched wdth the

send. In this case, the send-complete call is non-local. Note that a synchronous,

nonblocking send may complete, if matched by a nonblocking receive, before the receive

complete call occurs. (It can complete as soon as the sender " knows" the ttansfer vrill

complete, but before the receiver "knows" the transfer wdll complete.)

If the send mode is buffered then the message must be buffered if there is no pending

receive. In this case, the send-complete call is local, and must succeed irrespective of the

21

stams of a matching receive. If the send mode is standard then the send-complete call

may return before a matching receive occurred, if the message is buffered. On the other

hand, the send-complete may not complete until a matching receive occurred, and the

message was copied into the receive buffer.

These calls allocate a communication request object and associate it with the request

handle (the argument request). The request can be used later to query the stams of the

communication or wait for its completion. A nonblocking send call indicates that the

system may start copying data out of the send buffer. The sender should not access any

part of the send buffer after a nonblocking send operation is called, until the send

completes.

A non-blocking receive call indicates that the system may start writing data into the

receive buffer. The receiver should not access any part of the receive buffer after a

nonblocking receive operation is called, until receive completes.

When using non-blocking communication it is essential to ensure that the

communication has completed before making use of the result of the communication or

re-using the communication buffer. Completion tests come in two types:

WAIT type These routines block until the communication has completed. They are

useful when the data from the communication is required for the computations or the

communication buffer is about to be re-used. Therefore a non-blocking communication

immediately followed by a WAIT-type test is equivalent to the corresponding blocking

communication.

MPIWAIT (request, status)

22

This routine blocks until the communication specified by the handle request has

completed. The request handle will have been returned by an earlier call to a non-

blocking communication routine.

TEST type These routines return a TRUE or FALSE value depending on whether or

not the communication has completed. They do not block and are useful in situations

where we want to know if the communication has completed but do not yet need the

result or to re-use the communication buffer i.e. the process can usefully perform some

other task in the meantime.

MPITEST (request, flag, stams)

In this case the communication specified by the handle request is simply queried to

see if the communication has completed and the result of the query (TRUE or FALSE) is

remmed immediately in flag.

23

3.4 SGI Origin 2000

The SGI Ongin is a Scalable, Shared-Memory Processor (SSMP) system, offering the

benefits of both shared memory and distributed system architecmres. The CPU's in the

system are connected using a hub, to form a node. Multiple nodes are connected together

to form a complete system. The Origin architecture is an instantiation of Cache-Coherent

Non-Uniform Memory Access (CC-NUMA) Architecture,

• T P ^ Fig 3.1 Hyper Cube Arrangement of SGI Origin

CC-NUMA stands for cache coherent Non-Uniform Memory Access. Memory is

physically distributed throughout the system; as a result memory and peripherals are

globally addressable. Local memory accesses are faster than remote accesses. Local

accesses on different nodes do not interfere with each other.

The MIPS RIOOOO is a superscalar RISC processor used in several SGI product lines

from desktops to large parallel systems. The RIOOOO is not much faster than the its

24

predecessor, the R8000, but it was designed to operate efficiently with cache and in the

NUMA environment.

The RIOOO is 4-way superscalar; it can fetch and decode 4 insfructions per cycle to be

scheduled to run on its five independent, pipelined execution units:

1. a non-blocking load store unit,

2. 64-bit integer ALUs, (Arithmetic and Logical Unit)

3. a 32/64-bit pipelined floating point adder,

4. a 32/64-bit pipelined floating point multiplier.

The RIOOOO is a cache-based RISC CPU, and programs must utilize cache well if

they are to run efficientiy on the Origin. The Origin 2000 at the HPCC at Texas Tech

University is a 46-node machine. The operating system is the IRIX 6.4 and it uses LSF

(Load sharing Fachity) for load distribution. The processing speed of each processor is

300 MHz.

25

CHAPTER 4

DESIGN ISSUES AND IMPLEMENTATION

This chapter describes in detail various strategies that were developed and

implemented for this research work. The issues addressed are the creation of a serial

algorithm for one to three dimensions, parallelizing the serial algorithm using blocking

communication calls, drawbacks involving this parallelized code, and designing and

implementing a parallel algorithm using asynchronous communication. This latter

algorithm is demonstrably superior in many situations, and never significantiy worse than

its competitors.

One of the objectives accomplished in this research is the implementation of

adaptive quadramre integration in an efficient manner such that sub-intervals of the

integration (tasks) can be allocated depending on the turnaround time of each processor.

All processors ready for work are assigned new tasks as long as there are tasks to be

assigned. In [14], the integral for functions up to two variables were reported. This work

will extend the integration dimension to three variables, increasing the scope of practical

utilization of a parallelized adaptive quadramre integration method. More importantly,

this work implements and benchmarks a load-balancing algorithm that is significantly

more efficient than the previous implementation, which was based on synchronous

communication. Since heterogeneous execution behavior cannot be predicted in many

cases, we feel that this programming paradigm deserves more study.

26

4.1 Algorithm Design

The aim of this work is to increase efficiency with respect to the implementation

reported in [15]; therefore certain implementation-based enhancements are required.

These enhancements are, use of asynchronous communication and reducing the number

of send/receive calls.

The previous work [15] uses a master/slave paradigm to implement a synchronous

blocking algonthm. Under this paradigm master processor {Po) is responsible for storing

the intervals (tasks) and distributing them across idle processors. The master processor

uses the stack to store sub-intervals. The slave processor (P,, where i is any processor

other than Po) is responsible for performing the computation and reporting the result to

Po.

The design and implementation of this work can be broadly classified into the

following three stages:

1. Design and implementation of a serial adaptive quadramre algorithm,

2. Design and implementation of the parallel adaptive quadramre algorithm using

blocking calls,

3. Design and implementation of the parallel adaptive algorithm using non-blocking

calls.

We benchmark (2) and (3) and demonstrate that (3) results in an implementation that

is able to handle variable load conditions across the processors.

27

4.2 Serial Implementation of Adaptive Quadramre

The development of the serial code formed the basis of this research work. For

simplicity we will present the serial implementation for one dimension only. The serial

code for adaptive quadrature is explained in the following paragraphs.

For any given function, interval [a, b], and error tolerance e, gkn and gq? are

calculated. The integral value abs(gki5) is also calculated, that is the absolute value of the

integral using the 15-point Gauss Kronrod mle.

The values of the integrals obtained are then checked for the error tolerance e in order

to decide whether to accept the result or reject it and consequently create additional sub-

intervals.

The following condition is used in the implementation of the adaptive quadrature:

if{fabs(gk,,-gqT) > e(sqrt(e) (lb-la) + abs{gk,,)))

then sub-intervals are created, else the result gku is accepted. The intervals la and lb are

local (sub-intervals) to the slave processor. The right hand side of the condition distributes

the error for an interval/sub-interval. The error e specified for the initial interval [a, bJ, is

scaled for any additional sub-intervals that are created. The condition stated above

provides the error scaling by multiplying the specified error with the sub-interval. Note

that the intervals created in this process are stored in the stack, and the entire process is

repeated until the stack is empty.

28

4.2.1 Generation of Intervals and Behavior of Stack

Consider any function/fx) with an initial interval [a, bJ. In one dimension, [a, b] can

be considered as a segment with end points a and b. If the calculated values of the

integral do not satisfy the error tolerance, then [a, b] is partitioned into [a, c] and [c, b],

, a + b where c =

2

The interval [a, c] is stored in the stack and integrals are calculated for the interval [c,

bJ. If the calculated values fail to satisfy the error condition (section 4.1), then interval [c,

bJ is subdivided into intervals [c, d] and [d, bJ. The interval fc, d] is stored on the stack

and the integral values are again calculated for [d, b]. If the integral values for interval

[d, b] satisfy the error condition then the integral value gk^ for this interval is added to

the result. The interval [c, d] is popped from the stack and the integral values are

calculated for the interval. If the integral values satisfy the error condition then [c, b] is

popped for processing, otherwise interval [c, d] is subdivided as explained before. This

process is repeated until the stack is empty.

The algorithm starts with an empty stack as shown in figure 4.1(a). The stack when

interval [d, b] is being processed looks as shown in figure 4.1(b), When the program

terminates the stack again is as shown in figure 4.1(a). When interval [c, d] is being

processed the stack looks as shown in figure 4.1 (c). The left column on the stack is the

lower limit and right column is the upper limit of the integral.

29

c a

d c

(b)

a c

(c)

Figure 4.1 Working of the Stack

4.2.2 Implementation for two and three-dimensional functions

This section discusses the implementation details for two and three-dimensional

functions. The basic approach is the same but the complexity increases from the fact that

intervals created as a result of subdivision will be rectangles in case of two-dimensional

integration (figure 4.2 (a) and cubes in case of three-dimensional integration (figure 4.2

(b)).

30

[c,d]

[d, e, {]

[a,b] (a)

[a,b,c] (b)

(c)

Figure 4.2 Two and Three Dimensional Integration

In case of two-dimensional integration the error tolerance condition is modified as

follows:

//(fabsigk,^ -gqi)> e{sqrt{e){b -a){d -c) + abs(gk^^)))

where (b-a)(d-c) represents the area of the rectangle over which the integration is carried.

For three-dimensional integration the error tolerance condition is modified as follows:

if{ fab{gk,, -gq^)> e{sqrt{e){d -a){e -b){f -c) + abs{gk,,)))

31

where (d-a)(c-b)(f-c) represents the volume of the cube over which integration is carried

out.

The stack operation for two and three dimension integration handles four and six

variables respectively, compared to two in one dimensional integration.

An important feature to discuss here is the way sub-divisions are created when the

error tolerance condition fails for two and three-dimensional integration. In one-

dimensional integration interval [a, b] would result into [a, c] and [c, b], where

c = The implementation would have proceeded for one and two-dimensional

integration without any changes, i.e. sub-dividing along the first dimension. After

experimenting with different functions it was concluded that the sub-intervals should be

created along the longest edge, as seen in figure 4.2 (c) for a two-dimensional case.

Consider a three-dimensional integration, the longest edge would be the dimension

for which the absolute value of the difference between upper and lower limit is greatest.

For the intervals shown in figure 4.2 (b), longest edge will be

max (abs(d-a), abs(e-b), abs(f-c)),

if, the third dimension is remmed as the longest edge then the new sub-intervals -will be

[a, b, cj, [d, e, k] and [a, b, kj, [d, e,f} where k = — Note that the remaining two

dimensions are the same after the sub-division is performed.

32

4.3 Parallel Implementation of Adaptive Quadrature using blocking calls

After creating the serial algorithm, MPI is used to write a parallel version of the

adaptive quadramre algorithm. Most of the work in this part of the thesis is based on the

work reported in [15]. In [15 the parallel implementation of adaptive quadramre using

blocking calls is reported, but due to efficiency considerations, the algorithm was re-

implemented for this thesis.

Consider the case of a three-dimensional integration. The computation will

require six variables to be sent and received between the slave and the master processor.

Six variables are required because the finite integration requires every variable/dimension

to have a lower and a upper limit. There are three dimensions in this problem; therefore

six variables are required. MPlPack and MPlUnpack are used for packing and

unpacking variables. Packing the variables for send operation and unpacking them upon

receive will reduce the number of message passing calls to one call. In confrast, in [15]

MPlPack and MPlUnpack functions are not used to reduce the number of send/receive

operations.

Let's assume there are P processors, Po-Pp-j, where processor Po acts as a master.

All the other processors Pi...Pp.j interacts only with the master processor Po. The parallel

implementation can be explained with the help of following pseudo codes:

1. Pseudo code to explain the initial distribution of intervals,

2. Pseudo code to explain the work assigned to the slave processors,

3. Pseudo code to explain the role of the master processor.

33

1. Initial distribution of intervals.

The following pseudo code describes the procedure employed to distribute the initial

interval [a, b] across P-1 processors, for a one-dimensional integration problem.

if(my_rank !=0)

(

if (number of dimensions < 2)

{

/•Divide the initial interval between p-1 processors*/

( / ' - ! ) ( ; ' - l )

}

else if (number of dimensions > 2 )

{

find the longest edge and divide along the longest edge

}

}

Variables la and lb are local lower and upper limits. In the pseudo code {b-a)

(P-1)

defines the size of the interval given to every processor, pi...pp-i. So

a + - -{my _rank -1) determines the lower limit of the interval for processor rank P - 1

my rank and adding ^̂ ^to this determines the upper limit of the interval. Let's

(p-1)

assume we use three processors, Po, Pi and P2. to calculate an integral for function/("x)

across the interval [-1,1]- The interval [-1, 1] should be distributed to P; and P2,Po is not

given any interval since it is the master processor. The lower limit for Pywill be -1 as

seen by performing the following calculation on the formula, -1 + ^̂ —^̂ — (̂1 -1), the

34

upper limit for Py will be 0 as obtained from -1 + ^—^̂ —-̂ Similady for P2 the lower

limit will be 0 and the upper limit will be 1.

For two and three-dimensional problems the initial distribution of intervals is created

along the longest edge (section 4.2), the remaining dimensions are unchanged. It is

important to note here that the master processor has no involvement at the time of initial

distribution of the intervals.

2. Work allocation to slave processors

The following pseudo code gives a brief description of the work allocation to the

slave processors. The slave processors receive the interval(s), calculate the integral, and

depending on the error tolerance condition sends the result or sub-interval to Po-

if(myrank != 0)

{

while(l)

{

Receive a flag from po using MPI_Recv

if (flag = 1 )

Indicates po has no work to send so break the while loop

{

if (flag = 2)

{

receive interval in MPlPack format from po using MPI_Recv

}

Calculate gk^ and gq7 for the sub-interval(s) received

if the results are within the tolerance limit send the result using MPI_Send

else subdivide and send the two sub-intervals to po using MPI_Send

}

35

There are two important points to note here, first the slave processors already have an

interval to work on during the first iteration of the while loop as explained in the previous

paragraph(s), so the code is designed such a way that the receive calls are not executed

during the first iteration of the wdiile loop. Second, the sub-division result into two sub-

intervals, both of them (sub-intervals) are sent to master processor Po.

The intervals are received in the MPlPack format, so they are unpacked into

variables before the computation starts. The packing and unpacking reduces the number

of send/receive operation and increase the efficiency of the program.

3. The role of the master processor

The following pseudo code describes the role of the master processor, Po.

if(my_rank ^=0)

{

while(l)

{

fori = 1 to p-1

{ Receive result/sub-intervals (data) in the form of MPI_Pack, from the slave processors using MPI_Recv.

Unpack the data using MPlUnpack.

If the data is an integral value then add the value to the temporary result.

Else store the intervals on the stack.

}

If the stack is empty,

- inform slave processor, this acts as a terminating condition.

- break from the while loop.

Else distribute task (intervals) to the slave processors based on the availability of the intervals and

processors.

}

}

36

As seen from the pseudo code, the master processor does the following: receives the

results/intervals from every slave processor, storing the intervals on the stack and

distributing the intervals (if any) to the slave processors. The intervals are distributed to

every slave processor provided there are enough intervals on the stack to distribute. There

could be a simation when the numbers of intervals on the stack are less then P-1. The

code is adapted to this situation, before sending the interval the master processor sends a

flag to the slave processor, informing the slave processor of what is it sending. The slave

processor depending on the value of this flag knows whether to post a receive call for the

intervals.

The distribution of intervals by the master processor takes into account the dimension

of the application under execution, so for one-dimensional integration the master

processor pops two values from the stack, MPI_Pack the values and sends this packet to

the slave processor.

37

The working of master/slave processor(s) configuration and the stack operations

during the execution of the program is shown in 4 stages. Let's assume a case where two-

dimensional integration is being calculated for function/(̂ x) by 5 processors, an interval

[a, b].

Figure 4.3 shows the diagram after the initial distribution and before the computing

starts at the slave processors P1...P4. It shows that the stack is empty and the slaves have

intervals to process.

Master Po

Stack: Pi [a,b,]

P2 [bi. b2]

P3 [b2 b3]

' P4 [b3 b]

Figure 4.3 Initial disttibution of intervals

38

Figure 4.4 shows a diagram when slaves have calculated the integral values, they

have checked the error tolerance, and they have sent sub-intervals/result back to the

master processor Po. The figure also shows the state of the stack at the master processor

after it has stored the intervals. Let's assume that Pj and P3 are the processors sending

result and, P2 and P4 are sending sub-intervals. So [bj, b2j is sub-divided into ft>i, b4] and

fb4, b2], and [b3, h] is sub-divided into [b^. b4] and /Z»4, b].

Master Po

Stack:

b4 b3 b4 b,

b b4 b2 b4

Pi [ ]

P2

[ ]

P3 [ ]

P4 [ ]

Figure 4.4 Sub-intervals stored on the stack

39

Figure 4.5 shows a diagram after the master has distributed the intervals to the slave

processor. The distribution of intervals uses a for loop that pops an interval from the

stack and distributes it to Pj, then pops another and disfributes it to P2 and so on. After

distributing all intervals, the stack is empty. It is possible that there are more intervals on

the stack than the number of processors, in that case, the master processor will have to

wait for the next turn when the distribution algorithm is called by the master processor

and as a result some intervals are left on the stack. A case may also arise where the

number of intervals are less than P-1, in that case the slaves are informed that intervals

are not sent and they take a proper measure to handle the simation (like sending dummy

results to the master).

PI

[b4, b]

P2 [b3 b4]

P3 [b4 b2]

P4 [bi b4]

Figure 4.5 Sub-intervals distributed to the slave processors

40

Figure 4.6 shows the stage when the computation is complete, all processor have

returned the result to the master processor and the program is ready to terminate.

Pi [ ]

P2 [ ]

P3

[ ]

P4 [ ]

Figure 4.6 Termination of the parallel program

41

4.4 Drawbacks

The algorithm discussed in the previous section implements a parallel algorithm using

blocking communication calls. The algorithm follows a linear speedup when compared

with a serial algorithm. The performance reported in [14] show a considerable speedup

and accuracy, expected from the parallel implementation of adaptive quadrature. As with

every implementation there are certain drawbacks associated with the implementation

reported in [14]. These drawbacks are because of the use of blocking communication

calls (MPIRecv and MPISend).

The algorithm discussed in section 4.2 for blocking case and in [1] performs

inefficiently under the following conditions:

1. When the program runs in a heterogeneous environment. An example of this

condition will be a parallel code running on a grid, where processors with

different architectures are networked for the purpose of solving complex time-

consuming computations.

2. When the program runs on a cluster, with processors of varying speed. In this

case the architecmres are the same but the speed of processors differs.

3. When program runs on a set of processors that are fighting for resources,

implying variable load conditions. This condition can arise on a

heterogeneous/homogeneous cluster as well as on supercomputing machine like

Pleione (Section 2.4).

All the conditions stated above would result in an uneven time required by the slave

processors to do the required processing on the interval. The processing includes

42

calculating integral values, checking for error tolerance condition, sub-dividing if

required, sending results/sub-intervals and waiting for the next set of intervals. The

problem can clearly be understood by looking at the Figure 4.7. Consider a case when

four processors are used for the computation of an integral. The parallelization as already

explained, is achieved by data decomposition, so same copy of the code is getting

executed on different slave processors, and all the slave processors are working on

different data set.

Figure 4.7 shows a shuation that will decrease the efficiency of the parallel program.

The boxes on the right hand side of the figure represent the slave processors, the box on

the left hand side represents the master processor. All the boxes in the figure have brief

descnption of the activity the master/slave configuration would be performing during the

entire course of execution. Let's refer to the boxes by their name, i.e. Pi, P2, P3 and Po. It

is possible that slave processor are executing different part of the same code at a

particular time, bold lines inside Pi, P2, P3 describe this situation.

43

Po Master Processor - for i = 1 to p

-receive interval/results -store intervals on stack -if result add to temp value

-if stack is empty - intimate slaves to terminate

-else -inform slaves that an interval would be sent -forj = 1 top

-pop an interval from stack -send interval to j " ' processor

Pi Slave Processor - receive interval - calculate integral - check for error - send results/intervals - repeat until termination

P2 Slave Processor - receive interval - calculate integral - check for error - send results/ intervals - repeat until termination

P3 Slave Processor - receive interval - calculate integral - check for error - send results/ intervals - repeat until termination

Figure 4.7 Problem with the blocking communication calls

Let's assume that the master is waiting for the results/intervals from Pi, but Pi is still

calculating the integral for some interval, as a result the receive operation at the master

processor will block, this would not allow the send operation at Pi and P2 to complete

because the respective receive at the master processor is not posted (the for loop receives

44

the result/intervals serially). Hence Pi and P2 are idle, while the master processor may

have intervals to distribute. However, because of the blocking namre of the receive call

the master is not able to distribute more intervals. This situation will result in loss of

efficiency.

There can be a lot of simations similar to the one mentioned in the previous paragraph

where blocking nature of the communication calls is responsible for low efficiency of the

parallel program.

The for loop run by the master processor, which is responsible for receiving the

intervals/result from the slave processor, is identified as the bottleneck point for the entire

communication in this parallel implementation. The successful change in the behavior of

this for loop will increase the efficiency of the parallel algorithm. The desired behavior is

that none of the receives should be blocking, thus the sends by any slave processors are

independent of each other.

The other modification required in the implementation are the sends needed when the

master processor distributes intervals to the slave processor. This change will make the

distribution of intervals faster. It is important to note here that the corresponding receive

at the slave processor should be blocking because the computation should not be allowed

to proceed at the slave processors until it receives an interval from the master processor.

The use of nonblocking send/receive will alleviate most of the blocking problems

discussed so far. The next section discusses the implementation of the adaptive

quadrature algorithm using non-blocking communication.

45

4.5 Parallel Implementation of Adaptive Quadrature using Non-Blocking Calls

The algorithm discussed in this section discusses the implementation details of a

parallel adaptive quadrature using non-blocking communication calls. This

implementation not only removes the drawbacks associated with blocking

communication calls, but also adds a new feature that enhances the efficiency of the

algorithm.

The implementation has the following features:

1. The initial distribution of intervals,

2. The use of MPlPack and MPlUnpack to reduce the amount of communication,

3. The stack representation and handling by the master processor.

Implementation details of the behavior of the master/slave configuration are discussed

below.

1. Role of the slave processors

The following pseudo code explains the working of the slave processors.

while(l)

{

/•this is the client code*/

Receive message from Po in MPI_Pack format using MPI_Recv;

Unpack the message using MPlUnpack;

If message contains termination condition then break from while loop;

Calculate integral values gk^, abs (gk^) and gq,;

While (error tolerance condition for these values fail)

{ Sub-divide the interval into two sub-intervals;

Send one half of interval to master processor using MPI_hecv;

Calculate integral values for the second mterval;

}

46

Add accepted integral to the temporary result;

Inform the master that interval is required;

}

This implementation is different from the working of slaves in [15] and in the

blocking algonthm (Section 4.2). In this case the slave processor sends both the sub-

intervals to the master processor if the error condition fails after the integration is

calculated using the Gauss Kronrod mle and the Gauss Quadrature mle for an interval.

Consequentiy, the slaves are stopped from being overioaded. The overioading happens

because of some property of the function for that interval (e.g. singularity) such that large

numbers of subsequent sub-divisions are created. The non-blocking implementation is

efficient because only one of the sub-intervals is sent to the master processor, the other

sub-interval is retained by the slave processor for processing. The non-blocking

communication calls allow the slaves to start processing immediately on the next sub-

interval.

In [14] and in the blocking algorithm, a flag is used to indicate whether a result or

sub-interval is being sent to the master processor. This implementation maintains a local

result variable, this variable contains the cumulative accepted results for all the intervals

the slave processor has worked upon. The master processor uses MPIReduce operation

to collect this local value from the slave processors and sum them to get the final resuh.

The MPl_Recv, a blocking call, is used to receive the intervals from the master

processor. This is to ensure that computation is not allowed to proceed on the slave

processors unless they have an interval to work upon.

47

The parallel code executed by the master processor is much more difficult to

implement than anticipated. The next page describes the pseudo code that explains the

working of the master processor. It is important to note here that non-blocking

communication calls have a parameter of type MPlRequest (section 3.3) that takes care

of completing the send/receive communication call, so the execution may proceed to the

next line of code without waiting for the send'receive operation to complete. This

parameter is exploited by this implementation to get the desired improvement in

efficiency, permitting communication to overiap with computation, and eliminate the

drawbacks associated with the blocking algorithm.

2. Role of the Master Processor

The notation used in the following pseudo code is as follows: P is the number of

processors used for the computation, "i" is any value between 1 to P-1, 0 is not included

because Po is the master processor:

{/*this is the client code*/

for i= 1 to p-1

Receive message in MPI_Pack format from the i* slave processor using MPI_hecv;

while ( (one or more processors are busy) or (stack has elements)) {

fori =1 to p-l{

MPITes t the receive request associated with i* slave processor;

If the request is complete {

MPI_Unpack the message;

If the message indicate request for intervals

If the stack has elements send another interval to the i* processor;

Else put this processor in the free processor list.;

Else if the message is an interval, store it on the stack;

Call MPI_Irecv for interval/request for interval from slave;

}/*end if for checking request completion*/

If i* processor is the free processor and stack has elements, send interval to the i*

48

Processor

}/* end for */

}/* end while*/

Inform slave processors that program should terminate.

MPI_Reduce the final result from local variables.

}/* end of master code*/

The algorithm for the master processor works as follows:

1. The master processor calls MPl_Irecv, a non-blocking call, to receive data from

the slave processors. There is negligible wait involved because of the namre of the non-

blocking calls.

2. Using afor loop constmct, check the receive stams from every slave processors.

MPl_Test function is used to test the status (complete/ pending) of the non-blocking

communication call. The non-blocking communication call has a parameter of type

MPI_Request; this parameter is called the communication handle, and is responsible for

completing the communication call. The MPITest. function takes as parameter, the

request handle associated with the receive request and remms tme or false depending on

the stams of the communication call. If the MPITest function remms tme, h means that

the receive operation is complete.

3. Unpack the data using MPl_Unpack

4. If data contains a request for an interval, and there are intervals on the stack, pop

the interval and send the interval to the processor.

5. If the stack is empty place the slave processor in the free processor list.

6. Else, if the data is an interval, store the data on the stack.

49

7. Post a receive call (MPIlrecv) to receive the next data packet from the slave

processor. This step ends the //construct started by MPITest function if it remms ttiae.

8. Check if any processor is free and there are intervals on the stack. If both the

conditions are tme then pop an interval from the stack and send it to the processor. This

step is executed if step 5 is executed at the master processor. This step ends the/or loop

started at step 2.

9. Check for the termination condition, i.e. the stack is empty and all processors are

free; if these conditions are tme then send a termination signal (in the form of message)

to the slave processors and collect the result from the slave processors using the

MPIReduce function.

Steps 2 to 7 removes the bottleneck (for loop) of the communication between the

master and the slave processor(s). By using a non-blocking receive the program proceeds

without wait and MPITest makes sure that there is no more than one receive for

corresponding send at the slave processor. Whereas in the implementation for the

blocking case (section 4.2) and in [1] has to wait for send and receive to complete.

50

CHAPTER 5

EXPERIMENTS AND RESULTS

This chapter presents the results of the experiments conducted on the parallel adaptive

quadrature algorithm using blocking communication calls (PAQBC) and the parallel

adaptive quadramre algorithm using non-blocking calls (PAQNBC).

The objective function used in the experiments is

f{x,y,z) = l,ifx- +y^ +z^ <\ = 0, otherwise

the limits for the integral are [x: -1, 1], [y: -1, 1] and [z: -1, 1]. The algorithm was tested

on different functions for the accuracy of the results obtained for one, two and three-

dimensional integration. The function /(x,_y,z)is chosen as an objective function

because the discontinuities in this function on the boundary of the unit sphere result into

many sub-divisions.

The experiments performed can be categorized as follows:

1. Evaluation of f(x,y,z) using PAQNBC to demonstrate speedup.

2. Evaluation of f(x,y,z) using PAQNBC and PAQRM to compare speed up.

3. Evaluation of f(x,y,z) using PAQNBC and PAQRM to Compare speed up in a

heterogeneous environment.

4. Evaluation of f(x,y,z) using PAQNBC and PAQRM to Compare speed up during

variable load conditions.

51

A very important point to keep in mind while observing these results is that the

master processor does not participate in the actual computation or generation of sub-

intervals. The master is responsible only for allocating work and receiving

results/intervals. So when we say that we use two processors it acmally means just one

slave processor for computation and a master processor for allocation of work and

collection of results. Always the acmal number of processors utilized for computation is

one less than the total number of processors. The use of two processors corresponds to

the serial execution of the PAQNBC with one slave processor but with added

computation overhead between the master and the slave.

Speedup is defined as ratio of execution time using one processor to the execution

time using P processors [7]. Under the master/slave paradigm speedup is defines as

follows:

_ Execution time using 2 processors r —

Execution Ume using P processors

A parallel program is said to have a linear speedup if the speedup S is equal to P. In

all our experiments, the maximum speedup that we can possibly see is P-1, as PQ

(processor zero) is not involved in any computation.

Figure 5.1 shows that a speedup is obtained with an increase in the number of

processors used in PAQNBA. Speedup (2, P) where P = 2... 10 is the ratio of time taken

by processor P^ over processor Pp, T(2)/r(P). The experiment was conducted 10 times

and the averaged data was used to plot the graph.

52

Speedup

''5

•T" on _ T

(2)/

T(P

), S

peed

1

3 oi

o

en

c

i r^ / ^

} i

\ 'i / i

\

> »

/ —•—Speed Up

0 5 10 15

P, Number of Processors

Figure 5.1 Speedup for;? processors over 2 processors

The graph is expected to show a linear speedup, but at P = 4,6,8,10 a sudden surge in

the speedup is observed. To find out the reason for this behavior, we calculated a total

number of sub-divisions created during the entire computation for P = 2 to 10. The total

number of subdivisions(CfPj) created by PAQNBC is calculated by declaring a local

counter for every slave processor and incrementing this counter if a sub-interval is

created by the slave processor . Before the termination of the program, when all the slave

processors are free and the stack at the master processor is empty, the master processor

collects this counter value and adds them together to give the value of C(P). The

collection and addition are carried out simultaneously with the help of a collective

communication called MPl_Reduce. The value C(P) is equal to the total number of

iterations performed by the slave processor. Figure 5.2 shows the graph between P and

C(P).

53

Note that the value of C(P) is not same the for every value of P The reason for this

dissimilarity is the way intervals are distributed at the beginning of the execution (Section

4.2).

Number of Iterations/Sub-divisions

looon

^ 1 8000-5) > •° -o 6000 -c X}

^. 1 4000 -

" 1 2000 -CO

0)

- n -(

F

< . t

1

It

' 1

D 5

', Number

/

I'^l I

of P

L \

- < • -

10 1

rocessor

* Ni imhpr nf

Iterations/Subdivisions

5

5

Figure 5.2 Number of subintervals with P processors for PAQNBC

The ideal way of starting the process of parallel integration would be placing the

initial interval on the stack at the master processor and then calling the interval

distribution algorithm. The distribution algorithm would have given this un-divided

interval to P;. The remaining processors, if P>2, would have to wait until the call to the

distribution algorithm is again made by the master processor and there are enough

intervals on the stack to be distributed. This method is not the most efficient method of

distributing the intervals because this results in too many idle processors. The advantage

of this method is that the number of sub-divisions/iterations performed by the algorithm

is the same when we vary P from 2 to 10.

54

The blocking and non-blocking algorithms implemented for this thesis use the initial

distribution of the interval. Depending on the value of P, the initial interval is divided

into P-1 sub-intervals (panels), and each panel is given to one of the P-1 slave processor.

This initial distribution is responsible for the variation in the number of sub-

intervals/iterations created. Thus the sudden surge in the speedup at P = 4,6,8,10 is

attributed to the initial distribution of the interval; at these values of P the initial

distribution results in lesser number of subdivisions.

The variation in the number of iterations required the normalization of the speedup by

dividing the time with the amount of work done. By dividing the time with the number of

iterations, we get time per iteration, which is essential to find out the amount of work

done for different values of P The normalized speedup is the ratio of time required per

iterations when the number of processors is 2 to the time required per iterations when the

numbers of processors is P Therefore the normalized speedup can be mathematically

defined as

^ ^ r(2)/C(2) P(2) C(P) ^ T{P)IC{P) T(P) C(2)

Note that if the algorithm were to start with one interval on the stack C(P)/C(2) will

cancel each other.

55

Figure 5.3 shows the graph obtained with normalized speedup using PAQNBC. For

all further experiments on PAQNBC and PAQBC normalized speedup is calculated and

plotted against P to check the performance of the algorithm. The speedup as expected is

P-1 because in the master/slave paradigm only P-1 processors are involved in the acmal

computation. The speedup is linear in namre.

i n -,

m

0 t

o

1 ' « § 4 -t

1 2 n

1

Normalized Speedup

0 5 10 1 P, Number of Processors

5

Figure 5.3 PAQNBC with normalized speedup

Similar experiments were conducted with the synchronous/blocking algonthm

PAQBC. Figure 5.4 shows the comparison of the speedup between PAQBC and

PAQNBC. Note that PAQNBC algorithm performs slightiy better than the PAQBC

algorithm.

56

10 -

8 -

^ 6 -•D

^ A a. 4 -Vi

2 n -

(

PAQNBC Vs PAQBC to Compare Speedup

• ' •

i /

/ 1

/

J 1 f

A Y

(\

/

A >

1

-•—PAQNBC

- » - PAQBC

3 5 10 15


Figure 5.4 Speedup for p processors over 2 processors for PAQNBC and PAQBC

The next experiment was conducted on PAQNBC and PAQBC algorithm with the

processor P2 executing a dummy for loop. The processor P2, second in rank, takes into

account that the processor with rank 0 is the master processor. This would in theory make

processor P2 the slowest of all the processors. Figure 5.4 shows the comparison of the

speedup between PAQNBC and PAQBC.

57

10 -1

8 -

3 6 -•D

1 4 -2

n -

PAQNBC Vs PAQBC when P2 is the

processor

1 ^

t

/ /

H

1

/

H

/

H

/

HI

/

H

/

H 1

i s lowest

- • - P A Q N B C

- • - P A Q B C

0 5 10 15


Figure 5.5 Speedup for p processors over 2 processors with processor/72 executing a dummy for loop.

The graph shows that PAQNBC is adaptive to the simation where a processor is

running slow. The PAQBC algorithm completely loses its linearity in speedup where as

PAQNBC retains its linearity when the number of processors are increased to 4 and

onwards. It is important to note that for PAQNBC the number of processors equals 2 and

3 represents approximately the same case. This can be attributed to the fact that when

P=3, Pi IS doing the maximum work because P2 is ruiming slow. Therefore the speedup

for PAQNBC in this case is P-2 instead of P-1.

In the next experiment a slave processor executes a dummy for loop depending on the

value of a random number generated. This is achieved by inserting the following code

before the slave processor calculates integral values for an interval:

{ srand(time(0));

v = randO/(RAND_MAX+l .0);

58

if(v<0.2) my_sleepO;

}

The above code generates a random number that lies between 0.0 and 0.9 and

depending on the value executes a dummy for loop defined in the my_sleep() function.

The effect of inserting this code is that an unspecified number of the slave processors

from 0 to P can execute the dummy for loop. There is a probability that 1/4'*' of the time

before calculating the integral values; a slave processor executes a dummy for loop. This

is very similar to a condition when the load is variable across the processors, and is

unpredictable. The load here also means that certain intervals take a longer computation

time than others. The graph in figure 5.6 shows that PAQNBC performs better than

PAQBC under such conditions.

PAQNBC Vs PAQBC wi th a random num ber before call to Gauss function

-IC

Speedup

3 e

n

o

c 1

1

(

A /

4 4

V / / / f H

) (

^ / M •

-

) 5 10 15 2


0

generator

-•—PAQNBC

-m- PAQBC

Figure 5.6 Speedup for P processors over 2 processors with the slave processors executing a dummy for loop based on value of random number.

59

CHAPTER 6

CONCLUSIONS

This thesis discussed the design and implementation of two parallel algorithms

PAQNBC and PAQBC. The PAQNBC algorithm uses non-blocking communication

calls, whereas the PAQBC algorithm uses blocking communication calls. The algorithms

were tested and from the experiments, the following conclusions can be drawn:

1. The PAQNBC algorithm performs better in a heterogeneous environment and in

variable load conditions.

2. The PAQNBC algorithm adaptively allocates tasks to processors such that the

fastest processor gets the maximum work. The PAQNBC minimizes the idle time or the

wait time for every processor to get a new task.

3. The PAQBC is a simple algorithm to design and use. The PAQBC will perform

nearly as good as the PAQNBC when we have a homogeneous environment or invariable

load conditions.

There were certain issues related to non-blocking communication, that were taken

care of during the implementation of the parallel code using non-blocking calls. These

issues were specific to the platform on which the code was running. This thesis used the

SGI machine with the IRIX as OS to experiment and validate the results. The number of

MPI non-blocking communication requests this environment is able to handle is limited.

By default this value is 32,768, but the function

60

f{x,y,z) = \,ifx^+y^+z^<\

= 0, othervrise

for interval [x: -1 to 1], [y: -1 to 1] and [z: -1 to 1], and for error tolerance value of

10"^, required requests more than the default value. As a result the program failed very

time. After careful study of MPI, it was found that MPlREQUESTfree function

supported by MPI should be used to free the requests that are completed. Every receive

request that returns tme at step 2 of the explanation given for non-blocking master

processor are freed and non-blocking sends are freed immediately after the call.

The parallel adaptive quadramre algorithm can be implemented as a multi-list

algorithm, thus saving time in passing intervals/results from the master to slave and

utilizing all the P processors for computation instead of P-1. Although in this case certain

issues such as task migration will come up to balance load across processors. The results

of the multi-list algorithm can be compared with the single list algorithm to determine the

most efficient way of parallelizing algorithms that uses data decomposition to achieve

parallelism. The implementation can be generically extended for supporting multi

dimensional integration problems.

61

REFERENCES

1. Bull J.M., Freeman T.L., Parallel Algorithms For Multi-Dimensional Integration Parallel and Distnbuted Computing Practices, vol. 1, no. 1, pp. 89-102, 1998.

2. Bull J.M., Freeman T. L., Parallel Globally Adaptive Quadrature on the KSR-1, Advances in Computational Mathematics, vol, 2, pp, 347-373, 1994 1994.

3. Bull J.M., Freeman T. L. and Gladwell I, Parallel Quadrature Algorithms for Singular Integrals, Proceedings 14th Worid Congress on Computational and Applied Mathematics, 1994, IMACS pp. 1136-1139.

4. Center for Research on Parallel Computing and Supercomputers, http://pixel,dma.unina.it/RESEARCH/pamihr.html, 1996.

5. Computational Science Education Project http://csepl.phy.oml gov/ca/node21 hUnl 1996.

6. Gladwell Ian, Napierala Malgorzata, http://www-fp.mcs.anl.gov/ccst/research/reports_prel998/mcs/ numerical_integration/napierala.html,.Multidimensional Numerical Integration, Mathematics Department, Southern Methodist University, January 2000.

7. Grobe M., "The Architecmre and use of SGI Origin 2000" http://www.cc,ukans.edu/~grobe/docs/sgi-short-intro/index.shtml - RIOOOO, University of Kansas, August 1997.

8. Heath, Michael T, Scientific Computing, An Introductory Survey, Second Edition, McGraw-Hill, New York, 2002.

9. Interactive Educational Modules in Computational Science, http://www,cse,uiuc.edu/eot/modules/integration/adaptivq. Computational Science and Engineering Program, University of Illinois at Urbana-Champaign October, 2004.

10. Maui High Performance Computing Center http://www.mhpcc,edu/fraining/workshop/parallel_develop/MAIN,html#overview, September 2003,

11. MPI fomm, http://www.mpi-fomm,org/docs/mpi-1 l-html/node2,html-Node2 MPI: A Message-Passing Interface Standard, University of Tennessee, Knoxville, 1994,

12. NA Research Area http://www.maths.man.ac.uk/DeptWeb/Groups/NA/Parallel.html, Department of Mathematics, University of Manchester, UK, Febmary 2001.

62

http://pixel,dma.unina.it/RESEARCH/pamihr.html

http://csepl.phy.oml

http://www-fp.mcs.anl.gov/ccst/research/reports_prel998/mcs/

http://www.cc,ukans.edu/~grobe/docs/sgi-short-intro/index.shtml

http://www,cse,uiuc.edu/eot/modules/integration/adaptivq

http://www.mhpcc,edu/fraining/workshop/parallel_develop/MAIN,html%23overview

http://www.mpi-fomm,org/docs/mpi-1

http://www.maths.man.ac.uk/DeptWeb/Groups/NA/Parallel.html

13. Rao Singiresu S, Applied Numerical Techniques for Engineers and Scientist, Prentice Hall Publications, 2002.

14. P,S. Pacheco, Parallel Programming with MPI, 1997,Morgan Kaufmann Publishers Inc., San Mateo, CA.

15. Walawalkar, Milind, Parallelization of Adaptive Quadrature Rule-Based Integration Methods, Master's Thesis, Computer Science Department, Texas Tech University, 2003

63

PERMISSION TO COPY

In presenting this thesis m partial ftilfiUment of the requhements for a master's

degree at Texas Tech University or Texas Tech University Health Sciences Center, I

agree that the Library and my major department shall make it freely available for

research purposes. Permission to copy this diesis for scholarly purposes may be

granted by the Dhector of the Library or my major professor. It is understood that any

copymg or publication of this thesis for financial gain shall not be allowed without my

further written permission and that any user may be liable for copyright infrmgement.

Agree (Permission is granted.)

Student Signamre Date

Disagree (Permission is not granted.)

Student Signamre Date

Documents

AN EFFICIENT IMPLEMENTATION OF ADAPTP/E GAUSS …