10
An experience using different synchronisation mechanisms on a shared memory multiprocessors Dogan Kaya Department of Mathematics, Firat University, Elazig 23119, Turkey Abstract In this paper, we are concerned with a number of parallel algorithms for comparing three different synchronization mechanisms when applied to a particular problem. The problem is the reduction of a general matrix to upper Hessenberg form. The algorithms were written in C++ programming language using the Encore Parallel Threads [Encore Computer Corporation, Encore Parallel Threads Manual. No. 724-06210 Rev. A, 1988] package using a shared memory MIMD computer. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Parallel algorithms; Hessenberg reduction; C++ programming language; The Encore Parallel Threads 1. Introduction The multiprocessor system used to perform the experiments is a bus-con- nected shared memory Encore Multimax computer running the UMAX operating system. The machine (Newton) has 14 NS32532 processors, each with 256 Kb processor cache memory. The program does not have direct control of the allocation of either processors or storage. The transfer of data between shared and cache memory is controlled by the hardware. Variables can be declared locally or globally, but in either case will be stored in shared memory with possible copies in cache. E-mail address: dkaya@firat.edu.tr (D. Kaya). 0096-3003/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2003.12.064 Applied Mathematics and Computation 161 (2005) 1027–1036 www.elsevier.com/locate/amc

An experience using different synchronisation mechanisms on a shared memory multiprocessors

Embed Size (px)

Citation preview

Applied Mathematics and Computation 161 (2005) 1027–1036

www.elsevier.com/locate/amc

An experience using differentsynchronisation mechanisms on ashared memory multiprocessors

Dogan Kaya

Department of Mathematics, Firat University, Elazig 23119, Turkey

Abstract

In this paper, we are concerned with a number of parallel algorithms for comparing

three different synchronization mechanisms when applied to a particular problem. The

problem is the reduction of a general matrix to upper Hessenberg form. The algorithms

were written in C++ programming language using the Encore Parallel Threads [Encore

Computer Corporation, Encore Parallel Threads Manual. No. 724-06210 Rev. A, 1988]

package using a shared memory MIMD computer.

� 2004 Elsevier Inc. All rights reserved.

Keywords: Parallel algorithms; Hessenberg reduction; C++ programming language; The Encore

Parallel Threads

1. Introduction

The multiprocessor system used to perform the experiments is a bus-con-

nected shared memory Encore Multimax computer running the UMAX

operating system. The machine (Newton) has 14 NS32532 processors, each

with 256 Kb processor cache memory. The program does not have direct

control of the allocation of either processors or storage. The transfer of data

between shared and cache memory is controlled by the hardware. Variables canbe declared locally or globally, but in either case will be stored in shared

memory with possible copies in cache.

E-mail address: [email protected] (D. Kaya).

0096-3003/$ - see front matter � 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.amc.2003.12.064

1028 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036

1.1. Overview of the Encore Parallel Threads (EPT) package

EPT is a library of routines, which enables a programmer to employ the

shared memory and parallel features of the Encore Multimax. It is an extension

of the Threads package that was developed by Doeppner at Brown University

[1].

The package has many facilities but we have not used all of them. The li-

brary provides a programmer with routines at the user level, to manipulate

these threads of control and to provide a connection so that threads can share

information, thus enabling a program to be parallelised. The threads are verysuitable for use in a parallel environment which is independent of the number

of processors. The EPT routines can be accessed by C++ by using the C linkage

convention, a C++ function must be declared extern ‘‘C’’ and include file

thread.h. A threads environment is initialized in EPT by calling the function

THREADgoðnprocs; datasize; func; args; argsize; stacksize; priorityÞ:

This function provides the facilities for the programmer to identify the number

of processors nprocs for use. The argument datasize sets up a pool of memoryor total amount of data space. The function func is initiated as the first thread

of execution. The arguments args and argsize provide the programmer the

necessary parameters to passed to related functions func by the arguments args

and argsize. The maximum stack size stacksize is provided by the newly created

thread, and the newly created thread is given a runtime priority of priority.

This function is usually called in the main program.

As we mentioned above, the THREADgo function provides a multi-thread

environment so that when any thread requires a new thread to be created thenthis can be achieved by calling THREADcreate with the parameters as indi-

cated as follows:

THREADcreateðfunca; args; argsize;ATTACHED; stacksize; priorityÞ:

The arguments func, stacksize and priority are similar to those as in the

THREADgo function. The arguments args and argsize provide the program-

mer the necessary parameters to passed to related functions funca by thearguments args and argsize. The additional argument in the THREADcreate

function is ATTACHED. The argument ATTACHED means that there is

relationship parent and child threads, so that the parent thread will only end

when child thread has completed its work by using the THREADjoin function

to signal the completion. This function THREADjoin has no parameters.

1.2. Inter-thread communication

The EPT package also provides mechanisms for synchronization which

entails a thread suspending its own execution, usually waiting for some other

D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1029

thread to cause its execution to resume [2]. The mechanisms used in our study

are semaphores, monitors, and locks. Locks are not provided by the EPTpackage itself, are available as an extension code. The simplest form of these

synchronization mechanisms are used to provide mutually exclusive access

to a particular object or data structure (to shared data).

1.2.1. Locks

A lock prevents a thread from entering a critical region while another thread

is accessing that region, so that the newly arrived thread waits. When a thread

leaves a critical region, another a thread waiting is allowed to enter. The critical

region provides programs with a means of ensuring that shared variables are

accessed by only one thread at a time. Locks implement synchronization using

busy-waiting. The busy-waiting is a simple way to implement synchronization

using shared variables. The system uses the lock and unlock operations to

provide mutually exclusive access to a particular object or shared data.

1.2.2. Semaphores

A semaphore is a synchronization mechanism which provides an alternative

way of obtaining mutual exclusive. Originally proposed, these operations werefirst developed by Dijkstra in the mid-1960s [3], the only logical operations on

semaphores are P and V and some people call wait and signal respectively. Theoperation P and V come from abbreviations of the Dutch words for waiting

and signaling.

A semaphore is a shared integer variable that may only be accessed using

one of three possible operations which are THREADseminit, THREADpsem,

and THREADvsem. The last two functions perform the corresponding P and Vprimary operations on semaphores. The first function written in the form

sem ¼ THREADseminitðinitialvalueÞ

creates a new semaphore, initializes the value of initialvalue, and returns a

reference to the created semaphore. The wait operation THREADpsem tests

the semaphore value, if it is a positive value then continues execution, if it is

zero or negative then semaphore suspends the calling process so producing a

waiting operation and places is in a waiting queue otherwise thread continuousand decrement integer. The function is written in the form

THREADpsemðsemÞ:

The signal operation THREADvsem tests whether processes are waiting, if so

the value of the initialvalue semaphore incremented by 1 and a thread on the

semaphore�s queue is released. However, the system has to ensure that each ofthese operations execute automatically, that is, if a wait and signal operation

occur simultaneously they are executed one at a time though programmer does

not know in what order they are executed.

1030 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036

1.2.3. Monitors

A monitor is a synchronization mechanism that attempts to encapsulate themutual exclusion and provides convenient facilities for signaling and waking

up processes. Monitors are special memory locations used as an alternative

mechanism to semaphores in the EPT package. This mechanism was originally

proposed by Hoare in the early 1970s, and was implemented by using the

Concurrent Pascal Programming Language [4].

A monitor consist of a set of variables representing the state of some re-

source and a function. It can implement operations on that resource and by

that function. When a thread is requires to use a monitor, it must create onebefore using it. This is accomplished by a thread call

mon ¼ THREADmonitorinitðconditions; resetfuncÞ:

A monitor must provide condition variables conditions which are given the

number of condition queues and each has associated suspend and continue

operations. The second parameter, reset function resetfunc, is used for orderly

reorganization of the monitor should the thread be terminated. In it simplest

form is provide exclusive access to shared data in a similar way to lock andsemaphore. The required control of access can be accomplished by using

following functions:

THREADmonitorentryðmon;managerÞand THREADmonitorexitðmonÞ:

If there are threads waiting to enter the monitor, the current thread must

relinquish the monitor by the THREADmonitorexit function before another

thread can be moved from the entry queue and pass through its byTHREADmonitorentry. The first parameter mon is used in each function and it

is the handler of the monitor. The second parameter manager allows the caller

the option of managing the monitor control block space.

Monitors also provide a wait and signal operations in the following way. If a

thread enters a monitor and finds that a required condition is not true, it can

suspend itself by executing a wait statement the form

THREADmonitorwaitðmon; conditionÞ:

The THREADmonitorwait removes the thread from the monitor and places it

on a queue waiting for the condition to become true. When another thread

enters the monitor and change the condition to true then it can execute a signal

statement of the form

THREADmonitorsignalandexitðmon; conditionÞ:

This function withdraws a waiting thread from the condition�s queue andwakes it up. If no threads are waiting on the condition queue the thread simply

continues.

D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1031

2. Parallel implementation

We consider two different algorithm [5–7] and use the three synchronization

mechanisms for the implementations. The first one is a simple implementation

carrying out columns and rows corresponding to a pivotal column together in

parallel, with the pivotal columns treated sequentially. All the column updates

are independent of each other and all the row updates but at least some of the

row updates need to be carried out after the column updates have been com-

pleted. Once the pivotal column has been completed later columns are updated

and this is terminated by a ‘‘THREADjoin’’. Similarly, consecutive rows areupdated and terminated with a ‘‘THREADjoin’’. This implementation is for

comparison with the other algorithms, and thus not use the synchronization

mechanisms described in Section 1.

The next implementation is based on the observation that at the kth stagethe updates to rows 1; 2; . . . ; k can be carried out at the same time as the col-umn updates while the later row updates can not. Synchronization is needed to

ensure that the updates to the columns and rows are carried out in the correct

order. The parallel algorithm has two steps. In the first step, the processchooses a column number until there are no more column to allocate to the

processors. If all columns are started, then a row number is chosen from rows 1

to n in order and updates are carried out as long as row number is less than k. Ifthe row number is greater than k the process waits until all the columns havebeen completed. Row updates (k þ 1 to n) are carried out only if the columnshave been completed.

When a column is completed the counter done is incremented and this is

used to check when all columns are completed so that the thread can allowrows to k to n to be processed.In the second implementation the columns and rows are allocated dynam-

ically, and details of the programs are given in [5–7].

3. Experimental results

The results use the notation He for our implementation of parallel upper

Hessenberg reduction. The numerical results obtained from four different

versions. These versions are outlined in Section 2. Implementation three are

indicated by md (dynamic monitor), ld (dynamic lock) and smd (dynamic

semaphore), and the simple implementation one by ss. The representation of

the matrix by columns is indicated by t.Each version was run a number of times and the smallest value of the

elapsed time used in the results, as the time is dependent upon the load on thesystem at the time of measurement. The runs were made at times of low load to

reduce the effect of the presence of other programs running at the same time.

1032 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036

We tested the algorithms using from one up to 10 processors (1, 2, 4, 6, 8,

10) with matrices of sizes 100(100)500. Graphs of efficiencies are displayed inFigs. 1–8 , where efficiency, Ep is defined by

Ep ¼ Ts=ðpTpÞ;

where Ts and Tp are the times for the sequential and parallel versions and p isthe number of processors. The sequential times were obtained from the Heseq

simple algorithm which was compiled without array bound checking the row

representation version. It was significantly better than column representation

versions and with array bound checking. The parallel and sequential times were

obtained for the same sized matrix. Normally we expect the efficiency to be less

than one.

To show performance of the two different parallel versions, we plot in Figs.

1–4 the mean efficiencies against number of processors. Figs. 5 and 6 showactual efficiencies using two processors, for the without array bound checking

and with array bound checking, respectively and Figs. 7 and 8 similar plots for

eight processors. The second method and using ‘‘Lock’’ version (Held) display

a satisfactory parallel efficiency for all our versions.

4. Conclusions

If we compare the ‘‘Locks’’ and ‘‘Semaphores’’ synchronization for the

second algorithm (dynamic allocation), we find that with no array bound

Fig. 1. No check.

Fig. 2. No check.

Fig. 3. Check.

D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1033

checking and both row and column representation that the versions Held and

Hesmd are very close particularly for four and six processors. With array

bound checking the lock version is significantly better than the semaphore

versions.The monitor version is worse than the semaphore version with no check both

better than it with array bound checking. The monitor version in the checking

Fig. 4. Check.

Fig. 5. No check for two processors.

1034 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036

case is still significantly poorer than the lock version, and the difference in-

creases with the number of processors.

Also with array bound checking the semaphore versions are some times even

poorer than the simple version Hess. With no array bound checking themonitor versions Hemd and Hemdt are give similar efficiency to the simple

implementation Hess.

Fig. 6. Check for two processors.

Fig. 7. No check for eight processors.

D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1035

There is a clear conclusion about using the synchronization mechanisms. In

all cases using ‘‘Locks’’ implementations are more efficient than those using‘‘Monitors’’ or ‘‘Semaphores’’. The lock version is consistently equal and gives

the best efficiency in most cases.

Fig. 8. Check for eight processors.

1036 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036

References

[1] Encore Computer Corporation, Encore Parallel Threads Manual No. 724-06210 Rev. A, 1988.

[2] T.W. Doeppner Jr., Threads, a system for the support of concurrent programming, Brown

University Department of Computer Science Technical Report CS-87-11, 1987.

[3] E.W. Dijkstra, The structure of the �The Multiprogramming System�, Communications of theACM 11 (1968) 341–346.

[4] C.A.R. Hoare, Monitors: an operating system structuring concept, Communications of the

ACM 17 (1974) 549–557.

[5] D. Kaya, K. Wright, Parallel algorithms for reduction of a general matrix to upper Hessenberg

form on shared memory multiprocessor, Technical Report Series No. 490, University of

Newcastle upon Tyne, Computing Science, October, 1994.

[6] K. Wright, D. Kaya, Parallel algorithms for linear algebra on a shared memory multiprocessor,

3th Int. Coll. on Numerical Analysis, VSP, 1995, pp. 209–218.

[7] D. Kaya, Parallel Algorithms for numerical linear algebra on a shared memory multiprocessor,

Ph.D. dissertation, University of Newcastle upon Tyne, 1995.