Upload
dogan-kaya
View
213
Download
1
Embed Size (px)
Citation preview
Applied Mathematics and Computation 161 (2005) 1027–1036
www.elsevier.com/locate/amc
An experience using differentsynchronisation mechanisms on ashared memory multiprocessors
Dogan Kaya
Department of Mathematics, Firat University, Elazig 23119, Turkey
Abstract
In this paper, we are concerned with a number of parallel algorithms for comparing
three different synchronization mechanisms when applied to a particular problem. The
problem is the reduction of a general matrix to upper Hessenberg form. The algorithms
were written in C++ programming language using the Encore Parallel Threads [Encore
Computer Corporation, Encore Parallel Threads Manual. No. 724-06210 Rev. A, 1988]
package using a shared memory MIMD computer.
� 2004 Elsevier Inc. All rights reserved.
Keywords: Parallel algorithms; Hessenberg reduction; C++ programming language; The Encore
Parallel Threads
1. Introduction
The multiprocessor system used to perform the experiments is a bus-con-
nected shared memory Encore Multimax computer running the UMAX
operating system. The machine (Newton) has 14 NS32532 processors, each
with 256 Kb processor cache memory. The program does not have direct
control of the allocation of either processors or storage. The transfer of data
between shared and cache memory is controlled by the hardware. Variables canbe declared locally or globally, but in either case will be stored in shared
memory with possible copies in cache.
E-mail address: [email protected] (D. Kaya).
0096-3003/$ - see front matter � 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.amc.2003.12.064
1028 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036
1.1. Overview of the Encore Parallel Threads (EPT) package
EPT is a library of routines, which enables a programmer to employ the
shared memory and parallel features of the Encore Multimax. It is an extension
of the Threads package that was developed by Doeppner at Brown University
[1].
The package has many facilities but we have not used all of them. The li-
brary provides a programmer with routines at the user level, to manipulate
these threads of control and to provide a connection so that threads can share
information, thus enabling a program to be parallelised. The threads are verysuitable for use in a parallel environment which is independent of the number
of processors. The EPT routines can be accessed by C++ by using the C linkage
convention, a C++ function must be declared extern ‘‘C’’ and include file
thread.h. A threads environment is initialized in EPT by calling the function
THREADgoðnprocs; datasize; func; args; argsize; stacksize; priorityÞ:
This function provides the facilities for the programmer to identify the number
of processors nprocs for use. The argument datasize sets up a pool of memoryor total amount of data space. The function func is initiated as the first thread
of execution. The arguments args and argsize provide the programmer the
necessary parameters to passed to related functions func by the arguments args
and argsize. The maximum stack size stacksize is provided by the newly created
thread, and the newly created thread is given a runtime priority of priority.
This function is usually called in the main program.
As we mentioned above, the THREADgo function provides a multi-thread
environment so that when any thread requires a new thread to be created thenthis can be achieved by calling THREADcreate with the parameters as indi-
cated as follows:
THREADcreateðfunca; args; argsize;ATTACHED; stacksize; priorityÞ:
The arguments func, stacksize and priority are similar to those as in the
THREADgo function. The arguments args and argsize provide the program-
mer the necessary parameters to passed to related functions funca by thearguments args and argsize. The additional argument in the THREADcreate
function is ATTACHED. The argument ATTACHED means that there is
relationship parent and child threads, so that the parent thread will only end
when child thread has completed its work by using the THREADjoin function
to signal the completion. This function THREADjoin has no parameters.
1.2. Inter-thread communication
The EPT package also provides mechanisms for synchronization which
entails a thread suspending its own execution, usually waiting for some other
D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1029
thread to cause its execution to resume [2]. The mechanisms used in our study
are semaphores, monitors, and locks. Locks are not provided by the EPTpackage itself, are available as an extension code. The simplest form of these
synchronization mechanisms are used to provide mutually exclusive access
to a particular object or data structure (to shared data).
1.2.1. Locks
A lock prevents a thread from entering a critical region while another thread
is accessing that region, so that the newly arrived thread waits. When a thread
leaves a critical region, another a thread waiting is allowed to enter. The critical
region provides programs with a means of ensuring that shared variables are
accessed by only one thread at a time. Locks implement synchronization using
busy-waiting. The busy-waiting is a simple way to implement synchronization
using shared variables. The system uses the lock and unlock operations to
provide mutually exclusive access to a particular object or shared data.
1.2.2. Semaphores
A semaphore is a synchronization mechanism which provides an alternative
way of obtaining mutual exclusive. Originally proposed, these operations werefirst developed by Dijkstra in the mid-1960s [3], the only logical operations on
semaphores are P and V and some people call wait and signal respectively. Theoperation P and V come from abbreviations of the Dutch words for waiting
and signaling.
A semaphore is a shared integer variable that may only be accessed using
one of three possible operations which are THREADseminit, THREADpsem,
and THREADvsem. The last two functions perform the corresponding P and Vprimary operations on semaphores. The first function written in the form
sem ¼ THREADseminitðinitialvalueÞ
creates a new semaphore, initializes the value of initialvalue, and returns a
reference to the created semaphore. The wait operation THREADpsem tests
the semaphore value, if it is a positive value then continues execution, if it is
zero or negative then semaphore suspends the calling process so producing a
waiting operation and places is in a waiting queue otherwise thread continuousand decrement integer. The function is written in the form
THREADpsemðsemÞ:
The signal operation THREADvsem tests whether processes are waiting, if so
the value of the initialvalue semaphore incremented by 1 and a thread on the
semaphore�s queue is released. However, the system has to ensure that each ofthese operations execute automatically, that is, if a wait and signal operation
occur simultaneously they are executed one at a time though programmer does
not know in what order they are executed.
1030 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036
1.2.3. Monitors
A monitor is a synchronization mechanism that attempts to encapsulate themutual exclusion and provides convenient facilities for signaling and waking
up processes. Monitors are special memory locations used as an alternative
mechanism to semaphores in the EPT package. This mechanism was originally
proposed by Hoare in the early 1970s, and was implemented by using the
Concurrent Pascal Programming Language [4].
A monitor consist of a set of variables representing the state of some re-
source and a function. It can implement operations on that resource and by
that function. When a thread is requires to use a monitor, it must create onebefore using it. This is accomplished by a thread call
mon ¼ THREADmonitorinitðconditions; resetfuncÞ:
A monitor must provide condition variables conditions which are given the
number of condition queues and each has associated suspend and continue
operations. The second parameter, reset function resetfunc, is used for orderly
reorganization of the monitor should the thread be terminated. In it simplest
form is provide exclusive access to shared data in a similar way to lock andsemaphore. The required control of access can be accomplished by using
following functions:
THREADmonitorentryðmon;managerÞand THREADmonitorexitðmonÞ:
If there are threads waiting to enter the monitor, the current thread must
relinquish the monitor by the THREADmonitorexit function before another
thread can be moved from the entry queue and pass through its byTHREADmonitorentry. The first parameter mon is used in each function and it
is the handler of the monitor. The second parameter manager allows the caller
the option of managing the monitor control block space.
Monitors also provide a wait and signal operations in the following way. If a
thread enters a monitor and finds that a required condition is not true, it can
suspend itself by executing a wait statement the form
THREADmonitorwaitðmon; conditionÞ:
The THREADmonitorwait removes the thread from the monitor and places it
on a queue waiting for the condition to become true. When another thread
enters the monitor and change the condition to true then it can execute a signal
statement of the form
THREADmonitorsignalandexitðmon; conditionÞ:
This function withdraws a waiting thread from the condition�s queue andwakes it up. If no threads are waiting on the condition queue the thread simply
continues.
D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1031
2. Parallel implementation
We consider two different algorithm [5–7] and use the three synchronization
mechanisms for the implementations. The first one is a simple implementation
carrying out columns and rows corresponding to a pivotal column together in
parallel, with the pivotal columns treated sequentially. All the column updates
are independent of each other and all the row updates but at least some of the
row updates need to be carried out after the column updates have been com-
pleted. Once the pivotal column has been completed later columns are updated
and this is terminated by a ‘‘THREADjoin’’. Similarly, consecutive rows areupdated and terminated with a ‘‘THREADjoin’’. This implementation is for
comparison with the other algorithms, and thus not use the synchronization
mechanisms described in Section 1.
The next implementation is based on the observation that at the kth stagethe updates to rows 1; 2; . . . ; k can be carried out at the same time as the col-umn updates while the later row updates can not. Synchronization is needed to
ensure that the updates to the columns and rows are carried out in the correct
order. The parallel algorithm has two steps. In the first step, the processchooses a column number until there are no more column to allocate to the
processors. If all columns are started, then a row number is chosen from rows 1
to n in order and updates are carried out as long as row number is less than k. Ifthe row number is greater than k the process waits until all the columns havebeen completed. Row updates (k þ 1 to n) are carried out only if the columnshave been completed.
When a column is completed the counter done is incremented and this is
used to check when all columns are completed so that the thread can allowrows to k to n to be processed.In the second implementation the columns and rows are allocated dynam-
ically, and details of the programs are given in [5–7].
3. Experimental results
The results use the notation He for our implementation of parallel upper
Hessenberg reduction. The numerical results obtained from four different
versions. These versions are outlined in Section 2. Implementation three are
indicated by md (dynamic monitor), ld (dynamic lock) and smd (dynamic
semaphore), and the simple implementation one by ss. The representation of
the matrix by columns is indicated by t.Each version was run a number of times and the smallest value of the
elapsed time used in the results, as the time is dependent upon the load on thesystem at the time of measurement. The runs were made at times of low load to
reduce the effect of the presence of other programs running at the same time.
1032 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036
We tested the algorithms using from one up to 10 processors (1, 2, 4, 6, 8,
10) with matrices of sizes 100(100)500. Graphs of efficiencies are displayed inFigs. 1–8 , where efficiency, Ep is defined by
Ep ¼ Ts=ðpTpÞ;
where Ts and Tp are the times for the sequential and parallel versions and p isthe number of processors. The sequential times were obtained from the Heseq
simple algorithm which was compiled without array bound checking the row
representation version. It was significantly better than column representation
versions and with array bound checking. The parallel and sequential times were
obtained for the same sized matrix. Normally we expect the efficiency to be less
than one.
To show performance of the two different parallel versions, we plot in Figs.
1–4 the mean efficiencies against number of processors. Figs. 5 and 6 showactual efficiencies using two processors, for the without array bound checking
and with array bound checking, respectively and Figs. 7 and 8 similar plots for
eight processors. The second method and using ‘‘Lock’’ version (Held) display
a satisfactory parallel efficiency for all our versions.
4. Conclusions
If we compare the ‘‘Locks’’ and ‘‘Semaphores’’ synchronization for the
second algorithm (dynamic allocation), we find that with no array bound
Fig. 1. No check.
Fig. 2. No check.
Fig. 3. Check.
D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1033
checking and both row and column representation that the versions Held and
Hesmd are very close particularly for four and six processors. With array
bound checking the lock version is significantly better than the semaphore
versions.The monitor version is worse than the semaphore version with no check both
better than it with array bound checking. The monitor version in the checking
Fig. 4. Check.
Fig. 5. No check for two processors.
1034 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036
case is still significantly poorer than the lock version, and the difference in-
creases with the number of processors.
Also with array bound checking the semaphore versions are some times even
poorer than the simple version Hess. With no array bound checking themonitor versions Hemd and Hemdt are give similar efficiency to the simple
implementation Hess.
Fig. 6. Check for two processors.
Fig. 7. No check for eight processors.
D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036 1035
There is a clear conclusion about using the synchronization mechanisms. In
all cases using ‘‘Locks’’ implementations are more efficient than those using‘‘Monitors’’ or ‘‘Semaphores’’. The lock version is consistently equal and gives
the best efficiency in most cases.
Fig. 8. Check for eight processors.
1036 D. Kaya / Appl. Math. Comput. 161 (2005) 1027–1036
References
[1] Encore Computer Corporation, Encore Parallel Threads Manual No. 724-06210 Rev. A, 1988.
[2] T.W. Doeppner Jr., Threads, a system for the support of concurrent programming, Brown
University Department of Computer Science Technical Report CS-87-11, 1987.
[3] E.W. Dijkstra, The structure of the �The Multiprogramming System�, Communications of theACM 11 (1968) 341–346.
[4] C.A.R. Hoare, Monitors: an operating system structuring concept, Communications of the
ACM 17 (1974) 549–557.
[5] D. Kaya, K. Wright, Parallel algorithms for reduction of a general matrix to upper Hessenberg
form on shared memory multiprocessor, Technical Report Series No. 490, University of
Newcastle upon Tyne, Computing Science, October, 1994.
[6] K. Wright, D. Kaya, Parallel algorithms for linear algebra on a shared memory multiprocessor,
3th Int. Coll. on Numerical Analysis, VSP, 1995, pp. 209–218.
[7] D. Kaya, Parallel Algorithms for numerical linear algebra on a shared memory multiprocessor,
Ph.D. dissertation, University of Newcastle upon Tyne, 1995.