8
1989 International Conference on Parallel Processing A DISTRIBUTED MANAGEMENT SCHEME FOR PARTITIONABLE PARALLEL COMPUTERS Menkae Jeng* Computer Science Department University of Houston Houston, TX 77204 Howard Jay Siegel** Parallel Processing Laboratory School of Electrical Engineering Purdue University West Lafayette, IN 47907 Abstract Many large-scale parallel computers, such as those based on hypercube or multistage cube interconnection networks, can be partitioned to process applications with different computation structures and different degrees of parallelism simultaneously. However, partitioning may create resource fragments and may result in a loss of computation power. Dynamic Partitioning is an effective method to alleviate the fragmentation problem. A distributed scheme for dynamic partitioning is investigated in this paper. Distributed procedures to split a subsystem and to combine subsystems are presented. Correctness of each of these two procedures is shown and the complexity is analyzed. The procedures presented are applicable to parallel computers that use interconnection networks such as hypercube, omega, multistage cube, and extra stage cube networks. Key words: distributed management, interconnection networks, partitionable systems, system reconfiguration, task allocation. 1. Introduction Parallel computers can provide enormous computing power to solve many of today's scientific and industrial applications. Due to the large number of resources involved, how to efficiently utilize these computers has become an important issue. One of the the major problems comes from the fact that different applications may have different degrees of parallelism and may prefer different computation structures. To make better use of a parallel computer, a good management scheme should be able to allocate more resources for jobs (or tasks) that have more parallelism and less resources for jobs with less parallel operations. Such a management scheme can be easily implemented if the underlying architecture can be partitioned to execute jobs with various sizes. In recent years, designing a parallel system which can be partitioned and reconfigured into several subsystems with various sizes and computation structures to solve a broad range of applications has received increasing interest [3, 6, 14, 16, 19, 21, 27]. Examples of Parallel computers which have the potential to be partitioned are RP3 [18], Butterfly [4], Ultracomputer [8], PASM [24], Cosmic Cube [20], NCUBE [9], and Connection Machine [25]. Multiple tasks with various size and computation structures can be executed simultaneously in a partitionable system by partitioning the system into several independent subsystems and interconnecting PEs in each subsystem according to the computation structure desired. "This material is based in part upon work supported by Texas Ad- vanced Research Program under Grant No. 2009. **This material is based in part upon "work supported by the Air Force Office of Scientific Research under Grant NO. F49620-86-K-- 0006. Two models of parallel systems containing N = 2 m processors and an interconnection network are shown in Figure 1. Figure 1(a) shows a system model in which each PE (processing element) consists of a processor and a local memory, and communicates with other PEs through the interconnection network. The interconnection network can be a single-stage network, such as the m-dimensional hypercube [9, 22], or a multistage network, such as the omega [12] and multistage cube [23] networks. Figure 1(a) shows another system configuration in which processors and memories are on different sides of a multistage network. The model in Figure 1(b) will be used to investigate the partitioning problem. The other model can be treated similarly. When a parallel system is partitioned, some subsystems may be busy in executing jobs, and some may be idle. A busy subsystem will be released and become idle upon the completion of a job. A released subsystem should recombine with other idle subsystems, otherwise a severe resource fragmentation problem may arise. Combining idle resources can be done either at the PE level or at the subsystem level. One approach of combining idle resources in a system with N PEs is the modified Quine-McCluskey procedure [13] that disjoins every idle subsystem into individual PEs and applies m iterations of the modified procedure on the addresses of idle PEs to combine them into subsystems. Combining resources is done at PE level. Another approach of managing PEs stores all PE addresses according to the order of a gray code and searches the gray code to determine what idle PEs can be combined into a subsystem [3]. Multiple gray codes are needed in this Figure 1. System models: (a) Each processing element (PE) consists of a processor and a local memory and communicates with other PEs through the interconnection network (IN). The IN can be a single stage network or a multistage network, (b) Processors (P's) and memories (M's) are on the different side of a multistage interconnection network (MIN). 11-57

1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

A DISTRIBUTED MANAGEMENT SCHEMEFOR PARTITIONABLE PARALLEL COMPUTERS

Menkae Jeng*

Computer Science DepartmentUniversity of Houston

Houston, TX 77204

Howard Jay Siegel**

Parallel Processing LaboratorySchool of Electrical Engineering

Purdue UniversityWest Lafayette, IN 47907

Abstract

Many large-scale parallel computers, such as thosebased on hypercube or multistage cube interconnectionnetworks, can be partitioned to process applications withdifferent computation structures and different degrees ofparallelism simultaneously. However, partitioning maycreate resource fragments and may result in a loss ofcomputation power. Dynamic Partitioning is an effectivemethod to alleviate the fragmentation problem. Adistributed scheme for dynamic partitioning isinvestigated in this paper. Distributed procedures to splita subsystem and to combine subsystems are presented.Correctness of each of these two procedures is shown andthe complexity is analyzed. The procedures presented areapplicable to parallel computers that use interconnectionnetworks such as hypercube, omega, multistage cube, andextra stage cube networks.

Key words: distributed management, interconnectionnetworks, partitionable systems, system reconfiguration,task allocation.

1. Introduction

Parallel computers can provide enormous computingpower to solve many of today's scientific and industrialapplications. Due to the large number of resourcesinvolved, how to efficiently utilize these computers hasbecome an important issue. One of the the major problemscomes from the fact that different applications may havedifferent degrees of parallelism and may prefer differentcomputation structures. To make better use of a parallelcomputer, a good management scheme should be able toallocate more resources for jobs (or tasks) that have moreparallelism and less resources for jobs with less paralleloperations. Such a management scheme can be easilyimplemented if the underlying architecture can bepartitioned to execute jobs with various sizes. In recentyears, designing a parallel system which can be partitionedand reconfigured into several subsystems with various sizesand computation structures to solve a broad range ofapplications has received increasing interest [3, 6, 14, 16,19, 21, 27]. Examples of Parallel computers which havethe potential to be partitioned are RP3 [18], Butterfly [4],Ultracomputer [8], PASM [24], Cosmic Cube [20], NCUBE[9], and Connection Machine [25]. Multiple tasks withvarious size and computation structures can be executedsimultaneously in a partitionable system by partitioningthe system into several independent subsystems andinterconnecting PEs in each subsystem according to thecomputation structure desired.

"This material is based in part upon work supported by Texas Ad-vanced Research Program under Grant No. 2009.**This material is based in part upon "work supported by the AirForce Office of Scientific Research under Grant NO. F49620-86-K--0006.

Two models of parallel systems containing N = 2m

processors and an interconnection network are shown inFigure 1. Figure 1(a) shows a system model in which eachPE (processing element) consists of a processor and a localmemory, and communicates with other PEs through theinterconnection network. The interconnection network canbe a single-stage network, such as the m-dimensionalhypercube [9, 22], or a multistage network, such as theomega [12] and multistage cube [23] networks. Figure 1(a)shows another system configuration in which processorsand memories are on different sides of a multistagenetwork. The model in Figure 1(b) will be used toinvestigate the partitioning problem. The other model canbe treated similarly.

When a parallel system is partitioned, somesubsystems may be busy in executing jobs, and some maybe idle. A busy subsystem will be released and become idleupon the completion of a job. A released subsystemshould recombine with other idle subsystems, otherwise asevere resource fragmentation problem may arise.Combining idle resources can be done either at the PElevel or at the subsystem level. One approach ofcombining idle resources in a system with N PEs is themodified Quine-McCluskey procedure [13] that disjoinsevery idle subsystem into individual PEs and applies miterations of the modified procedure on the addresses ofidle PEs to combine them into subsystems. Combiningresources is done at PE level. Another approach ofmanaging PEs stores all PE addresses according to theorder of a gray code and searches the gray code todetermine what idle PEs can be combined into asubsystem [3]. Multiple gray codes are needed in this

Figure 1. System models: (a) Each processing element(PE) consists of a processor and a local memoryand communicates with other PEs through theinterconnection network (IN). The IN can be asingle stage network or a multistage network,(b) Processors (P's) and memories (M's) are onthe different side of a multistage interconnectionnetwork (MIN).

11-57

Page 2: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

method for a complete subsystem recognition. Anapproach that combines idle resources at the subsystemlevel can be found in [10]. In that approach, the wholesystem is considered as a lattice and subsystems assublattices. Each subsystem is treated as an object(sublattice) with an identifier. Partitioning andrecombining subsystems are done by manipulating theidentifiers.

In the three approaches mentioned above,management of resources is centralized. The status ofeach PE or a subsystem is maintained in a global tablewith a particular data structure. A PE or a controller isassigned to update the system table and to performmanagement procedures. Disadvantages of the centralizedscheme are that errors in the system table would result indeadlock or system crash, and that managementprocedures are executed in only one PE or one controllereven though there are many idle PEs. It would bebeneficial if the status information can be distributed toindividual PEs and the procedures for resourcemanagement are migrated to the idle PEs. In this paper, adistributed dynamic partitioning (DDP) scheme is studied.The DDP approach consists of two processes: thedistributed splitting process and the distributed combiningprocess. The distributed splitting process splits a systemor a subsystem into subsystems of smaller sizes to executemultiple jobs simultaneously. The combining processcombines released subsystems and idle subsystems toreduce the fragmentation problem. The DDP approach isanalyzed based on a lattice model. However, no globaltable is needed to store the lattice or any statusinformation. Instead, Each PE stores its own statusinformation. Procedures of splitting and combining areperformed locally.

In Section 2, an overview of partitionable systems isgiven and the lattice model is presented. Section 3 presentsthe distributed splitting process. Section 4 shows thevalidity of combining subsystems in the distributedscheme. Finally in Section 5, implementation of thedistributed combining process is presented and thecomplexity is analyzed. This research was motivated by astudy on automatic reconfiguration of the PASM parallelprocessing system for dynamic task allocation.

2. Partitionable Parallel Systems

Parallel computers can be classified as eitherspeedup-oriented or throughput-oriented [5]. They caneither speed up the execution of a single job bypartitioning the job into a set of cooperating processes, ormaximize the throughput of many jobs by executingmultiple jobs simultaneously. A partitionable parallelsystem can serve both purposes. It can minimize theexecution time of a single job by allocating as manyresources to the job as possible. When multiple jobs exist,a partitionable system can be partitioned into severalsubsystems so that each job can be allocated to asubsystem with suitable size.

A parallel system containing N PEs is partitionable ifit can be partitioned into two subsystems with size N/2,each having all the properties of the original system withthe same size [21, 27]. Each subsystem can further bepartitioned independently if it has two or more PEs. LetX = xm_1...x1Xg be the binary address of a PE. It hasbeen shown in [21, 22] that many network-based systemscan be partitioned into two subsystems based on theaddress bit position Xj such that all PE addresses in a

subsystem have same value in X[. Parallel systems basedon multistage networks such as the multistage cube,omega, Gamma [17, 26], ADM [15], Dynamic Redundancy[11], and Extra Stage Cube [1], or single-stage networkssuch as the hypercube are partitionable.

A system is statically partitionable if it can bepartitioned based on only one address bit position intotwo subsystems. Parallel systems based on an ADM,Gamma, or Dynamic Redundancy network are staticallypartitionable. They can be partitioned based on x0 only,then each subsystem can be partitioned based on Xj only,etc. A system is dynamically partitionable if more thanone address bit can be chosen to partition the system intotwo subsystems. Systems based on the m-dimensionalhypercube or the multistage cube are fully dynamicallypartitionable because any address bit can be used forpartitioning. Some other systems may be partiallydynamically partitionable if some address bits cannot beused to partition. Examples are systems based on theExtra Stage Cube. Both of these two types of systems canbe treated in the same way by considering only thoseaddress bits that can be chosen for partitioning. Hence,without losing generality, only fully dynamicallypartitionable systems are discussed.

After a system is partitioned to execute multiple jobs,some subsystems may complete execution earlier thanothers. With dynamic partitioning, any two subsystemscan be combined as soon as the combined subsystem is avalid subsystem. In a system of size N (= 2m), a validsubsystem of size K (= 2 ) contains K PEs whoseaddresses agree in exactly m—k bit positions. Adynamically partitionable system can be modeled by usinga lattice. A valid subsystem then can be represented by asublattice.

Let P = {0, 1, ..., N - l } be the set containing all PEaddresses and let X = x ^ ^ ^ x , , and Y = ym_1...yly0 betwo elements in P, where Xj € {0,1} and yj € {0,1},0 < i < m. If Xi < y; for all i, 0 < i < m - 1 , then X is lessthan or equal to Y, denoted by X 2 Y, otherwise, X !oc Y.The relation " 2 " is reflexive, antisymmetric, andtransitive. A relation possessing these three properties iscalled a partial ordering relation. The set P together withthe partial ordering relation "oc" form a partially orderedset (poset). In a poset, an element W is said to be a leastupper bound (lub) of X and Y if X « W j m / Y ^ W andthere is no other element E in P such that X 2 E 2 W andY oc E 2 W; an element Z is said to be a greatest lowerbound (glb) of X and Y if Z <x X and Z 2 Y and there is noother element E in P such that Z 2 E 2 X and Z 2 E 2 Y.A poset in which any two elements have an unique lub andan unique glb is called a lattice [7]. The poset definedabove is a lattice. Figure 2 shows a lattice for N = 16.

Figure 2. A lattice for N = 16.

11-58

Page 3: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

The lattice consists of m+1 levels. Level j contains allelement whose binary addresses have j l's. An element Xat level j , 0 s j < m, has one chain (link) to an element Yat level j + 1 if X oc Y. These two elements are then said tobe adjacent to each other.

When a system is represented by a lattice, a validsubsystem can be represented by a sublattice. Any validsubsystem can be uniquely defined by the glb and lub of asublattice. For any two elements A and B in P, the set[X,Y] = {E I X oc E « Y, E <E P} uniquely defines anonempty valid subsystem if and only if X <* Y [10]. Forexample, because 0010 oc UIO, the subsystem [0010, UIO]= {0010, 0110, 1010, 1110} is uniquely defined. Figure 3shows a subsystem [0010,1110] in a lattice. As an anotherexample, because 0100 !ºE 1011, no subsystem can bedefined by these two elements.

Figure 3. A subsystem [A,B] = [0010,1110] in a lattice forN = 16.

Based on the lattice model, a partitionable system canbe split into sublattices, each representing a subsystem.Smaller sublattices may be combined into largersublattices. Both the splitting and combining processescan be done by manipulating the gib's and lub's, therebydefining new sublattices.

3. Distributed Splitting Process

A partitionable system can execute multiple jobssimultaneously. When many jobs are waiting forexecution, they will be allocated one after another. Jobscheduling is not a direct concern here. It is simplyassumed that jobs are scheduled and stored in a first-in-first-out (FIFO), queue, and that each job requests asubsystem with 2J PEs, where 0 s j <m.

When the FIFO queue is not empty, jobs in the queuewill be allocated to a certain subsystem one after anotherbased on a best-fit policy. All allocated jobs will beexecuted in parallel. Consider a system in a state wheresome subsystems are busy and some subsystems are idleand the first job in the FIFO queue requests a subsystemof size 2 .̂ If there exists an idle subsystem with size 2\then the job will be allocated to the subsystem forexecution. Otherwise, a splitting process is started, whichsplits a larger idle subsystem, if any, and creates asubsystem with this size. With the best-fit policy, thesplitting process will try to find a smallest idle subsystemwith size greater than 2K Let the size of the subsystem be2k, where k > j . The subsystem will be split into twosubsystems of size size 2k , one of them is then split

again, and so on. Finally two subsystems with size 2-* areobtained; one of them then executes the job.

In order to perform the splitting process in adistributed way, each subsystem has to maintain its statuslocally. In the distributed scheme, every node (PE) in asubsystem [X,Y] contains a status word consisting of aBUSY/IDLE flag and a bit vector T, where T =tm-i-Mo = X©Y. The bit vector T is called the bitdifference vector (BDV) of the subsystem. In a subsystem[X,Y], if tj = 0, then x5 = yj, otherwise Xj = 0 and y; = 1(because X«Y) . For any element E in [X,Y],e; = x( ( = yj) if tj = 0 and 0 < es < 1 if t[ = 1. With theBDV, each node E can calculate the glb and the lub of thesubsystem, i.e., glb = E • T and lub = E + T, where "•"and " + " are the logical AND and OR bit-wise operations,respectively. Hence, any node knows what subsystem itbelongs to by analyzing its BDV.

In the distributed scheme, each subsystem [X,Y] iscontrolled by its leading node, i.e., the glb X. A leadingnode knows the size of its subsystem. When a subsystemchanges its status, the BUSY/IDLE flag and/or the BDVin every node of the subsystem must be updated. Forexample, if a idle subsystem [X,Y] is to execute a job, thennode X will broadcast a message to set the BUSY/IDLEflag in each node of the subsystem [X,Y] to BUSY.

Initially, the whole system can be considered as asubsystem of size 2m, i.e., all BUSY/IDLE flags are IDLEand BDV = 11...1. To execute multiple jobssimultaneously, the system should be properly split so thatas many jobs can be allocated as possible. It can be shownthat if the total size of all jobs is less than N, then with thebest-fit policy, every job can always be allocated to asubsystem regardless of the order in which these jobs areassigned and the way in which the system is partitioned.

Theorem 1: Let Sj be the size of a job J; and k be theK

total number of jobs. If 2 si — N, then with the best-fit

policy, all jobs can be allocated and executed in parallel.Proof: This will be a proof be contradiction.Assume that the first job which fails to be allocated has asize of 2J. In this case, all idle subsystems are smaller than2J. It will be shown first that under the best-fit policy, anytwo different idle subsystems must have different sizes.Then it will be shown that the assumption of failing toallocate a job will result in a contradiction to thehypothesis. Without losing generality, it can be assumedthat if a subsystem is assigned to execute a job, then it willbe busy all the time, i.e., no busy subsystems become idle.(1) In the splitting process, a subsystem of size 2k will besplit only when a job requests a subsystem with size 2-" andno idle subsystem is of size 21, j < i < k. After splitting asubsystem of size 2k, k—j idle subsystems will be created.These idle subsystems have different sizes, ranging from2k~1 to 2*. Because no busy subsystems become idle, idlesubsystems are created only through the splitting process.In other words, if one idle subsystem with a certain sizeexists, no idle subsystem of the same size will be created.Therefore, different idle subsystems must have differentsizes.(2) Because the job that fails to be allocated has a size of2 , no idle subsystem with size greater than or equal to 2̂exists. Because different idle subsystems must havedifferent sizes, at each size 2', 0 < i < 23, at most one idlesubsystem exists. Let Si(jie be the sum of the size of everyidle subsystem, then

11-59

Page 4: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

which is a contradiction to the hypothesis that the totalrequest size is less than or equal to N. Therefore, every jobcan be allocated. •

If the total request size is greater than N, then somerequests have to wait until some busy subsystems complete'their jobs. When a busy subsystem with size 2̂ is released,it will inspect the FIFO queue. If there is any waiting jobwhose size is equal to or less than 2J, then the releasedsubsystem can be assigned or split to execute the job.Otherwise the released subsystem will combine with otheridle subsystems, if possible, to form a larger subsystem inorder to execute large jobs. The combining process will bediscussed later.

Splitting a subsystem [X,Y] of size 2k can be donebased on any bit position i where tj = 1. When node Xwants to split its subsystem, it can simply choose thelowest possible bit position. After splitting, node X is stillthe leading node of one of the two new subsystems. Theleading node of the other new subsystem and the BDVs ofboth subsystems have to be derived. Let W be leadingnode to be calculated and T?ew be the BDV for bothsubsystems. It has been shown in [10] that W can be easilycalculated if splitting is done based on the lowest possiblebit position. Corollary 1 shows how W and Tnew arederived.

Corollary 1: W = X + ( T « T ' ) and T =T • (T»T') , where T is the BDV of the subsystem IX,Y],T' is the two's complement of T, and "+" and "•" denotethe logical OR and AND bit-wise operations, respectively.

Proof: From [10], if t,, the lowest bit of T, is 1, thent, • t'j = 1 for i = s and ts • t'; = 0 for i # s. Becausexs = 0, so w, = 1 (and Wj = x-v if i # s). Within each ofthe two new subsystems, all nodes now agree in bit s, soTncw = T • (T»T ' j . a

From Corollary 1, both W and Tnew can be derived ina few simple calculations. For example, if X = 0000 and T= 0111, then T' = 1001, W = 0000 + (0111 • 1001) =0001, and Tnew = 0111 • (0111 • 1001) = 0110. Aftercalculating W and Tnew, node X needs to send Tnew tonode W and spawns a process in node W to update theBDV of the subsystem containing node W. Node X mayneed to split the other subsystem again into two smallersubsystems. The procedure executed by node X to split anidle subsystem [X,Y] of size 2k for a job of size 2J, where0 < j < m and j < k < m, is given below.

(52.3) Spawn a process at node W to broadcastT to all nodes in the subsystem thatcontains node W.

(52.4) psize - psize — 1;end (of while)

(S3) Broadcast T and a BUSY message to update thestatus words in all nodes of the subsystemthat contains node X.

The correctness of the procedure SPLIT followsdirectly from Corollary 1. In the procedure, the while loopcontains k—j iterations. At each iteration, node Xcalculates W and Tnew (which is T at the left hand side ofS2.2) and then spawns a process at node W that becomesthe leading node of one of the two new idle subsystems.Node W then can broadcast Tnew to update the statusinformation of the new idle subsystem. The leading nodeand the BDV of the other idle subsystem are still node Xand Tncw, respectively. Hence node X can split thesubsystem again, if needed. When the while loopterminates, node X. becomes the leading node of an idlesubsystem of size 2 .̂ It will broadcast Tnew and a BUSYmessage to update the status of this subsystem andexecute the job.

Broadcasting a message has a time complexity ofO(logN) in both the multistage cube and hypercubenetworks. In a multistage cube network, a broadcasttransfer can be done by using a broadcast routing tagscheme that allows a node to broadcast to 2 nodes whoseaddresses differ in at most d bit positions [23]. Thebroadcast transfer can be done in one pass through themultistage cube network. Because a multistage cube haslog2N stages, the time complexity for the broadcasttransfer is O(logN). In a hypercube computer, thebroadcast transfer can be implemented by using the divideand conquer approach with a time complexity of O(logN).In both networks, Tnew can be used as the broadcast tag,or to calculate the tag, depending on the particularnetwork implementation.

The while loop in the procedure SPLIT contains atmost logN iterations. This happens when k = m (=log2N) and j = 0. Each iteration of the while loop (notincluding the broadcast transfer) can be done in constanttime. Hence, the worst cast complexity is O(logN).Furthermore, it can be noted that all the broadcasttransfers are performed in different (and disjoint)subsystems and hence can be overlapped, i.e., the initiatedbroadcasts can be executed simultaneously with each otherand with the continued execution of the while loop bynode X. Therefore, the total complexity for all thebroadcast transfers is O(logN). The complexity of theprocedure SPLIT is the asymptotic sum of the twocomplexities above. Hence, the procedure SPLIT has acomplexity of O(logN).

4. Distributed Combining Process

When a system is partitioned into several subsystemsto execute multiple jobs, some subsystems may finish theirjobs earlier than others. When a busy subsystemcompletes a job and is released, it starts a combiningprocess. The combining process will combine the releasedsubsystem with idle subsystems into a larger one so thatjobs with large sizes can be allocated and executed. Theprocess is in a critical section so that two releasedsubsystems will not combined with same idle subsystems.The combining process is more complicated than the

11-60

Page 5: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

splitting process because a global search for combinativeidle subsystems is needed. An idle subsystem is said to becombinative with the released subsystem if they can becombined into a valid subsystem. When a subsystem isreleased, some idle subsystems may be combinative, somemay not. The sufficient and necessary conditions for twosubsystems to be combinative were derived in [10]. Theconditions are given in the following corollary withoutproof.

Corollary 2: Two subsystems [A,B] and [C,D] arecombinative if and only if A is adjacent to C andTA = Tc, where TA and T c are the BDVs of [A,B] and[C,D], respectively.

When a subsystem is released and if there is acombinative idle subsystem, the released subsystem willcombine with the idle subsystem. An intermediatesubsystem will be formed as a result of this combiningstep. The BDV of the intermediate subsystem subsystemcan be easily calculated by using the following Corollary.

Corollary 3: Let [A,B] and [C,D] be two combinativesubsystems, TA be the BDV of [A,B], and TE be the BDVof the combined subsystem. Then TE = TA + (A © C).

Proof: Tg is generated from TA by setting to 1 the bitposition where A and C differ but A and B do not. Theabove calculation does it. D

After the BDV TE is calculated, the leading node ofthe intermediate subsystem can be computed. When anintermediate subsystem is formed, it should also combinewith an idle subsystem again, if possible, into a largerintermediate subsystem. The goal is to form anintermediate subsystem as large as possible. In thesimplest case where the released subsystem and each of theintermediate subsystems have only one combinative idlesubsystem, the combining process is a simple sequence ofcombining steps, each of which can be easily done by usingCorollaries 2 and 3. However, it is possible that more thanone idle subsystem are combinative to the releasedsubsystem, each forming an intermediate subsystem; it isalso possible that each intermediate subsystem is alsocombinative to more than one idle subsystem. Thesnowball effect created by trying to form the largestpossible subsystem could make the combining process verycomplicated. Fortunately, the complexity can be managedby performing the combining process in a distributed way.

In the distributed scheme, the combining process isperformed in the released subsystem. To combine areleased subsystem [X,Y], the leading node X can accessthe status words contained in its adjacent nodes todetermine if an idle subsystem is combinative. FromCorollary 2, if an adjacent node is a leading node of anidle subsystem whose BDV is equal to that of the releasedsubsystem, then these two subsystems are combinative.However, when an intermediate subsystem is formed, nodeX may not be the leading node of the intermediatesubsystem. Hence, node X may not be able to applyCorollary 2 again to continue the combining process forthe intermediate subsystem. For example, if [2,3] is thereleased subsystem and [0,1] is idle, then from Corollaries2 and 3, these two subsystems can be combined into anintermediate subsystem [0,3] whose leading node is nownode 0, not node 2. Node 2 is now only an inner node ofthe the intermediate subsystem [0,3] and hence Corollary 2cannot be applied again if the combining process is to becontinued at node 2.

One way to solve the problem of changing leadingnodes is to switch control to the new leading node and letthe new leading node access status words from its adjacentnodes to resume the combining process. However, whenthe snowball effect occurs, i.e., when many intermediatesubsystems exist during the process, switching control tonew leading nodes may create multiple combiningsubprocesses. In addition, each new leading node has toaccess status words from its adjacent nodes, and henceincreases network traffic.

It is conjectured that to combine the releasedsubsystem [X,Y], node X can continue the combiningprocess based on the status words in its adjacent nodesafter becoming an inner node of an intermediatesubsystem. If this is true, then node X can complete thecombining process by itself without switching control tonew leading nodes of intermediate subsystems. Thecomplexity of this process would be significantly reduced.In the following, the possibility is investigated. Theanalysis contains two parts. The first part is to prove thatit is sufficient to consider only those status wordscontained in the adjacent nodes of the leading node Xduring the entire process. The second part then provesthat node X can determine if an idle subsystem iscombinative to the released subsystem, or to anintermediate subsystem, even after node X is only an innernode of an intermediate subsystem.

Theorem 2: Let J be a node in a subsystem [A,B] and Kxand K2 be any two different adjacent nodes of J. IfKj i [A,B] and K2 i [A,B], then Kt and K2 must be intwo different subsystems.Proof: (By contradiction)Assume that Kj and K2 are in one subsystem [C,D].Because all subsystems are disjoint, [C,D] f] [A,B] = <j>.Let glb(Ki,K2) be the glb of Kt and K2, and lub(K1,K2) bethe lub of Kj and K2 (see Section 2). Then C * glb(K1(K2)and lubfK^K^) <* D. Because both Kj and K2 areadjacent to J, each can differ from J in exact one bitposition. Let bit u be the bit position in which J and Kjare different and bit v be the bit position in which J andK2 are different. Then u # v. Without losing generality,assume u > v. Let J = jm_1...ju...jv...j0, then

Because C ?£ glb^^Kj) and glb(Kl7K2) 2E J, then C « J .Similarly, J oc lub(K1(K2) and lub(K!,K2) * D impliesJ « D . Therefore, C 2 J oc D, i.e., J is in [C,D]. However, Jis a node in [A,B] and [A,B] f) [C,D] = <$>, so J cannot bein [C,D]. Therefore, K̂ and K2 must be in two differentsubsystems. n

Theorem 3: Let [A,B] and IC,D] be any two differentsubsystems. If [A,B] and [C,D] are combinative, then forany node J in [A,B], exactly one node K in [C,D] isadjacent to node J.Proof:(1) ExistenceBecause [A,B] and [C,D] are combinative, from Corollary

11-61

Page 6: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

2, A and C must be adjacent and TA = Tc. Let bit u bethe bit position in which A and C differ and let TA =tm_j...t0. Then tu must be 0 and

Because C and D differ only in those bit positions where t;is l ,

(2)

(3)

Let K = km_1...k1k0 = Jm-^-Ju-Jo- Then K is adjacentto J. From equation 3,

From equations 1 and 2, C oc K ?s D. Hence K is in [C,D].Therefore, at least one node in [C,D] is adjacent to node J.(2) UniquenessFrom Theorem 2, J can have only one adjacent node in[C,D]. Therefore, exactly one node in [C,D] is adjacent to

Corollary 4: Let [A,B] and [C,D] be two differentsubsystems and J € [A,B]. If no nodes in [C,D] areadjacent to node J, then [A,B] and [C,D] are notcombinative.

Proof: It follows directly from Theorem 3. O

Theorems 2 and 3 are two important theorems in thedistributed combining process. Let [X,Y] be the releasedsubsystem. Theorem 2 ensures that the status informationcontained in each adjacent node of node X represents anindependent subsystem if the adjacent node is not in[X,Y]. Theorem 3 shows that any subsystem combinativeto [X,Y] must contain one adjacent node of node X.Therefore, when combining [X,Y], it is sufficient toexamine only the status words contained in the adjacentnodes of X. It is shown next that any node in a subsystemcan determine if a subsystem is combinative.

Theorem 4: Let J <E [A,B] and K € [C,D], where[A,B] f) [C,D] = *. If J is adjacent to K and TA = Tc,then [A,B] and [C,D] are combinative.Proof: Let T = TA = Tc. BecauseJ € [A,B] andK € [C,D],

(4)

Let bit u be the bit position in which J and K differ. Thentu = 0, otherwise from equations 4 and 5, A = C and B =D and hence [A,B] = [C,D]. Furthermore, becauseTA = Tc and tu = 0, from equations 4 and 5, A and C aredifferent in bit u only, i.e., A is adjacent to C. FromCorollary 2, [A,B] and [C,D] are combinative. D

Because in the distributed scheme, every node in asubsystem contains the status words of the subsystem, anynode in [A,B] is able to check if an idle subsystem iscombinative to |A,B] by using Corollary 2. In addition,from the proof of Theorem 4, if J and K are different in bitu then A and C are different in bit u, i.e.,

(6)

where J, K, A, and C are as defined in Theorem 4. FromTheorem 4, node J can be any node of the subsystem.Hence, from Corollary 3 and equation 6, if a node knowsthe address of its adjacent node in the combiningsubsystem, it can calculate the new BDV when anintermediate subsystem is formed. In summary, when asubsystem [X,Y] is released, node X can continue toperform the combining process even if it is not a leadingnode after a combining step. From Corollary 4, it issufficient to examine the status words in the adjacentnodes of node X. A global search for idle subsystems is notnecessary.

5. The Procedure for Distributed Combining

When a subsystem [X,Y] is released, node X willaccess the status words from its adjacent nodes. Eachadjacent node not contained in [X,Y] represents anindependent subsystem. From Theorem 4, if theBUSY/IDLE flag in an adjacent node K of node X is IDLEand the BDV of node K is equal to that of [X,Yl, then thesubsystem containing node K is combinative to [X,Y]. Let[E,F] be the intermediate subsystem and let T x and TE bethe BDVs of [X,Y] and [E,F], respectively, then fromCorollary 2 and equation 6,

(7)

Because node X contains Tx, it can calculate TJJ by usingequation 7. After combining these two subsystems, it isobvious that node X is also in the intermediate subsystem[E,F]. Hence from Theorem 4 node X can examine thestatus words of the remaining adjacent nodes again to findidle subsystems that are combinative to [E,F]. It isimportant to note that once node X has all the statuswords from its adjacent nodes, it can perform thecombining process by itself. When more than one idlesubsystem is combinative to the released subsystem or toany intermediate subsystem, node X can keep track ofeach possible combining sequence and find a combiningsequence that leads to the largest intermediate subsystem.LetKj be the adjacent node of X whose address differs

from X in the ith bit position;FLAG[i] be the value of the BUSY/IDLE flag in node Ki;

T[i] be the BDV in node K^Tx be the BDV in node X;Tx[i] be the ith bit of Tx;NT[j] be the BDV of an intermediate subsystem.[G,H] be the largest intermediate subsystem.

11-62

Page 7: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

The procedure for distributed combining process is givenbelow. It should be in a critical section so that onereleased subsystem can execute it at a time. The procedureis followed by a description of how it works.

The correctness of the procedure COMBINE wasproven in Section 4. In statement (S2) of the procedure,node X accesses the status information from all itsadjacent nodes. In (S3), because both oldj and j are set to0 initially, at least one iteration in the while loop will beperformed. In the first iteration, node X searches for anycombinative idle subsystems. If no idle subsystem iscombinative to [X,Y], then (S3.1.1) and (S3.1.2) areskipped, i.e., j remans 0. The value of oldj becomes 1 after(S3.2) and the loop is terminated.

If there are idle subsystems combinative to [X,Y] inthe first iteration of the while loop, then for eachcombinative idle subsystem, node X will increase j by oneand calculate the BDV for the intermediate subsystemformed. Therefore, after (S3.2), oldj s j and moreiterations will be performed. The while loop will beterminated when all derived NTs have been used in (S3.1)and no new NT is obtained from (S3.1.2). In this case,oldj = j + 1 and the while loop terminates. It has shownin [10] that at most m — k intermediate subsystems can beformed in combining a released subsystem of size 2 .Hence at most m — k NTs can be obtained. Therefore, atmost m — k +1 iterations will be performed, one iterationfor the released subsystem and m—k iterations for theintermediate subsystems.

When the while loop terminates, the number jindicates the total number of intermediate subsystemformed. In addition, the later an intermediate subsystem isformed, the larger the intermediate subsystem is.Similarly, the later a new BDV is obtained in the whileloop, the larger subsystem the new BDV represents. Hence,when the while loop is terminated, NT[j] will be the BDVof one of the largest subsystems formed. Therefore, in (S6)and (S7), node X can update the status words in [X,Y] andin [G,H], which is one of the largest intermediatesubsystems, by setting the BUSY/IDLE flag in each nodeof [X,Y] to IDLE and broadcasting NT[j] to all nodes in[G,H].

The complexity of the procedure COMBINE can becalculated by summing up the complexity in each step.The complexity of (S1) is 0(1). The complexity of (S2)depends on what interconnection network is used. In a

multistage cube or omega network, the access time of onestatus word is proportional to the number of stages, andhence has a complexity of O(logN). In a system based onhypercube, the complexity of reading one status wordfrom an adjacent node is 0(1). Because node X has maddress bits and the address of each adjacent node differsfrom X in one bit position, node X has exact m adjacentnodes. Hence, the complexity of (S2) is 0(log2N) in amultistage cube network and O(logN) in a hypercubenetwork. In (S3), each iteration is a for loop that has acomplexity of O(logN). There are at most m — k for loops,so the worst case complexity of (S3) is 0(log2N). (S6) and(S7) are two broadcast transfers, each having a complexityof O(logN). By summing up these complexities, theprocedure COMBINE has a complexity of 0(log2N).

6. Discussion

With dynamic partitioning, a partitionable systemcan be efficiently managed by allocating more resources forapplications with a high degree parallelism, and lessresources for applications with a low degree parallelism.Applications with different degrees of parallelism can beexecuted simultaneously if there are enough resources inthe system. Multitasking of a single job can also besupported. A job may be multithreaded into several taskswith a precedence relation among them. Ready tasks canbe executed concurrently in a partitionable system.Examples can be found in [2].

Splitting and combining subsystems are the twofundamental processes in the problem of dynamicpartitioning. Distributed procedures to split a subsystemand to combine a released subsystem with idle subsystemswere presented. No global table is needed to store theidentifiers of subsystems. Instead, each node in the systemstores a status word consisting of a BUSY/IDLE flag and abit difference vector.

The splitting process can be easily implemented in adistributed scheme because the process is basically localthe the subsystem involved. The worst-case complexity ofthe distributed splitting process is O(logN).

The distributed combining process is initiated as soonas a subsystem is released. One alternative is to allow areleased subsystem to inspect the FIFO queue beforestarting the combining process. If the released subsystemcan find a suitable job to execute, then the process can besaved, The combining process has to be in a critical sectionbecause it actually requests more resources to combinetogether. If two or more subsystems are released at thesame time, only one released subsystem is allowed toperform the combining procedure at a time.

The distributed combining process is executed in theleading node of the released subsystem. It has been shownthat it is sufficient to examine only those status wordscontained in PEs adjacent to the leading node of thereleased subsystem, and that the entire process can beexecuted in the leading node without switching the controlto other nodes. These properties reduce the amount ofdata to be analyzed and hence significantly reduce thecomplexity of the distributed procedure. The distributedcombining process has a worst-case complexity ofO(log2N).

In summary, both the splitting and combiningprocedures developed in this study are applicable tosystems that use interconnection networks such ashypercube, multistage cube, and omega networks. Theycan also be easily modified into a centralized scheme by

11-63

Page 8: 1989 International Conference on Parallel Processing A ...hj/conferences/116.pdfbeneficial if the status information can be distributed to individual PEs and the procedures for resource

1989 International Conference on Parallel Processing

using a global table that contains the status words of allnodes.

References

[I] G. B. Adams III and H. J. Siegel, "The extra stagecube: A fault-tolerant interconnection network forSupersystems," IEEE Trans, on Computers, Vol. C-31, May 1982, pp. 443-454.

[2] C. H. Chu, E. J. Delp, L. H. Jamieson, H. J. Siegel,F. J. Weil, and A. B. Whinston, "A model for anintelligent operating system for executing imageunderstanding task on a reconfigurable parallelarchitecture," Journal of Parallel and DistributedComputing, June 1989.

[3] M-S. Chen and K. G. Shin, "Processor allocation inan N-Cube multiprocessor using gray codes," IEEETrans, on Computers, Vol. C-36, Dec. 1987, pp.1396-1407.

[4] W. Crowther, J. Goodhue, E. Starr, R. Thomas, W.Milliken, and T. Blackadar, "Performancemeasurements on a 128-node Butterfly ParallelProcessor," 1985 Int'l Conf. Parallel Processing,Aug. 1985, pp. 531-540.

[5] M. Dubois, C. Scheurich, and F. A. Briggs,"Synchronization, coherence, and event ordering inmultiprocessors," Computer, Vol. 21, Feb. 1988, pp.9-21.

[6] J. P. Fishburn and R. A. Finkel, "Quotientnetworks," IEEE Trans, on Computers, vol. C-31,Apr. 1982, pp. 288-295.

[7] J. L. Gerstring, Mathematical Structures forComputer Science, 2nd ed., W. H. Freeman and Co.,New York, NY, 1987.

[8] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P.McAuliffe, L. Rudolph, and M. Snir, "The NYUUltracomputer—Designing an MIMD shared memoryparallel computer," IEEE Trans, on Computers, vol.C-32, Feb. 1983, pp. 175-189.

[9] J. P. Hayes, T. Mudge, and Q. F. Stout, "Amicroprocessor-based hypercube supercomputer,"IEEE Micro, Vol. 1, Oct. 1986, pp. 6-17.

[10] M. Jeng and H. J. Siegel, "Dynamic partitioning in aclass of parallel systems," 8th Int'l Conference onDistributed Computing Systems, June 1988, pp. 33-40.

[II] M. Jeng and H. J. Siegel, "Design and analysis ofdynamic redundancy networks," IEEE Trans, onComputers, Vol. C-37, Sep. 1988, pp.1019-1029.

[12] D. H. Lawrie, "Access and alignment of data in anarray processor," IEEE Trans, on Computers, Vol.C-24, Dec. 1975, pp. 1145-1155.

[13] W. Lin and C-L. Wu, "An object-based resourcemanagement for a high-performance distributedcomputing system - STAR," 8th Int'l ComputerSoftware and Applications Conf., Nov. 1984, pp.208-216.

[14] W. Lin and C-L. Wu, "Reconfiguration proceduresfor a polymorphic and partitionable multiprocessor,"IEEE Trans, on Computers, Vol. C-35, Oct. 1986,pp. 910-916.

[15] R. J. McMillen and H. J. Siegel, "Routing schemesfor the augmented data manipulator network in anMIMD system," IEEE Trans, on Computers, Vol.C-31, Dec. 1982. pp. 1202-1214.

[16] E. Opper and M. Malek, "Resource allocation for aclass of problem structures in multistageinterconnection network-based systems," 3rd Int'lConf. Distributed Computing Systems, Oct. 1982, pp.106-113.

[17] D. S. Parker and C. S. Raghavendra, "The gammanetwork," IEEE Trans, on Computers, Vol. C-33,Apr. 1984, pp. 367-373.

[18] G. F. Pfister, W. C. Brantley, D. A. George, S. L.Harvey, W. J. Kleinfelder, K. P. McAuliffe, E. A.Melton, V. A. Norton, and J. Weiss, "The IBMresearch parallel processor prototype (RP3):Introduction and architecture," 1985 Int'l Conf.Parallel Processing, Aug. 1985, pp. 764-771.

[19] U. V. Premkumar and J. C. Browne, "Resourceallocation in rectangular SW Banyans," 9th Ann.Symp. Computer Architecture, Oct. 1982, pp. 326-333.

[20] C. L. Seitz, "The Cosmic Cube," Communications ofACM, Vol. 7, Jan. 1985, pp. 22-33.

[21] H. J. Siegel, "The theory underlying the partitioningof permutation networks," IEEE Trans, onComputers, Vol. C-29, Sept. 1980, pp. 791-801.

[22] H. J. Siegel, Interconnection Networks for Large-Scale Parallel Processing: Theory and Case Studies,Lexington Books, D.C. Health and Company,Lexington, MA, 1985.

[23] H. J. Siegel, W. T-Y. Hsu, and M. Jeng, "Anintroduction to the multistage cube family ofinterconnection networks," The Journal ofSupercomputing, Vol. 1, 1987, 13-42.

[24] H. J. Siegel, T. Schwederski, J. T. Kuehn, and N. J.Davis IV, "An overview of the PASM parallelprocessing system," in Computer Architecture, D. D.Gajski, V. M. Milutinovic, H. J. Siegel, and B. P.Furht, eds., IEEE Computer Society Press,Washington, DC, 1987, pp. 387-407.

[25] L. W. Tucker and G. G. Robertson, "Architectureand applications of the Connection Machine,"Computer, Vol. 21, Aug. 1988, pp. 26-38.

[26] A. Varma and C. S Raghavendra, "On permutationspassable by the gamma network," Journal ofParallel and Distributed Computing, Vol. 3, Mar.1986, pp. 72-91.

[27] S. Yalamanchili and J. K. Aggarwal,"Reconfiguration strategies for parallelarchitectures," Computer, Vol. 18, Dec. 1985, pp.44-61.

11-64