8
1990 International Conference on Parallel Processina OPTIMIZING TASK MIGRATION TRANSFERS USING MULTISTAGE CUBE NETWORKS Thomas Schwederski Institute for Microelectronics Stuttgart Allmandring 30a D-7000 Stuttgart 80, West Germany Howard Jay Siegel [email protected] Parallel Processing Laboratory School of Electrical Engineering Purdue University West Lafayette, IN 47907, USA Thomas L. Casavant [email protected] Dept, of Electrical and Computer Engineering University of Iowa Iowa City, IA 52242, USA Abstract — As hardware and software technology progresses, the interest in large-scale parallel processing systems is increasing. Making such a system partitionable into independent subsystems has many advantages. To maximize these benefits, it may be necessary to move (migrate) a job from one submachine (partition) to another. Here, machines based on multistage cube net- works are considered. Assume the task is to be migrated from a given set of K source PEs, P s , to a given set of K destination PEs, P j . A mapping to determine which PE in P d is to receive data from each PE in P s is given. It is proven that this mapping will allow the task migration to be performed in the minimum amount of time. An equa- tion for determining this minimum time is derived. The results are shown for both packet- and circuit-switched multistage cube networks. The techniques presented can be used as part of a strategy for making decisions as to whether to migrate a task and which partitions to use as source and destination of the migration. 1. Introduction This research builds on, but is distinct from, the work presented in [19]. The comments on motivation in this section, the qualitative discussion in Section 2, the model in Section 3, and the terminology in Section 4 are needed background material based on [19], and are included here so that this paper is self-contained. The results in Sections 5 and 6 are new and are the contribu- tions of this paper. As hardware and software technology progresses, the interest in large-scale parallel processing systems is increasing. Making such a system partitionable into independent subsystems has many advantages. These include exploitation of subtask parallelism, allowing mul- tiple simultaneous users, and facilitating more efficient utilization of resources [22]. To maximize these benefits, it may be necessary to move (migrate) a job from one sub- machine (partition) to another. Here it is assumed that memory is physically distributed, i.e., each processor has some local memory associated with it, forming a PE (pro- cessing element), as is the case in most current large-scale parallel systems. Existing commercial and research distri- buted memory parallel machines that are partitionable to some degree (i.e., they can be subdivided into indepen- dent submachines, and each of these machines can per- form a separate task) include the BBN Butterfly* [6], Connection Machine 2 [10, 24], IBM RP3* [16, 17], Intel iPSC/2 [13], NCube system [9], and PASM [22]. Task migration is the movement of a task (or sub- task) executing on one partition (set of PEs) to another partition. The area of task migration in partitionable sys- tems has received little attention because only recently have large-scale partitionable parallel systems become available. The overhead of task migration is studied here. In particular, the amount of time required to transfer data for the migration is investigated. This work builds on [19], deriving a new methodology that is proven optimal for both packet- and circuit-switched multistage cube networks. An equation for determining this minimum time is derived. The techniques presented can be used as part of a strategy for making decisions as to whether to migrate a task and which partitions to use as source and destination of the migration. These research results are applicable to MIMD, partitionable-SIMD, and partitionable-SIMD/MIMD systems that use multistage cube networks. This research is motivated by the study of the PASM parallel processing system [7, 22]. Partition restructuring, the movement of small tasks to make larger partitions available, is one motivation for task migration. In parallel processing systems with cube interconnection networks, the ways in which processors can be combined into independent submachines obey cer- tain restrictions [20]. As a consequence, busy partitions and/or faulty PEs can cause a fragmentation problem that prohibits larger partitions from being formed. Depending on the situation, the migration of a task from one partition to another may make available a larger size partition. This research was supported in part by the Office of Naval Research under grant no. N00014-90-J-1483, by the National Science Foundation under grant no. CCR-8809600 and CCR-8704826, and by the Naval Ocean Systems Center under the High Performance Computing Block, ONT. *These machines can use a shared memory addressing scheme, but the memories are physically distributed with each memory associated with one local processor. 1-51

1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processina

OPTIMIZING TASK MIGRATION TRANSFERS USING MULTISTAGE CUBE NETWORKS

Thomas SchwederskiInstitute for Microelectronics Stuttgart

Allmandring 30aD-7000 Stuttgart 80, West Germany

Howard Jay [email protected]

Parallel Processing LaboratorySchool of Electrical Engineering

Purdue UniversityWest Lafayette, IN 47907, USA

Thomas L. [email protected]

Dept, of Electrical and Computer EngineeringUniversity of Iowa

Iowa City, IA 52242, USA

Abstract — As hardware and software technologyprogresses, the interest in large-scale parallel processingsystems is increasing. Making such a system partitionableinto independent subsystems has many advantages. Tomaximize these benefits, it may be necessary to move(migrate) a job from one submachine (partition) toanother. Here, machines based on multistage cube net-works are considered. Assume the task is to be migratedfrom a given set of K source PEs, Ps, to a given set of Kdestination PEs, P j . A mapping to determine which PEin Pd is to receive data from each PE in Ps is given. It isproven that this mapping will allow the task migration tobe performed in the minimum amount of time. An equa-tion for determining this minimum time is derived. Theresults are shown for both packet- and circuit-switchedmultistage cube networks. The techniques presented canbe used as part of a strategy for making decisions as towhether to migrate a task and which partitions to use assource and destination of the migration.

1. IntroductionThis research builds on, but is distinct from, the

work presented in [19]. The comments on motivation inthis section, the qualitative discussion in Section 2, themodel in Section 3, and the terminology in Section 4 areneeded background material based on [19], and areincluded here so that this paper is self-contained. Theresults in Sections 5 and 6 are new and are the contribu-tions of this paper.

As hardware and software technology progresses, theinterest in large-scale parallel processing systems isincreasing. Making such a system partitionable intoindependent subsystems has many advantages. Theseinclude exploitation of subtask parallelism, allowing mul-tiple simultaneous users, and facilitating more efficientutilization of resources [22]. To maximize these benefits,it may be necessary to move (migrate) a job from one sub-machine (partition) to another. Here it is assumed thatmemory is physically distributed, i.e., each processor has

some local memory associated with it, forming a PE (pro-cessing element), as is the case in most current large-scaleparallel systems. Existing commercial and research distri-buted memory parallel machines that are partitionable tosome degree (i.e., they can be subdivided into indepen-dent submachines, and each of these machines can per-form a separate task) include the BBN Butterfly* [6],Connection Machine 2 [10, 24], IBM RP3* [16, 17], InteliPSC/2 [13], NCube system [9], and PASM [22].

Task migration is the movement of a task (or sub-task) executing on one partition (set of PEs) to anotherpartition. The area of task migration in partitionable sys-tems has received little attention because only recentlyhave large-scale partitionable parallel systems becomeavailable. The overhead of task migration is studied here.In particular, the amount of time required to transferdata for the migration is investigated. This work buildson [19], deriving a new methodology that is provenoptimal for both packet- and circuit-switched multistagecube networks. An equation for determining thisminimum time is derived. The techniques presented canbe used as part of a strategy for making decisions as towhether to migrate a task and which partitions to use assource and destination of the migration. These researchresults are applicable to MIMD, partitionable-SIMD, andpartitionable-SIMD/MIMD systems that use multistagecube networks. This research is motivated by the study ofthe PASM parallel processing system [7, 22].

Partition restructuring, the movement of small tasksto make larger partitions available, is one motivation fortask migration. In parallel processing systems with cubeinterconnection networks, the ways in which processorscan be combined into independent submachines obey cer-tain restrictions [20]. As a consequence, busy partitionsand/or faulty PEs can cause a fragmentation problemthat prohibits larger partitions from being formed.Depending on the situation, the migration of a task fromone partition to another may make available a larger sizepartition.

This research was supported in part by the Office of Naval Researchunder grant no. N00014-90-J-1483, by the National ScienceFoundation under grant no. CCR-8809600 and CCR-8704826, andby the Naval Ocean Systems Center under the High PerformanceComputing Block, ONT.

*These machines can use a shared memory addressing scheme,but the memories are physically distributed with each memoryassociated with one local processor.

1-51

Page 2: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processing

Consider a hypercube network with eight nodes, asillustrated in Fig. 1 (the situation is similar for a multi-stage cube network). If an independent partition withfour nodes is desired, all four nodes must lie on the samesurface of the cube. For example, nodes 1, 3, 5, and 7 canform a partition of size four. However, if two one-nodetasks occupy nodes 0 and 7, each of the six surfaces con-tains one task. No size four partition can be formed, eventhough six nodes are available. By moving one task, e.g.,by moving the task in node 0 to node 4, a size four parti-tion becomes possible, consisting of nodes 0, 1, 2, and 3.This migration can result in an increase of the overallmachine performance, or it can be used to meet real-timeconstraints that apply to the incoming task.

Fig. 1: Three-dimensional cube structure, with verticeslabeled from 0 to 7 in binary.

A second motivation is load balancing [3]. Considera parallel processing system that permits multiprocessingin its processors. For example, if a partition A shares itstime between five tasks, while another partition B of thesame size has only three tasks, migration of one task fromA to B may represent a more favorable distribution of sys-tem load.

The task migration mechanism is also applicable tocertain "fork" (spawning) situations. For example, if atask needs to spawn a copy of itself onto a partition ofequal size, this is analogous to migrating the task.

Section 2 discusses the parameters of task migrationqualitatively. The system model that is used to analyzetask migration is presented in Section 3. In Section 4, ter-minology and parameters to describe task migration inmultistage cube networks are defined. Section 5 derivesthe lower time bounds for task migrations. In Section 6, amapping between source partition and destination parti-tion PEs is given and proven to minimize task transfertime.

2. Qualitative Discussion ofTask Migration Overhead

Several phases can be identified during the migrationof a task from a source partition to a destination partitionin a distributed memory machine. First, a decision mustbe made whether to migrate a task, and to which destina-tion to migrate. In some situations, for this decision, themigration overhead must be known so that a task ismigrated only if the migration cost is outweighed by themigration gain (this may require estimating the expectedcompletion time of a task, techniques for which are out-side of the scope of this paper). If a task can be migratedto one of several destinations, the cost of migration toeach of these must be known so that the best choice canbe made. Furthermore, there may be more than one task

which can be selected to be migrated. Making the migra-tion decision might require computing resources. Theseare available, however, because a migration can be per-formed only if free resources (i.e., the potential destina-tion of the migrating task) exist. The decision could bepart of an overall automatic system reconfigurationscheme [4].

After the decision has been made, the task currentlyactive in the source partition P8 must be suspended, andthe destination partition P j must be allocated. All neces-sary information must then be transferred from Psthrough the interconnection network to P j . The time toaccomplish the data transfer depends on a combination ofthe amount of data to be transmitted, the location ofsource and destination partitions, which source PE isassociated with each destination PE, the use of the net-work by other tasks, the type of interconnection network,and system implementation details. After the datatransfer has been completed, the source partition is freed.The system controller can reassign it to a new task, usu-ally as part of a larger partition. At P^, the migrated taskis resumed, and the migration process is completed. Thecomponents of the migration process are overviewed in18].

In general, the data transfers will be the mostsignificant component of the migration cost. Further-more, it is relatively straightforward to determine themigration costs associated with the suspension andresumption of a task compared to the complexity of deter-mining the cost of data transfers needed for migrating thetask. This is a result of the fact that migration messagesmay conflict with one another in the network and withmessages issued by other tasks. Here, only the transfertime of moving a task from a source partition to a destina-tion partition is examined; interference from other tasks isnot considered. This is because: (1) it is assumed thattwo simultaneous migrations would rarely occur, and,furthermore, two that would interfere with one anotherwould be serialized, and (2) interference with normalinter-PE message traffic generated by task(s) executing inpartition(s) that the task migration "passes through" isadditive in nature and very limited relative to the taskmigration traffic. In contrast, poor task migration choicescan increase transfer time by a multiplicative factor (e.g.,by two in the example in Section 4). Migration time usingmultistage cube networks is studied here (hypercube sys-tems were considered in [5]).

3. System ModelIn this section, the model of parallel systems and

multistage cube networks used here is briefly overviewedto define the terminology to be employed. It is assumedthat the reader is familiar with the basic SIMD and MIMDmodels of parallelism [8], as well as with the multistagecube network [20, 21 J.

The partitionable-SIMD/MIMD machine organi-zation shown in Fig. 2 is used as the general system model.It consists of N = 2n PEs numbered 0 through N - l . Thebinary representation of a PE number P is denotedPn-xPn-2'"PiPo- PEs communicate with each otherthrough an interconnection network. Sets of PEs can begrouped to form a partition; the ways in which partitionsmay be formed depends on the interconnection networkand is described below. Each partition can operate in

1-52

Page 3: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processing

SIMD or in MIMD mode and switch between modes. Amultiple-SIMD machine is similar except MIMD opera-tion is not permitted. The research described here isapplicable to MIMD, multiple-SMD, and partitionable-SIMD/MIMD machines; in all three classes PEs can beclustered together to form independent submachines (par-titions).

The multistage cube network is part of a large topo-logically equivalent family of networks including baseline[25], delta (a=b=2) [14], flip [2], generalized cube [20],indirect binary n-cube [15], multistage shuffle-exchange[23], omega [11], and SW-banyan (with S=F=2) [12], andhas been used or proposed for use in many machines (e.g.,BBN Butterfly, IBM RP3, PASM) [21]. Therefore, theresults presented here are directly applicable to physicallydistributed memory systems using these networks.

Fig. 2: Partitionable-SMD/MMD machine model.

The multistage cube interconnection network used inthe analyses is a Generalized Cube (GC) network [20,21]. This network can operate in the SIMD, multiple-SIMD, MIMD, and partitionable-SIMD/MIMD modes ofparallelism. An N input/output GC network with N=8is shown in Fig. 3a. It has n =log2N stages, where eachstage consists of a set of N lines (links) connected to N/2interchange boxes. Each interchange box is a two-input, two-output switch, and can be set to one of thestates shown in Fig. 3b (broadcast (one-to-many) connec-tions are not relevant to this study). The links are labeledfrom 0 to N—1. Links that differ only in bit position i arepaired at interchange boxes in stage i. Each interchangebox is controlled independently through the use of routingtags. PE i is connected to network input port i and out-put port i. In general, to go from a source S =sn_1...s1s0to a destination D = dn_1...d1do, the stage i box in thepath from S to D must be set to swap if Sj ^ dj and tostraight if sj = dj. There is only one path from a givensource to a given destination, because only stage i candetermine the i-th bit of the destination address.

One implementation aspect of all multistage net-works is the way in which paths through the network areestablished and released [21]. A packet-switched net-work divides a message into a sequence of fixed-size pack-ets, and each packet makes its way from stage to stage,releasing links and interchange boxes immediately afterusing them. Thus, a packet uses only one interchange box

at a time. Blocked packets can be stored in queues ininterchange boxes. In a circuit-switched network, acomplete circuit is established from the network inputport to the desired network output port, and then dataitems are sent through the network. A total of n+1 linksand n boxes are held for the duration of the transmissionof the entire message. The simplifying assumption that ifa path is blocked by another transmission, the blockedpath is dropped and tried again later, is made.

The GC network control is distributed among thePEs by using a routing tag as a header on each packet in apacket-switched network and to establish each path in acircuit-switched network. Because each source PE gen-erates its own tag, it is possible that a conflict will occur inthe network, i.e., the messages at the two input links of abox both require the same box output link.

The partitionability of a GC interconnection net-work of size N is the ability to divide the network intoindependent subnetworks of different sizes (when a net-work of "size" N has N I/O ports) [20, 21]. Each subnet-work of size N'<N must have all of the interconnectioncapabilities of a GC network originally built to be of sizeN'. The GC can be partitioned with the constraints thatthe size of each subnetwork must be a power of two, thephysical addresses of the input/output ports of a subnet-work of size 2k must all agree in some fixed set of n—k bitpositions, and each input/output port can belong to atmost one subnetwork. Each subnetwork forms a partitioncomposed of the PEs associated with the input/outputports of the subnetwork, and within each partition, thePEs and input/output ports are logically numbered from0 t o 2 k - l .

4. Task Migration UsingMultistage Cube Networks

Consider the analysis of moving task migration datathrough an interconnection network in a system of sizeN = 2D PEs. The numbers of all PEs in a partition of sizeK = 2k agree in n—k bit positions, and the other k bits

1-53

Page 4: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processing

assume all possible combinations. These k bits are calledpartition bits. It is assumed that all PEs in Ps musttransfer the same amount of data, and Ps and Pd are ofthe same size (i.e., both contain k partition bits). Further,assume that logical PE i in the source partition transfersits data to logical PE i in the destination partition. How-ever, logical labeling of PEs within a partition is restrictedonly by general constraints [20], and thus a partition bitin the source partition can correspond to any partition bitin the destination partition, or to its complement. Thiscorrespondence of partition bits is the mapping from thesource partition to the destination partition. (The logicallabelling of PEs within a partition may affect which pathsthrough the network conflict when performing intra-partition transfers, but the number of conflict-free simul-taneous paths does not change.)

As an example of a mapping, let N = 32,P3 == {17,19,21,23,25,27,29,31},Pd ={2,6,10,14,18,22,26,30}.

One possible mapping is to move the contents of PE:s4S3S2s1s0 = 1X2X4X31 of Ps

to PE:d4d3d2d1d0 = X4X3X210 of Pd ,

where X4, X3, and X2 are the partition bits; i.e, 21 to 2, 29to 6, 23 to 10, ..., 27 to 30. Thus, s3 maps to d2, s2 to d4,and sj to d3. Both PE 1X2X4X31 of Ps and PE ofX4X3X210 of Pd have the logical number X4X3X2 withintheir respective partitions, as shown in Table 1.

Logical PE# 0 1 2 3 4 5 6 7Physical PE# Ps 21 29 23 31 17 25 19 27Physical PE# Pd 2 6 10 14 18 22 26 30

Table 1: Numbering for N = 32, Ps = {17,19,21,...,3l},Pd ={2,6,10,...,30}, and X4X3X2 is thelogical PE number corresponding to1X2X4X31 in Ps and X4X3X21O in Pd .

Fig. 4a shows an example of a task migration in a16-PE system, where a two-PE task in PEs 5 and 7migrates to PEs 13 and 15 with no conflicts. Fig. 4bshows an example where the two-PE task in PEs 5 and 7migrates to PEs 8 and 9, and because there is a conflictand a link must be shared, the migration takes approxi-mately twice as long as in the conflict-free case.

The utilization factor of a link was defined byAgrawal as the number of times a link must be used topass a permutation [l]. In the context of task migration,the utilization factor UF of a link can be equivalentlydefined as the number of PEs that send data over thatlink to do the migration.

From properties of GC networks, a message from PEsn_!...s0 to PE dn_!...do enters stage i on a link whoselabel is dE_1...di+1Sj...s0 [20, 21]. Assume that the numberof partition bits among the destination bits dn_!...dj+1 isk'i, and that the number of partition bits among thesource bits s;...s0 is k";. Further assume that q; of the k";source partition bits in Si...s0 correspond (map) to q; of thek'i destination partition bits in dn_i...dj+i. Consider theexample in Table 1, where data is moved from PES 4 . . . S 0 = 1 X 2 X 4 X 3 1 to PE d4...d0 =X4X3X210. Whenstage i = 2 is entered, the migration uses links d4d3s2s1s0,where the destination partition bits d4(=X4) and

Fig. 4: Task migration in a system with 16 PEs.(a) Migration from PEs 5 and 7 to 13 and 15.(b) Migration from PEs 5 and 7 to 8 and 9.

d3(=X3) correspond to the source partition bits s2(=X4)and s1(=X3), respectively. Hence, k'2 =k"2 = q2 =2.When stage i = 1 is entered, the migration uses linksd4d3d2s1s0, so k;! = 3, k"i = 1, and qi = 1.

The parameters k, k';, k"j, and q; are the same for allstage i links that are used during the migration. The utili-zation factor UFj on each link used during the migrationthat connects a stage i+1 box to a stage i box is given by:

UF; = 2k k ' k l+q'

[19]. For a given Ps and Pd, different mappings of sourcepartition bits to destination partition bits can result indifferent values of qj, and hence in different utilization

1-54

Page 5: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processing

factors. For the example in Table 1, the maximum UFjoccurs for UF2, where k' 2 = k"2 = q2 = 2, so UF2 = 2. If adifferent mapping which pairs bits d4 and S3, d3 and s2,and d2 and sx (i.e., maps source PEs 1X4X3X21 to desti-nation PEs X4X3XX10) is used, UF; = 1 for all i, and themigration can be performed more quickly.

5. Lower Bounds of Migration TimeTo determine whether or not the method of task

migration presented in Section 6 is indeed an optimumone, a lower bound on the time required for migrating atask through a network has to be determined. For thisreason, the optimum utilization factor is introduced:

The maximum utilization factor for a migration is:n - l

U F m a I = max (UF;)i=0

Because UF; = 2k~k '"k" i + c l i) and qi > 0, UFo p t is clearly a

lower bound of the maximum utilization factor. FromUFopt, lower bounds of the migration time in packet-switched and circuit-switched GC networks can bederived.

Consider a packet-switched network first. Examinea link L between stage i+1 and i where UFj = U F m a x .Assume that each of the source PEs has to send M datapackets during the migration, and let the network cyclet ime (i.e., the time to pass a single packet along a link(and its associated box circuitry)) be T N C - Further,recall it is assumed that no conflicts with messages fromother tasks interfere with the migration. Then link Lmust carry M x UF m a x packets, because UF m a x PEs uselink L and each PE transfers M packets. A lower boundon the number of packets is given by using UFopt as thelower bound for UF m a x . That is, a lower bound on thenumber of packets that must traverse link L is M x UFopt.This will require time T&c X M x UF o p t .

Each packet has not only to traverse link L, but mustgo from a source PE to the destination PE and traverse atotal of n+1 links on the way (including the networkinput link). It is assumed that packets can be sentthrough one after another, i.e., can be "pipelined"through the network. Before the first packet reaches linkL, it must traverse n—i other links. After the last packetleaves link L, it must traverse i other links. Therefore, alower bound of the migration time is given by:

In a circuit-switched network, the data items mustpass through link L as well, but in addition, a paththrough the network has to be established for each sourcePE sending to link L. Let the network cycle time (i.e., thetime to pass a single word through the network) be T ^ c -Assume that each of the source PEs has to send Y wordsduring the migration, and that it takes S network cyclesto establish a path. Recall that UFopt is the minimumnumber of PEs that will need to use link L to perform themigration, and that blocked paths are dropped and triedagain later (see Section 3). Each of the UFo p t PEs mustestablish the path that goes through link L and then sendY words through it. Therefore, the minimum migrationtime in circuit switched networks is given by:

6. Optimum Mapping RuleMany alternatives to map source partition PEs to

destination partition PEs exist. In the paper [19], onemapping was shown that was proven to result in the smal-lest possible utilization factor and avoid "dual conflicts."However, it was not shown that with this mapping thelower bound on migration time can be achieved. Here, adifferent mapping is presented that is proven to beoptimum because it can be used to perform a migration inthe smallest possible time. This mapping rule is now dis-cussed.

Assume that a task with 2k PEs must be migrated.The optimum utilization factor is UF o p t =2 u ; i.e.,u = log2 UFopt. First, map the highest (leftmost) u sourcepartition bits to the u lowest (rightmost) destination par-tition bits in some arbitrary way. Then map the i-thhighest remaining source partition bit to the i-th highestdestination partition bit, for all remaining source parti-tion bits.

This is illustrated by the following example. Con-sider a system with N = 128 PEs, and a task that needs apartition with K = 16 PEs (k = 4). Let the source parti-tion be:

s6s5s4S3S2s1s0 = X 0 X X 1 0 X ,

and the destination partition be:

d6d5d4d3d2d1d0 = 1 X 0 0 X X X ,

where X denotes a partition bit. UFopt occurs for i = 2,when the link label is d6d5d4d3s2s1so, and where k'2 = 1(for d5) and k"2 = 1 (for s0). Thus,UFopt = 2*"1"1 = 22 = 4. The use of the optimum map-ping rule is shown in Fig. 5.

Fig. 5: Optimum mapping with N = 128, K = 16,and UFo p t = 4.

For the above example, UFopt = 4 , so u = 2. There-fore, the u = 2 highest source position bits, s6 and s4, mapto the u = 2 lowest destination bits, dj and dp. Then thei-th highest remaining source partition bit maps to the i-th highest remaining destination partition bit; i.e., s3

maps to d5, and s0 maps to d2.The migration will be performed in UFopt steps such

that the u highest source partition bits are kept at a fixedvalue j during step j , 0 < j < UFopt. The remaining k—upartition bits determine the set of PEs that is active dur-ing each step.

For the example above, the source partition PEs willbe divided by fixing bit positions Se and S4, and letting bitpositions s3 and s0 vary. Thus, the K = 16 source parti-

1-55

Page 6: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processing

tion PEs are divided into four subsets, each subset con-sisting of four PEs. At each step, one subset will transmitdata. In step 0, s6 = s4 = 0, and source partition PEs0 0 0 X 1 0 X (i.e., 0 0 0 0 1 0 0, 0 0 0 0 1 0 1, 0 0 0 1 1 0 0,and 0 0 0 110 1) transmit data. In step 1, s6 = 0, s4 = 1,and source PEs 0 0 1 X 1 OX (i.e., 0 010 10 0,0 0 1 0 1 0 1 , 0 0 1 1 1 0 0 , and 0 0 1 1 1 0 l) transmit data.In step 2, s6 = 1 , s4 = 0, and source partition PEs10 0X1 OX (i.e., 1 0 0 0 1 0 0, 1 0 0 0 1 0 1 , 1 0 0 1 1 0 0,and 10 0 110 1) transmit data. Finally, in step 3,S6 = S4 = 1> and source partition PEs 1 0 1 X 1 0 X (i.e.,1 0 1 0 1 0 0, 1 0 1 0 1 0 1, 1 0 1 1 1 0 0, and 1011101)transmit data.

It will be shown that the transfers performed by thePEs during each step do not interfere with each other.Because the overall migration process is performed insteps where only a specific subset of PEs is active, eachstep can be treated as a task migration by itself. For eachof these step task migrations the utilization factor,denoted with an underscore as UF . determines theamount of interference between migration messages in astep. Clearly, if the migration in each step is to beconflict-free, the utilization factor UF must be one for allstages and all steps. This is stated in Theorem 1. Lemma1 is used in the proof of Theorem 1.

Lemma 1: Consider a source partition bit sr that is oneof the k—u partition bits that vary during a migrationstep. Let sr map to destination partition bit dp. Then,because of the mapping rule, p > r.

Proof: The lemma is proven by contradiction. Assumethat p < r, and consider the links that connect stage rwith stage r—1. From Section 4, at these links, the bitsdn_i...dr contain k'r_i destination partition bits, and thebits sr_!...So contain k'r_i source partition bits. Becausethe highest u source partition bits are fixed during a step,only k—u bits will vary. Therefore, wr_! =k—u—k"r_inon-fixed source partition bits are in sn_1...sr. The right-most of these source partition bits is sr, and becausep < r, this bit was not mapped to one of the k'r_! desti-nation partition bits in dn_x...dr. Recall from the map-ping rule that the u highest source partition bits aremapped to the u lowest destination partition bits and thatthe next highest wr_! source partition bits are mapped tothe wr_1 highest destination bits. Because sr is one of thewr_! non-fixed source partition bits and does not map toone of the k'r_1 destination partition bits in dn_!...dr, itfollows from the mapping rule that there must be at leastk'r_j non-fixed source partition bits in sn_i...sr+1 that domap to these destination partition bits. Thus,wr_j > k' r_j . Therefore,

k'r_! < wr_! = k — u — k'r_! , or

k - k'r_i - k"r_i > u .However, in Section 5, u was defined to be the maximumvalue of k — k'r_! — k"r_! for all r, and hence the lemmais proven by contradiction. D

Lemma 1 is now used to prove that each step migra-tion can be done with no conflicts.

Theorem 1: If the above mapping rule is followed, thestep utilization factor UF = 1 for all steps and all stages.

Proof: Consider the migration performed in each stepseparately. For a given step task migration, the values fork', k", and q will be distinguished by underscoring. Thus,the number of partition bits for the step task migration isk = k—u, because u of the k partition bits are fixed duringeach step. Links connecting stage b+1 and stage b arenumbered dn_i...db+iSb...So. In the step task migration,dn_1...db+i contain k'b destination partition bits, andS[,...So contain k"b source partition bits. For the step taskmigration, neither k'b nor k"b contain any of the u fixedpartition bits. The number of the k'b destination parti-tion bits in dn_1...db+i that correspond to any of the k"bsource partition bits in Sb...s0 is q . Let w = k — u — k"b-Because of the mapping, each of the k'b destination parti-tion bits in d]1_1...db+i must correspond to either one ofthe k"b source partition bits in Sb...So (exactly q do), or

one of the w source partition bits in sn_1...Sb+i. FromLemma 1, each of the w source partition bits insn_i...Sb+i must correspond to one of the k'b destinationpartition bits in dn_1...db+i. Therefore, k'b = w . +SvBy definition, w = k — u — k"b and k = k — u, so

k - k'b - k"b + qb = 0 .

Consequently, the step utilization factor is:

Ur . = Z = Z = 1b

for all stages. D

Therefore, the migration can be performed in UFoptsteps, with no conflicts in any step. Theorem 2 provesthat this implies the migration can be performed in thetime given by the lower bounds in Section 5.

Theorem 2: If the step task migration approachdescribed above is followed, the migration will be done inthe minimum time possible.

Proof: The minimum migration times for packet- andcircuit-switching were given in Section 5. First, considerpacket-switching, then circuit switching.

From Theorem 1, each step of the migration is donewith no conflicts. There are UFopt steps. At each step, Mdata packets must be sent. Thus, a total of M x UFOptpackets must be sent (pipelined) through the network.Immediately (in the next network cycle) after one steptask migration is completed, the next one is begun. Thus,when the last packets for step task migration i are in stagen—2, the first packets for step task migration i+1 are instage n—1. Therefore, the total number of network cyclesrequired is M x UFopt for all the packets to enter the net-work (in a pipelined fashion), plus n more cycles for thelast packet to exit the network. Hence, the total migra-tion time is the lower bound:

TMIGR,PACK = T&c x (M x UFopt + n ) .For the circuit-switched case, again from Theorem 1,

each step of the migration can be done with no conflicts.

1-56

Page 7: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processing

There are UFopt step task migrations, each requiring timeT N C X ( S + Y ) . Thus, the total migration time is thelower bound:

TMIGR,CIR = T§c x (S+Y) x UFopt . •

As an example of the use of this methodology, con-sider N = 8 and migrating a task from Ps = {0,2,4,6} toPd = {4,5,6,7}; i.e., Ps =XX0 and Pd = 1XX. UFopt = 2,so u = 1 and using the mapping rule PE XQXJO sends datato PE IXJXO- This is done in two conflict-free steps usingthe step task migration approach: first XQ is fixed at 0and PEs 0 and 2 send data, and then Xo is fixed at 1 andPEs 4 and 6 send data. This is shown in Fig. 6. Thus, ifthe network were circuit-switched, the migration timewould be:

T§c X (S + Y) x 2 .

Consider what would happen if, instead of using thismethodology, PE X ^ O sent data to PE IXJXQ and allPEs attempted to send data simultaneously. Then PE 0maps to 4, 2 to 5, 4 to 6, and 6 to 7. For both this map-ping and the one above, UFj = UF0 = 2. As shown in Fig.7, if the 0 to 4 path is established first, it can cause theother paths to be blocked (depending on the timing of theother path establishment attempts). The 2 to 5 and 4 to 6paths can next be established simultaneously, but theyblock the 6 to 7 path. Finally, the 6 to 7 path is esta-blished. Thus, three sets of transfers are needed: (1) fromPE 0, (2) from PEs 2 and 4, and (3) from PE 6. There-fore, the migration time would be:

T & c x ( S + Y ) x 3 ,

50% longer than the minimum, which is attainable withthe mapping rule and step migration techniquepresented.

7. ConclusionsA methodology for performing task migration

transfers in the minimum possible time was presented.This methodology consisted of two components: a map-ping of source partition PEs to destination partition PEs,and a way to group source PEs when data is transferred.The optimality of the methodology was proven correctunder the assumptions of the model used.

The techniques and equations presented can be usedto help determine if a given migration will save more timethan it takes. For example, if a task is expected to com-plete is less time than it takes to migrate it, the migrationshould not be done.

These techniques can also be used to help selectsource and destination partitions. If the source partition

1-57

Page 8: 1990 International Conference on Parallel Processinahj/conferences/126.pdf · 2014-02-04 · cube networks. This research is motivated by the study of the PASM parallel processing

1990 International Conference on Parallel Processing

is fixed, there may be a choice among several possible des-tination partitions. The optimum choice can be basedsolely on UFopt, because the smallest UFopt -will requirethe shortest transfer time. If multiple destination parti-tions have the same smallest UFOpt, the choice betweenthese can be made randomly, or other measures that takethe effects of other tasks in the system into account can beemployed [18].

If there are multiple possible source partitions inaddition to multiple destinations, the amount of data thatmust be transmitted is, in general, different for everysource partition. Therefore, UFopt and the amount ofdata (M or Y) are used with the equations derived todetermine the actual transfer time as a criterion for whichsource-destination partition pair to use.

In summary, as large-scale parallel processing sys-tems become a reality, issues such as partitionability andtask migration become important. The techniquepresented was proven to produce the minimum datatransfer time for task migration using multistage cubenetworks.

Acknowledgement: The authors thank Mark A.Nichols for his comments.

References

[I] D. P. Agrawal, "Graph theoretical analysis anddesign of multistage interconnection networks,"IEEE Trans. Computers, v. C-32, July 1983, pp.637-648.

[2] K. E. Batcher, "The flip network in STARAN," 1976Int'l Conf. Parallel Processing, Aug. 1976, pp. 65-71.

[3] T. L. Casavant and J. G. Kuhl, "A taxonomy ofscheduling in general-purpose distributed computingsystems," IEEE Trans. Software Engineering, Vol.SE-14, Feb. 1988, pp. 141-154.

[4] C. H. Chu, E. J. Delp, L. H. Jamieson, H. J. Siegel, J.Weil, and A. B. Whinston, "A model for an intelli-gent operating system for executing image under-standing tasks on a reconfigurable parallel architec-ture," Journal of Parallel and Distributed Computing,Vol. 6, June 1989, pp. 598-622.

[5] G-I. Chen, and T-H. Lai, "Virtual subcubes and jobmigration in a hypercube," 1989 Int'l Conf. ParallelProcessing, Vol. II, Aug. 1989, pp. 73-76.

[6] W. Crowther, J. Goodhue, R. Thomas, W. Milliken,and T. Blackadar, "Performance measurements on a128-node butterfly parallel processor," 1985 Int'lConf. Parallel Processing, Aug. 1985, pp. 531-540.

[7] S. A. Fineberg, T. L. Casavant, and H. J. Siegel,"Experimental analysis of a mixed-mode parallelarchitecture performing sequence sorting," 1990 Int'lConf. Parallel Processing, Aug. 1990, to appear.

[8] M. J. Flynn, "Very high-speed computing systems,"Proceedings of the IEEE, v. 54, Dec. 1966, pp. 1901-1909.

[9] J. P. Hayes, T. N. Mudge, Q. F. Stout, and S. Colley,"Architecture of a hypercube supercomputer," 1986Int'l Conf. Parallel Processing, Aug. 1986, pp. 653-660.

[10] W. D. Hillis, The Connection Machine, MIT Press,Cambridge, MA, 1985.

[II] D. H. Lawrie, "Access and alignment of data in anarray processor," IEEE Trans. Computers, Vol. C-

24, Dec. 1975, pp. 1145-1155.[12] G. J. Lipovski and M. Malek, Parallel Computing:

Theory and Comparisons, John Wiley & Sons, Inc.,NY, NY, 1987.

[13] S. F. Nugent, "The iPCS/2 direct-connect communi-cations technology," 3rd Conf. Hypercube Computersand Applications, Jan. 1988, pp. 51-60.

[14] J. H. Patel, "Performance of processor-memoryinterconnections for multiprocessors," IEEE Trans.Computers, v. C-30, Oct. 1981, pp. 771-780.

[15] M. C. Pease III, "The indirect binary n-cubemicroprocessor array," IEEE Trans. Computers, v.C-26, May 1977, pp. 458-473.

[16] G. F. Pfister, W. C. Brantley, D. A. George, S. L.Harvey, W. J. Kleinfelder, K. P. McAuliffe, E. A.Melton, V. A. Norton, and J. Weiss, "The IBMResearch Parallel Processor Prototype (RP3): intro-duction and architecture," 1985 Int'l Conf. ParallelProcessing, Aug. 1985, pp. 764-771.

[17] G. F. Pfister and V. A. Norton, " 'Hot spot' " conten-tion and combining in multistage interconnectionnetworks," 1985Int'l Conf. Parallel Processing, Aug.1985, pp. 790-797.

[18] T. Schwederski, H. J. Siegel, and T. L. Casavant, "Amodel of task migration in partitionable parallel pro-cessing systems," Frontiers '88: 2nd Symp, on theFrontiers of Massively Parallel Computation, Oct.1988, pp. 211-214.

[19] T. Schwederski, H. J. Siegel, and T. L. Casavant,"Task migration transfers in multistage cube basedparallel systems," 1989 Int'l Conf. Parallel Process-ing, v. I, Aug. 1989, pp. 296-304.

[20] H. J. Siegel, Interconnection Networks for Large-Scale Parallel Processing: Theory and Case Studies,2nd Edition, McGraw-Hill, NY, NY, 1990.

[21] H. J. Siegel, W. G. Nation, C. P. Kruskal, and L. M.Napolitano, Jr., "Using the multistage cube networktopology in parallel supercomputers," Proceedings ofthe IEEE, v. 77, Dec. 1989, pp. 1932-1953.

[22] H. J. Siegel, T. Schwederski, J. T. Kuehn, and N. J.Davis IV, "An overview of the PASM parallel pro-cessing system," in Computer Architecture, D. D.Gajski, V. M. Milutinovic, H. J. Siegel, and B. P.Furht, eds., IEEE Computer Society Press, Washing-ton, DC, 1987, pp. 387-407.

[23] S. Thanawastien and V. P. Nelson, "Interferenceanalysis of shuffle/exchange networks," IEEE Trans.Computers, v. C-30, Aug. 1981, pp. 545-556.

[24] L. W. Tucker and G. G. Robertson, "Architecturesand applications of the Connection Machine," Com-puter, v. 21, Aug. 1988, pp. 26-38.

[25] C.-L. Wu and T. Y. Feng, "On a class of multistageinterconnection networks,"/£'£'£' Trans. Computers,v. C-29, Aug. 1980, pp. 694-702.

1-58