66
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors YU-KWONG KWOK The University of Hong Kong AND ISHFAQ AHMAD The Hong Kong University of Science and Technology Static scheduling of a program represented by a directed task graph on a multiprocessor system to minimize the program completion time is a well-known problem in parallel processing. Since finding an optimal schedule is an NP- complete problem in general, researchers have resorted to devising efficient heuristics. A plethora of heuristics have been proposed based on a wide spectrum of techniques, including branch-and-bound, integer-programming, searching, graph- theory, randomization, genetic algorithms, and evolutionary methods. The objective of this survey is to describe various scheduling algorithms and their functionalities in a contrasting fashion as well as examine their relative merits in terms of performance and time-complexity. Since these algorithms are based on diverse assumptions, they differ in their functionalities, and hence are difficult to describe in a unified context. We propose a taxonomy that classifies these algorithms into different categories. We consider 27 scheduling algorithms, with each algorithm explained through an easy-to-understand description followed by an illustrative example to demonstrate its operation. We also outline some of the novel and promising optimization approaches and current research trends in the area. Finally, we give an overview of the software tools that provide scheduling/mapping functionalities Categories and Subject Descriptors: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors); Parallel processors; D.1.3 [Programming Techniques]: Concurrent Programming; Parallel programming; D.4.1 [Operating Systems]: Process Management—Multiprocessing/multiprogramming; Scheduling; F.1.2 [Conputation by Abstract Devices]: Modes of Computation—Parallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E. Authors’ addresses: Y.-K. Kwok, Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong; email: [email protected]; I. Ahmad, Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 2000 ACM 0360-0300/99/1200–0406 $5.00 ACM Computing Surveys, Vol. 31, No. 4, December 1999

Static Scheduling Algorithms for Allocating Directed Task

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Static Scheduling Algorithms for Allocating Directed Task

Static Scheduling Algorithms for Allocating Directed Task Graphsto MultiprocessorsYU-KWONG KWOK

The University of Hong Kong

AND

ISHFAQ AHMAD

The Hong Kong University of Science and Technology

Static scheduling of a program represented by a directed task graph on amultiprocessor system to minimize the program completion time is a well-knownproblem in parallel processing. Since finding an optimal schedule is an NP-complete problem in general, researchers have resorted to devising efficientheuristics. A plethora of heuristics have been proposed based on a wide spectrum oftechniques, including branch-and-bound, integer-programming, searching, graph-theory, randomization, genetic algorithms, and evolutionary methods. The objectiveof this survey is to describe various scheduling algorithms and their functionalitiesin a contrasting fashion as well as examine their relative merits in terms ofperformance and time-complexity. Since these algorithms are based on diverseassumptions, they differ in their functionalities, and hence are difficult to describein a unified context. We propose a taxonomy that classifies these algorithms intodifferent categories. We consider 27 scheduling algorithms, with each algorithmexplained through an easy-to-understand description followed by an illustrativeexample to demonstrate its operation. We also outline some of the novel andpromising optimization approaches and current research trends in the area.Finally, we give an overview of the software tools that provide scheduling/mappingfunctionalities

Categories and Subject Descriptors: C.1.2 [Processor Architectures]: MultipleData Stream Architectures (Multiprocessors); Parallel processors; D.1.3[Programming Techniques]: Concurrent Programming; Parallel programming;D.4.1 [Operating Systems]: ProcessManagement—Multiprocessing/multiprogramming; Scheduling; F.1.2[Conputation by Abstract Devices]: Modes of Computation—Parallelism andconcurrency

General Terms: Algorithms, Design, Performance, Theory

Additional Key Words and Phrases: Automatic parallelization, DAG,multiprocessors, parallel processing, software tools, static scheduling, task graphs

This research was supported by the Hong Kong Research Grants Council under contract numbersHKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E.Authors’ addresses: Y.-K. Kwok, Department of Electrical and Electronic Engineering, The Universityof Hong Kong, Pokfulam Road, Hong Kong; email: [email protected]; I. Ahmad, Department ofComputer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong.Permission to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 2000 ACM 0360-0300/99/1200–0406 $5.00

ACM Computing Surveys, Vol. 31, No. 4, December 1999

A B
Static scheduling algorithms for allocating directed task graphs to multiprocessors Yu-Kwong Kwok , Ishfaq Ahmad ACM Computing Surveys (CSUR) December 1999 Volume 31 Issue 4
Page 2: Static Scheduling Algorithms for Allocating Directed Task

1. INTRODUCTION

Parallel processing is a promising ap-proach to meet the computational re-quirements of a large number of currentand emerging applications [Hwang1993; Kumar et al. 1994; Quinn 1994].However, it poses a number of problemsthat are not encountered in sequentialprocessing such as designing a parallelalgorithm for the application, partition-ing of the application into tasks, coordi-nating communication and synchroniza-tion, and scheduling of the tasks ontothe machine. A large body of researchefforts addressing these problems hasbeen reported in the literature [Amdahl1967; Chu et al. 1984; Gajski and Peir

1985; Hwang 1993; Lewis and El-Re-wini 1993; Lo et al. 1991; Lord et al.1983; Manoharan and Topham 1995;Pease et al. 1991; Quinn 1994; Shiraziet al. 1993; Wu and Gajski 1990; Yangand Gerasoulis 1992]. Scheduling andallocation is a highly important issuesince an inappropriate scheduling oftasks can fail to exploit the true poten-tial of the system and can offset thegain from parallelization. In this paperwe focus on the scheduling aspect.

The objective of scheduling is to mini-mize the completion time of a parallelapplication by properly allocating thetasks to the processors. In a broadsense, the scheduling problem exists intwo forms: static and dynamic. In staticscheduling, which is usually done atcompile time, the characteristics of aparallel program (such as task process-ing times, communication, data depen-dencies, and synchronization require-ments) are known before programexecution [Chu et al. 1984; Gajski andPeir 1985]. A parallel program, there-fore, can be represented by a node- andedge-weighted directed acyclic graph(DAG), in which the node weights repre-sent task processing times and the edgeweights represent data dependencies aswell as the communication times be-tween tasks. In dynamic schedulingonly, a few assumptions about the par-allel program can be made before execu-tion, and thus, scheduling decisionshave to be made on-the-fly [Ahmad andGhafoor 1991; Palis et al. 1995]. Thegoal of a dynamic scheduling algorithmas such includes not only the minimiza-tion of the program completion time butalso the minimization of the schedulingoverhead which constitutes a significantportion of the cost paid for running thescheduler. We address only the staticscheduling problem. Hereafter, we referto the static scheduling problem as sim-ply scheduling.

The scheduling problem is NP-com-plete for most of its variants except for afew simplified cases (these cases will beelaborated in later sections) [Chreti-enne 1989; Coffman 1976; Coffman and

CONTENTS

1. Introduction2. The DAG Scheduling Problem

2.1 The DAG Model2.2 Generation of a DAG2.3 Variations in the DAG Model2.4 The Multiprocessor Model

3. NP-Completeness of the DAG Scheduling Problem4. A Taxonomy of DAG Scheduling Algorithms5. Basic Techniques in DAG Scheduling

Computing a t-levelComputing a b-levelComputing ALAP

6. Description of the Algorithms6.1 Scheduling DAGs with Restricted Structures6.2 Scheduling Arbitrary DAGs Without Communi-

cation6.3 UNC Scheduling6.4 BNP Scheduling6.5 TDB SchedulingConstructing the CPN-Dominant Sequence6.6 APN SchedulingThe BSA Algorithm6.7 Scheduling in Heterogeneous Environments6.8 Mapping Clusters to Processors

7. Some Scheduling Tools7.1 Hypertool7.2 PYRROS7.3 Parallax7.4 OREGAMI7.5 PARSA7.6 CASCH7.7 Commercial Tools

8. New Ideas and Research Trends8.1 Scheduling Using Genetic Algorithms8.2 Randomization Techniques8.3 Parallelizing a Scheduling Algorithm8.4 Future Research Directions

9. Summary and Concluding Remarks

Static Scheduling for Task Graph Allocation • 407

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 3: Static Scheduling Algorithms for Allocating Directed Task

Graham 1972; El-Rewini et al. 1995;Garey and Johnson 1979; Gonzales, Jr.1977; Graham et al. 1979; Hu 1961;Kasahara and Narita 1984; Papadimi-triou and Ullman 1987; Papadimitriouand Yannakakis 1979; 1990; Rayward–Smith 1987b; Sethi 1976; Ullman 1975].Therefore, many heuristics with polyno-mial-time complexity have been sug-gested [Ahmad et al. 1996; Casavantand Kuhl 1988; Coffman 1976; El-Re-wini et al. 1995; El-Rewini et al. 1994;Gerasoulis and Yang 1992; Khan et al.1994; McCreary et al. 1994; Pande et al.1994; Prastein 1987; Shirazi et al. 1990;Simons and Warmuth 1989]. However,these heuristics are highly diverse interms of their assumptions about thestructure of the parallel program andthe target parallel architecture, andthus are difficult to explain in a unifiedcontext.

Common simplifying assumptions in-clude uniform task execution times,zero inter-task communication times,contention-free communication, full con-nectivity of parallel processors, andavailability of unlimited number of pro-cessors. These assumptions may nothold in practical situations for a numberof reasons. For instance, it is not alwaysrealistic to assume that the task execu-tion times of an application are uniformbecause the amount of computations en-capsulated in tasks are usually varied.Furthermore, parallel and distributedarchitectures have evolved into varioustypes such as distributed-memory mul-ticomputers (DMMs) [Hwang 1993];shared-memory multiprocessors (SMMs)[Hwang 1993]; clusters of symmetricmultiprocessors (SMPs) [Hwang 1993];and networks of workstations (NOWs)[Hwang 1993]. Therefore, their more de-tailed architectural characteristics mustbe taken into account. For example, in-tertask communication in the form ofmessage-passing or shared-memory ac-cess inevitably incurs a non-negligibleamount of latency. Moreover, a conten-tion-free communication and full con-nectivity of processors cannot be as-sumed for a DMM, a SMP or a NOW.

Thus, scheduling algorithms relying onsuch assumptions are apt to have re-stricted applicability in real environ-ments.

Multiprocessor scheduling has beenan active research area and, therefore,many different assumptions and termi-nology are independently suggested.Unfortunately, some of the terms andassumptions are neither clearly statednor consistently used by most of theresearchers. As a result, it is difficult toappreciate the merits of various sched-uling algorithms and quantitativelyevaluate their performance. To avoidthis problem, we first introduce the di-rected acyclic graph (DAG) model of aparallel program, and then proceed todescribe the multiprocessor model. Thisis followed by a discussion about theNP-completeness of variants of the DAGscheduling problem. Some basic tech-niques used in scheduling are intro-duced. Then we describe a taxonomy ofDAG scheduling algorithms and use itto classify several reported algorithms.

The problem of scheduling a set oftasks to a set of processors can be di-vided into two categories: job schedulingand scheduling and mapping (see Fig-ure 1(a)). In the former category, inde-pendent jobs are to be scheduled amongthe processors of a distributed comput-ing system to optimize overall systemperformance [Bozoki and Richard 1970;Chen and Lai 1988a; Cheng et al. 1986].In contrast, the scheduling and map-ping problem requires the allocation ofmultiple interacting tasks of a singleparallel program in order to minimizethe completion time on the parallel com-puter system [Adam et al. 1974; Ahmadet al. 1996; Bashir et al. 1983; Casavantand Kuhl 1988; Coffman 1976; Veltmanet al. 1990]. While job scheduling re-quires dynamic run-time schedulingthat is not a priori decidable, the sched-uling and mapping problem can be ad-dressed in both static [El-Rewini et al.1995; 1994; Gerasoulis and Yang 1992;Hochbaum and Shmoys 1987; 1988;Khan et al. 1994; McCreary et al. 1994;Shirazi et al. 1990] as well as dynamic

408 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 4: Static Scheduling Algorithms for Allocating Directed Task

contexts [Ahmad and Ghafoor 1991;Norman and Thanisch 1993]. When thecharacteristics of the parallel program,including its task execution times, taskdependencies, task communications andsynchronization are known a priori,scheduling can be accomplished off-lineduring compile-time. On the contrary,dynamic scheduling in the absence of apriori information is done on-the-fly ac-cording to the state of the system.

Two distinct models of the parallelprogram have been considered exten-sively in the context of static schedul-ing: the task interaction graph (TIG)model and the task precedence graph(TPG) model (see Figure 1(b) and Figure1(c)).

The task interaction graph model, inwhich vertices represent parallel pro-cesses and edges denote the interpro-cess interaction [Bokhari 1981], is usu-ally used in static scheduling of looselycoupled communicating processes (sinceall tasks are considered as simulta-neously and independently executable,there is no temporal execution depen-dency) to a distributed system. For ex-ample, a TIG is commonly used to modelthe finite element method (FEM)[Bokhari 1979]. The objective of sched-uling is to minimize parallel programcompletion time by properly mappingthe tasks to the processors. This re-quires balancing the computation loaduniformly among the processors while

Figure 1. (a) A simplified taxonomy of the approaches to the scheduling problem; (b) a task interactiongraph; (c) a task precedence graph.

Static Scheduling for Task Graph Allocation • 409

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 5: Static Scheduling Algorithms for Allocating Directed Task

simultaneously keeping communicationcosts as low as possible. The research inthis area was pioneered by Bokhari[1979] and Stone [1977]: Stone [1977]applied network-flow algorithms tosolve the assignment problem, whereasBokhari [1981] described the mappingproblem as being equivalent to graphisomorphism, quadratic assignment,and sparse matrix bandwidth reductionproblems.

The task precedence graph model (orsimply the DAG) in which the nodesrepresent the tasks and the directededges represent the execution depen-dencies as well as the amount of com-munication, is commonly used in staticscheduling of a parallel program withtightly coupled tasks on multiproces-sors. For example, in the task prece-dence graph shown in Figure 1(c), taskn4 cannot commence execution beforetasks n1 and n2 finish execution andgathers all the communication datafrom n2 and n3. The scheduling objec-tive is to minimize the program comple-tion time (or maximize the speed-up,defined as the time required for sequen-tial execution divided by the time re-quired for parallel execution). For mostparallel applications, a task precedencegraph can model the program more ac-curately because it captures the tempo-ral dependencies among tasks. This isthe model we use in this paper.

As mentioned above, earlier staticscheduling research made simplifyingassumptions about the architecture ofthe parallel program and the parallelmachine, such as uniform node weights,zero edge weights, and the availabilityof an unlimited number of processors.However, even with some of these as-sumptions, the scheduling problem hasbeen proven to be NP-complete exceptfor a few restricted cases [Garey andJohnson 1979]. Indeed, the problem isNP-complete even in two simple cases:(1) scheduling tasks with uniformweights to an arbitrary number of pro-cessors [Ullman 1975] and (2) schedul-ing tasks with weights equal to one or

two units to two processors [Ullman1975]. There are only three specialcases for which there exists optimalpolynomial-time algorithms. Thesecases are (1) scheduling tree-structuredtask graphs with uniform computationcosts on an arbitrary number of proces-sors [Hu 1961]; (2) scheduling arbitrarytask graphs with uniform computationcosts on two processors [Coffman andGraham 1972]; and (3) scheduling aninterval-ordered task graph [Fishburn1985] with uniform node weights to anarbitrary number of processors [Papad-imitriou and Yannakakis 1979]. How-ever, even in these cases, communica-tion among tasks of the parallelprogram is assumed to take zero time[Coffman 1976]. Given these observa-tions, the general scheduling problemcannot be solved in polynomial-time un-less P 5 NP.

Due to the intractability of the gen-eral scheduling problem, two distinctapproaches have been taken: sacrificingefficiency for the sake of optimality andsacrificing optimality for the sake ofefficiency. To obtain optimal solutionsunder relaxed constraints, state-spacesearch and dynamic programming tech-niques have been suggested. However,these techniques are not useful becausemost of them are designed to work un-der restricted environments and mostimportantly they incur an exponentialtime in the worst case. In view of theineffectiveness of optimal techniques,many heuristics have been suggested totackle the problem under more prag-matic situations. While these heuristicsare shown to be effective in experimen-tal studies, they usually cannot gener-ate optimal solutions, and there is noguarantee about their performance ingeneral. Most of the heuristics arebased on a list scheduling approach[Coffman 1976], which is explained be-low.

2. THE DAG SCHEDULING PROBLEM

The objective of DAG scheduling is tominimize the overall program finish-

410 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 6: Static Scheduling Algorithms for Allocating Directed Task

time by proper allocation of the tasks tothe processors and arrangement of exe-cution sequencing of the tasks. Schedul-ing is done in such a manner that theprecedence constraints among the pro-gram tasks are preserved. The overallfinish-time of a parallel program is com-monly called the schedule length ormakespan. Some variations to this goalhave been suggested. For example,some researchers proposed algorithmsto minimize the mean flow-time ormean finish-time, which is the averageof the finish-times of all the programtasks [Bruno et al. 1974; Leung andYoung 1989]. The significance of themean finish-time criterion is that mini-mizing it in the final schedule leads tothe reduction of the mean number ofunfinished tasks at each point in theschedule. Some other algorithms try toreduce the setup costs of the parallelprocessors [Sumichrast 1987]. We focuson algorithms that minimize the sched-ule length.

2.1 The DAG Model

A parallel program can be representedby a directed acyclic graph (DAG) G 5~V, E!, where V is a set of v nodes andE is a set of e directed edges. A node inthe DAG represents a task which inturn is a set of instructions which mustbe executed sequentially without pre-emption in the same processor. Theweight of a node ni is called the compu-tation cost and is denoted by w~ni!. Theedges in the DAG, each of which isdenoted by ~ni, nj!, correspond to thecommunication messages and prece-dence constraints among the nodes. Theweight of an edge is called the commu-nication cost of the edge and is denotedby c~ni, nj!. The source node of an edgeis called the parent node while the sinknode is called the child node. A nodewith no parent is called an entry nodeand a node with no child is called anexit node. The communication-to-com-putation-ratio (CCR) of a parallel pro-

gram is defined as its average edgeweight divided by its average nodeweight. Hereafter, we use the termsnode and task interchangeably. Wesummarize in Table I the notation usedthroughout the paper.

The precedence constraints of a DAGdictate that a node cannot start execu-tion before it gathers all of the messagesfrom its parent nodes. The communica-tion cost between two tasks assigned tothe same processor is assumed to bezero. If node ni is scheduled to someprocessor, then ST~ni! and FT~ni! de-note the start-time and finish-time ofni, respectively. After all the nodes havebeen scheduled, the schedule length isdefined as maxi$FT~ni!% across all pro-cessors. The goal of scheduling is tominimize maxi$FT~ni!%.

The node and edge weights are usu-ally obtained by estimation at compile-time [Ahmad et al. 1997; Chu et al.1984; Ha and Lee 1991; Cosnard andLoi 1995; Wu and Gajski 1990]. Genera-tion of the generic DAG model and someof the variations are described below.

2.2 Generation of a DAG

A parallel program can be modeled by aDAG. Although program loops cannot beexplicitly represented by the DAGmodel, the parallelism in data-flow com-putations in loops can be exploited tosubdivide the loops into a number oftasks by the loop-unraveling technique[Beck et al. 1990; Lee and Feng 1991].The idea is that all iterations of the loopare started or fired together, and opera-tions in various iterations can executewhen their input data are ready foraccess. In addition, for a large class ofdata-flow computation problems andmany numerical algorithms (such asmatrix multiplication), there are veryfew, if any, conditional branches or in-determinism in the program. Thus, theDAG model can be used to accuratelyrepresent these applications so that thescheduling techniques can be applied.Furthermore, in many numerical appli-

Static Scheduling for Task Graph Allocation • 411

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 7: Static Scheduling Algorithms for Allocating Directed Task

cations, such as Gaussian eliminationor fast Fourier transform (FFT), theloop bounds are known during compile-

time. As such, one or more iterations ofa loop can be deterministically encapsu-lated in a task and, consequently, berepresented by a node in a DAG.

The node- and edge-weights are usu-ally obtained by estimation using profil-ing information of operations such asnumerical operations, memory accessoperations, and message-passing primi-tives [Jiang et al. 1990]. The granular-ity of tasks usually is specified by theprogrammers [Ahmad et al. 1997]. Nev-ertheless, the final granularity of thescheduled parallel program is to be re-fined by using a scheduling algorithm,which clusters the communication-in-tensive tasks to a single processor [Ah-mad et al. 1997; Yang and Gerasoulis1992].

2.3 Variations in the DAG Model

There are a number of variations in thegeneric DAG model described above.The more important variations are: pre-emptive scheduling vs. nonpreemptivescheduling, parallel tasks vs. non-paral-lel tasks, and DAG with conditionalbranches vs. DAG without conditionalbranches.

Preemptive Scheduling vs. Nonpre-emptive Scheduling: In preemptivescheduling, the execution of a task maybe interrupted so that the unfinishedportion of the task can be re-allocated toa different processor [Chen and Lai1988b; Gonzales and Sahni 1978; Hor-vath et al. 1977; Rayward-Smith1987a]. On the contrary, algorithms as-suming nonpreemptive scheduling mustallow a task to execute until completionon a single processor. From a theoreti-cal perspective, a preemptive schedul-ing approach allows more flexibility forthe scheduler so that a higher utiliza-tion of processors may result. Indeed, apreemptive scheduling problem is com-monly reckoned as “easier” than its non-preemptive counterpart in that thereare cases in which polynomial time so-lutions exist for the former while thelatter is proved to be NP-complete [Coff-man and Graham 1972; Gonzalez, Jr.

Table I. Notation

Symbol Definition

niThe node number of a node inthe parallel program task graph

w~ni! The computation cost of node ni

~ni, nj! An edge from node ni to nj

c~ni, nj! The communication cost of thedirected edge from node ni to nj

v Number of nodes in the taskgraph

e Number of nodes in the taskgraph

p Number of edges in the taskgraph

CP The number of processors orprocessing elements (PEs) inthe target system

CP A critical path of the task graphCPN Critical Path NodeIBN In-Branch NodeOBN Out-Branch Nodesl Static level of a nodeb-level Bottom level of a nodet-level Top level of a nodeASAP As soon as possible start time

of a nodeALAP As late as possible start time of

a nodeTs~ni! The actual start time of a node ni

DAT~ni, P! The possible data availabletime of ni on target processor P

ST~ni, P! The start time of node ni ontarget processor P

FT~ni, P! The finish time of node ni ontarget processor P

VIP~ni! The parent node of ni thatsends the data arrive last

Pivot2PE The target processor fromwhich nodes are migrated

Proc~ni! The processor accommodatingnode ni

Lij The communication linkbetween PE i and PE j

CCR Communication-to-computationRatio

SL Schedule LengthUNC Unbounded Number of Clusters

scheduling algorithmsBNP Bounded Number of Processors

scheduling algorithmsTDB Task Duplication Based

scheduling algorithmsAPN Arbitrary Processors Network

scheduling algorithms

412 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 8: Static Scheduling Algorithms for Allocating Directed Task

1977]. However, in practice, interrupt-ing a task and transferring it to anotherprocessor can lead to significant pro-cessing overhead and communicationdelays. In addition, a preemptive sched-uler itself is usually more complicatedsince it has to consider when to split atask and where to insert the necessarycommunication induced by the splitting.We concentrate on the nonpreemptiveapproaches.

Parallel Tasks vs. Nonparallel Tasks:A parallel task is a task that requiresmore than one processor at the sametime for its execution [Wang and Cheng1991]. Blazewicz et al. [1986; 1984] in-vestigated the problem of scheduling aset of independent parallel tasks toidentical processors under preemptiveand nonpreemptive scheduling assump-tions. Du and Leung [1989] also ex-plored the same problem but with onemore flexibility: a task can be scheduledto no more than a certain predefinedmaximum number of processors. How-ever, in Blazewicz et al. ’s approach, atask must be scheduled to a fixed pre-defined number of processors. Wangand Cheng [1991] further extended themodel to allow precedence constraintsamong tasks. They devised a list sched-uling approach to construct a schedulebased on the earliest completion time(ECT) heuristic. We concentrate onscheduling DAGs with nonparalleltasks.

DAG with Conditional Branches vs.DAG without Conditional Branches:Towsley [1986] addressed the problemof scheduling a DAG with probabilisticbranches and loops to heterogeneousdistributed systems. Each edge in theDAG is associated with a nonzero prob-ability that the child will be executedimmediately after the parent. He intro-duced two algorithms based on theshortest path method for determiningthe optimal assignments of tasks to pro-cessors. El-Rewini and Ali [1995] alsoinvestigated the problem of schedulingDAGs with conditional branches. Simi-lar to Towsley’s approach, they alsoused a two-step method to construct a

final schedule. However, unlike Tows-ley’s model, they modeled a parallel pro-gram by using two DAGs: a branchgraph and a precedence graph. Thismodel differentiates the conditionalbranching and the precedence relationsamong the parallel program tasks. Theobjective of the first step of the algo-rithm is to reduce the amount of inde-terminism in the DAG by capturing thesimilarity of different instances of theprecedence graph. After this preprocess-ing step, a reduced branch graph and areduced precedence graph are gener-ated. In the second step, all the differ-ent instances of the precedence graphare generated according to the reducedbranch graph, and the correspondingschedules are determined. Finally,these schedules are merged to produce aunified final schedule [El-Rewini andAli 1995]. Since modeling branchingand looping in DAGs is an inherentlydifficult problem, little work has beenreported in this area. We concentrate onDAGs without conditional branching inthis research.

2.4 The Multiprocessor Model

In DAG scheduling, the target system isassumed to be a network of processingelements (PEs), each of which is com-posed of a processor and a local memoryunit so that the PEs do not share mem-ory and communication relies solely onmessage-passing. The processors maybe heterogeneous or homogeneous. Het-erogeneity of processors means the pro-cessors have different speeds or process-ing capabilities. However, we assumeevery module of a parallel program canbe executed on any processor eventhough the completion times on differ-ent processors may be different. ThePEs are connected by an interconnec-tion network with a certain topology.The topology may be fully connected orof a particular structure such as a hy-percube or mesh.

Static Scheduling for Task Graph Allocation • 413

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 9: Static Scheduling Algorithms for Allocating Directed Task

3. NP-COMPLETENESS OF THE DAGSCHEDULING PROBLEM

The DAG scheduling problem is in gen-eral an NP-complete problem [Gareyand Johnson 1979], and algorithms foroptimally scheduling a DAG in polyno-mial-time are known only for three sim-ple cases [Coffman 1976]. The first caseis to schedule a uniform node-weightfree-tree to an arbitrary number of pro-cessors. Hu [1961] proposed a linear-time algorithm to solve the problem.The second case is to schedule an arbi-trarily structured DAG with uniformnode-weights to two processors. Coff-man and Graham [1972] devised a qua-dratic-time algorithm to solve this prob-lem. Both Hu’s algorithm and Coffmanet al.’s algorithm are based on node-labeling methods that produce optimalscheduling lists leading to optimalschedules. Sethi [1976] then improvedthe time-complexity of Coffman et al.’salgorithm to almost linear-time by sug-gesting a more efficient node-labelingprocess. The third case is to schedule aninterval-ordered DAG with uniformnode-weights to an arbitrary number ofprocessors. Papadimitriou and Yanna-kakis [1979] designed a linear-time al-gorithm to tackle the problem. A DAG iscalled interval-ordered if every two pre-cedence-related nodes can be mapped totwo nonoverlapping intervals on thereal number line [Fishburn 1985].

In all of the above three cases, com-munication between tasks is ignored.Ali and El-Rewini [1993] showed thatinterval-ordered DAG with uniformedge weights, which are equal to thenode weights, can also be optimallyscheduled in polynomial time. These op-timality results are summarized in Ta-ble II.

Ullman [1975] showed that schedul-ing a DAG with unit computation to pprocessors is NP-complete. He alsoshowed that scheduling a DAG with oneor two unit computation costs to twoprocessors is NP-complete [Coffman1975; Ullman 1975]. Papadimitriou andYannakakis [1979] showed that sched-uling an interval-ordered DAG with ar-bitrary computation costs to two proces-sors is NP-complete. Garey et al. [1983]showed that scheduling an opposing for-est with unit computation to p proces-sors is NP-complete. Finally, Papadimi-triou and Yannakakis [1990] showedthat scheduling a DAG with unit com-putation to p processors possibly withtask-duplication is also NP-complete.

4. A TAXONOMY OF DAG SCHEDULINGALGORITHMS

To outline the variations of schedulingalgorithms and to describe the scope ofour survey, we introduce in Figure 2 ataxonomy of static parallel scheduling

Table II. Summary of Optimal Scheduling Under Various Simplified Situations

Researcher(s) Complexity p w~ni! Structure c~ni, nj!

Hu [1961] O~v! — Uniform Free-tree NILCoffman and Graham

[1972]O~v2! 2 Uniform — NIL

Sethi [1976] O~va~v! 1 e! 2 Uniform — NILPapadimitrious and

Yannakakis [1979]O~ve! — Uniform Interval-ordered NIL

Ali and El-Rewini[1993]

O~ev! — Uniform(5c)

Interval-ordered Uniform(5c)

Papadimitrious andYannakakis [1979]

NP-complete — — Interval-ordered NIL

Garey and Johnson[1979]

Open Fixed, .2 Uniform — NIL

Ullman [1975] NP-complete — Uniform — NILUllman [1975] NP-complete Fixed, .1 51 or 2 — NIL

414 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 10: Static Scheduling Algorithms for Allocating Directed Task

[Ahmad et al. 1996; Ahmad et al. 1997].Note that unlike the taxonomy sug-gested by Casavant and Kuhl [1988],which describes the general schedulingproblem (including partitioning andload balancing issues) in parallel anddistributed systems, the focus of ourtaxonomy is on the static schedulingproblem, and therefore is only partial.

The highest level of the taxonomy di-vides the scheduling problem into twocategories, depending upon whether thetask graph is of an arbitrary structureor a special structure such as trees.Earlier algorithms have made simplify-ing assumptions about the task graphrepresenting the program and themodel of the parallel processor system[Coffman 1976; Gonzalez Jr. 1977].Most of these algorithms assume thegraph to be of a special structure suchas a tree, forks-join, etc. In general,however, parallel programs come in avariety of structures, and as such, manyrecent algorithms are designed to tacklearbitrary graphs. These algorithms canbe further divided into two categories.Some algorithms assume the computa-tional costs of all the tasks to be uni-form. Others assume the computationalcosts of tasks to be arbitrary.

Some of the scheduling algorithmsthat consider the intertask communica-tion assume the availability of an un-limited number of processors, whilesome other algorithms assume a limitednumber of processors. The former classof algorithms are called the UNC (un-bounded number of clusters) schedulingalgorithms [Kim and Browne 1988;Kwok and Ahmad 1996; Sarkar 1989;Wong and Morris 1989; Yang and Gera-soulis 1994] and the latter the BNP(bounded number of processors) schedul-ing algorithms [Adam et al. 1974; Angeret al. 1990; Kim and Yi 1994; Kwok andAhmad 1997; McCreary and Gill 1989;Palis et al. 1996; Sih and Lee 1993b]. Inboth classes of algorithms, the proces-sors are assumed to be fully connected,and no attention is paid to link conten-tion or routing strategies used for com-munication. The technique employed by

the UNC algorithms is also called clus-tering [Kim and Browne 1988; Liou andPalis 1996; Palis et al. 1996; Sarkar1989; Yang and Gerasoulis 1994]. At thebeginning of the scheduling process,each node is considered a cluster. In thesubsequent steps, two clusters aremerged if the merging reduces the com-pletion time. This merging procedurecontinues until no cluster can bemerged. The rationale behind the UNCalgorithms is that they can take advan-tage of using more processors to furtherreduce the schedule length. However,the clusters generated by the UNC needa postprocessing step for mapping theclusters onto the processors because thenumber of processors available may beless than the number of clusters. As aresult, the final solution quality alsohighly depends on the cluster-mappingstep. On the other hand, the BNP algo-rithms do not need such a postprocess-ing step. It is an open question as towhether UNC or BNP is better.

We use the terms cluster and proces-sor interchangeably since, in the UNCscheduling algorithms, merging a singlenode cluster to another cluster is analo-gous to scheduling a node to a proces-sor.

There have been a few algorithms de-signed with the most general model inthat the system is assumed to consist ofan arbitrary network topology, of whichthe links are not contention-free. Thesealgorithms are called the APN (arbi-trary processor network) scheduling al-gorithms. In addition to schedulingtasks, the APN algorithms also sched-ule messages on the network communi-cation links. Scheduling of messagesmay be dependent on the routing strat-egy used by the underlying network. Tooptimize schedule lengths under suchunrestricted environments makes thedesign of an APN scheduling algorithmintricate and challenging.

The TDB (Task-Duplication Based)scheduling algorithms also assume theavailability of an unbounded number ofprocessors but schedule tasks with du-plication to further reduce the schedule

Static Scheduling for Task Graph Allocation • 415

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 11: Static Scheduling Algorithms for Allocating Directed Task

lengths. The rationale behind the TDBscheduling algorithms is to reduce thecommunication overhead by redun-dantly allocating some tasks to multipleprocessors. In duplication-based sched-uling, different strategies can be em-ployed to select ancestor nodes for du-

plication. Some of the algorithmsduplicate only the direct predecessorswhile others try to duplicate all possibleancestors. For a recent quantitativecomparison of TDB scheduling algo-rithms, the reader is referred to Ahmadand Kwok [1999].

Figure 2. A partial taxonomy of the multiprocessor scheduling problem.

416 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 12: Static Scheduling Algorithms for Allocating Directed Task

5. BASIC TECHNIQUES IN DAGSCHEDULING

Most scheduling algorithms are basedon the so-called list scheduling tech-nique [Adam et al. 1974; Ahmad et al.1996; Casavant and Kuhl 1988; Coff-man 1976; El-Rewini et al. 1995; El-Rewini 1994; Gerasoulis and Yang1992; Khan et al. 1994; Kwok and Ah-mad 1997; McCreary et al. 1994; Shiraziet al. 1990; Yang and Miller 1988]. Thebasic idea of list scheduling is to make ascheduling list (a sequence of nodes forscheduling) by assigning them some pri-orities, and then repeatedly execute thefollowing two steps until all the nodesin the graph are scheduled:

(1) Remove the first node from thescheduling list;

(2) Allocate the node to a processorwhich allows the earliest start-time.

There are various ways to determinethe priorities of nodes, such as HLF(Highest Level First) [Coffman 1976];LP (Longest Path) [Coffman 1976]; LPT(Longest Processing Time) [Friesen1987; Gonzalez, Jr. 1977]; and CP (Crit-ical Path) [Graham et al. 1979].

Recently a number of scheduling algo-rithms based on a dynamic list schedul-ing approach have been suggested[Kwok and Ahmad 1996; Sih and Lee1993a; Yang and Gerasoulis 1994]. In atraditional scheduling algorithm, thescheduling list is statically constructedbefore node allocation begins, and mostimportantly, the sequencing in the listis not modified. In contrast, after eachallocation, these recent algorithms re-compute the priorities of all unsched-uled nodes, which are then used to rear-range the sequencing of the nodes in thelist. Thus, these algorithms essentiallyemploy the following three-step ap-proaches:

(1) Determine new priorities of all un-scheduled nodes;

(2) Select the node with the highest pri-ority for scheduling;

(3) Allocate the node to the processorwhich allows the earliest start-time.

Scheduling algorithms that employthis three-step approach can potentiallygenerate better schedules. However, adynamic approach can increase thetime-complexity of the scheduling algo-rithm.

Two frequently used attributes for as-signing priority are the t-level (top level)and b-level (bottom level) [Adam et al.1974; Ahmad et al. 1996; Gerasooulisand Yang 1992]. The t-level of a node niis the length of a longest path (there canbe more than one longest path) from anentry node to ni (excluding ni). Here,the length of a path is the sum of all thenode and edge weights along the path.As such, the t-level ni highly correlateswith ni’s earliest start-time, denoted byTs~ni!, which is determined after ni isscheduled to a processor. This is be-cause after ni is scheduled, its Ts~ni! issimply the length of the longest pathreaching it. The b-level of a node ni isthe length of a longest path from ni toan exit node. The b-level of a node isbounded from above by the length of acritical path. A critical path (CP) of aDAG, which is an important structurein the DAG, is a longest path in theDAG. Clearly, a DAG can have morethan one CP. Consider the task graphshown in Figure 3(a). In this taskgraph, nodes ni, n7, and n8 are thenodes of the only CP and are calledCPNs (Critical-Path Nodes). The edgeson the CP are shown with thick arrows.The values of the priorities discussedabove are shown in Figure 3(b).

Below is a procedure for computingthe t-levels:Computing a t-level

(1) Construct a list of nodes in topologicalorder. Call it TopList.(2) for each node ni in TopList do(3) max 5 0(4) for each parent nx of ni do(5) if t-level~nx! 1 w~nx! 1 c~nx, ni! .

max then

Static Scheduling for Task Graph Allocation • 417

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 13: Static Scheduling Algorithms for Allocating Directed Task

(6) max5t-level~nx! 1 w~nx! 1 c~nx, ni!(7) endif(8) endfor(9) t-level~ni! 5 max(10)endfor

The time-complexity of the above pro-cedure is O~e 1 v!. A similar proce-dure, which also has time-complexityO~e 1 v!, for computing the b-levels isshown below:

Computing a b-level

(1) Construct a list of nodes in reversedtopological order. Call it RevTopList.(2) for each node ni in RevTopList do(3) max 5 0(4) for each child ny of ni do(5) if c~ni, ny! 1 b-level~ny! . max then(6) max 5 c~ni, ny! 1 b-level~ny!(7) endif(8) endfor(9) b-level~ni! 5 w~ni! 1 max(10) endfor

In the scheduling process, the t-levelof a node varies while the b-level isusually a constant, until the node hasbeen scheduled. The t-level varies be-

cause the weight of an edge may bezeroed when the two incident nodes arescheduled to the same processor. Thus,the path reaching a node, whose lengthdetermines the t-level of the node, maycease to be the longest one. On the otherhand, there are some variations in thecomputation of the b-level of a node.Most algorithms examine a node forscheduling only after all the parents ofthe node have been scheduled. In thiscase, the b-level of a node is a constantuntil after it is scheduled to a processor.Some scheduling algorithms allow thescheduling of a child before its parents,however, in which case the b-level of anode is also a dynamic attribute. Itshould be noted that some schedulingalgorithms do not take into account theedge weights in computing the b-level.In such a case, the b-level does notchange throughout the scheduling pro-cess. To distinguish this definition ofb-level from the one we described above,we call it the static b-level or simplystatic level (sl).

Different algorithms use the t-leveland b-level in different ways. Some algo-

Figure 3. (a) A task graph; (b) the static levels (sls), t-levels, b-levels and ALAPs of the nodes.

418 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 14: Static Scheduling Algorithms for Allocating Directed Task

rithms assign a higher priority to anode with a smaller t-level while somealgorithms assign a higher priority to anode with a larger b-level. Still somealgorithms assign a higher priority to anode with a larger (b-level — t-level). Ingeneral, scheduling in a descending or-der of b-level tends to schedule criticalpath nodes first, while scheduling in anascending order of t-level tends toschedule nodes in a topological order.The composite attribute (b-level—t-level) is a compromise between the pre-vious two cases. If an algorithm uses astatic attribute, such as b-level or staticb-level, to order nodes for scheduling, itis called a static algorithm; otherwise, itis called a dynamic algorithm.

Note that the procedure for comput-ing the t-levels can also be used to com-pute the start-times of nodes on proces-sors during the scheduling process.Indeed, some researchers call the t-levelof a node the ASAP (As-Soon-As-Possi-ble) start-time because the t-level is theearliest possible start-time.

Some of the DAG scheduling algo-rithms employ an attribute called ALAP(As-Late-As-Possible) start-time [Kwokand Ahmad 1996; Wu and Gajski 1990].The ALAP start-time of a node is ameasure of how far the node’s start-time can be delayed without increasingthe schedule length. An O~e 1 v! timeprocedure for computing the ALAP timeis shown below:

Computing ALAP

(1) Construct a list of nodes in reversedtopological order. Call it RevTopList.(2) for each node ni in RevTopList do(3) min2ft 5 CP2Length(4) for each child ny of ny do(5) if alap~ny! 2 c~ni, ny! , min2ftthen(6) min2ft 5 alap~ny! 2 c~ni, ny!(7) endif(8) endfor(9) alap~ni! 5 min2ft 2 w~ni!(10) endfor

After the scheduling list is con-structed by using the node priorities,

the nodes are then scheduled to suitableprocessors. Usually a processor is con-sidered suitable if it allows the earlieststart-time for the node. However, insome sophisticated scheduling heuris-tics, a suitable processor may not be theone that allows the earliest start-time.These variations are described in detailin Section 6.

6. DESCRIPTION OF THE ALGORITHMS

In this section, we briefly survey algo-rithms for DAG scheduling reported inthe literature. We first describe some ofthe earlier scheduling algorithms thatassume a restricted DAG model, andthen proceed to describe a number ofsuch algorithms before proceeding to al-gorithms that remove all such simplify-ing assumptions. The performance ofthese algorithms on some primitivegraph structures is also discussed. Ana-lytical performance bounds reported inthe literature are also briefly surveyedwhere appropriate. We first discuss theUNC class of algorithms, followed byBNP algorithms and TDB algorithms.Next we describe a few of the relativelyunexplored APN class of DAG schedul-ing algorithms. Finally, we discuss theissues of scheduling in heterogeneousenvironments and the mapping prob-lem.

6.1 Scheduling DAGs with RestrictedStructures

Early scheduling algorithms were typi-cally designed with simplifying assump-tions about the DAG and processor net-work model [Adam et al. 1974; Bruno etal. 1974; Fujii et al. 1969; Gabow 1982].For instance, the nodes in the DAGwere assumed to be of unit computationand communication was not considered;that is, w~ni! 5 1, @i, and c~ni, nj! 5 0.Furthermore, some algorithms were de-signed for specially structured DAGssuch as a free-tree [Coffman 1976; Hu1961].

Static Scheduling for Task Graph Allocation • 419

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 15: Static Scheduling Algorithms for Allocating Directed Task

6.1.1 Hu’s Algorithm for Tree-Struc-tured DAGs. Hu [1961] proposed apolynomial-time algorithm to constructoptimal schedules for in-tree structuredDAGs with unit computations and with-out communication. The number of pro-cessors is assumed to be limited and isequal to p. The crucial step in the algo-rithm is a node labelling process. Eachnode ni is labelled a1 where a1 5 xi 1 1and xi is the length of the path from nito the exit node in the DAG. Here, thenotion of length is the number of edgesin the path. The labelling process beginswith the exit node, which is labelled 1.

Using the above labelling procedure,an optimal schedule can be obtained forp processors by processing a tree-struc-tured task graph in the following steps:

(1) Schedule the first p (or fewer) nodeswith the highest numbered label,i.e., the entry nodes, to the proces-sors. If the number of entry nodes isgreater than p, choose p nodeswhose a i is greater than the others.In case of a tie, choose a node arbi-trarily.

(2) Remove the p scheduled nodes fromthe graph. Treat the nodes with nopredecessor as the new entry nodes.

(3) Repeat steps (1) and (2) until allnodes are scheduled.

The labelling process of the algorithmpartitions the task graph into a numberof levels. In the scheduling process,each level of tasks is assigned to theavailable processors. Schedules gener-ated using the above steps are optimalunder the stated constraints. The read-ers are referred to Hu [1961] for theproof of optimality. This is illustrated inthe simple task graph and its optimalschedule shown in Figure 4. The com-plexity of the algorithm is linear interms of the number of nodes becauseeach node in the task graph is visited aconstant number of times.

Kaufman [1974] devised an algorithmfor preemptive scheduling that alsoworks on an in-tree DAG with arbitrarycomputation costs. The algorithm isbased on principles similar to those inHu’s algorithm. The main idea of thealgorithm is to break down the non-unit-weighted tasks into unit-weightedtasks. Optimal schedules can be ob-tained since the resulting DAG is stillan in-tree.

6.1.2 Coffman and Graham’s Algo-rithm for Two-Processor Scheduling.Optimal static scheduling have also beenaddressed by Coffman and Graham

Figure 4. (a) A simple tree-structured task graph with unit-cost tasks and without communicationamong tasks; (b) the optimal schedule of the task graph using three processors.

420 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 16: Static Scheduling Algorithms for Allocating Directed Task

[1972]. They developed an algorithm forgenerating optimal schedules for arbi-trary structured task graphs with unit-weighted tasks and zero-weighted edgesto a two-processor system. The algo-rithm works on similar principles as inHu’s algorithm. The algorithm first as-signs labels to each node in the taskgraph. The assignment process proceeds“up the graph” in a way that considersas candidates for the assignment of thenext label all the nodes whose succes-sors have already been assigned a label.After all the nodes are assigned a label,a list is formed by ordering the tasks indecreasing label numbers, beginningwith the last label assigned. The opti-mal schedule is then obtained by sched-uling ready tasks in this list to idleprocessors. This is elaborated in the fol-lowing steps.

(1) Assign label 1 to one of the exitnode.

(2) Assume that labels 1, 2, . . . , j 2 1have been assigned. Let S be the setof unassigned nodes with no unla-beled successors. Select an elementof S to be assigned label j as follows.For each node x in S, let y1, y2, . . . , yk

be the immediate successors of x.Then, define l~x! to be the decreas-ing sequence of integers formed by

ordering the set of y ’s labels. Sup-pose that l~x! # l~x9! lexicographi-cally for all x9 in S. Assign the labelj to x.

(3) After all tasks have been labeled,use the list of tasks in descendingorder of labels for scheduling. Be-ginning from the first task in thelist, schedule each task to one of thetwo given processors that allows theearlier execution of the task.

Schedules generated using the abovealgorithm are optimal under the givenconstraints. For the proof of optimality,the reader is referred to Coffman andGraham [1972]. An example is illus-trated in Figure 5. Through counter-examples, Coffman and Graham alsodemonstrated that their algorithm cangenerate sub-optimal solutions whenthe number of processors is increased tothree or more, or when the number ofprocessors is two and tasks are allowedto have arbitrary computation costs.This is true even when computationcosts are allowed to be one or two units.The complexity of the algorithm isO~v2! because the labelling process andthe scheduling process each takes O~v2!time.

Sethi [1976] reported an algorithm to

Figure 5. (a) A simple task graph with unit-cost tasks and no-cost communication edges; (b) theoptimal schedule of the task graph in a two-processor system..

Static Scheduling for Task Graph Allocation • 421

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 17: Static Scheduling Algorithms for Allocating Directed Task

determine the labels in O~v 1 e! timeand also gave an algorithm to constructa schedule from the labeling inO~va~v! 1 e! time, where a~v! is analmost constant function of v. The mainidea of the improved labeling process isbased on the observation that the labelsof a set of nodes with the same heightonly depend on their children. Thus,instead of constructing the lexico-graphic ordering information fromscratch, the labeling process can infersuch information through visiting theedges connecting the nodes and theirchildren. As a result, the time-complex-ity of the labeling process is O~v 1 e!instead of O~v2!. The construction of thefinal schedule is done with the aid of aset data structure, for which v accessoperations can be performed inO~va~v!!, where a~v! is the inverseAckermann’s function.

6.1.3 SchedulingInterval-OrderedDAGs.Papadimitriou and Yannakakis [1979]investigated the problem of schedulingunit-computational interval-orderedtasks to multiprocessors. In an interval-ordered DAG, two nodes are prece-dence-related if and only if the nodescan be mapped to non-overlapping in-tervals on the real line [Fishburn 1985].An example of an interval-ordered DAGis shown in Figure 6. Based on theinterval-ordered property, the number

of successors of a node can be used as apriority to construct a list. An optimal listschedule can be constructed in O~v 1 e!time. However, as mentioned earlier,the problem becomes NP-complete if theDAG is allowed to have arbitrary com-putation costs. Ali and El-Rewini [1993]worked on the problem of schedulinginterval-ordered DAGs with unit com-putation costs and unit communicationcosts. They showed that the problem istractable and devised an O~ve! algo-rithm to generate optimal schedules. Intheir algorithm, which is similar to thatof Papadimitriou and Yannakakis[1979], the number of successors is usedas a node priority for scheduling.

6.2 Scheduling Arbitrary DAGs WithoutCommunication

In this section, we discuss algorithmsfor scheduling arbitrary structuredDAGs in which computation costs arearbitrary but communication costs arezero.

6.2.1 Level-based Heuristics. Adamet al. [1974] performed an extensivesimulation study of the performance ofa number of level-based list schedulingheuristics. The heuristics examined are:

● HLFET (Highest Level First with Es-timated Times): The notion of level isthe sum of computation costs of all

Figure 6. (a) A unit-computational interval ordered DAG; (b) an optimal schedule of the DAG.

422 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 18: Static Scheduling Algorithms for Allocating Directed Task

the nodes along the longest path fromthe node to an exit node.

● HLFNET (Highest Levels First withNo Estimated Times): In this heuris-tic, all nodes are scheduled as if theywere of unit cost.

● Random: The nodes in the DAG areassigned priorities randomly.

● SCFET (Smallest Co-levels First withEstimated Times): A co-level of anode is determined by computing thesum of the longest path from an entrynode to the node.

● A node has a higher priority if it hasthe smaller co-level.

● SCFNET (Smallest Co-levels Firstwith No Estimated Times): This heu-ristic is the same as SCFET exceptthat it schedules the nodes as if theywere of unit costs.

In Adam et al. [1974], an extensivesimulation study was conducted usingrandomly generated DAGs. The perfor-mance of the heuristics were ranked inthe following order: HLFET, HLFNET,SCFNET, Random, and SCFET. Thestudy provided strong evidence that theCP (critical path) based algorithms havenear-optimal performance. In anotherstudy conducted by Kohler [1975], theperformance of the CP-based algorithmsimproved as the number of processorsincreased.

Kasahara et al. [1984] proposed analgorithm called CP/MISF (critical path/most immediate successors first), whichis a variation of the HLFET algorithm.The major improvement of CP/MISFover HLFET is that when assigning pri-orities, ties are broken by selecting thenode with a larger number of immediatesuccessors.

In a recent study, Shirazi et al. [1990]proposed two algorithms for schedulingDAGs to multiprocessors without com-munication. The first algorithm, calledHNF (Heavy Node First), is based on asimple local analysis of the DAG nodesat each level. The second algorithm, WL

(Weighted Length), considers a globalview of a DAG by taking into accountthe relationship among the nodes at dif-ferent levels. Compared to a critical-path-based algorithm, Shirazi et al.showed that the HNF algorithm is morepreferable for its low complexity andgood performance.

6.2.2 A Branch-and-Bound Approach.In addition to CP/MISF, Kasahara et al.[1984] also reported a scheduling algo-rithm based on a branch-and-bound ap-proach. Using Kohler and Steiglitz’s[1974] general representation forbranch-and-bound algorithms, Kasa-hara et al. devised a depth-first searchprocedure to construct near-optimalschedules. Prior to the depth-firstsearch process, priorities are assignedto those nodes in the DAG which may begenerated during the search process.The priorities are determined using thepriority list of the CP/MISF method. Inthis way the search procedure can bemore efficient both in terms of comput-ing time and memory requirement.Since the search technique is aug-mented by a heuristic priority assign-ment method, the algorithm is calledDF/IHS (depth-first with implicit heu-ristic search). The DF/IHS algorithmwas shown to give near optimal perfor-mance.

6.2.3 Analytical Performance Boundsfor Scheduling without Communication.Graham [1966] proposed a bound on theschedule length obtained by general listscheduling methods. Using a level-based method for generating a list forscheduling, the schedule length SL andthe optimal schedule length SLopt arerelated by the following:

SL # S2 21

pDSLopt.

Rammamoorthy et al. [1972] used theconcept of precedence partitions to gen-erate bounds on the schedule length andthe number of processors for DAGs withunit computation costs. An earliest pre-

Static Scheduling for Task Graph Allocation • 423

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 19: Static Scheduling Algorithms for Allocating Directed Task

cedence partition Ei is a set of nodesthat can be started in parallel at thesame earliest possible time constrainedby the precedence relations. A latestprecedence partition is a set of nodeswhich must be executed at the samelatest possible time constrained by theprecedence relations. For any i and j,Ei ù Ej 5 À and Li ù Lj 5 À. The pre-cedence partitions group tasks into sub-sets to indicate the earliest and latesttimes during which tasks can be startedand still guarantee minimum executiontime for the graph. This time is given bythe number of partitions and is a mea-sure of the longest path in the graph.For a graph of l levels, the minimumexecution time is l units. In order to exe-cute a graph in the minimum time, theabsolute minimum number of processorsrequired is given by max

1#i#l$?Ei ù Li?%.

Rammamoorthy et al. [1972] also de-veloped algorithms to determine theminimum number of processors re-quired to process a graph in the leastpossible amount of time, and to deter-mine the minimum time necessary toprocess a task graph given k processors.Since a dynamic programming approachis employed, the computational time re-quired to obtain the optimal solution isquite considerable.

Fernandez and Bussell [1983] devisedimproved bounds on the minimum num-ber of processors required to achieve theoptimal schedule length and on the min-imum increase in schedule length ifonly a certain number of processors areavailable. The most important contribu-tion is that the DAG is assumed to haveunequal computational costs. Althoughfor such a general model similar parti-tions as in Rammamoorthy et al. ’s workcould be defined, Fernandez et al. [Fer-nandez and Bussell 1983] used the con-cepts of activity and load density, de-fined below.

Definition 1. The activity of a nodeni is defined as

f~ti, t! 5 H 1, t [ @ti 2 w~ni!, ti#,0, otherwise

where t i is the finish ni.

Definition 2. The load density func-tion is defined by: F~t, t! 5 (

i51

v f~t i, t!.

Then, f~t i, t! indicates the activity ofnode ni along time, according to theprecedence constraints in the DAG, andF~t, t! indicates the total activity of thegraph as a function of time. Of particu-lar importance are F~te, t!, the earliestload density function for which all tasksare completed at their earliest times,and F~t l, t!, the load density functionfor which all tasks are completed attheir latest times. Now let R~u1, u2, t!be the load density function of the tasksor parts of tasks remaining within @u1,u2# after all tasks have been shifted toform minimum overlap within the inter-val. Thus, a lower bound on the mini-mum number of processors to executethe program (represented by the DAG)within the minimum time is given by:

pmin 5 max@u1, u2#F 1

u2 2 u1E

u1

u2

R~u1, u2, t!dtG.

The maximum value obtained for allpossible intervals indicate that thewhole computation graph cannot be ex-ecuted with a number of processorssmaller than the maximum. Supposingthat only p9 processors are available,Fernandez and Bussell [1973] alsoshowed that the schedule length will belonger than the optimal schedule lengthby no less than the following amount:

maxw~n1!#w~nk!#CPF2w~nk! 1

1

p9E

0

w~nk!

F~tl, t!dtG.

In a recent study, Jain and Rajara-man [1994] reported sharper bounds us-ing the above expressions. The idea isthat the intervals considered for theintegration is not just the earliest and

424 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 20: Static Scheduling Algorithms for Allocating Directed Task

latest start-times but are based on apartitioning of the graphs into a set ofdisjoint sections. They also devised anupper bound on the schedule length,which is useful in determining theworst case behavior of a DAG. Not onlyare their new bounds easier to computebut are also tighter, because DAG parti-tioning strategy enhances the accuracyof the load activity function.

6.3 UNC Scheduling

In this section, we survey the UNC classof scheduling algorithms. In particular,we will describe in more details fiveUNC scheduling algorithms: the EZ,LC, DSC, MD, and DCP algorithms. TheDAG shown in Figure 3 is used to illus-trate the scheduling process of thesealgorithms. In order to examine the ap-proximate optimality of the algorithms,we will first describe the scheduling oftwo primitive DAG structures: the forkset and the join set. Some work on the-oretical performance analysis of UNCscheduling is also discussed in the lastsubsection.

6.3.1 Scheduling of Primitive GraphStructures. To highlight the differentcharacteristics of the algorithms de-scribed below, it is useful to considerhow the algorithms work on some prim-itive graph structures. Two commonlyused primitive graph structures are forkand join [Gerasoulis and Yang 1992],examples of which are shown in Figure7. These two graph primitives are use-

ful for understanding the optimality ofscheduling algorithms because any taskgraph can be decomposed into a collec-tion of forks and joins. In the following,we derive the optimal schedule lengthsfor these primitive structures. The opti-mal schedule lengths can then be usedas a basis for comparing the functional-ity of the scheduling algorithms de-scribed later in this section.

Without loss of generality, assumethat for the fork structure, we have:

c~nx, n1! 1 w~n1! $ c~nx, n2! 1 w~n2!

$ . . . $ c~nx, nk! 1 w~nk!.

Then the optimal schedule length isequal to:

maxHw~nx! 1 Oi51

j

w~ni!,

w~nx! 1 c~nx, nj11! 1 w~nj11!J,

where j is given by the following condi-tions:

Oi51

j

w~ni! # c~nx, nj! 1 w~nj!

and

Oi51

j11

w~ni! . c~nx, nj11! 1 w~nj11!.

Figure 7. (a) A fork set; and (b) a join set.

Static Scheduling for Task Graph Allocation • 425

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 21: Static Scheduling Algorithms for Allocating Directed Task

In addition, assume that for the joinstructure, we have:

w~n1! 1 c~n1, nx! $ w~n2! 1 c~n2, nx!

$ . . . $ w~nk! 1 c~nk, nx!.

Then the optimal schedule length forthe join is equal to:

maxHOi51

j

w~ni! 1 w~nx!, w~nj11!

1 c~nj11, nx! 1 w~nx!J,

where j is given by the following condi-tions:

Oi51

j

w~ni! # w~nj! 1 c~nj, nx!

and

Oi51

j11

w~ni! . w~nj11! 1 c~nj11, nx!.

From the above expressions, it is clearthat an algorithm has to be able torecognize the longest path in the graphin order to generate optimal schedules.Thus, algorithms which consider onlyb-level or only t-level cannot guaranteeoptimal solutions. To make properscheduling decisions, an algorithm hasto dynamically examine both b-level andt-level. In the coming subsections, wewill discuss the performance of the algo-rithms on these two primitive graphstructures.

6.3.2 The EZ Algorithm. The EZ(Edge-zeroing) algorithm [Sarkar 1989]selects clusters for merging based onedge weights. At each step, the algo-rithm finds the edge with the largestweight. The two clusters incident by theedge will be merged if the merging(thereby zeroing the largest weight)does not increase the completion time.

After two clusters are merged, the or-dering of nodes in the resulting clusteris based on the static b-levels of thenodes. The algorithm is briefly de-scribed below.

(1) Sort the edges of the DAG in a de-scending order of edge weights.

(2) Initially all edges are unexamined.Repeat

(3) Pick an unexamined edge which hasthe largest edge weight. Mark it asexamined. Zero the highest edgeweight if the completion time doesnot increase. In this zeroing process,two clusters are merged so thatother edges across these two clus-ters also need to be zeroed andmarked as examined. The orderingof nodes in the resulting cluster isbased on their static b-levels.Until all edges are examined.

The time-complexity of the EZ algo-rithm is O~e~e 1 v!!. For the DAGshown in Figure 3, the EZ algorithmgenerates a schedule shown in Figure8(a). The steps of scheduling are shownin Figure 8(b).

Performance on fork and join: Sincethe EZ algorithm considers only thecommunication costs among nodes tomake scheduling decisions, it does notguarantee optimal schedules for bothfork and join structures.

6.3.3 The LC Algorithm. The LC(Linear Clustering) algorithm [Kim andBrowne 1988] merges nodes to form asingle cluster based on the CP. Thealgorithm first determines the set ofnodes constituting the CP, then sched-ules all the CP nodes to a single proces-sor at once. These nodes and all edgesincident on them are then removed fromthe DAG. The algorithm is briefly de-scribed below.

(1) Initially, mark all edges as unexam-ined.Repeat

426 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 22: Static Scheduling Algorithms for Allocating Directed Task

(2) Determine the critical path com-posed of unexamined edges only.

(3) Create a cluster by zeroing all theedges on the critical path.

(4) Mark all the edges incident on thecritical path and all the edges inci-dent to the nodes in the cluster asexamined.Until all edges are examined.

The time-complexity of the LC algo-rithm is O~v~e 1 v!!. For the DAGshown in Figure 3, the LC algorithmgenerates a schedule shown in Figure9(a); the scheduling steps are shown inFigure 9(b).

Performance on fork and join: Sincethe LC algorithm does not schedulenodes on different paths to the sameprocessor, it cannot guarantee optimalsolutions for both fork and join struc-tures.

6.3.4 The DSC Algorithm. The DSC(Dominant Sequence Clustering) algo-rithm [Yang and Gerasoulis 1993] con-siders the Dominant Sequence (DS) of agraph. The DS is the CP of the partiallyscheduled DAG. The algorithm is brieflydescribed below.

(1) Initially, mark all nodes as unexam-ined. Initialize a ready node list L tocontain all entry nodes. Compute b-level for each node. Set t-level foreach ready node.Repeat

(2) If the head of L, ni, is a node on theDS, zeroing the edge between niand one of its parents so that thet-level of ni is minimized. If no zero-ing is accepted, the node remains ina single node cluster.

(3) If the head of L, ni, is not a node onthe DS, zeroing the edge between ni

and one of its parents so that thet-level of ni is minimized under theconstraint called Dominant Se-quence Reduction Warranty(DSRW). If some of its parents areentry nodes that do not have anychild other than ni, merge part ofthem so that the t-level of ni is min-imized. If no zeroing is accepted, thenode remains in a single node clus-ter.

Figure 8. (a) The schedule generated by the EZ algorithm (schedule length 5 18); (b) a schedulingtrace of the EZ algorithm.

Static Scheduling for Task Graph Allocation • 427

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 23: Static Scheduling Algorithms for Allocating Directed Task

(4) Update the t-level and b-level of thesuccessors of ni and mark ni as ex-amined.Until all nodes are examined.

DSRW: Zeroing incoming edges of aready node should not affect the futurereduction of t-level ~ny!, where ny is anot-yet ready node with a higher prior-ity, if t-level ~ny! is reducible by zeroingan incoming DS edge of ny.

The time-complexity of the DSC algo-rithm is O~~e 1 v!logv!. For the DAGshown in Figure 3, the DSC algorithmgenerates a schedule shown in Figure10(a). The steps of scheduling are givenin the table shown in Figure 10(b). Inthe table, the start-times of the node onthe processors at each scheduling stepare given and the node is scheduled tothe processor on which the start-time ismarked by an asterisk.

Performance on fork and join: TheDSC algorithm dynamically tracks thecritical path in the DAG using botht-level and b-level. In addition, it sched-ules each node to start as early as pos-sible. Thus, for both fork and join struc-tures, the DSC algorithm can guaranteeoptimal solutions.

Yang and Gerasoulis [1993] also in-vestigated the granularity issue of clus-tering. They considered that a DAG con-sists of fork(Fx) and join(Jx) structuressuch as the two shown in Figure 7.Suppose we have:

g~Fx! 5min$w~ni!%

max$c~nx, ni!%,

g~Jx! 5min$w~ni!%

max$c~ni, nx!%

Then the granularity of a DAG is de-fined as g 5 min$gx% where gx 5min$g~Fx!, g~Jx!%. A DAG is calledcoarse grain if g $ 1. Based on thisdefinition of granularity, Yang andGerasoulis proved that the DSC algo-rithm has the following performancebound:

SLDSC # S1 11

gDSLopt

Thus, for a coarse grain DAG, the DSCalgorithm can generate a schedulelength within a factor of two from theoptimal. Yang and Gerasoulis also

Figure 9. (a) The schedule generated by the LC algorithm (schedule length 5 19); (b) a schedulingtrace of the LC algorithm.

428 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 24: Static Scheduling Algorithms for Allocating Directed Task

proved that the DSC algorithm is opti-mal for any coarse grain in-tree, andany single-spawn out-tree with uniformcomputation costs and uniform commu-nication costs.

6.3.5 The MD Algorithm. The MD(Mobility Directed) algorithm [Wu andGajski 1990] selects a node ni for sched-uling based on an attribute called therelative mobility, defined as:

Cur2CP2Length 2 ~b-level~ni! 1 t-level~ni!!

w~ni!

If a node is on the current CP of thepartially scheduled DAG, the sum of itsb-level and t-level is equal to the currentCP length. Thus, the relative mobility ofa node is zero if it is on the current CP.The algorithm is described below.

(1) Mark all nodes as unexamined. Ini-tially, there is no cluster.Repeat

(2) Compute the relative mobility foreach node.

(3) Let L9 be the group of unexaminednodes with the minimum relativemobility. Let ni be a node in L9 thatdoes not have any predecessors inL9. Start from the first cluster,check whether there is any clusterthat can accommodate ni. In thechecking process, all idle time slotsin a cluster are examined until oneis found to be large enough to holdni. A large enough idle time slotmay be created by pulling alreadyscheduled nodes downward because

Figure 10. (a) The schedule generated by the DSC algorithm (schedule length 5 17); (b) a schedulingtrace of the DSC algorithm (N.C. indicates “not considered”).

Static Scheduling for Task Graph Allocation • 429

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 25: Static Scheduling Algorithms for Allocating Directed Task

the start-times of the already sched-uled nodes are not fixed yet. If nicannot be scheduled to the first clus-ter, try the second cluster, and soon. If ni cannot be scheduled to anyexisting cluster, leave it as a newcluster.

(4) When ni is scheduled to cluster m,all edges connecting ni and othernodes already scheduled to clusterm are changed to zero. If ni isscheduled before node nj on clusterm, add an edge with weight zerofrom ni to nj in the DAG. If ni isscheduled after node nj on the clus-ter, add an edge with weight zerofrom nj to ni, then check if the add-

ing edges form a loop. If so, scheduleni to the next available space.

(5) Mark ni as examined.Until all nodes are examined.

The time-complexity of the MD algo-rithm is O~v3!. For the DAG shown inFigure 3, the MD algorithm generates aschedule shown in Figure 11(a). Thesteps of scheduling are given in thetable shown in Figure 11(b). In the ta-ble, the start-times of the node on theprocessors at each scheduling step aregiven and the node is scheduled to theprocessor on which the start-time ismarked by an asterisk.

Performance on fork and join: Usingthe notion of relative mobility, the MDalgorithm is also able to track the criti-

Figure 11. (a) The schedule generated by the MD algorithm (schedule length 5 17); (b) a schedulingtrace of the MD algorithm (N.C. indicates “not considered,” N.R. indicates “no room”).

430 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 26: Static Scheduling Algorithms for Allocating Directed Task

cal path of the DAG in the schedulingprocess. Thus, the algorithm can gener-ate optimal schedules for fork and joinas well.

6.3.6 The DCP Algorithm. The DCP(Dynamic Critical Path) algorithm isproposed by Kwok and Ahmad [1996]and is designed based on an attributewhich is slightly different from the rela-tive mobility used in the MD algorithm.Essentially, the DCP algorithm exam-ines a node ni for scheduling if, amongall nodes, ni has the smallest differencebetween its ALST (Absolute-Latest-Start-Time) and AEST (Absolute-Earli-est-Start-Time). The value of such dif-ference is equivalent to the value of thenode’s mobility, defined as:

~Cur_CP_Length2~b-level~ni!1t-level~ni!!!.

The DCP algorithm uses a lookaheadstrategy to find a better cluster for agiven node. The DCP algorithm isbriefly described below.

Repeat

(1) Compute (Cur_CP_Length 2 (b-level~ni! 1 t-level~ni!)) for each node ni.

(2) Suppose that nx is the node with thelargest priority. Let nc be the childnode (i.e., the critical child) of nxthat has the largest priority.

(3) Select a cluster P such that the sumTs~nx! 1 ~Tx~nc!! is the smallestamong all the clusters holding nx’sparents or children. In examining acluster, first try not to pull downany node to create or enlarge an idletime slot. If this is not successful infinding a slot for nx, scan the clusterfor suitable idle time slot again pos-sibly by pulling some already sched-uled nodes downward.

(4) Schedule nx to P.Until all nodes are scheduled.

The time-complexity of the DCP algo-rithm is O~v3!. For the DAG shown in

Figure 3, the DCP algorithm generatesa schedule shown in Figure 12(a). Thesteps of scheduling are given in thetable shown in Figure 12(b). In the ta-ble, the composite start-times of thenode (i.e., the start-time of the nodeplus that of its critical child) on theprocessors at each scheduling step aregiven and the node is scheduled to theprocessor on which the start-time ismarked by an asterisk.

Performance on fork and join: Sincethe DCP algorithm examines the firstunscheduled node on the current criticalpath by using mobility measures, it con-structs optimal solutions for fork andjoin graph structures.

6.3.7 Other UNC Approaches. Kimand Yi [1994] proposed a two-passscheduling algorithm with time-com-plexity O~vlog v!. The idea of the algo-rithm comes from the scheduling of in-trees. Kim and Yi observed that an in-tree can be efficiently scheduled byiteratively merging a node to the parentnode that allows the earliest completiontime. To extend this idea to arbitrarystructured DAGs, Kim and Yi devised atwo-pass algorithm. In the first pass, anindependent v-graph is constructed foreach exit node and an iterative schedul-ing process is carried out on thev-graphs. This phase is called forward-scheduling. Since some intermediatenodes may be assigned to different pro-cessors in different schedules, a back-ward-scheduling phase—the secondpass of the algorithm—is needed to re-solve the conflicts. In their simulationstudy, the two-pass algorithm outper-formed a simulated annealing approach.Moreover, as the principles of the algo-rithm originated from scheduling trees,the algorithm is optimal for both forkand join structures.

6.3.8 Theoretical Analysis for UNCScheduling. In addition to the granu-larity analysis performed for the DSCalgorithm, Yang and Gerasoulis [1993]worked on the general analysis for UNC

Static Scheduling for Task Graph Allocation • 431

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 27: Static Scheduling Algorithms for Allocating Directed Task

scheduling. They introduced a notioncalled d-lopt which is defined below.

Definition 2. Let DLilopt be the opti-

mum schedule length at step i of a UNCscheduling algorithm. A UNC schedul-ing algorithm is called d-lopt ifmaxi$SLi 2 SLi

lopt% # d where dis agiven constant.

In their study, they examined twocritical-path-based simple UNC sched-uling heuristics called RCP and RCP*.Essentially, both heuristics use b-levelas the scheduling priority but with aslight difference in that RCP* uses~b-level 2 w~ni!! as the priority. Theyshowed that both heuristics are d-lopt,and thus demonstrated that criticalpath-based scheduling algorithms arenear-optimal.

6.4 BNP Scheduling

In this section we survey the BNP classof scheduling algorithms. In particularwe discuss in detail six BNP schedulingalgorithms: the HLFET, ISH, MCP,ETF, DLS, and LAST algorithms.Again, the DAG shown in Figure 3 isused to illustrate the scheduling processof these algorithms. The analytical per-formance bounds of BNP schedulingwill also be discussed in the last subsec-tion.

6.4.1 The HLFET Algorithm. TheHLFET (Highest Level First with Esti-mated Times) algorithm [Adam et al.1974] is one of the simplest list schedul-ing algorithms and is described below.

(1) Calculate the static b-level (i.e., sl orstatic level) of each node.

Figure 12. (a) The schedule generated by the DCP algorithm (schedule length 5 16); (b) a schedulingtrace of the DCP algorithm (N.C. indicates “not considered,” N.R. indicates “no room”).

432 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 28: Static Scheduling Algorithms for Allocating Directed Task

(2) Make a ready list in a descendingorder of static b-level. Initially, theready list contains only the entrynodes. Ties are broken randomly.Repeat

(3) Schedule the first node in the readylist to a processor that allows theearliest execution, using the non-insertion approach.

(4) Update the ready list by insertingthe nodes that are now ready.Until all nodes are scheduled.

The time-complexity of the HLFETalgorithm is O~v2!. For the DAG shownin Figure 3, the HLFET algorithm gen-erates a schedule shown in Figure 13(a).The steps of scheduling are given in thetable shown in Figure 13(b). In the ta-

ble, the start-times of the node on theprocessors at each scheduling step aregiven and the node is scheduled to theprocessor on which the start-time ismarked by an asterisk.

Performance on fork and join: Sincethe HLFET algorithm schedules nodesbased on b-level only, it cannot guaran-tee optimal schedules for fork and joinstructures even if given sufficient pro-cessors.

6.4.2 The ISH Algorithm. The ISH(Insertion Scheduling Heuristic) algo-rithm [Kruatrachue and Lewis 1987]uses the “scheduling holes”—the idletime slots—in the partial schedules.The algorithm tries to fill the holes byscheduling other nodes into them anduses static b-level as the priority of a

Figure 13. (a) The schedule generated by the HLFET algorithm (schedule length 5 19); (b) ascheduling trace of the HLFET algorithm (N.C. indicates “not considered”).

Static Scheduling for Task Graph Allocation • 433

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 29: Static Scheduling Algorithms for Allocating Directed Task

node. The algorithm is briefly describedbelow.

(1) Calculate the static b-level of eachnode.

(2) Make a ready list in a descendingorder of static b-level. Initially, theready list contains only the entrynodes. Ties are broken randomly.Repeat

(3) Schedule the first node in the readylist to the processor that allows theearliest execution, using the non-insertion algorithm.

(4) If scheduling of this node causes anidle time slot, then find as manynodes as possible from the ready listthat can be scheduled to the idletime slot but cannot be scheduledearlier on other processors.

(5) Update the ready list by insertingthe nodes that are now ready.Until all nodes are scheduled.

The time-complexity of the ISH algo-rithm is O~v2!. For the DAG shown inFigure 3, the ISH algorithm generates aschedule shown in Figure 14(a). Thesteps of scheduling are given in thetable shown in Figure 14(b). In the ta-ble, the start-times of the node on theprocessors at each scheduling step aregiven and the node is scheduled to theprocessor on which the start-time ismarked by an asterisk. Hole tasks arethe nodes considered for scheduling intothe idle time slots.

Performance on fork and join: Sincethe ISH algorithm schedules nodesbased on b-level only, it cannot guaran-tee optimal schedules for fork and join

Figure 14. (a) The schedule generated by the ISH algorithm (schedule length 5 19); (b) a schedulingtrace of the ISH algorithm (N.C. indicates “not considered”).

434 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 30: Static Scheduling Algorithms for Allocating Directed Task

structures even if given sufficient pro-cessors.

6.4.3 The MCP Algorithm. The MCP(Modified Critical Path) algorithm [Wuand Gajski 1990] uses the ALAP of anode as the scheduling priority. TheMCP algorithm first computes theALAPs of all the nodes, then constructsa list of nodes in an ascending order ofALAP times. Ties are broken by consid-ering the ALAP times of the children ofa node. The MCP algorithm then sched-ules the nodes on the list one by onesuch that a node is scheduled to a pro-cessor that allows the earliest start-time using the insertion approach. TheMCP algorithm and the ISH algorithmhave different philosophies in utilizingthe idle time slot: MCP looks for an idletime slot for a given node, while ISHlooks for a hole node to fit in a givenidle time slot. The algorithm is brieflydescribed below.

(1) Compute the ALAP time of eachnode.

(2) For each node, create a list whichconsists of the ALAP times of thenode itself and all its children in adescending order.

(3) Sort these lists in an ascending lex-icographical order. Create a nodelist according to this order.Repeat

(4) Schedule the first node in the nodelist to a processor that allows theearliest execution, using the inser-tion approach.

(5) Remove the node from the node list.Until the node list is empty.

The time-complexity of the MCP algo-rithm is O~v2log v!. For the DAG shownin Figure 3, the MCP algorithm gener-ates a schedule shown in Figure 15(a).The steps of scheduling are given in thetable shown in Figure 15(b). In the ta-ble, the start-times of the node on theprocessors at each scheduling step aregiven and the node is scheduled to the

processor on which the start-time ismarked by an asterisk.

Performance on fork and join: Sincethe MCP algorithm schedules nodesbased on ALAP (effectively based onb-level) only, it cannot guarantee opti-mal schedules for fork and join struc-tures even if given sufficient processors.

6.4.4 The ETF Algorithm. The ETF(Earliest Time First) algorithm [Hwanget al. 1989] computes, at each step, theearliest start-times for all ready nodesand then selects the one with the small-est start-time. Here, the earliest start-time of a node is computed by examin-ing the start-time of the node on allprocessors exhaustively. When twonodes have the same value in their ear-liest start-times, the ETF algorithmbreaks the tie by scheduling the onewith the higher static priority. The algo-rithm is described below.

(1) Compute the static b-level of eachnode.

(2) Initially, the pool of ready nodes in-cludes only the entry nodes.Repeat

(3) Calculate the earliest start-time oneach processor for each node in theready pool. Pick the node-processorpair that gives the earliest time us-ing the non-insertion approach. Tiesare broken by selecting the nodewith a higher static b-level. Sched-ule the node to the correspondingprocessor.

(4) Add the newly ready nodes to theready node pool.Until all nodes are scheduled.

The time-complexity of the ETF algo-rithm is O~ pv2!. For the DAG shown inFigure 3, the ETF algorithm generatesa schedule shown in Figure 16(a). Thescheduling steps are given in the tableshown in Figure 16(b). In the table, thestart-times of the node on the proces-sors at each scheduling step are givenand the node is scheduled to the proces-

Static Scheduling for Task Graph Allocation • 435

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 31: Static Scheduling Algorithms for Allocating Directed Task

sor on which the start-time is markedby an asterisk.

Performance on fork and join: Sincethe ETF algorithm schedules nodesbased on b-level only, it cannot guaran-tee optimal schedules for fork and joinstructures even if given sufficient pro-cessors.

Hwang et al. [1989] also analyzed theperformance bound of the ETF algo-rithm. They showed that the schedulelength produced by the ETF algorithmSLEFT satisfies the following relation:

SLEFT # S2 21

pDSLoptnc 1 C,

where SLoptnc is the optimal schedule

length without considering communica-tion delays and C is the communication

requirements over some parent-parentpairs along a path. An algorithm is alsoprovided to compute C.

6.4.5 The DLS Algorithm. The DLS(Dynamic Level Scheduling) algorithm[Sih and Lee 1993a] uses an attributecalled dynamic level (DL), which is thedifference between the static b-level of anode and its earliest start-time on aprocessor. At each scheduling step, thealgorithm computes the DL for everynode in the ready pool on all processors.The node-processor pair which gives thelargest value of DL is selected for sched-uling. This mechanism is similar to theone used by the ETF algorithm. How-ever, there is one subtle difference be-tween the ETF algorithm and the DLSalgorithm: the ETF algorithm alwaysschedules the node with the minimum

Figure 15. (a) The schedule generated by the MCP algorithm (schedule length 5 20); (b) a schedulingtrace of the MCP algorithm (N.C. indicates “not considered”).

436 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 32: Static Scheduling Algorithms for Allocating Directed Task

earliest start-time and uses static b-level merely to break ties. In contrast,the DLS algorithm tends to schedulenodes in a descending order of staticb-levels at the beginning of a schedulingprocess but tends to schedule nodes inan ascending order of t-levels (i.e., theearliest start-times) near the end of thescheduling process.The algorithm isbriefly described below.

(1) Calculate the b-level of each node.

(2) Initially, the ready node pool in-cludes only the entry nodes.Repeat

(3) Calculate the earliest start-time forevery ready node on each processor.Hence, compute the DL of everynode-processor pair by subtracting

the earliest start-time from thenode’s static b-level.

(4) Select the node-processor pair thatgives the largest DL. Schedule thenode to the corresponding processor.

(5) Add the newly ready nodes to theready poolUntil all nodes are scheduled.

The time-complexity of the DLS algo-rithm is O~ pv3!. For the DAG shown inFigure 3, the ETF algorithm generatesa schedule shown in Figure 17(a). Thesteps of scheduling are given in thetable shown in Figure 17(b). In the ta-ble, the start-times of the node on theprocessors at each scheduling step aregiven and the node is scheduled to the

Figure 16. (a) The schedule generated by the ETF algorithm (schedule length 5 19); (b) a schedulingtrace of the ETF algorithm (N.C. indicates “not considered”).

Static Scheduling for Task Graph Allocation • 437

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 33: Static Scheduling Algorithms for Allocating Directed Task

processor on which the start-time ismarked by an asterisk.

Performance on fork and join: Eventhough the DLS algorithm schedulesnodes based on dynamic levels, it can-not guarantee optimal schedules forfork and join structures even if givensufficient processors.

6.4.6 The LAST Algorithm. LAST(Localized Allocation of Static Tasks)algorithm [Baxter and Patel 1989] isnot a list scheduling algorithm, anduses for node priority an attributecalled D2NODE, which depends onlyon the incident edges of a node.D2NODE is defined below:

D2NODE~ni! 5

Oc~nk,ni!D2EDGE~nk,ni!1Oc~ni,nj!D2EDGE~ni,nj!Oc~nk,ni!1Oc~ni,nj!

In the the above definition, D2EDGEis equal to 1 if one of the nodes on theedge is assigned to some processor. Themain goal of the LAST algorithm is tominimize the overall communication.The algorithm is briefly described be-low.

(1) For each entry node, set its D2NODEto be 1. Set all other D_NODEs to 0.Repeat

(2) Let candidate be the node with thehighest D2NODE value.

(3) Schedule candidate to the processorwhich allows the minimum start-time.

(4) Update the D2EDGE and D2NODEvalues of all adjacent nodes of can-didate.

Figure 17. (a) The schedule generated by the DLS algorithm (schedule length 5 19); (b) a schedulingtrace of the DLS algorithm (N.C. indicates “not considered”).

438 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 34: Static Scheduling Algorithms for Allocating Directed Task

The time-complexity of the LAST al-gorithm is O~v~e 1 v!!. For the DAGshown in Figure 3, the LAST algorithmgenerates a schedule shown in Figure18(a). The steps of scheduling are givenin the table shown in Figure 18(b). Inthe table, the start-times of the node onthe processors at each scheduling stepare given and the node is scheduled tothe processor on which the start-time ismarked by an asterisk.

Performance on fork and join: Sincethe LAST algorithm schedules nodesbased on edge costs only, it cannot guar-antee optimal schedules for fork and

join structures even if given sufficientprocessors.

6.4.7 Other BNP Approaches. Mc-Creary and Gill [1989] proposed a BNPscheduling technique based on the clus-tering method. In the algorithm, theDAG is first parsed into a set of CLANs.Informally, two nodes ni and nj aremembers of the same CLAN if and onlyif parents of nj outside the CLAN arealso parents of ni, and children of nioutside the CLAN are also children ofnj. Essentially, a CLAN is a subset of

Figure 18. (a) The schedule generated by the LAST algorithm (schedule length 5 19); (b) a schedulingtrace of the LAST algorithm (N.C. indicates “not considered”).

Static Scheduling for Task Graph Allocation • 439

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 35: Static Scheduling Algorithms for Allocating Directed Task

nodes where every element outside theset is related in the same way to eachmember in the set. The CLANs so de-rived are hierachically related by aparse tree. That is, a CLAN can be asubset of another CLAN of larger size.Trivial CLANs include the single nodesand the whole DAG. Depending uponthe number of processors available, theCLAN parse tree is traversed to deter-mine the optimal CLAN size for assign-ment so as to reduce the schedulelength.

Sih and Lee [1993b] reported a BNPscheduling scheme which is also basedon clustering. The algorithm is calleddeclustering because upon forming a hi-erarchy of clusters the optimal clustersize is determined possibly by crackingsome large clusters in order to gainmore parallelism while minimizingschedule length. Thus, using similarprinciples as in McCreary and Gill’s ap-proach, Sih and Lee’s scheme alsotraverses the cluster hierarchy from topto bottom in order to match the level ofcluster granularity to the characteristicof the target architecture. The crucialdifference between their methods is inthe cluster formation stage. While Mc-Creary and Gill’s method is based onCLANs construction, Sih and Lee’s ap-proach is to isolate a collection of edgesthat are likely candidates for separatingthe nodes at both ends onto differentprocessors. These cut-edges are tempo-rarily removed from the DAG and thealgorithm designates each remainingconnected component as an elementarycluster.

Lee et al. [1991] reported a BNPscheduling algorithm targeted for data-flow multiprocessors based on a verticallayering method for the DAG. In theirscheme, the DAG is first partitionedinto a set of vertical layers of nodes. Theinitial set of vertical layers is builtaround the critical path of the DAG andis then optimized by considering variouscases of accounting for possible inter-processor communication, which may inturn induce new critical paths. Finally,the vertical layers of nodes are mapped

to the given processors in order to min-imize the schedule length.

Zhu and McCreary [1992] reported aset of BNP scheduling algorithms fortrees. They first devised an algorithmfor finding optimal schedules for trees,in particular, binary trees. Nonethelessthe algorithm is of exponential complex-ity since optimal scheduling of trees isan NP-complete problem. They thenproposed a number of heuristic ap-proaches that can generate reasonablygood solutions within a much shorteramount of time. The heuristics are allgreedy in nature in that they attempt tominimize the completion times of pathsin the tree and exploit only a smallnumber of possible paths in the searchof a good schedule.

Varvarigou et al. [1996] proposed aBNP scheduling scheme for in-forestsand out-forests. However, their algo-rithm assumes that the trees are withunit computation costs and unit commu-nication costs. Another distinctive fea-ture of their algorithm is that the time-complexity is pseudopolynomial, O~v2p!,which is polynomial if p is fixed andsmall. The idea of their algorithm is tofirst transform the trees into delay-freetrees, which are then scheduled usingan optimal merging algorithm. Thistransformation step is crucial and isdone as follows. For each node, a succes-sor node is selected to be scheduledimmediately after the node. Then, sincethe communication costs are unit, thecommunication costs between the nodeand all other successors can be dropped.Only an extra communication free edgeis needed to add between the chosensuccessor and the other successors. Thesuccessor node is so selected that theresulting DAG does not violate the pre-cedence constraints of the original DAG.

Pande et al. [1994] proposed a BNPscheduling scheme using a thresholdingtechnique. The algorithm first computesthe earliest start-times and latest start-times of the nodes. A threshold for anode is then the difference between itsearliest and the latest start-times. A

440 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 36: Static Scheduling Algorithms for Allocating Directed Task

global threshold is varied between theminimum threshold among the nodes tothe maximum. For a node with thresh-old less than the global value, a newprocessor is allocated for the node, ifthere is any available. For a node withthreshold above the global value, thenode is then scheduled to the same pro-cessor as its parent which allows theearliest start-time. The rationale of thescheme is that as the threshold of anode represents the tolerable delay ofexecution without increasing overallschedule length, a node with smallerthreshold deserves a new processor sothat it can start as early as possible.Depending upon the number of givenprocessors, there is a trade-off betweenparallelism and schedule length, andthe global threshold is adjusted accord-ingly.

6.4.8 Analytical Performance Boundsof BNP Scheduling. For the BNP classof scheduling algorithms, Al-Mouhamed[1990] extended Fernandez and Bus-sell’s work [1973] (described in Section6.2.3) and devised a bound on the mini-mum number of processors for optimalschedule length and a bound on theminimum increase in schedule length ifonly a certain smaller number of proces-sor is available. Essentially, Al-Mou-hamed extended the techniques of Fer-nandez et al. for arbitrary DAGs withcommunication. Furthermore, the ex-pressions for the bounds are similar tothe ones reported by Fernandez andBussell, except that Al-Mouhamed con-jectured that the bounds need not becomputed across all possible integer in-tervals within the earliest completiontime of the DAG. However, Jain andRajaraman [1995] in a subsequentstudy found that the computation ofthese bounds needs to consider all theinteger intervals within the earliestcompletion time of the DAG. They alsoreported a technique to partition theDAGs into nodes with non-overlappingintervals so that a tighter bound is ob-tained. In addition, the new bounds cantake lesser time to compute. Jain and

Rajaraman also found that using such apartitioning facilitates all possible inte-ger intervals to be considered in orderto compute a tighter bound.

6.5 TDB Scheduling

In this section ,we survey the TDB classof DAG scheduling algorithms. We de-scribe in detail six TDB scheduling algo-rithms: the PY, LWB, DSH, BTDH,LCTD, and CPFD algorithms. The DAGshown in Figure 3 is used to illustratethe scheduling process of these algo-rithms.

In the following, we do not discuss theperformance of the TDB algorithms onfork and join sets separately becausewith duplication the TDB schedulingschemes can inherently produce optimalsolutions for these two primitive struc-tures. For a fork set, a TDB algorithmduplicates the root on every processor sothat each child starts at the earliestpossible time. For a join set, althoughno duplication is needed to start thesink node at the earliest time, all theTDB algorithms surveyed in this sectionemploy a similar recursive schedulingprocess to minimize the start-times ofnodes so that an optimal schedule re-sults.

6.5.1 The PY Algorithm. The PY al-gorithm (named after Papadimitriouand Yannakakis[1990]) is an approxi-mation algorithm which uses an at-tribute, called e-value, to approximatethe absolute achievable lower bound ofthe start-time of a node. This attributeis computed recursively beginning fromthe entry nodes to the exit nodes. Aprocedure for computing the e-values isgiven below.

(1) Construct a list of nodes in topologicalorder. Call it TopList.(2) for each node ni in TopList do(3) if ni has no parent then e~ni! 5 0(4) else(5) for each parent nx of ni do f~nx! 5

e~nx! 1 c~nx, ni! endfor(6) Construct a list of parents in decreas-ing f. Call it ParentList.

Static Scheduling for Task Graph Allocation • 441

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 37: Static Scheduling Algorithms for Allocating Directed Task

(7) Let min_e 5 the f value of the firstparent in ParentList(8) Make ni as a single node cluser. Call itCluster~ni!.(9) for each parent nx in ParentList do(10) Include Cluster~nx! in Cluster~ni!.(11) Compute the new min_e (i.e., start-time) of ni in Cluster~ni!.(12) if new min_e . original min_e thenexit this for-loop endif(13) endfor(14) e~ni! 5 min2e(15) endif(16) endfor

After computing the e-values, the al-gorithm inserts each node into a cluster,in which a group of ancestors are to beduplicated such that the ancestors havedata arrival times larger than the e-value of the node. Papadimitriou andYannakakis also showed that the sched-ule length generated is within a factorof two from the optimal. The PY algo-rithm is briefly described below.

(1) Compute e-values for all nodes.(2) for each node ni do(3) Assign ni to a new processor PEi.(4) for all ancestors of ni, duplicate anancestor nx if:

e~nx! 1 w~nx! 1 c~nx, ni! . e~ni!

(5) Order the nodes in PEi so that a nodestarts as soon as all its data is available.(6) endfor

The time-complexity of the PY algo-rithm is O~v2~e 1 vlogv!!. For the DAGshown in Figure 3, the PY algorithmgenerates a schedule shown in Figure19(a). The e-values are also shown inFigure 19(b).

6.5.2 The LWB Algorithm. We callthe algorithm the LWB (Lower Bound)algorithm [Colin and Chretienne 1991]based on its main principle: it first de-termines the lower bound start-time foreach node, and then identifies a set ofcritical edges in the DAG. A criticaledge is the one in which a parent’smessage-available time for the child isgreater than the lower bound start-timeof the child. Colin and Chretienne[1991] showed that the LWB algorithmcan generate optimal schedules forDAGs in which node weights are strictlylarger than any edge weight. The LWBalgorithm is briefly described below.

(1) For each node ni, compute its lowerbound start-time, denoted bylwb~ni!, as follows:

Figure 19. (a) The schedule generated by the PY algorithm (schedule length 5 21); (b) the 3-values ofthe nodes computed by the PY algorithm.

442 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 38: Static Scheduling Algorithms for Allocating Directed Task

(a) For any entry node ni, lwb~ni!is zero.

(b) For any node ni other than anentry node, consider the set ofits parents. Let nx be the parentsuch that lwb~nx! 1 w~nx! 1c~nx, ni! is the largest among allparents. Then, the lower boundof ni, lwb~ni!, is given by, withny Þ nx,

MAX$lwb$nx% 1 w~nx!, MAX$lwb~ny!

1 w~ny! 1 c~ny, ni!%%

(2) Consider the set of edges in the taskgraph. An edge ~ny, ni! is labelledas “critical” if lwb~nx! 1 w~nx! 1c~nx, ni! . lwb~ni!.

(3) Assign each path of critical edges toa distinct processor such that eachnode is scheduled to start at itslower bound start-time.

The time-complexity of the LWB algo-rithm is O~v2!. For the DAG shown inFigure 3, the LWB algorithm generatesa schedule shown in Figure 20(a). Thelower bound values are also shown inFigure 20(b).

6.5.3 The DSH Algorithm. The DSH(Duplication Scheduling Heuristic) algo-rithm [Kruatrachue and Lewis 1988]considers each node in a descending or-der of their priorities. In examining thesuitability of a processor for a node, theDSH algorithm first determines thestart-time of the node on the processorwithout duplication of any ancestor.Then, it considers the duplication in theidle time period from the finish-time ofthe last scheduled node on the processorand the start-time of the node currentlyunder consideration. The algorithmthen tries to duplicate the ancestors ofthe node into the duplication time slotuntil either the slot is used up or thestart-time of the node does not improve.The algorithm is briefly described be-low.

(1) Compute the static b-level for eachnode.Repeat

(2) Let ni be an unscheduled node withthe largest static b-level.

(3) For each processor P, do(a) Let the ready time of P, denoted

by RT, be the finish-time of thelast node on P. Compute the

Figure 20. (a) The schedule generated by the LWB algorithm (schedule length 5 16); (b) the lwb (lowerbound) values of the nodes computed by the LWB algorithm.

Static Scheduling for Task Graph Allocation • 443

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 39: Static Scheduling Algorithms for Allocating Directed Task

start-time of ni on P and denoteit by ST. Then the duplicationtime slot on P has length ~ST2 RT !. Let candidate be ni.

(b) Consider the set of candidate’sparents. Let nx be the parent ofni which is not scheduled on Pand whose message for candi-date has the latest arrival time.Try to duplicate nx into the du-plication time slot.

(c) If the duplication is unsuccess-ful, then record ST for this pro-cessor and try another proces-sor; otherwise, let ST becandidate’s new start-time andcandidate be nx. Go to step (b).

(4) Let P9 be the processor that givesthe earliest start-time of ni. Sched-

ule ni to P9 and perform all thenecessary duplication on P9Until all nodes are scheduled.

The time-complexity of the DSH algo-rithm is O~v4!. For the DAG shown inFigure 3, the DSH algorithm generatesa schedule shown in Figure 21(a). Thesteps of scheduling are given in thetable shown in Figure 21(b). In the ta-ble, the start-times of the node on theprocessors at each scheduling step aregiven and the node is scheduled to theprocessor on which the start-time ismarked by an asterisk.

6.5.4 The BTDH Algorithm. TheBTDH (Bottom-Up Top-Down Duplica-tion Heuristic) algorithm [Chung andRanka 1992] is an extension of the DSHalgorithm described above. The majorimprovement of the BTDH algorithm

Figure 21. (a) The schedule generated by the DSH algorithm (schedule length 5 15); (b) a schedulingtrace of the DSH algorithm.

444 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 40: Static Scheduling Algorithms for Allocating Directed Task

over the DSH algorithm is that the algo-rithm keeps on duplicating ancestors ofa node even if the duplication time slotis totally used up (i.e., the start-time ofthe node temporarily increases) withthe hope that the start-time will eventu-ally be minimized. That is, the BTDHalgorithm is the same as the DSH algo-rithm except for step (3)(c) of the latterin that the duplication of an ancestor isconsidered successful even if the dupli-cation time slot is used up. The processstops only when the final start-time ofthe node is greater than before the du-plication. The time-complexity of theBTDH algorithm is also O~v4!. For theDAG shown in Figure 3, the BTDH algo-rithm generates the same schedule asthe DSH algorithm which is shown inFigure 21(a). The scheduling process isalso the same except at step (5) whennode n6 is considered for scheduling onPE 2, the start-time computed by theBTDH algorithm is also 5 instead of 6as computed by the DSH algorithm.This is because the BTDH algorithmdoes not stop the duplication processeven though the start-time increases.

6.5.5 The LCTD Algorithm. TheLCTD (Linear Clustering with Task Du-plication) algorithm [Chen et al. 1993]is based on linear clustering of theDAG. After performing the clusteringstep, the LCTD algorithm identifies theedges among clusters that determinesthe completion time. Then, it tries toduplicate the parents corresponding tothese edges to reduce the start-times ofsome nodes in the clusters. The algo-rithm is described below.

(1) Apply the LC algorithm to the DAGto generate a set of linear clusters.

(2) Schedule each linear cluster to adistinct processor and let the nodesstart as early as possible on theprocessors.

(3) For each linear cluster C1 do:(a) Let the first node in C1 be nx.(b) Consider the set of nx’s parents.

Select the parent that allows thelargest reduction of nx’s start-time. Duplicate this parent andall the necessary ancestors to C1.

(c) Let nx be the next node in CPi.Go to step (b).

(4) Consider each pair of processors. Iftheir schedules have enough com-mon nodes so that they can bemerged without increasing theschedule length, then merge the twoschedules and discard one processor.

The time-complexity of the LCTD al-gorithm is O~v3log v!. For the DAGshown in Figure 3, the LCTD algorithmgenerates a schedule shown in Figure22(a). The steps of scheduling are givenin the table shown in Figure 22(b). Inthe table, the original start-times of thenode on the processors after the linearclustering step are given. In addition,the improved start-times after duplica-tion are also given.

6.5.6 The CPFD Algorithm. TheCPFD (Critical Path Fast Duplication)algorithm [Ahmad and Kwok 1998a] isbased on partitioning the DAG intothree categories: critical path nodes(CPN), in-branch nodes (IBN), and out-branch nodes (OBN). An IBN is a nodefrom which there is a path reaching aCPN. An OBN is a node which is nei-ther a CPN nor an IBN. Using thispartitioning of the graph, the nodes canbe ordered in decreasing priority as alist called the CPN-Dominant Sequence.In the following, we first describe theconstruction of this sequence.

In a DAG, the CP nodes (CPNs) arethe most important nodes since theirfinish-times effectively determine the fi-nal schedule length. Thus, the CPNs ina task graph should be considered asearly as possible for scheduling in thescheduling process. However, we cannotconsider all the CPNs without first con-sidering other nodes because the start-times of the CPNs are determined bytheir parent nodes. Therefore, before wecan consider a CPN for scheduling, we

Static Scheduling for Task Graph Allocation • 445

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 41: Static Scheduling Algorithms for Allocating Directed Task

must first consider all its parent nodes.In order to determine a scheduling or-der in which all the CPNs can be sched-uled as early as possible, we classify thenodes of the DAG into three categoriesgiven in the following definition.

Definition 4. In a connected graph,an In-Branch Node (IBN) is a node,which is not a CPN, and from whichthere is a path reaching a Critical PathNode (CPN). An Out-Branch Node(OBN) is a node, which is neither a CPNnor an IBN.

After the CPNs, the most importantnodes are IBNs because their timelyscheduling can help reduce the start-times of the CPNs. The OBNs are rela-tively less important because they usu-ally do not affect the schedule length.Based on this reasoning, we make a

sequence of nodes called the CPN-Dom-inant sequence which can be con-structed by the following procedure:

Constructing the CPN-Dominant Sequence

(1) Make the entry CPN to be the firstnode in the sequence. Set Position to2. Let nx be the next CPN.

Repeat(2) If nx has all its parent nodes in the

sequence then(3) Put nx at Position in the sequence

and increment Position.(4) else(5) Suppose ny is the parent node of nx

which is not in the sequence andhas the largest b-level. Ties are brokenby choosing the parent with a smallert-level. If ny has all its parent nodesin the sequence, put ny at Positionin the sequence and increment Posi-tion. Otherwise, recursively include

Figure 22. (a) The schedule generated by the LCTD algorithm (schedule length 5 17); (b) a schedulingtrace of the LCTD algorithm.

446 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 42: Static Scheduling Algorithms for Allocating Directed Task

all the ancestor nodes of ny in thesequence so that the nodes with alarger communication are consi-dered first.

(6) Repeat the above step until all theparent nodes of nx are in the sequence.Put nx in the sequence at Position.

(7) endif(8) Make nx to be the next CPN.Until all CPNs are in the sequence.(9) Append all the OBNs to the sequence

in a decreasing order of b-level.

The CPN-Dominant sequence pre-serves the precedence constraintsamong nodes as the IBNs reaching aCPN are always inserted before theCPN in the CPN-Dominant sequence. Inaddition, the OBNs are appended to thesequence in a topological order so that aparent OBN is always in front of a childOBN.

The CPN-Dominant sequence of theDAG shown in Figure 3 is constructedas follows. Since n1 is the entry CPN, itis placed in the first position in theCPN-Dominant sequence. The secondnode is n2 because it has only one par-ent node. After n2 is appended to theCPN-Dominant sequence, all parentnodes of n7 have been considered andcan, therefore, also be added to the se-quence. Now the last CPN, n9 is consid-ered. It cannot be appended to the se-quence because some of its parent nodes(i.e., the IBNs) have not been examinedyet. Since both n6 and n8 have the sameb-level but n8 has a smaller t-level, n8 isconsidered first. However, both parentnodes of n8 have not been examined,thus, its two parent nodes, n3 and n4are appended to the CPN-Dominant se-quence first. Next, n8 is appended fol-lowed by n6. The only OBN, n5, is thelast node in the CPN-Dominant se-quence. The final CPN-Dominant se-quence is as follows: n1, n2, n7, n4, n3,n8, n6, n9, n5 (see Figure 3(b); the CPNsare marked by an asterisk). Note thatusing sl (static level) as a priority mea-

sure will generate a different orderingof nodes: n1, n4, n2, n3, n5, n6, n7, n8, n9.

Based on the CPN-Dominant se-quence, the CPFD algorithm is brieflydescribed below.

(1) Determine a critical path. Partitionthe task graph into CPNs, IBNs,and OBNs. Let candidate be the en-try CPN.Repeat

(2) Let P2SET be the set of processorscontaining the ones accommodatingthe parents of candidate plus anempty processor.

(3) For each processor P in P2SET, do:(a) Determine candidate’s start-

time on P and denote it by ST.(b) Consider the set of candidate’s

parents. Let m be the parentwhich is not scheduled on P andwhose message for candidatehas the latest arrival time.

(c) Try to duplicate m on the earli-est idle time slot on P. If theduplication is successful and thenew start-time of candidate isless than ST, then let ST be thenew start-time of candidate.Change candidate to m and go tostep (a). If the duplication is un-successful, then return controlto examine another parent of theprevious candidate.

(4) Schedule candidate to the processorP9 that gives the earliest start-timeand perform all the necessary dupli-cation.

(5) Let candidate be the next CPN.

(6) Repeat the process from step (2) tostep (5) for each OBN with P_SETcontaining all the processors in usetogether with an empty processor.The OBNs are considered one by onetopologically.Until all CPNs are scheduled.

The time-complexity of the CPFD al-gorithm is O~v4!. For the DAG shown in

Static Scheduling for Task Graph Allocation • 447

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 43: Static Scheduling Algorithms for Allocating Directed Task

Figure 3, the CPFD algorithm generatesa schedule shown in Figure 23(a). Thesteps of scheduling are given in thetable shown in Figure 23(b). In thistable, the start-times of the node on theprocessors at each scheduling step aregiven and the node is scheduled to theprocessor on which the start-time ismarked by an asterisk.

6.5.7 Other TDB Approaches. Angeret al. [1990] reported a TDB schedulingscheme called JLP/D (Joint Latest Pre-decessor with Duplication). The algo-rithm is optimal if the communicationcosts are strictly less than any computa-tion costs, and there are sufficient pro-cessors available. The basic idea of thealgorithm is to schedule every nodewith its latest parent to the same pro-cessor. Since a node can be the latest

parent of several successors, duplicationis necessary.

Markenscoff and Li [1993] reported aTDB scheduling approach based on anoptimal technique for scheduling in-trees. In their scheme, a DAG is firsttransformed into a set of in-trees. Anode in the DAG may appear in morethan one in-tree after the transforma-tion. Each tree is then optimally sched-uled independently and hence, duplica-tion comes into play.

In a recent study, Darbha andAgrawal [1995] proposed a TDB sched-uling algorithm using similar principlesas the LCTD algorithm. In the algo-rithm, a DAG is first parsed into a set oflinear clusters. Then each cluster is ex-amined to determine the critical nodesfor duplication. Critical nodes are the

Figure 23. (a) The schedule generated by the CPFD algorithm (schedule length 5 15); (b) a schedulingtrace of the CPFD algorithm.

448 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 44: Static Scheduling Algorithms for Allocating Directed Task

nodes that determine the data arrivaltime of the nodes in the cluster but arethemselves outside the cluster. Similarto the LCTD algorithm, the number ofprocessors required is also optimized bymerging schedules with the same set of“prefix” schedules.

Palis et al. [1996] also investigatedthe problem of scheduling task graphsto processors using duplication. Theyproposed an approximation TDB algo-rithm which produces schedule lengthsat most twice from the optimal. Theyalso showed that the quality of theschedule improves as the granularity ofthe task graph becomes larger. For ex-ample, if the granularity is at least 1/2,the schedule length is at most 5/3 timesoptimal. The time-complexity of the al-gorithm is O~v~vlog v 1 e!!, which is vtimes faster than the PY algorithm pro-posed by Papadimitriou and Yanna-kakis [1990]. In Palis et al. [1996], sim-ilar algorithms were also developed thatproduce: (1) optimal schedules forcoarse grain graphs; (2) 2-optimalschedules for trees with no task duplica-tion; and (3) optimal schedules forcoarse grain trees with no task duplica-tion.

6.6 APN Scheduling

In this section, we survey the APN classof DAG scheduling algorithms. In par-ticular we describe in detail four APNalgorithms: the MH (Mapping Heuris-tic) algorithm [Rewini and Lewis 1990],the DLS (Dynamic Level Scheduling)algorithm [Sih and Lee 1993a], the BU(Bottom Up) algorithm [Mehdiratta andGhose 1994], and the BSA (BubbleScheduling and Allocation) algorithm[Kwok and Ahmad 1995]. Before we dis-cuss these algorithms, it is necessary toexamine one of the most important is-sues in APN scheduling–the messagerouting issue.

6.6.1 The Message Routing Issue. InAPN scheduling, a processor network isnot necessarily fully-connected and con-tention for communication channels

needs to be addressed. This in turn im-plies that message routing and schedul-ing must also be considered. Recenthigh-performance architectures (nCUBE-2[Hwang 1993], iWarp [Hwang 1993], andIntel Paragon [Quinn 1994]) employwormhole routing in which the headerflit of a message establishes the path,intermediate flits follow the path, andthe tail flit releases the path. Once theheader gets blocked due to link conten-tion, the entire message waits in thenetwork, occupying all the links it istraversing. Hence, it increasingly be-comes important to take link contentioninto account as compared to distancewhen scheduling computations ontowormhole-routed systems. Routingstrategies can be classified as eitherdeterministic or adaptive. Deterministicschemes, such as the e-cube routing forhypercube topology, construct fixedroutes for messages and cannot avoidcontention if two messages are usingthe same link even when other links arefree. Yet deterministic schemes are easyto implement and routing decisions canbe made efficiently. On the other hand,adaptive schemes construct optimizedroutes for different messages dependingupon the current channel allocation inorder to avoid link contention. However,adaptive schemes are usually more com-plex as they require much state infor-mation to make routing decisions.

Wang [1990] suggested two adaptiverouting schemes suitable for use in APNscheduling algorithms. The first schemeis a greedy algorithm which seeks alocally optimal route for each messageto be sent between tasks. Instead ofsearching for a path with the least wait-ing time, the message is sent through alink which yields the least waiting timeamong the links that the processor canchoose from for sending a message.Thus, the route is only locally optimal.Using this algorithm, Wang observedthat there are two types of possibleblockings: (i) a later message blocks anearlier message (called LBE blocking),and (ii) an earlier message blocks alater message (called EBL blocking).

Static Scheduling for Task Graph Allocation • 449

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 45: Static Scheduling Algorithms for Allocating Directed Task

LBE blocking is always more costlythan EBL blocking. In the case thatseveral messages are competing for alink and blocking becomes unavoidable,LBE blockings should be avoided asmuch as possible. Given this observa-tion, Wang proposed the second algo-rithm, called the least blocking algo-rithm, which works by trying to avoidLBE blocking. The basic idea of thealgorithm is to use Dijkstra’s shortestpath algorithm to arrange optimizedroutes for messages so as to avoid LBEblockings.

Having determined routes for mes-sages, the scheduling of different mes-sages on the links is also an importantaspect. Dixit-Radiya and Panda [1993]proposed a scheme for ordering mes-sages in a link so as to further minimizethe extent of link contention. Theirscheme is based on the Temporal Com-munication Graph (TCG) which, in ad-dition to task precedence, captures thetemporal relationship of the communi-cation messages. Using the TCG model,the objective of which is to minimize thecontention on the link, the earlieststart-times and latest start-times ofmessages can be computed. These val-ues are then used to heuristically sched-ule the messages in the links.

6.6.2 The MH Algorithm. The MH(Mapping Heuristic) algorithm [El-Re-wini and Lewis 1990] first assigns prior-ities by computing the sl of all nodes. Aready node list is then initialized tocontain all entry nodes ordered in de-creasing priorities. Each node is sched-uled to a processor that gives the small-est start-time. In calculating the start-time of node, a routing table ismaintained for each processor. The ta-ble contains information as to whichpath to route messages from the parentnodes to the node under consideration.After a node is scheduled, all of itsready successor nodes are appended tothe ready node list. The MH algorithmis briefly described below.

(1) Compute the sl of each node ni inthe task graph.

(2) Initialize a ready node list by insert-ing all entry nodes in the taskgraph. The list is ordered accordingto node priorities, with the highestpriority node first.Repeat

(3) ni 4 the first node in the list.

(4) Schedule ni to the processor whichgives the smallest start-time. In de-termining the start-time on a pro-cessor, all messages from the parentnodes are scheduled and routed byconsulting the routing tables associ-ated with each processor.

(5) Append all ready successor nodes ofni, according to their priorities, tothe ready node list.Until the ready node list is empty.

The time-complexity of the MH algo-rithm is shown to be O~v~ p3v 1 e!!,where p is the number of processors inthe target system.

For the DAG shown in Figure 3(a),the schedule generated by the MH algo-rithm for a 4-processor ring is shown inFigure 24. Here, Lij denotes a communi-cation link between PE i and PE j. TheMH algorithm schedules the nodes inthe following order: n1, n4, n3, n5, n2,n8, n7, n6, n9. Note that the MH algo-rithm does not strictly schedule nodesaccording to a descending order ofsls(static levels) in that it uses the slorder to break ties. As can be seen fromthe schedule shown in Figure 24, theMH algorithm schedules n4 first beforen2 and n7, which are more importantnodes. This is due to the fact that bothalgorithms rank nodes according to adescending order of their sls. The nodesn2 and n7 are more important becausen7 is a CPN and n2 critically affects thestart-time of n7. As n4 has a largerstatic level, both algorithms examine n4first and schedule it to an early timeslot on the same processor as n1. As aresult, n2 cannot start at the earliest

450 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 46: Static Scheduling Algorithms for Allocating Directed Task

possible time–the time just after n1 fin-ishes.

6.6.3 The DLS Algorithm. The DLS(Dynamic Level Scheduling) algorithm[Sih and Lee 1993a] described in Sec-tion 6.4.5 can also be used as an APNscheduling algorithm. However, theDLS algorithm requires a message rout-ing method to be supplied by the user. Itthen computes the earliest start-time ofa node on a processor by tentativelyscheduling and routing all messagesfrom the parent nodes using the givenrouting table.

For APN scheduling, the time-com-plexity of the DLS algorithm is shown tobe O~v3pf~ p!!, where f~ p! is the time-complexity of the message routing algo-rithm. For the DAG shown in Figure3(a), the schedule generated by the DLSalgorithm for a 4-processor ring is thesame as that generated by the MH algo-rithm shown in Figure 24. The DLSalgorithm also schedules the nodes inthe following order: n1, n4, n3, n5, n2,n8, n7, n6, n9.

6.6.4 The BU Algorithm. The BU(Bottom-Up) algorithm [Mehdiratta andGhose 1994] first determines the criticalpath (CP) of the DAG and then assignsall the nodes on the CP to the sameprocessor. Afterwards, the algorithm as-signs the remaining nodes in a reversedtopological order of the DAG to the pro-cessors. The node assignment is guidedby a load-balancing processor selectionheuristic which attempts to balance theload across all processors. The BU algo-rithm examines the nodes at each topo-logical level in a descending order oftheir b-levels. After all the nodes areassigned to the processors, the BU algo-rithm tries to schedule the communica-tion messages among them using achannel allocation heuristic which triesto keep the hop count of every messageroughly a constant constrained by theprocessor network topology. Differentnetwork topologies require differentchannel allocation heuristics. The BUalgorithm is briefly described below.

(1) Find a critical path. Assign thenodes on the critical path to the

Figure 24. The schedule generated by the MH and DLS algorithm (schedule length 5 20, total comm.costs incurred 5 16).

Static Scheduling for Task Graph Allocation • 451

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 47: Static Scheduling Algorithms for Allocating Directed Task

same processor. Mark these nodesas assigned and update the load ofthe processor.

(2) Compute the b-level of each node. Ifthe two nodes of an edge are as-signed to the same processor, thecommunication cost of the edge istaken to be zero.

(3) Compute the p-level (precedencelevel) of each node, which is definedas the maximum number of edgesalong a path from an entry node tothe node.

(4) In a decreasing order of p-level, foreach value of p-level, do:(a) In a decreasing order of b-level,

for each node at the current p-level, assign the node to a pro-cessor such that the processingload is balanced across all thegiven processors.

(b) Re-compute the b-levels of allnodes.

(5) Schedule the communication mes-sages among the nodes such thatthe hop count of each message ismaintained constant.

The time-complexity of the BU algo-rithm is shown to be O~v2log v!.

For the DAG shown in Figure 3(a),the schedule generated by the BU algo-rithm1 for a 4-processor ring is shown inFigure 25. As can be seen, the schedulelength is considerably longer than thatof the MH and DLS algorithms. This isbecause the BU algorithm employs aprocessor selection heuristic whichworks by attempting to balance the loadacross all the processors.

6.6.5 The BSA Algorithm. The BSA(Bubble Scheduling and Allocation) al-gorithm [Kwok and Ahmad 1995] is pro-posed by us and is based on an incre-mental technique, which works byimproving the schedule through migra-tion of tasks from one processor to aneighboring processor. The algorithmfirst allocates all the tasks to a singleprocessor which has the highest connec-tivity in the processor network and is

1In this example, we have used the PSH2 proces-sor selection heuristic with p 5 1.5. Such a com-bination is shown [Mehdiratta and Ghose 1994] togive the best performance.

Figure 25. The schedule generated by the BU algorithm (schedule length 5 24, total comm. costsincurred 5 27).

452 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 48: Static Scheduling Algorithms for Allocating Directed Task

called the pivot processor. In the firstphase of the algorithm, the tasks arearranged in the processor according tothe CPN-Dominant sequence discussedearlier in Section 6.5.6. In the secondphase of the algorithm, the tasks mi-grate from the pivot processor to theneighboring processors if the start-times improve. This task migration pro-cess proceeds in a breadth-first order ofthe processor network in that after themigration process is complete for thefirst pivot processor, one of the neigh-boring processors becomes the nextpivot processor and the process repeats.

In the following outline of the BSAalgorithm, the Build_processor_list( )procedure constructs a list of processorsin a breadth-first order from the firstpivot processor. The Serial_injection( )procedure constructs the CPN-Domi-nant sequence of the nodes and injectsthis sequence to the first pivot processor.The BSA Algorithm

(1) Load processor topology and inputtask graph(2) Pivot_PE4 the processor with thehighest degree(3) Build_processor_list(Pivot_PE)(4) Serial_injection(Pivot_PE)(5) while Processor_list_not_empty do(6) Pivot_PE4 first processor of Proces-sor_list(7) for each ni on Pivot_PE do(8) if ST~ni, Pivot2PE! . DAT~ni,Pivot2PE! or Proc~VIP~ni!! 5

Pivot2PE then(9) Determine DAT and ST of ni on eachadjacent processor PE9

(10) if there exists a PE9 s.t. ST~ni,PE9! , ST~ni, Pivot2PE! then(11) Make ni to migrate from Pivot2PEto PE9(12) Update start-times of nodes andmessages(13) else if ST~ni, PE9! 5 ST~ni,Pivot2PE! and Proc~VIP~ni!! then(14) Make ni to migrate from Pivot2PEto PE9 then(15) Update start-times of nodes andmessages(16) end if

(17) end if(18) end for(19) endwhile

The time-complexity of the BSA algo-rithm is O~ p2ev!.

The BSA algorithm, as shown in Fig-ure 26(a), injects the CPN-Dominant se-quence to the first pivot processor PE 0.In the first phase, nodes n1, n2, and n7do not migrate because they are alreadyscheduled to start at the earliest possi-ble times. However, as shown in Figure26(b), node n4 migrates to PE 1 becauseits start-time improves. Similarly, asdepicted in Figure 26(c), node n3 alsomigrates to a neighboring processor PE3. Figure 26(d) shows the intermediateschedule after n8 migrates to PE 1 fol-lowing its VIP n4. Similarly, n6 alsomigrates to PE 3 following its VIP n3, asshown in Figure 27(a). The last CPN,n9, migrates to PE 1 to which its VIPn8is scheduled. Such migration allowsthe only OBN n5 to bubble up. Theresulting schedule is shown in Figure27(b). This is the final schedule as nomore nodes can improve the start-timethrough migration.

6.6.6 Other APN Approaches. Kon’yaand Satoh [1993] reported an APNscheduling algorithm for the hypercubearchitectures. Their algorithm, calledthe LST (Latest Starting Time) algo-rithm, works by using a list schedulingapproach where the priorities of nodesare first computed and a list is con-structed based on these priorities. Thepriority of a node is defined as its lateststarting time, which is determined be-fore scheduling starts. Thus, the list isstatic and does not capture the dynami-cally changing importance of nodes,which is crucial in APN scheduling.

In a later study, Selvakumar andMurthy [1994] reported an APN sched-uling scheme which is an extension ofSih and Lee’s DLS algorithm. The dis-tinctive new feature in their algorithmis that it exploits schedule holes in pro-cessors and communication links in or-

Static Scheduling for Task Graph Allocation • 453

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 49: Static Scheduling Algorithms for Allocating Directed Task

der to produce better schedules. Essen-tially, it differs from the DLS algorithmin two respects: (i) the way in which the

priority of a task with respect to a pro-cessor in a partial schedule; and (ii) theway in which a task and all communica-

Figure 26. Intermediate schedules produced by BSA after (a) serial injection (schedule length 5 30,total comm. cost 5 0); (b) n4 migrates from PE 0 to PE 1 (schedule length 5 26, total comm. cost 5 2);(c) n3 migrates from PE 0 to PE 3 (schedule length 5 23, total comm. cost 5 4); (d) n8 migrates from PE0 to PE 1 (schedule length 5 22, total comm. cost 5 9).

454 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 50: Static Scheduling Algorithms for Allocating Directed Task

tions from its parents are scheduled.The priority of a node is modified to bethe difference between the static leveland the earliest finish-time. During thescheduling of a node, a router is used todetermine the best possible path be-tween the processors that need commu-nication. In their simulation study, theimproved scheduling algorithm outper-formed both the DLS algorithm and theMH algorithm.

6.7 Scheduling in HeterogeneousEnvironments

Heterogeneity has been shown to be animportant attribute in improving theperformance of multiprocessors [Ercego-vac 1988; Freund and Siegel 1993; Me-nasce and Almeida 1990; Siegel et al.1992; Siegel et al. 1996; Wang et al.1996]. In parallel computations, the se-rial part is the bottleneck, according toAmdahl’s law [Amdahl 1967]. In homo-geneous multiprocessors, if one or morefaster processors are used to replace aset of cost-equivalent processors, the se-rial computations and other criticalcomputations can be scheduled to suchfaster processors and performed at a

greater rate so that speedup can beincreased.

As we have seen in earlier parts ofthis section, most DAG scheduling ap-proaches assume the target system ishomogeneous. Introducing heterogene-ity into the model inevitably makes theproblem more complicated to handle.This is because the scheduling algo-rithm has to take into account the dif-ferent execution rate of different proces-sors when computing the potentialstart-times of tasks on the processors.Another complication is that the result-ing schedule for a given heterogeneoussystem immediately becomes invalid ifsome of the processing elements are re-placed even though the number of pro-cessors remain the same. This is be-cause the scheduling decisions are madenot only on the number of processorsbut also on the capability of the proces-sors.

Static scheduling targeted for hetero-geneous environments was unexploreduntil recently. Menasce et al. [Menasceand Porto 1993; Menasce et al. 1994;Menasce et al. 1992; Menasce et al.1995] investigated the problem of sched-

Figure 27. (a) Intermediate schedule produced by BSA after n6 migrates from PE 0 to PE 3 (schedulelength 5 22, total comm. cost 5 15); (b) final schedule produced by BSA after n9 migrates from PE 0 toPE 1 and n5 is bubbled up (schedule length 5 16, total comm. cost 5 21).

Static Scheduling for Task Graph Allocation • 455

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 51: Static Scheduling Algorithms for Allocating Directed Task

uling computations to heterogeneousmultiprocessing environments. The het-erogeneous environment was modeledas a system with one fast processor plusa number of slower processors. In theirstudy, both dynamic and static schedul-ing schemes were examined, but never-theless DAGs without communicationare used to model computations[Almeida et al. 1992]. Markov chainswere used to analyze the performance ofdifferent scheduling schemes. In theirfindings, out of all the static schedulingschemes, the LTF/MFT (Largest TaskFirst/Minimizing finish-time) signifi-cantly outperformed all the others in-cluding WL (Weighted Level), CPM(Critical Path Method) and HNF (HeavyNode First). The LTF/MFT algorithmworks by picking the largest task fromthe ready tasks list and schedules it tothe processor which allows the mini-mum finish-time, while the other threestrategies select candidate processorsbased on the execution time of the task.Thus, based on their observations, anefficient scheduling algorithm for heter-ogeneous systems should concentrate onreducing the finish-times of tasks.Nonetheless, if communication delaysare also considered, different strategiesmay be needed.

6.8 Mapping Clusters to Processors

As discussed earlier, mapping of clus-ters to physical processors is necessaryfor UNC scheduling when the number ofclusters is larger than the number ofphysical processors. However, the map-ping of clusters to processors is a rela-tively unexplored research topic [Leeand Aggarwal 1987]. In the following wediscuss a number of approaches re-ported in the literature.

Upon obtaining a schedule by usingthe EZ algorithm, Sarkar [1989] used alist-scheduling based method to map theclusters to physical processors. In themapping algorithm, each task is consid-ered in turn according to the staticlevel. A processor is allocated to thetask if it allows the earliest execution,

then the whole cluster containing thetask is also assigned to that processorand all the member tasks are marked asassigned. In this scheme, two clusterscan be merged to a single processor buta cluster is never cracked. Furthermore,the allocation of channels to communi-cation messages was not considered.

Kim and Browne [1988] also proposeda mapping scheme for the UNC sched-ules obtained from their LC algorithm.In their scheme, the linear UNC clus-ters are first merged so that the numberof clusters is at most the same as thenumber of processors. Two clusters arecandidates for merging if one can startafter another finishes, or the membertasks of one cluster can be merged intothe idle time slots of another cluster.Then a dominant request tree (DRT) isconstructed from the UNC schedulewhich is a cluster graph. The DRT con-sists of the connectivity information ofthe schedule and is, therefore, useful forthe mapping stage in which two commu-nicating UNC clusters attempt to bemapped to two neighboring processors,if possible. However, if for some clustersthis connectivity mapping heuristicfails, another two heuristics called per-turbation mapping and foster mappingare invoked. For both mapping strate-gies, a processor is chosen which hasthe most appropriate number of chan-nels among currently unallocated pro-cessors. Finally, to further optimize themapping, a restricted pairwise ex-change step is called for.

Wu and Gajski [1990] suggested amapping scheme for assigning the UNCclusters generated in scheduling to pro-cessors. They realized that for bestmapping results, a dedicated trafficscheduling algorithm that balances thenetwork traffic should be used. How-ever, traffic scheduling requires flexi-ble-path routing, which incurs higheroverhead. Thus, they concluded that ifnetwork traffic is not heavy, a simpleralgorithm which minimizes total net-work traffic can be used. The algorithmthey used is a heuristic algorithm de-signed by Hanan and Kurtzberg [1972]

456 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 52: Static Scheduling Algorithms for Allocating Directed Task

to minimize the total communicationtraffic. The algorithm generates an ini-tial assignment by a constructivemethod and the assignment is then iter-atively improved to obtain a better map-ping.

Young and Gerasoulis [1993] em-ployed a work profiling method formerging UNC clusters. The mergingprocess proceeds by first sorting theclusters in an increasing order of aggre-gate computational load. Then a load-balancing algorithm is invoked to mapthe clusters to the processors so thatevery processor has about the sameload. To take care of the topology of theunderlying processor network, thegraph of merged clusters are thenmapped to the network topology usingBokhari’s algorithm.

Yang et al. [1993] reported an algo-rithm for mapping cluster graphs toprocessor graphs which is suitable foruse as the post-processing step for BNPscheduling algorithms. The mappingscheme is not suitable for UNC schedul-ing because it assumes the schedulingalgorithm has already produced a num-ber of clusters which is less than orequal to the number of processors avail-able. The objective of the mappingmethod is to reduce contention and opti-mize the schedule length when the clus-ters are mapped to a topology which isnot fully connected as assumed by theBNP algorithms. The idea of the map-ping algorithm is based on determininga set of critical edges, each of which isassigned a single communication link.Substantial improvement over randommapping was obtained in their simula-tion study.

In a recent study, Liou and Palis[1997] investigated the problem of map-ping clusters to processors. One of themajor objectives of their study was tocompare the effectiveness of one-phasescheduling (i.e., BNP scheduling) tothat of the two-phase approach (i.e.,UNC scheduling followed by clustersmapping). To this end, they proposed anew UNC algorithm called CASS-II(Clustering And Scheduling System II),

which was applied to randomly gener-ated task graphs in an experimentalstudy using three clusters mappingschemes, namely, the LB (load-balanc-ing) algorithm, the CTM (communica-tion traffic minimizing) algorithm, andthe RAND (random) algorithm. The LBalgorithm uses processor workload asthe criterion of matching clusters toprocessors. By contrast, the CTM algo-rithm tries to minimize the communica-tion costs between processors. TheRAND algorithm simply makes randomchoices at each mapping step. To com-pare the one-phase method with thetwo-phase method, in one set of testcases the task graphs were scheduledusing CASS-II with the three mappingalgorithms while in the other set usingthe mapping algorithms alone. Liou andPalis found that two-phase schedulingis better than one-phase schedulingwhereas the utilization of processors inthe former is more efficient than thelatter. Furthermore, they found that theLB algorithm finds significantly betterschedules than the CTM algorithm.

7. SOME SCHEDULING TOOLS

Software tools providing automatedfunctionalities for scheduling/mappingcan make the parallel programmingtask easier. Despite a vast volume ofresearch on scheduling that exists,building useful scheduling tools is onlyrecently addressed. A scheduling toolshould allow a programmer to specify aparallel program in certain textual orgraphical form, and then perform auto-matic partitioning and scheduling of theprogram. The tool should also allow theuser to specify the target architecture.Performance evaluation and debuggingfunctions are also highly desirable.Some tools provide interactive environ-ments for performance evaluation ofvarious popular parallel machines butdo not generate an executable scheduledcode [Pease et al. 1991]. Under theabove definition, such tools provideother functionalities but cannot be clas-sified as scheduling tools.

Static Scheduling for Task Graph Allocation • 457

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 53: Static Scheduling Algorithms for Allocating Directed Task

In the following, we survey some ofthe recently reported prototype schedul-ing tools.

7.1 Hypertool

Hypertool takes a user-partitioned se-quential program as input and automat-ically allocates and schedules the parti-tions to processors [Wu and Gajski1990]. Proper synchronization primi-tives are also automatically inserted.Hypertool is a code generation tool sincethe user program is compiled into aparallel program for the iPSC/2 hyper-cube computer using parallel code syn-thesis and optimization techniques. Thetool also generates performance esti-mates including execution time, commu-nication and suspension times for eachprocessor as well as network delay foreach communication channel. Schedul-ing is done using the MD algorithm orthe MCP algorithm.

7.2 PYRROS

PYRROS is a compile-time schedulingand code generation tool [Yang andGerasoulis 1992]. Its input is a taskgraph and the associated sequential Ccode. The output is a static scheduleand a parallel C code for a given archi-tecture (the iPSC/2). PYRROS consistsof a task graph language with an inter-face to C, a scheduling system whichuses only the DSC algorithm, a X-Win-dows based graphic displayer, and acode generator. The task graph lan-guage allows the user to define parti-tioned programs and data. The schedul-ing system is used for clustering thetask graph, performing load balancedmapping, and computation/communica-tion ordering. The graphic displayer isused for displaying task graphs andscheduling results in the form of Ganttcharts. The code generator inserts syn-chronization primitives and performsparallel code optimization for the targetparallel machine.

7.3 Parallax

Parallax incorporates seven classicalscheduling heuristics designed in theseventies [Lewis and El-Rewini 1993],providing an environment for parallelprogram developers to find out how theschedulers affect program performanceon various parallel architectures. Usersmust provide the input program as atask graph and estimate task executiontimes. Users must also express the tar-get machine as an interconnection to-pology graph. Parallax then generatesschedules in the form of Gantt charts,speedup curves, and processor and com-munication efficiency charts using X-Windows interface. In addition, an ani-mated display of the simulated runningprogram to help developers to evaluatethe differences among the schedulingheuristics is provided. Parallex, how-ever, is not reported to generate an exe-cutable code.

7.4 OREGAMI

OREGAMI is designed for use in con-junction with parallel programming lan-guages that support a communicationmodel, such as OCCAM, C*, or C andFORTRAN with communication exten-sion [Lo et al. 1991]. As such, it is a setof tools that includes a LaRCS compilerto compile textual user task descrip-tions into specialized task graphs,which are called TCG (Temporal Com-munication Graphs) [Lo 1992]. In addi-tion, OREGAMI includes a mapper toolfor mapping tasks on a variety of targetarchitectures, and a metrics tools foranalyzing and displaying the perfor-mance. The suite of tools are imple-mented in C for SUN workstations withan X-Windows interface. However, pre-cedence constraints among tasks are notconsidered in OREGAMI. Moreover, notarget code is generated. Thus, like Par-allax, OREGAMI is rather a design toolfor parallel program development.

458 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 54: Static Scheduling Algorithms for Allocating Directed Task

7.5 PARSA

PARSA is a software tool developed forautomatic scheduling and partitioningof sequential user programs [Shirazi etal. 1993]. PARSA consists of four com-ponents: an application specificationtool, an architecture specification tool, apartitioning and scheduling tool, and aperformance assessment tool. PARSAdoes not generate any target code. Theapplication specification tool accepts asequential program written in the SI-SAL functional language and converts itinto a DAG, which is represented intextual form by the IF1 (IntermediateForm 1) acyclic graphical language. Thearchitecture specification tool allowsthe user to interactively specify the tar-get system in graphical form. The exe-cution delay for each task is also gener-ated based on the architecturespecification. The partitioning andscheduling tool consists of the HNF al-gorithm, the LC algorithm, and theLCTD algorithm. The performance as-sessment tool displays the expected run-time behavior of the scheduled program.The expected performance is generatedby the analysis of the scheduled pro-gram trace file, which contains the in-formation on where each task is as-signed for execution and exactly whereeach task is expected to start execution,stop execution, or send a message to aremote task.

7.6 CASCH

CASCH(Computer- Aided SCHeduling)tool [Ahmad et al. 1997] is aimed to be acomplete parallel programming environ-ment including parallelization, parti-tioning, scheduling, mapping, communi-cation, synchronization, codegeneration, and performance evalua-tion. Parallelization is performed by acompiler that automatically converts se-quential applications into parallelcodes. The parallel code is optimizedthrough proper scheduling and map-ping, and is executed on a target ma-chine. CASCH provides an extensive li-

brary of state-of-the-art schedulingalgorithms from the recent literature.The library of scheduling algorithms isorganized into different categorieswhich are suitable for different archi-tectural environments.

The scheduling and mapping algo-rithms are used for scheduling the taskgraph generated from the user program,which can be created interactively orread from disk. The weights on thenodes and edges of the task graph arecomputed using a database that con-tains the timing of various computation,communication, and I/O operations fordifferent machines. These timings areobtained through benchmarking. An at-tractive feature of CASCH is its easy-to-use GUI for analyzing various schedul-ing and mapping algorithms using taskgraphs generated randomly, interac-tively, or directly from real programs.The best schedule generated by an algo-rithm can be used by the code generatorfor generating a parallel program for aparticular machine–and the same pro-cess can be repeated for another ma-chine.

7.7 Commercial Tools

There are only a few commerciallyavailable tools for scheduling and pro-gram parallelization. Examples includeATEXPERT by Cray Research [1991];PARASPHERE by DEC [Digital Equip-ment Corp.]; IPD by Intel [1991]; MPPEby MasPar [1992]; and PRISM by TMC[1991]. Most of these tools provide soft-ware development and debugging envi-ronments. Some of them also provideperformance tuning tools and other pro-gram development facilities.

8. NEW IDEAS AND RESEARCH TRENDS

With the advancements in processorsand networking hardware technologies,parallel processing can be accomplishedin a wide spectrum of platforms rangingfrom tightly-coupled MPPs to a loosely-coupled network of autonomous work-stations. Designing an algorithm for

Static Scheduling for Task Graph Allocation • 459

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 55: Static Scheduling Algorithms for Allocating Directed Task

such diverse platforms makes thescheduling problem even more complexand challenging. In summary, in thedesign of scheduling algorithms for effi-cient parallel processing, we have toaddress four fundamental aspects: per-formance, time-complexity, scalability,and applicability. These aspects areelaborated below.

Performance: A scheduling algorithmmust exhibit high performance and berobust. By high performance we meanthe scheduling algorithm should pro-duce high quality solutions. A robustalgorithm is one which can be used un-der a wide range of input parameters(e.g., arbitrary number of available pro-cessors and diverse task graph struc-tures).

Time-complexity: The time-complexityof an algorithm is an important factorinsofar as the quality of solution is notcompromised. As real workload is typi-cally of a large size [Ahmad et al. 1997],a fast algorithm is necessary for findinggood solutions efficiently.

Scalability: A scheduling algorithmmust possess problem-size scalability,that is, the algorithm has to consis-tently give good performance even forlarge input. On the other hand, a sched-uling algorithm must also possess pro-cessing-power scalability, that is, givenmore processors for a problem, the algo-rithm should produce solutions withcomparable quality in a shorter periodof time.

Applicability: A scheduling algorithmmust be applicable in practical environ-ments. To achieve this goal, it musttake into account realistic assumptionsabout the program and multiprocessormodels such as arbitrary computationand communication weights, link con-tention, and processor network topol-ogy.

It is clear that the above mentionedgoals are conflicting and thus pose anumber of challenges to researchers. Tocombat these challenges, several newideas have been suggested recently.These new ideas, which include geneticalgorithms, randomization approaches,

and parallelization techniques, are em-ployed to enhance the solution quality,or to lower the time-complexity, or both.In the following, we briefly outline someof these recent advancements. At theend of this section, we also indicatesome current research trends in sched-uling.

8.1 Scheduling Using Genetic Algorithms

Genetic algorithms (GAs) [Davis 1991;Filho et al. 1994; Forrest and Mitchell1993; Goldberg 1989; Holland 1975;Srinivas and Patnaik 1994] have re-cently found many applications in opti-mization problems including scheduling[Ali et al. 1994; Benten and Sait 1994;Chandrasekharam 1993; Dhodhi et al.1995; Hou et al. 1994; Schwehm et al.1994]. GAs use global search techniquesto explore different regions of the searchspace simultaneously by keeping trackof a set of potential solutions of diversecharacteristics, called a population. Assuch, GAs are widely recognized as ef-fective techniques in solving numerousoptimization problems, because theycan potentially locate better solutions atthe expense of longer running time. An-other merit of a genetic search is thatits inherent parallelism can be exploitedso as to further reduce its running time.Thus, a parallel genetic search tech-nique in scheduling is a viable approachin producing high quality solutions us-ing short running times.

Ali et al. [1994] proposed a geneticalgorithm for scheduling a DAG to alimited number of fully connected pro-cessors with a contention-free communi-cation network. In their scheme, eachsolution or schedule is encoded as achromosome containing valleles, each ofwhich is an ordered pair of task indexand its assigned processor index. Withsuch encoding the design of genetic op-erators is straightforward. Standardcrossover is used because it always pro-duces valid schedules as offsprings andis computationally efficient. Mutation issimply a swapping of the assigned pro-cessors between two randomly chosen

460 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 56: Static Scheduling Algorithms for Allocating Directed Task

alleles. For generating an initial popu-lation, Ali et al. use a technique called“prescheduling” in which Np randompermutations of numbers from 1 to vare generated. The number in each ran-dom permutation represents the taskindex of the task graph. The tasks arethen assigned to the PEs uniformly: thefirst v / p tasks in a permutation areassigned to PE 0, the next v / p tasks toPE 1, and so on. In their simulationstudy using randomly generated taskgraphs with a few tenths of nodes, theiralgorithm was shown to outperform theETF algorithm proposed by Hwang etal. [1989].

Hou et al. [1994] also proposed ascheduling algorithm using a geneticsearch in which each chromosome is acollection of lists, and each list repre-sents the schedule on a distinct proces-sor. Thus, each chromosome is not alinear structure but a two-dimensionalstructure instead. One dimension is aparticular processor index and the otheris the ordering of tasks scheduled on theprocessor. Using such an encodingscheme poses a restriction on the sched-ules being represented: the list of taskswithin each processor in a schedule isordered in ascending order of their topo-logical height, which is defined as thelargest number of edges from an entrynode to the node itself. This restrictionalso facilitates the design of the cross-over operator. In a crossover, two pro-cessors are selected from each of twochromosomes. The list of tasks on eachprocessor is cut into two parts, and thenthe two chromosomes exchange the twolower parts of their task lists corre-spondingly. It is shown that this cross-over mechanism always produces validoffsprings. However, the height restric-tion in the encoding may cause thesearch to be incapable of obtaining theoptimal solution because the optimal so-lution may not obey the height orderingrestriction at all.

Hou et al. incorporated a heuristictechnique to lower the likelihood of sucha pathological situation. Mutation is

simpler in design. In a mutation, tworandomly chosen tasks with the sameheight are swapped in the schedule. Asto the generation of the initial popula-tion, Np randomly permuted schedulesobeying the height ordering restrictionare generated. In their simulation studyusing randomly generated task graphswith a few tenths of nodes, their algo-rithm was shown to produce scheduleswithin 20 percent degradation from op-timal solutions.

Ahmad and Dhodhi [1995] proposed ascheduling algorithm using a variant ofgenetic algorithm called simulated evo-lution. They employ a problem-spaceneighborhood formulation in that achromosome represents a list of taskpriorities. Since task priorities are de-pendent on the input DAG, differentsets of task priorities represent differ-ent problem instances. First, a list ofpriorities is obtained from the inputDAG. Then the initial population ofchromosomes are generated by ran-domly perturbing this original list.Standard genetic operators are appliedto these chromosomes to determine thefittest chromosome, which is the onegiving the shortest schedule length forthe original problem. The geneticsearch, therefore, operates on the prob-lem-space instead of the solution-space,as is commonly done. The rationale ofthis approach is that good solutions ofthe problem instances in the problem-space neighborhood are expected to begood solutions for the original problemas well [Storer et al. 1992].

Recently, we have proposed a parallelgenetic algorithm for scheduling [Kwokand Ahmad 1997], called the ParallelGenetic Scheduling (PGS) algorithm,using a novel encoding scheme, an effec-tive initial population generation strat-egy, and computationally efficient ge-netic search operators. The majormotivation of using a genetic search ap-proach is that the recombinative natureof a genetic algorithm can potentiallydetermine an optimal scheduling listleading to an optimal schedule. As such,

Static Scheduling for Task Graph Allocation • 461

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 57: Static Scheduling Algorithms for Allocating Directed Task

a scheduling list (i.e., a topological or-dering of the input DAG) is encoded as agenetic string. Instead of generating theinitial population totally randomly, wegenerate the initial set of strings basedon a number of effective scheduling listssuch as ALAP list, b-level list, t-levellist, etc. We use a novel crossover oper-ator, which is a variant of the ordercrossover operator, in the schedulingcontext. The proposed crossover opera-tor has the potential to effectively com-bine the good characteristics of two par-ent strings in order to generate ascheduling string leading to a schedulewith shorter schedule length. The cross-over operator is easy to implement andis computationally efficient.

In our experimental studies [Kwokand Ahmad 1997], we have found thatthe PGS algorithm generates optimalsolutions for more than half of all thecases in which random task graphswere used. In addition, the PGS algo-rithm demonstrates an almost linearspeedup, and is therefore scalable.While the DCP algorithm [Kwok andAhmad 1996] has already been shown tooutperform many leading algorithms,the PGS algorithm is even better, sinceit generates solutions with comparablequality while using significantly lesstime due to its effective parallelization.The PGS algorithm outperforms thewell-known DSC algorithm in terms ofboth the solution quality and runningtime. An extra advantage of the PGSalgorithm is scalability, that is by usingmore parallel processors, the algorithmcan be used for scheduling larger taskgraphs.

8.2 Randomization Techniques

The time-complexity of an algorithmand its solution quality are in generalconflicting goals in the design of effi-cient scheduling algorithms. Our previ-ous study [Kwok and Ahmad 1999b]indicates that not only does the qualityof existing algorithms differ consider-ably but their running times can varyby large margins. Indeed, designing an

algorithm which is fast and can producehigh quality solutions requires somelow-complexity algorithmic techniques.One promising approach is to employrandomization. As indicated by Karp[1991], Motwani and Raghavan [1995],and other researchers, an optimizationalgorithm which makes random choicescan be very fast and simple to imple-ment. However, there has been very lit-tle work done in this direction.

Recently, we [Kwok 1997; Kwok andAhmad 1999a; Kwok et al. 1996] pro-posed a BNP scheduling algorithmbased on a random neighborhood searchtechnique [Johnson et al. 1988; Papad-imitriou and Steiglitz 1982]. The algo-rithm is called the Fast Assignment andScheduling of Tasks using an EfficientSearch Technique (FASTEST) algo-rithm, which has a time-complexity ofonly O~e!, where e is the number ofedges in the DAG [Kwok and Ahmad1999a]. The FASTEST algorithm firstconstructs an initial schedule quickly inlinear-time and then refines it by usingmultiple physical processors, each ofwhich operates on a disjoint subset ofblocking-nodes as a search neighbor-hood. The physical processors communi-cate periodically to exchange the bestsolution found thus far. As the numberof search steps required is a small con-stant, which is independent of the sizeof the input DAG, the algorithm effec-tively takes linear-time to determinethe final schedule.

In our performance study [Kwok1997; Kwok and Ahmad 1999a], we com-pared the FASTEST algorithm with anumber of well-known efficient schedul-ing algorithms. The FASTEST algo-rithm has been shown to be better thanthe other algorithms in terms of bothsolution quality and running time.Since the algorithm takes linear-time, itis the fastest algorithm to our knowl-edge. In experiments using random taskgraphs for which optimal solutions areknown, the FASTEST algorithm gener-ates optimal solutions for a significantportion of all the test cases, and close-

462 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 58: Static Scheduling Algorithms for Allocating Directed Task

to-optimal solutions for the remainingcases. The FASTEST algorithm also ex-hibits good scalability in that it gives aconsistent performance when applied tolarge task graphs. An interesting find-ing of the FASTEST algorithm is thatparallelization can sometimes improveits solution quality. This is due to thepartitioning of the blocking-nodes set,which implies a partitioning of thesearch neighborhood. The partitioningallows the algorithm to explore thesearch space simultaneously, therebyenhancing the likelihood of getting bet-ter solutions.

8.3 Parallelizing a Scheduling Algorithm

Parallelizing a scheduling algorithm isa novel as well as natural way to reducethe time-complexity. This approach isnovel in that no previous work has beendone in the parallelization of a schedul-ing algorithm. Indeed, as indicated byNorman and Thanisch [1993], it isstrange that there has been hardly anyattempt to parallelize a scheduling andmapping process itself. Parallelizationis natural in that parallel processing isrealized only when a parallel processingplatform is available. Furthermore, par-allelization can be utilized not only tospeed up the scheduling process furtherbut also to improve the solution quality.Recently there have been a few parallelalgorithms proposed for DAG schedul-ing [Ahmad and Kwok 1999; Kwok1997; Kwok and Ahmad 1997].

In a recent study [Ahmad and Kwok1998b], we have proposed two parallelstate-space search algorithms for find-ing optimal or bounded solutions. Thefirst algorithm which is based on the A*search technique uses a computation-ally efficient cost function for quicklyguiding the search. The A* algorithm isalso parallelized, using static and dy-namic load-balancing schemes to evenlydistribute the search states to the pro-cessors. A number of effective state-pruning techniques are also incorpo-rated to further enhance the efficiencyof the algorithm. The proposed algo-

rithm outperforms a previously reportedbranch-and-bound algorithm by usingconsiderable less computation time. Thesecond algorithm is an approximate al-gorithm that guarantees a bounded de-viation from the optimal solution, butexecutes in a considerably shorter turn-around time. Based on both theoreticalanalysis and experimental evaluation[Ahmad and Kwok 1998b] using ran-domly generated task graphs, we havefound that the approximate algorithm ishighly scalable and is an attractivechoice, if slightly degraded solutions areacceptable.

We have also proposed [Ahmad andKwok 1999; Kwok 1997] a parallel APNscheduling algorithm called the ParallelBubble Scheduling and Allocation(PBSA) algorithm. The proposed PBSAalgorithm is based on considerationssuch as a limited number of processors,link contention, heterogeneity of proces-sors, and processor network topology.As a result, the algorithm is useful fordistributed systems including clustersof workstations. The major strength ofthe PBSA algorithm lies in its incre-mental strategy of scheduling nodes andmessages together. It first uses theCPN-Dominant sequence to serializethe task graph to one PE, and thenallows the nodes to migrate to otherPEs for improving their start-times. Inthis manner, the start-times of thenodes, and hence, the schedule length,are optimized incrementally. Further-more, in the course of migration, therouting and scheduling of communica-tion messages between tasks are alsooptimized. The PBSA algorithm firstpartitions the input DAG into a numberof disjoint subgraphs. The subgraphsare then scheduled independently inmultiple physical processors, each ofwhich runs a sequential BSA algorithm.The final schedule is constructed byconcatenating the subschedules pro-duced. The proposed algorithm is, there-fore, the first attempt of its kind in thatit is a parallel algorithm and it alsosolves the scheduling problem by con-

Static Scheduling for Task Graph Allocation • 463

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 59: Static Scheduling Algorithms for Allocating Directed Task

sidering all the important schedulingparameters.

We have evaluated the PBSA algo-rithm [Ahmad and Kwok 1999; Kwok1997] by testing it in experiments usingextensive variations of input parame-ters including graph types, graph sizes,CCRs, and target network topologies.Comparisons with three other APNscheduling algorithms have also beenmade. Based on the experimental re-sults, we find that the PBSA algorithmcan provide a scalable schedule, and canbe useful for scheduling large taskgraphs which are virtually impossible toschedule using sequential algorithms.Furthermore, the PBSA algorithm ex-hibits superlinear speedup in that givenq physical processors, the algorithm canproduce solutions with comparablequality with a speedup of roughly O~q2!over the sequential case.

Other researchers have also sug-gested techniques for some restrictedforms of the scheduling problem. Re-cently, Pramanick and Kuhl [1995] pro-posed a paradigm, called Parallel Dy-namic Interaction (PDI), for developingparallel search algorithms for many NP-hard optimization problems. The PDImethod is applied to the job-shop sched-uling problem in which a set of indepen-dent jobs are scheduled to homogeneousmachines. De Falco et al. [1997] havesuggested using parallel simulated an-nealing and parallel tabu search algo-rithms for the task allocation problem,in which a Task Interaction Graph(TIG), representing communicating pro-cesses in a distributed systems, is to bemapped to homogeneous processors. Asmentioned earlier, a TIG is differentfrom a DAG in that the former is anundirected graph with no precedenceconstraints among the tasks. Parallelbranch-and-bound techniques [Ferreiraand Pardalos 1996] have also been usedto tackle some simplified schedulingproblems.

8.4 Future Research Directions

Research in DAG scheduling can be ex-tended in several directions. One of themost challenging directions is to extendDAG scheduling to heterogeneous com-puting platforms. Heterogeneous com-puting (HC), using physically distrib-uted diverse machines connected via ahigh-speed network for solving complexapplications, is likely to dominate thenext era of high-performance comput-ing. One class of HC environment is asuite of sequential machines known as anetwork of workstations (NOWs). An-other class, known as the distributedheterogeneous supercomputing system(DHSS), is a suite of machines compris-ing a variety of sequential and parallelcomputers–providing an even higherlevel of parallelism. In general, it isimpossible for a single machine archi-tecture with its associated compiler, op-erating system, and programming toolsto satisfy all the computational require-ments in an application equally well.However, a heterogeneous computingenvironment that consists of a heteroge-neous suite of machines, high-speed in-terconnections, interfaces, operatingsystems, communication protocols andprogramming environments provides avariety of architectural capabilities,which can be orchestrated to perform anapplication that has diverse executionrequirements. Due to the latest ad-vances in networking technologies, HCis likely to flourish in the near future.

The goal of HC using a NOW or aDHSS is to achieve the minimum com-pletion time for an application. A chal-lenging future research problem is todesign efficient algorithms for schedul-ing and mapping of applications to themachines in a HC environment. Task-to-machine mapping in a HC environ-ment is beyond doubt more complicatedthan in a homogeneous environment. Ina HC environment, a computation canbe decomposed into tasks, each of whichmay have substantially different pro-cessing requirements. For example asignal processing task may strictly re-

464 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 60: Static Scheduling Algorithms for Allocating Directed Task

quire a machine possessing DSP pro-cessing capability. While the PBSA al-gorithm proposed [Ahmad and Kwok1999] is a first step toward this direc-tion, more work is needed. One possibleresearch direction is to first devise anew model of heterogeneous parallel ap-plications as well as new models of HCenvironments. Based on these new mod-els, more optimized algorithms can bedesigned.

Another avenue of further research isto extend the applicability of the exist-ing randomization and evolutionaryscheduling algorithms [Ali et al. 1994;Hou et al. 1994; Kwok 1997]. While theyare targeted to be used in BNP schedul-ing, the algorithms may be extended tohandle APN scheduling as well. How-ever, some novel efficient algorithmictechniques for scheduling messages tolinks need to be sought, lest the time-complexity of the randomization algo-rithms increase. Further improvementsin the genetic and evolutionary algo-rithms may be possible if we can deter-mine an optimal set of control parame-ters, including crossover rate, mutationrate, population size, number of genera-tions, and number of parallel processorsused. However, finding an optimal pa-rameters set for a particular geneticalgorithm is hitherto an open researchproblem.

9. SUMMARY AND CONCLUDINGREMARKS

In this paper, we have presented anextensive survey of algorithms for thestatic scheduling problem. Processorsand communication links are in generalthe most important resources in paralleland distributed systems, and their effi-cient management through properscheduling is essential for obtaininghigh performance. We first introducedthe DAG model and the multiprocessormodel, followed by the problem state-ment of scheduling. In the DAG model,a node denotes an atomic program task,and an edge denotes the communicationand data dependency between two pro-

gram tasks. Each node is labeled aweight denoting the amount of compu-tational time required by the task. Eachedge is also labeled a weight denotingthe amount of communication time re-quired. The target multiprocessor sys-tems is modeled as a network of pro-cessing elements (PEs), each of whichcomprises a processor and a local mem-ory unit, so that communication isachieved solely by message-passing. Theobjective of scheduling is to minimizethe schedule length by properly allocat-ing the nodes to the PEs and sequencingtheir start-times so that the precedenceconstraints are preserved.

We have also presented a scrutiny ofthe NP-completeness results of varioussimplified variants of the problem,thereby illustrating that static schedul-ing is a hard optimization problem. Asthe problem is intractable even for mod-erately general cases, heuristic ap-proaches are commonly sought.

To better understand the design ofthe heuristic scheduling schemes, wehave also described and explained a setof basic techniques used in most algo-rithms. With these techniques the taskgraph structure is carefully exploited todetermine the relative importance ofthe nodes in the graph. More importantnodes get a higher consideration prior-ity for scheduling first. An importantstructure in a task graph is the criticalpath (CP). The nodes of the CP can beidentified by the nodes’ b-level and t-level. In order to put the representativework with different assumptions re-ported in the literature in a unifiedframework, we described a taxonomy ofscheduling algorithms which classifiesthe algorithms into four categories: theUNC (unbounded number of clusters)scheduling, the BNP (bounded numberof processors) scheduling, the TDB (taskduplication based) scheduling, and APN(arbitrary processor network) schedul-ing. Analytical results as well as sched-uling examples have been shown toillustrate the functionality and characteris-tics of the surveyed algorithms. Tasksscheduling for heterogeneous systems,

Static Scheduling for Task Graph Allocation • 465

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 61: Static Scheduling Algorithms for Allocating Directed Task

which are widely considered as promisingplatforms for high-performance computing,is briefly discussed. As a postprocessingstep of some scheduling algorithms, themapping process is also examined. Variousexperimental software tools for schedulingand mapping are also described.

Finally, we have surveyed a numberof new techniques which are recentlyproposed for achieving these goals.These techniques include genetic andevolutionary algorithms, randomizationtechniques, and parallelized schedulingapproaches.

ACKNOWLEDGMENT

The authors would like to thank thereferees for their comments.

REFERENCES

ADAM, T. L., CHANDY, K. M., AND DICKSON, J.R. 1974. A comparison of list schedulingfor parallel processing systems. Commun.ACM 17, 12 (Dec.), 685–690.

AHMAD, I. AND DHODHI, M. K. 1995. Task as-signment using a problem-space genetic algo-rithm. Concurrency: Pract. Exper. 7, 5(Aug.), 411–428.

AHMAD, I. AND GHAFOOR, A. 1991. Semi-distrib-uted load balancing for massively parallelmulticomputer systems. IEEE Trans. Softw.Eng. 17, 10 (Oct. 1991), 987–1004.

AHMAD, I. AND KWOK, Y.-K. 1998a. On exploit-ing task duplication in parallel programscheduling. IEEE Trans. Parallel Distrib.Syst. 9, 9, 872–892.

AHMAD, I. AND KWOK, Y.-K. 1998b. Optimal andnear-optimal allocation of precedence-con-strained task to parallel processors: Defyingthe high complexity using effective searchtechnique. In Proceedings of the 1998 Inter-national Conference on Parallel Processing(Aug.),

AHMAD, I. AND KWOK, Y.-K. 1999. On paralleliz-ing the multiprocessor scheduling problem.IEEE Trans. Parallel Distrib. Syst. 10, 4(Apr.), 414–432.

AHMAD, I., KWOK, Y.-K., AND WU, M.-Y.1996. Analysis, evaluation, and comparisonof algorithms for scheduling task graphs onparallel processors. In International Sympo-sium on Parallel Architectures, Algorithms,and Networks (June), 207–213.

AHMAD, I., KWOK, Y.-K., WU, M.-Y., AND SHU, W.1997. Automatic parallelization and sched-uling of programs on multiprocessors usingCASCH. In Proceedings of the InternationalConference on Parallel Processing (ICPP,Aug.), 288–291.

ALI, H. H. AND EL-REWINI, H. 1993. The timecomplexity of scheduling interval orders withcommunication is polynomial. Para. Proc.Lett. 3, 1, 53–58.

ALI, S., SAIT, S. M., AND BENTEN, M. S. T. 1994.GSA: Scheduling and allocation using geneticalgorithm. In Proceedings of the Conferenceon EURO-DAC’94, 84–89.

AL-MOUHAMED, M. A. 1990. Lower bound on thenumber of processors and time for schedulingprecedence graphs with communicationcosts. IEEE Trans. Softw. Eng. 16, 12 (Dec.1990), 1390–1401.

ALMEIDA, V. A. F., VASCONCELOS, I. M. M., ÁRABE,J. N. C., AND MENASCÉ, D. A. 1992. Usingrandom task graphs to investigate the poten-tial benefits of heterogeneity in parallel sys-tems. In Proceedings of the 1992 Conferenceon Supercomputing (Supercomputing ’92,Minneapolis, MN, Nov. 16–20), R. Werner,Ed. IEEE Computer Society Press, LosAlamitos, CA, 683–691.

AMDAHL, G. 1967. Validity of the single proces-sor approach to achieving large scale comput-ing capability. In Proceedings of the onAFIPS Spring Joint Computer Conference(Reston, Va.), AFIPS Press, Arlington, VA,483–485.

ANGER, F. D., HWANG, J.-J., AND CHOW, Y.-C.1990. Scheduling with sufficient loosely cou-pled processors. J. Parallel Distrib. Comput.9, 1 (May 1990), 87–92.

BASHIR, A. F., SUSARLA, V., AND VAIRAVAN, K.1983. A statistical study of the performanceof a task scheduling algorithm. IEEE Trans.Comput. C-32, 8 (Aug.), 774–777.

BAXTER, J. AND PATEL, J. H. 1989. The LASTalgorithm: A heuristic-based static task allo-cation algorithm. In Proceedings of the Inter-national Conference on Parallel Processing(ICPP ’89, Aug.), Pennsylvania State Uni-versity, University Park, PA, 217–222.

BECK, M., PINGALI, K., AND NICOLAU, A. 1990.Static scheduling for dynamic dataflow ma-chines. J. Parallel Distrib. Comput. 10, 4(Dec. 1990), 279–288.

BENTEN, M. S. T. AND SAIT, S. M. 1994. Geneticscheduling of task graphs. Int. J. Electron.77, 4 (Oct.), 401–415.

BLAZEWICZ, J., DRABOWSKI, M., AND WEGLARZ, J.1986. Scheduling multiprocessor tasks tominimize schedule length. IEEE Trans.Comput. C-35, 5 (May 1986), 389–393.

BLAZEWICZ, J., WEGLARZ, J., AND DRABOWSKI, M.1984. Scheduling independent 2-processortasks to minimize schedule length. Inf. Pro-cess. Lett. 18, 5 (June 1984), 267–273.

BOKHARI, S. H. 1979. Dual processor schedulingwith dynamic reassignment. IEEE Trans.Softw. Eng. SE-5, 4 (July), 341–349.

BOKHARI, S. H. 1981. On the mapping problem.IEEE Trans. Comput. C-30, 5, 207–214.

BOZOKI, G. AND RICHARD, J. P. 1970. A branch-and-bound algorithm for continuous-process

466 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 62: Static Scheduling Algorithms for Allocating Directed Task

task shop scheduling problem. AIIE Trans.2, 246–252.

BRUNO, J., COFFMAN, E. G., AND SETHI, R. 1974.Scheduling independent tasks to reduce meanfinishing time. Commun. ACM 17, 7 (July),382–387.

CASAVANT, T. L. AND KUHL, J. G. 1988. A taxon-omy of scheduling in general-purpose dis-trbuted computing systems. IEEE Trans.Softw. Eng. 14, 2 (Feb.), 141–154.

CHANDRASEKHARAM, R., SUBHRAMANIAN, S., AND

CHAUDHURY, S. 1993. Genetic algorithm fornode partitioning problem and applications inVLSI design. IEE Proc. Comput. Digit. Tech.140, 5 (Sept.), 255–260.

CHEN, G. AND LAI, T. H. 1988a. Scheduling in-dependent jobs on hypercubes. In Proceed-ings of the Conference on Theoretical Aspectsof Computer Science, 273–280.

CHEN, G.-I. AND LAI, T.-H. 1988b. Preemptivescheduling of independent jobs on a hypercube.Inf. Process. Lett. 28, 4 (July 29, 1988), 201–206.

CHEN, H., SHIRAZI, B., AND MARQUIS, J. 1993.Performance evaluation of a novel schedulingmethod: Linear clustering with task duplica-tion. In Proceedings of the 2nd InternationalConference on Parallel and Distributed Sys-tems (Dec.), 270–275.

CHENG, R., GEN, M., AND TSUJIMURA, Y. 1996. Atutorial survey of job-shop scheduling prob-lems using genetic algorithms—I: representa-tion. Comput. Ind. Eng. 30, 4, 983–997.

CHRETIENNE, P. 1989. A polynomial algorithmto optimally schedule tasks on a virtual dis-tributed system under tree-like precedenceconstraints. Europ. J. Oper. Res. 43, 225–230.

CHU, W. W., LAN, M.-T., AND HELLERSTEIN, J.1984. Estimation of intermodule communi-cation (IMC) and its applications in distrib-uted processing systems. IEEE Trans. Com-put. C-33, 8 (Aug.), 691–699.

CHUNG, Y.-C. AND RANKA, S. 1992. Applicationsand performance analysis of a compile-timeoptimization approach for list scheduling al-gorithms on distributed memory multiproces-sors. In Proceedings of the 1992 Conferenceon Supercomputing (Supercomputing ’92,Minneapolis, MN, Nov. 16–20), R. Werner,Ed. IEEE Computer Society Press, LosAlamitos, CA, 512–521.

COFFMAN, E. G. 1976. Computer and Job-ShopScheduling Theory. John Wiley and Sons,Inc., New York, NY.

COFFMAN, E. G. AND GRAHAM, R. L. 1972.Optimal scheduling for two-processor sys-tems. Acta Inf. 1, 200–213.

COLIN, J. Y. AND CHRETIENNE, P. 1991. C.P.M.scheduling with small computation delaysand task duplication. Oper. Res. 39, 4, 680–684.

COSNARD, M. AND LOI, M. 1995. Automatic taskgraph generation techniques. Para. Proc.Lett. 5, 4 (Dec.), 527–538.

CRAY RESEARCH, INC. 1991. UNICOS Perfor-mance Utilities Reference Manual,SR2040. Cray Supercomputers, ChippewaFalls, MN.

DALLY, W. J. 1992. Virtual-channel flow control.IEEE Trans. Parallel Distrib. Syst. 3, 3(Mar.), 194–205.

DARBHA, S. AND AGARWAL, D. P. 1995. A fastand scalable scheduling algorithm for dis-trbuted memory systems. In Proceedings of7th Symposium on Parallel and DistributedProcessing (Oct.), 60–63.

DAVIS, T., Ed. 1991. The Handbook of GeneticAlgorithms. Van Nostrand Reinhold Co.,New York, NY.

DE FALCO, I., DEL BALIO, R., AND TARANTINO, E.1997. An analysis of parallel heuristics fortask allocation in multicomputers. Comput-ing 59, 3, 259–275.

DHODI, M. K., AHMAD, I., AND AHMAD, I. 1995. Amultiprocessor scheduling scheme using prob-lem-space genetic algorithms. In Proceed-ings of the IEEE International Conference onEvolutionary Computation, IEEE ComputerSociety Press, Los Alamitos, CA, 214–219.

DIGITAL EQUIPMENT CORP. 1992. PARASPHEREUser’s Guide. Digital Equipment Corp.,Maynard, MA.

DIXIT-RADYA, V. A. AND PANDA, D. K. 1993.Task assignment on distrbuted-memory sys-tems with adaptive wormhole routing. InProceedings of the 2nd International Confer-ence on Parallel and Distributed Systems(Dec.), 674–681.

DU, J. AND LEUNG, J. Y.-T. 1989. Complexity ofscheduling parallel task systems. SIAM J.Discrete Math. 2, 4 (Nov. 1989), 473–487.

EL-REWINI, H. AND ALI, H. H. 1995. Staticscheduling of conditional branches in parallelprograms. J. Parallel Distrib. Comput. 24, 1(Jan. 1995), 41–54.

EL-REWINI, H., ALI, H. H., AND LEWIS, T. G.1995. Task scheduling in multiprocessorsystems. IEEE Computer 28, 12 (Dec.), 27–37.

EL-REWINI, H. AND LEWIS, T. G. 1990.Scheduling parallel program tasks onto arbi-trary target machines. J. Parallel Distrib.Comput. 9, 2 (June 1990), 138–153.

EL-REWINI, H., LEWIS, T. G., AND ALI, H.H. 1994. Task scheduling in parallel anddistributed systems. Prentice-Hall series ininnovative technology. Prentice-Hall, Inc.,Upper Saddle River, NJ.

ERCEGOVAC, M. D. 1988. Heterogeneity in su-percomputer architectures. Parallel Com-put. 7, 367–372.

FERNANDEZ, E. B. AND BUSSELL, B. 1973.Bounds on the number of processors and timefor multiprocessor optimal schedules. IEEETrans. Comput. C-22, 8 (Aug.), 745–751.

Static Scheduling for Task Graph Allocation • 467

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 63: Static Scheduling Algorithms for Allocating Directed Task

FERREIRA, A. AND PARDALOS, P., Eds.1996. Solving Combinatorial OptimizationProblems in Parallel: Methods and Tech-niques. Lecture Notes in Computer Science,vol. 1054.. Springer-Verlag, New York, NY.

FILHO, J. L. R., TRELEAVEN, P. C., AND ALIPPI, C.1994. Genetic-algorithm programming envi-ronments. IEEE Computer 27, 6 (June1994), 28–43.

FISHBURN, P. C. 1985. Interval Orders and In-terval Graphs. John Wiley and Sons, Inc.,New York, NY.

FORREST, S. AND MITCHELL, M. 1993. Whatmakes a problem hard for a genetic algo-rithm?: some anomalous results and their ex-planation. Mach. Learn. 13, 2/3 (Nov./Dec.1993), 285–319.

FREUND, R. F. AND SIEGEL, H. J. 1993.Heterogeneous processing. IEEE Computer26, 6 (June), 13–17.

FRIESEN, D. K. 1987. Tighter bounds for LPTscheduling on uniform processors. SIAM J.Comput. 16, 3 (June 1987), 554–560.

FUJII, M., KASAMI, T., AND NINOMIYA, K.1969. Optimal Sequencing of Two Equiva-lent Processors. SIAM J. Appl. Math. 17, 1.

GABOW, H. 1982. An almost linear algorithmfor two-processor scheduling. J. ACM 29, 3(July), 766–780.

GAJSKI, D. D. AND PIER, J. 1985. Essential is-sues in multiprocessors. IEEE Computer 18,6 (June).

GAREY, M. AND JOHNSON, D. 1979. Computersand Intractability: A Guide to the Theory ofNP-Completeness. W. H. Freeman and Co.,New York, NY.

GAREY, M. R., JOHNSON, D., TARJAN, R., AND YAN-NAKAKIS, M. 1983. Scheduling opposing for-ests. SIAM J. Algebr. Discret. Methods 4, 1,72–92.

GERASOULIS, A. AND YANG, T. 1992. A compari-son of clustering heuristics for schedulingDAG’s on multiprocessors. J. Parallel Dis-trib. Comput. 16, 4 (Dec.), 276–291.

GERASOULIS, A. AND YANG, T. 1993. On thegranularity and clustering of directed acyclictask graphs. IEEE Trans. Parallel Distrib.Syst. 4, 6 (June), 686–701.

GOLDBERG, D. E. 1989. Genetic Algorithms inSearch, Optimization and Machine Learnin-g. Addison-Wesley Publishing Co., Inc., Red-wood City, CA.

GONZALEZ, M. J., JR. 1977. Deterministic pro-cessor scheduling. ACM Comput. Surv. 9, 3(Sept.), 173–204.

GONZALEZ, T. AND SAHNI, S. 1978. Preemptivescheduling of uniform processor systems. J.ACM 25, 1 (Jan.), 92–101.

GRAHAM, R. L. 1966. Bounds for certain multi-processing anomalies. Bell Syst. Tech. J. 45,1563–1581.

GRAHAM, R. L., LAWLER, E. L., LENSTRA, J. K., ANDRINNOY KAN, A. H. G. 1979. Optimizationand approximation in deterministic sequenc-

ing and scheduling: A survey. Ann. DiscreteMath. 5, 287–326.

HA, S. AND LEE, E. A. 1991. Compile-timescheduling and assignment of data-flow pro-gram graphs with data-dependent iteration.IEEE Trans. Comput. 40, 11 (Nov. 1991),1225–1238.

HANAN, M. AND KURTZBERG, J. 1972. A review ofthe placement and quadratic assignmentproblems. SIAM Rev. 14 (Apr.), 324–342.

HOCHBAUM, D. S. AND SHMOYS, D. B. 1987.Using dual approximation algorithms forscheduling problems: theoretical and practicalresults. J. ACM 34, 1 (Jan. 1987), 144–162.

HOCHBAUM, D. S. AND SHMOYS, D. B. 1988. Apolynomial approximation scheme for sched-uling on uniform processors: Using the dualapproximation approach. SIAM J. Comput.17, 3 (June 1988), 539–551.

HOLLAND, J. H. 1992. Adaptation in Naturaland Artificial Systems: An Introductory Anal-ysis with Applications to Biology, Control andArtificial Intelligence. 2nd MIT Press,Cambridge, MA.

HORVATH, E. C., LAM, S., AND SETHI, R. 1977. Alevel algorithm for preemptive schedul-ing. J. ACM 24, 1 (Jan.), 32–43.

HOU, E. S. H., ANSARI, N., AND REN, H. 1994. Agenetic algorithm for multiprocessor schedul-ing. IEEE Trans. Parallel Distrib. Syst. 5, 2(Feb.), 113–120.

HU, T. C. 1961. Parallel sequencing and assem-bly line problems. Oper. Res. 19, 6 (Nov.),841–848.

HWANG, K. 1993. Advanced Computer Architec-ture: Parallelism, Scalability, Programmability.McGraw-Hill, Inc., New York, NY.

HWANG, J.-J., CHOW, Y.-C., ANGER, F. D., AND LEE,C.-Y. 1989. Scheduling precedence graphsin systems with interprocessor communica-tion times. SIAM J. Comput. 18, 2 (Apr.1989), 244–257.

INTEL SUPERCOMPUTER SYSTEMS DIVISION. 1991.iPSC/2 and iPSC/860 Interactive Parallel De-bugger Manual.

JAIN, K. K. AND RAJARMAN, V. 1994. Lower andupper bounds on time for multiprocessor opti-mal schedules. IEEE Trans. Parallel Dis-trib. Syst. 5, 8 (Aug.), 879–886.

JAIN, K. K. AND RAJARAMAN, V. 1995. Improvedlower bounds on time and processors forscheduling precedence graphs on multicom-puter systems. J. Parallel Distrib. Comput.28, 1 (July 1995), 101–108.

JIANG, H., BHUYAN, L. N., AND GHOSAL, D. 1990.Approximate analysis of multiprocessing taskgraphs. In Proceedings of the InternationalConference on Parallel Processing(Aug.), 228–235.

JOHNSON, D. S., PAPADIMTRIOU, C. H., AND YANNA-KAKIS, M. 1988. How easy is localsearch?. J. Comput. Syst. Sci. 37, 1 (Aug.1988), 79–100.

468 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 64: Static Scheduling Algorithms for Allocating Directed Task

KARP, R. M. 1991. An introduction to random-ized algorithms. Discrete Appl. Math. 34, 1-3(Nov. 1991), 165–201.

KASAHARA, H. AND NARITA, S. 1984. Practicalmultiprocessor scheduling algorithms for effi-cient parallel processing. IEEE Trans. Com-put. C-33, 11 (Nov.), 1023–1029.

KAUFMAN, M. 1974. An almost-optimal algo-rithm for the assembly line scheduling problem.IEEE Trans. Comput. C-23, 11 (Nov.), 1169–1174.

KHAN, A., MCCREARY, C. L., AND JONES, M. S.1994. A comparison of multiprocessor sched-uling heuristics. In Proceedings of the 1994International Conference on Parallel Process-ing, CRC Press, Inc., Boca Raton, FL, 243–250.

KIM, S. J. AND BROWNE, J. C. 1988. A generalapproach to mapping of parallel computationupon multiprocessor architectures. In Pro-ceedings of International Conference on Paral-lel Processing (Aug.), 1–8.

KIM, D. AND YI, B.-G. 1994. A two-pass schedul-ing algorithm for parallel programs. ParallelComput. 20, 6 (June 1994), 869–885.

KOHLER, W. H. 1975. A preliminary evaluationof the critical path method for schedulingtasks on multiprocessor systems. IEEETrans. Comput. C-24, 12 (Dec.), 1235–1238.

KOHLER, W. H. AND STEIGLITZ, K. 1974.Characterization and theoretical comparisonof branch-and-bound algorithms for permuta-tion problems. J. ACM 21, 1 (Jan.), 140–156.

KON’YA, S. AND SATOH, T. 1993. Task schedulingon a hypercube with link contentions. InProceedings of International Parallel Process-ing Symposium (Apr.), 363–368.

KRUATRACHUE, B. AND LEWIS, T. G. 1987.Duplication Scheduling Heuristics (DSH): ANew Precedence Task Scheduler for ParallelProcessor Systems. Oregon State Univer-sity, Corvallis, OR.

KRUATRACHUE, B. AND LEWIS, T. G. 1988. Grainsize determination for parallel process-ing. IEEE Software 5, 1 (Jan.), 23–32.

KUMAR, V., GRAMA, A., GUPTA, A., AND KARYPIS, G.1994. Introduction to Parallel Computing:Design and Analysis of Algorithms. Ben-jamin/Cummings, Redwood City, CA.

KWOK, Y.-K. 1997. High-performance algo-rithms for compile-time scheduling of parallelprocessors. Ph.D. Dissertation. Depart-ment of Computer Science, Hong Kong Uni-versity of Science and Technology, HongKong.

KWOK, Y.-K. AND AHMAD, I. 1995. Bubble sched-uling: A quasi-dynamic algorithm for staticallocation of tasks to parallel architec-tures. In Proceedings of 7th Symposium onParallel and Distributed Processing(Oct.), 36–43.

KWOK, Y.-K. AND AHMAD, I. 1996. Dynamic crit-ical-path scheduling: An effective techniquefor allocating task graphs to multiprocessors.

IEEE Trans. Parallel Distrib. Syst. 7, 5, 506–521.

KWOK, Y.-K. AND AHMAD, I. 1997. Efficientscheduling of arbitrary task graphs to multi-processors using a parallel genetic algo-rithm. J. Parallel Distrib. Comput. 47, 1,58–77.

KWOK, Y.-K. AND AHMAD, I. 1999a. FASTEST: Apractical low-complexity algorithm for com-pile-time assignment of parallel programs tomultiprocessors. IEEE Trans. Parallel Dis-trib. Syst. 10, 2 (Feb.), 147–159.

KWOK, Y.-K. AND AHMAD, I. 1999b. Bench-marking and comparison of the task graphscheduling algorithms. J. Parallel Distrib.Comput. 59, 3 (Dec.), 381–422.

KWOK, Y.-K., AHMAD, I., AND GU, J. 1996. FAST:A low-complexity algorithm for efficientscheduling of DAGs on parallel processors.In Proceedings of 25th International Confer-ence on Parallel Processing (Aug.), 150–157.

LEE, S.-Y. AND AGGARWAL, J. K. 1987. A map-ping strategy for parallel processing. IEEETrans. Comput. C-36, 4 (Apr. 1987), 433–442.

LEE, B., HURSON, A. R., AND FENG, T. Y. 1991. Avertically layered allocation scheme for dataflow systems. J. Parallel Distrib. Comput.11, 3 (Mar. 1991), 175–187.

LEUNG, J. Y.-T. AND YOUNG, G. H. 1989.Minimizing schedule length subject to mini-mum flow time. SIAM J. Comput. 18, 2 (Apr.1989), 314–326.

LEWIS, T. G. AND EL-REWINI, H. 1993. Parallax:A tool for parallel program scheduling. IEEEParallel Distrib. Technol. 1, 2 (May), 64–76.

LIOU, J.-C. AND PALIS, M. A. 1996. An efficienttask clustering heuristic for scheduling DAGson multiprocessors. In Workshop on Re-source Management, Symposium on Paralleland Distributed Processing,

LIOU, J.-C. AND PALIS, M. A. 1997. A compari-son of general approaches to multiprocessorscheduling. In Proceedings of the 11th Inter-national Parallel Processing Symposium(Apr.), 152–156.

LO, V. M. 1992. Temporal communicationgraphs: Lamport’s process-time graphs aug-mented for the purpose of mapping and sched-uling. J. Parallel Distrib. Comput. 16, 4(Dec.), 378–384.

LO, V. M., RAJOPADHYE, S., GUPTA, S., KELDSEN, D.,MOHAMED, M. A., NITZBERG, B., TELLE, J. A.,AND ZHONG, X. 1991. OREGAMI: Tools formapping parallel computations to parallel ar-chitectures. Int. J. Parallel Program. 20, 3,237–270.

LORD, R. E., KOWALIK, J. S., AND KUMAR, S. P.1983. Solving linear algebraic equations onan MIMD computer. J. ACM 30, 1 (Jan.),103–117.

MANOHARAN, S. AND TOPHAM, N. P. 1995. Anassessment of assignment schemes for depen-dency graphs. Parallel Comput. 21, 1 (Jan.1995), 85–107.

Static Scheduling for Task Graph Allocation • 469

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 65: Static Scheduling Algorithms for Allocating Directed Task

MARKENSCOFF, P. AND LI, Y. Y. 1993.Scheduling a computational DAG on a paral-lel system with communication delays andreplication of node execution. In Proceedingsof International Parallel Processing Sympo-sium (Apr.), 113–117.

MASSPAR COMPUTER. 1992. MPPE User’s Guide.MCCREARY, C. AND GILL, H. 1989. Automatic

determination of grain size for efficient paral-lel processing. Commun. ACM 32, 9 (Sept.1989), 1073–1078.

MCCREARY, C., KHAN, A. A., THOMPSON, J. J., AND

MCARDLE, M. E. 1994. A comparison ofheuristics for scheduling DAG’s on multipro-cessors. In Proceedings of International Par-allel Processing Symposium, 446–451.

MEHDIRATTA, N. AND GHOSE, K. 1994. A bot-tom-up approach to task scheduling on dis-tributed memory multiprocessor. In Proceed-ings of the 1994 International Conference onParallel Processing, CRC Press, Inc., BocaRaton, FL, 151–154.

MENASCÉ, D. AND ALMEIDA, V. 1990. Cost-per-formance analysis of heterogeneity in super-computer architectures. In Proceedings onSupercomputing ’90 (New York, NY, Nov. 12–16, 1990), J. L. Martin, Ed. IEEE ComputerSociety Press, Los Alamitos, CA, 169–177.

MENASCÉ, D. A. AND PORTO, S. C. 1993.Scheduling on heterogeneous message passingarchitectures. J. Comput. Softw. Eng. 1, 3.

MENASCÉ, D. A., PORTO, S. C., AND TRIPATHI, S. K.1994. Static heuristic processor assignmentin heterogeneous message passing architec-tures. Int. J. High Speed Comput. 6, 1(Mar.), 115–137.

MENASCÉ, D. A., PORTO, S. C., AND TRIPATHI, S. K.1992. Processor assignment in heteroge-neous parallel architectures. In Proceedingsof International Parallel Processing Sympo-sium.

MENASCÉ, D. A., SAHA, D., PORTO, S. C. D. S.,ALMEIDA, V. A. F., AND TRIPATHI, S. K. 1995.Static and dynamic processor scheduling dis-ciplines in heterogeneous parallel architec-tures. J. Parallel Distrib. Comput. 28, 1(July 1995), 1–18.

MOTWANI, R. AND RAGHAVAN, P. 1995.Randomized Algorithms. Cambridge Univer-sity Press, New York, NY.

NORMAN, M. G. AND THANISCH, P. 1993. Modelsof machines and computation for mapping inmulticomputers. ACM Comput. Surv. 25, 3(Sept. 1993), 263–302.

PALIS, M. A., LIOU, J.-C., RAJASEKARAN, S.,SHENDE, S., AND WEI, D. S. L. 1995. Onlinescheduling of dynamic trees. Para. Proc.Lett. 5, 4 (Dec.), 635–646.

PALIS, M. A., LIOU, J.-C., AND WEI, D. S. L.1996. Task clustering and scheduling fordistributed memory parallel architectures.IEEE Trans. Parallel Distrib. Syst. 7, 1, 46–55.

PANDE, S. S., AGRAWAL, D. P., AND MAUNEY, J.1994. A threshold scheduling strategy forSisal on distributed memory machines. J.Parallel Distrib. Comput. 21, 2 (May 1994),223–236.

PAPADIMITRIOU, C. H. AND STEIGLITZ, K. 1982.Combinatorial Optimization: Algorithms andComplexity. Prentice-Hall, Inc., Upper Sad-dle River, NJ.

PAPADIMITRIOU, C. H. AND ULLMAN, J. D. 1987.A communication-time tradeoff. SIAM J.Comput. 16, 4 (Aug. 1987), 639–646.

PAPADIMITRIOU, C. H. AND YANNAKAKIS, M. 1979.Scheduling interval-ordered tasks. SIAM J.Comput. 8, 405–409.

PAPADIMITRIOU, C. H. AND YANNAKAKIS, M. 1990.Towards an architecture-independent analy-sis of parallel algorithms. SIAM J. Comput.19, 2 (Apr. 1990), 322–328.

PEASE, D., GHAFOOR, A., AHMAD, I., ANDREWS, D.L., FOUDIL-BEY, K., KARPINSKI, T. E., MIKKI,M. A., AND ZERROUKI, M. 1991. PAWS: Aperformance evaluation tool for parallel com-puting systems. IEEE Computer 24, 1 (Jan.1991), 18–29.

PRAMANICK, I AND KUHL, J. G. 1995. An inher-ently parallel method for heuristic problem-solving: Part I–General framework. IEEETrans. Parallel Distrib. Syst. 6, 10 (Oct.),1006–1015.

PRASTEIN, M. 1987. Precedence-constrained sche-duling with minimum time and communica-tion. Master’s Thesis. University of Illinoisat Urbana-Champaign, Champaign, IL.

QUINN, M. J. 1994. Parallel computing (2nded.): theory and practice. McGraw-Hill, Inc.,New York, NY.

RAMAMOORTHY, C. V., CHANDY, K. M., AND GONZA-LEZ, M. J. 1972. Optimal scheduling strate-gies in a multiprocessor system. IEEETrans. Comput. C-21, 2 (Feb.), 137–146.

RAYWARD-SMITH, V. J. 1987a. The complexity ofpreemptive scheduling given interprocessorcommunication delays. Inf. Process. Lett. 25,2 (6 May 1987), 123–125.

RAYWARD-SMITH, V. J. 1987b. UET schedulingwith unit interprocessor communication de-lays. Discrete Appl. Math. 18, 1 (Jan. 1987),55–71.

SARKAR, V. 1989. Partitioning and SchedulingParallel Programs for Multiprocessors. MITPress, Cambridge, MA.

SCHWEHM, M., WALTER, T., BUCHBERGER, B., ANDVOLKERT, J. 1994. Mapping and schedulingby genetic algorithms. In Proceedings of the3rd Joint International Conference on Vectorand Parallel Processing (CONPAR’94), Springer-Verlag, New York, NY, 832–841.

SELVAKUMAR, S. AND MURTHY, C. S. R. 1994.Scheduling precedence constrained taskgraphs with non-negligible intertask commu-nication onto multiprocessors. IEEE Trans.Parallel Distrib. Syst. 5, 3 (Mar.), 328–336.

470 • Y.-K. Kwok and I. Ahmad

ACM Computing Surveys, Vol. 31, No. 4, December 1999

Page 66: Static Scheduling Algorithms for Allocating Directed Task

SETHI, R. 1976. Scheduling graphs on two pro-cessors. SIAM J. Comput. 5, 1 (Mar.), 73–82.

SHIRAZI, B., KAVI, K., HURSON, A. R., AND BISWAS,P. 1993. PARSA: A parallel program sched-uling and assessment environment. In Pro-ceedings of the International Conference onParallel Processing, CRC Press, Inc., BocaRaton, FL, 68–72.

SHIRAZI, B., WANG, M., AND PATHAK, G. 1990.Analysis and evaluation of heuristic methodsfor static task scheduling. J. Parallel Dis-trib. Comput. 10, 3 (Nov. 1990), 222–232.

SIEGEL, H. J., ARMSTRONG, J. B., AND WATSON, D.W. 1992. Mapping computer-vision-relatedtasks onto reconfigurable parallel-processingsystems. IEEE Computer 25, 2 (Feb. 1992),54–64.

SIEGEL, H. J., DIETZ, H. G., AND ANTONIO, J. K.1996. Software support for heterogeneouscomputing. ACM Comput. Surv. 28, 1, 237–239.

SIH, G. C. AND LEE, E. A. 1993a. A compile-timescheduling heuristic for interconnection-con-strained heterogeneous processor architectures.IEEE Trans. Parallel Distrib. Syst. 4, 2 (Feb.),75–87.

SIH, G. C. AND LEE, E. A. 1993b. Declustering:A new multiprocessor scheduling tech-nique. IEEE Trans. Parallel Distrib. Syst. 4,6 (June), 625–637.

SIMONS, B. B. AND WARMUTH, M. K. 1989. A fastalgorithm for multiprocessor scheduling ofunit-length jobs. SIAM J. Comput. 18, 4(Aug. 1989), 690–710.

SRINIVAS, M. AND PATNAIK, L. M. 1994. Geneticalgorithms: A survey. IEEE Computer 27, 6(June 1994), 17–26.

STONE, H. S. 1977. Multiprocessor schedulingwith the aid of network flow algorithms.IEEE Trans. Softw. Eng. SE-3, 1 (Jan.), 85–93.

SUMICHRAST, R. T. 1987. Scheduling parallelprocessors to minimize setup time. Comput.Oper. Res. 14, 4 (Oct. 1987), 305–313.

STORER, R. H., WU, S. D., AND VACCARI, R.1992. New search spaces for sequencingproblems with application to job shop schedul-ing. Manage. Sci. 38, 10 (Oct. 1992), 1495–1509.

THINKING MACHINES CORPORATION. 1991.PRISM User’s Guide. Thinking MachinesCorp., Bedford, MA.

TOWSLEY, D 1986. Allocating programs contain-ing branches and loops within a multiple pro-cessor system. IEEE Trans. Softw. Eng. SE-12, 10 (Oct. 1986), 1018–1024.

VARVARIGOU, T. A., ROYCHOWDHURY, V. P.,KAILATH, T., AND LAWLER, E. 1996.Scheduling in and out forests in the presenceof communication delays. IEEE Trans. Par-allel Distrib. Syst. 7, 10, 1065–1074.

VELTMAN, B., LAGEWEG, B. J., AND LENSTRA, J. K.1990. Multiprocessor scheduling with com-munication delays. Parallel Comput. 16,173–182.

ULLMAN, J. 1975. NP-complete schedulingproblems. J. Comput. Syst. Sci. 10, 384–393.

WANG, M.-F. 1990. Message routing algorithmsfor static task scheduling. In Proceedings ofthe 1990 Symposium on Applied Computing(SAC ’90), 276–281.

WANG, Q. AND CHENG, K. H. 1991. List schedul-ing of parallel tasks. Inf. Process. Lett. 37, 5(Mar. 14, 1991), 291–297.

WANG, L., SIEGEL, H. J., AND ROYCHOWDHURY, V. P.1996. A genetic-algorithm-based approachfor task matching and scheduling in heteroge-neous computing environments. In Proceed-ings of the ’96 Workshop on HeterogeneousComputing, IEEE Computer Society Press,Los Alamitos, CA, 72–85.

WONG, W. S. AND MORRIS, R. J. T. 1989. A newapproach to choosing initial points in localsearch. Inf. Process. Lett. 30, 2 (Jan. 1989),67–72.

WU, M.-Y. AND GAJSKI, D. D. 1990. Hypertool: Aprogramming aid for message-passing sys-tems. IEEE Trans. Parallel Distrib. Syst. 1,3 (1990), 330–343.

YANG, C.-Q. AND MILLER, B. P. 1988. Criticalpath analysis for the execution of parallel anddistributed programs. In Proceedings of the8th International Conference on DistributedComputing Systems (ICDCS ’88, Washington,D. C., June), IEEE Computer Society Press,Los Alamitos, CA, 366–373.

YANG, T. AND GERASOULIS, A. 1993. List schedul-ing with and without communication delays.Parallel Comput. 19, 12 (Dec. 1993), 1321–1344.

YANG, T. AND GERASOULIS, A. 1992. PYRROS:Static task scheduling and code generation formessage passing multiprocessors. In Pro-ceedings of the 1992 international conferenceon Supercomputing (ICS ’92, Washington, DC,July 19–23, 1992), K. Kennedy and C. D.Polychronopoulos, Eds. ACM Press, NewYork, NY, 428–437.

YANG, T. AND GERASOULIS, A. 1994. DSC: Sched-uling parallel tasks on an unbounded numberof processors. IEEE Trans. Parallel Distrib.Syst. 5, 9 (Sept.), 951–967.

YANG, J., BIC, L., AND NICOLAU, A. 1993. A map-ping strategy for MIMD computers. Int. J.High Speed Comput. 5, 1, 89–103.

ZHU, Y. AND MCCREARY, C. L. 1992. Optimaland near optimal tree scheduling for parallelsystems. In Proceedings of Symposium onParallel and Distributed Processing, IEEEComputer Society Press, Los Alamitos, CA,112–119.

Received: December 1997; revised: July 1998; accepted: September 1998

Static Scheduling for Task Graph Allocation • 471

ACM Computing Surveys, Vol. 31, No. 4, December 1999