Convergent - Massachusetts Institute of Technologygroups.csail.mit.edu/commit/papers/02/Convergence-MICRO...ict-ing constrain ts and heuristics in an ad ho c manner. One approac his

To appear in MICRO-35, November 2002, Istanbul, Turkey

Convergent Scheduling

Walter Lee, Diego Puppin, Shane Swenson, Saman Amarasinghe

Massachusetts Institute of Technology

fwalt, diego, sswenson, [email protected]

Abstract

Convergent scheduling is a general framework for

cluster assignment and instruction scheduling on spa-

tial architectures. A convergent scheduler is composed

of independent passes, each implementing a heuris-

tic that addresses a particular problem or constraint.

The passes share a simple, common interface that pro-

vides spatial and temporal preference for each instruc-

tion. Preferences are not absolute; instead, the in-

terface allows a pass to express the con�dence of its

preferences, as well as preferences for multiple space

and time slots. A pass operates by modifying these

preferences. By applying a series of passes that ad-

dress all the relevant constraints, the convergent sched-

uler can produce a schedule that satis�es all the im-

portant constraints. Because all passes are indepen-

dent and need to understand only one interface to in-

teract with each other, convergent scheduling simpli�es

the problem of handling multiple constraints and co-

developing di�erent heuristics. We have applied con-

vergent scheduling to two spatial architectures: the Raw

processor and a clustered VLIW machine. It is able to

successfully handle traditional constraints such as par-

allelism, load balancing, and communication minimiza-

tion, as well as constraints due to preplaced instruc-

tions, which are instructions with predetermined cluster

assignment. Convergent scheduling is able to obtain an

average performance improvement of 21% over the ex-

isting space-time scheduler of the Raw processor, and

an improvement of 14% over state-of-the-art assign-

ment and scheduling techniques on a clustered VLIW

architecture.

1 Introduction

Instruction scheduling on microprocessors is becom-ing a more and more diÆcult problem. In almost allpractical instances, it is NP complete, and it often facesmultiple contradictory constraints. For superscalars or

VLIWs, the two primary issues are parallelism and reg-ister pressure. Code sequences that expose more in-struction level parallelism (ILP) also have longer liveranges and higher register pressure. To generate goodschedules, the instruction scheduler must somehow ex-ploit as much ILP as possible without leading to a largenumber of register spills.

On spatial architectures, instruction scheduling iseven more complicated. Examples of spatial architec-tures include clustered VLIWs, Raw [24], Grid pro-cessors [22], and ILDPs [12]. Spatial architecturesare architectures that distribute their computing re-sources. Communication between distant resources canincur one or more cycles of delays. On these archi-tectures, the instruction scheduler has to partition in-structions across the computing resources. Thus, in-struction scheduling becomes both a spatial problemand a temporal problem.

To make partitioning decisions, the scheduler hasto understand the proper tradeo� between parallelismand locality. Figure 1 shows an example of this trade-o�. Spatial scheduling by itself is already a more diÆ-cult problem than temporal scheduling, because a smallspatial mistake is generally more costly than a smalltemporal mistake. If a critical instruction is sched-uled one cycle later than desired, only one cycle is lost.But if a critical instruction is scheduled one unit of dis-tance farther away than desired, cycles can be lost fromunnecessary communication delays, additional commu-nication resource contention, and increase in registerpressure.

In addition, some instructions on spatial architec-tures have speci�c spatial requirements. These re-quirements arise from two sources. First, some loadsand stores instructions must access memory banks onspeci�c clusters, either for correctness or for perfor-mance reasons [2, 6, 13]. Second, when a value islive across scheduling regions, its de�nitions and usesmust be mapped to a consistent cluster [14]. We callinstructions with these spatial requirements preplacedinstructions. A good scheduler must be sensitive to

1

(a)

1

2

3

4

5

6

7

8

2 ADD

7 ADD

4 ADD

6 ADD

8 ADD

1 MUL

3 MUL

5 MUL

(b)

1

2

3

4

5

6

7

8

2 ADD

7 ADD

RCV

4 ADD6 ADD

RCV

1 MUL3 MUL5 MUL

8 ADD (c)

1

2

3

4

5

6

7

2 ADD

7 ADD

4 ADD

6 ADD

RCV

1 MUL

3 MUL

5 MUL

8 ADD

Figure 1. An example of tradeoff betweenparallelism and locality on spatial architec-tures. Each node color represents a differ-ent cluster. Consider an architecture withthree clusters, each with one functional unit,where communication takes one cycle of la-tency due to the “receive” instruction. In (a),conservative partitioning that maximizes lo-cality and minimizes communication leads toan eight-cycle schedule. In (b), aggressivepartitioning has high communication require-ments and leads to an eight-cycle schedule.The optimal schedule, in (c), takes only sevencycles: it is a careful tradeoff between localityand parallelism.

constraints imposed by preplaced instructions in orderto generate a good schedule.

A scheduler also faces diÆculties because di�erentheuristics work well for di�erent types of graphs. Fig-ure 2 depicts representative data dependence graphsfrom two ends of a spectrum. In the graphs, nodesrepresent instructions and edges represent data depen-dences between instructions. Graph (a) is typical ofgraphs seen in non-numeric programs, while graph (b)is representative of graphs coming from applying loopunrolling to numeric programs. Consider the problemof scheduling these graphs onto a spatial architecture.Long, narrow graphs are dominated by a few criticalpaths. For these graphs, critical-path based heuris-tics are likely to work well. Fat, parallel graphs havecoarse grained parallelism available and many criticalpaths. For these graphs it is more important to mini-mize communication and exploit the coarse-grain par-allelism. To perform well for arbitrary graphs, a sched-uler may require multiple heuristics in its arsenal.

Traditional scheduling frameworks handle con ict-ing constraints and heuristics in an ad hoc manner.One approach is to direct all e�orts toward the mostserious problem. For example, modern RISC super-scalars can issue up to four instructions and have tensof registers. Furthermore, most integer programs tendto have little ILP. Therefore, many RISC schedulers

s21

se6

s1

se7

s3

se8

s2

se9

s4

sea

s0

seb

s6

sec

s5

sed

sa

see

s9

sef

s8

sf0

s7

sf1

se

sf2

sd

sf3

sc

sf4

sb

sf5

s12

sf6

s11

sf7

s10

sf8

sf

sf9

s16

sfa

s15

sfb

s14

sfc

s13

sfd

s1a

sfe

s19

sff

s18

s100

s17

s81

s1e

s82

s1d

s83

s1c

s84

s1b

s85

s23

s86

s22

s87

s20

s88

s1f

s89

s27

s8a

s26

s8b

s25

s8c

s24

s8d

s2b

s8e

s2a

s8f

s29

s90

s28

s91

s92

s93

s2d

s94

s2c

s95

s96

s32

s97

s31

s98

s99

s30

s37

s9a

s36

s9b

s35

s9c

s9d

s3b

s9e s9f

s3a

sa0

sa1

sa2

s3f

sa3 sa4

s3c

sa5

sa6

sa7

sa8

sa9 saa

sab

sac

sad

s44

sae

saf

sb0

sb1

sb2 sb3

sb4

sb5

sb6

sb7

sb8sb9

sba

sbb

sbc sbd

s5b

sbe

s5a

sbf

s59

sc0

s58

sc1

s5f

s663

s5e

s659

s5d

s651s5c

s65f

s63 s653s62

s660s61

s655s60

s661

s67s657

s66

s662

s65

s64f

s64

s65e

s6b s14b

s6a

s160s69 s175

s68

s18a

s6f

s19f

s6e

s1b2

s6d

s1c5

s6c

s1d8s73 s1eb

s72

s14c

s71

s1fe

s70

s161

s77s76

s75

s74

s7b

s7a

s79

s78

s7f

s7e

s7d s7c

s24a

s80

sc2

sc3

sc4

sc5

s13e

sc6

s177

sc7

s1d9

sc8

s270

sc9

s101sca

s107

scb

s108

scc

s10c

scd

s10d

sce

s109

scf

s110

sd0

s103

sd1

s104

sd2

s116

sd3

s117

sd4

s11b

sd5

s11c

sd6

s118

sd7

s11f

sd8

sd9

sda sdb

sdc

sdd

sde

sdfse0

se1

se2

se3

se4

se5

(a)

s2

s2e s7

s2f

s5

s30

s4

s31

s3 s32

s6

s33

s8

s34

s9

s35

sa

s36

s4bs37

s4d

s38 s4c

s39

s51

s3a

s50

s3b

s4f

s3c

s4e

s3d

s55s3e

s54s3f

s53

s40

s52

s41

s59

s42

s58s43

s57

s44

s56

s45

s5b

s46

s5a

s47

s5d

s48

s5c

s49 s5f

s4a

s5e

sc2

s61

sc3

s60

sc4

s63

sc5

s62

sc6

s65

sc7

s64

sc8

s67

sc9s66

sca

s69

scb

s68

scc

s6d

scd

s6c

sce

s6bscf

s6a

s71

s70

s6f

s6e

s74

s73

s72

s76

s75

s78

s77

s7a

s79

s7c

s7b

s7e

s7d

s81

s82

s83

s84

s85

s86

s87

s88

s89

s8a

s8b

s8c

s8d

s8e

s8f

s90

s91

s92

s93

s94

s95

s96

s97

s98

s99

s9a

s9b

s9c

s9d

s9e

s9f

sa0

sa1

sa2

sa3

sa4

sa5

sa6

sa7

sa8

sa9

saa

sab

sac

sad

sae

saf

sb0

sb1

sb2

sb3

sb4

sb5

sb6

sb7

sb8

sb9

sba

sbb

sbc

sbd

sbe

sbf

sc0

sc1

s7f

s80

s1s0

sb

sc

sd

se sf

s10

s11

s12

s13

s14

s15

s16

s17

s18

s19

s1a

s1b

s1c

s1d

s1e

s1f

s20

s21

s22

s23

s24

s25

s26

s27

s28

s29

s2a

s2b s2c

s2d

(b)

Figure 2. Different data dependence graphshave different characteristics. Some are thinand dominated by a few critical paths (a),while others are fat and parallel (b).

focus on �nding ILP and ignore register pressure alto-gether. Another approach is to address the constraintsone at a time in a sequence of phases. This approach,however, introduces phase ordering problems, as deci-sions made by the early phases are based on partialinformation and can adversely a�ect the quality of de-cisions made by subsequent phases. A third approachis to attempt to address all the problems together. Forexample, there have been reasonable attempts to per-form instruction scheduling and register allocation atthe same time [21]. However, extending such frame-works to support preplaced instructions is diÆcult {no such extension exists today.

This paper presents convergent scheduling, a radicaldeparture from traditional scheduling methods. Con-vergent scheduling is a general scheduling frameworkthat makes it easy to specify arbitrary constraints andscheduling heuristics. Figure 3 illustrates this frame-work. A convergent scheduler is composed of indepen-

2

ConvergentScheduler

PreplacedInstruction

Info

GraphDependence

Machine Model andOther Constraints

Space−timeSchedule

... ...

Input

Collection of Heuristics

Output

Critical path

Load balanceComm.

Figure 3. Convergent schedule framework.

dent phases. Each phase implements a heuristic thataddresses a particular problem such as ILP or regis-ter pressure. Multiple heuristics may address the sameproblem.

All phases in the convergent scheduler share a com-mon interface. The input or output to each phase isa collection of spatial and temporal preferences of in-structions. A phase operates by analyzing the currentpreferences and modifying them. As the scheduler ap-plies the phases in succession, the preference distribu-tion converges to a �nal schedule that incorporates thepreferences of all the constraints and heuristics. Logi-cally, preferences are speci�ed as a three-input functionthat maps an instruction, space, and time triple to aweight.

The contributions of this paper are:

� a novel interface between scheduling passes basedon weighted preferences;

� a novel approach to address the combined prob-lems of cluster assignment, scheduling, and regis-ter pressure,

� the formulation of a set of powerful heuristics toaddress both general constraints and architecture-speci�c issues,

� a demonstration of the e�ectiveness of convergentscheduling compared to traditional techniques.

The rest of this paper is organized as follows. Sec-tion 2 introduces convergent scheduling and uses an ex-ample to illustrates how it works. Section 3 describesthe convergent scheduling interface between passes.Section 4 describes the collection of passes currentlyimplemented in our framework. Section 5 presents ex-perimental results for two architectures: a clusteredVLIW and the Raw processor [24]. Section 6 providesrelated work. Section 7 concludes.

2 Convergent scheduling

In the convergent scheduling framework, passescommunicate their choices as changes in the relativepreferences of instructions for clusters and time slots.1

The spatial and temporal preferences of each instruc-tion are represented as weights in a preference map;a pass in uences the scheduling of an instruction bychanging these weights. When convergent schedulingcompletes, the slot in the map with the highest weightis designated as the preferred slot, which includes botha preferred cluster and a preferred time. The instruc-tion is assigned to the preferred cluster; the preferredtime is used as the instruction priority for list schedul-ing.

Di�erent heuristics work to improve the schedule indi�erent ways. The critical path strengthening heuris-tic (PATH), for example, expresses a preference tokeep all the instructions in critical paths together inthe same cluster. The communication minimization

heuristic (COMM) tries to keep dependent neighbor-ing instructions in the same cluster. The preplacement

heuristic (PLACE) prefers that preplaced instructionsand their neighbors are placed on the clusters selectedby the preplaced instructions. The load balance heuris-tic (LOAD) changes reduces the preferences on highlyloaded clusters cluster and increases them on the lessloaded ones. Other heuristics will be introduced in Sec-tion 4.

Figure 4 shows how convergent scheduling operateson a small code sequence from fpppp. Initially, theweights are evenly distributed, as shown in (b). We ap-ply the noise introduction heuristic (NOISE) to breaksymmetry, resulting in (c). This heuristic helps in-crease parallelism by distributing instructions to di�er-ent clusters. Then, we run critical path strengthening

(PATH), which increases the weight of the instructionsin the critical path (i.e., instructions 23, 25, 26, etc.) inthe �rst cluster (d). Then we run the communication

minimization (COMM) and the load balance (LOAD)heuristics, resulting in (e). These heuristics lead to sev-eral changes: the �rst few instructions are pushed outof the �rst cluster, and groups of instructions start toassemble in speci�c clusters (e.g., instructions 19, 20,21, and 22 in cluster 3).

Next, we run PLACE and PLACEPROP, which biasinstructions using information from preplaced nodes.The result is shown in (f). The pass causes a lot ofdisturbances: preplaced instructions strongly attractneighbors of preplaced instructions to their clusters.

1In the paper, the following terms will be used interchange-ably: phases and passes, tile and cluster, and cycle and time

slot.

3

(a)

22

16

17

28

29

30

25

26

15

3

4

31

32

6

10 14

19

21 9

82

13

18

20

5

7 12

33

11

27

124 023

(b) (c)

(d) (e) (f) (g)

Figure 4. Convergent scheduling operates on a code sequence from fpppp. Figure (a)shows the data dependence graph representation of the scheduled code. Nodes rep-resent instructions; edges represent dependences between instructions. Triangularnodes are preplaced, with different shades corresponding to different clusters. Figures(b)-(g) show how the convergent schedule is modified by a series of passes. This exam-ple only illustrates space scheduling, not time scheduling. Each figure in Figures 4b-gis a cluster preference map. A row represents an instruction. The row numbers corre-spond to the instruction numbers in (a). A column represents a cluster. The color ofeach entry represents the level of preference an instruction has for that cluster. Thelighter the color, the stronger the preference.

Observe how the group 19{22 is attracted to cluster 4.Finally we run communication minimization (COMM)another time. The �nal schedule is shown in (g).

Convergent scheduling has the following features:

1. Its scheduling decisions are made cooperatively

rather than exclusively.

2. The interface allows a phase to express con�denceabout its decisions. A phase needs not make apoor and unrecoverable decision just because ithas to make a decision. On the other side, anypass can strongly a�ect the �nal choice if needed.

3. Convergent scheduling can naturally recover froma temporary wrong decision by one phase. In

the example, when we apply noise to (b), mostnodes are initially moved away from the �rst clus-ter. Subsequently, however, nodes with strong tiesto cluster one, such as nodes 1{6, are eventuallymoved back, while nodes without strong ties, suchas node 0, remain away.

4. Most compilers allow only very limited exchangeof information among passes. In contrast, theweight-based interface to convergent scheduling isvery expressive.

5. The framework allows a heuristic to be appliedmultiple times, either independently or as part ofan iterative process. This feature is useful to pro-vide feedback between phases and to avoid phase

4

ordering problems.

6. The simple interface (preference maps) betweenpasses makes it easy for the compiler writer tohandle new constraints or design new heuristics.Phases for di�erent heuristics are written indepen-dently, and the expressive, common interface re-duces design complexity. This o�ers an easy wayto retarget a compiler and to address peculiaritiesof the underlying architecture. If, for example,an architecture is able to exploit auto-incrementon memory-access with a speci�c instruction, onepass could try to keep together memory-accessesand increments, so that the scheduler will �ndthem together and will be able to exploit the ad-vanced instructions.

3 Convergent interface

This section describes the convergent scheduling in-terface between passes. The passes themselves will bepresented in Section 4.

Convergent scheduling operates on individualscheduling units, which may be basic blocks, traces [7],superblocks [10], or hyperblocks [19], or treegions [9]. Itstores preferences in a three dimensional matrix Wi;c;t,where i spans over all instructions in the schedulingunit, c spans over the clusters in the architecture, andt spans over time. We allocate as many time slots asthe critical-path length (CPL).

Initially, all the weights are distributed evenly. Apass examines the dependence graph and the weightmatrix to determine the characteristics of the preferredschedule so far. Then, it expresses its preferences bymanipulating the preference map. Passes are not re-quired to perform changes that a�ect the preferredschedule. If they are indi�erent to one or more choices,they can keep the weights the same.

Let i spans over instructions, c over clusters, t overtime-slots. The following invariants are maintained:

8i; t; c : 0 �Wi;t;c � 1

8i :Xt;c

Wi;t;c = 1

Given an instruction i, we de�ne the following:2

preferred time(i) = argmax

(t :Xc

Wi;t;c

)

preferred cluster(i) = argmax

(c :Xt

Wi;t;c

)

2The function argmax returns the value of the variable thatmaximizes the expression for a given set of values (while maxreturn the value of the expression). For instance maxf0 � x �2 : 10� xg is 10, and argmaxf0 � x � 2 : 10� xg is 0.

runnerup cluster(i) = argmax

�c :P

tWi;t;c;

c 6= preferred cluster(i)

�

con�dence(i) =

PtWi;t;preferred cluster(i)P

tWi;t;runnerup cluster(i)

Preferred values are those that maximize the sum ofthe preferences over time and clusters.

The con�dence of an instruction measures how con�-dent the convergent scheduler is about its current spa-tial assignment. It is computed as the ratio of theweights of the top two clusters.

The following basic operations are available on theweights:

� Any weight Wi;t;c can be increased or decreasedby a constant, or multiplied by a factor.

� The matrices of two or more instructions can becombined linearly to produce a matrix of anotherinstruction. Given input instructions i1; i2; :::; in,an output instruction j, and weights for the inputinstructions w1; :::; wnwhere

Pk2[1;n] wk = 1, the

linear combination is as follows:

for each (c; t);Wj;t;c X

k2[1;n]

wk �Wik;t;c

Our current implementaion uses a simpler form ofthis operation, with n = 2 and i1 = j:

for each (c; t);Wi1;t;c w1Wi1;t;c + (1� w1)Wi2;t;c

We never perform this full operation because it isexpensive. Instead, we only do it on part of thematrices, e.g., only along the space dimension, oronly within a small range along the time dimen-sion.

� The system incrementally keeps track of the sumsof the weights over both space and time, so thatthey can be determined in O(1) time. It also mem-oizes the preferred time and preferred cluster ofeach instruction.

� The preferences can be normalized to guaranteeour invariants; the normalization simply performs:

for each (i; c; t);Wi;t;c Wi;t;cPt;c

Wi;t;c

4 Collection of Heuristics

This section presents a collection of heuristics wehave implemented for convergent scheduling. Eachheuristic attempts to address a single constraint andonly communicates with other heuristics via the weightmatrix. There are no restrictions on the order orthe number of times each heuristic is applied. Cur-rently, the following parameters are selected by trial-and-error: the set of heuristics we use, the weights used

5

in the heuristics, and the order in which the heuris-tics are run. We expect to implement more systematicheuristics selection in the future.

Whenever necessary, we run normalization at theend of every pass to ensure the invariants described inSetion 3. For brevity, this step is not included in thedescription below.

Initital time assignment (INITTIME) Instruc-tion in the middle of the dependence graph cannot bescheduled before their predecessors, nor after their suc-cessors. So, if CPL is the lenght of the critical path,lp is the length of the longest path from the top ofthe graph (latency of predecessor chain), and ls is thelongest path to any leaf (latency of successor chain),the instruction can be scheduled only in the time slotsbetween lp and CPL � ls. If an instruction is part ofthe critical path, only one time-slot will be feasible.This pass squashes to zero all the weights outside thisrange.

for each (i; t < lp [ t < CPL� ls; c); Wi;t;c 0

A pass similar to this one can address the fact thatsome instructions cannot be scheduled in all clustersin some architectures, simply by squashing the weightsfor the unfeasible clusters.

Noise introduction (NOISE) This pass introducesa small amount of noise in the weight distribution. Thenoise helps break symmetry and spreads instructionsaround to facilitate scheduling for parallelism.

for each (i; t; c); Wi;t;c Wi;t;c + rand()=RAND MAX

Preplacement (PLACE) This pass increases theweight for preplaced instructions to be placed in theirhome cluster. Since this condition is required for cor-rectness, the weight increase is large. Given preplacedinstruction i, let cp(i) be its preplaced cluster. Then,

for each (i 2 PREPLACED;t);Wi;t;cp(i) 100Wi;t;cp(i)

Push to �rst cluster (FIRST) In our clusteredVLIW infrastructure, an invariant is that all the dataare available in the �rst cluster at the beginning ofevery scheduling unit. For this architecture, we wantto give advantage to a schedule utilizing more the �rstcluster, where data are already available, versus theother clusters, were copies can be needed. We expressthis preference as follows:

for each (i; t);Wi;t;1 1:2Wi;t;1

Critical path strengthening (PATH) This passtries to keep all the instructions on a critical path (CP)

in the same cluster. If instructions in the paths havebias for a particular cluster, the path is moved to thatcluster. Otherwise the least loaded cluster is selected.If di�erent portions of the paths have strong bias to-ward di�erent clusters (e.g., when there are two ormore preplaced instructions on the path), the criticalpath is broken in two or more pieces and kept locallyclose to the relevant home clusters. Let cc(i) be thechosen cluster for the CP.

for each (i 2 CP; t; c); Wi;t;cc(i) 3Wi;t;cc(i)

Communication minimization (COMM) Thispass reduces communication load by increasing theweight for an instruction to be in the same clusterswhere most of neighbors (successors and predecessorsin the dependence graph) are. This is done by summingthe weights of all the neighbors in a speci�c cluster, andusing that to skew weights in the correct direction.

for each (i; t; c);Wi;t;c Wi;t;c

Xn2neighbors of i

Wn;t;c

We have also implemented a variant of this that con-siders grand-parents and grand-children, and we usuallyrun it together with COMM.

for each (i);Wi;ti;ci 2Wi;ti;ci

Preplacement propagation (PLACEPROP)This pass propagates preplacement information to allinstructions. For each non-preplaced instruction i, wedivide its weight for each cluster c by its distance tothe closest preplaced instruction in c. Let dist(i; c) bethis distance. Then,

for each (i 62 PREPLACED;t; c);Wi;t;c Wi;t;c=dist(i; c)

Load balance (LOAD) This pass performs loadbalancing across clusters. Each weight on a cluster isdivided by the total load on that cluster:

for each (i; t; c);Wi;t;c Wi;t;c=load(c)

Level distribute (LEVEL) This pass distributesinstructions at the same level across clusters. Giveninstruction i, we de�ne level(i) to be its distance fromthe furthest root. Level distribution has two goals. Theprimary goal is to distribute parallelism across clus-ters. The second goal is to minimize potential com-munication. To this end, the pass tries to distributeinstructions that are far apart, while keeping togetherinstructions that are near each other.

To perform the dual goals of instruction distribu-tion without excessive communication, instructions ona level are partitioned into bins. Initially, the bin Bc

6

for each cluster c contains instructions whose preferredcluster is c, and whose con�dence is greater than athreshold (2.0). Then, we perform the following:

LevelDistribute: int l, int gIl = Instruction i j level(i) = l

foreach Cluster cIl = Il �Bc

Ig = fi j i 2 Il; distance(i; �nd closest bin(i)) > ggwhile Il 6= �

B = round robin next bin()iclosest = argmaxfi 2 Ig : distance(i; B)gB = B [ iclosestIl = Il � iclosestUpdate Ig

The parameter g controls the minimum distance gran-ularity at which we distribute instructions across bins.The distance between an instruction i and a bin B isthe minimum distance between i and any instructionin B.

LEVEL can be applied multiple times to di�erentlevels. Currently we apply it every four levels on Raw.The four levels correspond approximately to the mini-mum granularity of parallelism that Raw can pro�tablyexploit given its communication cost.

Path propagation (PATHPROP) This pass se-lects high con�dence instructions and propagates theirconvergent matrices along a path. The con�dencethreshold t is an input parameter. Let ih be the se-lected con�dent instruction. The following propagatesih along a downward path:

�nd i j i 2 successor(ih); con�dence(i) < con�dence(ih)while (i 6= nil)

for each (c; t);Wi;t;c 0:5Wi;t;c + 0:5Wih ;t;c

�nd in j in 2 successor(i); con�dence(in) < con�dence(ih)i in

A similar function that visits predecessors propa-gates ih along an upward path.

Emphasize critical path distance (EMPHCP)This pass attempts to help the convergence of the timeinformation by emphasizing the level of each instruc-tion. The level of an instruction is a good time ap-proximation because it is when the instruction can bescheduled if a machine has in�nite resources.

for each (i; c);Wi;level(i);c 1:2Wi;level(i);c

5 Results

We have implemented convergent scheduling in twosystems: the Raw architecture [24] and the Chorusclustered VLIW infrastructure developed at MIT [20].

SMEM

Switch

Registers

DMEMPC

PC

ALU

IMEM

Figure 5. The Raw machine.

Experimental environment Figure 5 shows a pic-ture of the Raw machine. The Raw machine comprisestiles organized in a two dimensional mesh. The actualRaw prototype has 16 tiles in a 4x4 mesh. Each tilehas its own instruction memory, data memory, regis-ters, processor pipeline, and ALUs. Its instruction setis based on the Mips R4000. The tiles are connected viapoint-to-point, mesh networks. In additional to a tra-ditional, wormhole hole dynamic network, Raw has aprogrammable, compiler-controlled static network thatcan be used to route scalar values between the regis-ter �le/ALUs on di�erent tiles (for details, please referto [14]). To reduce communication overhead, the staticnetwork ports are register-mapped. Latency on thestatic network is three cycles for two neighboring tiles;each additional hop takes one extra cycle of latency.

Rawcc, the Raw compiler, takes a sequential C orFortran program and parallelizes it across Raw tiles.It is built on top of the Machsuif intermediate rep-resentation [18]. Rawcc divides each input programinto one or more scheduling traces. For each trace,Rawcc constructs the data precedence graph and per-forms space-time scheduling on each graph. Then, itapplies a traditional register allocator to the code oneach tile.

The Chorus clustered VLIW system is a exi-ble compiler/simulator environment that can simulatemany di�erent con�gurations of clustered VLIW ma-chines. We use it to simulate a clustered VLIW ma-chine with four identical clusters. Each cluster hasfour function units: one integer ALU, one integerALU/Memory, one oating point unit, and one trans-fer unit. Instruction latencies are based on the MipsR4000. The transfer unit moves values between regis-ter �les on di�erent clusters. It takes one cycle to copya register value from one cluster to another. Memoryaddresses are interleaved across clusters for maximumparallelism. Memory operations can request remotedata, with a penalty of one cycle.

The Chorus compiler shares with Rawcc the same

7

INITTIMEPLACEPROPLOADPLACEPATHPATHPROPLEVELPATHPROPCOMMPATHPROPEMPHCP

(a)

INITTIMENOISEFIRSTPATHCOMMPLACEPLACEPROPCOMMEMPHCP

(b)

Table 1. Sequence of heuristics used by theconvergent scheduler for (a) the Raw machineand (b) clustered VLIW.

high level structure. Like Rawcc, it is implemented ontop of Machsuif. It �rst performs space-time schedul-ing, followed by traditional single-cluster register allo-cation [8].

Both Rawcc and the Chorus compiler employ con-gruence transformation and analysis to increase andanalyze the predictability of memory references [13].This analysis creates preplaced memory reference in-structions that must be placed on speci�c tiles or clus-ters. For dense matrix loops, the congruence pass usu-ally unrolls the loops by the number of clusters or tiles.This unrolling also increases the size of the schedulingregions, so that no additional unrolling is necessary toexpose parallelism.

In both compilers, when a value is live across multi-ple scheduling regions, its de�nitions and uses must bemapped to a consistent cluster. On Rawcc, this clusteris the cluster of the �rst de�nition/use encountered bythe compiler; subsequent de�nitions and uses becomepreplaced instructions.3 On Chorus, all values that arelive across multiple scheduling regions are mapped tothe �rst cluster.

Convergent schedulers Table 1 lists the heuristicsused by the convergent scheduler for Raw and Chorus.The heuristics are run in the order given.

The convergent scheduler interfaces with the exist-ing schedulers as follows. The output of the convergentscheduler is split into two parts:

1. A map describing the preferred partition, i.e., anassignment for every instruction to a speci�c clus-ter.

2. The temporal assignment of each instruction.

3Rawcc does use SSA renaming to eliminate false depen-dences, which in turn reduces these preplacement constraints.

0

2

4

6

8

10

12

14

chole

sky

tom

catv

vpen

tam

xm

fppp

p-ke

rnel sh

asw

imjac

obi

life

Spe

edup

RawccConvergent

Figure 6. Performance comparisons betweenRawcc and Convergent scheduling on a 16-tile Raw machine.

Both Chorus and Rawcc use the spatial assignmentsgiven by the convergent scheduler. Chorus uses thetemporal assignments as priorities for the list sched-uler. For Rawcc, however, the temporal assignmentsare computed independently by its own instructionscheduler.

Benchmarks Our sources of benchmarks include theRaw benchmark suite (jacobi, life) [1], Nasa7 of Spec92(cholesky, vpenta, and mxm), and Spec95 (tomcatv,fpppp-kernel). Fpppp-kernel is the inner loop of fppppfrom that consumes 50% of the run-time. Sha is animplementation of Secure Hash Algorithm. Fir is a FIR�lter. Rbsorf is a Red Black SOR relaxation. Vvmulis a simple matrix multiplication. Yuv does RGB toYUV color conversion. Some problem sizes have beenchanged to cut down on simulation time, but they donot e�ect the results qualitatively.

Performance comparisons We compared our re-sults with the baseline Rawcc and Chorus compil-ers. Table 2 compares the performance of convergentscheduling to Rawcc for two to 16 tiles. Figure 6plots the same data for 16 tiles. Results show thatconvergent scheduling consistently outperforms base-line Rawcc for all tile con�gurations for most of thebenchmarks, with an average improvement of 21% for16 tiles.

Many of our benchmarks are dense matrix code withpreplaced memory instructions from congruence anal-ysis. For these benchmarks, convergent scheduling al-ways outperforms baseline Rawcc. The reason is thatconvergent scheduling is able to actively take advantage

8

Base ConvergentBenchmark/Tiles 2 4 8 16 2 4 8 16cholesky 1.14 2.21 3.29 4.33 1.44 2.75 4.94 7.06tomcatv 1.18 1.83 2.88 3.94 1.37 2.12 3.33 5.15vpenta 1.86 2.85 4.58 8.03 1.96 3.23 5.82 9.71mxm 1.77 2.40 3.78 7.09 1.89 2.54 4.04 7.77fpppp 1.54 3.09 5.13 6.76 1.42 2.04 3.87 5.39sha 1.11 2.05 1.96 2.29 1.05 1.33 1.51 1.45swim 1.40 2.04 3.62 6.23 1.63 2.69 4.24 8.30jacobi 1.33 2.43 4.13 6.39 1.40 2.74 4.92 9.30life 1.65 3.02 5.56 8.48 1.76 3.35 6.34 11.97

Table 2. Rawcc speedup. Speedup is relative to performance on one tile.

Percentage of instructions whose preferred tiles have changed

PassesPLACEPROP LOAD PLACE PATH PATHPROP LEVEL PATHPROP COMM PATHPROP

0

0.2

0.4

0.6

0.8

1 choleskytomcatvvpentamxmfpppp−kernelshaswimjacobilife

Figure 7. Convergence of spatial assignments on Raw.

of preplacement information to guide the placement ofother instructions. This information turns out to leadto very good natural assignments of instructions.

For fpppp-kernel and sha, convergent schedulingperforms worse than baseline Rawcc because preplacedinstructions do not suggest many good assignments.Attaining good speedup on these benchmarks require�nding and exploiting very �ne-grained parallelism.Our level distribution pass has been less eÆcient in thisregard than the clustering counterpart in Rawcc | weexpect that integrating a clustering pass to convergentscheduling will address this problem.

Figure 7 shows the percentage of instructions whosepreferred tiles are changed by each convergent pass onRaw. The plots measure static instruction counts, andthey exclude passes that only modify temporal prefer-ences. For benchmarks with useful preplacement in-formation, the convergent scheduler is able to convergeto good solutions quickly, by propagating the preplace-ment information and using the load balancing heuris-tic. In contrast, preplacement provides little useful in-formation for fpppp-kernel and sha. These benchmarksthus require other critical paths, parallelism, and com-munication heuristics to converge to good assignments.

Figure 8 compares the performance of convergentscheduling to two existing assignment/scheduling tech-niques for clustered VLIW: UAS [23] and PCC [5]. We

0

0.5

1

1.5

2

2.5

3

3.5

4

vvmul rbsorf yuv tomcatv mxm fir cholesky

Spe

edup

PCCUASConvergent

Figure 8. Performance comparisons betweenPCC, UAS, and Convergent scheduling on afour-clustered VLIW. Speedup is relative to asingle-cluster machine.

9

Percentage of instructions whose preferred clusters have changed

PassesNOISE FIRST PATH COMM PLACE+PLACEPROP COMM

0

0.2

0.4

0.6

0.8

1

cholskyfirmxmtomcatvvvmulyuv

Figure 9. Convergence of spatial assignments on Chorus.

0.0001

0.01

1

100

10000

1000000

0 500 1000 1500 2000

Number of Instructions Scheduled

Sch

edul

ing

time

(sec

onds

)

PCCUASConvergent

Figure 10. Comparison of compile-time vs in-put size for algorithms on Chorus.

augment each existing algorithm with preplacement in-formation. For UAS, we modify the CPSC heuristicdescribed in the original paper to give the highest pri-ority to the home cluster of preplaced instructions. ForPCC, the algorithm for estimating schedule lengths andcommunication costs properly accounts for preplace-ment information, by modeling the extra costs incurredby the clustered VLIW machine for a non-local mem-ory access. Convergent scheduling outperforms UASand PCC by 14% and 28%, respectively, on a four-clustered VLIW machine. Like in Raw, the conver-gent scheduler is able to use preplacement informationto �nd good natural partitions for our dense matrixbenchmarks. Figure 9 shows the percentage of staticinstructions whose preferred tiles are changed by eachconvergent pass on Chorus. Passes that only modifytemporal preferences are excluded.

Compile-time scalability We examine the scalabil-ity of convergent scheduling. Scalability is importantbecause there is an increasing need for instruction as-signment and scheduling algorithms to handle largerand larger number of instructions. This need arisesfor two reasons: �rst, due to improvements in compilerrepresentation techniques such as hyperblocks and tree-gions; and second, because a large scope is necessaryto support the level of ILP provided by modern andfuture microprocessors.

Figure 10 compares the compile-time of convergentscheduling with that of UAS and PCC on Chorus. Bothconvergent scheduling and PCC use an independent listscheduler after instruction assignment | our measure-ments include time spent in the scheduler. The �gureshows that convergent scheduling and UAS take aboutthe same amount of time. They both scale consider-ably better than PCC. We note that PCC is highlysensitive to the number of components it initially di-vides the instructions into. Compile-time can be dra-matically reduced if the number of components is keptsmall. However, we �nd that for our benchmarks, re-ducing the number of components also results in muchpoorer assignment quality.

6 Related work

Spatial architectures require cluster assignment,scheduling, and register allocation. We have provideda general framework that can perform all three taskstogether (by adding preference maps for registers aswell), but the focus of this paper is the application ofconvergent scheduling to cluster assignment.

Many compilers for spatial architectures addressthe three problems separately. Much research has fo-cused on novel ways to do cluster assignment, cou-pled with traditional list scheduling and register al-location methods. The pioneer work in cluster as-

10

signment is BUG [6]. BUG is uses a two-phase al-gorithm. First, the algorithm traverses a dependencegraph bottom-up to propagate information about pre-placed instructions. Then, it traverses the graph top-down and greedily map each instruction to the clus-ter that can execute it earliest. The Multi ow com-piler uses a variant of BUG [17], without the supportfor preplaced instructions. PCC is an iterative assign-ment approach based on partial components [5]. Itbuilds partial components by visiting the data depen-dence graph bottom up, critical-path �rst. The max-imum size of a component is capped by a parameter,�th. It uses simple heuristics to select a value for �ththat balances the tradeo� between performance andcompile-time, although the exact method is not ex-plained. The components are initially assigned to clus-ters based on simple load balancing and communica-tion criteria. The assignments are subsequently im-proved through iterative descent, by checking whethermoving a sub-component to another cluster improvesthe schedule. Rawcc leverages techniques developedfor multiprocessor task graph scheduling [14]. Assign-ment is performed in three steps: clustering groups to-gether instructions that have little parallelism; merging

reduces the number of clusters through merging; place-ment maps clusters to tiles. During placement, Rawccalso handles constraints from preplaced instructions.

In [3], the approach di�ers from the above ap-proaches in the ordering of phases. It performs schedul-ing before assignment. The assignment phase uses amin-cut algorithm adapted from circuit partitioningthat tries to minimize communication. This algorithm,however, does not directly attempt to optimize the ex-ecution length of input DAGs.

To avoid phase ordering problems, recent works haveproposed combined solutions. Leupers describes an it-erative combined approach to perform scheduling andpartitioning on a VLIW DSP [16]. The approach isbased on simulated annealing. UAS performs assign-ment and scheduling together by integrating assign-ment into a cycle-driven list scheduler [23]. CARS per-forms all three tasks | assignment, scheduling, andregister allocation | in one step, by integrating bothassignment and register allocation into a mostly cycle-driven list scheduler [11]. In all these approaches, how-ever, every decision is irrevocable and �nal. In con-trast, convergent scheduling provides a general frame-work that allows decisions to be postponed or reversed,and we have used it to address the phase ordering prob-lem we found in cluster assignment when we have pre-placed instructions.

Few assignment approaches are designed to take intoaccount preplaced instructions. Of the above, only

BUG and Rawcc directly support them.

Phase ordering issues on clustered architectures isa relatively new area; a more classical phase order-ing problem occurs in scalar optimizations. Some ap-proaches in those areas share similar goals and featureswith convergent scheduling. Cooper uses a genetic-algorithm solution to evolve the order of passes to op-timize code size [4]. The approach �nds good generalsolutions, and it performs even better when the evo-lution is applied independently on each benchmark.Lerner proposes an interesting interface to di�erentpasses based on graph replacement [15]. His approachenables independently designed data ow passes to becomposed and run together. The composed pass is ableto achieve the precision of iterating independent passes,but without the compile-time cost of iteration.

7 Conclusion

This paper introduces convergent scheduling, a ex-ible framework for performing instruction assignmentand scheduling. Its interface between passes are sim-ple and expressive. The simplicity makes it easy forthe compiler developer to handle new constraints orintegrate new heuristics. It also allows passes to be ex-ecuted multiple times or even iteratively, which helpsalleviate phase ordering problems. The expressivenessallows passes to specify con�dence about their deci-sions, thus avoiding poor irreversible decisions.

This paper presents evidence that convergentscheduling is a promising framework. We have imple-mented it on two spatial architectures, Raw and clus-tered VLIW, and demonstrated performance improve-ment of 21% and 14%, respectively. One reason forthese performance gains comes from the ability of theconvergent scheduler to support, in an integrated man-ner, both traditional constraints (parallelism, locality,and communication) and a new constraint (preplacedinstructions).

References

[1] J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua,M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal.The RAW Benchmark Suite: Computation Structuresfor General Purpose Computing. In 5th Symposium onFPGAs-Based Custom Computing Machines (FCCM),pages 134{143, 1997.

[2] R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal.Maps: A Compiler-Managed Memory System for RawMachines. In 26th International Symposium on Com-puter Architecture (ISCA), pages 4{15, 1999.

11

[3] A. Capitanio, N. Dutt, and A. Nicolau. PartitionedRegister Files for VLIWs: A Preliminary Analysis ofTradeo�s. In 25th International Symposium on Mi-croarchitecture (MICRO), pages 292{300, 1992.

[4] K. D. Cooper, P. J. Schielke, and D. Subramanian.Optimizing for Reduced Code Space using GeneticAlgorithms. In Workshop on Languages, Compilersand Tools for Embedded Systems (LCTES), pages 1{9,1999.

[5] G. Desoli. Instruction Assignment for Clustered VLIWDSP Compilers: a New Approach. Technical ReportHPL-98-13, Hewlett Packard Laboratories, 1998.

[6] J. R. Ellis. Bulldog: A Compiler for VLIW Architec-tures. MIT Press, 1986.

[7] J. A. Fisher. Trace Scheduling: A Technique for GlobalMicrocode Compaction. IEEE Transactions on Com-puters, C-30(7):478{490, July 1981.

[8] L. George and A. W. Appel. Iterated Register Coalesc-ing. ACM Transactions on Programming Languagesand Systems, 18(3):300{324, 1996.

[9] W. A. Havanki, S. Banerjia, and T. M. Conte. TreegionScheduling for Wide Issue Processors. In 4th Inter-national Symposium on High Performance ComputerArchitecture (HPCA), pages 266{276, 1998.

[10] W. Hwu, S. Mahlke, W. Chen, P. Chang, N. Warter,R. Bringmann, R. Ouellette, R. Hank, T. Kiyohara,G. Haab, J. Holm, and D. Lavery. The Superblock: AnE�ective Technique for VLIW and Superscalar Compi-lation. The Journal of Supercomputing, 7(1):229{248,Jan 1993.

[11] K. Kailas, K. Ebcioglu, and A. K. Agrawala. CARS:A New Code Generation Framework for Clustered ILPProcessors. In 7th International Symposium on HighPerformance Computer Architecture (HPCA), pages133{143, 2001.

[12] H.-S. Kim and J. E. Smith. An Instruction Set and Mi-croarchitecture for Instruction Level Distributed Pro-cessing. In 29th International Symposium on Com-puter Architecture (ISCA), pages 71{81, 2002.

[13] S. Larsen and S. Amarasinghe. Increasing and Detect-ing Memory Address Congruence. In 11th Interna-tional Conference on Parallel Architectures and Com-pilation Techniques (PACT), 2002.

[14] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb,V. Sarkar, and S. Amarasinghe. Space-Time Schedul-ing of Instruction-Level Parallelism on a Raw Machine.In 8th International Conference on Architectural Sup-port for Programming Languages and Operating Sys-tems (ASPLOS), pages 46{57, 1998.

[15] S. Lerner, D. Grove, and C. Chambers. Compos-ing Data ow Analyses and Transformations. In 29thSymposium on Principles of Programming Languages(POPL), pages 270{282, 2002.

[16] R. Leupers. Instruction Scheduling for ClusteredVLIW DSPs. In 9th International Conference onParallel Architectures and Compilation Techniques(PACT), pages 291{300, 2000.

[17] P. Lowney, S. Freudenberger, T. Karzes, W. Lichten-stein, R. Nix, J. O'Donnell, and J. Ruttenberg. TheMulti ow Trace Scheduling Compiler. In Journal ofSupercomputing, pages 51{142, 1993.

[18] Machsuif. http://www.eecs.harvard.edu/hube.

[19] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank,and R. A. Bringmann. E�ective Compiler Support forPredicated Execution Using the Hyperblock. In 25thAnnual International Symposium on Microarchitecture(MICRO), pages 45{54, 1992.

[20] D. Maze. Compilation Infrastructure for VLIW Ma-chines. Master's thesis, Massachusetts Institute ofTechnology, September 2001.

[21] R. Motwani, K. V. Palem, V. Sarkar, andS. Reyen. Combining Register Allocation and Instruc-tion Scheduling. Technical Report CS-TN-95-22, 1995.

[22] R. Nagarajan, K. Sankaralingam, D. Burger, andS. Keckler. A Design Space Evaluation of Grid Pro-cessor Architectures. In 34th International Symposiumon Microarchitecture (MICRO), pages 40{51, 2001.

[23] E. Ozer, S. Banerjia, and T. M. Conte. Uni�ed As-sign and Schedule: A New Approach to Schedulingfor Clustered Register File Microarchitectures. In 31stInternational Symposium on Microarchitecture (MI-CRO), pages 308{315, 1998.

[24] M. Taylor, J. Kim, J. Miller, D. Wentzla�, F. Gho-drat, B. Greenwald, H. Ho�man, J.-W. Lee, P. John-son, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnid-man, V. S. M. Frank, S. Amarasinghe, and A. Agarwal.The Raw Microprocessor: A Computational Fabricfor Software Circuits and General Purpose Programs.IEEE Micro, pages 25{35, March/April 2002.

12

Documents

Convergent - Massachusetts Institute of Technologygroups.csail.mit.edu/commit/papers/02/Convergence-MICRO...ict-ing constrain ts and heuristics in an ad ho c manner. One approac his