Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
To appear in MICRO-35, November 2002, Istanbul, Turkey
Convergent Scheduling
Walter Lee, Diego Puppin, Shane Swenson, Saman Amarasinghe
Massachusetts Institute of Technology
fwalt, diego, sswenson, [email protected]
Abstract
Convergent scheduling is a general framework for
cluster assignment and instruction scheduling on spa-
tial architectures. A convergent scheduler is composed
of independent passes, each implementing a heuris-
tic that addresses a particular problem or constraint.
The passes share a simple, common interface that pro-
vides spatial and temporal preference for each instruc-
tion. Preferences are not absolute; instead, the in-
terface allows a pass to express the con�dence of its
preferences, as well as preferences for multiple space
and time slots. A pass operates by modifying these
preferences. By applying a series of passes that ad-
dress all the relevant constraints, the convergent sched-
uler can produce a schedule that satis�es all the im-
portant constraints. Because all passes are indepen-
dent and need to understand only one interface to in-
teract with each other, convergent scheduling simpli�es
the problem of handling multiple constraints and co-
developing di�erent heuristics. We have applied con-
vergent scheduling to two spatial architectures: the Raw
processor and a clustered VLIW machine. It is able to
successfully handle traditional constraints such as par-
allelism, load balancing, and communication minimiza-
tion, as well as constraints due to preplaced instruc-
tions, which are instructions with predetermined cluster
assignment. Convergent scheduling is able to obtain an
average performance improvement of 21% over the ex-
isting space-time scheduler of the Raw processor, and
an improvement of 14% over state-of-the-art assign-
ment and scheduling techniques on a clustered VLIW
architecture.
1 Introduction
Instruction scheduling on microprocessors is becom-ing a more and more diÆcult problem. In almost allpractical instances, it is NP complete, and it often facesmultiple contradictory constraints. For superscalars or
VLIWs, the two primary issues are parallelism and reg-ister pressure. Code sequences that expose more in-struction level parallelism (ILP) also have longer liveranges and higher register pressure. To generate goodschedules, the instruction scheduler must somehow ex-ploit as much ILP as possible without leading to a largenumber of register spills.
On spatial architectures, instruction scheduling iseven more complicated. Examples of spatial architec-tures include clustered VLIWs, Raw [24], Grid pro-cessors [22], and ILDPs [12]. Spatial architecturesare architectures that distribute their computing re-sources. Communication between distant resources canincur one or more cycles of delays. On these archi-tectures, the instruction scheduler has to partition in-structions across the computing resources. Thus, in-struction scheduling becomes both a spatial problemand a temporal problem.
To make partitioning decisions, the scheduler hasto understand the proper tradeo� between parallelismand locality. Figure 1 shows an example of this trade-o�. Spatial scheduling by itself is already a more diÆ-cult problem than temporal scheduling, because a smallspatial mistake is generally more costly than a smalltemporal mistake. If a critical instruction is sched-uled one cycle later than desired, only one cycle is lost.But if a critical instruction is scheduled one unit of dis-tance farther away than desired, cycles can be lost fromunnecessary communication delays, additional commu-nication resource contention, and increase in registerpressure.
In addition, some instructions on spatial architec-tures have speci�c spatial requirements. These re-quirements arise from two sources. First, some loadsand stores instructions must access memory banks onspeci�c clusters, either for correctness or for perfor-mance reasons [2, 6, 13]. Second, when a value islive across scheduling regions, its de�nitions and usesmust be mapped to a consistent cluster [14]. We callinstructions with these spatial requirements preplacedinstructions. A good scheduler must be sensitive to
1
(a)
1
2
3
4
5
6
7
8
2 ADD
7 ADD
4 ADD
6 ADD
8 ADD
1 MUL
3 MUL
5 MUL
(b)
1
2
3
4
5
6
7
8
2 ADD
7 ADD
RCV
4 ADD6 ADD
RCV
1 MUL3 MUL5 MUL
8 ADD (c)
1
2
3
4
5
6
7
2 ADD
7 ADD
4 ADD
6 ADD
RCV
1 MUL
3 MUL
5 MUL
8 ADD
Figure 1. An example of tradeoff betweenparallelism and locality on spatial architec-tures. Each node color represents a differ-ent cluster. Consider an architecture withthree clusters, each with one functional unit,where communication takes one cycle of la-tency due to the “receive” instruction. In (a),conservative partitioning that maximizes lo-cality and minimizes communication leads toan eight-cycle schedule. In (b), aggressivepartitioning has high communication require-ments and leads to an eight-cycle schedule.The optimal schedule, in (c), takes only sevencycles: it is a careful tradeoff between localityand parallelism.
constraints imposed by preplaced instructions in orderto generate a good schedule.
A scheduler also faces diÆculties because di�erentheuristics work well for di�erent types of graphs. Fig-ure 2 depicts representative data dependence graphsfrom two ends of a spectrum. In the graphs, nodesrepresent instructions and edges represent data depen-dences between instructions. Graph (a) is typical ofgraphs seen in non-numeric programs, while graph (b)is representative of graphs coming from applying loopunrolling to numeric programs. Consider the problemof scheduling these graphs onto a spatial architecture.Long, narrow graphs are dominated by a few criticalpaths. For these graphs, critical-path based heuris-tics are likely to work well. Fat, parallel graphs havecoarse grained parallelism available and many criticalpaths. For these graphs it is more important to mini-mize communication and exploit the coarse-grain par-allelism. To perform well for arbitrary graphs, a sched-uler may require multiple heuristics in its arsenal.
Traditional scheduling frameworks handle con ict-ing constraints and heuristics in an ad hoc manner.One approach is to direct all e�orts toward the mostserious problem. For example, modern RISC super-scalars can issue up to four instructions and have tensof registers. Furthermore, most integer programs tendto have little ILP. Therefore, many RISC schedulers
s21
se6
s1
se7
s3
se8
s2
se9
s4
sea
s0
seb
s6
sec
s5
sed
sa
see
s9
sef
s8
sf0
s7
sf1
se
sf2
sd
sf3
sc
sf4
sb
sf5
s12
sf6
s11
sf7
s10
sf8
sf
sf9
s16
sfa
s15
sfb
s14
sfc
s13
sfd
s1a
sfe
s19
sff
s18
s100
s17
s81
s1e
s82
s1d
s83
s1c
s84
s1b
s85
s23
s86
s22
s87
s20
s88
s1f
s89
s27
s8a
s26
s8b
s25
s8c
s24
s8d
s2b
s8e
s2a
s8f
s29
s90
s28
s91
s92
s93
s2d
s94
s2c
s95
s96
s32
s97
s31
s98
s99
s30
s37
s9a
s36
s9b
s35
s9c
s9d
s3b
s9e s9f
s3a
sa0
sa1
sa2
s3f
sa3 sa4
s3c
sa5
sa6
sa7
sa8
sa9 saa
sab
sac
sad
s44
sae
saf
sb0
sb1
sb2 sb3
sb4
sb5
sb6
sb7
sb8sb9
sba
sbb
sbc sbd
s5b
sbe
s5a
sbf
s59
sc0
s58
sc1
s5f
s663
s5e
s659
s5d
s651s5c
s65f
s63 s653s62
s660s61
s655s60
s661
s67s657
s66
s662
s65
s64f
s64
s65e
s6b s14b
s6a
s160s69 s175
s68
s18a
s6f
s19f
s6e
s1b2
s6d
s1c5
s6c
s1d8s73 s1eb
s72
s14c
s71
s1fe
s70
s161
s77s76
s75
s74
s7b
s7a
s79
s78
s7f
s7e
s7d s7c
s24a
s80
sc2
sc3
sc4
sc5
s13e
sc6
s177
sc7
s1d9
sc8
s270
sc9
s101sca
s107
scb
s108
scc
s10c
scd
s10d
sce
s109
scf
s110
sd0
s103
sd1
s104
sd2
s116
sd3
s117
sd4
s11b
sd5
s11c
sd6
s118
sd7
s11f
sd8
sd9
sda sdb
sdc
sdd
sde
sdfse0
se1
se2
se3
se4
se5
(a)
s2
s2e s7
s2f
s5
s30
s4
s31
s3 s32
s6
s33
s8
s34
s9
s35
sa
s36
s4bs37
s4d
s38 s4c
s39
s51
s3a
s50
s3b
s4f
s3c
s4e
s3d
s55s3e
s54s3f
s53
s40
s52
s41
s59
s42
s58s43
s57
s44
s56
s45
s5b
s46
s5a
s47
s5d
s48
s5c
s49 s5f
s4a
s5e
sc2
s61
sc3
s60
sc4
s63
sc5
s62
sc6
s65
sc7
s64
sc8
s67
sc9s66
sca
s69
scb
s68
scc
s6d
scd
s6c
sce
s6bscf
s6a
s71
s70
s6f
s6e
s74
s73
s72
s76
s75
s78
s77
s7a
s79
s7c
s7b
s7e
s7d
s81
s82
s83
s84
s85
s86
s87
s88
s89
s8a
s8b
s8c
s8d
s8e
s8f
s90
s91
s92
s93
s94
s95
s96
s97
s98
s99
s9a
s9b
s9c
s9d
s9e
s9f
sa0
sa1
sa2
sa3
sa4
sa5
sa6
sa7
sa8
sa9
saa
sab
sac
sad
sae
saf
sb0
sb1
sb2
sb3
sb4
sb5
sb6
sb7
sb8
sb9
sba
sbb
sbc
sbd
sbe
sbf
sc0
sc1
s7f
s80
s1s0
sb
sc
sd
se sf
s10
s11
s12
s13
s14
s15
s16
s17
s18
s19
s1a
s1b
s1c
s1d
s1e
s1f
s20
s21
s22
s23
s24
s25
s26
s27
s28
s29
s2a
s2b s2c
s2d
(b)
Figure 2. Different data dependence graphshave different characteristics. Some are thinand dominated by a few critical paths (a),while others are fat and parallel (b).
focus on �nding ILP and ignore register pressure alto-gether. Another approach is to address the constraintsone at a time in a sequence of phases. This approach,however, introduces phase ordering problems, as deci-sions made by the early phases are based on partialinformation and can adversely a�ect the quality of de-cisions made by subsequent phases. A third approachis to attempt to address all the problems together. Forexample, there have been reasonable attempts to per-form instruction scheduling and register allocation atthe same time [21]. However, extending such frame-works to support preplaced instructions is diÆcult {no such extension exists today.
This paper presents convergent scheduling, a radicaldeparture from traditional scheduling methods. Con-vergent scheduling is a general scheduling frameworkthat makes it easy to specify arbitrary constraints andscheduling heuristics. Figure 3 illustrates this frame-work. A convergent scheduler is composed of indepen-
2
ConvergentScheduler
PreplacedInstruction
Info
GraphDependence
Machine Model andOther Constraints
Space−timeSchedule
... ...
Input
Collection of Heuristics
Output
Critical path
Load balanceComm.
Figure 3. Convergent schedule framework.
dent phases. Each phase implements a heuristic thataddresses a particular problem such as ILP or regis-ter pressure. Multiple heuristics may address the sameproblem.
All phases in the convergent scheduler share a com-mon interface. The input or output to each phase isa collection of spatial and temporal preferences of in-structions. A phase operates by analyzing the currentpreferences and modifying them. As the scheduler ap-plies the phases in succession, the preference distribu-tion converges to a �nal schedule that incorporates thepreferences of all the constraints and heuristics. Logi-cally, preferences are speci�ed as a three-input functionthat maps an instruction, space, and time triple to aweight.
The contributions of this paper are:
� a novel interface between scheduling passes basedon weighted preferences;
� a novel approach to address the combined prob-lems of cluster assignment, scheduling, and regis-ter pressure,
� the formulation of a set of powerful heuristics toaddress both general constraints and architecture-speci�c issues,
� a demonstration of the e�ectiveness of convergentscheduling compared to traditional techniques.
The rest of this paper is organized as follows. Sec-tion 2 introduces convergent scheduling and uses an ex-ample to illustrates how it works. Section 3 describesthe convergent scheduling interface between passes.Section 4 describes the collection of passes currentlyimplemented in our framework. Section 5 presents ex-perimental results for two architectures: a clusteredVLIW and the Raw processor [24]. Section 6 providesrelated work. Section 7 concludes.
2 Convergent scheduling
In the convergent scheduling framework, passescommunicate their choices as changes in the relativepreferences of instructions for clusters and time slots.1
The spatial and temporal preferences of each instruc-tion are represented as weights in a preference map;a pass in uences the scheduling of an instruction bychanging these weights. When convergent schedulingcompletes, the slot in the map with the highest weightis designated as the preferred slot, which includes botha preferred cluster and a preferred time. The instruc-tion is assigned to the preferred cluster; the preferredtime is used as the instruction priority for list schedul-ing.
Di�erent heuristics work to improve the schedule indi�erent ways. The critical path strengthening heuris-tic (PATH), for example, expresses a preference tokeep all the instructions in critical paths together inthe same cluster. The communication minimization
heuristic (COMM) tries to keep dependent neighbor-ing instructions in the same cluster. The preplacement
heuristic (PLACE) prefers that preplaced instructionsand their neighbors are placed on the clusters selectedby the preplaced instructions. The load balance heuris-tic (LOAD) changes reduces the preferences on highlyloaded clusters cluster and increases them on the lessloaded ones. Other heuristics will be introduced in Sec-tion 4.
Figure 4 shows how convergent scheduling operateson a small code sequence from fpppp. Initially, theweights are evenly distributed, as shown in (b). We ap-ply the noise introduction heuristic (NOISE) to breaksymmetry, resulting in (c). This heuristic helps in-crease parallelism by distributing instructions to di�er-ent clusters. Then, we run critical path strengthening
(PATH), which increases the weight of the instructionsin the critical path (i.e., instructions 23, 25, 26, etc.) inthe �rst cluster (d). Then we run the communication
minimization (COMM) and the load balance (LOAD)heuristics, resulting in (e). These heuristics lead to sev-eral changes: the �rst few instructions are pushed outof the �rst cluster, and groups of instructions start toassemble in speci�c clusters (e.g., instructions 19, 20,21, and 22 in cluster 3).
Next, we run PLACE and PLACEPROP, which biasinstructions using information from preplaced nodes.The result is shown in (f). The pass causes a lot ofdisturbances: preplaced instructions strongly attractneighbors of preplaced instructions to their clusters.
1In the paper, the following terms will be used interchange-ably: phases and passes, tile and cluster, and cycle and time
slot.
3
(a)
22
16
17
28
29
30
25
26
15
3
4
31
32
6
10 14
19
21 9
82
13
18
20
5
7 12
33
11
27
124 023
(b) (c)
(d) (e) (f) (g)
Figure 4. Convergent scheduling operates on a code sequence from fpppp. Figure (a)shows the data dependence graph representation of the scheduled code. Nodes rep-resent instructions; edges represent dependences between instructions. Triangularnodes are preplaced, with different shades corresponding to different clusters. Figures(b)-(g) show how the convergent schedule is modified by a series of passes. This exam-ple only illustrates space scheduling, not time scheduling. Each figure in Figures 4b-gis a cluster preference map. A row represents an instruction. The row numbers corre-spond to the instruction numbers in (a). A column represents a cluster. The color ofeach entry represents the level of preference an instruction has for that cluster. Thelighter the color, the stronger the preference.
Observe how the group 19{22 is attracted to cluster 4.Finally we run communication minimization (COMM)another time. The �nal schedule is shown in (g).
Convergent scheduling has the following features:
1. Its scheduling decisions are made cooperatively
rather than exclusively.
2. The interface allows a phase to express con�denceabout its decisions. A phase needs not make apoor and unrecoverable decision just because ithas to make a decision. On the other side, anypass can strongly a�ect the �nal choice if needed.
3. Convergent scheduling can naturally recover froma temporary wrong decision by one phase. In
the example, when we apply noise to (b), mostnodes are initially moved away from the �rst clus-ter. Subsequently, however, nodes with strong tiesto cluster one, such as nodes 1{6, are eventuallymoved back, while nodes without strong ties, suchas node 0, remain away.
4. Most compilers allow only very limited exchangeof information among passes. In contrast, theweight-based interface to convergent scheduling isvery expressive.
5. The framework allows a heuristic to be appliedmultiple times, either independently or as part ofan iterative process. This feature is useful to pro-vide feedback between phases and to avoid phase
4
ordering problems.
6. The simple interface (preference maps) betweenpasses makes it easy for the compiler writer tohandle new constraints or design new heuristics.Phases for di�erent heuristics are written indepen-dently, and the expressive, common interface re-duces design complexity. This o�ers an easy wayto retarget a compiler and to address peculiaritiesof the underlying architecture. If, for example,an architecture is able to exploit auto-incrementon memory-access with a speci�c instruction, onepass could try to keep together memory-accessesand increments, so that the scheduler will �ndthem together and will be able to exploit the ad-vanced instructions.
3 Convergent interface
This section describes the convergent scheduling in-terface between passes. The passes themselves will bepresented in Section 4.
Convergent scheduling operates on individualscheduling units, which may be basic blocks, traces [7],superblocks [10], or hyperblocks [19], or treegions [9]. Itstores preferences in a three dimensional matrix Wi;c;t,where i spans over all instructions in the schedulingunit, c spans over the clusters in the architecture, andt spans over time. We allocate as many time slots asthe critical-path length (CPL).
Initially, all the weights are distributed evenly. Apass examines the dependence graph and the weightmatrix to determine the characteristics of the preferredschedule so far. Then, it expresses its preferences bymanipulating the preference map. Passes are not re-quired to perform changes that a�ect the preferredschedule. If they are indi�erent to one or more choices,they can keep the weights the same.
Let i spans over instructions, c over clusters, t overtime-slots. The following invariants are maintained:
8i; t; c : 0 �Wi;t;c � 1
8i :Xt;c
Wi;t;c = 1
Given an instruction i, we de�ne the following:2
preferred time(i) = argmax
(t :Xc
Wi;t;c
)
preferred cluster(i) = argmax
(c :Xt
Wi;t;c
)
2The function argmax returns the value of the variable thatmaximizes the expression for a given set of values (while maxreturn the value of the expression). For instance maxf0 � x �2 : 10� xg is 10, and argmaxf0 � x � 2 : 10� xg is 0.
runnerup cluster(i) = argmax
�c :P
tWi;t;c;
c 6= preferred cluster(i)
�
con�dence(i) =
PtWi;t;preferred cluster(i)P
tWi;t;runnerup cluster(i)
Preferred values are those that maximize the sum ofthe preferences over time and clusters.
The con�dence of an instruction measures how con�-dent the convergent scheduler is about its current spa-tial assignment. It is computed as the ratio of theweights of the top two clusters.
The following basic operations are available on theweights:
� Any weight Wi;t;c can be increased or decreasedby a constant, or multiplied by a factor.
� The matrices of two or more instructions can becombined linearly to produce a matrix of anotherinstruction. Given input instructions i1; i2; :::; in,an output instruction j, and weights for the inputinstructions w1; :::; wnwhere
Pk2[1;n] wk = 1, the
linear combination is as follows:
for each (c; t);Wj;t;c X
k2[1;n]
wk �Wik;t;c
Our current implementaion uses a simpler form ofthis operation, with n = 2 and i1 = j:
for each (c; t);Wi1;t;c w1Wi1;t;c + (1� w1)Wi2;t;c
We never perform this full operation because it isexpensive. Instead, we only do it on part of thematrices, e.g., only along the space dimension, oronly within a small range along the time dimen-sion.
� The system incrementally keeps track of the sumsof the weights over both space and time, so thatthey can be determined in O(1) time. It also mem-oizes the preferred time and preferred cluster ofeach instruction.
� The preferences can be normalized to guaranteeour invariants; the normalization simply performs:
for each (i; c; t);Wi;t;c Wi;t;cPt;c
Wi;t;c
4 Collection of Heuristics
This section presents a collection of heuristics wehave implemented for convergent scheduling. Eachheuristic attempts to address a single constraint andonly communicates with other heuristics via the weightmatrix. There are no restrictions on the order orthe number of times each heuristic is applied. Cur-rently, the following parameters are selected by trial-and-error: the set of heuristics we use, the weights used
5
in the heuristics, and the order in which the heuris-tics are run. We expect to implement more systematicheuristics selection in the future.
Whenever necessary, we run normalization at theend of every pass to ensure the invariants described inSetion 3. For brevity, this step is not included in thedescription below.
Initital time assignment (INITTIME) Instruc-tion in the middle of the dependence graph cannot bescheduled before their predecessors, nor after their suc-cessors. So, if CPL is the lenght of the critical path,lp is the length of the longest path from the top ofthe graph (latency of predecessor chain), and ls is thelongest path to any leaf (latency of successor chain),the instruction can be scheduled only in the time slotsbetween lp and CPL � ls. If an instruction is part ofthe critical path, only one time-slot will be feasible.This pass squashes to zero all the weights outside thisrange.
for each (i; t < lp [ t < CPL� ls; c); Wi;t;c 0
A pass similar to this one can address the fact thatsome instructions cannot be scheduled in all clustersin some architectures, simply by squashing the weightsfor the unfeasible clusters.
Noise introduction (NOISE) This pass introducesa small amount of noise in the weight distribution. Thenoise helps break symmetry and spreads instructionsaround to facilitate scheduling for parallelism.
for each (i; t; c); Wi;t;c Wi;t;c + rand()=RAND MAX
Preplacement (PLACE) This pass increases theweight for preplaced instructions to be placed in theirhome cluster. Since this condition is required for cor-rectness, the weight increase is large. Given preplacedinstruction i, let cp(i) be its preplaced cluster. Then,
for each (i 2 PREPLACED;t);Wi;t;cp(i) 100Wi;t;cp(i)
Push to �rst cluster (FIRST) In our clusteredVLIW infrastructure, an invariant is that all the dataare available in the �rst cluster at the beginning ofevery scheduling unit. For this architecture, we wantto give advantage to a schedule utilizing more the �rstcluster, where data are already available, versus theother clusters, were copies can be needed. We expressthis preference as follows:
for each (i; t);Wi;t;1 1:2Wi;t;1
Critical path strengthening (PATH) This passtries to keep all the instructions on a critical path (CP)
in the same cluster. If instructions in the paths havebias for a particular cluster, the path is moved to thatcluster. Otherwise the least loaded cluster is selected.If di�erent portions of the paths have strong bias to-ward di�erent clusters (e.g., when there are two ormore preplaced instructions on the path), the criticalpath is broken in two or more pieces and kept locallyclose to the relevant home clusters. Let cc(i) be thechosen cluster for the CP.
for each (i 2 CP; t; c); Wi;t;cc(i) 3Wi;t;cc(i)
Communication minimization (COMM) Thispass reduces communication load by increasing theweight for an instruction to be in the same clusterswhere most of neighbors (successors and predecessorsin the dependence graph) are. This is done by summingthe weights of all the neighbors in a speci�c cluster, andusing that to skew weights in the correct direction.
for each (i; t; c);Wi;t;c Wi;t;c
Xn2neighbors of i
Wn;t;c
We have also implemented a variant of this that con-siders grand-parents and grand-children, and we usuallyrun it together with COMM.
for each (i);Wi;ti;ci 2Wi;ti;ci
Preplacement propagation (PLACEPROP)This pass propagates preplacement information to allinstructions. For each non-preplaced instruction i, wedivide its weight for each cluster c by its distance tothe closest preplaced instruction in c. Let dist(i; c) bethis distance. Then,
for each (i 62 PREPLACED;t; c);Wi;t;c Wi;t;c=dist(i; c)
Load balance (LOAD) This pass performs loadbalancing across clusters. Each weight on a cluster isdivided by the total load on that cluster:
for each (i; t; c);Wi;t;c Wi;t;c=load(c)
Level distribute (LEVEL) This pass distributesinstructions at the same level across clusters. Giveninstruction i, we de�ne level(i) to be its distance fromthe furthest root. Level distribution has two goals. Theprimary goal is to distribute parallelism across clus-ters. The second goal is to minimize potential com-munication. To this end, the pass tries to distributeinstructions that are far apart, while keeping togetherinstructions that are near each other.
To perform the dual goals of instruction distribu-tion without excessive communication, instructions ona level are partitioned into bins. Initially, the bin Bc
6
for each cluster c contains instructions whose preferredcluster is c, and whose con�dence is greater than athreshold (2.0). Then, we perform the following:
LevelDistribute: int l, int gIl = Instruction i j level(i) = l
foreach Cluster cIl = Il �Bc
Ig = fi j i 2 Il; distance(i; �nd closest bin(i)) > ggwhile Il 6= �
B = round robin next bin()iclosest = argmaxfi 2 Ig : distance(i; B)gB = B [ iclosestIl = Il � iclosestUpdate Ig
The parameter g controls the minimum distance gran-ularity at which we distribute instructions across bins.The distance between an instruction i and a bin B isthe minimum distance between i and any instructionin B.
LEVEL can be applied multiple times to di�erentlevels. Currently we apply it every four levels on Raw.The four levels correspond approximately to the mini-mum granularity of parallelism that Raw can pro�tablyexploit given its communication cost.
Path propagation (PATHPROP) This pass se-lects high con�dence instructions and propagates theirconvergent matrices along a path. The con�dencethreshold t is an input parameter. Let ih be the se-lected con�dent instruction. The following propagatesih along a downward path:
�nd i j i 2 successor(ih); con�dence(i) < con�dence(ih)while (i 6= nil)
for each (c; t);Wi;t;c 0:5Wi;t;c + 0:5Wih ;t;c
�nd in j in 2 successor(i); con�dence(in) < con�dence(ih)i in
A similar function that visits predecessors propa-gates ih along an upward path.
Emphasize critical path distance (EMPHCP)This pass attempts to help the convergence of the timeinformation by emphasizing the level of each instruc-tion. The level of an instruction is a good time ap-proximation because it is when the instruction can bescheduled if a machine has in�nite resources.
for each (i; c);Wi;level(i);c 1:2Wi;level(i);c
5 Results
We have implemented convergent scheduling in twosystems: the Raw architecture [24] and the Chorusclustered VLIW infrastructure developed at MIT [20].
SMEM
Switch
Registers
DMEMPC
PC
ALU
IMEM
Figure 5. The Raw machine.
Experimental environment Figure 5 shows a pic-ture of the Raw machine. The Raw machine comprisestiles organized in a two dimensional mesh. The actualRaw prototype has 16 tiles in a 4x4 mesh. Each tilehas its own instruction memory, data memory, regis-ters, processor pipeline, and ALUs. Its instruction setis based on the Mips R4000. The tiles are connected viapoint-to-point, mesh networks. In additional to a tra-ditional, wormhole hole dynamic network, Raw has aprogrammable, compiler-controlled static network thatcan be used to route scalar values between the regis-ter �le/ALUs on di�erent tiles (for details, please referto [14]). To reduce communication overhead, the staticnetwork ports are register-mapped. Latency on thestatic network is three cycles for two neighboring tiles;each additional hop takes one extra cycle of latency.
Rawcc, the Raw compiler, takes a sequential C orFortran program and parallelizes it across Raw tiles.It is built on top of the Machsuif intermediate rep-resentation [18]. Rawcc divides each input programinto one or more scheduling traces. For each trace,Rawcc constructs the data precedence graph and per-forms space-time scheduling on each graph. Then, itapplies a traditional register allocator to the code oneach tile.
The Chorus clustered VLIW system is a exi-ble compiler/simulator environment that can simulatemany di�erent con�gurations of clustered VLIW ma-chines. We use it to simulate a clustered VLIW ma-chine with four identical clusters. Each cluster hasfour function units: one integer ALU, one integerALU/Memory, one oating point unit, and one trans-fer unit. Instruction latencies are based on the MipsR4000. The transfer unit moves values between regis-ter �les on di�erent clusters. It takes one cycle to copya register value from one cluster to another. Memoryaddresses are interleaved across clusters for maximumparallelism. Memory operations can request remotedata, with a penalty of one cycle.
The Chorus compiler shares with Rawcc the same
7
INITTIMEPLACEPROPLOADPLACEPATHPATHPROPLEVELPATHPROPCOMMPATHPROPEMPHCP
(a)
INITTIMENOISEFIRSTPATHCOMMPLACEPLACEPROPCOMMEMPHCP
(b)
Table 1. Sequence of heuristics used by theconvergent scheduler for (a) the Raw machineand (b) clustered VLIW.
high level structure. Like Rawcc, it is implemented ontop of Machsuif. It �rst performs space-time schedul-ing, followed by traditional single-cluster register allo-cation [8].
Both Rawcc and the Chorus compiler employ con-gruence transformation and analysis to increase andanalyze the predictability of memory references [13].This analysis creates preplaced memory reference in-structions that must be placed on speci�c tiles or clus-ters. For dense matrix loops, the congruence pass usu-ally unrolls the loops by the number of clusters or tiles.This unrolling also increases the size of the schedulingregions, so that no additional unrolling is necessary toexpose parallelism.
In both compilers, when a value is live across multi-ple scheduling regions, its de�nitions and uses must bemapped to a consistent cluster. On Rawcc, this clusteris the cluster of the �rst de�nition/use encountered bythe compiler; subsequent de�nitions and uses becomepreplaced instructions.3 On Chorus, all values that arelive across multiple scheduling regions are mapped tothe �rst cluster.
Convergent schedulers Table 1 lists the heuristicsused by the convergent scheduler for Raw and Chorus.The heuristics are run in the order given.
The convergent scheduler interfaces with the exist-ing schedulers as follows. The output of the convergentscheduler is split into two parts:
1. A map describing the preferred partition, i.e., anassignment for every instruction to a speci�c clus-ter.
2. The temporal assignment of each instruction.
3Rawcc does use SSA renaming to eliminate false depen-dences, which in turn reduces these preplacement constraints.
0
2
4
6
8
10
12
14
chole
sky
tom
catv
vpen
tam
xm
fppp
p-ke
rnel sh
asw
imjac
obi
life
Spe
edup
RawccConvergent
Figure 6. Performance comparisons betweenRawcc and Convergent scheduling on a 16-tile Raw machine.
Both Chorus and Rawcc use the spatial assignmentsgiven by the convergent scheduler. Chorus uses thetemporal assignments as priorities for the list sched-uler. For Rawcc, however, the temporal assignmentsare computed independently by its own instructionscheduler.
Benchmarks Our sources of benchmarks include theRaw benchmark suite (jacobi, life) [1], Nasa7 of Spec92(cholesky, vpenta, and mxm), and Spec95 (tomcatv,fpppp-kernel). Fpppp-kernel is the inner loop of fppppfrom that consumes 50% of the run-time. Sha is animplementation of Secure Hash Algorithm. Fir is a FIR�lter. Rbsorf is a Red Black SOR relaxation. Vvmulis a simple matrix multiplication. Yuv does RGB toYUV color conversion. Some problem sizes have beenchanged to cut down on simulation time, but they donot e�ect the results qualitatively.
Performance comparisons We compared our re-sults with the baseline Rawcc and Chorus compil-ers. Table 2 compares the performance of convergentscheduling to Rawcc for two to 16 tiles. Figure 6plots the same data for 16 tiles. Results show thatconvergent scheduling consistently outperforms base-line Rawcc for all tile con�gurations for most of thebenchmarks, with an average improvement of 21% for16 tiles.
Many of our benchmarks are dense matrix code withpreplaced memory instructions from congruence anal-ysis. For these benchmarks, convergent scheduling al-ways outperforms baseline Rawcc. The reason is thatconvergent scheduling is able to actively take advantage
8
Base ConvergentBenchmark/Tiles 2 4 8 16 2 4 8 16cholesky 1.14 2.21 3.29 4.33 1.44 2.75 4.94 7.06tomcatv 1.18 1.83 2.88 3.94 1.37 2.12 3.33 5.15vpenta 1.86 2.85 4.58 8.03 1.96 3.23 5.82 9.71mxm 1.77 2.40 3.78 7.09 1.89 2.54 4.04 7.77fpppp 1.54 3.09 5.13 6.76 1.42 2.04 3.87 5.39sha 1.11 2.05 1.96 2.29 1.05 1.33 1.51 1.45swim 1.40 2.04 3.62 6.23 1.63 2.69 4.24 8.30jacobi 1.33 2.43 4.13 6.39 1.40 2.74 4.92 9.30life 1.65 3.02 5.56 8.48 1.76 3.35 6.34 11.97
Table 2. Rawcc speedup. Speedup is relative to performance on one tile.
Percentage of instructions whose preferred tiles have changed
PassesPLACEPROP LOAD PLACE PATH PATHPROP LEVEL PATHPROP COMM PATHPROP
0
0.2
0.4
0.6
0.8
1 choleskytomcatvvpentamxmfpppp−kernelshaswimjacobilife
Figure 7. Convergence of spatial assignments on Raw.
of preplacement information to guide the placement ofother instructions. This information turns out to leadto very good natural assignments of instructions.
For fpppp-kernel and sha, convergent schedulingperforms worse than baseline Rawcc because preplacedinstructions do not suggest many good assignments.Attaining good speedup on these benchmarks require�nding and exploiting very �ne-grained parallelism.Our level distribution pass has been less eÆcient in thisregard than the clustering counterpart in Rawcc | weexpect that integrating a clustering pass to convergentscheduling will address this problem.
Figure 7 shows the percentage of instructions whosepreferred tiles are changed by each convergent pass onRaw. The plots measure static instruction counts, andthey exclude passes that only modify temporal prefer-ences. For benchmarks with useful preplacement in-formation, the convergent scheduler is able to convergeto good solutions quickly, by propagating the preplace-ment information and using the load balancing heuris-tic. In contrast, preplacement provides little useful in-formation for fpppp-kernel and sha. These benchmarksthus require other critical paths, parallelism, and com-munication heuristics to converge to good assignments.
Figure 8 compares the performance of convergentscheduling to two existing assignment/scheduling tech-niques for clustered VLIW: UAS [23] and PCC [5]. We
0
0.5
1
1.5
2
2.5
3
3.5
4
vvmul rbsorf yuv tomcatv mxm fir cholesky
Spe
edup
PCCUASConvergent
Figure 8. Performance comparisons betweenPCC, UAS, and Convergent scheduling on afour-clustered VLIW. Speedup is relative to asingle-cluster machine.
9
Percentage of instructions whose preferred clusters have changed
PassesNOISE FIRST PATH COMM PLACE+PLACEPROP COMM
0
0.2
0.4
0.6
0.8
1
cholskyfirmxmtomcatvvvmulyuv
Figure 9. Convergence of spatial assignments on Chorus.
0.0001
0.01
1
100
10000
1000000
0 500 1000 1500 2000
Number of Instructions Scheduled
Sch
edul
ing
time
(sec
onds
)
PCCUASConvergent
Figure 10. Comparison of compile-time vs in-put size for algorithms on Chorus.
augment each existing algorithm with preplacement in-formation. For UAS, we modify the CPSC heuristicdescribed in the original paper to give the highest pri-ority to the home cluster of preplaced instructions. ForPCC, the algorithm for estimating schedule lengths andcommunication costs properly accounts for preplace-ment information, by modeling the extra costs incurredby the clustered VLIW machine for a non-local mem-ory access. Convergent scheduling outperforms UASand PCC by 14% and 28%, respectively, on a four-clustered VLIW machine. Like in Raw, the conver-gent scheduler is able to use preplacement informationto �nd good natural partitions for our dense matrixbenchmarks. Figure 9 shows the percentage of staticinstructions whose preferred tiles are changed by eachconvergent pass on Chorus. Passes that only modifytemporal preferences are excluded.
Compile-time scalability We examine the scalabil-ity of convergent scheduling. Scalability is importantbecause there is an increasing need for instruction as-signment and scheduling algorithms to handle largerand larger number of instructions. This need arisesfor two reasons: �rst, due to improvements in compilerrepresentation techniques such as hyperblocks and tree-gions; and second, because a large scope is necessaryto support the level of ILP provided by modern andfuture microprocessors.
Figure 10 compares the compile-time of convergentscheduling with that of UAS and PCC on Chorus. Bothconvergent scheduling and PCC use an independent listscheduler after instruction assignment | our measure-ments include time spent in the scheduler. The �gureshows that convergent scheduling and UAS take aboutthe same amount of time. They both scale consider-ably better than PCC. We note that PCC is highlysensitive to the number of components it initially di-vides the instructions into. Compile-time can be dra-matically reduced if the number of components is keptsmall. However, we �nd that for our benchmarks, re-ducing the number of components also results in muchpoorer assignment quality.
6 Related work
Spatial architectures require cluster assignment,scheduling, and register allocation. We have provideda general framework that can perform all three taskstogether (by adding preference maps for registers aswell), but the focus of this paper is the application ofconvergent scheduling to cluster assignment.
Many compilers for spatial architectures addressthe three problems separately. Much research has fo-cused on novel ways to do cluster assignment, cou-pled with traditional list scheduling and register al-location methods. The pioneer work in cluster as-
10
signment is BUG [6]. BUG is uses a two-phase al-gorithm. First, the algorithm traverses a dependencegraph bottom-up to propagate information about pre-placed instructions. Then, it traverses the graph top-down and greedily map each instruction to the clus-ter that can execute it earliest. The Multi ow com-piler uses a variant of BUG [17], without the supportfor preplaced instructions. PCC is an iterative assign-ment approach based on partial components [5]. Itbuilds partial components by visiting the data depen-dence graph bottom up, critical-path �rst. The max-imum size of a component is capped by a parameter,�th. It uses simple heuristics to select a value for �ththat balances the tradeo� between performance andcompile-time, although the exact method is not ex-plained. The components are initially assigned to clus-ters based on simple load balancing and communica-tion criteria. The assignments are subsequently im-proved through iterative descent, by checking whethermoving a sub-component to another cluster improvesthe schedule. Rawcc leverages techniques developedfor multiprocessor task graph scheduling [14]. Assign-ment is performed in three steps: clustering groups to-gether instructions that have little parallelism; merging
reduces the number of clusters through merging; place-ment maps clusters to tiles. During placement, Rawccalso handles constraints from preplaced instructions.
In [3], the approach di�ers from the above ap-proaches in the ordering of phases. It performs schedul-ing before assignment. The assignment phase uses amin-cut algorithm adapted from circuit partitioningthat tries to minimize communication. This algorithm,however, does not directly attempt to optimize the ex-ecution length of input DAGs.
To avoid phase ordering problems, recent works haveproposed combined solutions. Leupers describes an it-erative combined approach to perform scheduling andpartitioning on a VLIW DSP [16]. The approach isbased on simulated annealing. UAS performs assign-ment and scheduling together by integrating assign-ment into a cycle-driven list scheduler [23]. CARS per-forms all three tasks | assignment, scheduling, andregister allocation | in one step, by integrating bothassignment and register allocation into a mostly cycle-driven list scheduler [11]. In all these approaches, how-ever, every decision is irrevocable and �nal. In con-trast, convergent scheduling provides a general frame-work that allows decisions to be postponed or reversed,and we have used it to address the phase ordering prob-lem we found in cluster assignment when we have pre-placed instructions.
Few assignment approaches are designed to take intoaccount preplaced instructions. Of the above, only
BUG and Rawcc directly support them.
Phase ordering issues on clustered architectures isa relatively new area; a more classical phase order-ing problem occurs in scalar optimizations. Some ap-proaches in those areas share similar goals and featureswith convergent scheduling. Cooper uses a genetic-algorithm solution to evolve the order of passes to op-timize code size [4]. The approach �nds good generalsolutions, and it performs even better when the evo-lution is applied independently on each benchmark.Lerner proposes an interesting interface to di�erentpasses based on graph replacement [15]. His approachenables independently designed data ow passes to becomposed and run together. The composed pass is ableto achieve the precision of iterating independent passes,but without the compile-time cost of iteration.
7 Conclusion
This paper introduces convergent scheduling, a ex-ible framework for performing instruction assignmentand scheduling. Its interface between passes are sim-ple and expressive. The simplicity makes it easy forthe compiler developer to handle new constraints orintegrate new heuristics. It also allows passes to be ex-ecuted multiple times or even iteratively, which helpsalleviate phase ordering problems. The expressivenessallows passes to specify con�dence about their deci-sions, thus avoiding poor irreversible decisions.
This paper presents evidence that convergentscheduling is a promising framework. We have imple-mented it on two spatial architectures, Raw and clus-tered VLIW, and demonstrated performance improve-ment of 21% and 14%, respectively. One reason forthese performance gains comes from the ability of theconvergent scheduler to support, in an integrated man-ner, both traditional constraints (parallelism, locality,and communication) and a new constraint (preplacedinstructions).
References
[1] J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua,M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal.The RAW Benchmark Suite: Computation Structuresfor General Purpose Computing. In 5th Symposium onFPGAs-Based Custom Computing Machines (FCCM),pages 134{143, 1997.
[2] R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal.Maps: A Compiler-Managed Memory System for RawMachines. In 26th International Symposium on Com-puter Architecture (ISCA), pages 4{15, 1999.
11
[3] A. Capitanio, N. Dutt, and A. Nicolau. PartitionedRegister Files for VLIWs: A Preliminary Analysis ofTradeo�s. In 25th International Symposium on Mi-croarchitecture (MICRO), pages 292{300, 1992.
[4] K. D. Cooper, P. J. Schielke, and D. Subramanian.Optimizing for Reduced Code Space using GeneticAlgorithms. In Workshop on Languages, Compilersand Tools for Embedded Systems (LCTES), pages 1{9,1999.
[5] G. Desoli. Instruction Assignment for Clustered VLIWDSP Compilers: a New Approach. Technical ReportHPL-98-13, Hewlett Packard Laboratories, 1998.
[6] J. R. Ellis. Bulldog: A Compiler for VLIW Architec-tures. MIT Press, 1986.
[7] J. A. Fisher. Trace Scheduling: A Technique for GlobalMicrocode Compaction. IEEE Transactions on Com-puters, C-30(7):478{490, July 1981.
[8] L. George and A. W. Appel. Iterated Register Coalesc-ing. ACM Transactions on Programming Languagesand Systems, 18(3):300{324, 1996.
[9] W. A. Havanki, S. Banerjia, and T. M. Conte. TreegionScheduling for Wide Issue Processors. In 4th Inter-national Symposium on High Performance ComputerArchitecture (HPCA), pages 266{276, 1998.
[10] W. Hwu, S. Mahlke, W. Chen, P. Chang, N. Warter,R. Bringmann, R. Ouellette, R. Hank, T. Kiyohara,G. Haab, J. Holm, and D. Lavery. The Superblock: AnE�ective Technique for VLIW and Superscalar Compi-lation. The Journal of Supercomputing, 7(1):229{248,Jan 1993.
[11] K. Kailas, K. Ebcioglu, and A. K. Agrawala. CARS:A New Code Generation Framework for Clustered ILPProcessors. In 7th International Symposium on HighPerformance Computer Architecture (HPCA), pages133{143, 2001.
[12] H.-S. Kim and J. E. Smith. An Instruction Set and Mi-croarchitecture for Instruction Level Distributed Pro-cessing. In 29th International Symposium on Com-puter Architecture (ISCA), pages 71{81, 2002.
[13] S. Larsen and S. Amarasinghe. Increasing and Detect-ing Memory Address Congruence. In 11th Interna-tional Conference on Parallel Architectures and Com-pilation Techniques (PACT), 2002.
[14] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb,V. Sarkar, and S. Amarasinghe. Space-Time Schedul-ing of Instruction-Level Parallelism on a Raw Machine.In 8th International Conference on Architectural Sup-port for Programming Languages and Operating Sys-tems (ASPLOS), pages 46{57, 1998.
[15] S. Lerner, D. Grove, and C. Chambers. Compos-ing Data ow Analyses and Transformations. In 29thSymposium on Principles of Programming Languages(POPL), pages 270{282, 2002.
[16] R. Leupers. Instruction Scheduling for ClusteredVLIW DSPs. In 9th International Conference onParallel Architectures and Compilation Techniques(PACT), pages 291{300, 2000.
[17] P. Lowney, S. Freudenberger, T. Karzes, W. Lichten-stein, R. Nix, J. O'Donnell, and J. Ruttenberg. TheMulti ow Trace Scheduling Compiler. In Journal ofSupercomputing, pages 51{142, 1993.
[18] Machsuif. http://www.eecs.harvard.edu/hube.
[19] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank,and R. A. Bringmann. E�ective Compiler Support forPredicated Execution Using the Hyperblock. In 25thAnnual International Symposium on Microarchitecture(MICRO), pages 45{54, 1992.
[20] D. Maze. Compilation Infrastructure for VLIW Ma-chines. Master's thesis, Massachusetts Institute ofTechnology, September 2001.
[21] R. Motwani, K. V. Palem, V. Sarkar, andS. Reyen. Combining Register Allocation and Instruc-tion Scheduling. Technical Report CS-TN-95-22, 1995.
[22] R. Nagarajan, K. Sankaralingam, D. Burger, andS. Keckler. A Design Space Evaluation of Grid Pro-cessor Architectures. In 34th International Symposiumon Microarchitecture (MICRO), pages 40{51, 2001.
[23] E. Ozer, S. Banerjia, and T. M. Conte. Uni�ed As-sign and Schedule: A New Approach to Schedulingfor Clustered Register File Microarchitectures. In 31stInternational Symposium on Microarchitecture (MI-CRO), pages 308{315, 1998.
[24] M. Taylor, J. Kim, J. Miller, D. Wentzla�, F. Gho-drat, B. Greenwald, H. Ho�man, J.-W. Lee, P. John-son, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnid-man, V. S. M. Frank, S. Amarasinghe, and A. Agarwal.The Raw Microprocessor: A Computational Fabricfor Software Circuits and General Purpose Programs.IEEE Micro, pages 25{35, March/April 2002.
12