97
ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Embed Size (px)

DESCRIPTION

ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines. Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology. Static Scheduling. Compiler performs instruction scheduling VLIW  Very Long Instruction Word - PowerPoint PPT Presentation

Citation preview

Page 1: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

ECE 4100/6100Advanced Computer Architecture

Lecture 15 Static Scheduling Machines

Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology

Page 2: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

2

Static Scheduling• Compiler performs instruction scheduling• VLIW Very Long Instruction Word• An alternative to dynamic scheduling processors• Pack multiple operations into one instruction• Move scheduling to Compiler (Software Approach)• Can simplify the complexity of a hardware-based

instruction scheduler• Cydrome, Multiflow, EPIC

Page 3: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

3

Very Long Instruction Word (VLIW)

• Rely on Compilers• Simple Hardware• Dependency is explicitly represented in the

instructions• Instruction window, supposedly, is much larger

than a hardware scheduling window– How about loop boundary?– How about function boundary?– Interprocedural optimization is generally

difficult• Might lead to compatibility or performance issues

if instruction latency changed• EPIC/Itanium closely follows VLIW philosophy,

many embedded and DSP processors embrace VLIW

Page 4: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

4

Intel Itanium ISA• Itanium Instruction “Bundle” (VLIW)

– 128 bits each– Contains three Itanium instructions (aka syllables)– Template bits in each bundle specify dependencies both within

a bundle as well as between sequential bundles– A collection of independent bundles forms a “group” (use

stops)

• Each Itanium Instruction– Fixed-length 41 bits long– Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st,

INT ld/st, ALU)– Contains max three 7-bit register specifiers– Contains a 6-bit field for specifying one of the 64 one-bit

qualifying predicate registers

Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt

0454586127

Page 5: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

5

Encoding Instruction Bundle

• Use “;;” as “stop bitstop bit” in assembly code to separate dependent instructions

• Instructions between “;;” belong to the same “instruction group”– RAW and WAW are not allowed in the same instruction group– WAR is allowed except for an special case: when writing p63 by

modulo-scheduled branch (e.g. br.ctop) after reading p63 (e.g. qualifying predicate) by B-type instruction

• Each instruction slot can represent one (out of 5) functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit)

• 12 basic templates provided, each with 2 versions depending on stop bit– MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB,

MMB, MFB– MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_,

MFB_

{ .mii ld4 r28=[r8]add r9 = 2,r1;;add r30= 1,r9

}MI_I format Template encoded “02”

Page 6: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

6

Itanium Instruction Example

{ .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;;}{ .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5} { .mbb ld8 r45 = [r55](p3)br.call b1=func1(p4)br.cond Label1}{ .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;;}

Page 7: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

7

Itanium Register Files

Stacked (Rotating)

Static

0

3132

127

General Purpose Registers

Stacked (Rotating)

Static

0

3132

127

FP Registers

063 081

Stacked (Rotating)

Static

01516

630

Predicate Registers

Page 8: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

8

Register Stack Engine

• Avoid spills/fills during function call/return• Callee uses instruction alloc r1=ar.pfs, i, l, o, r alloc r1=ar.pfs, i, l, o, r upon

entering a function

(inputs)

Static

0

3132

127

localsoutputs

illegalsize of frame (sof)

sofsol

Current Frame Marker (CFM) 38 bits

size of locals (sol = i+l)

sorrrb.grrrb.frrrb.pr

size of rotating (sor)

Page 9: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

9

Function Call Examplemain(){

a=foo(i*i, b[i]);

}

int foo(int ii, int bb){

}

r32

r43r44r45

i*i b[i]

r127

main: alloc r32=ar.pfs,0,12,2,0

foo: alloc r26=ar.pfs,2,5,0,0

GPR

Caller (main)

r32

r43r32r33

i*i b[i]

r127

GPR

r38

Callee (foo)

Page 10: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

10

RSE: A Function Call

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call

pfm: Previous frame marker

Page 11: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

11

RSE: Alloc

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call alloc r32=ar.pfs,7,9,3,0

sofsol

1916

2114

32

48

loc

out50

inputs

alloc copies PFM to GR (r32)

Page 12: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

12

RSE: Return

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call alloc

sofsol

1916

2114

32

48

loc

out50

32

46

loc

out52

sofsol

2114

2114

return

Page 13: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

13

Itanium Pipelines

• Performance improvement due to pipeline shortening — 4% to 6% • Large integer register file cause extra stage WLD (Word Line

Decode) in Itanium, circuit improved for Itanium 2 • Inter-group latency is enforced by a scoreboard

– Latency due to scheduling that failed to space instructions out– Due to cache misses

Front-endFront-end

Ckt improvedCkt improved

Dependency Scoreboard Stall checked here prior to EXE

Page 14: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

14

Itanium 2 Eight-stage Pipeline

EXPEXP RENRENROTROTIPGIPG REGREG EXEEXE DETDET WBWB

FP1FP1 FP2FP2 FP3FP3 FP4FP4 WBWB

L2NL2N L2IL2I L2AL2A L2ML2M L2DL2D L2CL2C L2WL2W

CoreCore

FPFP

L2L2

IPGIPG IP Generate, L1I cache (6 inst) and TLB access

EXEEXE ALU Execute, L1D Cache and TLB Access + L2 Cache Tag Access

ROTROT Instruction Rotate and Buffer (6 inst) DETDET Exception Detect, Branch Correction

EXPEXP Expand, Port assignment and routing WBWB Writeback, INT register update

RENREN INT and FP register rename FP1-WBFP1-WB FP FMAC pipeline (2) + register write

REGREG INT and FP register file read L2N-L2IL2N-L2I L2 Queue Nominate/Issue (4)(speculatively issued with L1 requestspeculatively issued with L1 request)

L2A-L2WL2A-L2W L2 Access, Rotate, Correct, Write (4)

Page 15: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

15

Itanium 2 MicroarchitectureL1 I-Cache &

Fetch/Prefetch engine I-TLB

8 bundles8 bundlesInstructionInstructionQueueQueue

Branch Prediction

FF FFII IIMM MMMM MMBBBB BB

Register stack engine / remapping Register stack engine / remapping

Branch & Predicate

128 INTRegisters

128 FPRegisters

BranchUnits

BranchUnits

BranchUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

Quad-port(INT) L1

PIPT DataCache (WT)

D-TLB

ALA

T

FloatingFloatingPointPointUnitsUnits

FloatingFloatingPointPointUnitsUnits

Scor

eboa

rd, P

redi

cate

NaT,

Exc

eptio

ns

IA-32Decode

& Control

11 issue 11 issue portsports

PIPT

Uni

fied

L2 C

ache

Qua

d-Po

rt (E

CC

)PI

PT U

nifie

d L2

Cac

he Q

uad-

Port

(EC

C)

On-

chip

PIP

T U

nifie

d L

3 C

ache

Sin

gle-

port

ed

On-

chip

PIP

T U

nifie

d L

3 C

ache

Sin

gle-

port

ed

(EC

C)

(EC

C)

Bus Controller (ECC)Bus Controller (ECC)

Page 16: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

16

Page 17: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

17

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

ItaniumItanium

instr 1instr 1instr 2instr 2. . .. . .brbr

LoadLoaduseuse

Conventional ArchitecturesConventional Architectures

Elevate loads above a branchElevate loads above a branch

• To improve memory latency by control speculation at compile time• Defer exceptions by setting NaT (GR’s 65th bit) that indicates:

– Whether or not an exception has occurred – Branch to fixup code required

• NaT set during ld.s, checked by chk.s

BarrierBarrier

Control Speculation (Speculative Load)

Page 18: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

18

Control Speculation (Hoist Uses)

• The uses of speculative data can be executed speculatively– Distinguishes speculation from simple prefetch

• NaT bit propagates down to the dependent instruction chain

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

IA-64IA-64

Page 19: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

19

Control Speculation (Recovery)• All computation instructions propagate NaTsNaTs to the

consumers to reduce number of checks• Cmp propagates “false” if NaT is set when writing predicates

(“0” for both target predicates)

chk.s chk.s r5r5, recv, recvsub r7 = sub r7 = r5r5,r2,r2

ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)addaddr6 = r3, r4r6 = r3, r4ld8.s ld8.s r5r5 = (r6) = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...)

Allows single chk on Allows single chk on resultresult

ld8ld8ld8ld8addaddld8ld8br homebr home

Recovery codeRecovery code

Page 20: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

20

Data Speculation (Advanced Loads)• Compiler can hoist a load prior to a preceding, possibly-

conflicting store• ALAT (Advanced Load Address Table) is used for checking

every store address in-between • Can be done by superscalar machine using Store coloringStore coloring

instr 1instr 1instr 2instr 2. . .. . .st8st8

ld8ld8useuse

BarrierBarrier

Conventional ArchitecturesConventional Architectures

ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8

ld.cld.cuse use

ItaniumItanium

Page 21: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

21

Data Speculation (load.a + chk.a)• Compiler hoist a load and its subsequent consumersits subsequent consumers

prior to a preceding, possibly-conflicting store• Need to patch a recovery code for mis-speculation

ld8.a r3=ld8.a r3=instr 1instr 1instr 2instr 2st8st8

ld.cld.cadd =r3, add =r3,

ld8.a r3=ld8.a r3=instr 1instr 1add =r3,add =r3,instr 2instr 2st8st8

chk.achk.aL1:L1:

ld8 r3=ld8 r3=add =r3,add =r3,br L1br L1

Recovery codeRecovery code

Page 22: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

22

Parallel Compare Types

• Three new types of compares:– and: both target predicates set FALSE if compare is false– or: both target predicates set TRUE if compare is true– DeMorgan: if true, sets one TRUE, sets other FALSE

• Do not get confused with the “parallel compare” pcmp1/pcmp2/pcmp4

Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path

BB

AA

CC

DD

BBAA CC

DD

Page 23: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

23

Eight Queen Example

Source: Crawford & HuckSource: Crawford & Huck

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]p1,p2=cmp.unc(R2==true)p1,p2=cmp.unc(R2==true)

(p1)(p1) chk.s R4chk.s R4(p1)(p1) p3,p4=cmp.unc(R4==true)p3,p4=cmp.unc(R4==true)

(p3)(p3) chk.s R6chk.s R6(p3)(p3) p5,p6=cmp.unc(R5==true)p5,p6=cmp.unc(R5==true)(p5) br then(p5) br thenelseelse

1

2

4

5

6

7

ThenElse

P1

P2

P5

P3 P4

P6

8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares

Page 24: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

24

Eight Queen Example

Source: Crawford & HuckSource: Crawford & Huck

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

ThenElse

P1

P2

P5

P3 P4

P6

Parallel ComparesParallel Compares

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse

1

2

4

5

Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5

ThenElse

P1= true P1=False

Page 25: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

25

More Example of Parallel Compare

1

0 cmp.eq p1,p2 = r0,r0;;

cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0

(p1) add r1=r2,r3(p2) sub r4=r5-r6

c1

c2

c3

else

c4

then

Itanium CodeItanium Code

2

if (c1 && c2 && c3 && c4)if (c1 && c2 && c3 && c4) r1 = r2 + r3;r1 = r2 + r3;else else r4 = r5 – r6 r4 = r5 – r6

Parallel cmp.crel.and or cmp.crel.or write the same values to both predicatesParallel cmp.crel.and or cmp.crel.or write the same values to both predicates

Use Use cmp.crel.and.orcm cmp.crel.and.orcm or or cmp.crel.or.andcmcmp.crel.or.andcm for writing for writing

complementary predicatescomplementary predicates

Also called Also called DeMorganDeMorgan type type (for complementary output)(for complementary output)

Page 26: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

26

Multiway Branches

3 branch cycles3 branch cycles3 branch cycles3 branch cycles 1 branch cycle1 branch cycle1 branch cycle1 branch cycle

w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads

ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1

ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2

ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3

(p1) br exit1(p1) br exit1

chk r7, rec1chk r7, rec1(p3) br exit2(p3) br exit2

chk r8, rec2chk r8, rec2(p5) br exit3(p5) br exit3

ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2(p4) chk r8, rec2 (p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3

P1P1

P6P6P5P5

P2P2

P4P4P3P3

• Multiway branches: more than 1 branch in a single cycleMultiway branches: more than 1 branch in a single cycle– Itanium allows multiple Itanium allows multiple ““consecutiveconsecutive”” B instructions in the same inst B instructions in the same inst

groupgroup– Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per

cyclecycle– Ordering matters if branch predicates are not mutually exclusiveOrdering matters if branch predicates are not mutually exclusive

• E.g. BBB template enables 3 branches in one bundleE.g. BBB template enables 3 branches in one bundle

Multi-way BranchesMulti-way Branches

Page 27: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

27

Branch and Prefetch Hints

• Compiler provides hints for branch predictor by– Completer in branch instructions, e.g. br.call.sptksptk

• 4 completer types for static and dynamic predictions: sptk, spnt, dptk, dpntsptk, spnt, dptk, dpnt

– Explicit brpbrp instructions• Compiler provide hints for instructioninstruction sequentialsequential prefetchingprefetching

– Use completer in branch instructions, e.g. br.call.sptk.manymany• 2 completer types: many, few many, few• ManyMany and fewfew are implementation-specific

• Compiler directs predictor allocation– For managing branch predictor resources– Use completer in branch instructions, e.g.

br.call.sptk.many.nonenone• 2 completer types: none, clr none, clr• nonenone: don’t deallocate; clrclr: deallocate branch info

Page 28: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

28

Modulo Scheduling Support

• Will be discussed next• Itanium features support modulo

scheduling (or software pipelining)– Full Predication– Special branch handling features

• br.ctop (for for-loop with known loop count)• br.wtop (for while-loop)

– Register rotation: removes loop copy overhead• No modulo variable expansion, tighter code

– Predicate rotation/generation• Removes prologue & epilogue

Page 29: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

29

List Scheduling

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2

P = Mem[A++] + C1;Q = P * C2;Y = P * C3 + (P + Q) * (P * C3);Mem[B++] = Y;

Latency: Latency: Mem — 1 cycleAdder — 2 cyclesMultiplier — 2 cycles

Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}

• Build dependency graph• Assign a priority of “0” to all

operations having no successors• Assign each remaining operation the

sum of priority and latency of their successor. If more than one successor, assign the maximum.

• Schedule instructions based on priority

00

11

33

55 55

99

1111

77

Page 30: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

30

List Scheduling

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2 00

11

33

55 55

99

1111

• LS (a heuristic) provides near-optimal schedule

• But no guarantee for optimality, especially, in terms of throughputthroughput

Reservation TableReservation Table

Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2

77

Page 31: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

31

Scheduling• If I want to use the same schedule, what is the

minimum initiation interval? • In the example, do I need to wait for 12 cycles?• If not, how do I avoid collision?

Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2

Page 32: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

32

Modulo Scheduling [RauGlaeser’81]

• A.k.a. “Polycyclic scheduling” or “Software pipelining”

• Exploit ILP among loop iterations to maximize – Machine utilization– Throughput

• Use a common schedule for the majority of iterations

• Overlap execution of consecutive iterations• Constant initiation rate Initiation IntervalInitiation Interval (IIII)• Minimum II (MIIMII) generates an optimal schedule

with maximum throughput• Originally developed for polycyclic architecture (or

horizontal architecture, or aka VLIW later) at TRW/ESL

Page 33: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

33

Modulo Scheduling: Resource Constraint• The optimal schedule is constrained by the number of

available resources• Determine ResII (Resource minimal initiation interval)

– Successive iterations will be scheduled ResII cycles apart

• N(i) is the number of usage of resource i in a loop• C(i) is the number of resources i

) .... ,C(3)

N(3) ,

C(2)

N(2) ,

C(1)

N(1) max( ResII

Page 34: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

34

Resource II

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2

• Assume 3 FUs– 1 adder with 2-cycle

latency– 1 mult with 2-cycle

latency– 1 mem unit with 1-cycle

latency

• Determine MII = MII = Resource IIResource II

3 ) 1

3 ,

1

3,

1

2 max( MII ResII

Page 35: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

35

Modulo Reservation Table (MRT)

MRT

New Schedule for 1 iteration

Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 01 A1 1 12 2 23 M1 0 34 M2 1 45 A2 2 56 0 67 M3 1 78 2 89 A3 0 910 1 1011 X2 2 11

0 121 132 14

Modulo MEM ADDER MULT012

Page 36: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

36

Modulo Reservation Table (MRT)

MRT

New Schedule for 1 iteration

Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 0 X11 A1 1 1 A12 2 23 M1 0 3 M14 M2 1 4 M25 A2 2 5 A26 0 67 M3 1 78 2 8 M39 A3 0 910 1 1011 X2 2 11

0 12 A31 132 14 X2

Modulo MEM ADDER MULT0 X1 A3 M11 A1 M22 X2 A2 M3

Page 37: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

37

Modulo Scheduled Loop

Kernel, steady state (MRT schedule)

Prolog

Modulo Time MEM ADDER MULT0 0 X1 (1)1 1 A1 (1)2 20 3 X1 (2) M1 (1)1 4 A1 (2) M2 (1)2 5 A2 (1)0 6 X1 (3) M1 (2)1 7 A1 (3) M2 (2)2 8 A2 (2) M3 (1)0 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) M3 (2)0 12 X1 (5) A3 (1) M1 (4)1 13 A1 (5) M2 (4)2 14 X2 (1) A2 (4) M3 (3)0 15 X1 (6) A3 (2) M1 (5)1 16 A1 (6) M2 (5)2 17 X2 (2) A2 (5) M3 (4)0 18 X1 (7) A3 (3) M1 (6)1 19 A1 (7) M2 (6)2 20 X2 (3) A2 (6) M3 (5)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)

Page 38: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

38

Modulo Scheduled Loop

Lastkernel

Epilog

Modulo Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 0 X1 (1) 0 T+0 X1 (N-2) A3 (N-6) M1 (N-3)1 1 A1 (1) 1 T+1 A1 (N-2) M2 (N-3)2 2 2 T+2 X2 (N-6) A2 (N-3) M3 (N-4)0 3 X1 (2) M1 (1) 0 T+3 X1 (N-1) A3 (N-5) M1 (N-2)1 4 A1 (2) M2 (1) 1 T+4 A1 (N-1) M2 (N-2)2 5 A2 (1) 2 T+5 X2 (N-5) A2 (N-2) M3 (N-3)0 6 X1 (3) M1 (2) 0 T+6 X1 (N) A3 (N-4) M1 (N-1)1 7 A1 (3) M2 (2) 1 T+7 A1 (N) M2 (N-1)2 8 A2 (2) M3 (1) 2 T+8 X2 (N-4) A2 (N-1) M3 (N-2)0 9 X1 (4) M1 (3) 0 T+9 A3 (N-3) M1 (N)1 10 A1 (4) M2 (3) 1 T+10 M2 (N)2 11 A2 (3) M3 (2) 2 T+11 X2 (N-3) A2 (N) M3 (N-1)0 12 X1 (5) A3 (1) M1 (4) 0 T+12 A3 (N-2)1 13 A1 (5) M2 (4) 1 T+132 14 X2 (1) A2 (4) M3 (3) 2 T+14 X2 (N-2) M3 (N)0 15 X1 (6) A3 (2) M1 (5) 0 T+15 A3 (N-1)1 16 A1 (6) M2 (5) 1 T+162 17 X2 (2) A2 (5) M3 (4) 2 T+17 X2 (N-1)0 18 X1 (7) A3 (3) M1 (6) 0 T+18 A3 (N)1 19 A1 (7) M2 (6) 1 T+192 20 X2 (3) A2 (6) M3 (5) 2 T+20 X2 (N)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)

Page 39: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

39

Another Modulo Schedule Example

xx

A1A1

A3A3

M2M2M1M1

AA BB

EE

ZZ

++ A2A2

CC DD

00

1111

33 33

Modulo Reservation TableModulo Reservation Table

Given 2 adders (1-cycle) & 1 multiplier (2-cycle)Given 2 adders (1-cycle) & 1 multiplier (2-cycle)

prologprolog

epilogepilog

5x kernel5x kernel

Multiplier is fully utilizedMultiplier is fully utilized

MII = max(3/2, 2/1) = 2 MII = max(3/2, 2/1) = 2

++

++

xx

Modulo ADDER1 ADDER2 MULT0 A1 (3) A2 (3) M2 (2)1 A3 (1) M1 (3)

Page 40: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

40

How to Perform Register Allocation?• We are overlapping multiple iterations into

one schedule.– Example: iteration 1 to 5 are alive at the same

time

• Registers from multiple iterations are alive during a period of time

MRT

Modulo MEM ADDER MULT0 X1 (5) A3 (1) M1 (4)1 A1 (5) M2 (4)2 X2 (1) A2 (4) M3 (3)

Page 41: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

41

Modulo Variable Expansion

• Analyze the “life time” of an architecture register• Unroll the loop to enable modulo schedule• R5 needs to stay alive for 8 cycles = 8/3 = 3 MII (i.e. unroll 3

times)r1(1) r2

(4)

r3 (2) r4

(3)

r5 (8)

r6 (4)

r7 (2)

The cycle numbers assumes WAR allowed in the same cycle

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 X1 (3) mul r13, r12, $c21 7 A1 (3) mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) mul r16, r14, r150 12 X1 (5) add r7, r5, r6 M1 (4)1 13 A1 (5) M2 (4)2 14 st r7, (B)++ A2 (4) M3 (3)

Page 42: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

42

Post MVE code

Kernel (unrolled 3 times)

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r21, (A)++ mul r13, r12, $c21 7 add r22, r21, $c1 mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 ld r1, (A)++ mul r23, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r24, r22, r23 mul r16, r14, r150 12 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r11, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 15 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 16 add r22, r21, $c1 mul r15, r12, $c32 17 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 18 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r27, (B)++ add r24, r22, r23 mul r16, r14, r150 21 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r11, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 24 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 25 add r22, r21, $c1 mul r15, r12, $c32 26 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 27 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r27, (B)++ add r24, r22, r23 mul r16, r14, r15

Page 43: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

43

Register Allocation for MVE

• To save # of registers, might not need to expand all registers• Calculate the lifetime of each register to determine if a new

register is needed across iterations (the formula assumes WAR in the same instruction bundle is allowed)

• # of copies = (MII % lifetime/MII == 0) ? lifetime/MII : MII• 14 5/14

– R1 is alive for 1 cycle = 1/3 = 1 MII (need 1 copy)– R2 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since

3%2=1)– R3 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)– R4 is alive for 3 cycles = 3/3 = 1 MII (need 1 copy)– R5 is alive for 8 cycles = 8/3 = 3 MII (need 3 copies)– R6 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since

3%2=1)– R7 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)

• 13 registers used, instead of 21 with the same unrolling degree

Page 44: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

44

MVE (reallocate registers)

Kernel (unrolled 3 times)

The cycle numbers assumes WAR allowed in the same cycle

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r1, (A)++ mul r3, r2, $c21 4 add r12, r1, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r1, (A)++ mul r3, r12, $c21 7 add r22, r1, $c1 mul r15, r12, $c32 8 add r4, r12, r3 mul r6, r4, r50 9 ld r1, (A)++ mul r3, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r4, r22, r3 mul r16, r4, r150 12 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r1, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 15 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 16 add r22, r1, $c1 mul r15, r12, $c32 17 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 18 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r7, (B)++ add r4, r22, r3 mul r16, r4, r150 21 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r1, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 24 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 25 add 22, r1, $c1 mul r15, r12, $c32 26 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 27 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r7, (B)++ add r4, r22, r3 mul r16, r4, r15

Page 45: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

45

Final Modulo Schedule

Prolog Code (12 instruction bundles)

Epilog Code (12 instruction bundles)

**Branch instruction not shown

9 instruction bundles

ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r11, $c1 mul r5, r2, $c3

st r7, (B)++ add r4, r2, r3 mul r26, r24, r25ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c2

add r22, r21, $c1 mul r15, r12, $c3st r17, (B)++ add r14, r12, r13 mul r6, r4, r5ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c2

add r2, r1, $c1 mul r25, r22, $c3st r27, (B)++ add r24, r22, r23 mul r16, r14, r15

Page 46: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

46

Final Modulo Schedule (Reallocate Registers)

Prolog Code (12 instruction bundles)

Epilog Code (12 instruction bundles)

**Branch instruction not shown

9 instruction bundles

ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r1, $c1 mul r5, r2, $c3

st r7, (B)++ add r4, r2, r3 mul r26, r4, r25ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c2

add r22, r1, $c1 mul r15, r12, $c3st r7, (B)++ add r4, r12, r3 mul r6, r4, r5ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c2

add r2, r1, $c1 mul r25, r22, $c3st r7, (B)++ add r4, r22, r3 mul r16, r4, r15

Page 47: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

47

Issues with Modulo Variable Expansion• Many architecture registers are needed• Code size gets bigger when more

unrolling needed

• Alternative solution: Rotating register file– A hardware technique– Solving problem without code duplication – Similar to register windowregister window plus renamingrenaming:

keep old iteration values on the stack (Itanium calls the hardware Register Stack Register Stack EngineEngine or RSERSE)

Page 48: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

48

Intention of Using Rotation Registers• Use exactly the same schedule (below) for all

including– Kernel codes– Prolog codes– Epilog codes

• The “registers” need to be re-allocated• Registers “rotate” per iteration!!!

**Branch instruction not shown

ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r2, r1, $c1 mul r5, r2, $c3

st r7, (B)++ add r4, r2, r3 mul r6, r4, r5

Page 49: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

49

Idea of Rotation Register (Original Schedule)

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r43, r42, $c2

4 mul r45, r42, $c3

5 add r44, r42, r43

2 6

7

8 mul r46, r44, r45

3 9

10

11

4 12 add r47, r45, r46

13

14 st r47, (B)++

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 50: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

50

Original Code Schedule

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r43, r42, $c2

4 mul r45, r42, $c3

5 add r44, r42, r43

2 6

7

8 mul r46, r44, r45

3 9

10

11

4 12 add r47, r45, r46

13

14 st r47, (B)++

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 51: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

51

Assume HW Rotation Registers

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r44, r43, $c2

4 mul r45, r43, $c3

5 add r52, r43, r44

2 6

7

8 mul r48, r53, r46

3 9

10

11

4 12 add r51, r48, r50

13

14 st r51, (B)++

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 52: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

52

Rotation Registers in Itanium Processors

Stacked (Rotating)

Static

0

3132

127

General Purpose Registers

Stacked (Rotating)

Static

0

3132

127

FP Registers

063 081

Stacked (Rotating)

Static

01516

630

Predicate Registers

Page 53: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

53

Register Rotation (Prolog i0)

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 54: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

54

Register Rotation (Prolog i1)

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 55: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

55

Register Rotation (Prolog i2)

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9

10

11

4 12

13

14

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 56: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

56

Register Rotation (Prolog i3)

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9 ld r41, (A)++ mul r44, r43, $c2

10 add r42, r41, $c1 mul r45, r43, $c3

11 add r52, r43, r44 mul r48, r53, r46

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 57: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

57

Register Rotation (Kernel Steady State i4)

ite

Time Mem Adder Multiplier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9 ld r41, (A)++ mul r44, r43, $c2

10 add r42, r41, $c1 mul r45, r43, $c3

11 add r52, r43, r44 mul r48, r53, r46

4 12 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

13 add r42, r41, $c1 mul r45, r43, $c3

14 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Registers wrapped around if exceeding specified bound

Page 58: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

58

• Execute many iterations in the kernel …

Register Rotation (Kernel)

Page 59: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

59

Register Rotation (Kernel to Epilog, i<-4>)

ite

Time Mem Adder Multiplier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11

N-10

N-9

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 60: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

60

Register Rotation (Kernel to Epilog, i<-3>)

ite

Time Mem Adder Multiplier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 61: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

61

Register Rotation (Kernel to Epilog, i<-2>)

ite

Time Mem Adder Multiplier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5

N-4

N-3

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 62: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

62

Register Rotation (Kernel to Epilog, i<-1>)

ite

Time Mem Adder Multiplier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5 add r51, r48, r50

N-4

N-3 st r51, (B)++

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 63: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

63

Register Rotation (Kernel to Epilog, final ite)

ite

Time Mem Adder Multiplier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5 add r51, r48, r50

N-4

N-3 st r51, (B)++

0 N-2 add r51, r48, r50

N-1

N st r51, (B)++

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 64: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

64

Modulo Schedule with Rotating Register Support• No loop unrolling required (required careful

register allocation)• Tighter code, saving space• However, there are still prolog and epilog codes• Can we use the same schedule for prolog/epilog?

– Use stage predicates to execute instructions conditionally

– Require new ISA support (Itanium)

ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2add r42, r41, $c1 mul r45, r43, $c3

st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

Page 65: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

65

Predicated Instruction Execution (Prolog i0)

ite

Time

Mem Adder Multiplier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3

4

5

2 6

7

8

3 9

10

11

4 12

13

14

Don’t execute shaded instructions

cc0: only issue ld

cc1: only issue add

cc2: no issue

Page 66: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

66

Predicated Prolog (Prolog i1)ite

Time

Mem Adder Multiplier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6

7

8

3 9

10

11

4 12

13

14

cc3: ld(i1) & mul(i0)

cc4: add(i0) & mul(i0)

cc5: add(i0)

Note that stage predicates also “rotate” per iteration

Page 67: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

67

Predicated Prolog (Prolog i2)ite

Time

Mem Adder Multiplier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

3 9

10

11

4 12

13

14

cc6: ld(i2) & mul(i1)

cc7: add(i2) & mul(i1)

cc8: add(i1) & mul(i0)

Page 68: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

68

Predicated Prolog (Prolog i3)ite

Time

Mem Adder Multiplier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

4 12

13

14

cc9: ld(i3) & mul(i2)

cc10: add(i3) & mul(i2)

cc11: add(i2) & mul(i1)

Page 69: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

69

Predicated Kernel (i4)ite

Time

Mem Adder Multiplier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

4 12 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

14 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

cc12: ld(i4) & add(i0) & mul(i3)cc13: st(i0) & add(i4) & mul(3)cc11: add(i3) & mul(i2)

(p20) is used in iteration 4, not (p19) because of predicate rotation

Page 70: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

70

• Execute many iterations in the kernel …

Register Rotation (Kernel)

Page 71: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

71

Predicated Epilog (i<-4>)ite

Time Mem Adder Multiplier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11

N-10

N-9

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Page 72: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

72

Predicated Epilog (i<-3>)ite

Time Mem Adder Multiplier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Page 73: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

73

Predicated Epilog (i<-2>)ite

Time Mem Adder Multiplier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-1 N-5

N-4

N-3

0 N-2

N-1

N

Page 74: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

74

Predicated Epilog (i<-1>)ite

Time Mem Adder Multiplier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

0 N-2

N-1

N

Page 75: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

75

Predicated Epilog (final iteration)ite

Time Mem Adder Multiplier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

0 N-2 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-1 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Page 76: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

76

Final Modulo Schedule (Itanium-like)

• Before entering the loop, set p16p16 =1 (p16 is the first rotating predicate register)

• When the modulo-scheduled loop branch (e.g. br.ctop) encountered – p63p63 is set to 1 by hardware in the prolog code (see next slide)– All registers (rotating registers and predicate rotating registers) rotate as

each stage (iteration) advances• Only 3 Itanium Instruction Bundles (= 3 VLIWs) needed

– No prolog, epilog codes– No modulo variable expansions that stress registers and blow up code size

(p16) r41 = (A)++ (p20) r51 = r48 + r50

(p20) (B)++ = r51(p16) r42 = r41 + $c1

(p17) r44 = r43 * $c2

(p17) r52 = r43 + r44

mov ar.lc = 196 // loop countmov ar.ec = 5 // epilog stages+1mov pr.rot = 0x10000 // special inst set pr[16]=1 and p[63:17]=0

L1top:

br.ctop L1top

(p17) r45 = r43 * $c3(p18) r48 = r53 * r46

Page 77: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

00

p19p19

00

p18p18

00

p17p17

00

p16p16

11

p63p63

11

p62p62

00

Stage 0 (Stage 0 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

After the first iterationLC = 195, EC = 5

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 78: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

00

p19p19

00

p18p18

00

p17p17

00

p16p16

11

p63p63

11

p62p62

00

Stage 1 (Stage 1 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 2nd iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 79: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

00

p19p19

00

p18p18

00

p17p17

00

p16p16

11

p63p63

11

p62p62

11

p61p61

00

Stage 2 (Stage 2 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 3rd iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 80: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

00

p19p19

00

p18p18

00

p17p17

00

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

00

Stage 3 (Stage 3 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 4th iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 81: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

00

p19p19

00

p18p18

00

p17p17

00

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

11

p59p59

00

Stage 4 (Stage 4 (KernelKernel))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 5th iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 82: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

In the Kernel

• After Another 191 Iterations …..

Page 83: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

11

p19p19

11

p18p18

11

p17p17

11

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

11

p59p59

11

p58p58

11

p57p57

11

p56p56

11

p55p55

11

Stage 195 (Stage 195 (KernelKernel))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 196th iterationLC=0, EC=5

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 84: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

11

p19p19

11

p18p18

11

p17p17

11

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

00

p59p59

11

p58p58

11

p57p57

11

p56p56

11

p55p55

11

Stage 195 (Stage 195 (KernelKernel))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

after the 196th iterationEC=4

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 85: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

11

p19p19

11

p18p18

11

p17p17

11

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

00

p59p59

11

p58p58

11

p57p57

11

p56p56

11

p55p55

11

Stage 196 (Stage 196 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 197th iterationEC=4

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 86: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

11

p19p19

11

p18p18

11

p17p17

11

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

00

p59p59

00

p58p58

11

p57p57

11

p56p56

11

p55p55

11

Stage 197 (Stage 197 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 198th iterationEC=3

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 87: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

11

p19p19

11

p18p18

11

p17p17

11

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

00

p59p59

00

p58p58

00

p57p57

11

p56p56

11

p55p55

11

Stage 198 (Stage 198 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 199th iterationEC=2

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 88: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

11

p19p19

11

p18p18

11

p17p17

11

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

00

p59p59

00

p58p58

00

p57p57

00

p56p56

11

p55p55

11

Stage 199 (Stage 199 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 200th iteration (Last iteration)EC=1

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 89: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

Counted Modulo-scheduled Loop

p20p20

11

p19p19

11

p18p18

11

p17p17

11

p16p16

11

p63p63

11

p62p62

11

p61p61

11

p60p60

00

p59p59

00

p58p58

00

p57p57

00

p56p56

00

p55p55

11

Stage 199 (Stage 199 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

After the 200th iteration (Last iteration)EC=0

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20• “br.ctop” instruction

exits the loop

Page 90: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

90

Modulo Scheduling ExampleLoop{

P=A+BQ=C+D;X=PxEY=PxQZ=X+Y

}

Step 1: Data flow graph

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

Loop{P=A+B

Q=C+D;X=PxEY=PxQZ=X+Y

}

Loop{P=A+B

Q=C+D;X=PxEY=PxQZ=X+Y

}

Loop{P=A+B

Q=C+D;X=PxEY=PxQZ=X+Y

}

Loop{P=A+B

Q=C+D;X=PxEY=PxQZ=X+Y

}

Page 91: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

91

Modulo SchedulingStep 2: Generate a list schedule

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

00

11 11

3333

Execution units:2 Adders – 1cycle latency1 Multiplier – 2 cycle latency

Page 92: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

92

Modulo SchedulingStep 2: Generate a list schedule

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

00

11 11

3333

ReservationReservation TableTable

Time Adder1 Adder2 Mult0 A1

1234

A2

M1

M2

A3

Page 93: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

93

Modulo SchedulingGenerating Modulo Schedule:

1. Determine the MII:

Ctyavailabilisource

NdemandsourceMII

:_Re

:_Remax

MII = max[(3/2) ,(2/1)] = 2

Page 94: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

94

Modulo SchedulingMapping from list schedule to modulo schedule

Time Modulo 2 Adder1 Adder2 Mult

0 0 A1 A2

1 1 M1

2 0 M2

3 1

4 0 A3

5 1

6 0

List scheduleList schedule

Time Adder1 Adder2 Mult0 A1

1234

A2

M1

M2

A3

Modulo scheduleModulo schedulefor 1 iterationfor 1 iteration

A3

Page 95: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

95

Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult

0 0 1:A1 1:A2

1 1 1:M1

2 0 2:A1 2:A2 1:M2

3 1 2:M1

4 0 2:M2

5 1 1:A3

6 0

7 1 2:A3

8 0

inserting iteration 2

Page 96: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

96

Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult

0 0 1:A1 1:A2

1 1 1:M1

2 0 2:A1 2:A2 1:M2

3 1 2:M1

4 0 3:A1 3:A2 2:M2

5 1 1:A3 3:M1

6 0 3:M2

7 1 2:A3

8 0

9 1 3:A3

inserting iteration 3

Page 97: ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines

97

Modulo Scheduled Loop

prologprolog

epilogepilog

5x kernel5x kernel

Modulo 2

Adder 1 Adder 2 Mult

0 3:A1 3:A2 2:M2

1 1:A3 3:M1

MRT