70
CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

1

Instruction Level Parallelism andTomasulo’s approach

Page 2: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

2Instruction Level Parallelism

• Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls

• Reduce stalls, reduce CPI

• Reduce CPI, increase IPC

• Instruction-level parallelism (ILP) seeks to reduce stalls

• Importance of ILP is more visible in Loop-level parallelism:

for (i=1; i<1000; i=i+1)

{

x[i] = x[i] + y[i];

}

Page 3: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

3Major Techniques to increase ILP

Techniques Reduces Section

Forwarding and bypassing Potential data hazard stalls

Delayed branches and simple branch scheduling

Control hazard stalls

Basic dynamic scheduling (scoreboarding) Data hazard stalls from true dependences

Dynamic scheduling with renaming Data hazard stalls and stalls from antidependences and output dependences

Dynamic branch prediction Control stalls

Issuing multiple instructions per cycle Ideal CPI

Speculation Data hazards and control hazard stalls

Dynamic memory disambiguation Data hazard stalls with memory

Loop unrolling Control hazard stalls

Basic compiler pipeline scheduling Data hazard stalls

Compiler dependence analysis Ideal CPI, data hazard stalls

Software pipelining, trace scheduling Ideal CPI, data hazard stalls

Compiler speculation Ideal CPI, data, control stalls

Page 4: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

4Instruction Level Parallelism

• ILP by SW (static) or HW (dynamic) techniques

• HW intensive ILP dominates desktop and server markets

• SW compiler intensive approaches more likely seen in embedded systems—but IA-64 uses the approach

Page 5: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

5

Dependences• Two instructions are parallel if they can execute

simultaneously in a pipeline without causing any stalls (assuming no structural hazards) and can be reordered

• Two instructions that are dependent are not parallel and cannot be reordered—must be executed in-order—even though they can be partially overlapped

• Three types of dependences

– Data dependences(=true data dependences)

– Name dependences

– Control dependences

Page 6: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

6

Dependences

• Dependences are properties of programs• Whether a dependence results in an actual hazard(& the length of stalls) are

properties of the pipeline organization• Dependence

1) indicates the potential for a hazard2) Determines the order in which results must be calculated3) Sets an upperbound for ILP

• Problems caused by Dependences can be solved by:1) Try to avoid by rescheduling2) Eliminate by transforming the code (alter the code)

• Compiler concerned about dependences in program, whether or not a HW hazard occurs depends on a given pipeline

Page 7: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

7Review of Data Hazards

Consider instructions i and j, where i occurs before j.

RAW (read after write) — j tries to read a source before i writes it, so j gets the old value

WAW (write after write) — j tries to write an operand before it is written by i (only possible in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled)

WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value (only possible when some instructions can write results early in the pipeline and other instructions can read sources late in the pipeline)

Page 8: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

8(1) Data Dependences

• (True) Data dependences

– Instruction i produces a result used by instruction j(directly), or

– Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i (inderectly).

j k i j i

• Easy to determine in cases of registers (fixed names)

• Harder to determine for memory:

– Does 100(R4) = 20(R6)?

– From different loop iterations, does 20(R4) = 20(R4)?

– Will see hardware technique in chap 2

i: ADD.D F0, F2, F4

j: SUB.D F6, F0, F8

Page 9: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

9(2) Name Dependences

• Second type of dependences called name dependence: two instructions use same name (same register or memory location) but don’t exchange data

• Antidependence

– Instruction j writes a register or memory location that instruction i reads from and instruction i must be executed first—if not, then WAR hazard

• Output dependence

– Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved—if not, then WAW

* Name Dependences are harder to handle for memory accesses– Does 100(R4) = 20 (R6)?– From different loop iterations, does 20(R4) = 20(R4)?

i : ADD.D F0, F2, F4

j : SUB.D F2, F6, F8

i : ADD.D F0, F2, F4

j : SUB.D F0, F6, F8

Page 10: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

10

Register Renaming eliminates WAR & WAWAssuming temporary registers S and T :

DIV.D F0, F2, F4 DIV.D F0, F2, F4ADD.D F6, F0, F8 ADD.D S, F0, F8S.D F6, 0(R1) S.D S, 0(R1)SUB.D F8, F10, F14 SUB.D T, F10, F14MUL.D F6, F10, F8 MUL.D F6, F10, T

(True) Data Dependences ? Antidependences(WAR) ? Output dependences(WAW) ? Which dependences are eliminated by renaming? Subsequent F8 must be replaced by T How about F6? Not needed to be replaced as F8 because MULT.D will change F6

(True) Data Dependences= (1) DIV.D—ADD.D (2) ADD.D—S.D (3) SUB.D—MUL.D

Antidependences = ADD.D—SUB.D

Output dependences = ADD.D—MUL.D

Register renaming

WAR & WAW are eliminated by register renaming—will be implemented in hardware

Page 11: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

11(3) Control Dependence

• Final kind of dependence called control dependence • Example

if pl {S1;};if p2 {S2;}

S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Note that S2 could be data dependent on S1.

Page 12: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

12Control Dependences

• Two (obvious) constraints on control dependences:

– An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch

– An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch

if p1 {S1;

};if p2

{S2;}

S1;if p1

{S1;};if p2

{S2;}

if pl {S1;

};S3;if p2

{S2;}

if pl {S1;

};S3;if p2

{S2;}S3

Page 13: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

13Limitations of Scoreboarding(Scoreboard

hardware onnext slide)

• No forwarding hardware

• Limited to instructions in basic block (small window)

• Small number of functional units (structural hazards), especially integer/load/store units—only one each

• Can not issue if structural or WAW hazards

• Must wait until WAR hazards resolved

• Imprecise exceptions due to out-of-order execution

Improvement? Tomasulo’s Approach

Page 14: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

14

Figure A.50 The basic structure of a MIPS processor with a scoreboard

Scoreboard

Integer unit

FP add

FP divide

FP mult

FP mult

Registers Data buses

Control/statusControl/status

Data flows

Control/status flows

Scoreboard originally proposed in CDC6600 (Seymore Cray,1964)

Scoreboard Hardware— centralized control by Scoreboard

Page 15: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

15Busy – Indicates whether the unit is busy or notOp – Operation to perform in the unit (e.g., add or subtract)Fi – Destination registerFj, Fk – Source-register numbersQj, Qk – Functional units producing source registers Fj, FkRj, Rk – Flags indicating when Fj, Fk are available and not yet read.

Page 16: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

16Tomasulo’s Algorithm

For IBM 360/91 about 3 years after CDC 6600 (Late 1960s)Goal: High performance without special compilersDifferences between Tomasulo’s Algorithm & Scoreboard(Similar to Scoreboarding, but added Register Renaming)

– Control & buffers (called “reservation stations”) distributed with functional units vs. centralized in scoreboard—Scoreboard/Inst buffer Reservation Stations for each FU

– Registers in instructions replaced by pointers to reservation station buffer

– HW renaming of registers to avoid WAR, WAW hazards– Common data bus (CDB) broadcasts results to functional units– Load and stores treated as functional units as well

Very Importantly– Tomasulo’s algorithm are adopted to many modern CPUs;

Alpha 21264, HP PA-8000, MIPS R10K, Pentium III, Pentium 4, PowerPC 604, etc…

Page 17: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

17Key concept: Reservation Stations(RS)

• Distributed (rather than centralized) control scheme

– Bypassing(data directly to RS rather than via registers) is allowed via Common Data Bus (CDB) to RS

– Register Renaming eliminates WAR/WAW hazards

• Scoreboard/Instruction Buffer => Reservation Stations

– Fetch and Buffer operands as soon as available

• Eliminates need to always get values from registers at execute

– Pending instructions designate reservation stations that will provide their inputs

– Successive writes to a register cause only the last one to update the register

Page 18: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

18MIPS Floating-point unit using Tomasulo’s Algorithm

Page 19: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

19Details

• Each reservation station holds instructions that has been issued and waiting for execution—an instruction may already have all the operands or it has the name(s) of RS or the names of load buffers which will provide them. These name fields are called “tags”—4-bits each to denote one of 5 RSs & 6 Load buffers—RSs are used for renaming

• Load buffer & Store buffer behave almost exactly like RS

• All results from the FUs and from memory are sent on the Common Data Bus which is connected to everywhere except the Load buffer

Page 20: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

20

Three Stages of Tomasulo’s Algorithm

1. Issue: Get the next instruction from FP operation queue (FIFO) If reservation station free (if Not free stall (=structural hazard)), issues instruction & sends operands (if available in register, else provide name of FU(=renaming)). Avoids WAR & WAW

2. Execution: Operate on operands (EX)

When both operands ready(already in Vj/Vk or from CDB), get them, then execute; if not ready, watch common data bus for result. RAW avoided

3. Write result: Finish execution (WB)

Write on common data bus so that all awaiting FUs can hear; mark reservation station as available.

Common data bus: 64 bit data + 4 bit source (“come from”)

Page 21: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

21

Data Buses in Tomasulo’s Algorithm

• Compare to Normal data bus which has: data + destination (“go to” bus)

• CDB(Common Data Bus): data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address

(RS’s number)

– Any receiving unit(Store buffer, RSs, FP registers) will accept(Write) if the RS’s number matches the expected number

Page 22: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

22Reservation Station Components

Op – Operation to perform in the unit (e.g., + or – )

Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here

Vj, Vk – Registers that store the Value of source operands—temp registers for renaming

Busy – Indicates reservation station and FU is busy

Register result status – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

Page 23: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

23Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 24: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

24Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 25: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

25

Load & Store require 2 steps:

Step 1: Compute effective addr(ea)

Step 2: Place ea in buffer

Execution(Load or Store) can start when memory unit is not busy

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 26: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

26Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 27: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

27

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 28: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

28Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 29: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

29Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 30: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

30Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 31: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

31Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 32: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

32Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 33: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

33Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 34: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

34Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 35: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

35Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 36: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

36Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 37: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

37Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 38: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

38Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 39: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

39Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 40: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

40

Wait until DIVD finishesDivide takes 40 cycles

Page 41: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

41Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 42: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

42Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 43: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

43Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 44: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

44

• Why take longer on scoreboard of CDC 6600?Structural HazardsLack of forwarding

• Both in-order issue and out-of-order execution• Scoreboard cannot handle WAR & WAW• Tomasulo can with register renaming• Both will stall with Branch instruction—later see Tomasulo with Speculation

Assuming(for Scoreboard):Add takes 2 clock cycles, multiply=10, divide=40

Scoreboard Tomasulo

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 45: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

45

Let’s try this site--http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo/AppletTomasulo.html

Page 46: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

46

Page 47: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

47Tomasulo’s Algorithm: A Loop-Based Example

Loop: LD F0 0(R1)MULTD F4 F0 F2SD F4 0(R1)SUBI R1 R1 #8BNEZ R1 Loop

• Multiply takes 4 clocks• Assume first load takes 8 clocks (cache miss), second load

takes 1 clock (hit)—on a cache miss, a block(several words) is brought into the cache

• Reality: integer instructions run ahead

Page 48: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

48

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 49: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

49

Cache miss occurs, so LD must wait for 8 cycles

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 50: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

50

Cache miss occurs, so LD must wait for 8 cycles

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 51: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

51

Cache miss occurs, so LD must wait for 8 cycles

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 52: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

52

Cache miss occurs, so LD must wait for 8 cycles

Since SUBI is executed by Integer unit, it is not shown here—we only show the FP unit here

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 53: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

53

Cache miss occurs, so LD must wait for 8 cycles

Since BNEZ is executed by Integer unit, it is not shown here—we only show the FP unit here

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 54: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

54

Cache miss occurs, so LD must wait for 8 cycles

This is “register renaming”

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 55: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

55

Cache miss occurs, so LD must wait for 8 cycles

This is “register renaming”

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 56: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

56

Cache miss occurs, so LD must wait for 8 cycles

Higher ILP !

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 57: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

57

Cache is finally ready, so read from memory

Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 58: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

58Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 59: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

59Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 60: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

60Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 61: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

61Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 62: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

62Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 63: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

63Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 64: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

64Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 65: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

65Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 66: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

66Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 67: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

67Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 68: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

68Op – Operation to perform in the unit (e.g., + or – )Qj, Qk – The name of Reservation stations that will produce source registers—no values stored hereVj, Vk – Registers that store the Value of source operands—temp registers for renamingBusy – Indicates reservation station and FU is busy

Page 69: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

69Tomasulo Summary

Reservation stations: renaming to larger set of registers + buffering source operands

– Prevents registers becoming bottleneck

– Distribute RAW hazard detection—to RSs

– Avoids WAR, WAW hazards of scoreboard by Register Renaming

– Allows loop unrolling in HW

– Tag match in CDB requires many associative compares

– Common Data Bus Achilles heal of Tomasulo Multiple writebacks (multiple CDBs) expensive

Page 70: CSCI 620 NOTE8 1 Instruction Level Parallelism and Tomasulo’s approach

CSCI 620 NOTE8

70Tomasulo Summary

Lasting Contributions—Most of modern processors employ the algorithm

– Dynamic scheduling

– Register renaming

– Load/store disambiguation– Load address compared with store address in store buffer If match found load instruction is not sent to load buffer—avoids which hazard?

RAW

360/91 descendants are Pentium III, IV; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264