29
•1 CPE 631: ILP, Dynamic Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenković [email protected] http://www.ece.uah.edu/~milenka 2 ©AM LaCASA Outline Instruction Level Parallelism (ILP) Recap: Data Dependencies Extended MIPS Pipeline and Hazards Dynamic scheduling with a scoreboard 3 ©AM LaCASA ILP: Concepts and Challenges ILP (Instruction Level Parallelism) – overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Reducing each of the terms of the right-hand side minimize CPI and thus increase instruction throughput 4 ©AM LaCASA Two approaches to exploit parallelism Dynamic techniques largely depend on hardware to locate the parallelism Static techniques relay on software

LaCASA - ece.uah.edumilenka/cpe631-06S/lectures/cpe631ilpdyn_4p1.pdf20 ©AM LaCASA Three Parts of the Scoreboard 1. Instruction status—which of 4 steps the instruction is in (Capacity

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • •1

    CPE 631: ILP, Dynamic Exploitation

    Electrical and Computer EngineeringUniversity of Alabama in Huntsville

    Aleksandar Milenković[email protected]

    http://www.ece.uah.edu/~milenka

    2

    ©AM

    LaCASA

    Outline

    Instruction Level Parallelism (ILP)Recap: Data DependenciesExtended MIPS Pipeline and HazardsDynamic scheduling with a scoreboard

    3

    ©AM

    LaCASA

    ILP: Concepts and Challenges

    ILP (Instruction Level Parallelism) –overlap execution of unrelated instructionsTechniques that increase amount of parallelism exploited among instructions

    reduce impact of data and control hazardsincrease processor ability to exploit parallelism

    Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls

    Reducing each of the terms of the right-hand side minimize CPI and thus increase instruction throughput

    4

    ©AM

    LaCASA

    Two approaches to exploit parallelism

    Dynamic techniqueslargely depend on hardware to locate the parallelism

    Static techniquesrelay on software

  • •2

    5

    ©AM

    LaCASA

    Techniques to exploit parallelism

    DH stallsBasic compiler pipeline scheduling (A.2, 4.1)

    CH stallsLoop Unrolling (4.1)

    WAR and WAW stallsDynamic scheduling with register renaming (3.2)

    DH stalls (RAW)Basic dynamic scheduling (A.8)

    Control hazard stallsDelayed branches (A.2)

    Data hazard (DH) stallsForwarding and bypassing (Section A.2)

    Ideal CPI, and D/CH stallsCompiler speculation (4.4)

    Ideal CPI and DH stallsSoftware pipelining and trace scheduling (4.3)

    Ideal CPI, DH stallsCompiler dependence analysis (4.4)

    RAW stalls w. memoryDynamic memory disambiguation (3.2, 3.7)

    Data and control stallsSpeculation (3.7)

    Ideal CPIIssuing multiple instruction per cycle (3.6)

    CH stallsDynamic branch prediction (3.4)

    Reduces Technique (Section in the textbook)

    6

    ©AM

    LaCASA

    Where to look for ILP?

    Amount of parallelism available within a basic block BB: straight line code sequence of instructions with no branchesin except to the entry, and no branches out except at the exitExample: Gcc (Gnu C Compiler): 17% control transfer

    5 or 6 instructions + 1 branchDependencies => amount of parallelism in a basic block is likely to be much less than 5=> look beyond single block to get more instruction level parallelism

    Simplest and most common way to increase amount of parallelism among instruction is to exploit parallelism among iterations of a loop =>Loop Level Parallelism

    Vector Processing: see Appendix G

    for(i=1; i

  • •3

    9

    ©AM

    LaCASA

    Definition: Name Dependencies

    Two instructions use same name (register or memory location) but don’t exchange data

    Antidependence (WAR if a hazard for HW)Instruction j writes a register or memory location that instruction i reads from and instruction i is executed firstOutput dependence (WAW if a hazard for HW)Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. If dependent, can’t execute in parallel

    Renaming to remove data dependenciesAgain Name Dependencies are Hard for Memory Accesses

    Does 100(R4) = 20(R6)?From different loop iterations, does 20(R6) = 20(R6)?

    10

    ©AM

    LaCASA

    Where are the name dependencies?1 Loop:L.D F0,0(R1)2 ADD.D F4,F0,F23 S.D 0(R1),F4 ;drop DSUBUI & BNEZ4 L.D F0,-8(R1)5 ADD.D F4,F0,F26 S.D -8(R1),F4 ;drop DSUBUI & BNEZ7 L.D F0,-16(R1)8 ADD.D F4,F0,F29 S.D -16(R1),F4 ;drop DSUBUI & BNEZ10 L.D F0,-24(R1)11 ADD.D F4,F0,F212 S.D -24(R1),F413 SUBUI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP

    How can remove them?

    11

    ©AM

    LaCASA

    Where are the name dependencies?1 Loop:L.D F0,0(R1)2 ADD.D F4,F0,F23 S.D 0(R1),F4 ;drop DSUBUI & BNEZ4 L.D F6,-8(R1)5 ADD.D F8,F6,F26 S.D -8(R1),F8 ;drop DSUBUI & BNEZ7 L.D F10,-16(R1)8 ADD.D F12,F10,F29 S.D -16(R1),F12 ;drop DSUBUI & BNEZ10 L.D F14,-24(R1)11 ADD.D F16,F14,F212 S.D -24(R1),F1613 DSUBUI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP

    The Orginal“register renaming”

    12

    ©AM

    LaCASA

    Definition: Control Dependencies

    Example: if p1 {S1;}; if p2 {S2;};S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1Two constraints on control dependences:

    An instruction that is control dep. on a branch cannot be moved before the branch, so that its execution is no longer controlled by the branchAn instruction that is not control dep. on a branch cannot be moved to after the branch so that its execution is controlled by the branch

    DADDU R5, R6, R7

    ADD R1, R2, R3

    BEQZ R4, LSUB R1, R5, R6

    L: OR R7, R1, R8

  • •4

    Dynamically Scheduled Pipelines

    14

    ©AM

    LaCASA

    Overcoming Data Hazards with Dynamic Scheduling

    Why in HW at run time?Works when can’t know real dependence at compile timeSimpler compiler Code for one machine runs well on another

    Example

    Key idea: Allow instructions behind stall to proceed

    DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F12,F8,F12

    SUB.D cannot execute because the dependence of ADD.D on DIV.D causes the pipeline to stall; yet SUBD is not data dependent on anything!

    15

    ©AM

    LaCASA

    Overcoming Data Hazards with Dynamic Scheduling (cont’d)

    Enables out-of-order execution => out-of-order completionOut-of-order execution divides ID stage:

    1. Issue—decode instructions, check for structural hazards2. Read operands—wait until no data hazards, then read operands

    Scoreboarding –technique for allowing instructions to execute out of order when there are sufficient resources and no data dependencies (CDC 6600, 1963)

    16

    ©AM

    LaCASA

    Scoreboarding Implications

    Out-of-order completion => WAR, WAW hazards?

    Solutions for WARQueue both the operation and copies of its operandsRead registers only during Read Operands stage

    For WAW, must detect hazard: stall until other completesNeed to have multiple instructions in execution phase => multiple execution units or pipelined execution unitsScoreboard keeps track of dependencies, state or operationsScoreboard replaces ID, EX, WB with 4 stages

    DIV.D F0,F2,F4

    ADD.D F10,F0,F8

    SUB.D F8,F8,F12

    DIV.D F0,F2,F4

    ADD.D F10,F0,F8

    SUB.D F10,F8,F12

  • •5

    17

    ©AM

    LaCASA

    Four Stages of Scoreboard Control

    ID1: Issue — decode instructions & check for structural hazardsID2: Read operands — wait until no data hazards, then read operandsEX: Execute — operate on operands; when the result is ready, it notifies the scoreboard that it hascompleted executionWB: Write results — finish execution; the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instructionDIV.D F0,F2,F4ADD.D F10,F0,F8

    SUB.D F8,F8,F12

    Scoreboarding stalls the the SUBD in its write result stage until ADDD reads its operands

    18

    ©AM

    LaCASA

    Four Stages of Scoreboard Control

    1. Issue—decode instructions & check for structural hazards (ID1)If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

    2. Read operands—wait until no data hazards, then read operands (ID2)

    A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

    19

    ©AM

    LaCASA

    Four Stages of Scoreboard Control

    3. Execution—operate on operands (EX)The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it hascompleted execution.

    4. Write result—finish execution (WB)Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction.Example:

    CDC 6600 scoreboard would stall SUBD until ADD.D reads operands

    DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F8,F8,F14

    20

    ©AM

    LaCASA

    Three Parts of the Scoreboard

    1. Instruction status—which of 4 steps the instruction is in (Capacity = window size)2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit

    Busy—Indicates whether the unit is busy or notOp—Operation to perform in the unit (e.g., + or –)Fi—Destination registerFj, Fk—Source-register numbersQj, Qk—Functional units producing source registers Fj, FkRj, Rk—Flags indicating when Fj, Fk are ready

    3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

  • •6

    21

    ©AM

    LaCASA

    MIPS with a Scoreboard

    Add1Add2Add3

    FP Mult

    Registers

    Control/StatusScoreboard

    FP Mult

    FP Div

    FP Div

    FP Div

    Control/Status

    22

    ©AM

    LaCASA

    Detailed Scoreboard Pipeline Control

    Read operandsExecution complete

    Instruction status

    Write result

    Issue

    Bookkeeping

    Rj← No; Rk← No

    ∀f(if Qj(f)=FU then Rj(f)← Yes);∀f(if Qk(f)=FU then Rj(f)← Yes);

    Result(Fi(FU))← 0; Busy(FU)← No

    Busy(FU)← yes; Op(FU)← op; Fi(FU)← ’D’; Fj(FU)← ’S1’;

    Fk(FU)← ’S2’; Qj← Result(’S1’); Qk← Result(’S2’); Rj← not Qj; Rk← not Qk; Result(’D’)← FU;

    Rj and Rk

    Functional unit done

    Wait until

    ∀f((Fj( f )≠Fi(FU) or Rj( f )=No) &

    (Fk( f ) ≠Fi(FU) or Rk( f )=No))

    Not busy (FU) and not result (D)

    23

    ©AM

    LaCASA

    Scoreboard ExampleInstruction status Read ExecutioWriteInstruction j k Issue operandcompleteResultL.D F6 34+ R2L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    FU

    24

    ©AM

    LaCASA

    Scoreboard Example: Cycle 1Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    1 FU Integer

    Issue 1st L.D!

  • •7

    25

    ©AM

    LaCASA

    Scoreboard Example: Cycle 2Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    2 FU Integer

    Issue 2nd L.D? Structural hazard!No further instructions will issue!

    26

    ©AM

    LaCASA

    Scoreboard Example: Cycle 3Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    3 FU Integer

    Issue MUL.D?

    27

    ©AM

    LaCASA

    Scoreboard Example: Cycle 4Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    4 FU Integer

    Check for WAR hazards!If none, write result!

    28

    ©AM

    LaCASA

    Scoreboard Example: Cycle 5Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 NoMult2 NoAdd NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    5 FU Integer

    Issue 2nd L.D!

  • •8

    29

    ©AM

    LaCASA

    Scoreboard Example: Cycle 6Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6MUL.D F0 F2 F4 6SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    6 FU Mult1 Integer

    Issue MUL.D!

    30

    ©AM

    LaCASA

    Scoreboard Example: Cycle 7Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7MUL.D F0 F2 F4 6SUB.D F8 F6 F2 7DIV.D F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide No

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    7 FU Mult1 Integer Add

    Issue SUB.D!

    31

    ©AM

    LaCASA

    Scoreboard Example: Cycle 8Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6SUB.D F8 F6 F2 7DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    8 FU Mult1 Integer Add Divide

    Issue DIV.D!

    32

    ©AM

    LaCASA

    Scoreboard Example: Cycle 9Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    10 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No

    2 Add Yes Sub F8 F6 F2 Integer Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    9 FU Mult1 Add Divide

    Read operands for MUL.D and SUB.D!Assume we can feed Mult1 and Add units in the same clock cycle.Issue ADD.D? Structural Hazard (unit is busy)!

  • •9

    33

    ©AM

    LaCASA

    Scoreboard Example: Cycle 11Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9 11DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    8 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No

    0 Add Yes Sub F8 F6 F2 Integer Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    11 FU Mult1 Add Divide

    Last cycle of SUB.D execution.

    34

    ©AM

    LaCASA

    Scoreboard Example: Cycle 12Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    7 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    12 FU Mult1 Add Divide

    Check WAR on F8. Write F8.

    35

    ©AM

    LaCASA

    Scoreboard Example: Cycle 13Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    6 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    13 FU Mult1 Add Divide

    Issue ADD.D!

    36

    ©AM

    LaCASA

    Scoreboard Example: Cycle 14Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    5 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No

    2 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    14 FU Mult1 Add Divide

    Read operands for ADD.D!

  • •10

    37

    ©AM

    LaCASA

    Scoreboard Example: Cycle 15Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    4 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No

    1 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    14 FU Mult1 Add Divide

    38

    ©AM

    LaCASA

    Scoreboard Example: Cycle 16Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    3 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No

    0 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    16 FU Mult1 Add Divide

    39

    ©AM

    LaCASA

    Scoreboard Example: Cycle 17Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    2 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    17 FU Mult1 Add Divide

    Why cannot write F6?

    40

    ©AM

    LaCASA

    Scoreboard Example: Cycle 19Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

    0 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    17 FU Mult1 Add Divide

  • •11

    41

    ©AM

    LaCASA

    Scoreboard Example: Cycle 20Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    20 FU Mult1 Add Divide

    42

    ©AM

    LaCASA

    Scoreboard Example: Cycle 21Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8 21ADD.D F6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 Yes Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    21 FU Add Divide

    43

    ©AM

    LaCASA

    Scoreboard Example: Cycle 22Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8 21ADD.D F6 F8 F2 13 14 16 22Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 Yes Yes

    40 Divide Yes Div F10 F0 F6 Mult1 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    22 FU Add Divide

    Write F6?

    44

    ©AM

    LaCASA

    Scoreboard Example: Cycle 61Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8 21 61ADD.D F6 F8 F2 13 14 16 22Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd No

    0 Divide Yes Div F10 F0 F6 Mult1 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    61 FU Divide

  • •12

    45

    ©AM

    LaCASA

    Scoreboard Example: Cycle 62Instruction status Read ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+ R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6 F2 7 9 11 12DIV.D F10 F0 F6 8 21 61 62ADD.D F6 F8 F2 13 14 16 22Functional unit status dest S1 S2 FU for jFU for kFj? Fk?

    Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide Yes Div F10 F0 F6 Mult1 Yes Yes

    Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

    62 FU Divide

    46

    ©AM

    LaCASA

    Scoreboard Results

    For the CDC 660070% improvement for Fortran150% improvement for hand coded assembly languagecost was similar to one of the functional units

    surprisingly lowbulk of cost was in the extra busses

    Still this was in ancient timeno caches & no main semiconductor memoryno software pipeliningcompilers?

    So, why is it coming backperformance via ILP

    47

    ©AM

    LaCASA

    Scoreboard Limitations

    Amount of parallelism among instructionscan we find independent instructions to execute

    Number of scoreboard entrieshow far ahead the pipeline can look for independent instructions (we assume a window does not extend beyond a branch)

    Number and types of functional unitsavoid structural hazards

    Presence of antidependences and output dependences

    WAR and WAW stalls become more important

    48

    ©AM

    LaCASA

    Things to Remember

    Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stallsData dependenciesDynamic scheduling to minimise stallsDynamic scheduling with a scoreboard

  • •13

    49

    ©AM

    LaCASA

    Scoreboard Limitations

    Amount of parallelism among instructionscan we find independent instructions to execute

    Number of scoreboard entrieshow far ahead the pipeline can look for independent instructions (we assume a window does not extend beyond a branch)

    Number and types of functional unitsavoid structural hazards

    Presence of antidependences and output dependences

    WAR and WAW stalls become more important

    50

    ©AM

    LaCASA

    Tomasulo’s Algorithm

    Used in IBM 360/91 FPU (before caches)Goal: high FP performance without special compilersConditions:

    Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operationsLong memory accesses and long FP delaysThis led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!

    Why Study 1966 Computer? The descendants of this have flourished!

    Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …

    51

    ©AM

    LaCASA

    Tomasulo’s Algorithm (cont’d)

    Control & buffers distributed with Function Units (FU)FU buffers called “reservation stations” => buffer the operands of instructions waiting to issue;

    Registers in instructions replaced by values or pointers to reservation stations (RS) => register renaming

    avoids WAR, WAW hazardsMore reservation stations than registers, so can do optimizations compilers can’t

    Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUsLoad and Stores treated as FUs with RSs as wellInteger instructions can go past branches, allowing FP ops beyond basic block in FP queue

    52

    ©AM

    LaCASA

    Tomasulo-based FPU for MIPS

    FP addersFP adders

    Add1Add2Add3

    FP multipliersFP multipliers

    Mult1Mult2

    From Mem FP Registers

    Reservation Stations

    Common Data Bus (CDB)

    To Mem

    FP OpQueue

    Load Buffers

    Store Buffers

    Load1Load2Load3Load4Load5Load6

    From Instruction Unit

    Store1Store2Store3

  • •14

    53

    ©AM

    LaCASA

    Reservation Station Components

    Op: Operation to perform in the unit (e.g., + or –)Vj, Vk: Value of Source operands

    Store buffers has V field, result to be storedQj, Qk: Reservation stations producing source registers (value to be written)

    Note: Qj/Qk=0 => source operand is already available in Vj /VkStore buffers only have Qi for RS producing result

    Busy: Indicates reservation station or FU is busy

    Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

    54

    ©AM

    LaCASA

    Three Stages of Tomasulo Algorithm

    1. Issue—get instruction from FP Op QueueIf reservation station free (no structural hazard), control issues instr & sends operands (renames registers)

    2. Execute—operate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for result

    3. Write result—finish execution (WB)Write it on Common Data Bus to all awaiting units; mark reservation station available

    Normal data bus: data + destination (“go to” bus)Common data bus: data + source (“come from” bus)

    64 bits of data + 4 bits of Functional Unit source addressWrite if matches expected Functional Unit (produces result)Does the broadcast

    Example speed: 2 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /

    55

    ©AM

    LaCASA

    Tomasulo ExampleInstruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    0 FU

    Clock cycle counter

    FU countdown

    Instruction stream

    3 Load/Buffers

    3 FP Adder R.S.2 FP Mult R.S.

    56

    ©AM

    LaCASA

    Tomasulo Example Cycle 1Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    1 FU Load1

  • •15

    57

    ©AM

    LaCASA

    Tomasulo Example Cycle 2Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    2 FU Load2 Load1

    Note: Can have multiple loads outstanding

    58

    ©AM

    LaCASA

    Tomasulo Example Cycle 3Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    3 FU Mult1 Load2 Load1

    • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued

    • Load1 completing; what is waiting for Load1?

    59

    ©AM

    LaCASA

    Tomasulo Example Cycle 4Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    4 FU Mult1 Load2 M(A1) Add1

    • Load2 completing; what is waiting for Load2? 60

    ©AM

    LaCASA

    Tomasulo Example Cycle 5Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

    10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    5 FU Mult1 M(A2) M(A1) Add1 Mult2

    • Timer starts down for Add1, Mult1

  • •16

    61

    ©AM

    LaCASA

    Tomasulo Example Cycle 6Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

    9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    6 FU Mult1 M(A2) Add2 Add1 Mult2

    • Issue ADDD here despite name dependency on F6?

    62

    ©AM

    LaCASA

    Tomasulo Example Cycle 7Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

    8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    7 FU Mult1 M(A2) Add2 Add1 Mult2

    • Add1 (SUBD) completing; what is waiting for it?

    63

    ©AM

    LaCASA

    Tomasulo Example Cycle 8Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 No2 Add2 Yes ADDD (M-M) M(A2)

    Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    8 FU Mult1 M(A2) Add2 (M-M) Mult2

    64

    ©AM

    LaCASA

    Tomasulo Example Cycle 9Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 No1 Add2 Yes ADDD (M-M) M(A2)

    Add3 No6 Mult1 Yes MULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    9 FU Mult1 M(A2) Add2 (M-M) Mult2

  • •17

    65

    ©AM

    LaCASA

    Tomasulo Example Cycle 10Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 No0 Add2 Yes ADDD (M-M) M(A2)

    Add3 No5 Mult1 Yes MULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    10 FU Mult1 M(A2) Add2 (M-M) Mult2

    • Add2 (ADDD) completing; what is waiting for it? 66

    ©AM

    LaCASA

    Tomasulo Example Cycle 11Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 No

    4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    11 FU Mult1 M(A2) (M-M+M(M-M) Mult2

    • Write result of ADDD here?• All quick instructions complete in this cycle!

    67

    ©AM

    LaCASA

    Tomasulo Example Cycle 12Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 No

    3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

    68

    ©AM

    LaCASA

    Tomasulo Example Cycle 13Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 No

    2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    13 FU Mult1 M(A2) (M-M+M(M-M) Mult2

  • •18

    69

    ©AM

    LaCASA

    Tomasulo Example Cycle 14Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 No

    1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    14 FU Mult1 M(A2) (M-M+M(M-M) Mult2

    70

    ©AM

    LaCASA

    Tomasulo Example Cycle 15Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 No

    0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    15 FU Mult1 M(A2) (M-M+M(M-M) Mult2

    • Mult1 (MULTD) completing; what is waiting for it?

    71

    ©AM

    LaCASA

    Tomasulo Example Cycle 16Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 No

    40 Mult2 Yes DIVD M*F4 M(A1)

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

    • Just waiting for Mult2 (DIVD) to complete

    72

    ©AM

    LaCASA

    Tomasulo Example Cycle 55Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 No

    1 Mult2 Yes DIVD M*F4 M(A1)

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

  • •19

    73

    ©AM

    LaCASA

    Tomasulo Example Cycle 56Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 No

    0 Mult2 Yes DIVD M*F4 M(A1)

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    56 FU M*F4 M(A2) (M-M+M(M-M) Mult2

    • Mult2 (DIVD) is completing; what is waiting for it? 74

    ©AM

    LaCASA

    Tomasulo Example Cycle 57Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

    Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)

    Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

    56 FU M*F4 M(A2) (M-M+M(M-M) Result

    • Once again: In-order issue, out-of-order execution and out-of-order completion.

    75

    ©AM

    LaCASA

    Tomasulo Drawbacks

    Complexitydelays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon!

    Many associative stores (CDB) at high speedPerformance limited by Common Data Bus

    Each CDB must go to multiple functional units ⇒ high capacitance, high wiring densityNumber of functional units that can complete per cycle limited to one!

    Multiple CDBs ⇒ more FU logic for parallel assoc storesNon-precise interrupts!

    We will address this later

    76

    ©AM

    LaCASA

    Tomasulo Loop Example

    This time assume Multiply takes 4 clocksAssume 1st load takes 8 clocks (L1 cache miss), 2nd load takes 1 clock (hit)To be clear, will show clocks for SUBI, BNEZ

    Reality: integer instructions ahead of Fl. Pt. Instructions

    Show 2 iterations

    Loop: LD F0 0(R1)MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

  • •20

    77

    ©AM

    LaCASA

    Loop ExampleInstruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    0 80 Fu

    Added Store Buffers

    Value of Register used for address, iteration control

    Instruction Loop

    Iter-ationCount

    78

    ©AM

    LaCASA

    Loop Example Cycle 1Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 80Load2 NoLoad3 NoStore1 NoStore2 NoStore3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    1 80 Fu Load1

    79

    ©AM

    LaCASA

    Loop Example Cycle 2Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No

    Load3 NoStore1 NoStore2 NoStore3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    2 80 Fu Load1 Mult1

    80

    ©AM

    LaCASA

    Loop Example Cycle 3Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

    Store1 Yes 80 Mult1Store2 NoStore3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    3 80 Fu Load1 Mult1

    Implicit renaming sets up data flow graph

  • •21

    81

    ©AM

    LaCASA

    Loop Example Cycle 4Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

    Store1 Yes 80 Mult1Store2 NoStore3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    4 80 Fu Load1 Mult1

    82

    ©AM

    LaCASA

    Loop Example Cycle 5Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

    Store1 Yes 80 Mult1Store2 NoStore3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    5 72 Fu Load1 Mult1

    83

    ©AM

    LaCASA

    Loop Example Cycle 6Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult1

    Store2 NoStore3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    6 72 Fu Load2 Mult1

    84

    ©AM

    LaCASA

    Loop Example Cycle 7Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No

    Store3 NoReservation Stations: S1 S2 RS

    Time Name Busy Op Vj Vk Qj Qk Code:Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    7 72 Fu Load2 Mult2

  • •22

    85

    ©AM

    LaCASA

    Loop Example Cycle 8Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    8 72 Fu Load2 Mult2

    86

    ©AM

    LaCASA

    Loop Example Cycle 9Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    9 72 Fu Load2 Mult2

    87

    ©AM

    LaCASA

    Loop Example Cycle 10Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

    4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    10 64 Fu Load2 Mult2

    88

    ©AM

    LaCASA

    Loop Example Cycle 11Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

    3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    11 64 Fu Load3 Mult2

  • •23

    89

    ©AM

    LaCASA

    Loop Example Cycle 12Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

    2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #83 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    12 64 Fu Load3 Mult2

    90

    ©AM

    LaCASA

    Loop Example Cycle 13Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

    1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #82 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    13 64 Fu Load3 Mult2

    91

    ©AM

    LaCASA

    Loop Example Cycle 14Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

    0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #81 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    14 64 Fu Load3 Mult2

    92

    ©AM

    LaCASA

    Loop Example Cycle 15Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

    0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 LoopRegister result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    15 64 Fu Load3 Mult2

  • •24

    93

    ©AM

    LaCASA

    Loop Example Cycle 16Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

    4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    16 64 Fu Load3 Mult1

    94

    ©AM

    LaCASA

    Loop Example Cycle 17Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    17 64 Fu Load3 Mult1

    95

    ©AM

    LaCASA

    Loop Example Cycle 18Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    18 64 Fu Load3 Mult1

    96

    ©AM

    LaCASA

    Loop Example Cycle 19Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    19 56 Fu Load3 Mult1

  • •25

    97

    ©AM

    LaCASA

    Loop Example Cycle 20Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

    1 LD F0 0 R1 1 9 10 Load1 Yes 561 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1

    Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

    Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

    Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

    20 56 Fu Load1 Mult1

    • Once again: In-order issue, out-of-order execution and out-of-order completion.

    98

    ©AM

    LaCASA

    Why can Tomasulo overlap iterations of loops?

    Register renamingMultiple iterations use different physical destinations for registers (dynamic loop unrolling)

    Reservation stations Permit instruction issue to advance past integer control flow operationsAlso buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard

    Other perspective: Tomasulo building data flow dependency graph on the fly

    99

    ©AM

    LaCASA

    Tomasulo’s scheme offers 2 major advantages

    (1) the distribution of the hazard detection logicdistributed reservation stations and the CDBIf multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available.

    (2) the elimination of stalls for WAW and WAR hazards

    100

    ©AM

    LaCASA

    Multiple Issue

    Allow multiple instructions to issue in a single clock cycle (CPI < 1)Two flavors

    SuperscalarIssue varying number of instruction per clockCan be statically (compiler tech.) or dynamically(Tomasulo) scheduled

    VLIW (Very Long Instruction Word)Issue a fixed number of instructions formatted as a single long instruction or as a fixed instruction packet

  • •26

    101

    ©AM

    LaCASA

    Multiple Issue with Dynamic Scheduling

    FP addersFP adders

    Add1Add2Add3

    FP multipliersFP multipliers

    Mult1Mult2

    From Mem FP Registers

    Reservation Stations

    To Mem

    FP OpQueue

    Load Buffers

    Store Buffers

    Load1Load2Load3Load4Load5Load6

    From Instruction Unit

    Store1Store2Store3

    Issue: 2 instructions per clock cycle102

    ©AM

    LaCASA

    Multiple Issue with Dynamic Scheduling: An Example

    Loop: L.D F0, 0(R1)ADD.D F4,F0,F2S.D 0(R1), F4DADDIU R1,R1,-#8 BNE R1,R2,Loop

    Assumptions:

    2-issue processor: can issue any pair of instructions if reservation stations are available

    Resources: ALU (int + effective address),a separate pipelined FP for each operation type,branch prediction hardware, 1 CDB

    2 cc for loads, 3 cc for FP Add

    Branches single issue, branch prediction is perfect

    103

    ©AM

    LaCASA

    Execution in Dual-issue Tomasulo Pipeline

    Wait for BNE1413127LD.D F0,0(R1)3Wait for LD.D18157ADD.D F4,F0,F23Wait for ADD.D19138S.D 0(R1), F43Wait for ALU15148DADDIU R1,R1,-#83Wait for DAIDU169BNE R1,R2,Loop3

    Wait for BNE9874LD.D F0,0(R1)2

    Wait for LD.D13104ADD.D F4,F0,F22

    Wait for ADD.D1485S.D 0(R1), F42Wait for ALU1095DADDIU R1,R1,-#82Wait for DAIDU116BNE R1,R2,Loop2

    Wait for DAIDU63BNE R1,R2,Loop1

    Wait for ALU542DADDIU R1,R1,-#81

    Wait for ADD.D932S.D 0(R1), F41

    Wait for LD.D851ADD.D F4,F0,F21

    first issue4321LD.D F0,0(R1)1

    Com.Write at CDBMem.Access

    Exe. (begins)IssueInst.Iter.

    104

    ©AM

    LaCASA

    Multiple Issue with Dynamic Scheduling: Resource Usage

    3/S.D19

    3/ADD.D18

    17

    16

    3/DADDIU3/ADD.D15

    3/L.D2/S.D3/ DADDIU14

    2/ADD.D3/L.D3/S.D13

    3/L.D12

    11

    2/DADDIU2/ADD.D10

    2/L.D1/S.D2/ DADDIU9

    1/ADD.D2/L.D2/S.D8

    2/L.D7

    6

    1/DADDIU1/ADD.D5

    1/L.D1/DADDIU4

    1/L.D1/S.D3

    1/L.D2

    CDBData CacheFP ALUInt ALUClock

  • •27

    105

    ©AM

    LaCASA

    Multiple Issue with Dynamic Scheduling

    DADDIU waits for ALU used by S.DAdd one ALU dedicated to effective address calculationUse 2 CDBs

    Draw table for the dual-issue version of Tomasulo’s pipeline

    106

    ©AM

    LaCASA

    Multiple Issue with Dynamic Scheduling

    Wait for BNE111097LD.D F0,0(R1)315127ADD.D F4,F0,F23

    16108S.D 0(R1), F431098DADDIU R1,R1,-#83

    119BNE R1,R2,Loop3

    Wait for BNE8764LD.D F0,0(R1)2

    Wait for LD.D1294ADD.D F4,F0,F22

    Wait for ADD.D1375S.D 0(R1), F42Executes earlier765DADDIU R1,R1,-#82

    86BNE R1,R2,Loop2

    Wait for DAIDU53BNE R1,R2,Loop1

    Executes earlier432DADDIU R1,R1,-#81

    Wait for ADD.D932S.D 0(R1), F41

    Wait for LD.D851ADD.D F4,F0,F21

    first issue4321LD.D F0,0(R1)1

    Com.Write at CDBMem.Access

    Exe. (begins)IssueInst.Iter.

    107

    ©AM

    LaCASA

    Multiple Issue with Dynamic Scheduling: Resource Usage

    3/ADD.D

    2/ADD.D

    3/L.D

    3/DADDIU

    1/ADD.D

    2/DADDIU

    1/L.D

    CDB#1

    3/S.D

    3/L.D

    2/S.D

    2/L.D

    1/S.D

    1/L.D

    Adr. Adder

    3/S.D16

    15

    14

    2/S.D13

    3/ADD.D12

    11

    3/L.D10

    1/S.D2/ADD.D3/ DADDIU9

    2/L.D8

    2/L.D7

    2/ DADDIU6

    1/ADD.D5

    1/DADDIU4

    1/L.D1/DADDIU3

    2

    CDB#2Data CacheFP ALUInt ALUClock

    108

    ©AM

    LaCASA

    What about Precise Interrupts?

    Tomasulo had:In-order issue, out-of-order execution, and out-of-order completionNeed to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream

  • •28

    109

    ©AM

    LaCASA

    Hardware-based Speculation

    With wide issue processors control dependences become a burden, even with sophisticated branch predictorsSpeculation: speculate on the outcome of branches and execute the program as if our guesses were correct => need a mechanism to handle situations when the speculations were incorrect

    110

    ©AM

    LaCASA

    Relationship between precise interrupts and speculation

    Speculation is a form of guessingImportant for branch prediction:

    Need to “take our best shot” at predicting branch direction

    If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:

    This is exactly same as precise exceptions!Technique for both precise interrupts/exceptions and speculation: in-order completion or commit

    111

    ©AM

    LaCASA

    HW support for precise interruptsNeed HW buffer for results of uncommitted instructions: reorder buffer (ROB)

    4 fields: instr. type, destination, value, readyUse reorder buffer number instead of reservation station when execution completesSupplies operands between execution complete & commit(Reorder buffer can be operand source => more registers like RS)Instructions commitOnce instruction commits, result is put into registerAs a result, easy to undo speculated instructions on mispredicted branches or exceptions

    ReorderBuffer

    FPOp

    Queue

    FP Adder FP Adder

    Res Stations Res Stations

    FP Regs

    112

    ©AM

    LaCASA

    Four Steps of Speculative TomasuloAlgorithm

    1. Issue—get instruction from FP Op QueueIf reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

    2. Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

    3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

    4. Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

  • •29

    113

    ©AM

    LaCASA

    What are the hardware complexities with reorder buffer (ROB)?How do you find the latest version of a register?

    (As specified by Smith paper) need associative comparison networkCould use future file or just use the register result status buffer to track which specific reorder buffer has received the value

    Need as many ports on ROB as register file

    ReorderBuffer

    FPOp

    Queue

    FP Adder FP Adder

    Res Stations Res Stations

    FP Regs

    Compar network

    Reorder Table

    Des

    t Re

    g

    Resu

    lt

    Exce

    ptions

    ?

    Valid

    Prog

    ram C

    ount

    er

    114

    ©AM

    LaCASA

    Summary

    Reservations stations: implicit register renaming to larger set of registers + buffering source operands

    Prevents registers as bottleneckAvoids WAR, WAW hazards of ScoreboardAllows loop unrolling in HW

    Not limited to basic blocks (integer units gets ahead, beyond branches)Today, helps cache misses as well

    Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)Lasting Contributions

    Dynamic schedulingRegister renamingLoad/store disambiguation

    360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264