35
MIPS Pipeline with Tomasulo’s Algorithm Issue ADD MULT DIV LSQ IR WB ADD RS Dispatch REG FILE Common Data Bus (CDB) MEM

MIPS Pipeline with Tomasulo’s Algorithm · A ea SD LSQ B LSQ Buffers ID: Identifies buffer ... Waiting in LSQ since operand F4 not yet available. Requires MUL to produce the value

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

  • MIPS Pipeline with Tomasulo’s Algorithm

    Issue

    ADD

    MULT

    DIV

    LSQ

    IR WB

    ADD

    RS

    Dispatch

    REGFILE

    Common DataBus (CDB)

    MEM

  • Renaming Source RegistersExample

    A: DIVD F0, F2, F4 RAW dependency between (A,B)B: ADDD F6, F0, F8 WAR dependency between (B,C)C: SUBD F8, F10, F12 RAW dependency between (C,D)D: DIVD F0, F8, F10

    • A source register that is not the destination of an in-flight instruction is copied to thenew storage location allocated for the source operand

    • A source register that is the destination of an in-flight instruction is tagged with the idof the producer

    • When the producer instruction does a write the value and its id are broadcast onthe CDB

    • The waiting instruction copies the value from the CDB to the storage allocated forthe source operand

    2

  • Example (Issue Unit)

    Issue A: DIVD F0, F2, F4

    TAG SOURCE1 SOURCE2 OP

    • v2, v4: Values read from registers F2, F4 during issue

    • Record A as the current writer of F0

    RSA DIVDv2 v4

    RSA

    Issue B: ADDD F6, F0, F8

    ADDDv8

    • v8: Value read from F8

    • Other operand will be result of instruction A

    RSA

    RSB

    RSA

    RSB

    RSBF6

    F0

    3

  • Example (Issue Unit)

    Issue C: SUBD F8, F10, F12

    TAG SOURCE1 SOURCE2 OP

    v10, v12: Values read from registers F10, F12 during instruction issue

    RSC SUBDv10 v12

    Issue D: DIVD F14, F8, F10

    DIVDv10

    • v10: Value read from F10

    • Other operand will be result of instruction C

    RSC

    RSC

    RSD RSD

    RSCRSD

    F14

    F8

    4

  • Dispatch and WB Units• Dispatch unit selects instructions from RS whose :

    operands are available and functional unit is free• In the example A, C are ready to execute. • When execution is complete the Write Unit is notified

    Write Unit selects one of the completed instruction in EX/WB register for WB

    For the selected instruction, say RSI

    • Broadcast its result on the CDB• Broadcast the TAG (RSI) along with the value

    • All units that are waiting on the result of the completing instruction (tagcomparison): copy the broadcast value

    • Reservation stations copy the value into the RS registers with matching tags• Register file copies the value into the destination register of the instruction.

    5

  • Snapshot after Issue of A, B, C, DIssue A: DIVD F0, F2, F4

    TAG SOURCE1 SOURCE2 OPRSA DIVDv2 v4

    RSA

    Issue B: ADDD F6, F0, F8

    ADDDv8

    RSA

    RSB

    RSA

    RSB

    RSBF6

    F0

    6

    TAG SOURCE1 SOURCE2 OPRSC SUBDv10 v12RSC

    Issue D: DIVD F14, F8, F10

    DIVDv10RSD RSD

    RSCF14

    F8RSC

    Issue C: SUBD F8, F10, F12

    RSD

  • Example (contd ...)

    Event: C completes executionWrite Unit broadcasts result (RESC) of C together with its tag RSCC releases Reservation Station RSCAll registers of the RS and all registers in the Register File monitor CDB for broadcast TAGIf the TAG matches copy the broadcast value into the register

    ID SOURCE1 SOURCE2 OP F8RSC SUBDv10 v12 RESC

    RSD DIVDRESC v10

    RSD

    RSD

    • F8 updated with result of C even though B has not yet started execution.

    • A in execution

    • D ready to be dispatched

    RSC

    F14

    7

  • Example (contd ...)

    Event: A completes executionWrite Unit broadcasts result (RESA) of A along with its tag RSAA releases RSA

    ID SOURCE1 SOURCE2 OP F0

    RSA DIVDv2 v4 RESA

    RSB DIVDRESA v8

    RSB

    RSB

    • D in execution.• B ready to dispatch

    • When D and B complete execution, their results are broadcast and used to update F14 andF6 respectively.

    RSA

    F6

    8

  • WAW HazardsA: DIVD F0, F2, F4B: ADDD F6, F0, F8C: DIVD F0, F10, F12D: ADDD F8, F0, F14

    ID SOURCE1 SOURCE2 OP

    RSB ADDDv8

    RSB

    • Write of F0 by A effectively canceled.

    • Intermediate instructions (like B) get operands directly from A bypassing F0.

    RSA DIVDv2 v4

    RSA

    RSC DIVDv10 v12

    RSC

    RSA

    RSA

    RSB

    RSC

    F6

    F0

    F0

    9

  • WAW Hazards

    A: DIVD F0, F2, F4B: ADDD F6, F0, F8C: DIVD F0, F10, F12D: ADDD F8, F0, F14

    ID SOURCE1 SOURCE2 OP

    RSB ADDDv8

    RSB

    • Write of F0 by A effectively canceled.

    • Intermediate instructions (like B) get operands directly from A bypassing F0.

    RSA DIVDv2 v4

    RSA

    RSC DIVDv10 v12

    RSC

    RSA

    RSA

    RSB

    RSC

    F6

    F0

    F0

    10

  • WAW HazardsA: DIVD F0, F2, F4B: ADDD F6, F0, F8C: DIVD F0, F10, F12D: ADDD F8, F0, F14

    ID SOURCE1 SOURCE2 OP

    RSB ADDDv8

    RSB

    • Write of F0 by A effectively canceled.• Intermediate instructions (like B) get operands directly from A bypassing F0.

    RSA DIVDv2 v4

    RSC DIVDv10 v12

    RSC

    RSA

    RSA

    RSB

    RSC

    F0

    F6

    F0

    RSDF8

    RSC

    RSD ADDDv14

    RSBRSD

    RSD 11

  • Load Store Queue (LSQ)

    LOAD/STORE BUFFERS

    FIFO QUEUE

    CDB

    MEM

    DISPATCH

    ISSUE WRITE

    • In-flight LOAD or STORE instructions are held in Load/Store Buffers in the LSQ unit

    • LOAD dispatched to memory when MEM is free

    • STORE dispatched when MEM is free and value to be stored is available in the Load/Store Buffer

    • STORE value either copied from REG during issue or copied from CDB while waiting

    • We will assume that the effective address is calculated during the ISSUE stage

    LSQ

    12

  • Load Store Queue (LSQ)

    LSQ: FIFO Queue that holds descriptors of issued LOAD and STORE instructions LOAD and STORE instructions wait in LSQ for memory access.

    ID MEM ADDR OPERAND OP

    LSQA SDeaLSQA

    LSQB

    LSQ Buffers

    ID: Identifies buffer (which also serves as the identification of the issued instruction)

    OP: Load or Store (may be implicit if the queues for LOAD and STORE are separate)

    MEM ADDR: Holds the effective address of the memory location to be accessed

    OPERAND: (STORE only) Holds value to be stored in memory

    When a SD instruction is issued the Buffer may either receive the:

    • Actual operand value by copying it from the source register or

    • Tag of the instruction producing the value if it (producer instruction) is still in flight

    ID MEM ADDR OPLSQB ea LD 13

  • Load/Store Buffers

    Load and Store Buffers: Hold descriptors for Load and Store instructions

    A: SD 0(R1), F0 I: ADDD F0, F2, F4B: LD 0(R2), F2 A: SD 0(R1), F0

    B: LD 0(R2), F2

    ID ADDRESS OPERAND OP

    SQA SDea v0

    LQB LDea F2

    LQB

    SQA

    LQB

    SQA SDea

    RSI

    SQA

    ea

    ea

    14

  • Example Schedule

    LOOP:A LD F0, 0(R1) | temp = x[i]B MUL F4, F0, F2 | temp = temp * aC SD F4, 0(R1) | x[i] = tempD ADDI R1, R1, #8 | i++E BNE R1, R2, LOOP | branch if R1.ne.R2

    • Within an iteration RAW between (A, B) and (C, D) [focusing on FP registers only]• Across iterations WAR and WAW dependencies become apparent

    A1 LD F0, 0(R1)B1 MUL F4, F0, F2C1 SD F4, 0(R1)D1 ADDI R1, R1, #8E1 BNE R1, R2, LOOP

    A2 LD F0, 0(R1) WAR with B1, WAW with A1B2 MUL F4, F0, F2 WAR with C1, WAW with B1C2 SD F4, 0(R1)D2 ADDI R1, R1, #8E2 BNE R1, R2, LOOP

    15

  • Assumptions in Constructing Schedule

    • MUL unit is 4 cycles fully pipelined

    • There are an unlimited number of buffers in LSQ• Hence we never stall ISSUE for want of a buffer in the LSQ unit

    • Assume that Branches are predicted as Taken• The Target Address is calculated during the ISSUE stage of the Branch• Hence there is a 1 cycle stall between the issue of the Branch instruction and the issue of the

    instruction at the target address

    • When a LD or SD instruction is issued the effective memory address is calculated in the Issue stage• That is, the base address register Rn is read and the offset added to it• The effective address is put into the appropriate field of the Buffer in the LSQ

    • The integer ALU instructions do not go through the FP or MEM pipeline• They read the source registers in the ID stage• Execute at the next cycle in the EX stage• Write on the following cycle to the integer register• Forwarding is assumed to be used whenever beneficial to reduce latency of integer instructions

    16

  • Example Schedule

    LOOP: | Assume R1 = 1000 initially; Assume F2 = 200

    A1 LD F0, 0(R1) | temp = x[i]B1 MUL F4, F0, F2 | temp = temp * aC1 SD F4, 0(R1) | x[i] = tempD1 ADDI R1, R1, #4 | Index next element of the arrayE1 BNE R1, R2, LOOP | branch if R1.ne.R2 R2 is the address past end of array

    Cycle 2. Issue A1F0LDLQA1 &x[0] = 1000

    Cycle 3. Issue B1F4

    RSB1

    MULRSB1 v2= 200

    LQALQA1

    Cycle 4. Issue C1SDSQC1

    RSB1

    LQA1

    r1&x[0] = 1000

    v2 = 200

    17

  • Schedule2120

    E2

    D2

    C2

    B2

    A2

    E1

    IFD1

    IIFC1

    DIIFB1

    MDIIFA1

    19181716151413121110987654321

    Cycle 4:

    A1: Memory ReadB1: In RS in Dispatch stageC1: In I stage. Will issued to LOAD Buffer at end of cycle

    • Effective Address &x[0] = 1000 and tag LQC18

  • Example Schedule

    Cycle 3A: in D stage. Ready to be dispatched to MEMB: in I stage. Will issue to RS at end of cycle

    Cycle 4A: Memory ReadB: Waiting in RS since operand not yet availableC: In I stage. Will issued to LOAD Buffer at end of cycle

    • Effective Address &x[0] = 1000 and tag LQCCycle 5

    A: In WRITE stage. WRITE unit broadcasts value read from memory (i.e x[0] value) along with tag LQAB: Finds a match between the broadcast tag LQA and copies broadcast data value into the RS fieldC: Waiting in LSQ since operand F4 not yet available. Requires MUL to produce the value.D: Integer instruction issued (reads register R1)

    Cycle 6B: All operands available; MUL unit free. Dispatched from the RS to the MUL unitC: Still waiting for the MUL value to be broadcastD: ADDI instruction in integer EX stageE: Branch instruction in Issue Stage. Calculates effective address (that of LOOP) and will put the address into PC at

    the end of the cycleCycle 7

    B: Second cycle of MULC: Still waiting for the MUL value to be broadcastD: ADDI instruction writes updated R1 (1004)E: Branch instruction in EX Stage. Compares registers and finds its prediction is correctA2: Second iteration begins in IF stage.

    19

  • Schedule2120

    E2

    D2

    C2

    B2

    A2

    IFE1

    IIFD1

    DIIFC1

    DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    Cycle 5A: In WB stage. WB unit broadcasts value read from memory (i.e x[0] value) along with tag LQA1B: Finds a match between the broadcast tag LQA and copies broadcast data value into the RS fieldC: Waiting in LSQ since operand F4 not yet available. Requires MUL to produce the value.D: Integer instruction issued (reads register R1)

    20

  • Example Schedule

    Cycle 5.WB broadcasts value of x[0] along with TAG LQA1F0 and RSB1 copy the broadcast value.LOAD buffer held by instruction A1 is released

    FO: x[0]

    RSB1

    MULRSB1 v2 = 200

    SDSQC1 &x[0]

    RSB1

    x[0] F4

    LQA1

    21

  • E2

    D2

    C2

    B2

    IFA2

    IIFE1

    EIIFD1

    DDIIFC1

    *DDIIFB1

    WMDIIFA1

    654321

    Cycle 6:B1: All operands available; MUL unit is free. So B is begins execution in the MUL unitD1: ADDI instruction in integer EX stageE1: Branch instruction in Issue Stage. Calculates effective address (that of LOOP) and will put the address into PC at clock edge

    22

  • 2120

    E2

    D2

    C2

    B2

    IFA2

    E IIFE1

    WEIIFD1

    DDDIIFC1

    **DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    Cycle 7B1: Second cycle of MUL C: Still waiting for the MUL value to be broadcast D1: ADDI instruction writes updated R1 (1004)E1: Branch instruction in EX Stage. Compares registers and finds its prediction is correct A2: Second iteration begins in IF stage.

    23

  • Example Schedule

    F0LDLQA2 &x[1]

    Cycle 9. Issue B2

    F4

    RSB2

    MULRSB2 v2 = 200

    LQA2

    Cycle 10. Issue C2

    SDSQC2

    RSB2

    LQA2 Cycle 8. Issue A2

    &x[1]

    Instructions issued during iteration 2 denoted by A2, B2, C2 etc

    24

  • Schedule

    2120

    E2

    D2

    IFC2

    IIFB2

    DIIFA2

    WEIIFE1

    WEIIFD1

    DDDDDIIFC1

    ****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    25

  • Snapshot of RS and LSQ Buffers at Cycle 10

    LDLQA2 &x[1] = 1004

    F4

    MULRSB2 v2 = 200

    LQA2

    SDSQC2

    RSB2

    LQA2

    &x[1] = 1004

    FO: x[0]

    MULv2 = 200

    SDSQC1 &x[0] = 1000

    RSB1 x[0]

    LQA

    F0: x[0]

    RSB1

    F4

    RSB1

    RSB2

    F4 was to be written with the result of B. However when C2 was issuedthis write was CANCELLED. That is B2 became the new writer for F4.

    Canceling the write of F4 by B isnot a problem . Why?

    C which wants that value will get itdirectly from the CDB when the WBunit broadcasts the value along withthe tag RSB

    26

  • Schedule

    / 2120

    E2

    IFD2

    IIFC2

    DIIFB2

    DDIIFA2

    WEIIFE1

    WEIIFD1

    DDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    27NOTE: LD of A2 at cycle 10 has stalled for earlier SD (of C1) since we assume LSQdispatches requests in strict FIFO order.

  • Schedule

    / 2120

    IFE2

    IIFD2

    DIIFC2

    DDIIFB2

    DDDIIFA2

    WEIIFE1

    WEIIFD1

    MDDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    27NOTE: LD of A2 at cycle 11 still stalled for earlier SD (of C1) since we assume LSQdispatches requests in strict FIFO order.

  • Schedule

    / 2120

    IIFE2

    EIIFD2

    DDIIFC2

    DDDIIFB2

    MDDDIIFA2

    WEIIFE1

    WEIIFD1

    MDDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    27NOTE: LD of A2 at cycle 10 has stalled for earlier SD (of C1) since we assume LSQdispatches requests in strict FIFO order.

  • Schedule

    /

    W

    21

    M

    20

    DDDDDIIFA3

    WEIIFE2

    WEIIFD2

    MDDDDDDDDIIFC2

    W****DDDDIIFB2

    WMDDDIIFA2

    WEIIFE1

    WEIIFD1

    MDDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    27NOTE: Completes SD every 8 cycles apart

    8(n-1) + 11 = 8n +3 cycles. Additional stalls will arise as RS fills up

  • Schedule

    /2120

    E2

    IFD2

    IIFC2

    DIIFB2

    MDIIFA2

    WEIIFE1

    WEIIFD1

    DDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    27WARNING: In the schedule above M stage (LD) of A2 at cycle 10 has overtakenearlier SD (of C1). Not valid if LSQ dispatches requests in strict FIFO order.

  • Schedule

    2120

    IFE2

    IIFD2

    DIIFC2

    DDIIFB2

    WMDIIFA2

    WEIIFE1

    WEIIFD1

    MDDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    T = 6(n-1) + 11 = 6n + 5 cycles for n iterations28

  • Schedule

    2120

    IFE2

    IIFD2

    DIIFC2

    DDIIFB2

    WMDIIFA2

    WEIIFE1

    WEIIFD1

    MDDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    T = 6(n-1) + 11 = 6n + 5 cycles for n iterations28

  • Schedule

    2120

    IIFE2

    EIIFD2

    DDIIFC2

    *DDIIFB2

    WMDIIFA2

    WEIIFE1

    WEIIFD1

    MDDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    T = 6(n-1) + 11 = 6n + 5 cycles for n iterations29

  • Schedule

    2120

    WMDIIF

    WEIIFE2

    WEIIFD2

    MDDDDDDIIFC2

    W****DDIIFB2

    WMDIIFA2

    WEIIFE1

    WEIIFD1

    MDDDDDDIIFC1

    W****DDIIFB1

    WMDIIFA1

    19181716151413121110987654321

    T = 6(n-1) + 11 = 6n + 5 cycles for n iterations30