15
ECE 486/586 Computer Architecture Lecture # 15 Spring 2019 Portland State University

ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

ECE 486/586

Computer Architecture

Lecture # 15

Spring 2019

Portland State University

Page 2: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Lecture Topics

• Instruction-Level Parallelism

– Dynamic Scheduling via Tomasulo Algorithm

Reference:

• Chapter 3: Section 3.1, 3.4 and 3.5

Page 3: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Dynamic Scheduling

• Key Idea:

– Hardware re-arranges the instruction execution to reduce stalls while maintaining data flow and exception behavior

• Advantages:– Code compiled with one pipeline in mind can be run efficiently on another

pipeline => no need to recompile for a different microarchitecture

– Handle dependences which are unknown at compile time

– Enables the processor to tolerate unpredictable delays, such as cache misses

• Example:

LW R1, 0(R2)

DADDU R3, R1, R4

DSUBU R5, R2, R6

– If LW misses in the cache, in-order pipeline gets stalled (in-order issue and in-order execution)

– Dynamic scheduling allows DSUBU to proceed (no dependence on LW)

Page 4: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Dynamic Scheduling

• Example:

LW R1, 0(R2)

DADDU R3, R1, R4

DSUBU R5, R2, R6

• If we are to allow DSUBU to proceed, we need to separate issue into two parts:– Check for structural hazards

– Await absence of any data hazards

• In-order issue but out-order execution (and completion)

• But, out-of-order introduces possibility of WAW & WAR hazards

• Previously we studied scoreboarding to allow out-of-order execution

• Today, we’ll look at Tomasulo’s algorithm (more sophisticated than scoreboard)

Page 5: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Tomasulo’s Algorithm

• Used in IBM 360/91

– Employed in floating point unit

• FP units were the major source of hazards at that time, since there were no caches

• Variations of Tomasulo algorithm are in use in modern processors

• Key common characteristics

– Track instruction dependences to allow execution as soon as operands are available

– Rename registers to avoid WAR and WAW haazards

Page 6: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Tomasulo’s Algorithm

Page 7: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Tomasulo’s Algorithm

• Issue

– Get next instruction from the head of the queue

– If matching reservation station empty and operands available• Issue instruction to reservation station, indicating that all operands are available

– If no empty reservation station, then structural hazard occurs• Stall instruction

– If empty reservation station but operands not available, keep track of functional units producing them• Issue instruction to reservation station indicating FU that will provide operands

– Effectively renames registers eliminating WAR and WAW hazards

Page 8: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Tomasulo’s Algorithm

• Execute

– If one or more operands unavailable, monitor CDB for them

– When an operand becomes available, it is placed in corresponding reservation station

– When all operands are available, operation can be executed at the functional unit

– Several instructions could become ready in the same clock cycle for the same functional unit => Need a selection heuristic to chose which instruction to execute first

• Write Result

– When result is available, broadcast it on the CDB

– CDB communicates the result to register file and all reservation stations

– Stores write data to memory during this step

Page 9: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Load/Store Buffers

• Load Buffers

– Hold components of effective address until it is computed

– Track outstanding loads waiting on memory

– Hold results of completed loads waiting on CDB

• Store Buffers– Hold components of effective address until it is computed

– Hold destination addresses for stores waiting for data value

– Hold address and value until memory is available

Page 10: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Load/Store Operations

• Proceed in two steps:– Compute effective address when base register is available

– Effective address is placed in the load or store buffer

– Loads in load buffer execute as soon as memory is available

– Stores in store buffer must wait for value being stored

Page 11: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Register Renaming via Tags

• Register names are effectively replaced with tags

– Identify reservation station which will produce operand

– Values broadcast on CDB include the tag

• Reservation stations with pending instructions awaiting operands monitor CDB for tags and values

• This allows direct communication of data between producer and consumer instructions => can read operands as soon as they become available

• It also enables WAR and WAW hazard avoidance (will discuss later)

Page 12: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Data Structures

• Reservation Station

– Op: operation to perform on source operands

– Qj, Qk: reservation stations that will produce source operands (0 source operand already in Vj or Vk or not required)

– Vj, Vk: actual value of source operands if available

– A: used for effective address calculation for loads/stores (initially holds immediate field of instruction, hold effective address once calculated)

– Busy: indicates reservation station is busy and FU occupied

• Register Status:

– Qi: Reservation station that will write this register. If the value of Qi is blank, no currently active instruction is computing a result that is destined for this register

Page 13: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

Data Structures

Page 14: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

How Tomasulo differs from Scoreboard?

1) Data structures are distributed among reservation stations rather than centrally located

2) Common Data Bus (CDB) permits bypass rather than await writes to register file

3) Scoreboard stalls instruction until all operands are available while in Tomasulo’s algorithm, instructions can read operands from CDB as they become available

• This eliminates WAR hazards because an earlier instruction in a WAR dependence can read its operands directly from its producer, rather than reading it from the source register which can be overwritten safely by the later instruction

Page 15: ECE 486/586 Computer Architecture Lecture # 15web.cecs.pdx.edu/~zeshan/ece586_lec15.pdf · 2019. 5. 22. · Tomasulo’sAlgorithm • Used in IBM 360/91 –Employed in floating point

How Tomasulo differs from Scoreboard?

4) Scoreboard stalls instructions to prevent WAW hazard while Tomasulo’s algorithm effectively renames the output register and allows the instruction to proceed

• Later instruction in a WAW dependence overwrites RegisterStatus[rd], even if earlier instruction recorded its FU there

• This is OK because subsequent instructions which needed the value from earlier instruction already recorded the FU for that instruction

• When FU of earlier instruction writes its result to CDB

– Reservation stations needing the result, retrieve it

– Register file not written, since RegisterStat[reg] is no longer that FU