130
ECE200 – Computer Organization Chapter 6 – Enhancing Performance with Pipelining

ECE200 – Computer Organization

  • Upload
    tola

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

ECE200 – Computer Organization. Chapter 6 – Enhancing Performance with Pipelining. Homework 6. 6.2, 6.3, 6.5, 6.9, 6.11, 6.19, 6.27, 6.30. Outline for Chapter 6 lectures. Pipeline motivation: increasing instruction throughput MIPS 5-stage pipeline Hazards Handling exceptions - PowerPoint PPT Presentation

Citation preview

Page 1: ECE200 – Computer Organization

ECE200 – Computer Organization

Chapter 6 – Enhancing Performance with

Pipelining

Page 2: ECE200 – Computer Organization

Homework 6

6.2, 6.3, 6.5, 6.9, 6.11, 6.19, 6.27, 6.30

Page 3: ECE200 – Computer Organization

Outline for Chapter 6 lectures

Pipeline motivation: increasing instruction throughput

MIPS 5-stage pipeline

Hazards

Handling exceptions

Superscalar execution

Dynamic scheduling (out-of-order execution)

Real pipeline designs

Page 4: ECE200 – Computer Organization

Pipeline motivation

Need both low CPI and high frequency for best performanceWant a multicycle for high frequency, but need better

CPI

Idea behind pipelining is to have a multicycle implementation that operates like a factory assembly line

Each “worker” in the pipeline performs a particular task, hands off to the next “worker”, while getting new work

Page 5: ECE200 – Computer Organization

Pipeline motivation

Tasks should take about the same time – if one “worker” is much slower than the rest, then other “workers” will stand idle

Once the assembly line is full, a new “product” (instruction) comes out of the back-end of the line each time period

In a computer assembly line (pipeline), each task is called a stage and the time period is one clock cycle

Page 6: ECE200 – Computer Organization

MIPS 5-stage pipeline

Like single cycle datapath but with registers separating each stage

Page 7: ECE200 – Computer Organization

MIPS 5-stage pipeline

5 stages for each instructionIF: instruction fetchID: instruction decode and register file readEX: instruction execution or effective address calculationMEM: memory access for load and storeWB: write back results to register file

Delays of all 5 stages are relatively the same

Staging registers are used to hold data and control as instructions pass between stages

All instructions pass through all 5 stages

As an instruction leaves a stage in a particular clock period, the next instruction enters it

Page 8: ECE200 – Computer Organization

Pipeline operation for lw

Stage 1: Instruction fetch

Page 9: ECE200 – Computer Organization

Pipeline operation for lw

Stage 2: Instruction decode and register file read

What happens to the instruction info in IF/ID?

Page 10: ECE200 – Computer Organization

Pipeline operation for lw

Stage 3: Effective address calculation

Page 11: ECE200 – Computer Organization

Pipeline operation for lw

Stage 4: Memory access

Page 12: ECE200 – Computer Organization

Pipeline operation for lw

Stage 5: Write back

Instruction info in IF/ID is gone – won’t work

Page 13: ECE200 – Computer Organization

Modified pipeline with write back fix

Write register bits from the instruction must be carried through the pipeline with the instruction

Page 14: ECE200 – Computer Organization

Pipeline operation for lw

Pipeline usage in each stage for lw

Page 15: ECE200 – Computer Organization

Pipeline operation for sw

Stage 3: Effective address calculation

Page 16: ECE200 – Computer Organization

Pipeline operation for sw

Stage 4: Memory access

Page 17: ECE200 – Computer Organization

Pipeline operation for sw

Stage 5: Write back (nothing)

Page 18: ECE200 – Computer Organization

Pipeline operation for lw, sub sequence

Page 19: ECE200 – Computer Organization

Pipeline operation for lw, sub sequence

Page 20: ECE200 – Computer Organization

Pipeline operation for lw, sub sequence

Page 21: ECE200 – Computer Organization

Pipeline operation for lw, sub sequence

Page 22: ECE200 – Computer Organization

Pipeline operation for lw, sub sequence

Page 23: ECE200 – Computer Organization

Pipeline operation for lw, sub sequence

Page 24: ECE200 – Computer Organization

Graphical pipeline representation

Represent overlap of pipelined instructions as multiple pipelines skewed by a cycle

Page 25: ECE200 – Computer Organization

Another useful shorthand form

Page 26: ECE200 – Computer Organization

Pipeline control

Basic pipeline control is similar to the single cycle implementation

Page 27: ECE200 – Computer Organization

Pipeline control

Control for an instruction is generated in ID and travels with the instruction and data through the pipeline

When an instruction enters a stage, it’s control signals set the operation of that stage

Page 28: ECE200 – Computer Organization

Pipeline control

Page 29: ECE200 – Computer Organization

Multiple instruction example

For the following code fragment

show the datapath and control usage as the instruction sequence travels down the pipeline

lw $10, 20($1)sub $11, $2, $3and $12, $4, $5or $13, $6, $7add $14, $8, $9

Page 30: ECE200 – Computer Organization

Multiple instruction example

Page 31: ECE200 – Computer Organization

Multiple instruction example

Page 32: ECE200 – Computer Organization

Multiple instruction example

Page 33: ECE200 – Computer Organization

Multiple instruction example

Page 34: ECE200 – Computer Organization

Multiple instruction example

Page 35: ECE200 – Computer Organization

Multiple instruction example

Page 36: ECE200 – Computer Organization

Multiple instruction example

Page 37: ECE200 – Computer Organization

Multiple instruction example

Page 38: ECE200 – Computer Organization

Multiple instruction example

Page 39: ECE200 – Computer Organization

How the MIPS ISA simplifies pipelining

Fixed length instruction simplifiesFetch – just get the next 32 bitsDecode – single step; don’t have to decode opcode

before figuring out where to get the rest of the fields

Source register fields always in same locationCan read source registers during decode

Load/store architectureALU can be used for both arithmetic and EA

calculationMemory instruction require about same amount of

work as arithmetic ones, easing pipelining of the two together

Memory data must be alignedRead or write accesses can be done in one cycle

Page 40: ECE200 – Computer Organization

Pipeline hazards

A hazard is a conflict, regarding data, control, or hardware resources

Data hazards are conflicts for register values

Control hazards occur due to the delay to execute branch and jump instruction

Structural hazards are conflicts for hardware resources, such as A single memory for instructions and dataA multi-cycle, non-pipelined functional unit (such as a

divider)

Page 41: ECE200 – Computer Organization

Data dependences

A read after write (RAW) dependence occurs when the register written by an instruction is a source register of a subsequent instruction

Also have write after read (WAR) and write after write (WAW) data dependences (later)

lw $10, 20($1)

sub $11, $10, $3

and $12, $4, $11

or $13, $11, $4

add $14, $13, $9

Page 42: ECE200 – Computer Organization

Pipelining and RAW dependences

RAW dependences that are close by may cause data hazards in the pipeline

Consider the following code sequence:

What are the RAW dependences?

sub $2, $1, $3and $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Page 43: ECE200 – Computer Organization

Pipelining and RAW dependences

Data hazards with first three instructions

ok

ok

hazard

hazard

Page 44: ECE200 – Computer Organization

Forwarding

Most RAW hazards can be eliminated by forwarding results between pipe stages

at this point, result of sub is available

Page 45: ECE200 – Computer Organization

Forwarding datapaths

Bypass paths feed data from MEM and WB back to MUXes at the EX ALU inputs

Do we still have to write the register file in WB?

Page 46: ECE200 – Computer Organization

Detecting forwarding

Rd of the instruction in MEM or WB must match Rs and/or Rt of the instruction in EX

The instruction in MEM or WB must have RegWrite=1 (why?)

Rd must not be $0 (why?)

Page 47: ECE200 – Computer Organization

Detecting forwarding from MEM to EX

To the upper ALU input (ALUupper)EX/MEM.RegWrite = 1EX/MEM.RegisterRd not equal 0EX/MEM.RegisterRd = ID/EX.RegisterRs

To the lower ALU input (ALUlower)EX/MEM.RegWrite = 1EX/MEM.RegisterRd not equal 0EX/MEM.RegisterRd = ID/EX.RegisterRt

Page 48: ECE200 – Computer Organization

Detecting forwarding from WB to EX

To the upper ALU inputMEM/WB.RegWrite = 1MEM/WB.RegisterRd not equal 0MEM/WB.RegisterRd = ID/EX.RegisterRsThe value is not being forwarded from MEM (why?)

To the lower ALU inputMEM/WB.RegWrite = 1MEM/WB.RegisterRd not equal 0MEM/WB.RegisterRd = ID/EX.RegisterRtThe value is not being forwarded from MEM

Page 49: ECE200 – Computer Organization

Forwarding control

Control is handled by the forwarding unit

Page 50: ECE200 – Computer Organization

Forwarding example

Show forwarding for the code sequence:

sub $2, $1, $3

and $4, $2, $5

or $4, $4, $2

add $9, $4, $2

Page 51: ECE200 – Computer Organization

Forwarding example

sub produces result in EX

Page 52: ECE200 – Computer Organization

Forwarding example

sub forwards result from MEM to ALUupper

Page 53: ECE200 – Computer Organization

Forwarding example

sub forwards result from WB to ALUlower

and forwards result from MEM to ALUupper

Page 54: ECE200 – Computer Organization

Forwarding example

or forwards result from MEM to ALUupper

Page 55: ECE200 – Computer Organization

RAW hazards involving loads

Loads produce results in MEM – can’t forward to an immediately following R-type instruction

Called a load-use hazard

Page 56: ECE200 – Computer Organization

RAW hazards involving loads

Solution: stall the stages behind the load for one cycle, after which the result can be forwarded

Page 57: ECE200 – Computer Organization

Detecting load-use hazards

Instruction in EX is a loadID/EX.MemRead = 1

Instruction in ID has a source register that matches the load destination registerID/EX.RegisterRt = IF/ID.RegisterRs OR

ID/EX.RegisterRt = IF/ID.RegisterRt

Page 58: ECE200 – Computer Organization

Stalling the stages behind the load

Force nop (“no operation”) instruction into EX stage on next clock cycleForce ID/EX.MemWrite input to zeroForce ID/EX.RegWrite input to zero

Hold instructions in ID and IF stages for one clock cycleHold the contents of PCHold the contents of IF/ID

Page 59: ECE200 – Computer Organization

Control for load-use hazards

Control is handled by the hazard detection unit

Page 60: ECE200 – Computer Organization

Load-use stall example

Code sequence:

lw $2, 20($1)

and $4, $2, $5

or $4, $4, $2

add $9, $4, $2

Page 61: ECE200 – Computer Organization

Load-use stall example

lw enters ID

Page 62: ECE200 – Computer Organization

Load-use stall example

Load-use hazard detected

Page 63: ECE200 – Computer Organization

Load-use stall example

Force nop into EX and hold ID and IF stages

Page 64: ECE200 – Computer Organization

Load-use stall example

lw result in WB forwarded to and in EX

or reads operand $2 from register file

Page 65: ECE200 – Computer Organization

Load-use stall example

Pipeline advances normally

Page 66: ECE200 – Computer Organization

Control hazards

Taken branches and jumps change the PC to the target address from which the next instruction is to be fetched

In our pipeline, the PC is changed when the taken beq instruction is in the MEM stage

This creates a control hazard in which sequential instructions in earlier stages must be discarded

Page 67: ECE200 – Computer Organization

beq instruction that is takeninstri+1instri+2instri+3

instri+1, instri+2, instri+3 must be discarded

beq $2,$3,7

Page 68: ECE200 – Computer Organization

beq instruction that is taken

In this example, the branch delay is three

Why is the branch immediate field a 7?

Page 69: ECE200 – Computer Organization

Reducing the branch delay

Reducing the branch delay reduces the number of instructions that have to be discarded on a taken branch

We can reduce the branch delay to one for beq by moving both the equality test and the branch target address calculation into ID

We need to insert a nop between the beq and the correctly fetched instruction

Page 70: ECE200 – Computer Organization

Reducing the branch delay

Page 71: ECE200 – Computer Organization

beq with one branch delay

Register equality test done in ID by a exclusive ORing the register values and NORing the result

Instruction in ID forced to nop by zeroing the IF/ID register

Next fetched instruction will be from PC+4 or branch target depending on the beq outcome

Page 72: ECE200 – Computer Organization

beq with one branch delay

beq in ID; next sequential instruction (and) in IF

Page 73: ECE200 – Computer Organization

beq with one branch delay

bubble in ID; lw (from taken address) in IF

Page 74: ECE200 – Computer Organization

Forwarding and stalling changes

Results in MEM and WB must be forwarded to ID for use as possible beq source operand values

beq may have to stall in ID to wait for source operand values to be produced

Examples addi $2, $2, -1

beq $2, $0, 20

lw $8, 20($1)

beq $4, $8, 6

Stall beq one cycle; forward $2 from MEM to upper equality input

in ID

Stall beq two cycles; forward $8 from WB to lower equality input in

ID

Page 75: ECE200 – Computer Organization

Forwarding from MEM to ID

beq $2,$0,20 addi $2,$2,-1bubble

How could we eliminate the bubble?

Page 76: ECE200 – Computer Organization

Forwarding from WB to ID

beq $4,$8,6 lw $8,20($1) bubble bubble

Page 77: ECE200 – Computer Organization

Further reducing the branch delay

Insert a bubble only if the branch is takenAllow the next sequential instruction to proceed if the

branch is not taken AND the IF.Flush signal with the result of the equality

testStill have bubble for taken branches (~2/3 of all

branches)

Delayed branching

Page 78: ECE200 – Computer Organization

Delayed branching

The ISA states that the instruction following the branch is always executed irregardless of the branch outcomeHardware must adhere to this rule!

The compiler finds an appropriate instruction to place after the branch (in the branch delay slot)

beq $4, $8, 6sub $1, $2, $3 branch delay slot (always

executed after the branch)

Page 79: ECE200 – Computer Organization

Delayed branching

Three places compiler may find a delay slot instruction

Page 80: ECE200 – Computer Organization

Prior example without delayed branch

beq in ID; next sequential instruction (and) in IF

What do you notice about the sub instruction?

Page 81: ECE200 – Computer Organization

Prior example without delayed branch

bubble in ID; lw (from taken address) in IF

Page 82: ECE200 – Computer Organization

Prior example with delayed branch

beq in ID; delay slot instruction (sub) in IF

sub $10 $4,$8

Page 83: ECE200 – Computer Organization

Prior example with delayed branch

sub in ID; lw (from taken address) in IF

sub $10 $4,$8

What would happen if the branch was not taken?

Page 84: ECE200 – Computer Organization

Limitations of delayed branching

50% of the time the compiler can’t fill delay slot with useful instructions while maintaining correctness (has to insert nops instead)

High performance pipelines may have >10 delay slotsMany cycles for instruction fetch and decodeMultiple instructions in each pipeline stageExample

Pipeline: IF1-IF2-ID1-ID2Branch calculation performed in ID2Four instructions in each stage12 delay slots

Solution: branch prediction (later)

Page 85: ECE200 – Computer Organization

Precise exceptions

Exceptions require a change of control to a special exception handler routine

The PC of the user program is saved in EPC and restored after the handler completes so that the user program can resume at that instruction

For the user program to work correctly after resuming, All instructions before the excepting one must have written

their resultAll subsequent instructions must not have written their result

Exceptions handled this way are called precise

Page 86: ECE200 – Computer Organization

Pipelining and precise exceptions

There may be instructions from before the excepting one and from after it in the pipeline when the exception occurs

Exceptions may be detected out of program order

Which should be handled first?

exception

exception

Page 87: ECE200 – Computer Organization

Supporting precise exceptions

Each instruction in the pipeline has an exception field that travels with it

When an exception is detected, the type of exception is encoded in the exception field

The RegWrite and MemWrite control signals for the instruction are set to 0

At the end of MEM, the exception field is checked to see if an exception occurred

If so, the instructions in IF, ID, and EX are made into nops, and the address of the exception handler is loaded into the PC

Page 88: ECE200 – Computer Organization

Supporting precise exceptions

exception

exception

Page 89: ECE200 – Computer Organization

Superscalar pipelines

In a superscalar pipeline, each pipeline stage holds multiple instructions 4-6 instructions in modern high performance

microprocessors

Performance is increased because every clock period more than one instruction completes (increased parallelism)

Superscalar pipelines have a CPI less than 1

Page 90: ECE200 – Computer Organization

Simple 2-way superscalar MIPS

Page 91: ECE200 – Computer Organization

Simple 2-way superscalar MIPS

Two instructions fetched and decoded each cycle

Conditions for executing a pair of instructionsFirst instruction an integer or branch, second a load or

storeNo RAW dependence from first to second

Otherwise, second instruction is executed the cycle after the first

Page 92: ECE200 – Computer Organization

Compiler code scheduling

The compiler can improve performance by changing the order of the instructions in the program (code scheduling)

Examples Fill branch delay slotsMove instructions between two dependent instructions

to eliminate the stall cyclesReorder instructions to increase the number executed

in parallel

Page 93: ECE200 – Computer Organization

Scheduling example – before

Load-use stallStall after addiFirst three instructions must execute serially due to

dependencesLast two must also execute serially for same reasonHave branch delay slot to fill

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero, Loop

Page 94: ECE200 – Computer Organization

Scheduling example – after

All stall cycles are eliminatedLast two instructions can now execute in parallel on

the 2-way superscalar MIPSFirst two can also, but we would introduce a stall cycle

before the addu (loop is too short – not enough instructions to schedule)

Loop: lw $t0, 0($s1)addi $s1, $s1, -4 # moved into load delay slotaddu $t0, $t0, $s2bne $s1, $zero, Loopsw $t0, 4($s1) # moved into branch delay slot

Page 95: ECE200 – Computer Organization

Loop unrolling

Idea is to take multiple iterations of a loop (“unroll” it) and combine them into one bigger loop

Gives the compiler many instructions to move between dependent instructions and to increase parallel execution

Reduces the overhead of branching

Page 96: ECE200 – Computer Organization

Loop unrolling

Example of prior loop unrolled 4 times:

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t0, -4($s1)addu $t0, $t0, $s2sw $t0, -4($s1)lw $t0, -8($s1)addu $t0, $t0, $s2sw $t0, -8($s1)lw $t0, -12($s1)addu $t0, $t0, $s2sw $t0, -12($s1)addi $s1, $s1, -16bne $s1, $zero, Loop

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero, Loop

Original code:

Page 97: ECE200 – Computer Organization

Loop unrolling

Problem: reuse of $t0 constrains instruction order

Write after read (WAR) and write after write (WAW) hazards

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t0, -4($s1)addu $t0, $t0, $s2sw $t0, -4($s1)lw $t0, -8($s1)addu $t0, $t0, $s2sw $t0, -8($s1)lw $t0, -12($s1)addu $t0, $t0, $s2sw $t0, -12($s1)addi $s1, $s1, -16bne $s1, $zero, Loop

Page 98: ECE200 – Computer Organization

Loop unrolling

Solution: different registers for each computation

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t1, -4($s1)addu $t1, $t1, $s2sw $t1, -4($s1)lw $t2, -8($s1)addu $t2, $t2, $s2sw $t2, -8($s1)lw $t3, -12($s1)addu $t3, $t3, $s2sw $t3, -12($s1)addi $s1, $s1, -16bne $s1, $zero, Loop

Page 99: ECE200 – Computer Organization

Loop unrolling

Unrolled loop after scheduling:

New sw offsets due to moving the addi

Loop: lw $t0, 0($s1)lw $t1, -4($s1)lw $t2, -8($s1)lw $t3, -12($s1)addu $t0, $t0, $s2addu $t1, $t1, $s2 addu $t2, $t2, $s2addu $t3, $t3, $s2addi $s1, $s1, -16sw $t0, 16($s1)sw $t1, 12($s1)sw $t2, 8($s1)bne $s1, $zero, Loopsw $t3, 4($s1)

Page 100: ECE200 – Computer Organization

Modern superscalar processors

Today’s superscalar processors attempt to issue (initiate the execution of) 4-6 instructions each clock cycle

Such processors have multiple integer ALUs, integer multipliers, and floating point units that operate in parallel on different instructions

Because most of these units are pipelined, there is the potential to have 10’s of instructions simultaneously executing

We must remove several barriers to achieve this

Page 101: ECE200 – Computer Organization

Modern processor challenges

Handling branches in a way that prevents instruction fetch from becoming a bottleneck

Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution

Removing register hazards due to the reuse of registers so that instruction can execute in parallel

Page 102: ECE200 – Computer Organization

Instruction fetch challenges

Branches comprise about 20% of the executed instructions in SPEC integer programs

The branch delay may be >10 instructions in a highly pipelined, superscalar processor

Delayed branches are useless with so may delay slots

Solution: dynamic branch prediction with speculative execution

Page 103: ECE200 – Computer Organization

Dynamic branch prediction

When fetching the branch, predict what the branch outcome and target will be

Fetch instructions from the predicted direction

After executing the branch, verify whether the prediction was correct

If so, continue without any performance penalty

If not, undo and fetch from the other direction

Page 104: ECE200 – Computer Organization

Bimodal branch predictor

Predicts the branch outcome

Works under the assumption that most branches are either taken most of the time or not taken most of the time

Prediction accuracy is ~85-95% with 2048 entries

Page 105: ECE200 – Computer Organization

Bimodal branch predictor

Consists of a small memory and a state machine

Each memory location has 2 bits

The address of the memory is the low-order log2n PC bits of a fetched branch instruction

.

.

.

PC of fetched branch instruction

n entriesaddress

2 bits/entry

branch predictor memory

Page 106: ECE200 – Computer Organization

Bimodal branch predictor

When a branch is fetched, the 2-bit memory entry is retrieved

The prediction is based on the high-order bit1=predict taken0=predict not taken

Page 107: ECE200 – Computer Organization

Bimodal branch predictor

Once the branch is executed, the state bits are updated and written back into the memory

In the 00 or 11 state, have to be wrong twice in a row to change the prediction

11

0001

10

actual branch outcome

Page 108: ECE200 – Computer Organization

Branch target buffer

Predicts the branch target addressIs this as critical as predicting the branch outcome?

Small memory (typically 256-512 entries) addressed by the low-order branch PC bits

Each entry holds the last target address of the branch

When a branch is fetched, the BTB is accessed and the target address is used if the bimodal predictor predicts “taken”

Page 109: ECE200 – Computer Organization

Speculative execution

The execution of the branch, and verification of the prediction, may take many cycles due to RAW dependences with long-latency instructions

We cannot write the register file or data memory until we know the prediction is correct

Execution will eventually stall

lw $2,100($1) # can take >100 cyclesbeq $2,$0,Label

Page 110: ECE200 – Computer Organization

Speculative execution

In speculative execution, results are first written to temporary buffers (NOT the register file or data memory)

The results are copied from the buffers to the register file or data memory if the branch prediction has been verified and is correct

If the prediction is incorrect, we discard the results

Page 111: ECE200 – Computer Organization

Speculative execution

Writeback now consists of two stages: instruction completion and instruction commit

Completion: execution is complete, write results to buffer

Commit: branch prediction is verified and correct, copy results from buffers to register file or data memory

Modern processors can speculate through 4-8 branches

execute

completion commit

buffersregister

file

branch prediction verified as correct

Page 112: ECE200 – Computer Organization

Modern processor challenges

Handling branches in a way that prevents instruction fetch from becoming a bottleneck

Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution

Removing register hazards due to the reuse of registers so that instruction can execute in parallel

Page 113: ECE200 – Computer Organization

Long latency operations

Long latency operations, especially loads that have to access main memory, may stall subsequent instructions

Solution: allow instructions to issue (start executing) out of their original program order but update registers/memory in program order

or $5,$6,$7and $8,$6,$7lw $2,100($1)add $9,$2,$2sub $10,$5,$8

completed

waiting for lw

can’t execute even though its operands

are available!

data not found in on-chip memory, have to get

from main memory

Page 114: ECE200 – Computer Organization

Out-of-order issue

Fetched and decoded instructions are placed in a special hardware queue called the issue queue

An instruction waits in the IQ untilIts source operands are availableA suitable functional unit is available

The instruction can then issue

IF ID …issue queue

EX

completioncommit

buffersreg file

Page 115: ECE200 – Computer Organization

Out-of-order issue

Every cycle, the destination register numbers (rd or rt) of issuing instructions are broadcast to all instructions in the IQ

A match with a source register number (rs or rt) of an instruction in the IQ indicates the operand will be available

or $5,$6,$7and $8,$6,$7lw $2,100($1) # can take >100 cycles!add $9,$2,$2sub $10,$5,$8

issued instructions

both operands become available

IF ID …issue queue

EX

completioncommit

dest reg numbers

Page 116: ECE200 – Computer Organization

Out-of-order issue

Instructions with available source operands can issue ahead of earlier instructions (out of original program order)

add $9,$2,$2

sub $10,$5,$8

.

.

.

from ID

waiting for lw

or and and instructions were just issued => issue sub

issue queue

Page 117: ECE200 – Computer Organization

Out-of-order issue, in-order commit

Once instructions complete, they write results into the buffers used for speculative execution

However, instructions are written to the register file and data memory in original program order

Why do we need to do this?

execute

completion commit

buffersregister

file

must be in-order

may be out-of-order

or $5,$6,$7and $8,$6,$7lw $2,100($1)add $9,$2,$2sub $10,$5,$8

commits firstcompletes first

Page 118: ECE200 – Computer Organization

Modern processor challenges

Handling branches in a way that prevents instruction fetch from becoming a bottleneck

Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution

Removing register hazards due to the reuse of registers so that instruction can execute in parallel

Page 119: ECE200 – Computer Organization

Register hazards

The reuse of registers creates WAW and WAR hazards that limit out-of-order issue and parallel execution

Example

Potential for multiple iterations to be executed in parallelThe branch could be predicted as taken with high accuracyProblem: WAW and WAR hazards involving $t0 and $s1

Solution: register renaming

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4bne $s1, $zero, Loop

Page 120: ECE200 – Computer Organization

Register renaming

Idea is for the hardware to reassign registers like the compiler does in loop unrolling

Requires implementing more registers than specified in the ISA (e.g., 128 integer registers rather than 32)

Allows every instruction in the pipeline to be given a unique destination register number to eliminate all WAR and WAW register conflicts

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t1, -4($s1)addu $t1, $t1, $s2sw $t1, -4($s1)

Page 121: ECE200 – Computer Organization

Register renaming

A register renaming stage is added between decode and the register file access

The original architectural destination register number is replaced by a unique physical register number that is not used by any other instruction

A lookup is done for each source register to find the corresponding physical register numberdecode rename

reg file

architectural register numbers used up to here

physical register numbers used after this point

Page 122: ECE200 – Computer Organization

Register renaming

Example: two iterations of the loop with branch predicted taken

WAR hazard involving $s1 is removed, allowing the addi to complete before the first iteration is completed

The WAW and WAR hazards involving $t0 are removedRemoving both of these restrictions allows the second

iteration to proceed in parallel with the first

lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4<bne predicted taken>lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4<bne predicted taken>

lw $p1, 0($p3)addu $p2, $p1, $p10sw $p2, 0($p3)addi $p4, $p3 , -4<bne predicted taken>lw $p7, 0($p4)addu $p23, $p7, $p10sw $p23, 0($p4)addi $p11, $p4 , -4<bne predicted taken>

BEFORE: AFTER:

Page 123: ECE200 – Computer Organization

The MIPS R12000 microprocessor

4-way superscalar

Five execution units2 integer2 floating point1 load/store for effective address calculation and data

memory access

Dynamic branch prediction and speculative execution

ooo issue, in-order commit

Register renaming

Page 124: ECE200 – Computer Organization

R12000 pipeline (ALU operations) Fetch stages 1 and 2

Fetch 4 instructions each cyclePredict branchesSplit into two stages to enable higher clock rates (R10K had

1)

Decode stageDecode and rename 4 instructions each cyclePut into issue queues

Issue stageCheck source operand availabilityRead source operands from register file (or bypass paths) for

issued instructions

Execute stageExecute and complete

Write stageWrite results to physical registers

Page 125: ECE200 – Computer Organization

R12000 branch prediction

2048-entry bimodal predictor

32 entry branch target address cache

Speculation through four branches

Page 126: ECE200 – Computer Organization

R12000 ooo completion, in-order commit

Separate 16-entry issue queues for integer, floating point, and memory (load and store) instructions

Hardware tracks the program order and status (completed, caused exception, etc) of up to 48 instructions

Page 127: ECE200 – Computer Organization

R12000 register renaming

64 integer and 64 floating point physical registers

Hardware lookup table to correlate architectural registers with physical registers

Hardware maintains list of currently unused registers that can be assigned as destination registers

Page 128: ECE200 – Computer Organization

R10000 die photo

Page 129: ECE200 – Computer Organization

R12000 summary

R10000 was one of the 1st microprocessors to implement the “issue queue” approach to ooo superscalar executionPowerPC processors use the “reservation station”

approach discussed in the book

Clock rate was slowR12000 provided a slight improvement with some redesign

Pentium and Alpha processors are ooo but with much faster clock rates

Very hard to get significant improvement beyond 4-6 way issueBranch prediction needs to be extremely high Finding parallel operations in many program is difficultLong latency of loads creates an operand supply problemKeeping the clock rate high is tough

Page 130: ECE200 – Computer Organization

Questions?