ECE200 – Computer Organization

ECE200 – Computer Organization

Chapter 6 – Enhancing Performance with

Pipelining

Homework 6

6.2, 6.3, 6.5, 6.9, 6.11, 6.19, 6.27, 6.30

Outline for Chapter 6 lectures

Pipeline motivation: increasing instruction throughput

MIPS 5-stage pipeline

Hazards

Handling exceptions

Superscalar execution

Dynamic scheduling (out-of-order execution)

Real pipeline designs

Pipeline motivation

Need both low CPI and high frequency for best performanceWant a multicycle for high frequency, but need better

CPI

Idea behind pipelining is to have a multicycle implementation that operates like a factory assembly line

Each “worker” in the pipeline performs a particular task, hands off to the next “worker”, while getting new work

Pipeline motivation

Tasks should take about the same time – if one “worker” is much slower than the rest, then other “workers” will stand idle

Once the assembly line is full, a new “product” (instruction) comes out of the back-end of the line each time period

In a computer assembly line (pipeline), each task is called a stage and the time period is one clock cycle


Like single cycle datapath but with registers separating each stage


5 stages for each instructionIF: instruction fetchID: instruction decode and register file readEX: instruction execution or effective address calculationMEM: memory access for load and storeWB: write back results to register file

Delays of all 5 stages are relatively the same

Staging registers are used to hold data and control as instructions pass between stages

All instructions pass through all 5 stages

As an instruction leaves a stage in a particular clock period, the next instruction enters it

Pipeline operation for lw

Stage 1: Instruction fetch


Stage 2: Instruction decode and register file read

What happens to the instruction info in IF/ID?


Stage 3: Effective address calculation


Stage 4: Memory access


Stage 5: Write back

Instruction info in IF/ID is gone – won’t work

Modified pipeline with write back fix

Write register bits from the instruction must be carried through the pipeline with the instruction


Pipeline usage in each stage for lw

Pipeline operation for sw

Stage 3: Effective address calculation


Stage 4: Memory access


Stage 5: Write back (nothing)

Pipeline operation for lw, sub sequence






Graphical pipeline representation

Represent overlap of pipelined instructions as multiple pipelines skewed by a cycle

Another useful shorthand form

Pipeline control

Basic pipeline control is similar to the single cycle implementation

Pipeline control

Control for an instruction is generated in ID and travels with the instruction and data through the pipeline

When an instruction enters a stage, it’s control signals set the operation of that stage

Pipeline control

Multiple instruction example

For the following code fragment

show the datapath and control usage as the instruction sequence travels down the pipeline

lw $10, 20($1)sub $11, $2, $3and $12, $4, $5or $13, $6, $7add $14, $8, $9










How the MIPS ISA simplifies pipelining

Fixed length instruction simplifiesFetch – just get the next 32 bitsDecode – single step; don’t have to decode opcode

before figuring out where to get the rest of the fields

Source register fields always in same locationCan read source registers during decode

Load/store architectureALU can be used for both arithmetic and EA

calculationMemory instruction require about same amount of

work as arithmetic ones, easing pipelining of the two together

Memory data must be alignedRead or write accesses can be done in one cycle

Pipeline hazards

A hazard is a conflict, regarding data, control, or hardware resources

Data hazards are conflicts for register values

Control hazards occur due to the delay to execute branch and jump instruction

Structural hazards are conflicts for hardware resources, such as A single memory for instructions and dataA multi-cycle, non-pipelined functional unit (such as a

divider)

Data dependences

A read after write (RAW) dependence occurs when the register written by an instruction is a source register of a subsequent instruction

Also have write after read (WAR) and write after write (WAW) data dependences (later)

lw $10, 20($1)

sub $11, $10, $3

and $12, $4, $11

or $13, $11, $4

add $14, $13, $9

Pipelining and RAW dependences

RAW dependences that are close by may cause data hazards in the pipeline

Consider the following code sequence:

What are the RAW dependences?

sub $2, $1, $3and $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Pipelining and RAW dependences

Data hazards with first three instructions

ok

ok

hazard

hazard

Forwarding

Most RAW hazards can be eliminated by forwarding results between pipe stages

at this point, result of sub is available

Forwarding datapaths

Bypass paths feed data from MEM and WB back to MUXes at the EX ALU inputs

Do we still have to write the register file in WB?

Detecting forwarding

Rd of the instruction in MEM or WB must match Rs and/or Rt of the instruction in EX

The instruction in MEM or WB must have RegWrite=1 (why?)

Rd must not be $0 (why?)

Detecting forwarding from MEM to EX

To the upper ALU input (ALUupper)EX/MEM.RegWrite = 1EX/MEM.RegisterRd not equal 0EX/MEM.RegisterRd = ID/EX.RegisterRs

To the lower ALU input (ALUlower)EX/MEM.RegWrite = 1EX/MEM.RegisterRd not equal 0EX/MEM.RegisterRd = ID/EX.RegisterRt

Detecting forwarding from WB to EX

To the upper ALU inputMEM/WB.RegWrite = 1MEM/WB.RegisterRd not equal 0MEM/WB.RegisterRd = ID/EX.RegisterRsThe value is not being forwarded from MEM (why?)

To the lower ALU inputMEM/WB.RegWrite = 1MEM/WB.RegisterRd not equal 0MEM/WB.RegisterRd = ID/EX.RegisterRtThe value is not being forwarded from MEM

Forwarding control

Control is handled by the forwarding unit

Forwarding example

Show forwarding for the code sequence:

sub $2, $1, $3

and $4, $2, $5

or $4, $4, $2

add $9, $4, $2

Forwarding example

sub produces result in EX

Forwarding example

sub forwards result from MEM to ALUupper

Forwarding example

sub forwards result from WB to ALUlower

and forwards result from MEM to ALUupper

Forwarding example

or forwards result from MEM to ALUupper

RAW hazards involving loads

Loads produce results in MEM – can’t forward to an immediately following R-type instruction

Called a load-use hazard

RAW hazards involving loads

Solution: stall the stages behind the load for one cycle, after which the result can be forwarded

Detecting load-use hazards

Instruction in EX is a loadID/EX.MemRead = 1

Instruction in ID has a source register that matches the load destination registerID/EX.RegisterRt = IF/ID.RegisterRs OR

ID/EX.RegisterRt = IF/ID.RegisterRt

Stalling the stages behind the load

Force nop (“no operation”) instruction into EX stage on next clock cycleForce ID/EX.MemWrite input to zeroForce ID/EX.RegWrite input to zero

Hold instructions in ID and IF stages for one clock cycleHold the contents of PCHold the contents of IF/ID

Control for load-use hazards

Control is handled by the hazard detection unit

Load-use stall example

Code sequence:

lw $2, 20($1)

and $4, $2, $5

or $4, $4, $2

add $9, $4, $2


lw enters ID


Load-use hazard detected


Force nop into EX and hold ID and IF stages


lw result in WB forwarded to and in EX

or reads operand $2 from register file


Pipeline advances normally

Control hazards

Taken branches and jumps change the PC to the target address from which the next instruction is to be fetched

In our pipeline, the PC is changed when the taken beq instruction is in the MEM stage

This creates a control hazard in which sequential instructions in earlier stages must be discarded

beq instruction that is takeninstri+1instri+2instri+3

instri+1, instri+2, instri+3 must be discarded

beq $2,$3,7

beq instruction that is taken

In this example, the branch delay is three

Why is the branch immediate field a 7?

Reducing the branch delay

Reducing the branch delay reduces the number of instructions that have to be discarded on a taken branch

We can reduce the branch delay to one for beq by moving both the equality test and the branch target address calculation into ID

We need to insert a nop between the beq and the correctly fetched instruction

Reducing the branch delay

beq with one branch delay

Register equality test done in ID by a exclusive ORing the register values and NORing the result

Instruction in ID forced to nop by zeroing the IF/ID register

Next fetched instruction will be from PC+4 or branch target depending on the beq outcome


beq in ID; next sequential instruction (and) in IF


bubble in ID; lw (from taken address) in IF

Forwarding and stalling changes

Results in MEM and WB must be forwarded to ID for use as possible beq source operand values

beq may have to stall in ID to wait for source operand values to be produced

Examples addi $2, $2, -1

beq $2, $0, 20

lw $8, 20($1)

beq $4, $8, 6

Stall beq one cycle; forward $2 from MEM to upper equality input

in ID

Stall beq two cycles; forward $8 from WB to lower equality input in

ID

Forwarding from MEM to ID

beq $2,$0,20 addi $2,$2,-1bubble

How could we eliminate the bubble?

Forwarding from WB to ID

beq $4,$8,6 lw $8,20($1) bubble bubble

Further reducing the branch delay

Insert a bubble only if the branch is takenAllow the next sequential instruction to proceed if the

branch is not taken AND the IF.Flush signal with the result of the equality

testStill have bubble for taken branches (~2/3 of all

branches)

Delayed branching

Delayed branching

The ISA states that the instruction following the branch is always executed irregardless of the branch outcomeHardware must adhere to this rule!

The compiler finds an appropriate instruction to place after the branch (in the branch delay slot)

beq $4, $8, 6sub $1, $2, $3 branch delay slot (always

executed after the branch)

Delayed branching

Three places compiler may find a delay slot instruction

Prior example without delayed branch

beq in ID; next sequential instruction (and) in IF

What do you notice about the sub instruction?

Prior example without delayed branch

bubble in ID; lw (from taken address) in IF

Prior example with delayed branch

beq in ID; delay slot instruction (sub) in IF

sub $10 $4,$8

Prior example with delayed branch

sub in ID; lw (from taken address) in IF

sub $10 $4,$8

What would happen if the branch was not taken?

Limitations of delayed branching

50% of the time the compiler can’t fill delay slot with useful instructions while maintaining correctness (has to insert nops instead)

High performance pipelines may have >10 delay slotsMany cycles for instruction fetch and decodeMultiple instructions in each pipeline stageExample

Pipeline: IF1-IF2-ID1-ID2Branch calculation performed in ID2Four instructions in each stage12 delay slots

Solution: branch prediction (later)

Precise exceptions

Exceptions require a change of control to a special exception handler routine

The PC of the user program is saved in EPC and restored after the handler completes so that the user program can resume at that instruction

For the user program to work correctly after resuming, All instructions before the excepting one must have written

their resultAll subsequent instructions must not have written their result

Exceptions handled this way are called precise

Pipelining and precise exceptions

There may be instructions from before the excepting one and from after it in the pipeline when the exception occurs

Exceptions may be detected out of program order

Which should be handled first?

exception

exception

Supporting precise exceptions

Each instruction in the pipeline has an exception field that travels with it

When an exception is detected, the type of exception is encoded in the exception field

The RegWrite and MemWrite control signals for the instruction are set to 0

At the end of MEM, the exception field is checked to see if an exception occurred

If so, the instructions in IF, ID, and EX are made into nops, and the address of the exception handler is loaded into the PC

Supporting precise exceptions

exception

exception

Superscalar pipelines

In a superscalar pipeline, each pipeline stage holds multiple instructions 4-6 instructions in modern high performance

microprocessors

Performance is increased because every clock period more than one instruction completes (increased parallelism)

Superscalar pipelines have a CPI less than 1

Simple 2-way superscalar MIPS

Simple 2-way superscalar MIPS

Two instructions fetched and decoded each cycle

Conditions for executing a pair of instructionsFirst instruction an integer or branch, second a load or

storeNo RAW dependence from first to second

Otherwise, second instruction is executed the cycle after the first

Compiler code scheduling

The compiler can improve performance by changing the order of the instructions in the program (code scheduling)

Examples Fill branch delay slotsMove instructions between two dependent instructions

to eliminate the stall cyclesReorder instructions to increase the number executed

in parallel

Scheduling example – before

Load-use stallStall after addiFirst three instructions must execute serially due to

dependencesLast two must also execute serially for same reasonHave branch delay slot to fill

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero, Loop

Scheduling example – after

All stall cycles are eliminatedLast two instructions can now execute in parallel on

the 2-way superscalar MIPSFirst two can also, but we would introduce a stall cycle

before the addu (loop is too short – not enough instructions to schedule)

Loop: lw $t0, 0($s1)addi $s1, $s1, -4 # moved into load delay slotaddu $t0, $t0, $s2bne $s1, $zero, Loopsw $t0, 4($s1) # moved into branch delay slot

Loop unrolling

Idea is to take multiple iterations of a loop (“unroll” it) and combine them into one bigger loop

Gives the compiler many instructions to move between dependent instructions and to increase parallel execution

Reduces the overhead of branching

Loop unrolling

Example of prior loop unrolled 4 times:

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t0, -4($s1)addu $t0, $t0, $s2sw $t0, -4($s1)lw $t0, -8($s1)addu $t0, $t0, $s2sw $t0, -8($s1)lw $t0, -12($s1)addu $t0, $t0, $s2sw $t0, -12($s1)addi $s1, $s1, -16bne $s1, $zero, Loop

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero, Loop

Original code:

Loop unrolling

Problem: reuse of $t0 constrains instruction order

Write after read (WAR) and write after write (WAW) hazards


Loop unrolling

Solution: different registers for each computation


Loop unrolling

Unrolled loop after scheduling:

New sw offsets due to moving the addi

Loop: lw $t0, 0($s1)lw $t1, -4($s1)lw $t2, -8($s1)lw $t3, -12($s1)addu $t0, $t0, $s2addu $t1, $t1, $s2 addu $t2, $t2, $s2addu $t3, $t3, $s2addi $s1, $s1, -16sw $t0, 16($s1)sw $t1, 12($s1)sw $t2, 8($s1)bne $s1, $zero, Loopsw $t3, 4($s1)

Modern superscalar processors

Today’s superscalar processors attempt to issue (initiate the execution of) 4-6 instructions each clock cycle

Such processors have multiple integer ALUs, integer multipliers, and floating point units that operate in parallel on different instructions

Because most of these units are pipelined, there is the potential to have 10’s of instructions simultaneously executing

We must remove several barriers to achieve this

Modern processor challenges

Handling branches in a way that prevents instruction fetch from becoming a bottleneck

Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution

Removing register hazards due to the reuse of registers so that instruction can execute in parallel

Instruction fetch challenges

Branches comprise about 20% of the executed instructions in SPEC integer programs

The branch delay may be >10 instructions in a highly pipelined, superscalar processor

Delayed branches are useless with so may delay slots

Solution: dynamic branch prediction with speculative execution

Dynamic branch prediction

When fetching the branch, predict what the branch outcome and target will be

Fetch instructions from the predicted direction

After executing the branch, verify whether the prediction was correct

If so, continue without any performance penalty

If not, undo and fetch from the other direction

Bimodal branch predictor

Predicts the branch outcome

Works under the assumption that most branches are either taken most of the time or not taken most of the time

Prediction accuracy is ~85-95% with 2048 entries


Consists of a small memory and a state machine

Each memory location has 2 bits

The address of the memory is the low-order log2n PC bits of a fetched branch instruction

.

.

.

PC of fetched branch instruction

n entriesaddress

2 bits/entry

branch predictor memory


When a branch is fetched, the 2-bit memory entry is retrieved

The prediction is based on the high-order bit1=predict taken0=predict not taken


Once the branch is executed, the state bits are updated and written back into the memory

In the 00 or 11 state, have to be wrong twice in a row to change the prediction

11

0001

10

actual branch outcome

Branch target buffer

Predicts the branch target addressIs this as critical as predicting the branch outcome?

Small memory (typically 256-512 entries) addressed by the low-order branch PC bits

Each entry holds the last target address of the branch

When a branch is fetched, the BTB is accessed and the target address is used if the bimodal predictor predicts “taken”

Speculative execution

The execution of the branch, and verification of the prediction, may take many cycles due to RAW dependences with long-latency instructions

We cannot write the register file or data memory until we know the prediction is correct

Execution will eventually stall

lw $2,100($1) # can take >100 cyclesbeq $2,$0,Label


In speculative execution, results are first written to temporary buffers (NOT the register file or data memory)

The results are copied from the buffers to the register file or data memory if the branch prediction has been verified and is correct

If the prediction is incorrect, we discard the results


Writeback now consists of two stages: instruction completion and instruction commit

Completion: execution is complete, write results to buffer

Commit: branch prediction is verified and correct, copy results from buffers to register file or data memory

Modern processors can speculate through 4-8 branches

execute

completion commit

buffersregister

file

branch prediction verified as correct





Long latency operations

Long latency operations, especially loads that have to access main memory, may stall subsequent instructions

Solution: allow instructions to issue (start executing) out of their original program order but update registers/memory in program order

or $5,$6,$7and $8,$6,$7lw $2,100($1)add $9,$2,$2sub $10,$5,$8

completed

waiting for lw

can’t execute even though its operands

are available!

data not found in on-chip memory, have to get

from main memory

Out-of-order issue

Fetched and decoded instructions are placed in a special hardware queue called the issue queue

An instruction waits in the IQ untilIts source operands are availableA suitable functional unit is available

The instruction can then issue

IF ID …issue queue

EX

completioncommit

buffersreg file

Out-of-order issue

Every cycle, the destination register numbers (rd or rt) of issuing instructions are broadcast to all instructions in the IQ

A match with a source register number (rs or rt) of an instruction in the IQ indicates the operand will be available

or $5,$6,$7and $8,$6,$7lw $2,100($1) # can take >100 cycles!add $9,$2,$2sub $10,$5,$8

issued instructions

both operands become available

IF ID …issue queue

EX

completioncommit

dest reg numbers

Out-of-order issue

Instructions with available source operands can issue ahead of earlier instructions (out of original program order)

add $9,$2,$2

sub $10,$5,$8

.

.

.

from ID

waiting for lw

or and and instructions were just issued => issue sub

issue queue

Out-of-order issue, in-order commit

Once instructions complete, they write results into the buffers used for speculative execution

However, instructions are written to the register file and data memory in original program order

Why do we need to do this?

execute

completion commit

buffersregister

file

must be in-order

may be out-of-order

or $5,$6,$7and $8,$6,$7lw $2,100($1)add $9,$2,$2sub $10,$5,$8

commits firstcompletes first





Register hazards

The reuse of registers creates WAW and WAR hazards that limit out-of-order issue and parallel execution

Example

Potential for multiple iterations to be executed in parallelThe branch could be predicted as taken with high accuracyProblem: WAW and WAR hazards involving $t0 and $s1

Solution: register renaming

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4bne $s1, $zero, Loop

Register renaming

Idea is for the hardware to reassign registers like the compiler does in loop unrolling

Requires implementing more registers than specified in the ISA (e.g., 128 integer registers rather than 32)

Allows every instruction in the pipeline to be given a unique destination register number to eliminate all WAR and WAW register conflicts

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t1, -4($s1)addu $t1, $t1, $s2sw $t1, -4($s1)

Register renaming

A register renaming stage is added between decode and the register file access

The original architectural destination register number is replaced by a unique physical register number that is not used by any other instruction

A lookup is done for each source register to find the corresponding physical register numberdecode rename

reg file

architectural register numbers used up to here

physical register numbers used after this point

Register renaming

Example: two iterations of the loop with branch predicted taken

WAR hazard involving $s1 is removed, allowing the addi to complete before the first iteration is completed

The WAW and WAR hazards involving $t0 are removedRemoving both of these restrictions allows the second

iteration to proceed in parallel with the first

lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4<bne predicted taken>lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4<bne predicted taken>

lw $p1, 0($p3)addu $p2, $p1, $p10sw $p2, 0($p3)addi $p4, $p3 , -4<bne predicted taken>lw $p7, 0($p4)addu $p23, $p7, $p10sw $p23, 0($p4)addi $p11, $p4 , -4<bne predicted taken>

BEFORE: AFTER:

The MIPS R12000 microprocessor

4-way superscalar

Five execution units2 integer2 floating point1 load/store for effective address calculation and data

memory access

Dynamic branch prediction and speculative execution

ooo issue, in-order commit

Register renaming

R12000 pipeline (ALU operations) Fetch stages 1 and 2

Fetch 4 instructions each cyclePredict branchesSplit into two stages to enable higher clock rates (R10K had

1)

Decode stageDecode and rename 4 instructions each cyclePut into issue queues

Issue stageCheck source operand availabilityRead source operands from register file (or bypass paths) for

issued instructions

Execute stageExecute and complete

Write stageWrite results to physical registers

R12000 branch prediction

2048-entry bimodal predictor

32 entry branch target address cache

Speculation through four branches

R12000 ooo completion, in-order commit

Separate 16-entry issue queues for integer, floating point, and memory (load and store) instructions

Hardware tracks the program order and status (completed, caused exception, etc) of up to 48 instructions

R12000 register renaming

64 integer and 64 floating point physical registers

Hardware lookup table to correlate architectural registers with physical registers

Hardware maintains list of currently unused registers that can be assigned as destination registers

R10000 die photo

R12000 summary

R10000 was one of the 1st microprocessors to implement the “issue queue” approach to ooo superscalar executionPowerPC processors use the “reservation station”

approach discussed in the book

Clock rate was slowR12000 provided a slight improvement with some redesign

Pentium and Alpha processors are ooo but with much faster clock rates

Very hard to get significant improvement beyond 4-6 way issueBranch prediction needs to be extremely high Finding parallel operations in many program is difficultLong latency of loads creates an operand supply problemKeeping the clock rate high is tough

Questions?

Documents

ECE200 – Computer Organization