60
CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic, [email protected] http:// www.ece.uah.edu/~milenka

CPE 631 Review: Pipelining

  • Upload
    presta

  • View
    47

  • Download
    1

Embed Size (px)

DESCRIPTION

CPE 631 Review: Pipelining. Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic, [email protected] http://www.ece.uah.edu/~milenka. Outline. Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards Structural Data Control. - PowerPoint PPT Presentation

Citation preview

Page 1: CPE 631 Review: Pipelining

CPE 631 Review: Pipelining

Electrical and Computer EngineeringUniversity of Alabama in Huntsville

Aleksandar Milenkovic, [email protected]

http://www.ece.uah.edu/~milenka

Page 2: CPE 631 Review: Pipelining

2

AM

LaCASA

Outline

Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards

Structural Data Control

Page 3: CPE 631 Review: Pipelining

3

AM

LaCASA

Laundry Example (by David Patterson)

Four loads of clothes: A, B, C, D

Task: each one to wash, dry, and fold Resources

Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes

A B C D

Page 4: CPE 631 Review: Pipelining

4

AM

LaCASA

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads If they learned pipelining,

how long would laundry take?

A

B

C

D

30 40 2030 40 2030 40 2030 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Page 5: CPE 631 Review: Pipelining

5

AM

LaCASA

Pipelined Laundry

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Page 6: CPE 631 Review: Pipelining

6

AM

LaCASA

Pipelining Lessons Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

Pipeline rate is limited by slowest pipeline stage

Multiple tasks operating simultaneously

Potential speedup = Number pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” reduce speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Page 7: CPE 631 Review: Pipelining

7

AM

LaCASA

Computer Pipelines

Execute billions of instructions, so throughput is what matters

What is desirable in instruction sets for pipelining? Variable length instructions vs.

all instructions same length? Memory operands part of any operation vs.

memory operands only in loads or stores? Register operand many places in instruction

format vs. registers located in same place?

Page 8: CPE 631 Review: Pipelining

8

AM

LaCASA

A "Typical" RISC

Registers 32 64-bit general-purpose (integer) registers (R0-R31)

32 64-bit floating-point registers (F0-F31)

Data types 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit

double words for integer data 32-bit single- or 64-bit double-precision numbers

Addressing Modes for MIPS Data Transfers Load-store architecture: Immediate, Displacement Memory is byte addressable with a 64-bit address Mode bit to select Big Endian or Little Endian

Page 9: CPE 631 Review: Pipelining

9

AM

LaCASA

MIPS64 Instruction Formats

Op31 26 01516202125

Rs Rt immediate

Op31 26 025

Op31 26 01516202125

Rs Rt

address

Rd funct

Register-Register561011

Register-Immediate

Jump / Call

shamt

Op31 26 01516202125

Fmt Ft Fs funct

561011

Fd

Floating-point (FR)

Op31 26 01516202125

Fmt Ft immediate

Floating-point (FI)

Page 10: CPE 631 Review: Pipelining

10

AM

LaCASA

MIPS64 Instructions

MIPS Operations(See Appendix B, Figure B.26)

Data Transfers (LB, LBU, SB, LH, LHU, SH, LW, LWU, SW, LD, SD, L.S, L.D, S.S, S.D, MFCO, MTCO, MOV.S, MOV.D, MFC1, MTC1)

Arithmetic/Logical (DADD, DADDI, DADDU, DADDIU, DSUB, DSUBU, DMUL, DMULU, DDIV, DDIVU, MADD, AND, ANDI, OR, ORI, XOR, XORI, LUI, DSLL, DSRL, DSRA, DSLLV, DSRLV, DSRAV, SLT, SLTI, SLTU, SLTIU)

Control (BEQZ, BNEZ, BEQ, BNE, BC1T, BC1F, MOVN, MOVZ, J, JR, JAL, JALR, TRAP, ERET)

Floating Point (ADD.D, ADD.S, ADD.PS, SUB.D, SUB.S, SUB.PS, MUL.D, MUL.S, MUL.PS, MADD.D, MADD.S, MADD.PS, DIV.D, DIV.S, DIV.PS, CVT._._, C._.D, C._.S

Page 11: CPE 631 Review: Pipelining

11

AM

LaCASA

5 Steps of Simple RISC Datapath

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

LMD

ALU

MU

X

Mem

ory

Reg File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

4

Ad

der Zero?

Next SEQ PC

Addre

ss

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

Page 12: CPE 631 Review: Pipelining

12

AM

LaCASA

5 Steps of Simple RISC Datapath (cont’d)

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

Zero?

IF/ID

ID/E

X

MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

Page 13: CPE 631 Review: Pipelining

13

AM

LaCASA

Visualizing Pipeline

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

Instr.

Order

Time (clock cycles)

Reg AL

U

DMIM Reg

CC 2 CC 3 CC 4 CC 6 CC 7CC 5CC 1

Page 14: CPE 631 Review: Pipelining

14

AM

LaCASA

Instruction Flow through Pipeline

Reg

ALU

DM

IMR

eg

CC 2 CC 3CC 1

Reg

ALU

DM

IMR

eg

Reg

ALU

DM

IMR

eg

Add R1,R2,R3

Nop

Nop

Nop

Add R1,R2,R3

Nop

Nop

Lw R4,0(R2)

Add R1,R2,R3

Nop

Lw R4,0(R2)

Sub R6,R5,R7

Reg

ALU

DM

IMR

eg

Add R1,R2,R3

Lw R4,0(R2)

Sub R6,R5,R7

CC 4

Xor R9,R8,R1

Time (clock cycles)

Page 15: CPE 631 Review: Pipelining

15

AM

LaCASA

Simple RISC Pipeline Definition: IF, ID

Stage IF IF/ID.IR Mem[PC]; if EX/MEM.cond {IF/ID.NPC, PC

EX/MEM.ALUOUT} else {IF/ID.NPC, PC PC + 4};

Stage ID ID/EX.A Regs[IF/ID.IR6…10];

ID/EX.B Regs[IF/ID.IR11…15]; ID/EX.Imm (IF/ID.IR16)16 ## IF/ID.IR16…31; ID/EX.NPC IF/ID.NPC;

ID/EX.IR IF/ID.IR;

Page 16: CPE 631 Review: Pipelining

16

AM

LaCASA

Simple RISC Pipeline Definition: IE

ALU EX/MEM.IR ID/EX.IR; EX/MEM.ALUOUT ID/EX.A func ID/EX.B; or

EX/MEM.ALUOUT ID/EX.A func ID/EX.Imm; EX/MEM.cond 0;

load/store EX/MEM.IR ID/EX.IR;

EX/MEM.B ID/EX.B; EX/MEM.ALUOUT ID/EX.A ID/EX.Imm; EX/MEM.cond 0;

branch EX/MEM.Aluout ID/EX.NPC (ID/EX.Imm<< 2); EX/MEM.cond (ID/EX.A func 0);

Page 17: CPE 631 Review: Pipelining

17

AM

LaCASA

Simple RISC Pipeline Def.: MEM, WB

Stage MEM ALU

MEM/WB.IR EX/MEM.IR; MEM/WB.ALUOUT EX/MEM.ALUOUT;

load/store MEM/WB.IR EX/MEM.IR; MEM/WB.LMD Mem[EX/MEM.ALUOUT] or

Mem[EX/MEM.ALUOUT] EX/MEM.B; Stage WB

ALU Regs[MEM/WB.IR16…20] MEM/WB.ALUOUT; or

Regs[MEM/WB.IR11…15] MEM/WB.ALUOUT; load

Regs[MEM/WB.IR11…15] MEM/WB.LMD;

Page 18: CPE 631 Review: Pipelining

18

AM

LaCASA

Its Not That Easy for Computers

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this

combination of instructions Data hazards: Instruction depends on result of

prior instruction still in the pipeline Control hazards: Caused by delay between

the fetching of instructions and decisions about changes in control flow (branches and jumps)

Page 19: CPE 631 Review: Pipelining

19

AM

LaCASA

One Memory Port/Structural Hazards

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg

ALU

DMemIfetch Reg

Page 20: CPE 631 Review: Pipelining

20

AM

LaCASA

One Memory Port/Structural Hazards (cont’d)

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg

ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

Page 21: CPE 631 Review: Pipelining

21

AM

LaCASA

Instr.

Order

dadd r1,r2,r3

dsub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Data Hazard on R1Time (clock cycles)

IF ID/RF EX MEM WB

Page 22: CPE 631 Review: Pipelining

22

AM

LaCASA

Three Generic Data Hazards

Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

I: add r1,r2,r3J: sub r4,r1,r3

Page 23: CPE 631 Review: Pipelining

23

AM

LaCASA

Three Generic Data Hazards

Write After Read (WAR) InstrJ writes operand before InstrI reads it

Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Page 24: CPE 631 Review: Pipelining

24

AM

LaCASA

Three Generic Data Hazards

Write After Write (WAW) InstrJ writes operand before InstrI writes it.

Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because:

All instructions take 5 stages, and Writes are always in stage 5

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Page 25: CPE 631 Review: Pipelining

25

AM

LaCASA

Time (clock cycles)

Forwarding to Avoid Data Hazard

Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Page 26: CPE 631 Review: Pipelining

26

AM

LaCASA

HW Change for Forwarding

MEM

/WR

ID/E

X

EX

/MEM

DataMemory

ALU

mux

mux

Registe

rs

NextPC

Immediate

mux

Page 27: CPE 631 Review: Pipelining

27

AM

LaCASA

Forwarding to DM input

lw R4,0(R1)

sw 12(R1),R4

add R1,R2,R3

Inst.

Order

Time (clock cycles)

- Forward R1 from EX/MEM.ALUOUT to ALU input (lw)- Forward R1 from MEM/WB.ALUOUT to ALU input (sw)- Forward R4 from MEM/WB.LMD to memory input (memory output to memory input)

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

CC 2 CC 3 CC 4 CC 6 CC 7CC 5CC 1

Page 28: CPE 631 Review: Pipelining

28

AM

LaCASA

Forwarding to DM input (cont’d)

sw 0(R4),R1

add R1,R2,R3

Inst.

Order

Time (clock cycles)

Forward R1 from MEM/WB.ALUOUT to DM input

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

CC 2 CC 3 CC 4 CC 6CC 5CC 1

Page 29: CPE 631 Review: Pipelining

29

AM

LaCASA

Forwarding to Zero

beqz R1,50

add R1,R2,R3

Instruction

Order

Time (clock cycles)

Forward R1 from EX/MEM.ALUOUT to Zero

sub R4,R5,R6

bneq R1,50

add R1,R2,R3

Forward R1 from MEM/WB.ALUOUT to Zero

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

CC 2 CC 3 CC 4 CC 6CC 5CC 1

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

Reg AL

U

DMIM Reg

Z

Z

Page 30: CPE 631 Review: Pipelining

30

AM

LaCASA

Time (clock cycles)

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with Forwarding

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Page 31: CPE 631 Review: Pipelining

31

AM

LaCASA

Data Hazard Even with Forwarding

Time (clock cycles)

or r8,r1,r9

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

Reg

ALU

DMemIfetch Reg

RegIfetch

ALU

DMem RegBubble

Ifetch

ALU

DMem RegBubble Reg

Ifetch

ALU

DMemBubble Reg

Page 32: CPE 631 Review: Pipelining

32

AM

LaCASA

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

Page 33: CPE 631 Review: Pipelining

33

AM

LaCASA

Control Hazard on BranchesThree Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Page 34: CPE 631 Review: Pipelining

34

AM

LaCASA

Example: Branch Stall Impact

If 30% branch, Stall 3 cycles significant Two part solution:

Determine branch taken or not sooner, AND Compute taken branch address earlier

MIPS branch tests if register = 0 or 0 MIPS Solution:

Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Page 35: CPE 631 Review: Pipelining

35

AM

LaCASA

Ad

der

IF/ID

Pipelined Simple RISC DatapathMemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero?

MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Addre

ss

RS1

RS2

ImmM

UX

ID/E

X

Page 36: CPE 631 Review: Pipelining

36

AM

LaCASA

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear #2: Predict Branch Not Taken

Execute successor instructions in sequence “Squash” instructions in pipeline if branch

actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next

instruction

Page 37: CPE 631 Review: Pipelining

37

AM

LaCASA

Branch not Taken

Time [clocks]MemIF ID Ex WB

IF ID

branch(not taken)

5

IF ID Ex Mem WB

Ex Mem WB

Ii+1

Ii+2

MemIF ID Ex WB

Instructions

branch(taken)

5

IF idle idle idle idle

IF ID Ex Mem WB

Ii+1

branchtarget

IF ID Ex Mem WBbranchtarget+1

Branch is untaken(determined during ID),we have fetched the fall-through and just continue no wasted cycles

Branch is taken(determined during ID),restart the fetch from at the branch target one cycle wasted

Page 38: CPE 631 Review: Pipelining

38

AM

LaCASA

Four Branch Hazard Alternatives

#3: Predict Branch Taken Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address

in MIPS MIPS still incurs 1 cycle branch penalty

Make sense only when branch target is known before branch outcome

Page 39: CPE 631 Review: Pipelining

39

AM

LaCASA

Four Branch Hazard Alternatives

#4: Delayed Branch Define branch to take place AFTER a following

instruction

branch instructionsequential successor1sequential successor2........sequential successornbranch target if taken

1 slot delay allows proper decision and branch target address in 5 stage pipeline

MIPS uses this

Branch delay of length n

Page 40: CPE 631 Review: Pipelining

40

AM

LaCASA

Delayed Branch

Where to get instructions to fill branch delay slot? Before branch instruction From the target address:

only valuable when branch taken From fall through:

only valuable when branch not taken

Page 41: CPE 631 Review: Pipelining

41

AM

LaCASA

Scheduling the branch delay slot: From Before

Delay slot is scheduled with an independent instruction from before the branch

Best choice, always improves performance

ADD R1,R2,R3

if(R2=0) then

<Delay Slot>

Becomes

if(R2=0) then

<ADD R1,R2,R3>

Page 42: CPE 631 Review: Pipelining

42

AM

LaCASA

Scheduling the branch delay slot: From Target

Delay slot is scheduled from the target of the branch

Must be OK to execute that instruction if branch is not taken

Usually the target instruction will need to be copied because it can be reached by another path programs are enlarged

Preferred when the branch is taken with high probability

SUB R4,R5,R6

...

ADD R1,R2,R3

if(R1=0) then

<Delay Slot>

Becomes

...

ADD R1,R2,R3

if(R2=0) then

<SUB R4,R5,R6>

Page 43: CPE 631 Review: Pipelining

43

AM

LaCASA

Scheduling the branch delay slot:From Fall Through

Delay slot is scheduled from thetaken fall through

Must be OK to execute that instruction if branch is taken

Improves performance when branch is not taken

ADD R1,R2,R3

if(R2=0) then

<Delay Slot>

SUB R4,R5,R6

Becomes

ADD R1,R2,R3

if(R2=0) then

<SUB R4,R5,R6>

Page 44: CPE 631 Review: Pipelining

44

AM

LaCASA

Delayed Branch Effectiveness

Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch

delay slots useful in computation About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Page 45: CPE 631 Review: Pipelining

45

AM

LaCASA

Example: Branch Stall Impact

Assume CPI = 1.0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles

Op Freq Cycles CPI(i) (% Time)

Other 70% 1 .7 (37%) Branch 30% 4 1.2 (63%)

=> new CPI = 1.9, or almost 2 times slower

Page 46: CPE 631 Review: Pipelining

46

AM

LaCASA

Example 2: Speed Up Equation for Pipelining

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline CPI Idealdepth Pipeline CPI Ideal

Speedup

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline 1depth Pipeline

Speedup

Instper cycles Stall Average CPI Ideal CPIpipelined

For simple RISC pipeline, CPI = 1:

Page 47: CPE 631 Review: Pipelining

47

AM

LaCASA

Example 3: Evaluating Branch Alternatives (for 1 program)

Scheduling Branch CPI speedup v. scheme penalty stall

Stall pipeline 3 1.42 1.0 Predict taken 1 1.14 1.26 Predict not taken 1 1.09 1.29 Delayed branch 0.5 1.07 1.31

Conditional & Unconditional = 14%, 65% change PC

Page 48: CPE 631 Review: Pipelining

48

AM

LaCASA

Example 4: Dual-port vs. Single-port

Machine A: Dual ported memory (“Harvard Architecture”)

Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate

Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed

Page 49: CPE 631 Review: Pipelining

49

AM

LaCASA

Extended Simple RISC Pipeline

IF ID

EXInt

EXFP/I Mult

EXFP Add

EXFP/I Div

Mem WB

DLX pipe with three unpipelined, FP functional units

In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1

Page 50: CPE 631 Review: Pipelining

50

AM

LaCASA

Extended Simple RISC Pipeline (cont’d)

Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations

Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result

Functional unit Latency Initiation interval

Integer ALU 0 1

Data Memory 1 1

FP Add 3 1

FP/Integer Multiply 6 1

FP/Integer Divide 24 25

Page 51: CPE 631 Review: Pipelining

51

AM

LaCASA

Extended Simple RISC Pipeline (cont’d)

IF ID

Ex

M1 M2 M3 M4 M5 M6 M7

A1 A2 A3

..

M WB

A4

Page 52: CPE 631 Review: Pipelining

52

AM

LaCASA

Extended Simple RISC Pipeline (cont’d)

Multiple outstanding FP operations FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined

Pipeline timing for independent operationsMUL.D IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB

ADD.D IF ID A1 A2 A3 A4 Mem WB

L.D IF ID Ex Mem WB

S.D IF ID Ex Mem WB

Page 53: CPE 631 Review: Pipelining

53

AM

LaCASA

Hazards and Forwarding in Longer Pipes

Structural hazard: divide unit is not fully pipelined detect it and stall the instruction

Structural hazard: number of register writes can be larger than one due to varying running times

WAW hazards are possible Exceptions!

instructions can complete in different order than they were issued

RAW hazards will be more frequent

Page 54: CPE 631 Review: Pipelining

54

AM

LaCASA

Examples

Stalls arising from RAW hazards

Three instructions that want to perform a write back to the FP register file simultaneously

L.D F4, 0(R2) IF ID EX Mem WB

MUL.D F0, F4, F6

IF ID stall M1 M2 M3 M4 M5 M6 M7 Mem WB

ADD.D F2, F0, F8

IF stall ID stall stall stall stall stall stall A1 A2 A3 A4 Mem WB

S.D 0(R2), F2 IF stall stall stall stall stall stall ID EX stall stall stall Mem

MUL.D F0, F4, F6

IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB

... IF ID EX Mem WB

... IF ID EX Mem WB

ADD.D F2, F4, F6

IF ID A1 A2 A3 A4 Mem WB

... IF ID EX Mem WB

... IF ID EX Mem WBL.D F2, 0(R2) IF ID EX Mem WB

Page 55: CPE 631 Review: Pipelining

55

AM

LaCASA

Solving Register Write Conflicts

First approach: track the use of the write port in the ID stage and stall an instruction before it issues

use a shift register that indicates when already issued instructions will use the register file

if there is a conflict with an already issued instruction, stall the instruction for one clock cycle

on each clock cycle the reservation register is shifted one bit Alternative approach: stall a conflicting instruction when it tries

to enter MEM or WB stage we can stall either instruction e.g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of

MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two

different places

Page 56: CPE 631 Review: Pipelining

56

AM

LaCASA

WAW Hazards

Result of ADD.D is overwritten without any instruction ever using it WAWs occur when useless instruction is executed still, we must detect them and provide correct

executionWhy?

IF ID EX Mem WB

ADD.D F2, F4, F6 IF ID A1 A2 A3 A4 Mem WB

IF ID EX Mem WB

L.D F2, 0(R2) IF ID EX Mem WB

BNEZ R1, foo

DIV.D F0, F2, F4 ; delay slot from fall-through

...

foo: L.D F0, qrs

Page 57: CPE 631 Review: Pipelining

57

AM

LaCASA

Solving WAW Hazards

First approach: delay the issue of load instruction until ADD.D enters MEM

Second approach: stamp out the result of the ADD.D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away

Detect hazard in ID when LD is issuing stall LD, or make ADDD no-op

Luckily this hazard is rare

Page 58: CPE 631 Review: Pipelining

58

AM

LaCASA

Hazard Detection in ID Stage

Possible hazards hazards among FP instructions hazards between an FP instruction and an

integer instr. FP and integer registers are distinct,

except for FP load-stores, and FP-integer moves

Assume that pipeline does all hazard detection in ID stage

Page 59: CPE 631 Review: Pipelining

59

AM

LaCASA

Hazard Detection in ID Stage (cont’d)

Check for structural hazards wait until the required functional unit is not busy and

make sure that the register write port is available Check for RAW data hazards

wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result

Check for WAW data hazards determine if any instruction in A1, .. A4, M1, .. M7, D

has the same register destination as this instruction; if so, stall the issue of the instruction in ID

Page 60: CPE 631 Review: Pipelining

60

AM

LaCASA

Forwarding Logic

Check if the destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction

If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data