Pipelining IV

Pipelining IVPipelining IV

TopicsTopics Implementing pipeline control Pipelining and performance analysis

Systems I

2

Implementing Pipeline Control

Combinational logic generates pipeline control signals Action occurs at start of following cycle

E

M

W

F

D

CCCC

rB

srcA

srcB

icode valE valM dstE dstM

Bchicode valE valA dstE dstM

icode ifun valC valA valB dstE dstM srcA srcB

valC valPicode ifun rA

predPC

d_srcB

d_srcA

e_Bch

D_icode

E_icode

M_icode

E_dstM

Pipecontrollogic

D_bubble

D_stall

E_bubble

F_stall

3

Initial Version of Pipeline Controlbool F_stall =

# Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } ||# Stalling at fetch while ret passes through pipelineIRET in { D_icode, E_icode, M_icode };

bool D_stall = # Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB };

bool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Bubble for ret IRET in { D_icode, E_icode, M_icode };

bool E_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB};

4

Control Combinations

Special cases that can arise on same clock cycle

Combination ACombination A Not-taken branch ret instruction at branch target

Combination BCombination B Instruction that reads from memory to %esp Followed by ret instruction

LoadE

UseD

M

Load/use

JXXE

D

M

Mispredict

JXXE

D

M

Mispredict

EretD

M

ret 1

retE

bubbleD

M

ret 2

bubbleE

bubbleD

retM

ret 3

EretD

M

ret 1

EretD

M

ret 1

retE

bubbleD

M

ret 2

retE

bubbleD

M

ret 2

bubbleE

bubbleD

retM

ret 3

bubbleE

bubbleD

retM

ret 3

Combination B

Combination A

5

Control Combination A

Should handle as mispredicted branch Stalls F pipeline register But PC selection logic will be using M_valM anyhow

JXXE

D

M

Mispredict

JXXE

D

M

Mispredict

E

retD

M

ret 1

E

retD

M

ret 1

E

retD

M

ret 1

Combination A

Condition F D E M W

Processing ret stall bubble normal normal normal

Mispredicted Branch normal bubble bubble normal normal

Combination stall bubble bubble normal normal

6

Stall in F

Your book provides two inconsistent meanings for “stall in F”

Instruction remains in F and injects a bubble into D

Instruction squashed into D, same PC fetched

Figure 4.61

Use the one that keeps 1 instr per pipeline stage

E M WD

F

F D E M WF D E M W

7

JXX + ret works great!

0x000: xorl %eax,%eax

1 2 3 4 5 6 7 8 9

F D E M WF D E M W

0x002: jne target # Not taken F D E M WF D E M W

E M W

10

0x011: t: ret # Target

bubble

0x012: nop # Target + 1

F D

E M W

D

F

bubble

0x007: irmovl $1,%eax # Fall through

0x00d: nop

F D E M WF D E M W

F D E M WF D E M W

8

Control Combination B

Would attempt to bubble and stall pipeline register D Signaled by processor as pipeline error

LoadE

UseD

M

Load/use

E

retD

M

ret 1

E

retD

M

ret 1

E

retD

M

ret 1

Combination B

Condition F D E M W


Load/Use Hazard stall stall bubble normal normal

Combination stall bubble + stall

bubble normal normal

9

Handling Control Combination B

Load/use hazard should get priority ret instruction should be held in decode stage for additional

cycle

LoadE

UseD

M

Load/use

E

retD

M

ret 1

E

retD

M

ret 1

E

retD

M

ret 1

Combination B

Condition F D E M W


Load/Use Hazard stall stall bubble normal normal

Combination stall stall bubble normal normal

10

Corrected Pipeline Control Logic

Load/use hazard should get priority ret instruction should be held in decode stage for additional

cycle

Condition FF DD EE MM WW

Processing ret stallstall bubblebubble normalnormal normalnormal normalnormal

Load/Use Hazard stallstall stallstall bubblebubble normalnormal normalnormal

Combination stallstall stallstall bubblebubble normalnormal normalnormal

bool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL }

&& E_dstM in { d_srcA, d_srcB });

11

Load/use hazard with ret

mrmovl F Dret F

mrmovl F D Eret F Daddl F

mrmovl F D E M bubble Eret F D Daddl F F

mrmovl F D E M W bubble E Mret F D D Eaddl F F bubble Daddl F

12

Pipeline Summary

Data Hazards Most handled by forwarding

No performance penalty

Load/use hazard requires one cycle stall

Control Hazards Cancel instructions when detect mispredicted branch

Two clock cycles wasted

Stall fetch stage while ret passes through pipelineThree clock cycles wasted

Control Combinations Must analyze carefully First version had subtle bug

Only arises with unusual instruction combination

13

Performance Analysis with Pipelining

Ideal pipelined machine: CPI = 1Ideal pipelined machine: CPI = 1 One instruction completed per cycle But much faster cycle time than unpipelined machine

However - hazards are working against the idealHowever - hazards are working against the ideal Hazards resolved using forwarding are fine Stalling degrades performance and instruction comletion

rate is interrupted

CPI is measure of “architectural efficiency” of designCPI is measure of “architectural efficiency” of design

Cycle

Seconds

nInstructio

Cycles

Program

nsInstructio

Program

Seconds timeCPU

14

CPI for PIPE

CPI CPI 1.0 1.0 Fetch instruction each clock cycle Effectively process new instruction almost every cycle

Although each individual instruction has latency of 5 cycles

CPI CPI >> 1.0 1.0 Sometimes must stall or cancel branches

Computing CPIComputing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B)

CPI = C/I = (I+B)/I = 1.0 + B/I Factor B/I represents average penalty due to bubbles

15

Computing CPI

CPICPI Function of useful instruction and bubbles

Cb/Ci represents the pipeline penalty due to stalls

Can reformulate to account forCan reformulate to account for load penalties (lp) branch misprediction penalties (mp) return penalties (rp)

CPI Ci CbCi

1.0 CbCi

CPI 1.0 lpmp rp

16

Computing CPI - II

So how do we determine the penalties?So how do we determine the penalties? Depends on how often each situation occurs on average How often does a load occur and how often does that load

cause a stall? How often does a branch occur and how often is it

mispredicted How often does a return occur?

We can measure theseWe can measure these simulator hardware performance counters

We can estimate through historical averagesWe can estimate through historical averages Then use to make early design tradeoffs for architecture

17

Computing CPI - III

CPI = 1 + 0.31 = 1.31 == 31% worse than idealCPI = 1 + 0.31 = 1.31 == 31% worse than ideal

This gets worse when:This gets worse when: Account for non-ideal memory access latency Deeper pipelines (where stalls per hazard increase)

CauseCause NameName InstructionInstructionFrequencyFrequency

ConditionConditionFrequencyFrequency

StallsStalls ProductProduct

Load/UseLoad/Use lplp 0.300.30 0.30.3 11 0.090.09

MispredictMispredict mpmp 0.200.20 0.40.4 22 0.160.16

ReturnReturn rprp 0.020.02 1.01.0 33 0.060.06

Total penaltyTotal penalty 0.310.31

18

CPI for PIPE (Cont.)B/I = LP + MP + RP

LP: Penalty due to load/use hazard stalling Fraction of instructions that are loads 0.25 Fraction of load instructions requiring stall 0.20 Number of bubbles injected each time 1

LP = 0.25 * 0.20 * 1 = 0.05

MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps 0.20 Fraction of cond. jumps mispredicted 0.40 Number of bubbles injected each time 2

MP = 0.20 * 0.40 * 2 = 0.16

RP: Penalty due to ret instructions Fraction of instructions that are returns 0.02 Number of bubbles injected each time 3

RP = 0.02 * 3 = 0.06

Net effect of penalties 0.05 + 0.16 + 0.06 = 0.27 CPI = 1.27 (Not bad!)

Typical Values

19

Summary

TodayToday Pipeline control logic Effect on CPI and performance

Next TimeNext Time Further mitigation of branch mispredictions State machine design

Documents

Pipelining IV