Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls Register isn’t written until completion of write-back stage

Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls

Register isn’t written until completion of write-back stage Source operands are read from register file in decode stage

Values need to be in register file at start of stage Leads to many more stall cycles than necessary

Key observation The value we want is generated in execute or memory stage It is “available” 1-2 cycles before write-back

Trick: go get it! Pass value directly from stage of generating instruction to decode

stage Must be available before end of decode stage to avoid stall

Detecting Stall Condition

0x000: irmovl $10,%edx

1 2 3 4 5 6 7 8 9

F D E M W

0x006: irmovl $3,%eax F D E M W

0x00c: nop F D E M W

bubble

F

E M W

0x00e: addl %edx,%eax D D E M W

0x010: halt F D E M W

10# demo-h2.ys

F

F D E M W0x00d: nop

11

Cycle 6

W

D

•••

W_dstE = %eaxW_valE = 3

srcA = %edxsrcB = %eax

Data Forwarding Example

irmovl in write-back stage Destination value in W pipeline

register “Forward” as valB for decode stage addl instruction can proceed

without stalling When do we actually know

the values of %eax and %edx?

0x000: irmovl $10,%edx

1 2 3 4 5 6 7 8 9

F D E M WF D E M W

0x006: irmovl $3,%eax F D E M WF D E M W

0x00c: nop F D E M WF D E M W

0x00d: nop F D E M WF D E M W

0x00e: addl% edx ,%eax F D E M WF D E M W

0x010: halt F D E M WF D E M W

10# demo-h2.ys

Cycle 6

W

R[%eax] f 3

D

valA

f R[%edx] = 10valB

f W_ valE

= 3

•••

W_dstE= % eaxW_valE = 3

srcA= % edxsrcB= % eax

Data Forwarding Example #2

Register %edx Generated by ALU during previous

cycle Forward from memory as valA

Register %eax Value just generated by ALU Forward from execute as valB

Forwarding Hardware

Feedback paths from E, M, and W registers to decode stage

Logic blocks to select source for valA and valB in decode stage

Note: we either do forwarding or stall on data hazards

Forwarding has better performance, higher cost

M

F

D

Instructionmemory

Instructionmemory

PCincr.PC

incr.

Registerfile

Registerfile

CCCC ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Writeback

data out

data in

BM

E

M_valA

W_valE

W_valM

W_valE

M_valA

W_valM

f_pc

PredictPC

Cndicode valE valA dstEdstM

E icode ifun valC valA valB dstE dstM srcA srcBstat

valC valPicodeifun rA

predPC

d_srcBd_srcA

e_Cnd

M_Cnd

Sel+FwdA

FwdB

W icode valE valM dstEdstM

m_valM

W_valM

M_valE

e_valE

stat

stat

dstE

stat

statimem_errorinstr_valid

dmem_error

e_dstE

m_stat

stat

stat

Stat PIPE

## Actions of “Sel+Fwd A” block ## Pick the correct A value## Order is important!int new_E_valA = [ # Use incremented PC

D_icode in {ICALL,IJXX} : D_valP; # Forward valE from execute

d_srcA == E_dstE : e_valE; # Forward valM from memory

d_srcA == M_dstM : m_valM; # Forward valE from memory

d_srcA == M_dstE : M_valE; # Forward valM from write back

d_srcA == W_dstM : W_valM; # Forward valE from write back

d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA;];

M

F

D

Instructionmemory

Instructionmemory

PCincr.PC

incr.

Registerfile

Registerfile

CCCC ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Writeback

data out

data in

BM

E

M_valA

W_valE

W_valM

W_valE

M_valA

W_valM

f_pc

PredictPC




predPC

d_srcBd_srcA

e_Cnd

M_Cnd

Sel+FwdA

FwdB


m_valM

W_valM

M_valE

e_valE

stat

stat

dstE

stat


dmem_error

e_dstE

m_stat

stat

stat

Stat

Forwarding Control

PIPE

At clock cycle 4d_srcA =d_srcB =E_dstE =e_valE =M_dstE =M_valE =

What are the forwarding conditions?

Highlight the data path in D, E, M stages.

M

F

D

Instructionmemory

Instructionmemory

PCincr.PC

incr.

Registerfile

Registerfile

CCCC ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Writeback

data out

data in

BM

E

M_valA

W_valE

W_valM

W_valE

M_valA

W_valM

f_pc

PredictPC




predPC

d_srcBd_srcA

e_Cnd

M_Cnd

Sel+FwdA

FwdB


m_valM

W_valM

M_valE

e_valE

stat

stat

dstE

stat


dmem_error

e_dstE

m_stat

stat

stat

Stat

Forwarding ExamplePIPE

At clock cycle 4d_srcA = ecxd_srcB = edxM_dstE = edxm_valE = 128E_dstE = ecxe_valE = 3

What are the forwarding conditions?

M

F

D

Instructionmemory

Instructionmemory

PCincr.PC

incr.

Registerfile

Registerfile

CCCC ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Writeback

data out

data in

BM

E

M_valA

W_valE

W_valM

W_valE

M_valA

W_valM

f_pc

PredictPC




predPC

d_srcBd_srcA

e_Cnd

M_Cnd

Sel+FwdA

FwdB


m_valM

W_valM

M_valE

e_valE

stat

stat

dstE

stat


dmem_error

e_dstE

m_stat

stat

stat

Stat

Forwarding ExamplePIPE

Limitation of Forwarding

Load-use dependency Value needed by end of decode

stage in cycle 7 Value read from memory in

memory stage of cycle 8 Terminology

This is a load hazard Only solution is a load stall Preferred: compiler avoid

Load/Use Hazard: Desired Behavior

Best we can do in hardware Stall reading instruction for one cycle Then forward value from memory stage

Better yet Have compiler avoid in code it generates

Addressing Load/Use Hazard

M

F

D

Instructionmemory

Instructionmemory

PCincr.PC

incr.

Registerfile

Registerfile

CCCC ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Writeback

data out

data in

BM

E

M_valA

W_valE

W_valM

W_valE

M_valA

W_valM

f_pc

PredictPC




predPC

d_srcBd_srcA

e_Cnd

M_Cnd

Sel+FwdA

FwdB


m_valM

W_valM

M_valE

e_valE

stat

stat

dstE

stat


dmem_error

e_dstE

m_stat

stat

stat

Stat

Detection Previous instr. is loading from

memory to src register dstM in E register matches

srcA or srcB (and not 0xF) Action

Stall instruction in decode

Interrupts and Exceptions Basic interrupt mechanism

…instr iinstr i+1instr i+2instr i+3instr i+4instr i+5instr i+6…

CPU runningcurrent process Event occurs that

needs attention

(e.g., Disc read finishes)HW asserts CPU

interrupt line

Control transferred to interrupt handler(think HW-induced

function call)

instr 1instr 2instr 3…

Handler

How is state of interrupted process saved?

How is location of handler determined?

Interrupt Handling Calling handler

Save return address (PC) on stack Address of next instruction to be executed for this process

– Depending on event, either current or next instruction PC usually passed through pipeline along with instruction Precise exception: all instructions to PC executed, none past (“Clean Break”)

Jump to handler address Usually obtained from table stored at fixed memory address Index to table entry determined by exception type/interrupt priority level Interrupt vector table written by software, accessed by hardware

Implementation Critical for real hardware Seldom implemented in simulators: no OS running to pass control to!

Exceptions Events occurring within processor under which pipeline cannot

continue normal operation Possible causes

Halt instruction executed (Current)

Bad address for instruction or data (Previous)

Invalid instruction (Previous)

Pipeline control error (Previous)

System calls, page faults, math errors (not in Y86) Desired action

Complete instructions to specific point Either current or previous (depends on exception type)

Discard instructions that follow Transfer control to exception handler in OS

Save return address, get handler address from table

Exception Examples Detect in fetch stage

irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address (for Y86 tools)

jmp $-1 # Invalid jump target

.byte 0xFF # Invalid instruction code

halt # Halt instruction

Detect in memory stage

Exceptions in Pipeline Processor (#1)

Desired behavior rmmovl should cause exception (1st in sequential machine) Tricky because invalid instruction code detected first

# demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop .byte 0xFF # Invalid instruction code

0x000: irmovl $100,%eax

1 2 3 4

F D E M

F D E0x006: rmmovl %eax,0x10000(%eax)

0x00c: nop

0x00d: .byte 0xFF

F D

F

W

5

M

E

D

Exception detected

Exception detected

Exceptions in Pipeline Processor (#2)

Desired behavior No exception should occur Must match behavior and results of sequential execution

# demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t: .byte 0xFF # Target

0x000: xorl %eax,%eax

1 2 3

F D E

F D0x002: jne t

0x014: t: .byte 0xFF

0x???: (I’m lost!)

F

Exception detected


4

M

E

F

D

W

5

M

D

F

E

M

E

W

7

W

M

8

W

9

E

D

M

6

W

Correct Exception Handling

Challenges: respond to exceptions in program order, and only those that “really” occur Motivation for exception status field (stat) in pipeline registers Fetch stage sets to either “AOK,” “ADR” (bad fetch address), or “INS”

(illegal instruction) Decode & execute stages pass values through Memory stage either passes through or sets to “ADR” CPU responds to exception only when instruction reaches write back

F predPC

W icode valE valM dstE dstMstat

M Cndicode valE valA dstE dstMstat


D rB valC valPicode ifun rAstat

Avoiding Side Effects

Desired behavior rmmovl should cause exception No following instruction should change any state

Note special challenge of condition codes!

# demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes


1 2 3 4

F D E M

F D E0x006: rmmovl %eax,0x10000(%eax)

0x00c: addl %eax,%eax F D

W

5

M

E

Exception detected

Condition code set

Avoiding Side Effects Exception should disable state update for following

instructions When exception detected in memory stage

Disable condition code setting in execute Must happen in same clock cycle

When exception passes to write-back stage Disable memory write in memory stage Disable condition code setting in execute stage

Let’s see how these are handled in PIPE processor

PIPE: Fetch Details Main points

Branch prediction Branch misprediction

recovery Return handling Stat initialization

F

D rB

M_icode

PredictPC

valC valPicode ifun rA

Instructionmemory

Instructionmemory

PCincr.PC

incr.

predPC

Needregids

NeedvalCInstr

valid

AlignAlignSplitSplit

Bytes 1-5Byte 0

SelectPC

M_CndM_valA

W_icode

W_valM

f_pc

stat

stat

imem_error

icode ifun

PIPE: Decode and Write-back

Main points Forwarding logic,

paths Forwarding priority valA and valP

merged

d_srcB

DrB

Eicode ifun valC valA valB dstE dstM srcA srcB


Registerfile

A B

M

E

dstE

dstM

srcA srcB

d_srcA

d_rvalA

Sel+FwdA

FwdB

W_v

alM

W_d

stE

e_dst

E

d_rvalB

e_va

lE

M_d

stE

M_v

alE

M_d

stM

m_v

alM

W_d

stM

W_v

alE

dstE dstM srcA srcB

stat

stat

PIPE: Execute

Main points CC update inhibited

by prior exceptions Values for

forwarding Special handling for

dstE (?)

E

M Cndicode valE valA dstE dstM

icode ifun valC valA valB dstE dstM srcA srcB

CCCC ALUALU

ALUA

ALUB

ALUfun.

SetCC

condcond

e_valEe_Cnde_dstE

stat

stat

dstE

m_statW_stat

PIPE: Memory and Write-back

Main points Values for forwarding Stat update logic Feedback for branch

misprediction recovery

M

W

Addr

icode

data in

M_valA

valE valM dstE dstM

Cndicode valE valA dstE dstM

Datamemory

Datamemory

Mem.read

read

write

data out

Mem.write

M_valE

W_dstM

W_valE

W_valM

W_dstE

W_icode

M_icode

M_dstM

M_dstE

m_valM

M_Cnd

stat

stat

statdmem_error

m_stat

stat

Stat

Pipeline Control: Register ModesRisingclockRisingclock_ _ Output = y

yy

RisingclockRisingclock_ _ Output = x

xx

xxnop

RisingclockRisingclock_ _ Output = nop

Output = xInput = y

stall = 0

bubble= 0

xxNormal

Output = xInput = y

stall = 1

bubble= 0

xxStall

Output = xInput = y

stall = 0

bubble= 1

Bubble

PIPE Control Logic

Handles special cases Handles ret, load/use hazards, misprediction recovery, exceptions Existing PIPE logic handles forwarding, branch prediction

E

M

W

F

D

CCCC

rB

srcAsrcB

icode valE valM dstE dstM

Cndicode valE valA dstE dstM

icode ifun valC valA valB dstE dstM srcA srcB


predPC

d_srcB

d_srcA

e_Cnd

D_icode

E_icode

M_icode

E_dstMPipecontrollogic

D_bubble

D_stall

E_bubble

F_stall

M_bubble

W_stall

set_cc

stat

stat

stat

stat

W_stat

stat

m_stat

PIPE: Actual ret handling

0x000: irmovl Stack,%edx

1 2 3 4 5 6 7 8 9

F D E M W0x006: call proc F D E M W

E M W

10# prog7

0x020: ret

bubble

F D E M W

D

E M W bubble D

E M W bubble D0x00b: irmovl $10,%edx # Return point F D E M W

11

F

F

F

0x000: irmovl Stack,%edx

1 2 3 4 5 6 7 8 9

F D E M W0x006: call proc F D E M W

FE M W

10# prog7

0x020: ret

0x021: rrmovl %edx,%ebx # Not executed

bubble

F D E M W

DF

E M W


bubble DF

E M W


bubble D0x00b: irmovl $10,%edx # Return point F D E M W

11

Simplified view

What hardwareactually does

PIPE: Actual exception handling Scenario: pushl uses bad memory address

Actions: disable CC, inject bubbles into memory stage, stall write-back


1 2 3 4 5 6 7 8 9

F D E M W

0x006: xorl %esp,%esp #CC = 100 F D E M W

10# prog10

0x008: pushl %eax

0x00a: addl %eax,%eax F D

0x00c: irmovl $2, %eax

F D E M W

E

F D E

W W W

Cycle 6

M

mem_error = 1

E

New CC = 000

set_cc 0

Special Control Cases: Exceptions Detection

Action (on next cycle)

Also: disable setting of condition codes in execute in current cycle

Condition Trigger

Exception m_stat is in {SADR, SINS, SHLT} || W_stat is in {SADR, SINS, SHLT}

Condition F D E M W

Exception normal normal normal bubble stall

Special Control Cases: Non-exceptions Detection

Action (on next cycle)

Condition Trigger

Processing ret IRET in { D_icode, E_icode, M_icode }

Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }

Mispredicted Branch E_icode = IJXX & !e_Cnd

Condition F D E M W

Processing ret stall bubble normal normal normal

Load/Use Hazard stall stall bubble normal normal

Mispredicted Branch normal bubble bubble normal normal

Pipeline Control, rev. 1.0bool F_stall =

# Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } ||# Stalling at fetch while ret passes through pipelineIRET in { D_icode, E_icode, M_icode };

bool D_stall = # Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB };

bool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode };

bool E_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB};

How do we know this works?

Analysis: Control Combinations

Special cases that can arise on same clock cycle Combination A

Not-taken branch ret instruction at branch target

Combination B Instruction that reads from memory to %esp Followed by ret instruction

Control Combination A

Should be handled as mispredicted branch Combination will also stall F pipeline register But PC selection logic will be using M_valA anyway Correct action taken!

JXXE

D

M

Mispredict

JXXE

D

M

Mispredict

E

retD

M

ret 1

E

retD

M

ret 1

E

retD

M

ret 1

Combination A

Condition F D E M W


Mispredicted branch normal bubble bubble normal normal

Combination stall bubble bubble normal normal

Control Combination B

Would assert both bubble and stall for pipeline register D Should be signaled by processor as pipeline error

Combination not handled correctly in control code 1.0 But it passed many simulation tests; caught only with systematic analysis

LoadE

UseD

M

Load/use

E

retD

M

ret 1

E

retD

M

ret 1

E

retD

M

ret 1

Combination B

Condition F D E M W


Load/use hazard stall stall bubble normal normal

Combination stall bubble + stall

bubble normal normal

Control Combination B: Correct Handling

LoadE

UseD

M

Load/use

E

retD

M

ret 1

E

retD

M

ret 1

E

retD

M

ret 1

Combination B

Condition F D E M W



Combination stall stall bubble normal normal

Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle

Corrected Pipeline Control Logic

Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle

Condition F D E M W



Combination stall stall bubble normal normal

bool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL }

&& E_dstM in { d_srcA, d_srcB });New

Lesson Learned Extensive and thorough testing is good, but it can’t prove

a design correct Formal verification important, but field not mature

enough for large-scale designs Important and active research area

Performance Metrics Clock rate

Measured in Megahertz or Gigahertz Function of stage partitioning and circuit design

To increase: keep amount of work per stage small Rate at which instructions executed

CPI: cycles per instruction On average, how many clock cycles does each instruction require

(after completion of previous instruction)? CPI a function of pipeline design and the program

How frequently are branches mispredicted? How frequent are load stalls? How frequent are ret instructions?

CPI for PIPE Ideal CPI = 1.0

Fetch instruction each clock cycle Process new instruction every cycle

Although each individual instruction has latency of 5 cycles Actual CPI > 1.0

Due to pipeline stalls, branch mispredictions Computing CPI

C clock cycles I instructions executed to completion B bubbles injected (C = I + B)

CPI = C/I = (I+B)/I = 1.0 + B/I B/I represents average penalty (per instruction) due to bubbles

CPI for PIPE (Cont.)B/I = LP + MP + RP LP: Penalty due to load/use hazard stalling

Fraction of instructions that are loads0.25

Fraction of load instructions requiring stall0.20

Number of bubbles injected each time 1 LP = 0.25 * 0.20 * 1 = 0.05

MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps

0.20 Fraction of cond. jumps mispredicted

0.40 Number of bubbles injected each time 2 MP = 0.20 * 0.40 * 2 = 0.16

RP: Penalty due to ret instructions Fraction of instructions that are returns

0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06

Net effect of penalties: 0.05 + 0.16 + 0.06 = 0.27 CPI = 1.27 (Not bad! Assumes perfect memories.)

Typical Values

State-of-the-Art Pipelining What have we ignored in our Y86 implementation?

Balancing delay in each stage Which stage is longest, how might we speed it up?

Multicycle instructions Realistic memory systems

Fetch Logic Revisited

During fetch cycle1. Select PC2. Read bytes from

instruction memory3. Examine icode to

determine instruction length

4. Increment PC Timing

Steps 2 & 4 require significant amount of time

F

D rB

M_icode

PredictPC


Instructionmemory

Instructionmemory

PCincr.PC

incr.

predPC

Needregids

NeedvalCInstr

valid

AlignAlignSplitSplit

Bytes 1-5Byte 0

SelectPC

M_CndM_valA

W_icode

W_valM

f_pc

stat

stat

imem_error

icode ifun

Standard Fetch Timing

Must perform everything in sequence: Can’t compute incremented PC until we know value to increment it

with Why is increment slow? How could we speed this up?

Select PC

Mem. Read Increment

need_regids, need_valC

1 clock cycle

A Fast PC Increment Circuit

3-bit adder

need_ValC

need_regids

0

29-bitincre-

menter

MUX

High-order 29 bits

Low-order 3 bits

High-order 29 bits Low-order 3 bits

0 1

PC

incrPC

Slow Fast

carry

1

carry

Modified Fetch Timing

29-bit incrementer Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read

Select PC

Mem. Read

Incrementer

need_regids, need_valC3-bit add

MUX

1 clock cycle

Standard cycle

State-of-the-Art Pipelining Other issues to consider

More complex instructions: consider FP divide, sqrt Take many cycles to execute Forwarding can’t resolve hazards: more data stalls Important for compiler to schedule code

Deeper pipelines to allow faster cycle times Increased penalty from misprediction, data hazard stalls, etc. Increased emphasis on branch prediction

Actual memory hierarchy issues (will increase CPI) Difficult to complete memory access in one cycle! Possibility of cache misses, TLB misses, page faults

Superscalar/VLIW: process multiple instructions/cycle Dynamic scheduling (discussed in Chapter 5)

Scheduling = determining instruction execution order Hardware decides based on data dependencies, resources

Pentium 4 Pipeline Very deep pipeline

Enables very high clock rates, but 20+ cycle branch penalty Slower than Pentium III for a given clock rate

1 2 3 4 5 6 7 8 9 10 11 12

TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch

13 14

Disp Disp

15 16 17 18 19 20

RF Ex Flgs Br Ck DriveRF

Multicycle FP Operations Multiple functional units: one approach

Special-purpose hardware for FP operations Increased latency causes more frequent stalls

can cause “structural” hazards

Single cycle integer unit

F D M W

Fully pipelined multiplier

Non-pipelined divider

Fully pipelined FP adder

Dynamic scheduling Out-of-order execution engine: one view (Pentium 4)

Image from www.xbitlabs. com

Fetching,decoding,translation

of x86 instrs to

uops

to support precise exceptions and recovery from mispredicted branches

Branch Prediction: Simplistic Branch history

Encode history about prior history of each individual branch instruction; store as hash table on instr. address

Use history to predict branch outcome

State machine stores history Each time branch taken, move to left Each time branch not taken, move to right In state Yes*, predict taken; in state No*, predict not taken Can be encoded using 2 bits per table entry

T T T

Yes! Yes? No? No!

NT

T

NT NT

NT

Branch Prediction: Realistic Alpha 21264 “tournament” predictor

Addr. of branch instr

12 bits

4K entries, each 2 bits,

to select predictor

GlobalPredictor

LocalPredictor

12 bit shift register of last branch

outcomes globally

10 bits of branch address

4K entries, standard 2-bit

branch predictors

1K entries, each 10 bits, history of behavior of this

branch

1K entries, each a 3-bit saturation

counter

Predictor size: 8K + 8K + 10K + 3K = 29K bits!

8K

8K

10K 3K

Discussion Questions Data hazards on register values can be dealt with by

stalling or by forwarding. Can hazards occur on condition codes? Can hazards occur on data memory accesses?

Can software be responsible for pipeline correctness? Schedule instructions, use nops DSP chips historically have had exposed pipelines

Relationship of hazards and dependencies Does every data dependence cause a data hazard? Is every data hazard caused by a data dependence?

Documents

Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls Register isn’t written until completion of write-back stage