Arquitectura de Computadoras - CINVESTAVadiaz/ArqComp/12-Pipelining1.pdf · Tecnologías de Información. Arquitectura de Computadoras. Pipelining- 30. Solution 1: Insert “Bubble”

Laboratorio deTecnologías de Información

Arquitectura de Computadoras Pipelining- 1


Introduction to Pipelining:Introduction to Pipelining: DatapathDatapath

Arquitectura de ComputadorasArquitectura de ComputadorasArturo DArturo Dííaz Paz Péérezrez

Centro de InvestigaciCentro de Investigacióón y de Estudios Avanzados del IPNn y de Estudios Avanzados del IPNLaboratorio de TecnologLaboratorio de Tecnologíías de Informacias de Informacióónn

[email protected]@cinvestav.mx



Laboratorio deTecnologías de InformaciónPipeliningPipelining

1 2 3 4

A way

of

exploiting

instruction

level

parallelism

1 2 3 4

1 2 3 4

1 2 3 4

timein

strn

s Throughput

Latency

1 2 3 4

1 2 3 4

1 2 3 4

time

inst

rns Throughput

Latency



Laboratorio deTecnologías de InformaciónObservationsObservations

♦

Pipelining doesn’t help latency of a single task, it helps throughput of the entire workload

♦

Pipeline rate limited by slowest pipeline stage♦

Multiple tasks operating simultaneously

♦

Potential speedup = Number pipe stages♦

Unbalanced lengths of pipe stages reduces speedup

♦

Time to “fill”

pipeline and time to “drain”

it reduces speedup



Laboratorio deTecnologías de Información5 Steps of DLX Datapath5 Steps of DLX Datapath

PCInst.

Memory

Add

Registers

SignExtend

Add

Mux

Mux

Zero ?

Mux

DataMemory

Mux

NPC

IR

16

A

B

SM D

ALUOutput

LM D

Instruction Fetch Instruction Decode/Register Fetch

ExecuteAddr. Calc.

MemoryAccess

WriteBack

32

4



Laboratorio deTecnologías de InformaciónPipelinedPipelined DLX DLX DatapathDatapath

Data stationary

control-

local decode

for

each

phase

/ pipeline

stage




Visualizing PipeliningVisualizing Pipelining

Single Cycle, Multiple Cycle, vs. PipelineSingle Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec MemLoad Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:

Load Store Waste

IfetchR-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

R-type



Laboratorio deTecnologías de InformaciónWhy Pipeline?Why Pipeline?

♦

Suppose we execute 100 instructions♦

Single Cycle Machine■

45 ns/cycle x 1 CPI x 100 inst = 4500 ns

♦

Multicycle Machine■

10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

♦

Ideal pipelined machine■

10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns



Laboratorio deTecnologías de InformaciónLimits to PipeliningLimits to Pipelining

♦

Its not that easy for computers

♦

Limits to pipelining: Hazards

prevent next instruction from executing during its designated clock cycle■

Structural hazards: HW cannot support this combination of instructions

■

Data hazards: instruction depends on result of prior instruction still in the pipeline

■

Control hazards: pipelining of branches & other instructions that change the PC

♦

Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”

in the pipeline




1 1 MemoryMemory isis StructuralStructural HazardHazard

Detection is easy in this case!

(right half highlight means read, left half write)




1 Memory is Structural Hazard1 Memory is Structural Hazard



Laboratorio deTecnologías de InformaciónData Hazard on r1Data Hazard on r1

• Dependencies backwards in time are hazards

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg




Data Hazard SolutionData Hazard Solution

• “Forward” result from one stage to another

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11


UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

•“or” OK if define read/write properly



Laboratorio deTecnologías de Información3 Generic Data Hazards3 Generic Data Hazards

♦

Instri

followed by Instrj

♦

Read After Write (RAW)■

Instrj

tries to read operand before Instri

writes it♦

Write After Read (WAR)■

Instrj tries to write operand before Instri

reads it■

Can’t happen in DLX 5 stage pipeline because

» all instructions take 5 stages» reads are always in stage 2, and» writes are always in stage 5

♦

Write After Write (WAW)■

Instrj tries to write operand before Instri writes it

» Leaves wrong result (Instri not Instrj)■

Can’t happen in DLX 5 stage pipeline because

» all instructions take 5 stages» writes are always in stage 5




Control Hazard: WaitControl Hazard: Wait

♦

Stall: wait until decision is clear♦

Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow

♦

Move decision to end of decode■

save 1 cycle per branch

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUReg Mem RegMemLostpotential




Control Hazard: PredictControl Hazard: Predict

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg

♦

Predict:

guess one direction then back up if wrong♦

Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right -

50% of time)■

Need to “Squash”

and restart following instruction if wrong■

Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5■

Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch)♦

More dynamic scheme: history of 1 branch (-

90%)




Control Hazard: Delayed BranchControl Hazard: Delayed Branch

Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg

Load Mem

AL

UReg Mem Reg

♦

Delayed Branch: Redefine branch behavior (takes place after next instruction)

♦

Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot”

(-

50% of time)

♦

As launch more instruction per clock cycle, less useful




Forwarding to Avoid Data HazardForwarding to Avoid Data Hazard




Data Hazard even with ForwardingData Hazard even with Forwarding




Forwarding and LoadsForwarding and Loads

Reg

• Dependencies backwards in time are hazards

Time (clock cycles)

lw r1,0(r2)

sub r4,r1,r3


UIm Reg Dm

AL

UIm Reg Dm Reg

Can’t solve with forwarding: Must delay/stall instruction dependent on loads




Forwarding and LoadsForwarding and Loads

Reg

Time (clock cycles)

lw r1,0(r2)

sub r4,r1,r3


UIm Reg DmA

LUIm Reg Dm RegStall

Dependencies backwards in time are hazards

Can’t solve with forwarding: Must delay/stall instruction dependent on loads




Designing a Pipelined ProcessorDesigning a Pipelined Processor

♦

Go back and examine your datapath

and control diagram♦

associated resources with states

♦

ensure that flows do not conflict, or figure out how to resolve

♦

assert control in appropriate stage

Control and Control and DatapathDatapath: Split state : Split state diagdiag into 5 piecesinto 5 pieces

IR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– S;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– S;

S <– A + SX;

Mem[S] <- B

If CondPC < PC+SX;

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem

D

M




Pipelined ProcessorPipelined Processor

♦

What happens if we start a new instruction every cycle?

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem

Valid

IRex

Dcd

Ctrl

IRm

em

Ex

Ctrl

IRw

b

Mem

Ctrl

WB

Ctrl

D



Laboratorio deTecnologías de InformaciónPipelinedPipelined DatapathDatapath

Data stationary

control-

local decode

for

each

phase

/ pipeline

stage




Pipelining the Load InstructionPipelining the Load Instruction

♦

The five independent functional units in the pipeline datapath

are:■

Instruction Memory for the Ifetch

stage

■

Register File’s Read ports (bus A and busB) for the Reg/Dec

stage■

ALU for the

Exec

stage

■

Data Memory for the Mem

stage■

Register File’s Write

port (bus W) for the Wr

stage

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr1st lw

Ifetch Reg/Dec Exec Mem Wr2nd lw

Ifetch Reg/Dec Exec Mem Wr3rd lw




The Four Stages of RThe Four Stages of R--typetype

♦

Ifetch: Instruction Fetch■

Fetch the instruction from the Instruction Memory

♦

Reg/Dec: Registers Fetch and Instruction Decode♦

Exec: ■

ALU operates on the two register operands

■

Update PC

♦

Wr: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec WrR-type




Pipelining the RPipelining the R--type and Load type and Load InstructionInstruction

♦

We have pipeline conflict or structural hazard:■

Two instructions try to write to the register file at the same time!

■

Only one write port

ClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9



Ifetch Reg/Dec Exec Mem WrLoad






Important ObservationImportant Observation

♦

Each functional unit can only be used once

per instruction

♦

Each functional unit must be used at the same

stage for all instructions:■

Load uses Register File’s Write Port during its 5th

stage

■

R-type uses Register File’s Write Port during its 4th

stage

Ifetch Reg/Dec Exec Mem WrLoad1 2 3 4 5

Ifetch Reg/Dec Exec WrR-type1 2 3 4

°

2 ways to solve this pipeline hazard.




Solution 1: Insert Solution 1: Insert ““BubbleBubble”” into into the Pipelinethe Pipeline

♦

Insert a “bubble”

into the pipeline to prevent 2 writes at the same cycle■

The control logic can be complex.

■

Lose instruction fetch and issue opportunity.

♦

No instruction is started in Cycle 6!

Clock



Ifetch Reg/Dec Exec


Ifetch Reg/Dec Exec WrR-typeIfetch Reg/Dec Exec WrR-type Pipeline

Bubble

Ifetch Reg/Dec Exec Wr




Solution 2: Delay RSolution 2: Delay R--typetype’’s Write s Write by One Cycleby One Cycle

♦

Delay R-type’s register write by one cycle:■

Now R-type instructions also use Reg

File’s write port at Stage 5

■

Mem

stage is a NOOP

stage: nothing is being done.

Clock


Ifetch Reg/Dec Mem WrR-type





Ifetch Reg/Dec Exec WrR-type Mem

Exec

Exec

Exec

Exec

1 2 3 4 5

Modified Control & Modified Control & DatapathDatapathIR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– M;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– M;

S <– A + SX;

Mem[S] <- B

if Cond PC < PC+SX;

M <– S

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem

D

M

M <– S




The Four Stages of StoreThe Four Stages of Store

♦



♦

Reg/Dec: Registers Fetch and Instruction Decode♦

Exec: Calculate the memory address

♦

Mem: Write the data into the Data Memory


Ifetch Reg/Dec Exec MemStore Wr



Laboratorio deTecnologías de InformaciónThe Three Stages of The Three Stages of BeqBeq

♦



♦

Reg/Dec: ■

Registers Fetch and Instruction Decode

♦

Exec: ■

compares the two register operand,

■

select correct branch target address■

latch into PC


Ifetch Reg/Dec Exec MemBeq Wr

Control DiagramControl DiagramIR <- Mem[PC]; PC < PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– S;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– S;

S <– A + SX;

Mem[S] <- B

If Cond PC < PC+SX;

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem

M <– S M <– S

A

B

S

D

M




DatapathDatapath + Data Stationary + Data Stationary ControlControl

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

M

rs rt

oprsrt

fun

im

exmewbrwv

mewbrwv

wbrwv




LetLet’’s Try it Outs Try it Out

10 lw r1, r2(35)14 addI r2, r2, 320 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

these addresses are octal

Start: Fetch 10Start: Fetch 10

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

M

rs rt im

10 lw r1, r2(35)14 addI r2, r2, 320 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

n n n n

10

Fetch 14, Decode 10Fetch 14, Decode 10

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

M

2 rt im

10

lw

r1, r2(35)

14 addI r2, r2, 320 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

n n n

14

lwr1

, r2(

35)

Fetch 20, Decode 14, Exec 10Fetch 20, Decode 14, Exec 10

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

r2

B

SReg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

M

2 rt 35

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

n n

20

lwr1

addI

r2, r

2, 3

Fetch 24, Decode 20, Exec 14, Fetch 24, Decode 20, Exec 14, MemMem 1010

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

r2

B

r2+3

5

Reg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

M

4 5 3

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20

sub

r3, r4, r5

24 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

n

24lw

r1

sub

r3, r

4, r5

addI

r2, r

2, 3

Fetch 30, Fetch 30, DcdDcd 24, Ex 20, 24, Ex 20, MemMem 14, WB 1014, WB 10

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

r4

r5

r2+3

Reg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

M[r2

+35]

6 7

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20

sub

r3, r4, r5

24

beq

r6, r7, 100

30 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

30

lwr1

beq

r6, r

7 10

0

addI

r2

sub

r3


Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

r6

r7

r2+3

Reg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

r1=M

[r2+3

5]

9 xx

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20

sub

r3, r4, r5

24

beq

r6, r7, 100

30

ori

r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

34

beq

addI

r2

sub

r3r4

-r5

100

orir

8, r9

17

Fetch 100, Fetch 100, DcdDcd 3434, Ex 30, , Ex 30, MemMem 24, WB 2024, WB 20

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

r9

x

Reg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

11 12

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20

sub

r3, r4, r5

24

beq

r6, r7, 100

30

ori

r8, r9, 17

34

add

r10, r11, r12

100 and r13, r14, 15

100

beq

r2 = r2+3

sub

r3r4

-r5

17or

ir8

xxx

add

r10,

r11,

r12

ooops, we should have only one delayed instruction

Fetch 104, Fetch 104, DcdDcd 100, 100, Ex 34Ex 34, , MemMem 30, WB 2430, WB 24

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

r11

r12

Reg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

14 15

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20

sub

r3, r4, r5

24

beq

r6, r7, 100

30

ori

r8, r9, 17

34

add

r10, r11, r12

100 and r13, r14, 15

104

beq

r2 = r2+3r3 = r4-r5

xx

orir

8

xxx

add

r10

and

r13,

r14,

r15

n

Squash the extra instruction in the branch shadow!

r9 |

17


Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

r14

r15

Reg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20

sub

r3, r4, r5

24

beq

r6, r7, 100

30

ori

r8, r9, 17

34

add

r10, r11, r12

100 and r13, r14, 15

110

r2 = r2+3r3 = r4-r5

xx

orir

8

add

r10

and

r13

n

Squash the extra instruction in the branch shadow!r9

| 17

r11+

r12

Fetch 114, Fetch 114, DcdDcd 110, 110, Ex 104Ex 104, , MemMem 100, 100, WB 34WB 34

Exe

c

Reg

. Fi

le

Mem

Acc

ess

Dat

aM

em

Reg

File

PC

Nex

t PC

IR

Inst

. Mem

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

10

lw

r1, r2(35)

14

addI

r2, r2, 3

20

sub

r3, r4, r5

24

beq

r6, r7, 100

30

ori

r8, r9, 17

34

add

r10, r11, r12

100 and r13, r14, 15

114

r2 = r2+3r3 = r4-r5r8 = r9 | 17

add

r10

and

r13

n

Squash the extra instruction in the branch shadow!r1

1+r1

2

NO WBNO Ovflow

r14

& R

15




Summary: PipeliningSummary: Pipelining

♦

What makes it easy■

all instructions are the same length

■

just a few instruction formats■

memory operands appear only in loads and stores

♦

What makes it hard?■

structural hazards: suppose we had only one memory

■

control hazards: need to worry about branch instructions■

data hazards: an instruction depends on a previous instruction

♦

We’ll build a simple pipeline and look at these issues

♦

We’ll talk about modern processors and what really makes it hard:■

exception handling

■

trying to improve performance with out-of-order execution, etc.




SummarySummary

♦

Pipelining is a fundamental concept■

multiple steps using distinct resources

♦

Utilize capabilities of the Datapath

by pipelined instruction processing■

start next instruction while working on the current one

■

limited by length of longest stage (plus fill/flush)■

detect and resolve hazards

Documents

Arquitectura de Computadoras - CINVESTAVadiaz/ArqComp/12-Pipelining1.pdf · Tecnologías de Información. Arquitectura de Computadoras. Pipelining- 30. Solution 1: Insert “Bubble”