Arquitectura de Computadoras - CINVESTAVadiaz/ArqComp/12-Pipelining1.pdf · Tecnologías de...

Laboratorio deTecnologías de Información

Arquitectura de Computadoras Pipelining- 1

Introduction to Pipelining:Introduction to Pipelining: DatapathDatapath

Arquitectura de ComputadorasArquitectura de ComputadorasArturo DArturo Dííaz Paz Péérezrez

Centro de InvestigaciCentro de Investigacióón y de Estudios Avanzados del IPNn y de Estudios Avanzados del IPNLaboratorio de TecnologLaboratorio de Tecnologíías de Informacias de Informacióónn

adiaz@cinvestav.mxadiaz@cinvestav.mx

Laboratorio deTecnologías de InformaciónPipeliningPipelining

1 2 3 4

exploiting

instruction

parallelism

1 2 3 4

timein

s Throughput

Latency

1 2 3 4

rns Throughput

Latency

Laboratorio deTecnologías de InformaciónObservationsObservations

Pipelining doesn’t help latency of a single task, it helps throughput of the entire workload

Pipeline rate limited by slowest pipeline stage♦

Multiple tasks operating simultaneously

Potential speedup = Number pipe stages♦

Unbalanced lengths of pipe stages reduces speedup

Time to “fill”

pipeline and time to “drain”

it reduces speedup

Laboratorio deTecnologías de Información5 Steps of DLX Datapath5 Steps of DLX Datapath

PCInst.

Memory

Registers

SignExtend

Zero ?

DataMemory

ALUOutput

Instruction Fetch Instruction Decode/Register Fetch

ExecuteAddr. Calc.

MemoryAccess

WriteBack

Laboratorio deTecnologías de InformaciónPipelinedPipelined DLX DLX DatapathDatapath

Data stationary

control-

local decode

/ pipeline

Visualizing PipeliningVisualizing Pipelining

Single Cycle, Multiple Cycle, vs. PipelineSingle Cycle, Multiple Cycle, vs. Pipeline

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec MemLoad Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Single Cycle Implementation:

Load Store Waste

IfetchR-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

R-type

Laboratorio deTecnologías de InformaciónWhy Pipeline?Why Pipeline?

Suppose we execute 100 instructions♦

Single Cycle Machine■

45 ns/cycle x 1 CPI x 100 inst = 4500 ns

Multicycle Machine■

10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

Ideal pipelined machine■

10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Laboratorio deTecnologías de InformaciónLimits to PipeliningLimits to Pipelining

Its not that easy for computers

Limits to pipelining: Hazards

prevent next instruction from executing during its designated clock cycle■

Structural hazards: HW cannot support this combination of instructions

Data hazards: instruction depends on result of prior instruction still in the pipeline

Control hazards: pipelining of branches & other instructions that change the PC

Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”

in the pipeline

1 1 MemoryMemory isis StructuralStructural HazardHazard

Detection is easy in this case!

(right half highlight means read, left half write)

1 Memory is Structural Hazard1 Memory is Structural Hazard

Laboratorio deTecnologías de InformaciónData Hazard on r1Data Hazard on r1

• Dependencies backwards in time are hazards

Instr.

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WBAL

UIm Reg Dm Reg

UIm Reg Dm RegA

LUIm Reg Dm Reg

UReg Dm Reg

UIm Reg Dm Reg

Data Hazard SolutionData Hazard Solution

• “Forward” result from one stage to another

Instr.

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

UIm Reg Dm Reg

UIm Reg Dm RegA

LUIm Reg Dm Reg

UReg Dm Reg

UIm Reg Dm Reg

•“or” OK if define read/write properly

Laboratorio deTecnologías de Información3 Generic Data Hazards3 Generic Data Hazards

Instri

followed by Instrj

Read After Write (RAW)■

Instrj

tries to read operand before Instri

writes it♦

Write After Read (WAR)■

Instrj tries to write operand before Instri

reads it■

Can’t happen in DLX 5 stage pipeline because

» all instructions take 5 stages» reads are always in stage 2, and» writes are always in stage 5

Write After Write (WAW)■

Instrj tries to write operand before Instri writes it

» Leaves wrong result (Instri not Instrj)■

Can’t happen in DLX 5 stage pipeline because

» all instructions take 5 stages» writes are always in stage 5

Control Hazard: WaitControl Hazard: Wait

Stall: wait until decision is clear♦

Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow

Move decision to end of decode■

save 1 cycle per branch

Instr.

Time (clock cycles)

UMem Reg Mem Reg

UMem Reg Mem RegA

LUReg Mem RegMemLostpotential

Control Hazard: PredictControl Hazard: Predict

Instr.

Time (clock cycles)

UMem Reg Mem Reg

UReg Mem Reg

Predict:

guess one direction then back up if wrong♦

Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right -

50% of time)■

Need to “Squash”

and restart following instruction if wrong■

Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5■

Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch)♦

More dynamic scheme: history of 1 branch (-

Control Hazard: Delayed BranchControl Hazard: Delayed Branch

Instr.

Time (clock cycles)

UMem Reg Mem Reg

UReg Mem Reg

Load Mem

UReg Mem Reg

Delayed Branch: Redefine branch behavior (takes place after next instruction)

Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot”

50% of time)

As launch more instruction per clock cycle, less useful

Forwarding to Avoid Data HazardForwarding to Avoid Data Hazard

Data Hazard even with ForwardingData Hazard even with Forwarding

Forwarding and LoadsForwarding and Loads

• Dependencies backwards in time are hazards

Time (clock cycles)

lw r1,0(r2)

sub r4,r1,r3

UIm Reg Dm

UIm Reg Dm Reg

Can’t solve with forwarding: Must delay/stall instruction dependent on loads

Forwarding and LoadsForwarding and Loads

Time (clock cycles)

lw r1,0(r2)

sub r4,r1,r3

UIm Reg DmA

LUIm Reg Dm RegStall

Dependencies backwards in time are hazards

Can’t solve with forwarding: Must delay/stall instruction dependent on loads

Designing a Pipelined ProcessorDesigning a Pipelined Processor

Go back and examine your datapath

and control diagram♦

associated resources with states

ensure that flows do not conflict, or figure out how to resolve

assert control in appropriate stage

Control and Control and DatapathDatapath: Split state : Split state diagdiag into 5 piecesinto 5 pieces

IR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– S;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– S;

S <– A + SX;

Mem[S] <- B

If CondPC < PC+SX;

Pipelined ProcessorPipelined Processor

What happens if we start a new instruction every cycle?

Laboratorio deTecnologías de InformaciónPipelinedPipelined DatapathDatapath

Data stationary

control-

local decode

/ pipeline

Pipelining the Load InstructionPipelining the Load Instruction

The five independent functional units in the pipeline datapath

are:■

Instruction Memory for the Ifetch

Register File’s Read ports (bus A and busB) for the Reg/Dec

stage■

ALU for the

Data Memory for the Mem

stage■

Register File’s Write

port (bus W) for the Wr

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr1st lw

Ifetch Reg/Dec Exec Mem Wr2nd lw

Ifetch Reg/Dec Exec Mem Wr3rd lw

The Four Stages of RThe Four Stages of R--typetype

Ifetch: Instruction Fetch■

Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode♦

Exec: ■

ALU operates on the two register operands

Update PC

Wr: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec WrR-type

Pipelining the RPipelining the R--type and Load type and Load InstructionInstruction

We have pipeline conflict or structural hazard:■

Two instructions try to write to the register file at the same time!

Only one write port

ClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Exec Mem WrLoad

Important ObservationImportant Observation

Each functional unit can only be used once

per instruction

Each functional unit must be used at the same

stage for all instructions:■

Load uses Register File’s Write Port during its 5th

R-type uses Register File’s Write Port during its 4th

Ifetch Reg/Dec Exec Mem WrLoad1 2 3 4 5

Ifetch Reg/Dec Exec WrR-type1 2 3 4

2 ways to solve this pipeline hazard.

Solution 1: Insert Solution 1: Insert ““BubbleBubble”” into into the Pipelinethe Pipeline

Insert a “bubble”

into the pipeline to prevent 2 writes at the same cycle■

The control logic can be complex.

Lose instruction fetch and issue opportunity.

No instruction is started in Cycle 6!

Ifetch Reg/Dec Exec

Ifetch Reg/Dec Exec WrR-typeIfetch Reg/Dec Exec WrR-type Pipeline

Bubble

Ifetch Reg/Dec Exec Wr

Solution 2: Delay RSolution 2: Delay R--typetype’’s Write s Write by One Cycleby One Cycle

Delay R-type’s register write by one cycle:■

Now R-type instructions also use Reg

File’s write port at Stage 5

stage is a NOOP

stage: nothing is being done.

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Exec WrR-type Mem

1 2 3 4 5

Modified Control & Modified Control & DatapathDatapathIR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– M;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– M;

S <– A + SX;

Mem[S] <- B

if Cond PC < PC+SX;

M <– S

The Four Stages of StoreThe Four Stages of Store

Reg/Dec: Registers Fetch and Instruction Decode♦

Exec: Calculate the memory address

Mem: Write the data into the Data Memory

Ifetch Reg/Dec Exec MemStore Wr

Laboratorio deTecnologías de InformaciónThe Three Stages of The Three Stages of BeqBeq

Reg/Dec: ■

Registers Fetch and Instruction Decode

Exec: ■

compares the two register operand,

select correct branch target address■

latch into PC

Ifetch Reg/Dec Exec MemBeq Wr

Control DiagramControl DiagramIR <- Mem[PC]; PC < PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– S;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– S;

S <– A + SX;

Mem[S] <- B

If Cond PC < PC+SX;

M <– S M <– S

DatapathDatapath + Data Stationary + Data Stationary ControlControl

MemCtrl

WB Ctrl

oprsrt

exmewbrwv

mewbrwv

LetLet’’s Try it Outs Try it Out

10 lw r1, r2(35)14 addI r2, r2, 320 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

these addresses are octal

Start: Fetch 10Start: Fetch 10

MemCtrl

WB Ctrl

rs rt im

10 lw r1, r2(35)14 addI r2, r2, 320 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

n n n n

Fetch 14, Decode 10Fetch 14, Decode 10

MemCtrl

WB Ctrl

2 rt im

r1, r2(35)

14 addI r2, r2, 320 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

Fetch 20, Decode 14, Exec 10Fetch 20, Decode 14, Exec 10

MemCtrl

WB Ctrl

2 rt 35

r1, r2(35)

r2, r2, 3

20 sub r3, r4, r524 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

Fetch 24, Decode 20, Exec 14, Fetch 24, Decode 20, Exec 14, MemMem 1010

MemCtrl

WB Ctrl

r1, r2(35)

r2, r2, 3

r3, r4, r5

24 beq r6, r7, 10030 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

Fetch 30, Fetch 30, DcdDcd 24, Ex 20, 24, Ex 20, MemMem 14, WB 1014, WB 10

MemCtrl

WB Ctrl

r1, r2(35)

r2, r2, 3

r3, r4, r5

r6, r7, 100

30 ori r8, r9, 1734 add r10, r11, r12

100 and r13, r14, 15

MemCtrl

WB Ctrl

r1, r2(35)

r2, r2, 3

r3, r4, r5

r6, r7, 100

r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

Fetch 100, Fetch 100, DcdDcd 3434, Ex 30, , Ex 30, MemMem 24, WB 2024, WB 20

MemCtrl

WB Ctrl

r1=M[r2+35]

r1, r2(35)

r2, r2, 3

r3, r4, r5

r6, r7, 100

r8, r9, 17

r10, r11, r12

100 and r13, r14, 15

r2 = r2+3

ooops, we should have only one delayed instruction

Fetch 104, Fetch 104, DcdDcd 100, 100, Ex 34Ex 34, , MemMem 30, WB 2430, WB 24

MemCtrl

WB Ctrl

r1=M[r2+35]

r1, r2(35)

r2, r2, 3

r3, r4, r5

r6, r7, 100

r8, r9, 17

r10, r11, r12

100 and r13, r14, 15

r2 = r2+3r3 = r4-r5

Squash the extra instruction in the branch shadow!

MemCtrl

WB Ctrl

r1=M[r2+35]

r1, r2(35)

r2, r2, 3

r3, r4, r5

r6, r7, 100

r8, r9, 17

r10, r11, r12

100 and r13, r14, 15

r2 = r2+3r3 = r4-r5

Squash the extra instruction in the branch shadow!r9

Fetch 114, Fetch 114, DcdDcd 110, 110, Ex 104Ex 104, , MemMem 100, 100, WB 34WB 34

MemCtrl

WB Ctrl

r1=M[r2+35]

r1, r2(35)

r2, r2, 3

r3, r4, r5

r6, r7, 100

r8, r9, 17

r10, r11, r12

100 and r13, r14, 15

r2 = r2+3r3 = r4-r5r8 = r9 | 17

Squash the extra instruction in the branch shadow!r1

NO WBNO Ovflow

Summary: PipeliningSummary: Pipelining

What makes it easy■

all instructions are the same length

just a few instruction formats■

memory operands appear only in loads and stores

What makes it hard?■

structural hazards: suppose we had only one memory

control hazards: need to worry about branch instructions■

data hazards: an instruction depends on a previous instruction

We’ll build a simple pipeline and look at these issues

We’ll talk about modern processors and what really makes it hard:■

exception handling

trying to improve performance with out-of-order execution, etc.

SummarySummary

Pipelining is a fundamental concept■

multiple steps using distinct resources

Utilize capabilities of the Datapath

by pipelined instruction processing■

start next instruction while working on the current one

limited by length of longest stage (plus fill/flush)■

detect and resolve hazards

Arquitectura de Computadoras - CINVESTAVadiaz/ArqComp/12-Pipelining1.pdf · Tecnologías de...

Documents

COMPUTADORAS INDEPENDIENTES

ARQUITECTURA DE COMPUTADORAS II Carlos Canto Q. Interrupciones en la IBM PC

computadoras hp

Organización y Arquitectura de Computadoras -5º Edicion (William Stallings)

Arquitectura de Computadoras - cs.uns.edu.aremarini/ac/downloads/Clases-teoricas/AC... · Estructura de datos indexada por el número ... No es aplicable en computadoras modernas

RENDIMIENTO - UTMfsantiag/ArqComputadoras/02_Rendimiento.pdfUnidad 2 - Rendimiento (Arquitectura de Computadoras) CPU Clocking n Operation of digital hardware governed by a constant-rate

Performance, Cost and Amdahl's Lawadiaz/ArqComp/03-Performance.pdf · Laboratorio de Tecnologías de Información Arquitectura de Computadoras Performance- 2 Performance ♦ Purchasing

Sistemas Digitales y Arquitectura de Computadoras · Sin embar- go,losautores no ... 11.5 Resumen 11.25 11.5.1 Puntos Importantes del Capítulo 11.25 11.6 Problemas 11.26 ... 14.11

(ncar) arquitectura de computadoras

Maestría en Electrónica Arquitectura de Computadoras Unidad 5fsantiag/ArqComputadoras/05_El_Procesador (Parte 2).pdf · Segmentación del Path de Datos (Parte II) Maestría en Electrónica

lab-7-2015B arquitectura de computadoras

Libro Arquitectura de Computadoras

ARQUITECTURA DE COMPUTADORAS II

Arquitectura de Computadoras para Ingeniería - Principalpmd/ac_ing/downloads/Slides/ACI-Clase-6y7.pdf · Dos números de n dígitos en base b – Multiplicador: ... El resultado

Computadoras & tecnología

Organización y Arquitectura de computadoras -5º edicion (William Stallings).pdf

Manual Computadoras Conafe

Computadoras y chips

Historia computadoras

Portafolio de computadoras