ELEC 5200/6200 Computer Architecture and Design Spring 2017 · 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5 Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing

ELEC 5200/6200

Computer Architecture and Design

Spring 2017 Lecture 5: Pipelining

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 1

Ujjwal Guin, Assistant Professor

Department of Electrical and Computer Engineering

Auburn University, Auburn, AL 36849

http://www.auburn.edu/~uzg0005/

Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University)

[Adapted from Computer Organization and Design, Patterson & Hennessy, 2014]

http://www.auburn.edu/~uzg0005/

ILP: Instruction Level Parallelism

Single-cycle and multi-cycle datapaths execute one

instruction at a time.

How can we get better performance?

Answer: Execute multiple instructions at the same

time.

– Pipelining – Enhance a multi-cycle datapath to fetch one

instruction every cycle.

– Parallelism – Fetch multiple instructions every cycle.

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 2

Automobile Team Assembly

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 3

1 car assembled every four hours

6 cars per day

180 cars per month

2,040 cars per year

1 hour 1 hour

1 hour

1 hour

Automobile Assembly Line

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 4

Task 1

1 hour

Task 2

1 hour

Task 3

1 hour

Task 4

1 hour

First car assembled in 4 hours (pipeline latency)

1 car completed per hour thereafter

21 cars on first day, thereafter 24 cars per day

717 cars per month

8,637 cars per year

What gives 4X increase?

Mecahnical Electrical Painting Testing

Throughput: Team Assembly

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5

Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing

Time of assembling one car = n hours

where n is the number of nearly equal subtasks,

each requiring 1 unit of time

Throughput = 1/n cars per unit time

Red car

completed

Red car

started

TimeBlue car

started

Blue car

completed

Throughput: Assembly Line

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 6

Time to complete first car = n time units (latency)

Cars completed in time T = T – n + 1

Throughput = 1 – (n – 1)/ T cars per unit time

Throughput (assembly line) 1 – (n – 1)/ T n(n – 1)

─────────────────── = ──────── = n – ───── → n

Throughput (team assembly) 1/n T as T→∞

Mechanical Electrical Painting Testing




Car 1

Car 2

Car 3

Car 4

.

.

Car 1

complete

Car 2

complete

time

Key idea: overlap execution

of multiple tasks

Some Features of Assembly Line

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 7

Task 1

1 hour

Task 2

1 hour

Task 3

1 hour

Task 4

1 hour


Electrical parts

delivered (JIT)

Defect

foundStall assembly line

to fix the cause of

defect

3 cars in the assembly line are suspects,

to be removed (flush pipeline)

Pros and Cons

Advantages: Efficient use of labor.

Specialists can do better job.

Just in time (JIT) methodology eliminates warehouse cost.

Disadvantages: Penalty of defect latency.

Lack of flexibility in production.

Assembly line work is monotonous and boring.

https://www.youtube.com/watch?v=IjarLbD9r30

https://www.youtube.com/watch?v=ANXGJe6i3G8

https://www.youtube.com/watch?v=5lp4EbfPAtI

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 8

https://www.youtube.com/watch?v=IjarLbD9r30



Pipelining a Digital System

Key idea: break big computation up into pieces

Separate each piece with a pipeline register1ns

200ps 200ps 200ps 200ps 200ps

Pipeline

Register

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 9

Pipelining a Digital System

Why do this? Because it's faster for repeated

computations

1ns

Non-pipelined:

1 operation finishes

every 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined:

1 operation finishes

every 200ps

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 10

Pipelining a Processor

Recall the 5 steps in instruction execution:1. Instruction Fetch (IF)

2. Instruction Decode and Register Read (ID)

3.Execution operation or calculate address (ALU or EX)

4.Memory access (MEM)

5.Write result into register (WB)

Review: Single-Cycle Processor– All 5 steps done in a single clock cycle

– Dedicated hardware required for each step

What happens if we break execution into multiple cycles, and add extra hardware?– Recall that in Multi-cycle, datapath hardware differs from

single-cycle

112/20/2017 ELEC 5200-001/6200-001 Lecture 5

Review - Single-Cycle Processor

12

IFInstruction Fetch

ID

Instruction Decode

EX

Execute/ Address Calc.

MEM

Memory Access

WB

Write Back

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

13

Pipelining - Key Idea

Question: What happens if we break execution into

multiple cycles, and add the extra hardware?

Answer: in the best case, we can start executing a

new instruction on each clock cycle

– this is pipelining

Pipelining stages:

– IF - Instruction Fetch

– ID - Instruction Decode

– EX - Execute / Address Calculation

– MEM - Memory Access (read / write)

– WB - Write Back (results into register file)

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

Project Summary A RISC CPU is to be designed in the VHDL modeling

language, verified via the Mentor Graphics "ModelSim" or Aldec “Active-HDL” simulator, and implemented on the Altera DE2 FPGA board using Altera’s Quartus II software.

The project consists of six parts. Due dates will be listed above as the semester progresses. You read problem definitions of all six parts before actually starting with Part 1, i.e., Instruction Set Architecture (ISA).

Please submit only the List Format (do not submit wave format) of the simulation results in part 3, part 4, and part 5. Always annotate your simulation results. Maintain a single folder for submitting the project parts. When submitting a later part, all the previous parts need to be in the folder.

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 14

Instruction Set Architecture Classes

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 15

ALU

Processor

Memory

…

…

ALU

Processor

Memory

…

…

Memory

…

…

ALU

Processor

Memory

…

…

ALU

Processor

…

…

…

…

a) Stack b) Accumulator c) Register-Memory c) Register-Register

Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.

Basic Pipelined Processor

16

IF/ID

Pipeline Registers

ID/EX EX/MEM MEM/WB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

Single-Cycle vs. Pipelined Execution

17

Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0)Instruction

FetchREG

RDALU REG

WRMEM


FetchREG

RDALU REG

WRMEM


Fetch

TimeInstructionOrder

800ps

800ps

800ps

Pipelined0 200 400 600 800 1000 1200 1400 1600


FetchREG

RDALU REG

WRMEM

lw $2, 200($0)

lw $3, 300($0)


200ps

Instruction

FetchREG

RDALU REG

WRMEM

Instruction

FetchREG

RDALU REG

WRMEM

200ps

200ps 200ps 200ps 200ps 200ps

Note: REGRD is at the

end of a stage but

REGWR is at the

beginning of a stage

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

Single-Cycle vs. Pipelined Execution (cont.)

Time taken in pipeline stages is limited by the slowest operation– Either ALU operation or Memory access

Time taken in ALU stage (i.e. EX) is used as pipeline clock cycle in the following discussion

If most memory access is cache access, MEM < ALU

Assumptions (Fig 4.27 on p.276)– Write to the register/memory occurs in the first half of the clock cycle

– Read from register/memory occurs in the second half of the clock cycle

– If no such assumption, Cycle 5 of the following example will have issues Executing Multiple Instructions Clock Cycle 5, where the register file is used for 2

instructions at their different stages (ID and WB)

– How to design such an assumption?

0 200 400 600 800 1000 1200 1400 1600


FetchREG

RDALU REG

WRMEM

lw $2, 200($0)

lw $3, 300($0)


200ps

Instruction

FetchREG

RDALU REG

WRMEM

Instruction

FetchREG

RDALU REG

WRMEM

200ps

200ps 200ps 200ps 200ps 200ps

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 18

Comments about Pipelining

The good news– Multiple instructions are being processed at the same

time

– This works because stages are isolated by registers

– Best case speedup of #Stages

The bad news– Instructions interfere with each other - Hazards

Different instructions may need the same piece of hardware (e.g., memory) in same clock cycle --- Structure Hazard

Not sure which is the next instruction for the next instruction fetch (IF) until EX of the branch instruction --- Control Hazard

Instruction may require a result produced by an earlier instruction that is not yet complete --- Data Hazard

– Worst case: Must suspend execution - Stall

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 19

Example - Executing Multiple

Instructions

Consider the following instruction sequence

lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

202/20/2017 ELEC 5200-001/6200-001 Lecture 5

Executing Multiple Instructions

Clock Cycle 1

21

LW

2/20/2017 ELEC 5200-001/6200-001 Lecture 5


Clock Cycle 2

22

LWSW

2/20/2017 ELEC 5200-001/6200-001 Lecture 5


Clock Cycle 3

23

LWSWADD

2/20/2017 ELEC 5200-001/6200-001 Lecture 5


Clock Cycle 4

24

LWSWADDSUB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5


Clock Cycle 5

25

LWSWADDSUB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5


Clock Cycle 6

26

SWADDSUB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5


Clock Cycle 7

27

ADDSUB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5


Clock Cycle 8

28

SUB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

Compact View

IM REG ALU DM REGlw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CC 8

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 29

Pipeline Hazards

Where one instruction cannot immediately follow

another

Types of hazards

– Structural hazards - attempt to use same resource twice

– Control hazards - attempt to make decision before

condition is evaluated

– Data hazards - attempt to use data before it is ready

We can always resolve hazards by waiting

– i.e. stall

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 30

31

Structural Hazards

Attempt to use same resource twice at same time

Example: A Single Memory for both instructions and data– Accessed by IF stage

– Accessed at same time by MEM stage

Solutions– Delay second access by one clock cycle, OR

– Provide separate memories for instructions and data (IM and DM) This is what MIPS does

Recall “Harvard Architecture”

Real pipelined processors have separate caches

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

Structural Hazard - Single Memory

0 2 4 6 8 10Time

12

IF ID EX MEM WB

14

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

14

Memory Conflict

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 32

Control Hazards Attempt to make a decision before condition is evaluated

Example: beq $s0, $s1, offset

– Must begin fetching the instruction following the branch on the very next

clock cycle

– But the pipeline does not know what is the next instruction since it only just

received the branch instruction from memory

– Possible solutions: Stall, predict, or delayed decision

If we add hardware to second stage to:

– Compare fetched registers for equality

– Compute branch target and update PC

– This allows branch to be taken at end of second clock cycle

May not be possible for longer pipelines since branch may not be resolved in 2nd

stage, then larger slowdown

– Must make sure that the additional hardware does not increase pipeline clock

cycle.

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 33

34

Control Hazard Solutions

Stall - Stop loading instructions until result is available

Predict - Assume an outcome and continue fetching (undo if prediction is wrong) – Always assuming branch untaken

– Or assuming half of branch taken and half untaken

Delayed branch (used in MIPS)– Always executes the next SAFE instruction in the sequence

a safe instruction is an instruction which is not affected by the branch

– MIPS software will place such a safe instruction immediately after the delayed branch

This step is hidden from MIPS assembly programmer

– If branch is taken, the taken branch changes the address of the instruction follows the safe instruction

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

Control Hazard – Stall

All following discussions are assumed with the extra

hardware at 2nd stage

beq writes PC here with

the extra hardwarenew PC used here

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6

beq $r0,$r1,tgt IF ID EX MEM WB

IF ID EX MEM WBsw $s4,200($t5)

18

BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

STALL

tgt:

Control Hazard - Correct Prediction

Fetch assuming

branch taken

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6


IF ID EX MEM WBtgt:sw $s4,200($t5)

18

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 36

Control Hazard - Incorrect Prediction

“Squashed”

instruction

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6


IF ID EX MEM WB

18

BUBBLE BUBBLE BUBBLE BUBBLE

tgt:sw $s4,200($t5)(incorrect prediction - STALL)

IF

or $r8,$r8,$r9

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 37

Control Hazard - Delayed Branch

always executes

correct PC avail. here

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6


IF ID EX MEM WB

18

Branch SLOT:

and $r6,$r6,$r7

Or re-arrange the codes

to execute the previous “add” here

tgt:sw $s4,200($t5) IF ID EX MEM WB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 38

Summary - Control Hazard Solutions

Stall - stop fetching instruction until result is available

– Significant performance penalty

– Hardware required to stall

Predict - assume an outcome and continue fetching

(undo if prediction is wrong)

– Performance penalty only when guess wrong

– Hardware required to "squash" instructions

Delayed branch - specify in architecture that following

instruction is always executed

– Compiler re-orders instructions into delay slot

– Insert "NOP" (no-op) operations when can't use (~50%)

– This is how original MIPS worked

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 39

Example: Delayed branchLoop: lw $8, 100($7)

addi $7, $7, 4beq $7, $4, Loop

addi is not a “safe” instruction to be placed at the

branch slot (i.e. the instruction after beq)

– Because the dependence of $7 between addi and beq.

lw seems a safe instruction candidate but its

location does not allow it to be moved to the

branch slot

– Because “addi $7, $7, 4” is after “lw $8, 100($7)”; i.e., if

lw is moved to branch slot, the value of $7 is off by 4.

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 40

Example: delayed branch (cont.)

Changes made for the MIPS codes– Swapping addi and lw location

– Changing offset from 100 to 100-4=96

In order to keep the results of two programs identical

– The value of $7 at the new location should be the value prior to “addi$7,$7,4”

Loop: addi $7, $7, 4 lw $8, 96($7)beq $7, $4, Loop

After the above swapping and changing of the offset, lwcan be safely moved to the delay slot

Loop: addi $7, $7, 4 beq $7, $4, Loop lw $8, 96($7) # delay slot

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 41

Attempt to use data before it is ready

Solutions

– Stalling - wait until result is available

– Forwarding (Bypassing)- make data available inside

datapath

– Re-ordering instructions - use compiler to avoid hazards

Examples:add $s0, $t0, $t1 ; $s0 = $t0+$t1

sub $t2, $s0, $t3 ; $t2 = $s0-$t3

lw $s0, 0($t0) ; $s0 = MEM[$t0]

sub $t2, $s0, $t3 ; $t2 = $s0-$t2

Data Hazards

422/20/2017 ELEC 5200-001/6200-001 Lecture 5

Data Hazard - Stalling0 2 4 6 8 10 12

IF ID EX MEM

16

add $s0 ,$t0,$t1

STALL

18

sub $t2, $s0 ,$t3 IF EX MEM

STALL

BUBBLE BUBBLE BUBBLE BUBBLE

BUBBLEBUBBLE BUBBLE BUBBLE BUBBLE

$s0writtenhere

Ws0

WB

$s0 readhere

Rs0

BUBBLE

May need one more , i.e. the 3rd, STALL to

be absolutely data hazard free, if such a

register can not be designed

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 43

Data Hazards - Forwarding

Key idea: connect new value directly to next stage

Still read s0, but ignore in favor of new result

Since forwarding is valid only if the destination stage is later in time than the source stage

– Problem: what about load instructions?

If the “add” replaced by “lw”, data won’t be available until MEM stage.

442/20/2017 ELEC 5200-001/6200-001 Lecture 5

Data Hazards - Forwarding

STALL still required for LOAD instruction

– Because data available after MEM

MIPS architecture calls this delayed load, initial

implementations required compiler to deal with this

ID

0 2 4 6 8 10 12

IF ID EX MEM

16

lw $s0 ,20($t1)

18

sub $t2, $s0 ,$t3 IF EX MEM

Ws0

WBRs0

new value of s0

STALLBUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 45

Data Hazards - Reordering

Instructions

What are the hazards in this code?lw $t0, 0($t1)

lw $t2, 4($t1)

sw $t2, 0($t1)

sw $t0, 4($t1)

Using data forwarding, resolve the data hazard but will introduce STALL

Reorder instructions to remove hazard without any STALL when using data forwarding:

lw $t0, 0($t1)

lw $t2, 4($t1)

sw $t0, 4($t1)

sw $t2, 0($t1)

462/20/2017 ELEC 5200-001/6200-001 Lecture 5

47

Summary - Pipelining Overview

Pipelining increase throughput (but not latency)

Hazards limit performance

– Structural hazards

– Control hazards

– Data hazards

2/20/2017 ELEC 5200-001/6200-001 Lecture 5

Summary: Hazards

Structural hazards– Cause: resource conflict

– Remedies: (i) hardware resources, (ii) stall (bubble)

Data hazards– Cause: data unavailablity

– Remedies: (i) forwarding, (ii) stall (bubble), (iii) code reordering

Control hazards– Cause: out-of-sequence execution (branch or jump)

– Remedies: (i) stall (bubble), (ii) branch prediction/pipeline flush, (iii) delayed branch/pipeline flush

ELEC 5200-001/6200-001 Lecture 5 482/20/2017

Control Unit

for

Pipelined MIPS

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 49

Single-Cycle Control Logic

Inputs Outputs

Instr.

type

OpcodeInstruction bits

31 31 29 28 27 26

R 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0

lw 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0

sw 1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 0

beq 0 0 0 1 0 0 X 0 X 0 0 0 1 0 1 0

J 0 0 0 0 1 0 X X X 0 X 0 X X X 1

ELEC 5200-001/6200-001 Lecture 6 50

AL

UO

p0

AL

UO

p1

Reg

Dst

AL

US

rc

Mem

toR

eg

Reg

Wri

te

Me

mR

ea

d

Mem

Wri

te

Bra

nch

Ju

mp

2/20/2017

Single-Cycle Control Circuit

ELEC 5200-001/6200-001 Lecture 5 51

lw sw beq JR

RegDst

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOp0

Jump

Op5

Op4

Op3

Op2

Op1

Op0

2/20/2017

ELEC 5200-001/6200-001 Lecture 5 52

ALU Control Logic

Inputs Outputs to ALU

Instr.

type

From CU Funct. Code from IR

(bits 0-5)3-bit

code

Opera-

tionALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0

lw, sw 0 0 X X X X X X 010 Add

B 0 1 X X X X X X 110 Subtract

R

1 X X X 0 0 0 0 010 Add

1 X X X 0 0 1 0 110 Subtract

1 X X X 0 1 0 0 000 AND

1 X X X 0 1 0 1 001 OR

1 X X X 1 0 1 0 111 slt

2/20/2017

ELEC 5200-001/6200-001 Lecture 5 53

ALU Control

ALU

3

zero

result

overflow

Operation

select

from control

Operation select ALU function

000 AND

001 OR

010 Add

110 Subtract

111 Set on less than

F3

F2

F1

F0

ALUOp1 ALUOp0

From Control Circuit

ALU control

2/20/2017

Returning to Pipelined Control Opcode input to control is supplied by the pipeline

register IF/ID in the ID (instruction decode) cycle. Nine control signals are generated in the ID cycle,

but none is used. They are saved in the pipeline register ID/EX.

ALUSrc, RegDst and ALUOp (2 bits) are used in the EX (execute) cycle. Remaining 5 control signals are saved in the pipeline register EX/MEM.

Branch, MemWrite and MemRead are used in the MEM (memory access) cycle. Remaining 2 control signals are saved in the pipeline register MEM/WB.

MemtoReg and RegWrite are used in the WB (write back) cycle.

Pipelined control is shown without Jump.

ELEC 5200-001/6200-001 Lecture 5 542/20/2017

Pipelined Datapath with Control

Signals

MemtoReg

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

0

1

MemRead

ALUSrc

MemWrite

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrc PCSrc

0

1

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 55

Control

Basic approach:

– Based on single-cycle control

– Place control unit in ID stage

– Pass control signals to following stages

Later: extra features to deal with:

– Data forwarding

– Stalls

– Exceptions

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 56

Control for Pipelined Datapath

RegDst

ALUOp[1:0]

ALUSrc

MemRead

MemWrite

Branch

RegWrite

MemtoReg

EX

M

WB

Control

IF / ID ID / EX EX / MEM MEM / WB

M

WB

WB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 57

Control for Pipelined Datapath

Execution/Address

Calculation stage control

lines

Memory access stage

control lines

Write-back

stage control

lines

Instruction

Reg

Dst

ALU

Op1

ALU

Op0

ALU

Src

Branc

h

Mem

Read

Mem

Write

Reg

write

Mem

to Reg

R-format 1 1 0 0 0 0 0 1 0

lw 0 0 0 1 0 1 0 1 1

sw X 0 0 1 0 0 1 0 X

beq X 0 1 0 1 0 0 0 X

RegDst

ALUOp[1:0]

ALUSrc

MemRead

MemWrite

Branch

RegWrite

MemtoReg

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 58

Datapath and Control Unit

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 59

Tracking Control Signals - Cycle 1

LW

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 60


SW LW

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 61


ADD SW LW

0

01

1

W

M WE

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

0

1

MemRead

ALUSrc

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrc

RegWrite

0

1

W

MControl

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 62


SUB ADD SW LW

1

0

0

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 63


1

1

ADDSUB SW LW

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 64

Data Hazards Revisited…

Data hazards occur when data is used before it is

stored

– RAW (read after write).

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Program

execution

order

(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of

register $2:

DM Reg

Reg

Reg

Reg

DM

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 65

66

Data Hazards Revisited… (cont.) Data hazards can be classified into 3 types, depending on the order

of read and write accesses in the instructions.

Consider two instructions i and j, with i occurring before j– RAW (read after write)

j tries to read a source before i writes it

So j incorrectly gets the old value

– WAR (write after read) j tries to write a destination before it is read by i

So i incorrectly get the new value

WAR never happens in MIPS because all READs are early in ID stage and all WRITEs are later in WB stage

For example, auto-increment addressing, which write results early in the pipeline and other instruction reading a source after a write later in the pipeline

– WAW (write after write) j tries to write an operand before it is written by i

The writes end up performed in the wrong order, so leaving the value written by i rather than the value written by j in the destination

MIPS pipeline writes a register only in WB stage and avoids WAW

WAW only occurs in pipelines that write in more than one pipeline stage, or allow an instruction to proceed even when a previous instruction is stalled

Can RAR (read after read) be a data hazard?2/20/2017 ELEC 5200-001/6200-001 Lecture 5

Data Hazard Solution: Forwarding

Key idea: connect data internally before it's stored

EX

Hazard

MEM

Hazard

Data Hazard Solution: Forwarding

Add hardware to feed back ALU and MEM results to

both ALU inputs

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 68

Forwarding Unit

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 69

Controlling Forwarding

Data hazard at “EX” stage: (EX Hazard)– EX/MEM - test whether the instruction in EX/MEM writes

register file and examine rd register

– ID/EX - test whether the instruction in ID/EX reads rs or rtregister and matches rd register in EX/MEM

Data hazard at “MEM” stage: (MEM Hazard)– MEM/WB - test whether the instruction in MEM/WB writes

register file and examine rd (or rt) register

– ID/EX - test whether the instruction in ID/EX reads rs or rtregister and matches rd (or rt) register in EX/MEM

702/20/2017 ELEC 5200-001/6200-001 Lecture 5

Forwarding Unit Detail - EX Hazard

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd ≠ 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRs))

ForwardA = 10

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd ≠ 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRt))

ForwardB = 10

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 71

Forwarding Unit Detail - MEM Hazard

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd ≠ 0)

and (MEM/WB.RegisterRd = ID/EX.RegisterRs))

ForwardA = 01

if (MEM/WB.RegWrite


and (MEM/WB.RegisterRd = ID/EX.RegisterRt))

ForwardB = 01

722/20/2017 ELEC 5200-001/6200-001 Lecture 5

2/20/2017 73

MEM Hazard Complication

One complication is potential data hazards between

the result of the instruction in WB stage, the result

of the instruction in MEM stage and the source

operand of the instruction in ALU stage.

Example: What if we a register is changed more

than once?

– add $1, $1, $2;

– add $1, $1, $3;

– add $1, $1, $4;

Answer: forward most recent result (in MEM stage)

ELEC 5200-001/6200-001 Lecture 5

Forwarding Unit Detail - MEM Hazard

Revised

if (MEM/WB.RegWrite


and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs)

and (MEM/WB.RegisterRd = ID/EX.RegisterRs))ForwardA = 01

if (MEM/WB.RegWrite


and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt)

and (MEM/WB.RegisterRd = ID/EX.RegisterRt))ForwardB = 01

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 74

Hazard Detection Unit - Control Detail

if (ID/EX.MemRead and

((ID/EX.RegisterRt = IF/ID.RegisterRs) or

((ID/EX.RegisterRt = IF/ID.RegisterRt)))

stall

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 75

Pipelined Processor with

Hazard Detection

PCInstruction

memory

Registers

M u x

M u x

M u x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data memory

M u x

Hazard detection

unit

Forwarding

unit

0

M u x

IF/ID

Instr

uctio

n

ID/EX.MemReadIF

/ID

Wri

te

PC

Wri

te

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Rt

Rs

Rd

RtEX/MEM.RegisterRd

MEM/WB.RegisterRd

This is how

“stall” is

implemented

Hazard Detection Unit

How “stall” is implemented

MUX zeros out control signals for instruction in ID

– "squashes” the instruction

– “no-op” propagates through following stages

IF/ID holds stalled instruction until next clock cycle

PC holds current value until next clock cycle (re-

loads first instruction)

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 77

Control (Branch) Hazards

Just stalling for each branch is not practical

Common assumption: branch not taken

When assumption fails: flush three instructions– Note that the following figure does not assume the extra hardware to reduce the

control hazard in ID stage.

Reducing Branch Delay Key idea: move branch logic to ID stage of pipeline

– New adder calculates branch target (PC + 4 + extend(IMM) << 2)

– New hardware tests rs == rt immediately after register read

– Add flush signal to squash instruction in IF/ID register

Reduced penalty (1 cycle) when branch taken

Example on the next slide: Figure 4.62, p. 320– Assume that branch is taken (i.e., $1==$3)

One bubble– i.e., One instruction is flushed

36 sub $10, $4, $8

40 beq $1, $3, 7 # PC-relative branch 40+4+7*4 =72

44 and $12, $2, $5

......

72 lw $4, 50(7)

792/20/2017 ELEC 5200-001/6200-001 Lecture 5

A couple of details

are ignored

(i) IF.Flush comes

from control

unit;

(ii) output of the

equivalence

check of rs and

rt should be fed

into control unit,

which then

determines the

branch control

for the MUX in

front of PC

Branch Prediction

Key idea: instead of always assuming branch not

taken, use a prediction based on previous history

– Branch history table: a small memory

Indexed by lower bits of the address of the branch instruction

Using one bit to save the history of “what happened” on last

execution

– branch taken (‘1’)

– branch not taken (‘0’)

– Use history to make prediction

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 81

ELEC 5200-001/6200-001 Lecture 5 82

Branch Prediction

Useful for program loops.

A one-bit prediction scheme: a one-bit buffer carries a “history bit” that tells what happened on the last branch instruction

History bit = 1, branch was taken

History bit = 0, branch was not taken

Predict

branch

not taken

0

Predict

branch

taken

1

taken

taken

Not taken

Not taken

2/20/2017

Branch Prediction

ELEC 5200-001/6200-001 Lecture 5 83

=Prediction

Logic

0

1

PC+4 Next PC

PC

Low-order

bits used

as index

Address of Target History

recent branch addresses bit(s)

instructions

2/20/2017

Branch Prediction for a Loop

Execu

-tion

seq.

Old

hist.

bit

Next instr. New

hist.

bit

Predi

ctionPred. I Act.

1 0 e 1 b 1 Bad

2 1 b 2 b 1 Good

3 1 b 3 b 1 Good

4 1 b 4 b 1 Good

5 1 b 5 b 1 Good

6 1 b 6 b 1 Good

7 1 b 7 b 1 Good

8 1 b 8 b 1 Good

9 1 b 9 b 1 Good

10 1 b 10 e 0 Bad

I = 0

I = I + 1

I – 10 = 0?

Store X in memory

X = X + R(I)

Y

N

a

b

c

d

e

Execution of Instruction d

h.bit = 0 branch not taken, h.bit = 1 branch taken.

Prediction Accuracy

One-bit predictor: 2 errors out of 10 predictions

Prediction accuracy = 80%

To improve prediction accuracy, use two-bit

predictor: A prediction must be wrong twice before it is changed

ELEC 5200-001/6200-001 Lecture 5 852/20/2017

ELEC 5200-001/6200-001 Lecture 5 86

Two-Bit Prediction Buffer

Implemented as a two-bit counter.

Can improve correct prediction statistics.

Predict

branch

not taken

00

Predict

branch

taken

10

Predict

branch

taken

11

Predict

branch

not taken

01

taken

taken

taken

taken

Not taken

Not taken

Not taken

Not taken

2/20/2017

Branch Prediction for a Loop

Execu

-tion

seq.

Old

Pred.

Buf

Next instr. New

pred.

Buf

Predi

ctionPred. I Act.

1 10 2 1 2 11 Good

2 11 2 2 2 11 Good

3 11 2 3 2 11 Good

4 11 2 4 2 11 Good

5 11 2 5 2 11 Good

6 11 2 6 2 11 Good

7 11 2 7 2 11 Good

8 11 2 8 2 11 Good

9 11 2 9 2 11 Good

10 11 2 10 5 10 Bad

I = 0

I = I + 1

I – 10 = 0?

Store X in memory

X = X + R(I)

Y

N

1

2

3

4

5

Execution of Instruction 4

Performance

Comparison

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 88

*More on performance evaluation will be discussed in a future lecture.

Single-Cycle Performance

Assume 200 ps for memory access

100 ps for ALU operation

50 ps for register file read or write

Cycle time set according to longest instruction:

lw ≡ IF + ID/RegRead + ALU + MEM + RegWrite

= 200 + 50 +100 + 200 + 50

= 600 ps

Cycles Per Instruction (CPI) = 1

Av. instruction execution time = clock cycle time

= 600 ps

ELEC 5200-001/6200-001 Lecture 5 892/20/2017

Multicycle Performance

Consider SPECINT2000* instruction mix: 25% lw 5 cycles

10% sw 4 cycles

11% branch 3 cycles

2% jump 3 cycles

52% ALU instr. 4 cycles

Av. CPI = 0.25×5 + 0.10×4 + 0.11×3 + 0.02×3 + 0.52×4

= 4.12

Clock cycle time determined from longest operation (memory access) = 200 ps

Av. instruction execution time = 4.12×200 = 824 ps

ELEC 5200-001/6200-001 Lecture 5 902/20/2017

Pipeline Performance

Neglect initial latency (reasonable for long programs).

One instruction completed every clock cycle unless delayed by hazard. Average CPI:

lw 2 cycles in 50% cases due to hazard 1.5 cycles

sw 1 cycle

ALU 1 cycle

branch 2 cycles in 25% cases due to hazard 1.25 cycles

jump 2 cycles

For SPECINT2000

Av. CPI = 0.25×1.5 + 0.10×1 + 0.11×1.25 + 0.02×2.0 + 0.52×1

= 1.17

Clock cycle time (longest operation: memory access) = 200 ps

Av. instruction execution time = 1.17×200 = 234 ps

ELEC 5200-001/6200-001 Lecture 5 912/20/2017

ELEC 5200-001/6200-001 Lecture 5 92

Comparing Alternatives

Type of

datapath

and control

Clock cycle

time

Average

CPI

Av. instruction

execution time

Single-cycle 600 ps 1.00 600 ps

Multicycle 200 ps 4.12 824 ps

Pipelined 200 ps 1.17 234 ps

2/20/2017

Exceptions

A typical exception occurs when ALU produces an overflow signal.

Control asserts following actions on exception:– Change the PC address to 4000 0040hex. This is the

location of the exception routine. This is done by adding an additional input to the PC input multiplexer.

– Overflow is detected in the EX cycle. Similar to data hazard and pipeline flush, Set IF/ID to 0 (nop).

Generate ID.Flush and EX.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle.

ELEC 5200-001/6200-001 Lecture 5 932/20/2017

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 94

Next Class Memory Organization

Documents

ELEC 5200/6200 Computer Architecture and Design Spring 2017 · 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5 Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing