30

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Embed Size (px)

Citation preview

Page 1: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time
Page 2: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Automobile Manufacturing1. Build frame. 60 min.

2. Add engine. 50 min.

3. Build body. 80 min.

4. Paint. 40 min.

5. Finish. 45 min.

275 min.

Latency: Time from start to finish for one car.

Throughput: Number of finished cars per time unit.

1 car/275 min = 0.218 cars/hour

275 minutes per car.

Issues: How can we make the process better by adding more workers?

(smaller is better)

(larger is better)

6.1

Page 3: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

An Assembly line

6.1

1

1

1

1

1

2

2

2

2

2

3

3

3

3

3

4

4

4

4

4

60 50 80 40 45

First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage.

80 80

Last two stages only receive onecar/80 min to work on.

80 80

Latency: 400 min/carThroughput: 4 cars/640 min (1 car/160 min)

time

Will approach 1 car/80 min as time goes on

Page 4: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Applying Assembly Lines to CPUs

• The single-cycle design did everything “at once”

• Can we break the single-cycle design up into stages?

6.1

• Issues:

• Car assembly works well. Will it be so easy to do the same technique to a CPU?

Page 5: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A

RegistersRead reg num B

Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

0

1

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

Rd:[15-11]

1

0

Instr. Fetch,PC=PC+4

Instr. DecodeRegister Fetch

Execute,Address Calc.

Memory

Reg.Write-back

Breaking up the Single-Cycle Datapath

6.2

Stages frommulti-cycle design

Page 6: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A

RegistersRead reg num B

Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

0

1

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

Rd:[15-11]

1

0

Instr. Fetch,PC=PC+4

Instr. DecodeRegister Fetch

Execute,Address Calc.

Memory

Reg.Write-back

The Key - Pipeline Registers

6.2

clock

PC+4

Page 7: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A

RegistersRead reg num B

Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

0

1

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

Rd:[15-11]

1

0

Example: R-type Instruction

6.2

PC+4

Writes the correct data to the wrong register

In general, arrows that go backwards across pipeline stages may be bad news...

Page 8: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A

RegistersRead reg num B

Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

0

1Rd:[15-11]

1

0

Correcting the Write Register Problem

6.2

PC+4

0

1

Rt:[20-16]

Rd:[15-11]

Page 9: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Assembly-line Control Signals

135 4

In an assembly line, the manufacturing instructions can be attachedto the car. The instructions then move along with the car.

F: StandardE: 135 HPB: 2-doorP: GreenF: Leather

E: 190 HPB: 4-doorP: BlueF: Cotton

B: 2-doorP: LavenderF: Leather

P: GreenF: Vinyl

F: Leather

2

By separating the control signals by stages, only the signals needed for the current stage must be decoded.

All signals for later stages must be passed along.

6.1

Page 10: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

ResultResult

Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A

RegistersRead reg num B

Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

1

0

The Pipelined Control Logic

6.3

PC+4

0

1

Rt:[20-16]

Rd:[15-11]

ALUcontrol

ALUOp

RegWrite

Mem

To

Reg

MemWrite

MemRead

ALUSrc

PCSrc

RegDest

Op:[31-26]

W

ME

Control W

MW

Branch

Page 11: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

How’d we do?

• Compared to Single-cycle

• 5 stages --> Potentially 5x speedup

• Not likely• Stages won’t all be equally long• Pipeline registers will cause some delays

• Latency --> Greater than in single-cycle design

• More complexity, but nicely divided up

Page 12: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Example 1

• Consider executing the following code

add $3, $4, $5

and $6, $7, $8

sub $9, $10, $11

on

i) A single-cycle machine with a cycle time of 200 ns

ii) A 5-stage pipeline machine with a cycle time of 50 ns

Which one runs faster?

What if the instructions were 100 instead of 3?

Page 13: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time
Page 14: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Analyzing Pipelines

6.4

ADD $10, $14, $0SUB $12, $13, $2AND $1, $6, $11SW $3, 200($9)OR $9, $13, $7

OR IF RF M WBEX

IF RF M WBSW EX

IF RF M WBAND EX

IF RF M WBSUB EX

IF RF MADD EX WB

Page 15: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Data Hazards

6.4

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7 Writes register $13Writes register $13

Reads wrong $13Reads wrong $13

Reads wrong $13Reads wrong $13

Reads ? $13Reads ? $13

Reads correct $13Reads correct $13 OR IF RF M WBEX

IF RF M WBSW EX

IF RF M WBAND EX

IF RF M WBSUB EX

IF RF MADD EX WB

Page 16: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Preventing Data Hazards

6.4

ADD $13, $14, $0NOPNOPNOPSUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7

Insert NOP’s into the instructionstream to allow WB to happen beforeRF.

Assume we can’t write a registerand read the new value in the same cycle

Assume we can’t write a registerand read the new value in the same cycle

IF RFOR

IF RFSW EX

IF RF MAND EX

IF M WBSUB EXRF

IF RF MADD WBEX

IF M WBSUB EXRF

Page 17: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Detecting Hazards

6.5

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7

Check each instruction as it is being decoded (RF-ID stage).If it reads a register that will be written by any instruction ahead of it (in RF, EX, or M stages), there is a hazard.

Write: $13

Read A: $13

Read B: $13

Read A: $13 IF RFOR EX

SW IF RF MEX

IF RF M WBAND EX

IF RF M WBSUB EX

ADD IF RF M WBEXCompare write reg # in EX with read reg # in RF

Compare write reg # in M with read reg # in RF

Compare write reg # in WB with read reg # in RF

Page 18: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Stalling with Bubbles

6.5

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7

IF RFOR

IFSUB

IFSUB

IFSUB

Stalling:• Kill the current executionby “neutralizing” all the controlsignals so that it won’t write any registers.• Don’t write PC+4 into PC --> Stay at the current instruction and try again.

IF RF MADD WBEX

IF RF M WBSUB EX

IF RF MAND EX

SW IF RF EX

==

=

Page 19: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Register Forwarding

6.6

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $2

Register $13’s value is computed in the EX stage of the ADD even thoughit isn’t written in the register until the WB stage.

--> The pipeline register following the EX stage hold the value of $13 that’s needed in the SUB instruction’s EX stage.

IF RF M WBSUB EX

IF RF M WBAND EX

IF RF M WBOR EX

IF RF M WBSW EX

IF RF MADD WBEX

Page 20: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Unforwardable Loads

6.6

LW $2, 30($2)AND $1, $2, $13SW $3, 200($2)OR $9, $2, $1

IF RF M WBAND EX

IF RF MLW WBEX

IF RF M WBSW EX

IF RF M WBOR EXOR

IF RF M WBAND EX

Loads don’t compute the register to write back until the Memory stage. This is one stage to late for the next instruction. ---> We can’t prevent stalls if the instruction following a Load uses the result of the Load.

Page 21: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Example 2

• Consider executing the following code on a 5-stage pipeline datapath

add $3, $4, $5

lw $7, 100($3)

sub $8, $7, $9

1. Identify any potential data dependencies

2. How many cycles will it take to execute this code assuming no register forwarding?

3. How many cycles will it take to execute this code assuming register forwarding is available?

Page 22: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Branch Hazards

6.7

BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5

SKIP: LW $2,32($4)

IF RF M WBAND EX

IF RF M WBOR EX

IF RF M WBOR EXLW

IF RF M WBSW EX

Don’t know result of branch untilthe end of the M stage

Don’t know result of branch untilthe end of the M stage

If the branch is taken,we’ve blown it by executingthe intervening instructions

If the branch is taken,we’ve blown it by executingthe intervening instructions

IF RFBEQ WBEX M

Page 23: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Solution 1: Stall

6.5

IF RFADD

IFAND

IFAND

IFAND

IF RF MBEQ WBEX

IF RF M WBAND EX

IF RF MSW EX

OR IF RF EX

BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5

SKIP: LW $2,32($4)

Stalling always solves theproblem. If we didn’t have somany branches in programs, it wouldnot be a problem

Branchnot taken

Branchnot taken

Page 24: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

6.7

BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5

SKIP: LW $2,32($4)

IF RFBEQ WBEX M

If we guess right, we win --> No stall at all

IF RF M WBLW EX

IF RF M WBOR EX

If we guessed wrong, 1. We have to undo all that we did (fortunately, no writebacks have occured yet). 2. We still take all the time of a stall

IF RF M WBAND EX

IF RF M WBSW EX

Solution 2: Assume not Taken

Must be undone if branchis taken!

Must be undone if branchis taken!

Branch is taken...Branch is taken...

Page 25: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

6.7

Solution 3: Better Prediction

• Predict that the branch goes the same way as the last time

• Works great for loops

• Works great for “special-case” code

• Need to keep track of the information for each branch, though...

• One or two bits will do

• Keep a small table of recently used branches and which way they went

Page 26: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

6.7

Solution 4: Delayed BranchesXOR $1, $3, $3ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0BEQ $10, $11, SKIPLW $4, 60($2)

SKIP AND $1, $2, $3

If we had some warning, wecould compute the branch aheadof time...

XOR $1, $3, $3 Branch-After-Three-EQ $10,$11,SKIP

ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)

SKIP AND $1, $2, $3

3 delay slots3 delay slots These instructionsare always executed.Branch can’t dependon them...

Page 27: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

3-slot Delayed Branch

6.7

IF RFB3E WBEX M

IF RF M WBLW or AND EX

Branch-After-Three-EQ $10,$11,SKIPADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)

SKIP AND $1, $2, $3

IF RF WBEX MADD

IF RF WBEX MSUB

IF RF WBEX MOR

Page 28: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Branch summary

• Two decent solutions:

• Branch prediction• Requires more hardware• Used in modern microprocessors

• Delayed branch• Requires special software manipulation• Often doesn’t deliver its promise• Used often in CPUs 4-10 years ago

Page 29: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Example 3

• Consider executing the following codeLOOP: add $3, $4, $5

and $6, $7, $8bne $12, $8, LOOP

oni) A single-cycle machine with a cycle time of 200 nsii) A 5-stage pipeline machine with a cycle time of 50

nsA. Assume the loop executes 10 timesB. Assume the loop executes 100 timesC. Assume the loop executes 1000 timesWhich one runs faster?

Page 30: Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time

Example 4

• Consider executing the following code on a 5-stage pipeline datapath

addi $3, $0, 10LOOPSTART: lw $5, ARRAY($3)

addi $5, $5, 1sw $5, ARRAYaddi $3, $3, -1bne $3, $0, LOOPSTARTadd $3, $5, $6sub $7, $8, $9addi $4, $6, 3

1. Identify potential data dependencies2. How many cycles will it take to execute this code?

A. With nops/stallsB. With branch prediction assuming branch not takenC. With branch prediction based on one previous result