Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture

Professor Nigel Topham

Director, Institute for Computing Systems Architecture

School of Informatics

Edinburgh University

Informatics 3

Computer Architecture

Inf3 Computer Architecture - 2006-2007 2

Terminology

Instruction fetch: fetch instruction from memory

Instruction issue: decode instruction, check for

structural hazards, and send to execution units

Instruction execution: execute instruction once

dependences are cleared

Instruction completion (or retire or commit): finish

instruction and update processor state

Some combinations are possible:

– In-order issue, execution and completion

– Out-of-order issue and execution and in-order completion

– Out-of-order issue, execution and completion


Dynamic Scheduling 1: Scoreboarding

Handles all RAW, WAR, and WAW with proper stalls, but allows independent instructions to proceed

Step 1: Issue (part of original ID stage)– Issue instruction to functional unit iff functional unit is free

and no earlier instruction writes to the same destination register (WAW)

Step 2: Read operands (part of original ID stage)– Wait until source registers become available from earlier

instructions through register file (RAW)

Step 3: Execute (original EXE stage)– Execute instruction and notify scoreboard when done

Step 4: Write result (original WB stage)– Wait until earlier instructions read operands before writing

to register file (WAR)


Scoreboard Organization

Instruction status: either one of the four steps of the instruction operation (Issue, Read Op, Execute, Write)

Functional unit status:– Busy – functional unit is being used– Op – type of operation to be performed (e.g., add,

subtract, etc)

– Fi – destination register

– Fj, Fk – source registers

– Qj, Qk – functional units producing Fj and Fk

– Rj, Rk – flag to indicate when Fj and Fk are ready (set to No after operand is read)

Register result status: indicate which functional unit will write the register next (one per register)


Instruction sequence:

Latencies:– Integer 1 cycle– FP add 2 cycles– FP multiply 10 cycles– FP divide 40 cycles

Functional units: 1 integer (also for ld/st), 1 FP adder, 2 FP multipliers, 1 FP divider

L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2

Scoreboard Example


Instruction


Issue Read Op Execute Write

Busy Op Fi FjName

IntegerMult1Mult2Add

Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 1

Y Load F6 R2 Y

Integer

Scoreboard Example


cycle 3cycle 4cycle 2Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

Y Load F6 R2 N

Integer

MultY F0 F2 F4 N Y

Mult1

Integer

SubY F8 F6 F2 N N

Add

Y Div F10 F0 F6 Mult1 NInteger N

Divide


Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 5

Y Load F2 R3 Y

Integer

MultY F0 F2 F4 N Y

Mult1

SubY F8 F6 F2 Y N

Add

Y Div F10 F0 F6 Mult1 N Y

Divide

Integer

Integer


cycle 6cycle 7cycle 8Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

Y Load F2 R3 N

Integer

MultY F0 F2 F4 N Y

Mult1

SubY F8 F6 F2 Y N

Add


Divide

Integer

Integer

H&P A.52


Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 9

MultY F0 F2 F4 N N

Mult1

SubY F8 F6 F2 N N

Add


Divide

cycle 10

cycle 11

cycle 12


Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 13

MultY F0 F2 F4 N N

Mult1

AddY F6 F8 F2 Y Y

Add


Divide


cycle 14cycle 15cycle 16cycle 17cycle 18cycle 19cycle 20Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

MultY F0 F2 F4 N N

Mult1

AddY F6 F8 F2 N N

Add


Divide

H&P A.53


Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 21

AddY F6 F8 F2 N N

Add

Y Div F10 F0 F6 N N

Divide

cycle 22


Scoreboard Example

Instruction



Busy Op Fi FjName


Divide

Fk Qj Qk Rj Rk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 23

Y Div F10 F0 F6 N N

Divide

H&P A.54

cycle 62


Scoreboard Summary

Dynamically schedules instructions

Forces instructions to wait on RAW, WAR, WAW dependences and structural hazards

First used in the CDC 6600 in 1964 and yielded performance improvements of 1.7 to 2.5 times

Hardware cost (size) of scoreboard equivalent to one of the functional units

Many more buses required to move around operands, results, and instructions


Dynamic Scheduling 2: Tomasulo’s Algorithm

Handles RAW with proper stalls and eliminates WAR and WAW through register renaming

Step 1: Issue– Issue instruction to the reservation stations if there is a

free reservation station– Read operands if available or rename operands if pending

(WAR and WAW)

Step 2: Execute– Execute instruction when both operands are ready in the

reservation station (RAW)

Step 3: Write result– Send results to register file and to all reservation stations

requiring the values (RAW)

Results are communicated through common data bus (CDB) (forwarding)


Address unit

IBM S/360 model 91 used Tomasulo’s Algorithm

Dynamic O-O-O execution

Tags used to name flow dependencies

5 reservation stations

6 load buffers

Issue instructions to reservation stations, load buffers and store buffers

Instructions wait in reservation stations or store buffers until all their operands are collected

Functional units broadcast result and tag on the Common Data Bus (CDB) for all reservation stations, store buffers and FP registers to pick up.

FP adders FP multipliersMemory unit

Address unit

From instruction fetch unit

InstructionQueue

FP registers

Store buffersLoad buffers

Reservationstations

1

23

4

5

6

11

.

.

.

ld f1, 4(r1)

st f4, 8(r2)

mul f3, f1, f2

add f4, f5, f3


Data Structures for Tomasulo’s Algorithm

Reservation station (RS) fields:Op The operation to be performedQj, Qk Tags to identify the producing RS of each source operandVj, Vk Actual values of the two source operandsBusy Indicates if RS contains a valid instruction

Register file fields:Qi Tag to identify producing RS of next value for this

register

Load buffers:A The computed address from which to load

Store buffers:A The computed address to which a store will take

placeQj Tag to identify the producing RS of the store data


Operation of Tomasulo’s Algorithm

Instruction Issue:Get next instruction from head of the issue queue

If reservation station RS is available then:For each p in { i, j } representing operand register u

If Reg[u].Qi == 0 then RS.Vp = Reg[u].value // value ready now

If Reg[u].Qi != 0 then RS.Qp = Reg[u].Qi // value not yet ready

RS.Busy = 1 // reserve this RS

RS.Op = instruction opcode // set the operation

Execution:Wait until (RS.Qj == 0) and (RS.Qk == 0), and whilst waiting:

For each p in { i, j } If CDB.tag == RS.Qp then { RS.Vp = CDB.value; RS.Qp = 0 }

When (RS.Qj == 0) and (RS.Qk == 0), perform operation in RS.Op

Write Result:When CDB is free, broadcast CDB = { tag = RS.id, value = RS.result }

and clear RS.Busy


Tomasulo’s Example

Instruction


Issue Execute Write

Busy Op VjName

Add_1Add_2Add_3Mult_1Mult_2

Vk Qj Qk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 1

Load_1

Y Mult R[F4]

Load_1

Mult_1

cycle 2cycle 3

Y Sub M[34+R[R2]] Load_1

Add_1



Instruction


Issue Execute Write

Busy Op VjName


Vk Qj Qk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 4

Load_1

Y Mult R[F4]

Load_1

Mult_1


Add_1

Y Div Mult_1R[F6]

Mult_2

cycle 5cycle 6

Y Add M[45+R[R3]] Add_1

Add_2



Instruction


Issue Execute Write

Busy Op VjName


Vk Qj Qk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 7

Y Mult R[F4]

Load_1

Mult_1


Add_1

Y Div Mult_1R[F6]

Mult_2


Add_2

cycle 8



Instruction


Issue Execute Write

Busy Op VjName


Vk Qj Qk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 9

Y Mult R[F4]

Load_1

Mult_1

Y Div Mult_1R[F6]

Mult_2


Add_2

cycle 10

cycle 11



Instruction


Issue Execute Write

Busy Op VjName


Vk Qj Qk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 12

Y Mult R[F4]

Load_1

Mult_1

Y Div Mult_1R[F6]

Mult_2

cycle 15

cycle 16



Instruction


Issue Execute Write

Busy Op VjName


Vk Qj Qk

F0 F2 F4 F6 F8 F10 F12 … F30

cycle 55

Y Div Mult_1R[F6]

Mult_2


Tomasulo’s Advantages

Register renaming:

– Qj and Qk can come from any reservation station independent of the register file

in fact we could have many more reservation stations than registers

– Vj and Vk store the actual value to be used

Parallel release of all instructions dependent as soon as the earlier

instruction completes (both SUB.D and ADD.D get the value from Load_1 )

No need to wait on WAR and WAW (notice that ADD.D has issued before DIV.D

has read its f6 operand and will execute as soon as the SUB.D finishes)

Dynamic unrolling of loops in hardware with dynamic register renaming

Documents

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture