Upload
chester-kevin-gordon
View
220
Download
0
Embed Size (px)
Citation preview
Professor Nigel Topham
Director, Institute for Computing Systems Architecture
School of Informatics
Edinburgh University
Informatics 3
Computer Architecture
Inf3 Computer Architecture - 2006-2007 2
Terminology
Instruction fetch: fetch instruction from memory
Instruction issue: decode instruction, check for
structural hazards, and send to execution units
Instruction execution: execute instruction once
dependences are cleared
Instruction completion (or retire or commit): finish
instruction and update processor state
Some combinations are possible:
– In-order issue, execution and completion
– Out-of-order issue and execution and in-order completion
– Out-of-order issue, execution and completion
Inf3 Computer Architecture - 2006-2007 3
Dynamic Scheduling 1: Scoreboarding
Handles all RAW, WAR, and WAW with proper stalls, but allows independent instructions to proceed
Step 1: Issue (part of original ID stage)– Issue instruction to functional unit iff functional unit is free
and no earlier instruction writes to the same destination register (WAW)
Step 2: Read operands (part of original ID stage)– Wait until source registers become available from earlier
instructions through register file (RAW)
Step 3: Execute (original EXE stage)– Execute instruction and notify scoreboard when done
Step 4: Write result (original WB stage)– Wait until earlier instructions read operands before writing
to register file (WAR)
Inf3 Computer Architecture - 2006-2007 4
Scoreboard Organization
Instruction status: either one of the four steps of the instruction operation (Issue, Read Op, Execute, Write)
Functional unit status:– Busy – functional unit is being used– Op – type of operation to be performed (e.g., add,
subtract, etc)
– Fi – destination register
– Fj, Fk – source registers
– Qj, Qk – functional units producing Fj and Fk
– Rj, Rk – flag to indicate when Fj and Fk are ready (set to No after operand is read)
Register result status: indicate which functional unit will write the register next (one per register)
Inf3 Computer Architecture - 2006-2007 5
Instruction sequence:
Latencies:– Integer 1 cycle– FP add 2 cycles– FP multiply 10 cycles– FP divide 40 cycles
Functional units: 1 integer (also for ld/st), 1 FP adder, 2 FP multipliers, 1 FP divider
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Scoreboard Example
Inf3 Computer Architecture - 2006-2007 6
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 1
Y Load F6 R2 Y
Integer
Scoreboard Example
Inf3 Computer Architecture - 2006-2007 7
cycle 3cycle 4cycle 2Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
Y Load F6 R2 N
Integer
MultY F0 F2 F4 N Y
Mult1
Integer
SubY F8 F6 F2 N N
Add
Y Div F10 F0 F6 Mult1 NInteger N
Divide
Inf3 Computer Architecture - 2006-2007 8
Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 5
Y Load F2 R3 Y
Integer
MultY F0 F2 F4 N Y
Mult1
SubY F8 F6 F2 Y N
Add
Y Div F10 F0 F6 Mult1 N Y
Divide
Integer
Integer
Inf3 Computer Architecture - 2006-2007 9
cycle 6cycle 7cycle 8Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
Y Load F2 R3 N
Integer
MultY F0 F2 F4 N Y
Mult1
SubY F8 F6 F2 Y N
Add
Y Div F10 F0 F6 Mult1 N Y
Divide
Integer
Integer
H&P A.52
Inf3 Computer Architecture - 2006-2007 10
Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 9
MultY F0 F2 F4 N N
Mult1
SubY F8 F6 F2 N N
Add
Y Div F10 F0 F6 Mult1 N Y
Divide
cycle 10
cycle 11
cycle 12
Inf3 Computer Architecture - 2006-2007 11
Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 13
MultY F0 F2 F4 N N
Mult1
AddY F6 F8 F2 Y Y
Add
Y Div F10 F0 F6 Mult1 N Y
Divide
Inf3 Computer Architecture - 2006-2007 12
cycle 14cycle 15cycle 16cycle 17cycle 18cycle 19cycle 20Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
MultY F0 F2 F4 N N
Mult1
AddY F6 F8 F2 N N
Add
Y Div F10 F0 F6 Mult1 N Y
Divide
H&P A.53
Inf3 Computer Architecture - 2006-2007 13
Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 21
AddY F6 F8 F2 N N
Add
Y Div F10 F0 F6 N N
Divide
cycle 22
Inf3 Computer Architecture - 2006-2007 14
Scoreboard Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Read Op Execute Write
Busy Op Fi FjName
IntegerMult1Mult2Add
Divide
Fk Qj Qk Rj Rk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 23
Y Div F10 F0 F6 N N
Divide
H&P A.54
cycle 62
Inf3 Computer Architecture - 2006-2007 15
Scoreboard Summary
Dynamically schedules instructions
Forces instructions to wait on RAW, WAR, WAW dependences and structural hazards
First used in the CDC 6600 in 1964 and yielded performance improvements of 1.7 to 2.5 times
Hardware cost (size) of scoreboard equivalent to one of the functional units
Many more buses required to move around operands, results, and instructions
Inf3 Computer Architecture - 2006-2007 16
Dynamic Scheduling 2: Tomasulo’s Algorithm
Handles RAW with proper stalls and eliminates WAR and WAW through register renaming
Step 1: Issue– Issue instruction to the reservation stations if there is a
free reservation station– Read operands if available or rename operands if pending
(WAR and WAW)
Step 2: Execute– Execute instruction when both operands are ready in the
reservation station (RAW)
Step 3: Write result– Send results to register file and to all reservation stations
requiring the values (RAW)
Results are communicated through common data bus (CDB) (forwarding)
Inf3 Computer Architecture - 2006-2007 17
Address unit
IBM S/360 model 91 used Tomasulo’s Algorithm
Dynamic O-O-O execution
Tags used to name flow dependencies
5 reservation stations
6 load buffers
Issue instructions to reservation stations, load buffers and store buffers
Instructions wait in reservation stations or store buffers until all their operands are collected
Functional units broadcast result and tag on the Common Data Bus (CDB) for all reservation stations, store buffers and FP registers to pick up.
FP adders FP multipliersMemory unit
Address unit
From instruction fetch unit
InstructionQueue
FP registers
Store buffersLoad buffers
Reservationstations
1
23
4
5
6
11
.
.
.
ld f1, 4(r1)
st f4, 8(r2)
mul f3, f1, f2
add f4, f5, f3
Inf3 Computer Architecture - 2006-2007 18
Data Structures for Tomasulo’s Algorithm
Reservation station (RS) fields:Op The operation to be performedQj, Qk Tags to identify the producing RS of each source operandVj, Vk Actual values of the two source operandsBusy Indicates if RS contains a valid instruction
Register file fields:Qi Tag to identify producing RS of next value for this
register
Load buffers:A The computed address from which to load
Store buffers:A The computed address to which a store will take
placeQj Tag to identify the producing RS of the store data
Inf3 Computer Architecture - 2006-2007 19
Operation of Tomasulo’s Algorithm
Instruction Issue:Get next instruction from head of the issue queue
If reservation station RS is available then:For each p in { i, j } representing operand register u
If Reg[u].Qi == 0 then RS.Vp = Reg[u].value // value ready now
If Reg[u].Qi != 0 then RS.Qp = Reg[u].Qi // value not yet ready
RS.Busy = 1 // reserve this RS
RS.Op = instruction opcode // set the operation
Execution:Wait until (RS.Qj == 0) and (RS.Qk == 0), and whilst waiting:
For each p in { i, j } If CDB.tag == RS.Qp then { RS.Vp = CDB.value; RS.Qp = 0 }
When (RS.Qj == 0) and (RS.Qk == 0), perform operation in RS.Op
Write Result:When CDB is free, broadcast CDB = { tag = RS.id, value = RS.result }
and clear RS.Busy
Inf3 Computer Architecture - 2006-2007 20
Tomasulo’s Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Execute Write
Busy Op VjName
Add_1Add_2Add_3Mult_1Mult_2
Vk Qj Qk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 1
Load_1
Y Mult R[F4]
Load_1
Mult_1
cycle 2cycle 3
Y Sub M[34+R[R2]] Load_1
Add_1
Inf3 Computer Architecture - 2006-2007 21
Tomasulo’s Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Execute Write
Busy Op VjName
Add_1Add_2Add_3Mult_1Mult_2
Vk Qj Qk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 4
Load_1
Y Mult R[F4]
Load_1
Mult_1
Y Sub M[34+R[R2]] Load_1
Add_1
Y Div Mult_1R[F6]
Mult_2
cycle 5cycle 6
Y Add M[45+R[R3]] Add_1
Add_2
Inf3 Computer Architecture - 2006-2007 22
Tomasulo’s Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Execute Write
Busy Op VjName
Add_1Add_2Add_3Mult_1Mult_2
Vk Qj Qk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 7
Y Mult R[F4]
Load_1
Mult_1
Y Sub M[34+R[R2]] Load_1
Add_1
Y Div Mult_1R[F6]
Mult_2
Y Add M[45+R[R3]] Add_1
Add_2
cycle 8
Inf3 Computer Architecture - 2006-2007 23
Tomasulo’s Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Execute Write
Busy Op VjName
Add_1Add_2Add_3Mult_1Mult_2
Vk Qj Qk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 9
Y Mult R[F4]
Load_1
Mult_1
Y Div Mult_1R[F6]
Mult_2
Y Add M[45+R[R3]] Add_1
Add_2
cycle 10
cycle 11
Inf3 Computer Architecture - 2006-2007 24
Tomasulo’s Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Execute Write
Busy Op VjName
Add_1Add_2Add_3Mult_1Mult_2
Vk Qj Qk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 12
Y Mult R[F4]
Load_1
Mult_1
Y Div Mult_1R[F6]
Mult_2
cycle 15
cycle 16
Inf3 Computer Architecture - 2006-2007 25
Tomasulo’s Example
Instruction
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
Issue Execute Write
Busy Op VjName
Add_1Add_2Add_3Mult_1Mult_2
Vk Qj Qk
F0 F2 F4 F6 F8 F10 F12 … F30
cycle 55
Y Div Mult_1R[F6]
Mult_2
Inf3 Computer Architecture - 2006-2007 26
Tomasulo’s Advantages
Register renaming:
– Qj and Qk can come from any reservation station independent of the register file
in fact we could have many more reservation stations than registers
– Vj and Vk store the actual value to be used
Parallel release of all instructions dependent as soon as the earlier
instruction completes (both SUB.D and ADD.D get the value from Load_1 )
No need to wait on WAR and WAW (notice that ADD.D has issued before DIV.D
has read its f6 operand and will execute as soon as the SUB.D finishes)
Dynamic unrolling of loops in hardware with dynamic register renaming