Upload
nigel-morgan
View
216
Download
2
Embed Size (px)
Citation preview
F1
E1
F2
E2
F3
E3
F1 E1
F2 E2
F3 E3
I1 I2 I3
I1
I2
I3
Instruction
(a) Sequential execution
(c) Pipelined execution
Figure 8.1. Basic idea of instruction pipelining.
Clock cycle 1 2 3 4
Instructionfetchunit
Executionunit
Interstage bufferB1
(b) Hardware organization
Time
Time
F4I4
F1
F2
F3
I1
I2
I3
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W3
W4
Instruction
Figure 8.2. A 4-stage pipeline.
Clock cycle 1 2 3 4 5 6 7
(a) Instruction execution divided into four steps
F : Fetchinstruction
D : Decodeinstructionand fetchoperands
E: Executeoperation
W : Writeresults
Interstage buffers
(b) Hardware organization
B1 B2 B3
Time
F1
F2
F3
I1
I2
I3
D1
D2
D3
E1
E2
E3
W1
W2
W3
Instruction
Figure 8.4. Pipeline stall caused by a cache miss in F2.
1 2 3 4 5 6 7 8 9Clock cycle
(a) Instruction execution steps in successive clock cycles
1 2 3 4 5 6 7 8Clock cycle
Stage
F: Fetch
D: Decode
E: Execute
W: Write
F1 F2 F3
D1 D2 D3idle idle idle
E1 E2 E3idle idle idle
W1 W2idle idle idle
(b) Function performed by each processor stage in successive clock cycles
9
W3
F2 F2 F2
Time
Time
Registerfile
SRC1 SRC2
RSLT
Destination
Source 1
Source 2
(a) Datapath
ALU
E: Execute(ALU)
W: Write(Register file)
SRC1,SRC2 RSLT
(b) Position of the source and result registers in the processor pipeline
Figure 8.7. Operand forw arding in a pipelined processor.
Forwarding path
E:Execute (ALU)
(b) Position of the source and result registers in the processor pipeline
X
Figure 8.9. Branch timing.
F1 D1 E1 W1
I2 (Branch)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
F3 X
Fk Dk Ek
Fk+1 Dk+1
I3
Ik
Ik+1
Wk
Ek+1
(b) Branch address computed in Decode stage
F1 D1 E1 W1
I2 (Branch)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
F3
Fk Dk Ek
Fk+1 Dk+1
I3
Ik
Ik+1
Wk
Ek+1
(a) Branch address computed in Execute stage
E2
D3
F4 XI4
8Time
Time
F E
F E
F E
F E
F E
F E
F E
Instruction
Decrement
Branch
Shift (delay slot)
Figure 8.13. Execution timing showing the delay slot being filledduring the last two passes through the loop in Figure 8.12.
Decrement (Branch taken)
Branch
Shift (delay slot)
Add (Branch not taken)
1 2 3 4 5 6 7 8Clock cycleTime
F1
F2
I1 (Compare)
I2 (Branch>0)
I3
D1 E1 W1
F3
F4
Fk Dk
D3 X
XI4
Ik
Instruction
Figure 8.14. Timing when a branch decision has been incorrectly predictedas not taken.
E2
Clock cycle 1 2 3 4 5 6
D2 /P2
Time
Figure 8.15. State-machine representation of branch prediction algorithms.
BTBNT
BNT
BT
BNT
Branch taken (BT)
Branch not taken (BNT)
(a) A 2-state algorithm
(b) A 4-state algorithm
BT
BNT
BTBNT LNT LT
LNT
LT ST
SNT
BT
X + [R1]
F
F D
D E
F D
F
F
F D
D
D
E
X + [R1] [X +[R1]] [[X +[R1]]]
[X +[R1]]
[[X +[R1]]]
Load
Next instruction
Add
Load
Load
Next instruction
(a) Complex addressing mode
(b) Simple addressing mode
Figure 8.16. Equivalent operations using complex and simple addressing modes.
W
W
1 2 3 4 5 6 7Clock cycleTime
W
Forward
W
W
W
Instruction cache
Figure 8.18. Datapath modified for pipelined execution, with
Bu
s A
Bu
s B
Control signal pipeline
IMAR
PC
Registerfile
ALU
Instruction
A
B
R
decoder
Incrementer
MDR/Write
Instructionqueue
Bu
s C
Data cache
Memory address
MDR/ReadDMAR
Memory address
(Instruction fetches)
(Data access)
interstage buffers at the input and output of the ALU.Figure 8.18. Datapath modified for pipelined execution, with
Interstage buffers at the input and output of the ALU.
I1 (Fadd) D1
D2
D3
D4
E1A E1B E1C
E2
E3A E3B E3C
E4
W1
W2
W3
W4
I2 (Add)
I3 (Fsub)
I4 (Sub)
Figure 8.21. Instruction completion in program order.
1 2 3 4 5 6Clock cycleTime
(a) Delayed write
I1 (Fadd) D1
D2
D3
D4
E1A E1B E1C
E2
E3A E3B E3C
E4
W1
W2
W3
W4
I2 (Add)
I3 (Fsub)
I4 (Sub)
1 2 3 4 5 6Clock cycleTime
(b) Using temporary registers
TW2
TW4
F1
F2
F3
F4
7
7
F1
F2
F3
F4
Figure 8.23. Main building blocks of the UltraSPARC II procesor.
External
cache unit
E-Cache
Prefetch and
dispatch unit
I-Cache
Instruction buffer
Loadqueue
Storequeue
D-Cache
Memory
management
unit
iTLB dTLB
Floating-
point
unit
Integer
execution
unit
Integer
registers
DataInstructions
System interconnection bus
Floating-
point
registers
Figure 8.23. Main building blocks of the UltraSPARC II processor.
ADDcc R3,R4, R7 R7 [R3]+[R4],Setconditioncodes
BRZ,a Label Branch if zero,setAnnul bit to 1FCMP F1, F5 FP: Compare[F2]and[F5]FADD F2,F3, F6 FP: F6 [F2]+[F3]FMOVs F3, F4 MovesingleprecisionoperandfromF3 to F4...
Label FSUB F2,F3, F6 FP: F6 [F2] [F3]LDSW R3,R4, R7 Loadsinglewordat location[R3]+[R4]into R7...
(a) Program fragment
ADDcc R3,R4, R7BRZ,a LabelFCMP F1, F5FSUB F2,F3, F6
(b) Instruction grouping, branch taken
ADDcc R3,R4, R7BRZ,a LabelFCMP F1, F5FADD F2, F3, F6
(c) Instruction grouping, branch not taken
Figure 8.25. Example of instruction grouping.
Figure 8.30. Execution flow.
Internal
registers and
execution units
Data
cache
External
cache
Main
memory
Instruction
cacheLoad/store
Data
Instructions
Elastic interf ace
queue
Instructionbuffer
Table 8.1 Examples of SPARC instructions.
Instruction Description
ADD R5, R6, R7 Integeradd: R7 [R5]+ [R6]
ADDcc R2, R3, R5 R5 [R2] + [R3],setcondition codeflags
SUB R5, Imm, R7 Integersubtract:R7 [R5] Imm (sign-extended)
AND R3, Imm, R5 Bitwise AND:R5 [R3] ANDImm (sign-extended)
XOR R3, R4, R5 Bitwise Exclusive OR: R5 [R3] XOR [R4]
FADDq F4, F12, F16 Floating-pointadd,quadprecision:F12 [F4]+ [F12]
FSUBs F2, F5, F7 Floating-pointsubtract,singleprecision:F7 [F2] [F5]
FDIVs F5, F10, F18 Floating-pointdivide,singleprecision,F18 [F5]/[F10]
LDSW R3, R5, R7 R7 32-bit wordat [R3]+[R5]signextendedto a64-bitvalue
LDX R3, R5, R7 R7 64-bitextendedwordat [R3] +[R5]
LDUB R4, Imm, R5 Loadunsignedbytefrommemorylocation[R4]+Imm,thebyte isloadedintotheleastsignificant 8bits ofregisterR5,and allhigher-orderbits arefilled with 0s
STW R3, R6, R12 Store wordfromregister R3 intomemory location[R6] +[R12]
LDF R5, R6, F3 Load a32-bit word ataddress [R5] + [R6] intofloatingpointregisterF3
LDDF R5, R6, F8 Loaddoubleword (two32-bit words)ataddress[R5]+ [R6]intofloating pointregistersF8 and F9
STF F14, R6, Imm Store wordfromfloating-registerF14 intomemorylocation[R6] +Imm
BLE icc, Label Testthe iccflagsandbranch to Label if lessthan orequalto zero
BZ,pn xcc, Label Testthe xccflagsandbranch to Label ifequal tozero,branch ispredictednot taken
BGT,a,pt icc, Label Testthe32-bit integercondition codesandbranchtoLabelif greaterthan zero,setannulbit,branch ispredictedtaken
FBNE,pn Label Testfloating-pointstatusflagsandbranch if not equal,Theannul bit is setto zeroand thebranch ispredictednottaken