F 1 E 1 F 2 E 2 F 3 E 3 F 1 E 1 F 2 E 2 F 3 E 3 I 1 I 2 I 3 I 1 I 2 I 3 Instruction (a) Sequential execution (c) Pipelined execution Figure 8.1. Basic

F1

E1

F2

E2

F3

E3

F1 E1

F2 E2

F3 E3

I1 I2 I3

I1

I2

I3

Instruction

(a) Sequential execution

(c) Pipelined execution

Figure 8.1. Basic idea of instruction pipelining.

Clock cycle 1 2 3 4

Instructionfetchunit

Executionunit

Interstage bufferB1

(b) Hardware organization

Time

Time

F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction

Figure 8.2. A 4-stage pipeline.

Clock cycle 1 2 3 4 5 6 7

(a) Instruction execution divided into four steps

F : Fetchinstruction

D : Decodeinstructionand fetchoperands

E: Executeoperation

W : Writeresults

Interstage buffers

(b) Hardware organization

B1 B2 B3

Time

F1

F2

F3

I1

I2

I3

D1

D2

D3

E1

E2

E3

W1

W2

W3

Instruction

Figure 8.4. Pipeline stall caused by a cache miss in F2.

1 2 3 4 5 6 7 8 9Clock cycle

(a) Instruction execution steps in successive clock cycles

1 2 3 4 5 6 7 8Clock cycle

Stage

F: Fetch

D: Decode

E: Execute

W: Write

F1 F2 F3

D1 D2 D3idle idle idle

E1 E2 E3idle idle idle

W1 W2idle idle idle

(b) Function performed by each processor stage in successive clock cycles

9

W3

F2 F2 F2

Time

Time

Registerfile

SRC1 SRC2

RSLT

Destination

Source 1

Source 2

(a) Datapath

ALU

E: Execute(ALU)

W: Write(Register file)

SRC1,SRC2 RSLT

(b) Position of the source and result registers in the processor pipeline

Figure 8.7. Operand forw arding in a pipelined processor.

Forwarding path

E:Execute (ALU)

(b) Position of the source and result registers in the processor pipeline

X

Figure 8.9. Branch timing.

F1 D1 E1 W1

I2 (Branch)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

F3 X

Fk Dk Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

(b) Branch address computed in Decode stage

F1 D1 E1 W1

I2 (Branch)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

F3

Fk Dk Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

(a) Branch address computed in Execute stage

E2

D3

F4 XI4

8Time

Time

F E

F E

F E

F E

F E

F E

F E

Instruction

Decrement

Branch

Shift (delay slot)

Figure 8.13. Execution timing showing the delay slot being filledduring the last two passes through the loop in Figure 8.12.

Decrement (Branch taken)

Branch

Shift (delay slot)

Add (Branch not taken)

1 2 3 4 5 6 7 8Clock cycleTime

F1

F2

I1 (Compare)

I2 (Branch>0)

I3

D1 E1 W1

F3

F4

Fk Dk

D3 X

XI4

Ik

Instruction

Figure 8.14. Timing when a branch decision has been incorrectly predictedas not taken.

E2

Clock cycle 1 2 3 4 5 6

D2 /P2

Time

Figure 8.15. State-machine representation of branch prediction algorithms.

BTBNT

BNT

BT

BNT

Branch taken (BT)

Branch not taken (BNT)

(a) A 2-state algorithm

(b) A 4-state algorithm

BT

BNT

BTBNT LNT LT

LNT

LT ST

SNT

BT

X + [R1]

F

F D

D E

F D

F

F

F D

D

D

E

X + [R1] [X +[R1]] [[X +[R1]]]

[X +[R1]]

[[X +[R1]]]

Load

Next instruction

Add

Load

Load

Next instruction

(a) Complex addressing mode

(b) Simple addressing mode

Figure 8.16. Equivalent operations using complex and simple addressing modes.

W

W

1 2 3 4 5 6 7Clock cycleTime

W

Forward

W

W

W

Instruction cache

Figure 8.18. Datapath modified for pipelined execution, with

Bu

s A

Bu

s B

Control signal pipeline

IMAR

PC

Registerfile

ALU

Instruction

A

B

R

decoder

Incrementer

MDR/Write

Instructionqueue

Bu

s C

Data cache

Memory address

MDR/ReadDMAR

Memory address

(Instruction fetches)

(Data access)

interstage buffers at the input and output of the ALU.Figure 8.18. Datapath modified for pipelined execution, with

Interstage buffers at the input and output of the ALU.

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3A E3B E3C

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

Figure 8.21. Instruction completion in program order.

1 2 3 4 5 6Clock cycleTime

(a) Delayed write

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3A E3B E3C

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

1 2 3 4 5 6Clock cycleTime

(b) Using temporary registers

TW2

TW4

F1

F2

F3

F4

7

7

F1

F2

F3

F4

Figure 8.23. Main building blocks of the UltraSPARC II procesor.

External

cache unit

E-Cache

Prefetch and

dispatch unit

I-Cache

Instruction buffer

Loadqueue

Storequeue

D-Cache

Memory

management

unit

iTLB dTLB

Floating-

point

unit

Integer

execution

unit

Integer

registers

DataInstructions

System interconnection bus

Floating-

point

registers

Figure 8.23. Main building blocks of the UltraSPARC II processor.

ADDcc R3,R4, R7 R7 [R3]+[R4],Setconditioncodes

BRZ,a Label Branch if zero,setAnnul bit to 1FCMP F1, F5 FP: Compare[F2]and[F5]FADD F2,F3, F6 FP: F6 [F2]+[F3]FMOVs F3, F4 MovesingleprecisionoperandfromF3 to F4...

Label FSUB F2,F3, F6 FP: F6 [F2] [F3]LDSW R3,R4, R7 Loadsinglewordat location[R3]+[R4]into R7...

(a) Program fragment

ADDcc R3,R4, R7BRZ,a LabelFCMP F1, F5FSUB F2,F3, F6

(b) Instruction grouping, branch taken

ADDcc R3,R4, R7BRZ,a LabelFCMP F1, F5FADD F2, F3, F6

(c) Instruction grouping, branch not taken

Figure 8.25. Example of instruction grouping.

Figure 8.30. Execution flow.

Internal

registers and

execution units

Data

cache

External

cache

Main

memory

Instruction

cacheLoad/store

Data

Instructions

Elastic interf ace

queue

Instructionbuffer

Table 8.1 Examples of SPARC instructions.

Instruction Description

ADD R5, R6, R7 Integeradd: R7 [R5]+ [R6]

ADDcc R2, R3, R5 R5 [R2] + [R3],setcondition codeflags

SUB R5, Imm, R7 Integersubtract:R7 [R5] Imm (sign-extended)

AND R3, Imm, R5 Bitwise AND:R5 [R3] ANDImm (sign-extended)

XOR R3, R4, R5 Bitwise Exclusive OR: R5 [R3] XOR [R4]

FADDq F4, F12, F16 Floating-pointadd,quadprecision:F12 [F4]+ [F12]

FSUBs F2, F5, F7 Floating-pointsubtract,singleprecision:F7 [F2] [F5]

FDIVs F5, F10, F18 Floating-pointdivide,singleprecision,F18 [F5]/[F10]

LDSW R3, R5, R7 R7 32-bit wordat [R3]+[R5]signextendedto a64-bitvalue

LDX R3, R5, R7 R7 64-bitextendedwordat [R3] +[R5]

LDUB R4, Imm, R5 Loadunsignedbytefrommemorylocation[R4]+Imm,thebyte isloadedintotheleastsignificant 8bits ofregisterR5,and allhigher-orderbits arefilled with 0s

STW R3, R6, R12 Store wordfromregister R3 intomemory location[R6] +[R12]

LDF R5, R6, F3 Load a32-bit word ataddress [R5] + [R6] intofloatingpointregisterF3

LDDF R5, R6, F8 Loaddoubleword (two32-bit words)ataddress[R5]+ [R6]intofloating pointregistersF8 and F9

STF F14, R6, Imm Store wordfromfloating-registerF14 intomemorylocation[R6] +Imm

BLE icc, Label Testthe iccflagsandbranch to Label if lessthan orequalto zero

BZ,pn xcc, Label Testthe xccflagsandbranch to Label ifequal tozero,branch ispredictednot taken

BGT,a,pt icc, Label Testthe32-bit integercondition codesandbranchtoLabelif greaterthan zero,setannulbit,branch ispredictedtaken

FBNE,pn Label Testfloating-pointstatusflagsandbranch if not equal,Theannul bit is setto zeroand thebranch ispredictednottaken

Documents

F 1 E 1 F 2 E 2 F 3 E 3 F 1 E 1 F 2 E 2 F 3 E 3 I 1 I 2 I 3 I 1 I 2 I 3 Instruction (a) Sequential execution (c) Pipelined execution Figure 8.1. Basic