103
1 Parallel architectures Computer Architectures M Parallelism

Computer Architectures M

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer Architectures M

1

Parallel architecturesComputer Architectures M

Parallelism

Page 2: Computer Architectures M

2

Architecture

•Synthesis: a physical implementation. There are many possible synthesises of the sameimplementation (for instance different technologies)

The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16,IA32…). More an ISA remains more are the programs implemented on it and thereforecompatibility becomes the main issue.

•Architecture: functional behaviour of a computer. For instance a processor which executesDLX code

•Implementation: a logical network implementing the architecture. It is called alsomicroarchitecture. There are many implementations of the same architecture. Example:family x86

The architecture is defined by the machine language that is the instruction set (assemblylanguage). Instruction Set Architecture -> ISA

Parallelism

Page 3: Computer Architectures M

3

Parallelism

• Superscalar superpipelined (i.e. Pentium IV, I5, I7 etc.)………..

Instruction level parallelism

• SequentialSingle instruction executed at a time

• PipelinedMultiple instructions executed simultaneously

• SuperpipelinedMultiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV)

• Scalar A single pipeline

• SuperscalarMultiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision)

• Very Long Instruction WordMultiple pipelines; many instructions started at the same time. Instructionorder decided at compile time

Parallelism

Page 4: Computer Architectures M

4

Parallelism architectures

• Memory level parallelismA memory able to provide multiple data at different addresses at the same time (outstandingrequests - DDR2, DDR3 etc.)

•Multicore (core level parallelism)Many processors in the same chip (i.e.. Core duo – Nehalem – Sandy Bridge …..)

•Multithread (thread level parallelism)Pipelines of the same processor used by different processes at the same time (timesharing) (as if it were a multicore – ex. Pentium IV, Nehalem, Sandy Bridge etc….)

Parallelism

Page 5: Computer Architectures M

5

Deep Pipeline (Superpipeline)

FetchDecodeExecuteMemory

Writeback

Fetch

Decode

Execute

Memory

Writeback

Branchpenalty

•Each stage subdivided in three substages.. Higher clock frequency but higher branchpenalty

Branchpenalty

•Higher power consumption!!!!!!!!!!!!

Parallelism

Page 6: Computer Architectures M

6

Parallel pipelines

Sequential Time parallelism: pipeline

Space parallelism: VLIW

Space-time parallelism: (ie. I5, I7…)Parallelism

Page 7: Computer Architectures M

7

Diversified pipelines - 1

IF

ID

RD

MEM2 FP2

FP3

WB

Multi instruction buffer to avoid pipelines block.

Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines.

Different execution times problemInstruction interdependency problem

Parallelism

ALU MEM1 FP1 BREX F => Floating

Page 8: Computer Architectures M

8

Diversified pipelines - 2

IF

ID

RD

EX ALU MEM1 FP1 BR

MEM2 FP2

FP3

Dispatch Buffer

Reorder Buffer

«Out Of Order» execution

”In order” execution

WB ”In order” retirement

Parallelism

Page 9: Computer Architectures M

9

Floating Point DLX – F instructions

IF ID MEM WB

Integer

FPMultipl.

FPadder

FP/Int.Divid.

multicycle stages

IF ID MEM WB

ExInteger

M1 M2 M3 M4 M5 M6 M7

FP Multiply

A1 A2 A3 A4FP Add

FP/INT. Divide(i.e . 24 clock cycles – one

instruction at a time executed)Parallelism

Pipelined

Page 10: Computer Architectures M

10

DLX revisited

• Example FMUL F1,F2, F2 (no interdependency between instructions in this sequence)FADD F3, F4, F5FLD F6, 10(R8)FST 40(R10), F9

FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB

FADD IF ID A1 A2 A3 A4 MEM WB

FLD IF ID EX MEM WB

FST IF ID EX MEM (WB)

• Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent

Data written

Data required for computing the address

In violet the stages where the operands are neededand in green the stages where new results areproduced

nop

Same destination registerWrite sequence error

• Very important structure change (more intermediate registers, more complex ID stage to send eachinstruction to the appropriate execution stage)

• Hazards problems: the instructions do not end in the same order of their issue.

• Since the division is normally a single functional unity , up to 40 clocks stalls may occur in this case

• Multiple instructions at the same time in the same stages (in particular in WB)• Write After Write hazards (WAW)– i.e. if a FADD F6, F4, F5 (four EX cycles ) directly preceded a

FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by thecompiler since useless)

• Instructions are not completed in order

Parallelism

Red squares: execution

Page 11: Computer Architectures M

11

DLX revisited

• For WAW hazards consider the following example

IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID A1 A2 A3 A4 MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

FMUL F0, F4, F6

…………..

…………..

FADD F2, F4, F6

…………..…………..

FLD F2, 0(R2)

If FADD were started one clock later a Write After Write hazard would have taken place !!

Multiple RF write operations

• To cope with multiple write operations at the same time of different registers the number of the input ports ofthe RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as tochoose the instructions to be stalled). More complex pipelines

• RAW hazards are solved through the forwarding

Normally the hazards are detected in the ID stage considering the preceding and following instructions so as tointroduce the required stalls (in this case FLD would have been stalled one clock)

Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which useinteger register for address computing

Parallelism

Page 12: Computer Architectures M

12

DLX revisited

• How can we grant that the final result is that of the program ?

• In the previous case FLD F2, 0(R2) must be stalled until FADD F2, F4, F6 has reached theMEM stage. It must be however assumed that between the two instructions there must at leastone using through the forwarding the result of FADD F2, F4, F6 otherwise the compiler wouldhave dropped the instruction !

• The situation would have been even worse if FLD had been completed before the FADD.

• In any case it is always possible that different instructions are completed in an order differentfrom that of their issue

Parallelism

Page 13: Computer Architectures M

Compiler

13

Let’s consider this high level language statementsX = Y + ZA = B * C

to be executed in a processor with the following pipeline

FetchF

Dec.D

IssueI

Ex.E

Ex.E

Ex.E

WBW

In order emissionThe issue of the addition (multiply) is possible only

AFTER the previous instruction execution calculatingR2 (R5) that is after the last EX stage of R2 <= Z

(R5 <= C) possibly with forwarding

Busy decoder

The issue is here possiblesince data to R1 e R2 havebeen already produced

Multiply: waits for results

RAW

StallsDecoder occupiedData not available

D freed by the previous additioninstruction

Busy decoder- RAW

Decoder busy

Addition resultnot yet ready

Parallelism

At the end of thisstage the additionresult is available

Page 14: Computer Architectures M

Compiler

14

But we can modify the emission without modyfying the result

16 cicles instead of 22 !!!!

before

after

FetchF

Dec.D

IssueI

Ex.E

Ex.E

Ex.E

WBW

Waiting for R5

Busy decoder

Parallelism Waiting for R6

Emission possiblesince R1 and R2 already available

Page 15: Computer Architectures M

Multicycle hazards

15

Let’s suppose to have a FP adder (1 cicle – in red) and a multiplier (3 cicles in green).

I1 F1 = F2 + F3I2 F2 = F4 x F5I3 F3 = F3 + F4I4 F6 = F6 x F6I5 F1 = F3 + F5I6 F2 = F3 + F4

I1

I2

I3

I4

I5

I6

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

NB: in this graph the hazards are potential since theregisters only are considered no matter how manycycles are required by the executions

Parallelism

I1

I2

WAR (F2)

I6WAW(F2)

I3WAR (F3)

RAW (F3)

I5RAW (F3)

WAW(F1)

Page 16: Computer Architectures M

Dynamic instructions scheduling

16

• Systems with out of order executions but commitment always in order

• Temporal dependencies (hazards) not known at compile time

• It allows the execution of the code on different pipelines and on superscalar processors withno implications for the compiler.

Parallelism

• It allows the execution of instructions ahead of their position (in the following case FSUBF12,F8,F14) if the conditions allow it

FDIV F0,F2,F4

FADD F10,F0,F8 (RAW - must wait for F0)

FSUB F12,F8,F14 (can be executed anyway)

Page 17: Computer Architectures M

17

Scoreboard

• Consider the following sequence

FDIV F0, F2, F4FADD F10, F0, F8FSUB F8, F8, F14

They must readthe same value

Write After Read (WAR)

•There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end beforeFADD has read F8 an error would occur (F8 already updated)

•A possible Write After Write (WAW) hazard would occur if in FSUB F10 instead of F8 hadbeen used as destination (in case FSUB would end before FADD – but probably FADD droppedby the compiler)

•“Scoreboard” technique: an instruction per clock should be terminated executing aninstruction as soon as possible.

Parallelism

Read after Write (RAW)

Page 18: Computer Architectures M

18

Scoreboard

FP MULFP MUL

FP DIV

Registers

FP ADD

INTEG

Scoreboard

The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines whenan instruction can read its operands and start its execution. The scoreboard considers all systemstate changes and decides when the first instruction in the FIFO queue (as produced by thecompiler) can be started.

Functional units

Parallelism

Page 19: Computer Architectures M

19

Scoreboard

• Obviously some stalls can be induced because the number of busses available for transfers issmall

• The four stages equivalent to ID, EX and WB in DLX are:1. Emission: if a functional unit for the instruction is available (free) the instruction is issued

unless another functional unit has already an instruction which must write into the samedestination register. No WAW hazards therefore. In this latter case the instruction is stalledwhich blocks the emission of all the following instructions in the prefetch queue even when allother conditions for them are met!

2. Operand read: the instruction has been emitted. If the operand(s) is(are) available and noalready executing instruction must write it(them), the operand(s) is(are) read otherwise stallin the functional unit

3. Execution: when the result has been computed and stored the scoreboard is informed so as tounblock a possibly waiting instruction

4. In case of possible WAR the instruction is stalled and does not write the result if there is aprevious instruction which has not yet read the operands and one(both) of them is(are) thedestination register(s) of the considered instruction. Once the operand(s) has(have) beenread the result can be written

• It must be noticed that with this organisation the forwarding is avoided since the results arewritten as soon as produced (but for the wait WAR – point 4)

The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing theRAW risks) .

Parallelism

Page 20: Computer Architectures M

20

An example

Hypothetical timing for different instructions (which includesthe operands read and execution)

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

LD F6, 34(R2)LD F2, 45(R3)FMUL F0, F2,F4 (MULD)

FSUB F8, F6, F2 (SUBD) FDIV F10, F0, F6 (DIVD)

FADD F6, F8, F2 (ADDD)

RAW

< WAR

RAW

Parallelism

Integer

Do you find more hypothetical hazards?

For instancewhataboutF0?

Page 21: Computer Architectures M

21

Scoreboard entitiesInstruction stages: emission, operands read, execution and writeback

Statuses of the functional units (FU): 9 parametersBusy Unit busyOp Operation Code presently executed Fi Instruction destination (result) register Fj, Fk Operands source registersQj, Qk Functional units producing the required operands (if not yet ready) for

the registers Fj and FkRj, Rk Flags (yes) indicating whether Fj, Fk have been already updated

Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register

N.B. It must be remembered that in case of possible WAWthe instructions emission is stalled (point 1 of the rules)

N.B. In the following example we suppose that two multiplication/division units are availableParallelism

Page 22: Computer Architectures M

22

Example (here we assume that F0 is a “normal”register and not always “0”)

0

Instruction status Read ExecutionWriteInstruction j k Issue Op complete ResultLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8Functional unit status

Time Name

Register result statusClock

F0 F2 F4 F6 F8 F10 F12 ... F31Functional Unit producing the result for the floating point register Fx (Qj, Qk)

Instructions statesProgression clock

1 integer unit2 multipl. units1 add/sub unit1 division unit

Rj and Rk indicates whether (possibly in the next cycle if just produced) thedata can be read from the operands source registers of the instruction whichmust be executed. Qjand Qk are the Functional Units which produce them (ifnot yet ready). Fj and Fk are the registers where data produced by Qj and Qkare stored (or will be stored in the next clock cycle – data available if thecorresponding Ri is yes) to be used in the executed instruction

F2dest Source1Source2

IntegerMult1Mult2AddDivide

FU for j FU for k Fj? Fk?Busy Op Fi Fj Fk Qj Qk Rj Rk

Register Qi Ready ?FU=Functional Unit

n. of clock cycles of execution yet

to elapse

NBLD = FLDMULTD = FMULSUBD = FSUBDIVD = FDIVADDD = FADD

FLD 1 cycleFADD, FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Floating point result registers

Page 23: Computer Architectures M

23

Cycle 1Instruction status Read Execution WriteInstruction j k Issue Op/Excomplete ResultLD F6 34 R2LD F2 45 R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Integer

Functional unit used for producing the result in F6

R2 is supposed to be already availableand therefore in the next clock can beused. LD uses the integer unit

At clock 1 the instruction state of LD F6,34(R2) is Issue

Parallelism

Yes Load F6 R2 Yes

R2

1

Brown colourfor state change

1

Page 24: Computer Architectures M

24

Cycle 2Instruction status Read Execution WriteInstruction j k Issue Op/Ex complete ResultLD F6 34 R2 1 2LD F2 45 R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2Mult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Integer

Data ready in R2: instructioncan proceed: execution

NB: The second LD cannot be emittedbecause the only integer unit is busyand the same applies for MULTD andthe following instructions becauseinstructions must be emitted in orderalthough their functional units are free!

Parallelism

2

Page 25: Computer Architectures M

25

Cycle 3Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2Mult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Integer

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

3

Page 26: Computer Architectures M

26

Cycle 4Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load R2Mult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU

4

F6

Register at the end of the period has been writtenInteger functional unit freed at the end of the period

The change of status of the FUsindicates their value at the clockpositive edge ending the currentcycle (future status). For instancethe integer functional unit is freedat the end of cycle 4 together withthe result writeback. LD F6 34,R2disappears totally from scoreboardat the clock positive edgeconcluding the current cycle 4.

Parallelism

Op/Ex

Integer4

Page 27: Computer Architectures M

27

Cycle 5Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

RU

S1 S2 RUj RUk Rj? Rk?

R3 supposed already ready as in the previous case

5

Yes Load F2 R3 Yes

IntegerThe Integer Functional Unit must produce a new value for F2

At the beginning of cycle 5 the integer unitis already free and then LD F2 45, R3 can be emitted and start

Parallelism

4Op/Ex

5

Page 28: Computer Architectures M

28

Cycle 6Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3Mult1Mult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Integer

S1 S2 FUj FUk Fj? Fk?

F4 supposedalreadypresent

Yes Mult F0 F2 F4 Integer No Yes

Mult

MULTD waits for F2from the integer unit !!!!

6

MULTD F0 F2, F4 can start because its FU is free and the destination register is F0

Parallelism

Op/Ex

6

Page 29: Computer Architectures M

29

Cycle 7

MULTD stalled in theexecution unit because F2not yet ready.

Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3Mult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAddDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult Integer

S1 S2 FUj FUk Fj? Fk?

(NB : FP adderexecutes

FP subtractionstoo)

F8Yes Subd F6 F2 Integer Yes No

Add

7

SUBD F8 F6, F2 can start becausethe arithmetic FP sum/subtraction isfree.

Parallelism

SUBD needs F2

Op/Ex

7

Page 30: Computer Architectures M

30

Cycle 8Instruction status Read EX Write

Instruction j k Issue complete. Result

LD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTDF0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 Yes Mult F0 F2 F4 YesMult2 NoAdd Yes Sub F8 F6 F2 YesDivide

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add

S1 S2 FUj FUk Fj? Fk?

F0 not yet available

Yes Load F2 R3

8

Yes Div F10 F0 F6 Mult1 No Yes

Divide

DIVD F10 F0, F6 can startbecause the divide FP FU is free

Updated at the end of the cycle

Yes

Yes

F2 available !!

F2 written allows MULTD andSUBD to read the operands duringthe next cycle

F2 is written and therefore the integer unit is freeParallelism

Op/Ex

8

Page 31: Computer Architectures M

31

Cycle 9 - 10Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7

N.B.: MULTD and SUBD can readthe operands because F2available (see cycle 8). DIVD isstill stalled because of F0.

99

DIVD F10 F0 F6 8ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

10 clock Mult1 Yes Mult F0 F2 F4Mult2 No

2 clock Add Yes Sub F8 F6 F2Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

40 clock

Parallelism

ADDD cannot start becauseSUBD uses the adder FU

Op/Ex

9-10

Page 32: Computer Architectures M

32

Cycle 11Nota: FU Add requires 2 cycles for theSUBD and therefore nothing happens incycle 10 while MULTD still processes itsdata

NB: ADDD will use the result of theSUBD but is not yet started because of

SUBD (the FU is busy)

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No8 clocks more Mult1 Yes Mult F0 F2 F4

Mult2 No0 Add Yes Sub F8 F6 F2

Divide Yes Div F10 F0 F6 Mult1 No YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11

Parallelism

Op/Ex

11

Page 33: Computer Architectures M

33

Cycle 12

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk

Integer No7 clocks more Mult1 Yes Mult F0 F2 F4

Mult2 NoAdd NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Divide

Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 SUBD ends freeing the FU. In the nextperiod ADDD can start

12

F8 is written and the ADD/SUB FU is freed

FLD 1 cycleFADD and FSUB 2c yclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

12

Page 34: Computer Architectures M

34

Cycle 13

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAddDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Divide

Instruction status Fead EX WriteInstruction j k IssueOp/Excomplete FesultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12 Now ADDD can start because SUBDhas finished its execution and hasfreed the FU

Yes Add F6 F8 F2 Yes Yes

Add

13

6 Clocks more

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

13

Page 35: Computer Architectures M

35

Cycle 14

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14

5 clocks more

2 Clocks more

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

14

Page 36: Computer Architectures M

36

Cycle 15

ADDD requires two cycles and therefore no system status change

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14

4 Clocks more

1 Clock more

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

15

Page 37: Computer Architectures M

37

Cycle 16

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14

ADDD ended its EX stage while MULTDand DIVD keep executing

16

3 clocks more

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

16

Page 38: Computer Architectures M

38

Cycle 17

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

NB !!! ADDD stalled (cannot write) because of aWAR with DIVD on F6. DIVD does not readF6 because it waits for F0 produced byMULTD (operands are read in parallel).MULT and DIVD keep executing

Stalled becauseWAR F6

2 Clocks more

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

17

Page 39: Computer Architectures M

39

Cycle 18

MULT still executing

DIVD still stalled

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

1 clock more

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

18

Page 40: Computer Architectures M

40

Cycle 19

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

MULT ends its execution, will write in cycle20 (after 10 cycles) which will unblockDIVD and then ADDD

19

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

19

Page 41: Computer Architectures M

41

Cycle 20

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Yes Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Add Divide

Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

19 MULTD writes F0 unblocking DIVD

20

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

20

Page 42: Computer Architectures M

42

Cycle 21

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Add Divide

Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

19 20

DIVD reads both F0 and F6 (whichcould not be written by ADDDbecause of WAR) unblocking ADDDwhich can write F6 in the next cycle

21

Parallelism

Op/Ex

21

Page 43: Computer Architectures M

43

Cycle 22

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide Yes Div F10 F0 F6

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Divide

Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

19 20

21

Parallelism

Now ADDD can write F6 after theWAR hazards with DIVD disappeared.For 6 cycles ADDD couldn’t write F6although its result was available

22

Op/Ex

22

Page 44: Computer Architectures M

44

Cycle 61

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide Yes Div F10 F0 F6

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Divide

Instruction status Read EX WriteInstruction j k Issue completeLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

19 20

2122

DIVD execution ends after 40 cycles61

Result

Parallelism

Op/Ex

61

Page 45: Computer Architectures M

45

Cycle 62

All executions ended

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd No

0 Divide NoRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

62 FU

Instruction status Read EX WriteInstruction j k Issue completeLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

19 20

2122

61

Result

62

Parallelism

Op/Ex

Page 46: Computer Architectures M

46

Scoreboard limits

• An instruction can be emitted only if all previous instructions have been emitted

WAWWAR

FDIV F0, F2, F4FADD F6, F0, F8FSTOR F6, 0(R1)FSUB F8, F10, F14FMUL F6, F10, F8

N.B Hazards of the sequence are onlypotential: their occurrence dependson the instructions execution time

• Register values must be read in any case in parallel only from the register file (which meansthat they must have been already stored in the registers – no RAW problem)

Parallelism

RAW

Page 47: Computer Architectures M

47

Renaming – Tomasulo Algorithm

Tomasulo algorithm: “renaming” is based on the concept of “reservation stations” which are functional units buffers whereinstructions can be «parked» waiting for the availability of the requested Fu and the needed data.

The following benefits occur

«Renaming» indicates a location different from the RF where a requested data is produced/stored and can beobtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed

Parallelism

A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU isfree and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operandsEITHER the source register data OR the reservation stations producing them are indicated (whence renaming).The renaming occurs at run-time

A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoidingthe register file access). Similar to the case of forwarding

When multiple writes to the same register occur (WAW – possible only if multiple busses between FUs and RF areavailable) only the most recently produced data are written (for each register a TAG is used indicating the FU whichhas the right to write)

Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the informationstored in the reservation stations of each functional unit determines whether an instruction can execute in the FUsince the source (where the data is being produced - if not yet int the RF) and NOT the RF is indicated. RAWhazards are no more possible since the requested data are provided as soon as produced. The same for WAR (dataare read by the reservation stations while written)

Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RFthrough the common data busses (multiple reservation stations in addition to RF register can be accessed at the sametime when multiple busses are available)

Page 48: Computer Architectures M

48

Tomasulo AlgorithmTomasulo eliminates not only WAWs but also WARs

FLD F6, 32(R2)FLD F2, 44(R3)FMUL F0, F2, F4 FSUB F8, F2, F6FDIV F10, F0, F6FADD F6, F8, F2

Renaming (functional unitproducing the data)

Possible WAW

As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the mostrecent instruction in the RS using a destination register can write the register.

FLD [T/F6], 32(R2)FLD F2, 44(R3)FMUL F0, F2, F4 FSUB F8, F2, [T]FDIV F10, F0, [T]FADD F6, F8, F2

NB: When an instruction is inserted in a RS it is checked whether one or more of its operands are beingproduced elsewhere by other RS: if yes then renaming

For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read itsoperands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but sinceFDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holdsfor FSUB.

Parallelism

Possible WAR.

Page 49: Computer Architectures M

49

Tomasulo Algorithm

Parallelism

Very high performance without special compilers

Differences with scoreboard

Buffer and controls directly distributed in the FUs (there is no centralizedcontrol): buffers are called “reservation stations”

Source registers names substituted by pointers to buffers of the reservationstations (if the requested data are being there produced)

“Renaming”: a direct pointer to the sources and not to the register

One ore more Common Data Bus for sending results to all FUs requiringthem

Load and Stores considered as FUs (a STORE can also be a source for a RSexecuting a LOAD)

Page 50: Computer Architectures M

50

Tomasulo AlgorithmIn this example is it assumed thatthe MUL unit executes the DIVs tooand that the ADD executes the SUBs too . LOAD and STORES are handled as other instructions

In this example: 3 RS for add/sub2 RS for mult/div5 RS for store5 RS for load

In this example only oneData Bus. Please noticethat the same CommonData Bus is used also bythe RS waiting for data

Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements:either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming)

For thedataproducedby the FUs

Parallelism

Page 51: Computer Architectures M

51

Tomasulo Algorithm

• Writeback: as soon as a data is produced, it is tranferred over one CDB (when more than one areavailable) to the RF and to the RS waiting for it.

Parallelism

• Load buffers are used to store the load addresses

• Store buffers contain the computed addresses and the data to be written in memory

• Load and store must be executed in sequence if they are related to the same addresses. In theother cases it is possibile to anticipate the LOADs (never the STOREs)

• In figure there are 3 phases (each one of which can last several clocks):

• Emission: the instructions are extracted in order from the general instruction queue when there is afree RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operandsare extracted from RF or the producing FU as indicated. In case of WAW it must be determinedwhich instruction must provide the data

• Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must betransferred over a bus anyway) in order to catch them (and their sources) as soon as available:RAW are therefore avoided (we are sure not to read stale data in the RF).

Page 52: Computer Architectures M

52

Tomasulo Algorithm

Let’s see the scoreboard example in a Tomasulo Architecture. Let’s supposethat the execution times are the same of the scoreboard (FLD 1+1 cycles,FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) – NB.“+1” for the writeback

LD F6, 34(R2)LD F2, 45(R3)FMUL F0, F2,F4FSUB F8, F6, F2FDIV F10, F0, F6FADD F6, F8, F2

Parallelism

Page 53: Computer Architectures M

53

Reservation Station

Register File Status: Indicates which FU will write the register (if needed). A blank meansthat there are no instructions which must write the register and therefore its value can bedirectly used

N.B. From the general instruction queue one instruction per clock is emitted when a FUs RSfor that instruction is available otherwise stall. In our example we assume only one CDB.

Parallelism

Op: opcode of the instruction to be executed

Vj, Vk: places where the operands are read (either RF or the FUs producing them).If blank the data is produced by the corresponding Qj or Qk

Qj, Qk: Functional units producing the results. A blank indicates that the source operandsare already in Vj or Vk or that they are not required

Busy: Busy FU

Page 54: Computer Architectures M

54

Cycle 0Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

0 FUThe FU producing the new value

Producing FU – if blank it means that the dat is in RF

Operands register. If blank the datum is produced in the corresponding Q FU

NB. For LD (ST here not used) there is a limitednumber of RS. Their BUSY status is here displayed differently from the FU (see next slide)

For sake of simplicity Rj e Rk(ready/notready) of the scoreboardare not displayed since their valuesare implicit in the status of Qj andQk

Parallelism

Load/store notindicated in the

status table

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 55: Computer Architectures M

55

Cycle 1Instruction status

Execution WriteInstruction j k Issue BusyLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

1 FU

Address1 Load1 Yes

Load1

34+R2

3 RS foradder/sub

2 RS formul/div

NB: Here it is assumed that R2 and R3 are already available

5 RS for the LOAD

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 56: Computer Architectures M

56

Cycle 2Instruction status

Execution WriteInstruction j k Issue BusyLD F6 34 R2 1LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

2 FU Load1

AddressLoad1 Yes 34+R2

5 RS for LOAD

2-2 Load2 Yes 45+R3

Load2

The second LD is emitted. One instruction per clock isemitted (when possible)

N.B. A second LOAD has been emitted(not possible with the scoreboard)and parked in the RS. R3 valuealready available in the RF

Parallelism

NB: Load -> 2 cycles: the first one for computing the address and the second for reading the data

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 57: Computer Architectures M

57

Cycle 3Instruction status

Execution WriteInstruction j k Issue BusyLD F6 34 R2 1 2--3 Load1 YesLD F2 45 R3 2 3- Load2 YesMULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

3 FU Load2 Load1

Address34+R245+R3

Yet10 cycles

LD two cycles

MULTD can be emitted although F2 NOTyet available . F2-> renaming

3

Yes Mult F4

Mult1

Load2

MULTD emitted (free RS )

Data supposed alreadyin the RF

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 58: Computer Architectures M

58

Cycle 4

The FUs execute both sums and subtractions

Instruction status Execution WriteInstruction j k Issue Busy

LD F6 34 R2 1 2--3LD F2 45 R3 2 3--4 Load2 YesMULTD F0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1Add2 NoAdd3 NoMult1 Yes Mult F4 Load2Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

4 FU Mult1 Load2

Address

45+R3

Yet 3 cycles

Yet 10 cycles

4

The data read from memory LD F6 34(R2) is writtenboth in the RF and in the RS of SUBD and MULTD which are waiting for it

4

Add1

Yes Sub F6 (captured on the fly) Load2SUBD is emitted (RS free)F6 available in RF at the end of the cycle

FU freed at the end of clock cycleParallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 59: Computer Architectures M

59

Cycle 5Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4MULTD F0 F2 F4 3SUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk3 Add1 Yes Sub F6 (capt.) F2 (capt)0 Add2 No

Add3 No10 Mult1 Yes Mult F2 (capt) F4

0 Mult2Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

5 FU Mult1 Add1

Cycles yet to be executed for completing the execution

5

5

Yes Div F6 Mult1

Mult2

DIVD is emitted (RS free)

Wait for F0

FU freedParallelism

The datum read from memory with LD F2 45(R3)is written both in register F2 and in the RS ofSUBD and MULTD which are waiting for it

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 60: Computer Architectures M

60

Cycle 6

Cycles yet to be executed for completing the execution

Instruction status Execution WriteInstruction j k Issue

LD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 --DIVD F10 F0 F6 5ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes Sub F6 (capt) F2 (capt)

Add2 Yes Add F2 Add1Add3 No

9 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

6 FU Mult1 Add1 Mult2

Yet 40 cyclesNow MULTD can execute (F2 and F4 available)

6

Add2

ADDD is emitted (RS free)

Wait for F0

Wait for F8

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 61: Computer Architectures M

61

Cycle 7Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes Sub F6 (capt) F2 (capt)

Add2 Yes Add F2 Add1Add3 No

8 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

7 FU Mult1 Add2 Add1 Mult2

6 -- 7SUBD (as ADDD) two cycles

ADDD stalled waiting for SUBD (F8)

Data in F6 will be overwritten byADDD but it was already read and ispresent in the RS of DIVD

Yet 40 cycles

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 62: Computer Architectures M

62

Cycle 8Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 No

2 Add2 Yes Add F8 F2Add3 No

7 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

8 FU Mult1 Add2 Mult2

Yet 40

0

8

NB: SUBD ends before MULTD andallows ADDD (which captures theresult of F8) to start executing

FU freedParallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 63: Computer Architectures M

63

Cycle 9Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 --Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 No

2 Add2 Yes Add F8 F2Add3 No

6 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

9 FU Mult1 Add2 Mult2

Yet 40

ADDD executing

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 64: Computer Architectures M

64

Cycle 10Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 No

1 Add2 Yes Add F8 F2Add3 No

5 Mult1 Yes Mult F2 F4Mult2 Yes Div Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

10 FU Mult1 Add2 Mult2

9 -- 10

Two execution cycles

Yet 40 F6

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 65: Computer Architectures M

65

Cycle 11Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

4 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

11 FU Mult1 Mult2

40

0

11ADDD too ends beforeMULTD and DIVD

FU freed

Cycles yet to be executed for completing the execution

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 66: Computer Architectures M

66

Cycle 12Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

3 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

12 FU Mult1 Mult2

40

Waiting for the data producedby MULTD

Cycles yet to be executed for completing the execution

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 67: Computer Architectures M

67

Cycle 15Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

1 Mult1 Yes Mult F2 F4Yet 40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

15 FU Mult1 Mult2

Waiting for the data producedby MULTD

Parallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 68: Computer Architectures M

68

Cycle 16Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

16 FU Mult2

Now DIVD can execute

0

0

16

FU freedParallelism

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

Page 69: Computer Architectures M

69

Cycle 56Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15 16SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

56 FU Mult2

Parallelism

Page 70: Computer Architectures M

70

Cycle 57Instruction status Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15 16SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56 57ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

57 FU

Parallelism

Page 71: Computer Architectures M

71

A demo can be found at

http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo_files/tomasulo.htm

Parallelism

Page 72: Computer Architectures M

Limits of Tomasulo Algorithm

73

• NOT precise interrupts

Parallelism

• Very complex

• Each CDB must be connected to each RS – Complex cabling – Reduce n. of CDB means reduced efficiency

• If a single CDB is present only one instruction per cycle can end

• Ouf of order instructions completion !!!!!!

Page 73: Computer Architectures M

Exceptions

74

• Traps: internal causes Exceptional conditions (overflow, zero division etc.) Errors (i.e. parity) Page fault (or – see later – segment fault): data not available in memory Syncronous to the current process Operating systems handler Instruction can be interrupted during its execution (i.e. page fault) and therefore must

be «restartable»,. The executing program is normally temporarily aborted.

Parallelism

• Exception/interrupt: non-programmed control transfer Return address and all other information necessary to restore the interrupted situation

must be saved «Response» subroutine (handler) must be executed

• Two exceptions types: interrupt and trap Interrupts: external causes The user program are interrupted and the then restored Asyncronous to the current process Acknowledged at the end of the current instruction (if interrupts enabled) The handler is responsibility of the user program

Page 74: Computer Architectures M

Examples

75

Instruction Restart

Parallelism

Page 75: Computer Architectures M

Precise exceptions/interrupts

76

• Precise exceptions(interrupts) : instruction commitment in order

Parallelism

• Exceptions must be “precise” that is their behaviour must be same that would occur in a “non-pipelined” architecture

• Precise: machine status is saved as if the code would have been executed until the exception : All preceding instruction must be terminated All instructions following the instruction which provoked the exception must be handled as if

they never started The same code must executed identically on different architectures

• Complex problem with pipeline, OOO execution (see later) etc.

• Scoreboard and Tomasulo have:In order emission, execution (and therefore terminated) out of order fuori ordine

Page 76: Computer Architectures M

78

• Automatic WAW avoidance

ROBFP OpQueue

FP Adder FP AdderRes Stations Res Stations

FP Regs

Reorder Buffer (ROB)

Parallelism

• FIFO queue

• Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we saythat the instruction is virtually inserted in the ROB

• When instructions are terminated the results are stored in the ROB (instead of theRF) which provides also the operands to other instructions which requires them(renaming!) Commitment

• Easy “undo” of speculated instructions (see later)or of branches erroneously predicted or exceptions

• Commitment: the results of the instruction which has reached the topslot of the FIFO are transferred to the architectural registers (registerswhich could be read by a test program)

Page 77: Computer Architectures M

79

Tomasulo again

Parallelism

Page 78: Computer Architectures M

80

Tomasulo in 4 steps

N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same(unlikely, otherwise the compiler would have dropped the first one) the result of the most recentinstruction is used.

Parallelism

• Emission— Emission of an instruction from the instruction queue when a RS anda ROB slot available. In the RS are indicated the operands source and the ROB slotwhere an instruction will be “parked” after its esecution (this phase is called«dispatch”). The results are NOT written in the RF until the commitment phase.NB the lack of one of the two conditions blocks the emission of the followinginstructions

• Execution — Operands transformation. If not yet ready they can be in the ROB (inthis case the operand values computed by the nearest previous instructions areused) or still computed in the FU. This phase is indicated as “issue”.

• Result writeback — Execution ends. Result trasmitted on the CDB for the RSwaiting for them and to the ROB.

• Commitment—Architectural registers (or memory) update with the results stored inthe ROB when the instruction is on the top of the ROB FIFO. In case of erroneouslypredicted branch the ROB results are just dropped (“graduation”).

EMISSION IN ORDERCOMMITMENT IN ORDER

Page 79: Computer Architectures M

Parallelism 81

HW with ROB

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP RegsC

ompar netw

ork

• ROB is a circular queue• Program counter i.e. used for branch

ROB

Des

tinat

ion

Reg

iste

r

Res

ult

Exc

eptio

n?

Valid

(ter

min

ated

)

Prog

ram

Cou

nter

Page 80: Computer Architectures M

82

Example

LD F0, 10(R2) 3 cyclesFADD F10, F4, F0 5 cyclesFDIV F2, F10, F6 20 cyclesBRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

Parallelism

Page 81: Computer Architectures M

83

To memory

FP adders FP multipliers

Reservation Stations

FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0 LD F0,10(R2) N

Completed?

From memory

1 10+R2

ROB

Tomasulo with ROB – cycle 1

ROB top

ROB end

Source

M1

LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

InstructionDest.

Parallelism

FP registers

ROBPosition ROB

Position

Cod.Op. Operands Cod.

Op. Operands

ROBPosition

Page 82: Computer Architectures M

84

2 FADD F10,F4, ROB1

FP adders FP multipliers

Reservation Stations

FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0 LD F0,10(R2) ExTop

End

1 10+R2

ROB

Tomasulo with ROB – cycle 2

To memory

From memory

M1

LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

F10 NFADD F10, F4, F0 [ROB1]ROB1

Renaming !!

(Memory2 clocks)

Three slotsfor memoryoperations

Completed?Source InstructionDest.

Parallelism There can be also two ROB sources

FP registers

RAW

ROBPosition ROB

Position

Cod.Op. Operands Cod.

Op. Operands

ROBPosition

Page 83: Computer Architectures M

85

32 FADD F10, F4, ROB1

FP adders FP multipliers

Reservation Stations

FP Opqueue

1 10+R2

ROB

Tomasulo with ROB – cycle 3

FDIV F2, ROB2, F6

To memory

From memory

M1

LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F0

ROB 2

LD F0,10(R2)

N

N

ExTop

End

FADD F10, F4, F0 [ROB1]

FDIV F2, F10 [ROB2], F6

ROB 1

Completed?Source InstructionDest.

Parallelism

FP registers

ROBPosition ROB

Position

Cod.Op. Operands Cod.

Op. Operands

ROBPosition

Page 84: Computer Architectures M

86

32 FADD F10, F4, F06 FADD F0, ROB5, F6

FP adders FP multipliers

Reservation Stations

FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0 ROB5 FADD F0, F4 [ROB5], F6 N

F4 LD F4,0(R3) Ex

-- N

F2F10

ROB2

Completed and committed (F0)

N

Ex Top

End

5 0+R3

Tomasulo with ROB – cycle 5

FADD F10, F4, F0

FDIV F2, F10 [ROB2], F6

FDIV F2, ROB2, F6

To memory

From memory

BRNE F2 [ROB3], +100

M1

F0(Updated by memory op ROB 1)

In cycle 4 (end of the first LD) FADD F10, F4, F0 started executing

Emitted in cycle 4 in parallel withLD F4, 0(R3)

LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

Data capturedon the fly. Notmore present in the ROB

Completed?Source InstructionDest.

Parallelism

ROBPosition ROB

Position

Cod.Op. Operands Cod.

Op. Operands

ROBPosition

Not yetcommitted

Page 85: Computer Architectures M

87

32 FADD F10, F4, F06 FADD F0, ROB5, F6

FP adders FP multipliers

Reservation Stations

FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0ROB5ROB5

ST 0(R3), F4[ROB5]

FADD F0, F4 [ROB5], F6 N

F4 LD F4, 0(R3) Ex

-- N

F2F10

ROB2 N

Ex Top

End

5 0+R3

Tomasulo with ROB – cycle 6

FADD F10, F4, F0

FDIV F2, F10[ROB2], F6

FDIV F2, ROB2, F6

To memory

From memory

FP registers

BRNE F2 [ROB3], +100

M1

F0

NB ST can start its execution whenLD F4, 0(R3) has terminated the execution NOT when is committed

LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

Completed?Source InstructionDest.

Parallelism

N

ROBPosition ROB

Position

Cod.Op. Operands Cod.

Op. Operands

ROBPosition

(Updated by memory op ROB 1)

ROB3

Page 86: Computer Architectures M

88

Register Renaming

• For each commitment the pointer to the architectural register points to the physicalregister linked the commited instruction. When a new instruction regarding the samearchitectural register is committed the pointer to it is changed (and the physical registerpreviously embodying the architectural register is freed).

Parallelism

• But when an emitted instruction must use a register where can it be found? In theROB or in the RF ? The entire ROB should be analysed and the most recent slotfound (if any) whose destination is the required register: the instruction shouldeither point to it (if any) or to the RF. Complex and slow procedure

• Solution: to use a number of physical registers greater than that of the architecturalregisters (the register known to the assembler language programmer - ISA) and to keepa pointer to the most recent (possibly not yet architectural)

• Whenever an instruction inserted in the ROB must write a register (i.e. F17), it points toa new physical register associate to the involved register (F17) where the result will betemporarily stored. Any following instruction which must use register (F17) will usethat physical register

Page 87: Computer Architectures M

89

An example with R2

R2-0R2-1

R2-3R2-4R2-5R2-6R2-7R2-8

Circular queue of register R2

Pointer to the first free register R2 when LD R2,

10(R5) is emitted

Let’ suppose that R2-2 andR2-3 are alredy occupied byprevious not yet committedinstructions

LD R2, 10(R5) ; R2-4 (destination.)

first physical register free associated to R2

Parallelism

•When LD R2, 10(R5) is emitted register R2-4 is given to it as destination which will be used by MUL(as soon as the new datum is computed). Now R2-2, R2-3 e R2-4 are «busy» and the first freeregister will be R2-5. R2-2, R2-3 ans R2-4 will be freed as soon the related instructions end. If thecommitment is “in-order” all hazards disappear. R2-1 is the architectural R2 register. R2-2 willbecome the architectural R2 register at the commitment of the related instruction. The busy registersare freed when no more needed

• No more distinction between register file and ROB locations. Normally there are 40-120 physicalregisters

R2-2

ArchitecturalregisterMUL R8, R2, R5 ; R2-4 (source.)

RADD R2, R9, R6 ; R2-5 (destination.)DIV R2, R2, R10 ; R2-6 (destination) and

R2-5 (source) (here commitment of instruction using R2-2)

R2-1R2-2

Page 88: Computer Architectures M

90

HW support for register renaming

• If no physical registers (circurarly) are available the instruction is stalled. There isno emission also if no free slot in the ROB is available and no RS is available

Parallelism

• Free/busy register table. Two solutions: one pool of physical registers for allarchitectural registers or one pool for each architectural register.

• Fast mapping between architectural and physical registers (run time)

• Great number of physical registers

Page 89: Computer Architectures M

ROB «without Tomasulo»

Parallelism 91

• When two instructions are ready for execution, FIFO rule (so as to speed-up thecommitment, always in order)

• Instructions are emitted as soon a free slot in the ROB and a physical destination register areavailable using the register renaming

• For each FU there is a virtual queue whose slots point to the ROB slots which require thatFU.

• The instruction of this queue are executed as soon as the required operands areavailable.

Page 90: Computer Architectures M

92

ROB and speculation

Need of a separated Return Stack Buffer for the speculative calls (otherwisethe stack could be damaged). It is a separated stack whose content is copiedonto the stack if the branch has been correctly predicted as taken. Allinstructions following a branch not yet commited use this stack. In case ofmisprediction the RSB content is cancelled

Parallelism

• Dynamic instruction execution granting precise interrupts which are checked at theinstruction commitment always in order

• Cancellation of speculative instructions when a branch is erroneously predicted

The prediction error must be revealed ASAP. The cancellation of post-branchinstructions erroneously executed allows the preceding instructions to keepexecuting. The erroneously executed instructions are not yet commited

The early branch prediction avoids the execution of useless instructions(sometimes very time expensive). It must remembered that not only theROB flush occurs but also the cancellation of all the instructions already inthe pipeline

Page 91: Computer Architectures M

93

FLD F4,0(R10)FDIV F8, F0, F4FMUL F4, F2, F3FMUL F4, F4, F4FADD F6, F10,F4FLD F4, 0(R5)

RAW WAW

RAWWAWRAW

Example - 1

Parallelism

Same execution times as in the previous Tomasulo example

Page 92: Computer Architectures M

94

F10F8F6F4F2

FU

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load2

WriteResult

Store 1Store 2

Reservation Stations

Time Name Busy Op Vj Vk Qj Qk

Register result status

Clock 0

Add1Add2Add3Mult1Mult2

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Parallelism

Tomasulo without ROB and with renaming (RES stations). Multiplication FU execute the divisionstoo.

Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

Page 93: Computer Architectures M

95

Load1

F10F8F6F4F2

R10yes1

FU

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load2

WriteResult

Store 1Store 2

Reservation Stations

Time Name Busy Op Vj Vk Qj Qk

Register result status

Clock 1

Add1Add2Add3Mult1Mult2

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Three RS for LOAD, 2 for STORE, 2 for MUL/DIVParallelism

CLOCK 1

Page 94: Computer Architectures M

96

Mult1

F10F8F6F4F2

F0divyes

2R10yes2-1

FU

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load2

WriteResult

Store 1Store 2

Reservation Stations

Time Name Busy Op Vj Vk Qj Qk

Register result status

Clock 2

Add1Add2Add3Mult1Mult2

Load1

Load1

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Three RS for LOAD, 2 for STORE, 2 for MUL/DIVParallelism

CLOCK 2

Page 95: Computer Architectures M

97

Mult1

F10F8F6F4F2mulyes

F0divyes

32

R10yes2-31

FU

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load2

WriteResult

Store 1Store 2

Reservation Stations

Time Name Busy Op Vj Vk Qj Qk

Register result status

Clock 3

Add1Add2Add3Mult1Mult2

Mult2

Load1F2 F3

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Three RS for LOAD, 2 for STORE, 2 for MUL/DIVParallelism

CLOCK 3

Page 96: Computer Architectures M

98

Mult1

F10F8F6F4F2mulyes

F0divyesyet 9 cycles

32

2-31

FU

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load2

WriteResult

Store 1Store 2

Reservation Stations

Time Name Busy Op Vj Vk Qj Qk

Register result status

Clock 4

Add1Add2Add3Mult1Mult2

Mult2

F2 F3

4-

4FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Stalled for lack of free RS untilcycle 13 (end of the precedingmultiplication – only two slots inthe multiply FU) blocking theemission of FADD which could beexecuted since there are two freeslots in the corresponding RS.

4-

yet 39 cycles F4

ParallelismThree RS for LOAD, 2 for STORE, 2 for MUL/DIV

CLOCK 4

Page 97: Computer Architectures M

99

80000000: FLD F4, 0(R10)80000004: FDIV F8, F0, F480000008: FMUL F4, F2, F38000000C: FMUL F4, F4, F480000010: FADD F6, F10,F480000014: FLD F4, 0(R5)

RAW WAW

RAWWAWRAW

ROB and register renaming.

The instructions are in any case inserted in the ROB when a free slot and a physical register (oneof the many associated to the same architectural register) is available and then executed whenthe FU and the operands are available (policy of all modern processors). By so doing instructionsare not only terminated OOO (but with results reordered in the ROB) but also emitted even if theFU is not available The execution is totally OOO but with an In-Order commitment

Example - 2

Same instruction stream

Parallelism

Page 98: Computer Architectures M

100

Addr Op. Des Sorg

P0 P1

Free Free. Free Free Free Arch

P2 P3 P4 P5

F4

ROB RAT

Q0 Q1

Busy Free. Free Free Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Free Free Free Free Arch

Z2 Z3 Z4 Z5

F8

Initial situation

Renaming registersfor F4, F6 e F8

12345

Parallelism

Register Allocation Table

Top free registers of the circular queues

These are thearchitectural registerswhich a program monitorwould display

These are registers in useby not yet committedinstructions. They willbecome architecturalregisters when the relatedinstructions are committed Here we assume that the instruction using Z0 precedes the

instruction using Q0. RAT for R5, R10, F0, F2, F10 not displayed

Page 99: Computer Architectures M

102

R10yes1

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

AddressWrite

Result

Load3Store 1Store 2

Time Name Busy Op Vj Vk Qj Qk

Clock 1

Add1Add2Add3Mult1Mult2

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Parallelism

CLOCK 1

0,R10P0FLD80000000

Addr Op. Des Sorg

P0 P1

Busy Free. Free Free Free Arch

P2 P3 P4 P5

Q0 Q1

Busy Free Free Free Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Free Free Free Free Arch

Z2 Z3 Z4 Z5

F8

ROB RAT

12345

Renaming: the first available register for F4

is used

F4

Page 100: Computer Architectures M

103

Mult2

F0divyes

2R10yes2-1

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load3

WriteResult

Store 1Store 2

Time Name Busy Op Vj Vk Qj Qk

Clock 2

Add1Add2Add3Mult1 P0

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Parallelism

CLOCK 2

F0,P0Z1FDIV80000004

0,R10P0FLD80000000

Addr Op. Des Sorg

P0 P1

Busy Free Free Free Free Arch

P2 P3 P4 P5

Q0 Q1

Busy Free Free Free Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Busy Free Free Free Arch

Z2 Z3 Z4 Z5

F8

Most recentlyattributed physical

register for F4

12345

ROBRAT

Renaming

F4

Page 101: Computer Architectures M

104

mulyesF0divyes

32

R10yes2-31

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load3

WriteResult

Store 1Store 2

Time Name Busy Op Vj Vk Qj Qk

Clock 3

Add1Add2Add3Mult1Mult2

P0F2 F3

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

F2,F3P1FMUL80000008

F0,P0Z1FDIV80000004

0,R10P0FLD80000000

Addr Op. Des Sorg

P0 P1

Busy Busy. Free Free Free Arch

P2 P3 P4 P5F4

ROBRAT

Q0 Q1

Busy Free Free Free Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Arch Busy Free Free Free Free

Z2 Z3 Z4 Z5

F8

12345

Parallelism

CLOCK 3

waiting for F4 (P0)

Previous instruction usingZ0 has been committed

Z0 is now the architectural register

Page 102: Computer Architectures M

P0

Op Vj Vk Qj Qk

105

mulyesF0divyes

Yet 9 cycles

32

2-31

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

AddressWrite

Result

Time Name Busy Op

Clock 4

Add1Add2Add3Mult1Mult2 F2 F3

4-

4FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

4

Yet 39 cycles

4-

Not yet executablebut however inserted in the ROB

It does not block the emissionof the following instructions

Parallelism

Load3Store 1Store 2

Integer

Load2Store 1Store 2

Qk

P1,P1P2FMUL8000000C

F2,F3P1FMUL80000008

F0,P0Z1FDIV80000004

0,R10P0FLD80000000

Addr Op. Des Sorg

P1P0

Busy Busy. Busy Free Free Arch

P2 P3 P4 P5F4

ROBRAT

Q0 Q1

Arch Free Free Free Free Free

Q2 Q3 Q4 Q5

F6

Z0 Z1

Arch Busy Free Free Free Free

Z2 Z3 Z4 Z5

F8

12345

Ended but not yetcommitted !

CLOCK 4

Instruction usingQ0 has ended its execution

Q0 is now the architectural register

Page 103: Computer Architectures M

106

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

mulyesF0divyes

Yet 8 cycles

32

2-31

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Address

WriteResult

Time Name Busy Op Vj Vk

Clock 5

Add1Add2Add3Mult1Mult2

P0

F2 F3

4-

4

4

5

yes add F10 P2

Yet 38 cycles

Load23-

Parallelism

Load3Store 1Store 2

CLOCK 5

waiting for F4 (P1)

Qj Qk

Integer

Load2Store 1Store 2

Qk

F10,P2Q1FADD80000010

P1,P1P2FMUL8000000C

F2,F3P2FMUL80000008

F0,P1Z1FDIV80000004

Addr Op. Des Sorg

P1P0

Arch Busy. Busy Busy Free Free

P2 P3 P4 P5

F4ROBRAT

Q0 Q1

Arch Busy Free Free Free Free

Q2 Q3 Q4 Q5

F6

Z0 Z1

Arch Busy Free Free Free Free

Z2 Z3 Z4 Z5

F8

12345

FLD commited: the architectural register F4

is now P0