36
Lecture 6 Slide 1 EECS 470 © Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar EECS 470 Lecture 6 Tomasulo’s Algorithm Fall 2019 Prof. Ron Dreslinski http://www.eecs.umich.edu/courses/eecs470 Adder Decoder Floating Point Registers (FLR) Control 0 2 4 8 Floating Operand Stack Floating Point Buffers (FLB) 1 2 3 4 5 6 Store Data 1 2 3 Buffers (SDB) Control Storage Bus Instruction Unit Result Multiply/Divide Common Data Bus (CDB) Point Busy Bits Adder FLB Bus FLR Bus CDB Tags Tags Sink Tag Tag Source Ctrl. Sink Tag Tag Source Ctrl. Sink Tag Tag Source Ctrl. Sink Tag Tag Source Ctrl. Sink Tag Tag Source Ctrl. Result (FLOS) Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions developed in part by Profs. Austin, Brehob , Falsafi , Hill, Hoe, Lipasti , Shen , Smith, Sohi , Tyson, Vijaykumar , and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.

Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

Lecture 6 Slide 1EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS 470Lecture 6Tomasulo’s Algorithm

Fall 2019

Prof. Ron Dreslinski

http://www.eecs.umich.edu/courses/eecs470

Adder

Floating Point

Registers FLR

0

2

4

8

Store

Data

1

2

3

Buffers SDB

Control

Decoder

Floating

Operand

Stack

FLOS

Control

Floating Point

Buffers FLB

1

2

3

4

5

6

Decoder

Floating Point

Registers (FLR)

Control

0

2

4

8

Floating

Operand

Stack

Floating Point

Buffers (FLB)

1

2

3

4

5

6

Store

Data

1

2

3

Buffers (SDB)

Control

Storage Bus Instruction Unit

Result

Multiply/Divide

Common Data Bus (CDB)

Point

BusyBits

Adder

FLB Bus

FLR Bus

CDB ••

Tags

Tags

Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.

Result

(FLOS)

Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.

Page 2: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

Lecture 6 Slide 2EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Readings

For Today:r H & P Chapter 2.4-2.6r D. Sima “Design Space of Register Renaming Techniques”

For Monday:r Smith & Pleszkun “Implementing Precise Interrupts”

Page 3: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

Lecture 6 Slide 3EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

In-Order vs. Scoreboard

• Big speedup? – Only 1 cycle advantage for scoreboard

m Why? addi WAR hazardm Scoreboard issued addi earlier (c8® c5)m But WAR hazard delayed W until c9m Delayed issue of second iteration

In-Order ScoreboardInsn D X W D S X Wldf X(r1),f1 c1 c2 c3 c1 c2 c3 c4mulf f0,f1,f2 c3 c4+ c7 c2 c4 c5+ c8stf f2,Z(r1) c7 c8 c9 c3 c8 c9 c10addi r1,4,r1 c8 c9 c10 c4 c5 c6 c9ldf X(r1),f1 c10 c11 c12 c5 c9 c10 c11mulf f0,f1,f2 c12 c13+ c16 c8 c11 c12+ c15stf f2,Z(r1) c16 c17 c18 c10 c15 c16 c17

Page 4: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 4

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scoreboard Redux• The good

+ Cheap hardware

• InsnStatus + FuStatus + RegStatus ~ 1 FP unit in area

+ Pretty good performance

• 1.7X for FORTRAN (scientific array) programs

• The less good

– No bypassing

– Limited scheduling scope

• Structural/WAW hazards delay dispatch

– Slow issue of truly-dependent (RAW) insns

• WAR hazards delay writeback

• Fix with hardware register renaming

Page 5: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 5

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Register Renaming• Register renaming (in hardware)

• Change register names to eliminate WAR/WAW hazards• One of most the beautiful things in computer architecture• Key: think of registers (r1,f0…) as names, not storage locations+ Can have more locations than names+ Can have multiple active versions of same name

• How does it work?• Map-table: maps names to most recent locations

• SRAM indexed by name• On a write: allocate new location, note in map-table• On a read: find location of most recent write via map-table lookup• Small detail: must de-allocate locations at some point

Page 6: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 6

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scheduling Algorithm II: Tomasulo• Tomasulo’s algorithm

• Reservation stations (RS): instruction buffer• Common data bus (CDB): broadcasts results to RS• Register renaming: removes WAR/WAW hazards

• First implementation: IBM 360/91 [1967]• Dynamic scheduling for FP units only• Bypassing

• Our example: “Simple Tomasulo”• Dynamic scheduling for everything, including load/store• No bypassing (for comparison with Scoreboard)• 5 RS: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined)

Page 7: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 7

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Data Structures• Reservation Stations (RS#)

• FU, busy, op, R: destination register name

• T: destination register tag (RS# of this RS)

• T1,T2: source register tags (RS# of RS that will produce value)

• V1,V2: source register values

• That’s new

• Map Table• T: tag (RS#) that will write this register

• Common Data Bus (CDB)• Broadcasts <RS#, value> of completed insns

• Tags interpreted as ready-bits++• T==0 ® Value is ready somewhere

• T!=0 ® Value is not ready, wait until CDB broadcasts T

Page 8: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 8

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Simple Tomasulo Data Structures

• Insn fields and status bits• Tags• Values

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 9: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 9

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Simple Tomasulo Pipeline• New pipeline structure: F, D, S, X, W

• D (dispatch)• Structural hazard ? stall : allocate RS entry

• S (issue)• RAW hazard ? wait (monitor CDB) : go to execute

• W (writeback)• Write register, free RS entry• W and RAW-dependent S in same cycle• W and structural-dependent D in same cycle

Page 10: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 10

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Dispatch (D)

• Stall for structural (RS) hazards• Allocate RS entry• Input register ready ? read value into RS : read tag into RS• Set register status (i.e., rename) for ouput register

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 11: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 11

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Issue (S)

• Wait for RAW hazards• Read register values from RS

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 12: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 12

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Execute (X)

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 13: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 13

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Writeback (W)

• Wait for structural (CDB) hazards• Output Reg Status tag still matches? clear, write result to register• CDB broadcast to RS: tag match ? clear tag, copy value• Free RS entry

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 14: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 14

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Difference Between Scoreboard…

FU Status

R1 R2

XS Insn value

FU

T

T2T1Top========

Reg Status

Fetchedinsns

Regfile

R

T

========

Page 15: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 15

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

…And Tomasulo

• What in Tomasulo implements register renaming?• Value copies in RS (V1, V2)• Insn stores correct input values in its own RS entry+ Future insns can overwrite master copy in regfile, doesn’t matter

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 16: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 16

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Value/Copy-Based Register Renaming• Tomasulo-style register renaming

• Called “value-based” or “copy-based”

• Names: architectural registers

• Storage locations: register file and reservation stations

• Values can and do exist in both

• Register file holds master (i.e., most recent) values

+ RS copies eliminate WAR hazards

• Storage locations referred to internally by RS# tags

• Register table translates names to tags

• Tag == 0 value is in register file

• Tag != 0 value is not ready and is being computed by RS#

• CDB broadcasts values with tags attached

• So insns know what value they are looking at

Page 17: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 17

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo Data StructuresInsn StatusInsn D S X Wldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1) addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Map TableReg Tf0f1f2r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD no3 ST no4 FP1 no5 FP2 no

CDBT V

Page 18: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 18

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 1Insn StatusInsn D S X Wldf X(r1),f1 c1mulf f0,f1,f2stf f2,Z(r1) addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD yes ldf f1 - - - [r1]3 ST no4 FP1 no5 FP2 no

CDBT V

allocate

Page 19: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 19

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 2Insn StatusInsn D S X Wldf X(r1),f1 c1 c2mulf f0,f1,f2 c2stf f2,Z(r1) addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#4r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD yes ldf f1 - - - [r1]3 ST no4 FP1 yes mulf f2 - RS#2 [f0] -5 FP2 no

CDBT V

allocate

Page 20: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 20

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 3Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3mulf f0,f1,f2 c2stf f2,Z(r1) c3addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#4r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD yes ldf f1 - - - [r1]3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - RS#2 [f0] -5 FP2 no

CDBT V

allocate

Page 21: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 21

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 4Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4stf f2,Z(r1) c3addi r1,4,r1 c4ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#4r1 RS#1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU yes addi r1 - - [r1] -2 LD no3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - RS#2 [f0] CDB.V5 FP2 no

CDBT VRS#2 [f1]

allocate

ldf finished (W) clear f1 RegStatusCDB broadcast

free

RS#2 ready ®grab CDB value

Page 22: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 22

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 5Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5stf f2,Z(r1) c3addi r1,4,r1 c4 c5ldf X(r1),f1 c5mulf f0,f1,f2stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#4r1 RS#1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU yes addi r1 - - [r1] -2 LD yes ldf f1 - RS#1 - -3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - - [f0] [f1]5 FP2 no

CDBT V

allocate

Page 23: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 23

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 6Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+stf f2,Z(r1) c3addi r1,4,r1 c4 c5 c6ldf X(r1),f1 c5mulf f0,f1,f2 c6stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#4RS#5r1 RS#1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU yes addi r1 - - [r1] -2 LD yes ldf f1 - RS#1 - -3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - - [f0] [f1]5 FP2 yes mulf f2 - RS#2 [f0] -

CDBT V

allocate

no D stall on WAW: scoreboard wouldoverwrite f2 RegStatusanyone who needs old f2 tag has it

Page 24: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 24

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 7Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+stf f2,Z(r1) c3addi r1,4,r1 c4 c5 c6 c7ldf X(r1),f1 c5 c7mulf f0,f1,f2 c6stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#5r1 RS#1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD yes ldf f1 - RS#1 - CDB.V3 ST yes stf - RS#4 - - [r1]4 FP1 yes mulf f2 - - [f0] [f1]5 FP2 yes mulf f2 - RS#2 [f0] -

CDBT VRS#1 [r1]

addi finished (W) clear r1 RegStatusCDB broadcast

RS#1 ready ®grab CDB value

no W wait on WAR: scoreboard wouldanyone who needs old r1 has RS copy

D stall on store RS: structural

Page 25: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 25

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 8Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8addi r1,4,r1 c4 c5 c6 c7ldf X(r1),f1 c5 c7 c8mulf f0,f1,f2 c6stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#5r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD yes ldf f1 - - - [r1]3 ST yes stf - RS#4 - CDB.V [r1]4 FP1 no5 FP2 yes mulf f2 - RS#2 [f0] -

CDBT VRS#4 [f2]

mulf finished (W) don’t clear f2 RegStatusalready overwritten by 2nd mulf (RS#5)

CDB broadcast

RS#4 ready ®grab CDB value

Page 26: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 26

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 9Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9addi r1,4,r1 c4 c5 c6 c7ldf X(r1),f1 c5 c7 c8 c9mulf f0,f1,f2 c6 c9stf f2,Z(r1)

Map TableReg Tf0f1 RS#2f2 RS#5r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD no3 ST yes stf - - - [f2] [r1]4 FP1 no5 FP2 yes mulf f2 - RS#2 [f0] CDB.V

CDBT VRS#2 [f1]

RS#2 ready ®grab CDB value

2nd ldf finished (W) clear f1 RegStatusCDB broadcast

Page 27: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 27

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Tomasulo: Cycle 10Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9 c10addi r1,4,r1 c4 c5 c6 c7ldf X(r1),f1 c5 c7 c8 c9mulf f0,f1,f2 c6 c9 c10stf f2,Z(r1) c10

Map TableReg Tf0f1f2 RS#5r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD no3 ST yes stf - RS#5 - - [r1]4 FP1 no5 FP2 yes mulf f2 - - [f0] [f1]

CDBT V

free ® allocate

stf finished (W) no output register ® no CDB broadcast

Page 28: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 28

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scoreboard vs. TomasuloScoreboard Tomasulo

Insn D S X W D S X Wldf X(r1),f1 c1 c2 c3 c4 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9 c10 c3 c8 c9 c10addi r1,4,r1 c4 c5 c6 c9 c4 c5 c6 c7ldf X(r1),f1 c5 c9 c10 c11 c5 c7 c8 c9mulf f0,f1,f2 c8 c11 c12+ c15 c6 c9 c10+ c13stf f2,Z(r1) c10 c15 c16 c17 c10 c13 c14 c15Hazard Scoreboard TomasuloInsn buffer stall in D stall in DFU wait in S wait in SRAW wait in S wait in SWAR wait in W noneWAW stall in D none

Page 29: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 29

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scoreboard vs. Tomasulo II: Cache MissScoreboard Tomasulo

Insn D S X W D S X Wldf X(r1),f1 c1 c2 c3+ c8 c1 c2 c3+ c8mulf f0,f1,f2 c2 c8 c9+ c12 c2 c8 c9+ c12stf f2,Z(r1) c3 c12 c13 c14 c3 c12 c13 c14addi r1,4,r1 c4 c5 c6 c13 c4 c5 c6 c7ldf X(r1),f1 c8 c13 c14 c15 c5 c7 c8 c9mulf f0,f1,f2 c12 c15 c16+ c19 c6 c9 c10+ c13stf f2,Z(r1) c13 c19 c20 c21 c7 c13 c14 c15

• Assume• 5 cycle cache miss on first ldf• Ignore FUST and RS structural hazards+ Advantage Tomasulo

• No addi WAR hazard (c7) means iterations run in parallel

Page 30: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 30

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Improvements with Combined RSInsn StatusInsn D S X Wldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1) addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)

Map TableReg Tf0f1f2r1

Reservation StationsT FU busy op R T1 T2 V1 V21 ALU no2 LD no3 ST no4 FP1 no5 FP2 no

CDBT V

Page 31: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 31

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Can We Add Superscalar?• Dynamic scheduling and multiple issue are orthogonal

• E.g., Pentium4: dynamically scheduled 5-way superscalar

• Two dimensions

• N: superscalar width (number of parallel operations)

• W: window size (number of reservation stations)

• What do we need for an N-by-W Tomasulo?• RS: N tag/value w-ports (D), 2N value r-ports (S), 2N tag CAMs (W)

• Select logic: W®N priority encoder (S)

• MT: 2N r-ports (D), N w-ports (D)

• RF: 2N r-ports (D), N w-ports (W)

• CDB: N (W)

• Which are the expensive pieces?

Page 32: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 32

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Superscalar Select Logic• Superscalar select logic: W®N priority encoder

– Somewhat complicated (N2 logW)• Can simplify using different RS designs

• Split design• Divide RS into N banks: 1 per FU? • Implement N separate W/N®1 encoders+ Simpler: N * logW/N– Less scheduling flexibility

• FIFO design [Palacharla+]• Can issue only head of each RS bank + Simpler: no select logic at all– Less scheduling flexibility (but surprisingly not that bad)

Page 33: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 33

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Can We Add Bypassing?

• Yes, but it’s more complicated than you might think

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 34: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 34

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Why Out-of-Order Bypassing Is Hard

• Bypassing: ldf X in c3 ® mulf X in c4 ® mulf S in c3• But how can mulf S in c3 if ldf W in c4? Must change pipeline

• Modern scheduler• Split CDB tag and value, move tag broadcast to S

• ldf tag broadcast now in cycle 2 ® mulf S in cycle 3• How do multi-cycle operations work? How do cache misses work?

No Bypassing BypassingInsn D S X W D S X Wldf X(r1),f1 c1 c2 c3 c4 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8 c2 c3 c4+ c7stf f2,Z(r1) c3 c8 c9 c10 c3 c6 c7 c8addi r1,4,r1 c4 c5 c6 c7 c4 c5 c6 c7ldf X(r1),f1 c5 c7 c8 c9 c5 c7 c7 c9mulf f0,f1,f2 c6 c9 c10+ c13 c6 c9 c8+ c13stf f2,Z(r1) c10 c13 c14 c15 c10 c13 c11 c15

Page 35: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 35

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Dynamic Scheduling Summary• Dynamic scheduling: out-of-order execution

• Higher pipeline/FU utilization, improved performance• Easier and more effective in hardware than software

+ More storage locations than architectural registers+ Dynamic handling of cache misses

• Instruction buffer: multiple F/D latches• Implements large scheduling scope + “passing” functionality• Split decode into in-order dispatch and out-of-order issue

• Stall vs. wait

• Dynamic scheduling algorithms• Scoreboard: no register renaming, limited out-of-order• Tomasulo: copy-based register renaming, full out-of-order

Page 36: Storage Bus Instruction Unit EECS 470 Lecture 6 Tomasulo ... · Scheduling Algorithm II: Tomasulo •Tomasulo’s algorithm •Reservation stations (RS): instruction buffer •Common

EECS 470Lecture 6 Slide 36

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Are we done?

• When can Tomasulo go wrong?• Exceptions!!

• No way to figure out relative order of instructions in RS

• Next Lecture: How to make exceptions work