89
Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

  • Upload
    dacia

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters. J. Nelson Amaral. Tomasulo Algorithm. IBM 360/91 Floating Point Arithmetic Unit. Tomasulo Algorithm: A reservation station for each functional unit. Free/Occupied bit. Flag = on → Data = value - PowerPoint PPT Presentation

Citation preview

Page 1: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Back-End: Instruction Scheduling, Memory Access Instructions, and

ClustersJ. Nelson Amaral

Page 2: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Tomasulo Algorithm

Page 3: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

IBM 360/91 Floating Point Arithmetic UnitTomasulo Algorithm:

A reservation stationfor each functional unit.

Baer, p. 97

Free/Occupied bit

Flag = on → Data = valueFlag = off → Data = tag

A tag (pointer) to the ROB entry that will store result.

Page 4: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Decode-rename Stage

Reservation Station

Available?

Structural HazardStall incoming instructions

No

Free ROB Entry?

Structural HazardStall incoming instructions

No

Assign reservation station and tail of ROB

to instruction

Yes

Yes

Baer p. 97

Page 5: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Dispatch Stage

Map for each source

operand?

ROBEntry

Forward ROB tag to RS.

ReadyBit(RS) ← 0

LogicalRegister

ROB Entry Flag?

Forward value to Reservation Station (RS)

ReadyBit(RS) ← 1

Tag

Value

Map result register to tag

Enter tag into RS

Enter instruction at tail of ROBResultFlag(tail of ROB) ←0 Baer p. 98

Page 6: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Issue Stage

Both Flags in RS are

on?

Issue instruction to functional unit to start

execution

Yes

No

No

Function unit stalled? (waiting for

CDB)

Yes

If multiple functional units of the same typeare available, use a scheduling algorithm

CDB = Common Data Bus

Baer p. 98

Page 7: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

ExecuteLast cycle

of execution?

Broadcast result and associated tag

Yes

No

Got ownership

of CDB

No

If multiple functional units request ownershipof the Common Data Bus (CDB) on the samecycle a hardwired priority protocol picks thewinner.

Baer p. 98

ROB stores result in entry identified by tag.

Set correspondingReadyBit.

RSs with same tag store result and set corresponding flag.

Yes

Page 8: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Commit Stage

Is there a result at the

head of ROB?

No

Store result in logical register

Delete ROB entry

Yes

Baer p. 97

Page 9: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Operation Timings

Assuming no dependencies

Baer, p. 98

Addition:0 1 2 3 4 5 6 7Time:

Multiplication:0 1 2 3 4 5 6 7Time:

Decoded Dispatched Issuedfinish execution

broadcast commit(if head of ROB)

Decoded Dispatched Issuedfinish execution

broadcast

commit(if head of ROB)

Page 10: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Example

i1: R4 ← R0 * R2 # use reservation station 1 of multiplieri2: R6 ← R4 * R8 # use reservation station 2 of multiplieri3: R8 ← R2 + R12 # use reservation station 1 of adderi4: R4 ← R14 + R16 # use reservation station 2 of adder

Page 11: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Index ⋅⋅⋅ 4 5 6 7Register Map

ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E1 E2

Flag Data Log. Reg0 E1 R4 head0 E2 R6

tail

8

Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations

01 0 E1 1 (R8) E2

Free Flag1 Oper1 Flag2 Oper2 Tag000

Adder Reservation Stations

ExecutingDispatched

i2 is inthis res.station.

Page 12: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Index ⋅⋅⋅ 4 5 6 7Register Map

ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

0 E4 R4

0 E1 R4 head0 E2 R60 E3 R8

tail

E38

Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations

01 0 E1 1 (R8) E2

Free Flag1 Oper1 Flag2 Oper2 Tag01 1 (R14) 1 (R16) E40

Adder Reservation Stations

ExecutingDispatched

Ready to Broadc.Dispatched

“register R4, which was renamed as ROB entry E1 and tagged as such in the reservation station Mult2, is now mapped to ROB entry E4.” (Baer, p. 102)

Page 13: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Index ⋅⋅⋅ 4 5 6 7Register Map

ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

0 E4 R4

0 E1 R4 head0 E2 R61 (i3) R8

tail

E38

Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations

01 0 E1 1 (R8) E2

Free Flag1 Oper1 Flag2 Oper2 Tag000

Adder Reservation Stations

DispatchedBroadcast

Ready to Broadc.

Ready to Broadc.

Assume Adder has priority to broadcast.

Page 14: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Index ⋅⋅⋅ 4 5 6 7Register Map

ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

1 (i4) R4

0 E1 R4 head0 E2 R61 (i3) R8

tail

E38

Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations

01 0 E1 1 (R8) E2

Free Flag1 Oper1 Flag2 Oper2 Tag000

Adder Reservation Stations

Dispatched

Broadcast

Ready to Broadc.

Assume Adder has priority to broadcast.

Page 15: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Index ⋅⋅⋅ 4 5 6 7Register Map

ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

1 (i4) R4

1 (i1) R4 head0 E2 R61 (i3) R8

tail

E38

Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations

01 1 (i1) 1 (R8) E2

Free Flag1 Oper1 Flag2 Oper2 Tag000

Adder Reservation Stations

DispatchedBroadcast

Page 16: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Index ⋅⋅⋅ 4 5 6 7Register Map

i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E4 E2 E38

Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations

00

Free Flag1 Oper1 Flag2 Oper2 Tag000

Adder Reservation Stations

CommitExecuting

ROB Flag Data Log. Reg

0 (i4) R4

0 (i1) R40 E2 R6 head0 (i3) R8

tail

Page 17: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

IBM 360/91 – unveiled in 1966

Page 18: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Some variant of the Tomasulo algorithm is the basis for the

design of all out-of-order processors.

Baer p. 97

Page 19: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Data dependency between instruction

Where should these instructions wait?

How do they become ready for issue?

Several instructions get to the end of thefront end and have to wait for operands.

Baer p. 177

Page 20: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Wakeup Stage

Detects instruction readiness.

We hope for m instructions to be woken up on each cycle.

Baer p. 177

Page 21: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Select Step

• or Scheduling step: Arbitrates between multiple instructions vieing for the same instruction unit.– Variations of fist-come-first-serve (of FIFO)

• Bypassing (or forwarding) of operands to units allows earlier selection.

• Critical instructions may have preference for selection.

Baer p. 177

Page 22: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Out-of-Order Architectures

Key idea: allow instructions following a stalled oneto start execution out of order.

A FIFO schedule is not a good idea!

Where to store stalled instructions?

Baer p. 178

Page 23: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Two Extreme Solutions

Tomasulo: a separatereservation station foreach functional unit.(distributed window)

Instruction Window: a centralizedreservation stationfor all functional units(centralized window)

IBM PowerPC series

Intel P6architecture

Baer p. 178

Page 24: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

A Hybrid Solution

Reservation stations are shared among groups of functional units(hybrid window).

MIPS R10000: 3 sets of reservationstations:• address calculations• floating-point units•load-store units

Baer p. 178

Page 25: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

How a design team selects between a centralized, distributed

or hybrid window?

What are the compromises?

Baer p. 179

Page 26: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Window design

• Resource allocation: centralized is better– static partitioning of resources is worse than

dynamic allocation

• Large windows: speed and power come into play

Baer p. 179

Page 27: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Two-Step Instruction Issue

Wakeup: instruction is ready for execution

Select: instruction is assigned to an execution unit.

Page 28: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Wakeup Step

Baer p. 180

f

Functional units

Window entries

w

Page 29: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Window entry with buses from 8 exec units

Page 30: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Wakeup Step

Baer p. 180

f

Functional units

Window entries

w

We need onebus from eachfunctional unit to each window entry

We also need twocomparators foreach functionalunit in eachwindow entry

Thus we need2fw comparators

If we separate thefunctional units andwindow slots into twoequal-size groups,we only need.fw/2 comparators

We will also need fewer (shorter)buses fromunits to slots.

Page 31: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Select Step

• Priority encoder: a circuit that receives several requests and issues one grant

• woken up instructions vying for the same unit send requests.• priority related to position in window

• Smaller window → smaller priority encoder

Baer p. 181

Page 32: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

When should a centralized window be replaced by a distributed or

hybrid one?

When the wakeup-select step are on the critical path.

Threshold appears to be windows with around 64 entries on a 4-wide

superscalar processor Baer p. 182

Page 33: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Intel Pentium 4:2 large windows 2 schedulers per window

Intel Pentium III and Intel core:Smaller centralized window

AMD Opteron:4 sets of reservation stations

Baer p. 182

Page 34: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Relation between Select and Wake Up

i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

The name given to the result of instruction i (R51)must be broadcast as soon as instruction i is selected.

Broadcasting the tag of R51 wakes up instruction i+1.

For single-cycle latency instructions, thestart of the execution is too late to broadcast the tag.

Baer p. 183

Page 35: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Speculative Wake Up and Select

i: R51 ← load(R22)i+1: R43 ← R27 – R51

i+2: R35 ← R51 + R28

Example:

In this case the tag of the destination of instruction iis broadcast.

Instructions i+1 and i+2 are speculatively woken upand selected based on a cache-hit latency.

In the case of a cache miss all dependent instructionsthat have been woken up and selected must be aborted.

Baer p. 183

Page 36: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Speculative Selection and the Reservation Stations

• An instruction must remain in a reservation station after it is scheduled– A bit indicates that the instruction has been

selected– Station is free once it is sure that the instruction

selection is not speculative anymore• Windows are large in comparison with the

number of functional units– accommodate many instructions in flight, some

speculatively.

Baer p. 183

Page 37: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Integrated Register File

Tomasulo Reservation Stations

What happens upon selection of an instruction?

FunctionalUnit

Reservation Station

Opcode

Operands

Opcode

Operands

FunctionalUnit

Instruction Window

PhysicalRegister File

Baer p. 183

Page 38: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

The complexity of Bypassing

i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

FunctionalUnit A

Compute i

FunctionalUnit B

Compute i+1

Output of A must be forwardedto B bypassing storage.

Baer p. 183

Page 39: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

The complexity of Bypassing

i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

FunctionalUnit A

Compute i

FunctionalUnit B

Now the bypass must forwardthe output to the input of A.

Compute i+1 But the hardware has to implement both buses.

Baer p. 183

Page 40: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

The complexity of Bypassing

i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

FunctionalUnit A

Compute i

FunctionalUnit B

Compute i+1

Also, we need buses to forwardthe output of B.

In general, given k functionalunits we may need k2 buses.

Buses become long to avoidcrossing each other.

Forwarding may limit the numberof functional units in a processor.

Forwarding may need more thanone cycle to complete.

Baer p. 184

Page 41: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load Speculation

• Load Address Speculation– Used for data prefetching

• Memory dependence prediction– Used to speculate data flow from a store to a

subsequent load.

Baer p. 185

Page 42: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Buffer

• Store Buffer: A circular queue – Entry allocated when store instruction is decoded– Entry removed when store is committed• Keep data for stores that have not yet committed

Baer p. 185

Page 43: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

States of a Store Buffer Entry

AV:Available

AD:Address is

known

CO:Committed

RE:Result and

Address known

AddressComputation

Data to be stored is still to be computed by another instruction

Store instructionreaches top of ROB

Datawritten

to cache

What happens with store buffer on a branch misprediction?

Baer p. 185

Page 44: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Handling Store Buffer on Branch Misprediction and Exceptions.

• Entries preceeding the mispredicted branch: – are in COMMIT state– must be written to cache

• Entries following misprediction– become AVAILABLE

• Exceptions: similar– Must write the COMMIT entries to cache before

handling exception

Baer p. 186

Page 45: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load Instructions and Load Speculation

Baer p. 187

Page 46: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load /Store Window Implementation –

Most Restricted

Load/StoreWindow

(FIFO)

Loads/Stores inserted inprogram order.

Loads/Stores removed insame order – at mot oneper cycle.

Single windowfor loads andstores.

Baer p. 187

Page 47: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load Bypassing• Compare address of load with all addresses in

store buffer– Load bypassing: If there is no match → load can

proceed– What happens if the operand address of any entry

in store buffer is not yet computed?• load cannot proceed

– What happens if there is a match to an entry that is not committed? • load cannot access cache• “match” is the last match in program order.

• Need associative search of operand addresses in store buffer

Baer p. 187

Page 48: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load Forwarding

• If these conditions are true:– A load match a store buffer entry AND– The result is available for the entry ( entry is in RE

or CO state)

• Then the result can be sent to the register specified by the load

• If the match is with an entry in AD state then:– Load waits for entry to reach RE state

Page 49: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load Speculation in Out-of-Order Architectures

Dynamic Memory Disambiguation Problem:

Loads are issued speculatively ahead of precedingstores in program order. How to ensure that datadependences are not violated?

Page 50: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Three approachesPessimistic: Wait until certain that load can proceed.(like load forwarding and bypassing)

Optimistic: Load always proceeds speculatively.Need a recovery mechanism.

Dependence prediction: use a predictorto decide to speculate or not.Try to have fewer recoveries.

Page 51: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Example

i1: st R1, memadd1⋅⋅⋅

i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4

Baer p. 188

true dependency

Pessimistic: i3 and i4 cannot issue until i2 has computed its result:• i2 must be at least in RE (Result)• i4 proceeds once i1 and i2 are in AD (Address)

Page 52: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Example

i1: st R1, memadd1⋅⋅⋅

i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4

Baer p. 189

true dependency

Optimistic: i3 and i4 issue as soon as possible (load-buffer entries are created)

A store reaches COaddress comparedassociativelywith load-buffer entries

Page 53: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Example

i1: st R1, memadd1⋅⋅⋅

i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4

Baer p. 189

true dependencyAD i1 memadd1AD i2 memadd2

Store Buffer:

1 i3 memadd31 i4 memadd4

Load Buffer:

Indicates thatthe load is speculative

CO

Nothing happens becausethere is no match in load buffer.

Page 54: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Example

i1: st R1, memadd1⋅⋅⋅

i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4

Baer p. 189

true dependencyCO i1 memadd1CO i2 memadd2

Store Buffer:

1 i3 memadd31 i4 memadd4

Load Buffer:match

i3 has to be reissued

i4 has to be reissued because itis after i3 in program order

some implementations only reissue instructions that depend on i3

Page 55: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Example

i1: st R1, memadd1⋅⋅⋅

i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4

Baer p. 189

true dependency

Dependence Prediction: with correct predictions, i4 can proceed and we avoid reissueing i3.

Page 56: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Motivation: Optimistic

Memory dependencies are rare:Less than 10% of loads depend on an earlier store

Baer p. 190

Page 57: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Motivation:Dependence Prediction

Load misspeculations are expensive and predictors can reduce them.

What strategy should we use forpredicting profitable speculations?

Baer p. 190

Page 58: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Simple StrategyMemory dependencies are infrequent

Predict that all loads can be speculated

If a load L is misspeculated

All subsequent instances of L must wait

We need a bit to remember. Where should this bit be stored?

Baer p. 190

Page 59: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Simple strategy (cont.)Single prediction bit P associated with instruction in cache.

When load instr. brought into cache → P = 1

Load is misspeculated → P = 0

Line evicted from cache and reloaded → P = 1

Strategy used in theDEC Alpha 21264

Baer p. 190

Page 60: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Principle Behind Load Prediction

“static store-load instruction pairs thatcause most of the dynamic data misspredictionare relatively few and exhibit temporal locality.”

Moshovos A. , Breach S. E., Vijaykumar T. N., Sohi G. S.,“Dynamic Speculation and Synchronization of DataDependences,” International Symposium on ComputerArchitecture, (ISCA) 1997, Denver, CO, USA

Page 61: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Ideal load speculationAvoids mis-speculation.

Allows loads to execute as early as possible.

Loads with no true dependences→ Execute without delay.

A load with a true dependence→ Execute as soon as the store that produces the data commits.

MoshovosISCA97.

Page 62: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

A Real Predictor

MoshovosISCA97.

Dynamically identify store-load pairs that are likelyto be data dependent.

i

Provide a synchronization mechanism to instancesof these dependences.

ii

Uses this mechanism to synchronize the storeand the load.

iii

Page 63: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load Predictor Table

Baer p. 190

Hash basedon PC

Saturatingcounters

Predictor States:• 00: strong nospeculate• 01: weak nospeculate• 10: weak speculate• 11: strong speculate

tagload buffer entry:

op.address: memory address of operand

spec.bit: speculative load?

update.bit: should update predictor at commit/abort?

Each load instruction has a loadspec bit.

Incrementing asaturating countermoves it towardstrong speculate.

Page 64: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Load/Decode Stage

• Set loadspec bit according to value of counter associated with the load PC

Baer p. 190

Page 65: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

After Operand Address is Computed

UncommittedYoungerStores?

Enter in the load buffer:

op.ad 0tag 0

spec.bit update.bit

loadspec

Issue Cache Access

Enter in the load buffer:

op.ad 1tag 0

Enter in the load buffer:

op.ad 0tag 1

Wait (like in pessimistic solution)

No

On

Off

Yes

Baer p. 190

Page 66: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Commit Stage

For all matchesin load buffer

spec.bit update.bit ← 1

Load Abort:Predictor ← Strong NoSpeculateRecover from misspeculated load

Baer p. 191

Off

On

It was correct to not speculateand should keep not speculatingin the future

Page 67: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Commit Stage

spec.bit increment saturating counter

speculating was correct

On

update.bitincrement

saturating counter

would like to speculatein the future

Off

Off

predictor ← strong nospeculate

On

Baer p. 191

Page 68: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Sets

Baer p. 191

Page 69: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Motivation for Store Sets• The past is a good predictor for future

memory-order violations.• Must also predict:

When one load is dependent on multiple stores

store A

store B

store C

load Dload E

load F

When multiple loads depend on one store.

ChrysosISCA98

Chrysos, G. Z. and Emer, J. S., “Memory Dependence Prediction using Store Sets,” International Symposium on Computer Architecture, 1998 pp. 142-153.

Page 70: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Set Definition

Given a load L, the store set of L is the set of all stores that L has ever depended upon.

Ideally, any time a store-load dependence isdetected, the store is added to the load’s store set table.

To make a prediction, the store set table ofthe load is searched for all uncommitted younger stores.

ChrysosISCA98

Too expensive! We need an approximation.

Page 71: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Implementation of Store Sets Memory Dependence Prediction

Both loads and stores have entries in Store Set ID Table.

ChrysosISCA98

Page 72: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Set Examples:Multiple loads depend on one store

j: loadadd1 k: loadadd2

⋅⋅⋅

i: storeadd3

i→

j→

SSITLFST

k→

Baer p. 192

Page 73: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Set Examples:Multiple loads depend on one store

i: storedd2 j: storeadd3

k: loadadd1

⋅⋅⋅

i→

j→

SSITLFST

k→

Baer p. 192

Page 74: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Store Set Examples:Multiple loads depend on multiple stores

i: storedd2 j: storeadd3

k: loadadd1

i→

j→

SSITLFST

l→

l: loadadd4

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

k→

We have a conflict betweenthe LFST entry associated with i and l.

Winner is the entry with smaller index in SSITMake loser point to the winner’s entry.

Baer p. 192

Page 75: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Evaluating Load Speculation

• Performance benefits from load speculation depends on:– speculation miss rate– cost of misspeculation recovery

Baer p. 194

Page 76: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Evaluating Load Speculation - Terminology

Conflicting load: at the time the load is ready to issue there is a previous store in the instruction window whose operand address is unknown.

Colliding load: the load is dependent on one of the stores with which it conflicts.

Baer p. 194

Page 77: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Evaluating Load Speculation – Typical measurements

• In a 32-entry load-store window, there are– 25% of loads are non-conflicting– of the 75% conflicting loads:• only 10% actually collide

• In larger windows the percentage of:– non-conflicting loads increase– colliding loads decrease

Baer p. 194

Page 78: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Back-End Optimizations

• Branch prediction– “a must”

• Load speculation (load-bypassing stores)– “important” because other instructions depend on

the load

• Prediction of load latency– “common” to hide load latency in the cache

hierarchy

Baer p. 195

Page 79: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Other Back-End Optimizations

• Value Prediction– Predict the value that an instruction will compute• May restrict to the value loaded by loads

• Critical Instructions– Predict which instructions are in the critical path.

Baer p. 196-201

Page 80: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Clustered Microarchitectures

Baer p. 201

Page 81: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Back-end Limitations to m

Large windows: large m requires largewindows.Expensive in hardware and power dissipation

Many functional units: many (long) buses;affect forwarding.

Centralized Resources (p. e. Register File): large resources, many ports.

Baer p. 201

Page 82: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Definition of a Cluster

• A cluster is formed by:– A set of functional units– A register file– An instruction window (or reservation stations)

Baer p. 201

Page 83: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Clustered Microarchitecture

Baer p. 202

Page 84: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Register File Replication

• A copy of the register file in each cluster– Small number of clusters– Can use crossbar switch for interconnection– Example (Alpha 21264):• integer unit is two clusters;• each cluster has a full copy of the 80 registers

Baer p. 202

Page 85: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Changes because of Clustering

• Front end– steer instruction to window of a cluster• static: compile time decision• dynamic: by hardware at runtime

• Back end– Copy results into registers of other clusters– Intercluster latency affects wake up and select

Baer p. 202

Page 86: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Effect of Clustering in Performance

• Latency to forward results between clusters• Sensitive to load balancing between clusters• Conflicting goals:– keep producers and consumers of data into same

cluster– balance the workload

Baer p. 202

Page 87: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Distributed Register Files• Steering affects Renaming– Assume that an instruction a is assigned to cluster

ci • A free register form ci will be used for the result of a

– If an operand of a is produced by an instruction b in a cluster cj, what needs to be done?

1. Another free register of ci is assigned tothis operand.

2. A copy instruction is inserted in cj immediately b.

3. The copy is kept in ci for use by other instructions.Baer p. 203

Page 88: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Clustered microarchitectures can be seen as astep in the evolution from monolithic processorsto multiprocessors.

Page 89: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Chapter Summary: Back end is important for performance

– Tomasulo Algorithm– Centralized/Distributed/Hybrid Windows– Wakeup/Select steps– Scheduling: Critical instructions first– Loads:• Bypassing stores• Forwarding values• Speculating on the absence of dependences with stores

– Clustering to reduce wiring complexity