133
Advanced Computer Architecture Part I: General Purpose Exploiting ILP Dynamically [email protected] EPFL I&C LAP EPFL I&C LAP

Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Advanced Computer Architecture—

Part I: General PurposeExploiting ILP Dynamicallyp g y y

[email protected] – I&C – LAPEPFL I&C LAP

Page 2: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

ILP? The Traditional WayILP? The Traditional Way

(Let’s Make It Fast!)

Page 3: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Speed: Main Goal in General Purpose Computer ArchitecturePurpose Computer Architecture

Reduce delay per gate Technology (~ x1.2/year) Reduce delay per gate Technology ( x1.2/year) Improve architecture Parallelism (~ x1.3/year)

1n,

© M

K 20

11y

& P

atte

rson

rce:

Hen

ness

y

Architectural and organisational ideas are the

Sour

© Ienne 2003-123 AdvCompArch — Exploiting ILP Dynamically

gmain performance drivers since the mid-1980s

Page 4: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Clock RateDoes Not Grow Much (Anymore!)Does Not Grow Much (Anymore!)

1n,

© M

K 20

11y

& P

atte

rson

rce:

Hen

ness

ySo

ur

© Ienne 2003-124 AdvCompArch — Exploiting ILP Dynamically

Page 5: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Sources of ParallelismSources of Parallelism

Bit-level Bit level Wider processor datapaths (8163264…)

Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)

Instruction-level Instruction level Pipelining Superscalar

d

This lesson:ILP = Instruction Level Parallelism

VLIW and EPIC Task- and Application-levels…

Explicit parallel programming Explicit parallel programming Multiple threads Multiple applications…

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically5

Page 6: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Starting Point (Programmer Model)Starting Point (Programmer Model)

Sequential multicycle processorSequential multicycle processor

CyclesCycles1:

2:

3:Instructions 3:

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically6

Page 7: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

ILP?ILP?

Instructions

Cycles

??© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically7

Standard

Page 8: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

First Step:PipeliningPipelining

CyclesIF ID EX WBIF ID EX MEM

IF ID EX MEM WB

Cycles

ons

1:

2:

3

MEMWB

IF ID EX MEM WBIF ID

IF IDnstr

uctio 3:

4:

5:

EX MEMEXWB

IF IDIn

5: EX

Simplest form of Instruction Level Parallelism (ILP): Several instructions are being executed at once

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically8

being executed at once

Page 9: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Simple PipelineSimple Pipeline

RF

F D E M1 M2 W

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically9

Page 10: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Simple PipeliningSimple Pipelining

Scope for parallelism is limited:Scope for parallelism is limited: Control hazards limit the usability of the pipeline

Must squash fetched and decoded instruction following a branch Must squash fetched and decoded instruction following a branch

Data hazards limit the usability of the pipelineWhenever the next instruction cannot be executed, the pipeline , p p

is stalled and no new useful work is done until the “problem” is solved (e.g., cache miss)

Rigid seq encing Rigid sequencing Special “slots” for everything even if sometimes useless (e.g.,

MEM before WB)) Every instruction must be coerced to the same framework Structural hazards avoided “by construction”

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically10

Page 11: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Simple Pipeline with ForwardingSimple Pipeline with Forwarding

RF

F E M1 M2 WDF E M1 M2 WD

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically11

Page 12: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

ILP So FarILP So Far…

Instructions

Cycles

??Pipelining

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically12Standard

Page 13: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Dynamic Scheduling: The IdeaDynamic Scheduling: The Idea

Extend the scope to extract parallelism:Extend the scope to extract parallelism:divd $f0, $f2, $f4addd $f10, $f0, $f8addd $ 0, $ 0, $ 8subd $f12, $f8, $f14

Why not to execute subd while addd waits for y ot to e ecute subd e addd a ts othe result of divd?

Relax a fundamental rule: instructions can beRelax a fundamental rule: instructions can be executed out of program order! (but the result must still be correct )result must still be correct…)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically13

Page 14: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Break the Rigidity of the Basic PipeliningPipelining

Continue fetching and decoding even and especially Continue fetching and decoding even and especially if one cannot execute previous instructions

Keep writeback waiting if there is a structural hazard, Keep writeback waiting if there is a structural hazard, without slowing down execution

Solution: Split the tasks in independent units/pipelines

Fetch and decode ExecuteWriteback

Clearly, instructions will now produce results out-of-order (OOO)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically14

Page 15: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Dynamically Scheduled ProcessorDynamically Scheduled Processor

RF

ALURS

F D ROB W

MEM(3)RS

F D E/M1/ W

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically15

F D E/M1/… W

Page 16: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reservation StationsReservation Stations

Fetch&Decode Unit and Register File (1) Fetched operation descriptions and

(2a) known operands (from RF)All Execution Units

(1) Tags of the executed operationsor (2b) source-operation tags and (2) corresponding results

ReservationSt tiStation

Dependent Execution Unit(1) Description of operations ready to execute

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically16

(1) Description of operations ready to executewith (2) corresponding tags and (3) operands

Page 17: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reservation StationsReservation Stations

A reservation station checks that the operands are available (RAW) and that A reservation station checks that the operands are available (RAW) and that the Execution Unit is free (Structural Hazards), then starts execution

ff

Tag1 Tag2 Arg1 Arg2Op

fromEUs and RF

fromF&D Unit

addd – MUL3 ???1subd ALU1 – ??? 0xffff fee11

ALU1:

ALU2:

0xa87f b351

0ALU3:

a bALU

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically17

Page 18: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reservation StationsReservation Stations

Unavailable operands are identified by theUnavailable operands are identified by the name of the entry in the reservation station in charge of the originatingstation in charge of the originating instructionI li it i t i thImplicit register renaming, thus

removing WAR and WAW hazardsNew results are seen at their inputs

through special result bus(es)g p ( )Writeback into the registers is not on the

critical execution path

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically18

critical execution path

Page 19: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

A Problem with ExceptionsA Problem with Exceptions…

Precise exceptions Precise Precise exceptions Reordering at commit; user

view is that of a fully in-

andi $t4, $t2, 0xffandi $t5, $t3, 0xffaddi $v0, $v0, 1srl $t2, $t2, 8

order processor

Imprecise exceptions N d i t f d

srl $t2, $t2, 8 srl $t3, $t3, 8

andi $t4, $v0, 3addi $t0, $t0, 4addi $t1, $t1, 4 No reordering; out-of-order

completion visible to the user

addi $t1, $t1, 4

Impreciseandi $t4, $t2, 0xffandi $t5 $t3 0xff

The OS/programmer must be aware of the problem and take appropriate action

andi $t5, $t3, 0xffaddi $v0, $v0, 1srl $t2, $t2, 8

srl $t3, $t3, 8andi $t4 $v0 3and take appropriate action

(e.g., execute again the complete subroutine where the problem occurred)

andi $t4, $v0, 3addi $t0, $t0, 4addi $t1, $t1, 4

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically19

the problem occurred)

Page 20: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Out-of-order Commitment and ExceptionsExceptions

Exception handlers should know exactly where aException handlers should know exactly where a problem has occurred, especially for nonterminating exceptions (e g page fault)nonterminating exceptions (e.g., page fault) so that they can correct the problem and resume exactly where the exception occurredresume exactly where the exception occurred

Of course, one assumes that everything before th f lt i t ti t d dthe faulty instruction was executed and everything after was not

With OOO dynamic execution it might no longer be true…

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically20

Page 21: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

ReorderingReordering

Fundamental observation: a processor canFundamental observation: a processor can do whatever it wants provided that it gives the appearance of sequential execution (i e theappearance of sequential execution (i.e., the architectural machine state is updated in program order)program order)

New phase: COMMIT or RETIRE or GRADUATE (b id th l F D E W)(besides the usual F, D, E, W)

This observation is fundamental because it allows many techniques (precise interrupts, speculation, multithreading, etc.)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically21

Page 22: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Dynamically Scheduled ProcessorDynamically Scheduled Processor

InstructionFetch & Decode

Unit

Computation advances independently from machine

state updatesUnit

Reservation Reservation Reservation ReservationReservationStn.

L d/StRegister File

B h

ReservationStn.

ReservationStn.

ReservationStn.

Load/StoreUnitFP UnitALU

gBranchUnit

M hi t t i

CommitUnit

Machine state is updated in order

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically22

Unit

Page 23: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reorder BufferReorder Buffer

Fetch&Decode Unit(1) Fetched-operation tags in original

order, (2) destination register orAll Execution Units

(1) Tags of the executed operationsaddress, and (3) PC and (2) corresponding results

Commit Unit(R d B ff )(Reorder Buffer)

Register File and MemoryFor each instruction in the original fetch order

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically23

For each instruction, in the original fetch order,(1) destination register or address and (2) value to write

Page 24: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reordering Instructions at WritebackWriteback

Needs a reorder buffer in the Commit Unit Needs a reorder buffer in the Commit Unit

from from EUsF&D Unit

0

Register Address ValueTagPC

to MEM

00

head to MEMand RF1

1

— $f3 0x627f ba5a

ALU1 ???0xa87f b351

head

0x1000 0008

0x1000 0004

10

$f5MUL2 ???tail

0x1000 000c

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically24

Page 25: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Origins of ReorderingOrigins of Reordering

Paper Smith & Pleszkun on precise interrupts 1988 Paper Smith & Pleszkun on precise interrupts, 1988

© I

EEE

1988

& P

lesz

kun,

©ur

ce:

Smith

&So

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically25

Page 26: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Second Step: Dynamic SchedulingDynamic Scheduling

IF ID EX1 EX2 WB Cycles1 IF ID EX1 EX2 WBIF ID EX1 EX2 EX3 EX4 EX5 MEM

IF ID EX1 EX2

Cycles1:

2:

3: WBIF ID EX1 EX2

EX2IF ID EX1 MEM WB

IF ID EX1 MEMruct

ions

3:

4:

5:

WB

Inst

r

EX2IF ID EX16:

Tangible amount of ILP now possibleWhat’s next?!

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically26

Page 27: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

ILP So FarILP So Far…

?Instructions

Cycles ?Dynamic Scheduling

Pipelining

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically27Standard

Page 28: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Superscalar ExecutionSuperscalar Execution

Why not more than one instructionWhy not more than one instruction beginning execution (issued) per cycle?Key requirements areFetching more instruction in a cycle: no bigFetching more instruction in a cycle: no big

difficulty provided that the instruction cache can sustain the bandwidthcan sustain the bandwidthDecide on data and control dependencies:

dynamic scheduling already takes care of thisdynamic scheduling already takes care of this

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically28

Page 29: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Superscalar ProcessorSuperscalar Processor

InstructionFetch & Decode Unit

(Multiple Instructions per Cycle)Multiple Buses

ReservationStn

ReservationStn

ReservationStn

ReservationStn

Multiple Buses

Stn.Stn.

FP UnitALU 1 Load/Store

Stn.

Branch

Stn.

ALU 2 FP UnitALU 1Register File

/UnitUnitALU 2

Commit Unit(Multiple Instructions per Cycle)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically29

Multiple Buses

Page 30: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Third Step: Superscalar ExecutionThird Step: Superscalar Execution

IF ID EX1 EX2 EX3 WB Cycles1: IF ID EX1 EX2 EX3 WBIF ID EX1 EX2 EX3 EX4 EX5IF ID EX1 MEM WB

y1:

2:

3:

MEM

IF ID EX1 MEMIF ID EX1 EX2 EX3 EX4 EX5 WB

s

3:

4:

5:

IF ID EX1 MEMIF ID EX1 EX2 EX3 WB

IF ID EX1 EX2 EX3 EX4 EX5 EX6 WBtruc

tions 6:

7:

IF ID EX1 EX2 EX3 EX4 EX5 EX6 WB

Inst 8:

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically30

Page 31: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Several Steps in Exploiting ILPSeveral Steps in Exploiting ILP

Instructions

Cycles

SuperscalarSuperscalar

Dynamic Scheduling

Pipelining

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically31Standard

Page 32: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

References on ILPReferences on ILP

AQA 5th ed Appendix C AQA 5 ed., Appendix C CAR, Chapter 4—Introduction J E Smith and A R Pleszkun Implementation of J. E. Smith and A. R. Pleszkun, Implementation of

Precise Interrupts in Pipelined Processors, IEEE Transactions on Computers, 37(5):562-73, May 1988p , ( ) , y

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically32

Page 33: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Register RenamingRegister Renaming

(How Do I Get Rid of WAR and WAW?!…)

Page 34: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Register RenamingRegister Renaming

Importance of removing WAR and WAW dependences Importance of removing WAR and WAW dependences with “close-to-ideal” instruction windows (2K entries) and maximum issue rate (64 per cycle) 19

96( p y )

Kauf

fman

©

Mor

gan

urce

: AQ

A,

So

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically34

Page 35: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

A Little History of (Modern) RenamingRenaming

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically35

Source: Sima, © IEEE 2000

First: IBM 360/91 (1967, FP partial renaming)

Page 36: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Main Dimensions in Renaming PoliciesPolicies

1 Scope of register renaming1. Scope of register renaming Simple: only some classes of registers are

renamed (e g integer or FP only)renamed (e.g., integer or FP only)2. Layout of the renamed registers Where are they?

3. Method of register mappingg pp g Allocation, tracking, and deallocation

4 Rename rate4. Rename rate How many instructions can be renamed at

once?

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically36

once?

Page 37: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Where Are the Rename Registers?Where Are the Rename Registers?

Four possibilities:Four possibilities:1. Merged rename and architectural RF2. Split rename and architectural RFs3 Renamed al es in the eo de b ffe3. Renamed values in the reorder buffer4. Renamed values in the reservation

stations (a.k.a. shelving buffers)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically37

Page 38: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Four Possible Locations for Rename RegistersRegisters

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically38

Source: Sima, © IEEE 2000

Page 39: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Dynamically Scheduled ProcessorDynamically Scheduled Processor

InstructionFetch & Decode

Unit

ArchitecturalRegistersUnit

Reservation Reservation Reservation Reservation

g

ReservationStn.

L d/StRegister File

B h

ReservationStn.

ReservationStn.

ReservationStn.

Load/StoreUnitFP UnitALU

gBranchUnit

CommitUnit

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically39

UnitRenameRegisters

Page 40: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Typical ROBTypical ROB

fromF&D Unit

Register Address ValueTag

from EUs

PC

00

Register Address ValueTagPC

to MEMand RF

011

— $f3 0x627f ba5a

ALU1 ???0 87f b351

head

0 1000 0008

0x1000 0004

11

ALU1 ???$f5MUL2 ???

0xa87f b351

tail

0x1000 000c

0x1000 0008

0tail

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically40

Page 41: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Possible States of each Registerin a Merged Filein a Merged File

EEE

2000

e:Si

ma,

© I

ESo

urce

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically41

Page 42: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

State Transitions in a Merged FileState Transitions in a Merged File

Initialisation: Initialisation: First N registers are “AR”, others “Available”

1. Available Renamed Invalid Instruction enters the Reservation Stations and/or the ROB:

register allocated for the result (i.e., register uninitialised)2 Renamed Invalid Renamed Valid2. Renamed Invalid Renamed Valid

Instruction completes (i.e., register initialised)3. Renamed Valid Architectural Registerg

Instruction commits (i.e., register “exists”)4. Architectural Register Available

h h ( Another instruction commits to the same AR (i.e., register is dead)

5. Renamed Invalid and Renamed Valid Available

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically42

Squashing

Page 43: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Tracking the Mapping: Where is Physically an Architectural Register?Physically an Architectural Register?

Mapping in aMapping Table

Mapping in theRename Buffer

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically43

Source: Sima, © IEEE 2000

Page 44: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reminder: Reservation StationsReminder: Reservation Stations

A reservation station checks that the operands are available (RAW) A reservation station checks that the operands are available (RAW) and that the Execution Unit is free (Control), then starts execution

ff

Tag1 Tag2 Arg1 Arg2Op

fromEUs and RF

fromF&D Unit

addd – MUL3 ???1subd ALU1 – ??? 0xffff fee11

ALU1:

ALU2:

0xa87f b351

0ALU3:

a bALU

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically44

Page 45: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

MIPS R10000:Merged RF with Mapping TableMerged RF with Mapping Table

1996

ager

, © I

EEE

Sour

ce:

Yea

Remark the complexity of the Mapping Table: 4-issue processor ( 4x above scheme in parallel)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically45

4 issue processor ( 4x above scheme in parallel) 16 parallel accesses: 16 read ports and 4 write ports!

Page 46: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

State Transitions Replaced by Copying in Stand-alone RRFCopying in Stand-alone RRF

Initialisation: Initialisation: All Rename Registers are “Available”

1. Available Renamed Invalid1. Available Renamed Invalid Instruction enters the Reservation Stations and/or the ROB:

register allocated for the result (i.e., register uninitialised)

2. Renamed Invalid Renamed Valid Instruction completes (i.e., register initialised)

3 d l d l bl3. Renamed Valid Available Instruction commits (i.e., register “exists”) Value is copied in the Architectural RF Value is copied in the Architectural RF

4. Renamed Invalid and Renamed Valid Available Squashing (no copy to the Architectural RF)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically46

Squashing (no copy to the Architectural RF)

Page 47: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

State of the Rename Registers in the Commit Unit (ROB)the Commit Unit (ROB)

fromF&D Unit

Register Address ValueTag

from EUs

PC

00

Register Address ValueTagPC

to MEMand RF

011

— $f3 0x627f ba5a

ALU1 ???0 87f b351

head

0 1000 0008

0x1000 0004

11

ALU1 ???$f5MUL2 ???

0xa87f b351

tail

0x1000 000c

0x1000 0008

0tail

R d I lidRenamed Valid

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically47

Renamed InvalidRenamed ValidAvailable

Page 48: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

MIPS R10000:32 AR 64 PhR Merged Register File32 AR, 64 PhR, Merged Register File

Mapping Table:Mapping Table:fD AR PhR

Free Register Table:Up to 32 empty PhR

6

Status Table:Invalid PhR

© I

EEE

1996

I-Queue / Resv. Station ROB

Invalid PhR

urce

: Ye

ager

,

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically48

Sou

New PhRto Hold fD

Previous PhRHolding fD

fD AR

Page 49: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

MIPS R10000:Information FlowInformation Flow

1. Available Renamed Invalida ab a d a d Read new PhR from top of Free Register Table Create new mapping LogDest Dest in the Mapping Table Set corresponding Busy-Bit (=invalid) in the Status Table Set corresponding Busy Bit (=invalid) in the Status Table

2. Renamed Invalid Renamed Valid Write PhR Dest indicated in the I-Queue R t di B Bit ( lid) i th St t T bl Reset corresponding Busy-Bit (=valid) in the Status Table Mark as Done in the corresponding entry in the ROB

3. Renamed Valid Architectural Register Implicit (removal of historical mapping LogDest Dest)

4. Architectural Register Available Free PhR indicated by OldDest in the entry removed from the ROBd a d by O d y o d o O

5. Renamed Invalid and Renamed Valid Available Restore mapping from all squashed ROB entries (from tail to head) as

LogDest Dest

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically49

LogDest Dest Reset corresponding Busy-Bit (=valid) in the Status Table

Page 50: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

How Many Rename Registers?How Many Rename Registers?

In-Flight instructions: In Flight instructions:

STLDEURSflightin NNNNN

Rename Registers:

f g

ROB size:

LDEURSrename NNNN ROB size:

fli htiROB NN flightinROB NN

Note: if strictly < then structural stalls can occur

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically50

Note: if strictly < then structural stalls can occur

Page 51: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Number of Rename RegistersNumber of Rename Registers

EEE

2000

e:Si

ma,

© I

ESo

urce

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically51

Page 52: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Actual Choices in Commercial ImplementationsImplementations

000

ma,

© I

EEE

2

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically52 Sour

ce:

Sim

Page 53: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Current High End

Source: Microprocessor Report, © Cahners 2009

High-End Processors

No renamingNo renaming only in

UltraSparc:Use of registerUse of register windows for

integers makes it diffi ltoo difficult to

implement renaming (and no OOO either!)

Nor in Itanium, of course…

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically53

Page 54: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

References on Register RenamingReferences on Register Renaming

AQA 5th ed Appendix C and Chapter 3 AQA 5 ed., Appendix C and Chapter 3 PA, Sections 6.3, 6.4, and 6.5 CAR Chapter 5—Introduction CAR, Chapter 5—Introduction D. Sima, The Design Space of Register Renaming

Techniques IEEE Micro (20):5 Sept -Oct 2000Techniques, IEEE Micro, (20):5, Sept. Oct. 2000 K. C. Yeager, The MIPS R10000 Superscalar

Microprocessor, IEEE Micro, 16(2):28-40, April 1996c op ocesso , c o, 6( ) 8 0, p 996

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically54

Page 55: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Prediction and SpeculationPrediction and Speculation

(Don’t Know It? Don’t Wait but Guess…)

Page 56: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Prediction & Speculation: The IdeaPrediction & Speculation: The Idea

Some operation takes awfully long?Some operation takes awfully long?The processor needs the result to proceed?To fetch the next instruction, one needs to know

which one must be fetchedT f t ti d th dTo perform a computation, one needs the operands

’ iDon’t wait!!!1. Make a guess ( Predict) and

2. Proceed tentatively ( Speculate)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically56

Page 57: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

General ProblemsGeneral Problems

1 How do I make a good guess?1. How do I make a good guess? Remember some history…

2 Wh t d I d if th ?2. What do I do if the guess was wrong? Undo speculatively-executed instructions (“squash”) May cost nothing—e.g.,

Squash some results

M t thi May cost something—e.g., Empty pipelines Restore saved state Restore saved state Execute compensation code

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically57

Page 58: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Branch PredictionBranch Prediction

The idea:The idea:Start fetching and decoding beforeknowing the outcome of a branch

Rudimentary form of speculationRudimentary form of speculationGuess if branch will be taken or notT i PC d d ffTentative PC advancements do not affect

machine state (no execution) Backing up is easy…

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically58

Page 59: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Branch PredictionBranch Prediction

Branch outcome and additional info

Predicted direction(Taken/Not Taken)

BranchPredictor

LogicLogic

Current PCPredicted target address

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically59

Page 60: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Branch Target BuffersBranch Target Buffers

PC

031CAM

031

Target AddressBranch Address0xa123 fee4

…One needs to know if a just fetched and yet

0x1234 56780x1235 ef5a

……

just fetched and yet undecoded instruction

is a branch and what is the destination

……

……

…(computed branch, relative address,

return, etc.)……

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically60

……

Page 61: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

More Complex But CheaperBranch Target BuffersBranch Target Buffers

PC (Branch Address)

07831Typical Cache/TLB

i ti07831 organisations

0xa123 fee40x1234 560x1235 ef

Target Addr.Tag (31..8)0x7834 38470x5678 23

0x1235 78

Target Addr.Tag (31..8)0000 0000:

0000 0001 ……

…0x1235 ef

……

0x1235 78…

0000 0001:

0000 0011:

……

……

……

……

1111 1110:

1111 1111:

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically61

= Etc.

Page 62: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Which Strategy to Predict?Which Strategy to Predict?

Static predictions: ignore history Static predictions: ignore history1. Never-taken or always-taken2 Al t k b k d ( l )2. Always-taken-backward (e.g., loops)3. Compiler-specified, etc. Still a form of dynamic control speculation,

because the squashing process is done in h dhardware

Dynamic prediction: learn from history Record how often a branch was taken in the

past

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically62

Page 63: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Which Strategy to Predict Dynamically?Dynamically?

1 Same outcome as last time1. Same outcome as last time Keep one bit of history per recently visited branches

Needs an associative memory expensive

Keep one bit of history per hashed address Needs only a RAM inexpensive Different branches alias mistakes but we are only guessing Different branches alias mistakes, but we are only guessing,

anyway…

2. Same outcome as last few times (hysteresis)( y ) Keep a two-bit saturating history counter per hashed

address; use sign as a predictor Tuned to for loops: one misprediction is normal (last iteration) Tuned to for loops: one misprediction is normal (last iteration)

and should not modify the successive prediction (first iteration of a new execution)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically63

Page 64: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Branch History TableBranch History Table

PC (Branch Address)

07831 One-bit Prediction

Two-bit Prediction

0

07831

not taken 01 not taken0000 0000:

11

taken

taken

1011

taken

taken

0000 0001:

0000 0010:

0

1…

not taken OR 01

10

not taken0000 0011:

10

taken

not taken

1000

taken

not taken

1111 1110:

1111 1111:

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically64

Page 65: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Two-bit Prediction SchemeTwo-bit Prediction Scheme

1998

Strong Prediction Weak Prediction

Kauf

fman

©

Mor

gan

urce

: CO

D,

So

Slightly modified saturating two-bit counterTwo mispredictions Strong reversal

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically65

p g(e.g., UltraSPARC-I)

Page 66: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Prediction AccuracyPrediction Accuracy

1996

Kauf

fman

tion

s

© M

orga

n

pred

ict

urce

: AQ

A,

Mis

p

So

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically66

Page 67: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

More Complex StrategiesMore Complex Strategies…

Important novelty in speculative techniques: One does Important novelty in speculative techniques: One does not care to be always right but just to be right most of the time!

Cheap but clever techniques can be used…

Pattern History Table (PHT):

Exploit correlation:

ffm

an 1

996

Pattern History Table (PHT): n-bit standard predictors

(m,n) Branch Predictor BufferA global m-bit predictor uses the

outcome of the last two branches to

© M

orga

n Ka

uf

select one among four different predictors

Sour

ce:

AQA,

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically67

Branch History Register (BHR): m-bit shift register

Page 68: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Return Address StackReturn Address Stack

Special elementary case of branch prediction:Special elementary case of branch prediction:Small stack (e.g., 8-16 values)Each call (CALL JAL etc ) pushes a valueEach call (CALL, JAL, etc.) pushes a valueEach return (RET, JP $ra, etc.) pops a predicted

return addressreturn address

Functionally identical to the “real” stack but avoids any SP manipulation memory accessesavoids any SP manipulation, memory accesses, argument bypassing, etc.

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically68

Page 69: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Dynamic Control SpeculationDynamic Control Speculation

The next step after Dynamic Branch Prediction isThe next step after Dynamic Branch Prediction is Control SpeculationLimited scope if one limits to fetch and decodeLimited scope if one limits to fetch and decode

speculativelyMore aggressively one could execute tooMore aggressively, one could execute too…

If instructions are issued and execute before the branch target is known one needs to avoidbranch target is known, one needs to avoid changing the state of the processor until the correctness of the prediction is assessed orthe correctness of the prediction is assessed or to restore the state preceding the branch

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically69

Page 70: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Remember Reordering Problem?Remember Reordering Problem?

Fundamental observation: a processor canFundamental observation: a processor can do whatever it wants provided that it gives the appearance of sequential execution (i e theappearance of sequential execution (i.e., the architectural machine state is updated in program order)program order)

Idea was to commit in order, so that one could t d thi t t d i fpretend something was not executed in case of

an exception in previous instructionsPrediction and Speculation?!

Yes! Prediction: “no exceptions will occur”

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically70

p

Page 71: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Double Use of the Reorder BufferDouble Use of the Reorder Buffer

OOO execution was hidden from the userOOO execution was hidden from the user by preventing instruction to commit until in order execution would have taken placein-order execution would have taken placeIf there is an exception, one squashes all

uncommitted instructions in the reorder bufferuncommitted instructions in the reorder buffer which followed the instruction which raised the exceptionthe exception

One can easily extend this functionality to squash all uncommitted instructionssquash all uncommitted instructions following a mispredicted branch!

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically71

Page 72: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Commit Unit Now Includes Also BranchesBranches

Something more in the Commit Unit to check branch outcomes Something more in the Commit Unit to check branch outcomes

from from EUsF&D Unit

0

Register Address ValueTagPC

to MEM

00

head to MEMand RF1

1

— $f3 0x627f ba5a

BR1 ???0x2012 1111

head

0x1000 0008

0x1000 0004

10

$f5MUL2 ???tail

0x2012 1111

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically72

Page 73: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Misprediction Has High Cost Lots of Efforts in Improving Accuracyof Efforts in Improving Accuracy

Pipelines become more and more deep (e g upPipelines become more and more deep (e.g., up to 22-24 cycles in Pentium 4)

I idth (t i ll 3 8)Issue width grows (typically 3-8)Large number of in-flight instructions (up to

100-200!)Many predicted branches in-flight at oncey p gProbability of executing speculatively something

useful reduces quicklyuseful reduces quickly

predall _

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically73

i

itot pp

Page 74: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Current High End

Source: Microprocessor Report, © Cahners 2009

High-End Processors

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically74

Page 75: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Speculation Is Not Necessarily a Run-Time ConceptRun-Time Concept

Dynamic: in hardware no interactionDynamic: in hardware, no interaction whatsoever from the compilerBinary code is unmodified

Static: in software, planned beforehandStatic: in software, planned beforehand by the compilerBi d i itt i h t dBinary code is written in such a way as to do

speculation (with or without some hardware t i th ISA)support in the ISA)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically75

Page 76: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Static Control SpeculationExampleExample

We need to compute:We need to compute:if (A==0) A=B; else A=A+4;

I blIn assembly:LW R1, 0(R3) ; load ABNEZ R1, L1 ; test A, possibly skip thenBNEZ R1, L1 ; test A, possibly skip thenLW R1, 0(R2) ; ‘then’ clause: load BJ L2 ; skip else

L1: ADD R1, R1, 4 ; ‘else’ clause: compute A+4L2: SW 0(R3), R1 ; store new A

If we know that the ‘then’ clause is almostIf we know that the then clause is almost always executed, can we optimise this code?

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically76

Page 77: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Static Control SpeculationExampleExample

We could speculatively start earlier to load B intoWe could speculatively start earlier to load B into another register and, if needed, squash the value with the right oneg

In assembly:LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4, , p

L3: SW 0(R3), R14 ; store new A

Advantages: now we load B while the test is f d ( i ll l)performed ( in parallel)

Any problem? As usual: exceptions…

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically77

Page 78: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Exceptions and SpeculationExceptions and Speculation

Some ways to handle exceptions inSome ways to handle exceptions in speculative execution:Static renaming: Hardware and operating

systems cooperatively ignore exceptionsPoison bits: Mark results as speculative and

delay exception at first usey pSpeculative instructions: Mark instruction as

speculative and do not commit the result untilspeculative and do not commit the result until speculation is solved

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically78

Page 79: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Static Renaming and Hardware-Software CooperationSoftware Cooperation

Back to our example: Back to our example:LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4

L3: SW 0(R3), R14 ; store new AL3: SW 0(R3), R14 ; store new A

OS “helps” with two policies: Nonterminating exceptions (e.g., Page Fault): resume g p ( g , g )

independently from speculativeness performance penalty, but execution ok

Terminating exceptions (e g Divide by Zero): ignore and return Terminating exceptions (e.g., Divide by Zero): ignore and return an undefined value if it was speculated, it will be unused

Problem: nonspeculative terminating exceptions?

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically79

p g p

Page 80: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Static Renaming and Poison BitsStatic Renaming and Poison Bits

Special marker for speculative instructions: Special marker for speculative instructions:LW R1, 0(R3) ; load ALW* R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4

L3: SW 0(R3), R14 ; store new A; report exceptionsL3: SW 0(R3), R14 ; store new A; report exceptions

The processor knows the load is speculative and turns on R14’s Poison Bit if it raises a terminating exception, g p ,and suppresses the exception

The add, if executed, resets the R14’s Poison BitWhen R14 is used, a deferred terminal exception is

raised if its Poison Bit is set

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically80

Page 81: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Example:Speculative Loads in Itanium (I)Speculative Loads in Itanium (I)

Goal: move loads as early as possible, even speculatively before Goal: move loads as early as possible, even speculatively before preceding branches (i.e., without being sure they are really needed)

<some code>(p1) br.cond somewhere// ------ barrierld 1 [ 2] Exceptions?ld r1 = [r2]<some code using r1>

Exceptions?

ld r1 = [r2] // load could be speculated// if old value r1 not needed

<some code> // <- neither here nor(p1) br.cond somewhere // in “somewhere”// ------ barrier<some code using r1> // but

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically81

<some code using r1> // but…

Page 82: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Example:Speculative Loads in Itanium (II)Speculative Loads in Itanium (II)

Speculative loads and deferred exceptions to explicit compiler- Speculative loads and deferred exceptions to explicit compilergenerated fix-up code

ld.s r1 = [r2] // speculative loads do not raise// exceptions but mark the register// with the additional NaT bit

<some code><some code using r1> // NaT is propagated in further

// calculations, which also// defer exceptions

(p1) br.cond somewhere(p )// ------ barrier<some more code using r1>chk.s r1, fix_code_r1 // call exception handler if needed

// to fix-up execution// to fix up execution

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically82

Page 83: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Predication (= Guarded) ExecutionPredication (= Guarded) Execution

A special form of static control speculation? A special form of static control speculation?“I cannot make a good prediction? I will avoid gambling

and will do both”and will do both

A bit more than that: removes control flow change A bit more than that: removes control flow change altogether

Not always a good idea: compiler trade-off (Almost) free if one uses execution units which where not used ( )

otherwise (e.g., because of limited ILP) Not free at all in the general case: more than needed is always

executed

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically83

executed

Page 84: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Weak- and Strong-dependence ModelsModels

Typical model for data dependences is:Typical model for data dependences is:No dependence (B does not depend on A)St d d (B d d A)Strong dependence (B depends on A)If A and B are executed OOO, one must be

always right about the dependencealways right about the dependence

Weak-dependence model:A d d b t il i l t dA dependence can be temporarily violated

(predicted negative and speculated) if means are in place to recover correct executionare in place to recover correct executionIf A and B are executed OOO, one must be

mostly right about the dependence

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically84

mostly right about the dependence

Page 85: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Dynamic Data Dependence PredictionPrediction

Examples:Examples:Memory Dependence Prediction: predicts whether a

load is dependent on a pending store (memoryload is dependent on a pending store (memory aliasing or disambiguation)

Alias Prediction: predicts which pending storeAlias Prediction: predicts which pending store contains the right value for a load

???

MEM

???

MEM

=

R WW WW WR R MEM R WW WW WR R MEM

???

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically85

???

Page 86: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Static Data Dependence Speculation (I)Speculation (I)

Potential RAW dependencies through memory are to be Potential RAW dependencies through memory are to be conservatively assumed as real dependencies Loss of useful reordering possibilities

G l l d l ibl l ti l b f Goal: move loads as early as possible, even speculatively before preceding stores (i.e., without being sure that the value is right)

<some code>st [r3] = r4// ------ barrierld r1 = [r2] NO!ld r1 [r2]<some code using r1>

NO!

ld r1 = [r2] // load could be speculated…<some code>st [r3] = r4 // …but if r2==r3, r1 is WRONG!

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically86

[ ] ,// ------ barrier<some code using r1>

Page 87: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Static Data Dependence Speculation (II)Speculation (II)

Speculative Loads get executed but mark the destination register as Speculative Loads get executed but mark the destination register as “speculatively” loaded and track subsequent stores for a conflict

ld.a r1 = [r2] // speculative loads are normal[ ] // p// but mark always the register// with the additional NaT bit

<some code><some code using r1> // NaT is propagated in further

// l l ti// calculations

st [r3] = r4 // successive stores are checked// to see if they rewrite locations// which were object of speculative// which were object of speculative// loads

// ------ barrier<some more code using r1>chk.a r1, fix_code_r1 // if violated RAW dependence, call

//

Important advantage because loads (slow operations) can now be started earlier

_ _// special fix-up routine

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically87

started earlier

Page 88: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Dynamic Data Value PredictionDynamic Data Value Prediction

Examples:Examples:Source Operand Value Prediction: predict

quasi-constant input operands Many constant values during program execution History table recording last value

Value Stride Prediction: predict constant pincrements across input operands History table recording stride between last two y g

values

Load Addresses and Load Values

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically88

Page 89: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Types of Prediction & SpeculationTypes of Prediction & Speculation

Dynamic (by the hardware) Static (by the compiler)

ExceptionsOOO execution and reorderingI i ti i DBTExceptions Imprecise exceptions in DBT (e.g., Transmeta Crusoe)

Trace Scheduling (AQA, 4.4)

Control Branch Prediction (AQA, 4.3)HyperblocksPredication (AQA, 4.6)Speculative Loads (e.g., Itanium)

Data Availability Virtual memory —

Data Research? Advanced Loads (e g Itanium)Dependence Research? Advanced Loads (e.g., Itanium)

Data Value Research Dynamic compilers (e.g., DyC, Calpa)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically89

p )

Not all of these are traditionally called “speculation”!

Page 90: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

References on Prediction & SpeculationSpeculation

AQA 5th ed Chapter 3 and Appendix H AQA 5 ed., Chapter 3 and Appendix H PA, Sections 4.3 and 5.3 J E Smith A Study of Branch Prediction Strategies J. E. Smith, A Study of Branch Prediction Strategies,

Eight International Symposium on Computer Architecture (ISCA), 135-48, May 1981( ), , y

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically90

Page 91: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Simultaneous MultithreadingSimultaneous Multithreading

(How do I fill my issue slots?!…)

Page 92: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Sources of ParallelismSources of Parallelism

Bit-level Bit level Wider processor datapaths (8163264…)

Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)

Instruction-level Instruction level Pipelining Superscalar

d VLIW and EPIC

Task- and Application-levels… Explicit parallel programming Explicit parallel programming Multiple threads Multiple applications…

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically92

Page 93: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Simple Sequential ProcessorSimple Sequential Processor

op 1op 1

cycl

esc

op 2

op 3

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically93

Page 94: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Pipelined ProcessorPipelined Processor

op 1op 1op 2

cycl

es

op 3

c

op 4

5op 5op 6 (br)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically94

Page 95: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Superscalar ProcessorSuperscalar Processor

op 1 op 2

functional units

op 1 op 2

cycl

es

op 3op 4

c

op 5op 6 (br)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically95

Page 96: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

OOO Superscalar ProcessorOOO Superscalar Processor

op 1 op 2

functional units

op 1op 4

op 5

op 2

cycl

es

op 3p

op 6 (br)

c

7op 7op 11

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically96

Page 97: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Speculative ExecutionSpeculative Execution

op 1 op 2

functional units

op 1op 4

op 5

op 2

op 10 ?

cycl

es

op 3p

op 6 (br)op 7 ?

p

c

11op 9 ?op 8 ?

op 11op 12 op 13

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically97

Page 98: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Limits of ILPLimits of ILP

op 1 op 2

functional units

op 1op 4

op 5

op 2

op 10 ?

cycl

es

op 3p

op 6 (br)op 7 ?

p

?c

11op 9 ?op 8 ? ?

op 11op 12 op 13

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically98

Page 99: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Sources of Unused Issue SlotsSources of Unused Issue Slots

IEEE

1995

en e

t al

., ©

ISo

urce

: Tu

lls

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically99

Page 100: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Horizontal and Vertical WasteHorizontal and Vertical Waste

f ti l it

op 1 op 2

functional units

op 3 op 4op 7

IEEE

1995

cycl

es

op 3 op 4

Vertical Waste61% en

et

al.,

© I

op 10op 11op 5op 6

61%

Sour

ce:

Tulls

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically100

Horizontal Waste39%

Page 101: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Multithreading: The IdeaThe Idea

ranc

h

Issueranc

h

ranc

h

ranc

h

ranc

h

BrBrBrBrBr

future

Bran

ch

Bran

ch

Rather than enlarging the depth of

Issueanch

BBRather than enlarging the depth of the instruction window (more

speculation with lowering fid !) l i “ id h”

Issue

Br

nch

nch

confidence!), enlarge its “width”

fetch from multiple threads!

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically101Br

an

Bran

Page 102: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Basic Needs of a Multithreaded ProcessorMultithreaded Processor

Processor must be aware of severalProcessor must be aware of several independent states, one per each thread:Program CounterRegister File (and Flags)g ( g )(Memory)

Either multiple resources in the processorEither multiple resources in the processor or a fast way to switch across states

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically102

Page 103: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Thread SchedulingThread Scheduling

When one switches thread?When one switches thread?Which thread will be run next?

Simple interleaving options:Cycle by cycle multithreadingCycle-by-cycle multithreading Round-robin selection between a set of threads

Bl k l i h diBlock multithreading Keep executing a thread until something happens

Long latency instruction foundSome indication of scheduling difficultiesMaximum number of cycles per thread executed

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically103

Maximum number of cycles per thread executed

Page 104: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Cycle-by-cycle Interleaving(or Fine-Grain) Multithreading(or Fine-Grain) Multithreading

op 1 op 5op 2

functional units

op 1 op 5op 2

op 1op 2op 5op 2op 1

s

cycl

es

op 3 op 4ppp

op 7op 3 op 4op 6

switc

hes

c

op 10 op 8op 3

5 10op 11 co

ntex

t

op 4op 6 op 7op 9op 5 op 10

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically104

Page 105: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Block Interleaving (or Coarse-Grain) MultithreadingMultithreading

op 1 op 5op 2

functional units

op 1op 3 op 4

op 5op 2

op 6 s

cycl

es

op 2op 1

p

switc

hes

op 8

c

125

op 7op 3 op 4

cont

ext

op 1op 3

op 2op 5op 6 op 7op 9

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically105

Page 106: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Fundamental RequirementFundamental Requirement

Key issue in general purpose processorsKey issue in general-purpose processors which has prevented for many years

l h d d h bmultithreaded techniques to become commercially relevanty

It i t t bl th t i l th dIt is not acceptable that single-thread performance goes significantly down

or at all

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically106

Page 107: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Problems of Cycle-by-Cycle MultithreadingMultithreading

Null time to switch context Null time to switch context Multiple Register Files

No need for forwarding paths if threads supported are No need for forwarding paths if threads supported are more than pipeline depth! Simple(r) hardware

Fills well short vertical waste (other threads hide latencies ~ no. of threads)

Fills much less well long vertical waste (the thread is rescheduled no matter what)

Does not reduce significantly horizontal waste (per thread, the instruction window is not much different…)

Si ifi t d t i ti f i l th d j b

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically107

Significant deterioration of single thread job

Page 108: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Block Interleaving TechniquesBlock Interleaving Techniques

9

Block Interleaving

Sprin

ger

1999

Static Dynamic ilc e

t al

., ©

S

y

pted

from

: Ši

Explicit Switch Implicit SwitchSwitch-on-LoadSwitch-on-Store

Explicit SwitchConditional-Switch

Implicit SwitchSwitch-on-MissSwitch-on-Use

Ada

Switch-on-Branch (a.k.a. Lazy-Switch-on-Miss)Switch-on-Signal

(interrupt, trap,…)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically108

Page 109: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Problems of Block MultithreadingProblems of Block Multithreading

Scheduling of threads not self evident:Scheduling of threads not self-evident:What happens of thread #2 if thread #1 executes

perfectly well and leaves no gap?perfectly well and leaves no gap?Explicit techniques require ISA modifications Bad…

More time allowable for context switchMore time allowable for context switchFills very well long vertical waste (other threads

i )come in)Fills poorly short vertical waste (if not sufficient

to switch context)Does not reduce almost at all horizontal waste

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically109

Page 110: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Simultaneous Multithreading (SMT):The IdeaThe Idea

op 1 op 5op 2

functional units

op 1op 1op 1op 4

op 5op 2 op 1op 3

op 2op 5op 7 op 5op 2

op 1

cycl

es

op 3 op 6op 7

ppop 4op 6 op 8

p p

op 3 op 4

c

op 9op 10

12

op 11op 8op 7 op 10

11

op 8op 6

10op 12op 13op 14 op 9

op 11op 10op 11

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically110

Page 111: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Several Simple Scheduling PossibilitiesPossibilities

Prioritised scheduling?Prioritised scheduling?Thread #0 schedules freelyThread #1 is allowed to use #0 empty slotsThread #2 is allowed to use #0 and #1Thread #2 is allowed to use #0 and #1

empty slots, etc.

Fair scheduling?Fair scheduling?All threads compete for resourcesIf several threads want the same resource,

round-robin assignment

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically111

Page 112: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Superscalar ProcessorSuperscalar Processor

InstructionFetch & Decode Unit

(Multiple Instructions per Cycle)Multiple Buses

PC

ReservationStn

ReservationStn

ReservationStn

ReservationStn

Multiple Buses

Stn.Stn.

FP UnitALU 1 Load/Store

Stn.

Branch

Stn.

ALU 2 FP UnitALU 1Register File

/UnitUnitALU 2

Commit Unit(Multiple Instructions per Cycle)

IQ

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically112

Multiple Buses

Page 113: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reorder BufferReorder Buffer

fromF&D Unit

Register Address ValueTag

from EUs

PC

00

Register Address ValueTagPC

to MEMand RF

011

— $f3 0x627f ba5a

ALU1 ???0 87f b351

head

0 1000 0008

0x1000 0004

11

ALU1 ???$f5MUL2 ???

0xa87f b351

tail

0x1000 000c

0x1000 0008

0tail

Register Renaming

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically113

Register Renaming(between fetch/decode and commit)

Page 114: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

What Must Be Added to a Superscalar to Achieve SMT?Superscalar to Achieve SMT?

Multiple program counters (= threads) and a Multiple program counters (= threads) and a policy for the instruction fetch unit(s) to decide which thread(s) to fetch

F/D

( ) Multiple or larger register file(s) with at least as

many registers as logical registers for all threads

mm

it

Multiple instruction retirement (e.g., per thread squashing) C

om

No changes needed in the execution pathAnd also: Thread-aware branch predictors (BTBs, etc.) Per-thread Return Address Stacks

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically114

Page 115: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

SMT Processor as a Natural Extension of a SuperscalarExtension of a Superscalar

InstructionFetch & Decode Unit

(Multiple Instructions per Cycle)

PCPCPCPC

ReservationStn

ReservationStn

ReservationStn

ReservationStnStn.Stn.

FP UnitALU 1 Load/Store

Stn.

Branch

Stn.

ALU 2 FP UnitALU 1 /UnitUnitALU 2

Register File(s)

Commit Unit(Multiple Instructions per Cycle)

IQIQIQIQ

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically115

IQ

Page 116: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reorder Buffer Remembers the Thread of OriginThread of Origin

Some changes to the reorder buffer in the Commit Unit—e.g.: Some changes to the reorder buffer in the Commit Unit e.g.:

fromF&D Unit

from EUs

0

Register Address ValueTagPC Thread

to MEMand RF1

1

— $f3 0x627f ba5a

ALU1 ???0xa87f b351

head

0x1000 0008

0x1000 0004 2

2

1 $f5MUL2 ???0x1000 000c

1 MUL30x2001 1234

2

1 $f3 ???

0tail

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically116

Architectural Register Identifier:Reg # + Thread #

Page 117: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Reservation StationsReservation Stations

Reservation stations do not need to know which thread an Reservation stations do not need to know which thread an instruction belongs to

Remember: operand sources are renamed—physical regs, tags, etc.

Tag1 Tag2 Arg1 Arg2Op

fromEUs and RF

fromF&D Unit

Threadaddd – MUL3 ???1subd ALU1 – ??? 0xffff fee11

ALU1:

ALU2:

0xa87f b351Thread#2?!

Thread

0ALU3: #1?!

a bALU

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically117

Page 118: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Does It Work?!Does It Work?!

IEEE

1996

en e

t al

., ©

ISo

urce

: Tu

lls

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically118

Page 119: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Main Results on ImplementabilitySMT vs SuperscalarSMT vs. Superscalar

From [TullsenJun96]:From [TullsenJun96]:Instruction scheduling not more complexRegister File datapaths not more complex (but

much larger register file!)Instruction Fetch Throughput is attainable even

without more fetch bandwidthUnmodified cache and branch predictors are

appropropriate also for SMTappropropriate also for SMTSMT achieves better results than aggressive

superscalar

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically119

superscalar

Page 120: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Where to Fetch?Where to Fetch?

Static solutions: Round-robinStatic solutions: Round robinEach cycle 8 instructions from 1 threadEach cycle 4 instructions from 2 threads 2 from 4Each cycle 4 instructions from 2 threads, 2 from 4,…Each cycle 8 instructions from 2 threads, and forward

as many as possible from #1 then when long latency instruction in #1 pick rest from #2

Dynamic solutions: Check execution queues!Favour threads with minimal # of in-flight branchesFavour threads with minimal # of outstanding missesF th d ith i i l # f i fli htFavour threads with minimal # of in-flight

instructionsFavour threads with instructions far from queue head

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically120

Favour threads with instructions far from queue head

Page 121: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

What to Issue?What to Issue?

Not exactly the same as in superscalars Not exactly the same as in superscalars… In superscalar: oldest is the best (least speculation, more dependent

ones waiting, etc.) In SMT not so clear: branch speculation level and optimism (cache hit In SMT not so clear: branch-speculation level and optimism (cache-hit

speculation) vary across threads

One can think of many selection strategies: Oldest first Cache-hit speculated last Branch speculated last Branch speculated last Branches first…

Important result: doesn’t matter too much!

Issue Logic (critical in superscalars) can be left alone

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically121

Page 122: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Importance of Accurate Branch Prediction in SMT vs SuperscalarPrediction in SMT vs. Superscalar

Reduce the impact of Branch Prediction was oneReduce the impact of Branch Prediction was one of the qualitative initial motivations

R lt f [T ll J 96]Results from [TullsenJun96]:Perfect branch prediction advantage

25% 1 h d 25% at 1 thread 15% at 4 threads 9% at 8 threads 9% at 8 threads

Losses due to suppression of speculative execution -7% at 8 threads7% at 8 threads -38% at 1 thread ( speculation was a good idea…)

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically122

Page 123: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

BottlenecksSources of Unused Issue SlotsSources of Unused Issue Slots

ger

1999

al.,

© S

prin

gou

rce:

Šilc

et

Completion queue not very relevant (remember: this is out of the execution path…)

So

Rename register count important Most critical: number of register writeback ports

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically123

SMT for utilisation rate (EUs) not bandwidth!…

Page 124: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

BottlenecksBottlenecks

Fetch and memory throughput Fetch and memory throughput are still bottlenecks Fetch: branches, etc. M t dd d Memory not addressed

Performance vs. # of rename registers (8T) in addition to

© I

EEE

1996

g ( )the architectural ones Infinite: +2% 100: ref lls

en e

t al

., ©

100: ref. 90: -1% 80 -3% So

urce

: Tu

70 -6%

Register file access time likely limit to # of threads IPC vs. # threads

200 h i l i t

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically124

limit to # of threads 200 physical registers

Page 125: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Performance (IPC) per Unit CostPerformance (IPC) per Unit Cost

ger

1999

al.,

© S

prin

gou

rce:

Šilc

et

Superscalars are cheap only for relatively small issue bandwidth,

So

Superscalars are cheap only for relatively small issue bandwidth, then quickly down

SMT improves significantly the picture already with 2 threads and maximum moves to larger issue bandwidths with more threads

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically125

maximum moves to larger issue bandwidths with more threads

Page 126: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Introduction of SMT in Commercial Processorsin Commercial Processors

Compaq Alpha 21464 (EV8)Compaq Alpha 21464 (EV8)4T SMTP j t kill d J 2001Project killed June 2001

Intel Pentium IV (Xeon)2T SMTAvailability since 2002 y

(already there before, but not enabled)10-30% gains expectedg p

SUN Ultra III2 core CMP 4T SMT

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically126

2-core CMP, 4T SMT

Page 127: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Intel SMT: Xeon Hyper-ThreadingPipelinePipeline

d l d

F t d

duplicatedresources

Front-end(TC hit)

or(TC miss)

ntel

2002

freely sharedresources

splitresources

time-sharedresources rr

et

al.,

© I

nSo

urce

: M

ar

OOOExecution

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically127

Page 128: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Intel SMT: Xeon Hyper-ThreadingSwitching Off ThreadsSwitching Off Threads

What happens when there is only one thread? WhatWhat happens when there is only one thread? What does the OS when there is nothing to do? Ahem… Four modes: Low-power, ST0, ST1, and MT Four modes: Low power, ST0, ST1, and MT

002

al.,

© I

ntel

20rc

e: M

arr

et a

Halt0

Halt1

Int1Int0

Halt1

Halt0

Int0Int1

Sour

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically128

Low-Power Mode

Halt0 Halt1

Page 129: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Intel SMT: Xeon Hyper-ThreadingGoals and ResultsGoals and Results

Minimum additional cost: SMT = approx. 5% area Minimum additional cost: SMT approx. 5% area No impact on single-thread performance

Recombine partitioned resources

Fair behaviour with 2 threads

tel2

002

r et

al.,

© I

ntSo

urce

: M

arr

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically129

Page 130: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

And Now What’s Next?And Now, What s Next?

Key ingredients for success so far:Key ingredients for success so far:Maximise compatibility, no info from programmers

beyond straight sequential code and coarse threadsbeyond straight sequential code and coarse threadsAggressive prediction and speculation of anything

predictablepredictableUse irregular, fine-grained parallelism (ILP): it is

“easier” to extract, can be done at runtime,…, ,

Problems:Branch prediction accuracy hard to improveBranch prediction accuracy hard to improveHard to exploit ILP any further within a thread

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically130

Page 131: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

SMT Research DirectionsSMT Research Directions

Dynamic Multithreading (DMT) [AkkariNov98] Dynamic Multithreading (DMT) [AkkariNov98] Automatic generation of threads from loops (backward

branches) and procedure call

IEEE

1998

ary

et a

l., ©

ISo

urce

: Ak

ka

Execution on a “typical” SMT microarchitecture Speculative execution of threads with interthread data

dependence and value speculation

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically131

dependence and value speculation

Page 132: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

SMT Research DirectionsSMT Research Directions

What Dynamic Multithreading represents?What Dynamic Multithreading represents? Introduces, from a single program, not only out-of-order Issue

but now also out-of-order Fetch Lowers the dependence on correct Branch Prediction

Experiments show good advantages even without more f t h b d idthfetch bandwidth

EEE

1998

ry e

t al

., ©

IE

Sour

ce:

Akka

r

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically132

S

Page 133: Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

References on Simultaneous MultithreadingMultithreading

AQA 5th ed Chapter 3 AQA 5 ed., Chapter 3 PA, Sections 6.3, 6.4, and 6.5 CAR, Chapter 5—Introduction, p D. M. Tullsen et al., Exploiting Choice: Instruction Fetch

and Issue on an Implementable Simultaneous M ltith di P ISCA 1996Multithreading Processor, ISCA, 1996

H. Marr et al., Hyper-Threading Technology Architecture and Microarchitecture Intel Technology Journal Q1and Microarchitecture, Intel Technology Journal, Q1, 2002

H. Accary and M. A. Driscoll, A Dynamic Multithreaded y , yProcessor, MICRO, 1998

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically133