Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP

Advanced Computer Architecture—

Part I: General PurposeExploiting ILP Dynamicallyp g y y

[email protected] – I&C – LAPEPFL I&C LAP

ILP? The Traditional WayILP? The Traditional Way

(Let’s Make It Fast!)

Speed: Main Goal in General Purpose Computer ArchitecturePurpose Computer Architecture

Reduce delay per gate Technology (~ x1.2/year) Reduce delay per gate Technology ( x1.2/year) Improve architecture Parallelism (~ x1.3/year)

1n,

© M

K 20

11y

& P

atte

rson

rce:

Hen

ness

y

Architectural and organisational ideas are the

Sour

© Ienne 2003-123 AdvCompArch — Exploiting ILP Dynamically

gmain performance drivers since the mid-1980s

Clock RateDoes Not Grow Much (Anymore!)Does Not Grow Much (Anymore!)

1n,

© M

K 20

11y

& P

atte

rson

rce:

Hen

ness

ySo

ur

© Ienne 2003-124 AdvCompArch — Exploiting ILP Dynamically

Sources of ParallelismSources of Parallelism

Bit-level Bit level Wider processor datapaths (8163264…)

Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)

Instruction-level Instruction level Pipelining Superscalar

d

This lesson:ILP = Instruction Level Parallelism

VLIW and EPIC Task- and Application-levels…

Explicit parallel programming Explicit parallel programming Multiple threads Multiple applications…

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically5

Starting Point (Programmer Model)Starting Point (Programmer Model)

Sequential multicycle processorSequential multicycle processor

CyclesCycles1:

2:

3:Instructions 3:


ILP?ILP?

Instructions

Cycles

??© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically7

Standard

First Step:PipeliningPipelining

CyclesIF ID EX WBIF ID EX MEM

IF ID EX MEM WB

Cycles

ons

1:

2:

3

MEMWB

IF ID EX MEM WBIF ID

IF IDnstr

uctio 3:

4:

5:

EX MEMEXWB

IF IDIn

5: EX

Simplest form of Instruction Level Parallelism (ILP): Several instructions are being executed at once


being executed at once

Simple PipelineSimple Pipeline

RF

F D E M1 M2 W


Simple PipeliningSimple Pipelining

Scope for parallelism is limited:Scope for parallelism is limited: Control hazards limit the usability of the pipeline

Must squash fetched and decoded instruction following a branch Must squash fetched and decoded instruction following a branch

Data hazards limit the usability of the pipelineWhenever the next instruction cannot be executed, the pipeline , p p

is stalled and no new useful work is done until the “problem” is solved (e.g., cache miss)

Rigid seq encing Rigid sequencing Special “slots” for everything even if sometimes useless (e.g.,

MEM before WB)) Every instruction must be coerced to the same framework Structural hazards avoided “by construction”


Simple Pipeline with ForwardingSimple Pipeline with Forwarding

RF

F E M1 M2 WDF E M1 M2 WD


ILP So FarILP So Far…

Instructions

Cycles

??Pipelining

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically12Standard

Dynamic Scheduling: The IdeaDynamic Scheduling: The Idea

Extend the scope to extract parallelism:Extend the scope to extract parallelism:divd $f0, $f2, $f4addd $f10, $f0, $f8addd $ 0, $ 0, $ 8subd $f12, $f8, $f14

Why not to execute subd while addd waits for y ot to e ecute subd e addd a ts othe result of divd?

Relax a fundamental rule: instructions can beRelax a fundamental rule: instructions can be executed out of program order! (but the result must still be correct )result must still be correct…)


Break the Rigidity of the Basic PipeliningPipelining

Continue fetching and decoding even and especially Continue fetching and decoding even and especially if one cannot execute previous instructions

Keep writeback waiting if there is a structural hazard, Keep writeback waiting if there is a structural hazard, without slowing down execution

Solution: Split the tasks in independent units/pipelines

Fetch and decode ExecuteWriteback

Clearly, instructions will now produce results out-of-order (OOO)


Dynamically Scheduled ProcessorDynamically Scheduled Processor

RF

ALURS

F D ROB W

MEM(3)RS

F D E/M1/ W


F D E/M1/… W

Reservation StationsReservation Stations

Fetch&Decode Unit and Register File (1) Fetched operation descriptions and

(2a) known operands (from RF)All Execution Units

(1) Tags of the executed operationsor (2b) source-operation tags and (2) corresponding results

ReservationSt tiStation

Dependent Execution Unit(1) Description of operations ready to execute


(1) Description of operations ready to executewith (2) corresponding tags and (3) operands


A reservation station checks that the operands are available (RAW) and that A reservation station checks that the operands are available (RAW) and that the Execution Unit is free (Structural Hazards), then starts execution

ff

Tag1 Tag2 Arg1 Arg2Op

fromEUs and RF

fromF&D Unit

addd – MUL3 ???1subd ALU1 – ??? 0xffff fee11

ALU1:

ALU2:

0xa87f b351

0ALU3:

a bALU



Unavailable operands are identified by theUnavailable operands are identified by the name of the entry in the reservation station in charge of the originatingstation in charge of the originating instructionI li it i t i thImplicit register renaming, thus

removing WAR and WAW hazardsNew results are seen at their inputs

through special result bus(es)g p ( )Writeback into the registers is not on the

critical execution path


critical execution path

A Problem with ExceptionsA Problem with Exceptions…

Precise exceptions Precise Precise exceptions Reordering at commit; user

view is that of a fully in-

andi $t4, $t2, 0xffandi $t5, $t3, 0xffaddi $v0, $v0, 1srl $t2, $t2, 8

order processor

Imprecise exceptions N d i t f d

srl $t2, $t2, 8 srl $t3, $t3, 8

andi $t4, $v0, 3addi $t0, $t0, 4addi $t1, $t1, 4 No reordering; out-of-order

completion visible to the user

addi $t1, $t1, 4

Impreciseandi $t4, $t2, 0xffandi $t5 $t3 0xff

The OS/programmer must be aware of the problem and take appropriate action

andi $t5, $t3, 0xffaddi $v0, $v0, 1srl $t2, $t2, 8

srl $t3, $t3, 8andi $t4 $v0 3and take appropriate action

(e.g., execute again the complete subroutine where the problem occurred)

andi $t4, $v0, 3addi $t0, $t0, 4addi $t1, $t1, 4


the problem occurred)

Out-of-order Commitment and ExceptionsExceptions

Exception handlers should know exactly where aException handlers should know exactly where a problem has occurred, especially for nonterminating exceptions (e g page fault)nonterminating exceptions (e.g., page fault) so that they can correct the problem and resume exactly where the exception occurredresume exactly where the exception occurred

Of course, one assumes that everything before th f lt i t ti t d dthe faulty instruction was executed and everything after was not

With OOO dynamic execution it might no longer be true…


ReorderingReordering

Fundamental observation: a processor canFundamental observation: a processor can do whatever it wants provided that it gives the appearance of sequential execution (i e theappearance of sequential execution (i.e., the architectural machine state is updated in program order)program order)

New phase: COMMIT or RETIRE or GRADUATE (b id th l F D E W)(besides the usual F, D, E, W)

This observation is fundamental because it allows many techniques (precise interrupts, speculation, multithreading, etc.)



InstructionFetch & Decode

Unit

Computation advances independently from machine

state updatesUnit

Reservation Reservation Reservation ReservationReservationStn.

L d/StRegister File

B h

ReservationStn.

ReservationStn.

ReservationStn.

Load/StoreUnitFP UnitALU

gBranchUnit

M hi t t i

CommitUnit

Machine state is updated in order


Unit

Reorder BufferReorder Buffer

Fetch&Decode Unit(1) Fetched-operation tags in original

order, (2) destination register orAll Execution Units

(1) Tags of the executed operationsaddress, and (3) PC and (2) corresponding results

Commit Unit(R d B ff )(Reorder Buffer)

Register File and MemoryFor each instruction in the original fetch order


For each instruction, in the original fetch order,(1) destination register or address and (2) value to write

Reordering Instructions at WritebackWriteback

Needs a reorder buffer in the Commit Unit Needs a reorder buffer in the Commit Unit

from from EUsF&D Unit

0

Register Address ValueTagPC

to MEM

00

head to MEMand RF1

1

— $f3 0x627f ba5a

ALU1 ???0xa87f b351

head

0x1000 0008

0x1000 0004

10

$f5MUL2 ???tail

0x1000 000c


Origins of ReorderingOrigins of Reordering

Paper Smith & Pleszkun on precise interrupts 1988 Paper Smith & Pleszkun on precise interrupts, 1988

© I

EEE

1988

& P

lesz

kun,

©ur

ce:

Smith

&So


Second Step: Dynamic SchedulingDynamic Scheduling

IF ID EX1 EX2 WB Cycles1 IF ID EX1 EX2 WBIF ID EX1 EX2 EX3 EX4 EX5 MEM

IF ID EX1 EX2

Cycles1:

2:

3: WBIF ID EX1 EX2

EX2IF ID EX1 MEM WB

IF ID EX1 MEMruct

ions

3:

4:

5:

WB

Inst

r

EX2IF ID EX16:

Tangible amount of ILP now possibleWhat’s next?!


ILP So FarILP So Far…

?Instructions

Cycles ?Dynamic Scheduling

Pipelining


Superscalar ExecutionSuperscalar Execution

Why not more than one instructionWhy not more than one instruction beginning execution (issued) per cycle?Key requirements areFetching more instruction in a cycle: no bigFetching more instruction in a cycle: no big

difficulty provided that the instruction cache can sustain the bandwidthcan sustain the bandwidthDecide on data and control dependencies:

dynamic scheduling already takes care of thisdynamic scheduling already takes care of this


Superscalar ProcessorSuperscalar Processor

InstructionFetch & Decode Unit

(Multiple Instructions per Cycle)Multiple Buses

ReservationStn

ReservationStn

ReservationStn

ReservationStn

Multiple Buses

Stn.Stn.

FP UnitALU 1 Load/Store

Stn.

Branch

Stn.

ALU 2 FP UnitALU 1Register File

/UnitUnitALU 2

Commit Unit(Multiple Instructions per Cycle)


Multiple Buses

Third Step: Superscalar ExecutionThird Step: Superscalar Execution

IF ID EX1 EX2 EX3 WB Cycles1: IF ID EX1 EX2 EX3 WBIF ID EX1 EX2 EX3 EX4 EX5IF ID EX1 MEM WB

y1:

2:

3:

MEM

IF ID EX1 MEMIF ID EX1 EX2 EX3 EX4 EX5 WB

s

3:

4:

5:

IF ID EX1 MEMIF ID EX1 EX2 EX3 WB

IF ID EX1 EX2 EX3 EX4 EX5 EX6 WBtruc

tions 6:

7:

IF ID EX1 EX2 EX3 EX4 EX5 EX6 WB

Inst 8:


Several Steps in Exploiting ILPSeveral Steps in Exploiting ILP

Instructions

Cycles

SuperscalarSuperscalar

Dynamic Scheduling

Pipelining


References on ILPReferences on ILP

AQA 5th ed Appendix C AQA 5 ed., Appendix C CAR, Chapter 4—Introduction J E Smith and A R Pleszkun Implementation of J. E. Smith and A. R. Pleszkun, Implementation of

Precise Interrupts in Pipelined Processors, IEEE Transactions on Computers, 37(5):562-73, May 1988p , ( ) , y


Register RenamingRegister Renaming

(How Do I Get Rid of WAR and WAW?!…)

Register RenamingRegister Renaming

Importance of removing WAR and WAW dependences Importance of removing WAR and WAW dependences with “close-to-ideal” instruction windows (2K entries) and maximum issue rate (64 per cycle) 19

96( p y )

Kauf

fman

©

Mor

gan

urce

: AQ

A,

So


A Little History of (Modern) RenamingRenaming


Source: Sima, © IEEE 2000

First: IBM 360/91 (1967, FP partial renaming)

Main Dimensions in Renaming PoliciesPolicies

1 Scope of register renaming1. Scope of register renaming Simple: only some classes of registers are

renamed (e g integer or FP only)renamed (e.g., integer or FP only)2. Layout of the renamed registers Where are they?

3. Method of register mappingg pp g Allocation, tracking, and deallocation

4 Rename rate4. Rename rate How many instructions can be renamed at

once?


once?

Where Are the Rename Registers?Where Are the Rename Registers?

Four possibilities:Four possibilities:1. Merged rename and architectural RF2. Split rename and architectural RFs3 Renamed al es in the eo de b ffe3. Renamed values in the reorder buffer4. Renamed values in the reservation

stations (a.k.a. shelving buffers)


Four Possible Locations for Rename RegistersRegisters




InstructionFetch & Decode

Unit

ArchitecturalRegistersUnit

Reservation Reservation Reservation Reservation

g

ReservationStn.

L d/StRegister File

B h

ReservationStn.

ReservationStn.

ReservationStn.

Load/StoreUnitFP UnitALU

gBranchUnit

CommitUnit


UnitRenameRegisters

Typical ROBTypical ROB

fromF&D Unit

Register Address ValueTag

from EUs

PC

00


to MEMand RF

011

— $f3 0x627f ba5a

ALU1 ???0 87f b351

head

0 1000 0008

0x1000 0004

11

ALU1 ???$f5MUL2 ???

0xa87f b351

tail

0x1000 000c

0x1000 0008

0tail


Possible States of each Registerin a Merged Filein a Merged File

EEE

2000

e:Si

ma,

© I

ESo

urce


State Transitions in a Merged FileState Transitions in a Merged File

Initialisation: Initialisation: First N registers are “AR”, others “Available”

1. Available Renamed Invalid Instruction enters the Reservation Stations and/or the ROB:

register allocated for the result (i.e., register uninitialised)2 Renamed Invalid Renamed Valid2. Renamed Invalid Renamed Valid

Instruction completes (i.e., register initialised)3. Renamed Valid Architectural Registerg

Instruction commits (i.e., register “exists”)4. Architectural Register Available

h h ( Another instruction commits to the same AR (i.e., register is dead)

5. Renamed Invalid and Renamed Valid Available


Squashing

Tracking the Mapping: Where is Physically an Architectural Register?Physically an Architectural Register?

Mapping in aMapping Table

Mapping in theRename Buffer



Reminder: Reservation StationsReminder: Reservation Stations

A reservation station checks that the operands are available (RAW) A reservation station checks that the operands are available (RAW) and that the Execution Unit is free (Control), then starts execution

ff


fromEUs and RF

fromF&D Unit

addd – MUL3 ???1subd ALU1 – ??? 0xffff fee11

ALU1:

ALU2:

0xa87f b351

0ALU3:

a bALU


MIPS R10000:Merged RF with Mapping TableMerged RF with Mapping Table

1996

ager

, © I

EEE

Sour

ce:

Yea

Remark the complexity of the Mapping Table: 4-issue processor ( 4x above scheme in parallel)


4 issue processor ( 4x above scheme in parallel) 16 parallel accesses: 16 read ports and 4 write ports!

State Transitions Replaced by Copying in Stand-alone RRFCopying in Stand-alone RRF

Initialisation: Initialisation: All Rename Registers are “Available”

1. Available Renamed Invalid1. Available Renamed Invalid Instruction enters the Reservation Stations and/or the ROB:

register allocated for the result (i.e., register uninitialised)

2. Renamed Invalid Renamed Valid Instruction completes (i.e., register initialised)

3 d l d l bl3. Renamed Valid Available Instruction commits (i.e., register “exists”) Value is copied in the Architectural RF Value is copied in the Architectural RF

4. Renamed Invalid and Renamed Valid Available Squashing (no copy to the Architectural RF)


Squashing (no copy to the Architectural RF)

State of the Rename Registers in the Commit Unit (ROB)the Commit Unit (ROB)

fromF&D Unit


from EUs

PC

00


to MEMand RF

011

— $f3 0x627f ba5a

ALU1 ???0 87f b351

head

0 1000 0008

0x1000 0004

11

ALU1 ???$f5MUL2 ???

0xa87f b351

tail

0x1000 000c

0x1000 0008

0tail

R d I lidRenamed Valid


Renamed InvalidRenamed ValidAvailable

MIPS R10000:32 AR 64 PhR Merged Register File32 AR, 64 PhR, Merged Register File

Mapping Table:Mapping Table:fD AR PhR

Free Register Table:Up to 32 empty PhR

6

Status Table:Invalid PhR

© I

EEE

1996

I-Queue / Resv. Station ROB

Invalid PhR

urce

: Ye

ager

,


Sou

New PhRto Hold fD

Previous PhRHolding fD

fD AR

MIPS R10000:Information FlowInformation Flow

1. Available Renamed Invalida ab a d a d Read new PhR from top of Free Register Table Create new mapping LogDest Dest in the Mapping Table Set corresponding Busy-Bit (=invalid) in the Status Table Set corresponding Busy Bit (=invalid) in the Status Table

2. Renamed Invalid Renamed Valid Write PhR Dest indicated in the I-Queue R t di B Bit ( lid) i th St t T bl Reset corresponding Busy-Bit (=valid) in the Status Table Mark as Done in the corresponding entry in the ROB

3. Renamed Valid Architectural Register Implicit (removal of historical mapping LogDest Dest)

4. Architectural Register Available Free PhR indicated by OldDest in the entry removed from the ROBd a d by O d y o d o O

5. Renamed Invalid and Renamed Valid Available Restore mapping from all squashed ROB entries (from tail to head) as

LogDest Dest


LogDest Dest Reset corresponding Busy-Bit (=valid) in the Status Table

How Many Rename Registers?How Many Rename Registers?

In-Flight instructions: In Flight instructions:

STLDEURSflightin NNNNN

Rename Registers:

f g

ROB size:

LDEURSrename NNNN ROB size:

fli htiROB NN flightinROB NN

Note: if strictly < then structural stalls can occur


Note: if strictly < then structural stalls can occur

Number of Rename RegistersNumber of Rename Registers

EEE

2000

e:Si

ma,

© I

ESo

urce


Actual Choices in Commercial ImplementationsImplementations

000

ma,

© I

EEE

2

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically52 Sour

ce:

Sim

Current High End

Source: Microprocessor Report, © Cahners 2009

High-End Processors

No renamingNo renaming only in

UltraSparc:Use of registerUse of register windows for

integers makes it diffi ltoo difficult to

implement renaming (and no OOO either!)

Nor in Itanium, of course…


References on Register RenamingReferences on Register Renaming

AQA 5th ed Appendix C and Chapter 3 AQA 5 ed., Appendix C and Chapter 3 PA, Sections 6.3, 6.4, and 6.5 CAR Chapter 5—Introduction CAR, Chapter 5—Introduction D. Sima, The Design Space of Register Renaming

Techniques IEEE Micro (20):5 Sept -Oct 2000Techniques, IEEE Micro, (20):5, Sept. Oct. 2000 K. C. Yeager, The MIPS R10000 Superscalar

Microprocessor, IEEE Micro, 16(2):28-40, April 1996c op ocesso , c o, 6( ) 8 0, p 996


Prediction and SpeculationPrediction and Speculation

(Don’t Know It? Don’t Wait but Guess…)

Prediction & Speculation: The IdeaPrediction & Speculation: The Idea

Some operation takes awfully long?Some operation takes awfully long?The processor needs the result to proceed?To fetch the next instruction, one needs to know

which one must be fetchedT f t ti d th dTo perform a computation, one needs the operands

’ iDon’t wait!!!1. Make a guess ( Predict) and

2. Proceed tentatively ( Speculate)


General ProblemsGeneral Problems

1 How do I make a good guess?1. How do I make a good guess? Remember some history…

2 Wh t d I d if th ?2. What do I do if the guess was wrong? Undo speculatively-executed instructions (“squash”) May cost nothing—e.g.,

Squash some results

M t thi May cost something—e.g., Empty pipelines Restore saved state Restore saved state Execute compensation code


Branch PredictionBranch Prediction

The idea:The idea:Start fetching and decoding beforeknowing the outcome of a branch

Rudimentary form of speculationRudimentary form of speculationGuess if branch will be taken or notT i PC d d ffTentative PC advancements do not affect

machine state (no execution) Backing up is easy…


Branch PredictionBranch Prediction

Branch outcome and additional info

Predicted direction(Taken/Not Taken)

BranchPredictor

LogicLogic

Current PCPredicted target address


Branch Target BuffersBranch Target Buffers

PC

031CAM

031

Target AddressBranch Address0xa123 fee4

…One needs to know if a just fetched and yet

0x1234 56780x1235 ef5a

……

…

just fetched and yet undecoded instruction

is a branch and what is the destination

……

…

……

…(computed branch, relative address,

return, etc.)……

…


……

More Complex But CheaperBranch Target BuffersBranch Target Buffers

PC (Branch Address)

07831Typical Cache/TLB

i ti07831 organisations

0xa123 fee40x1234 560x1235 ef

Target Addr.Tag (31..8)0x7834 38470x5678 23

0x1235 78

Target Addr.Tag (31..8)0000 0000:

0000 0001 ……

…0x1235 ef

…

…

……

…

0x1235 78…

…

0000 0001:

0000 0011:

……

…

……

…

……

…

……

…

1111 1110:

1111 1111:


= Etc.

Which Strategy to Predict?Which Strategy to Predict?

Static predictions: ignore history Static predictions: ignore history1. Never-taken or always-taken2 Al t k b k d ( l )2. Always-taken-backward (e.g., loops)3. Compiler-specified, etc. Still a form of dynamic control speculation,

because the squashing process is done in h dhardware

Dynamic prediction: learn from history Record how often a branch was taken in the

past


Which Strategy to Predict Dynamically?Dynamically?

1 Same outcome as last time1. Same outcome as last time Keep one bit of history per recently visited branches

Needs an associative memory expensive

Keep one bit of history per hashed address Needs only a RAM inexpensive Different branches alias mistakes but we are only guessing Different branches alias mistakes, but we are only guessing,

anyway…

2. Same outcome as last few times (hysteresis)( y ) Keep a two-bit saturating history counter per hashed

address; use sign as a predictor Tuned to for loops: one misprediction is normal (last iteration) Tuned to for loops: one misprediction is normal (last iteration)

and should not modify the successive prediction (first iteration of a new execution)


Branch History TableBranch History Table

PC (Branch Address)

07831 One-bit Prediction

Two-bit Prediction

0

07831

not taken 01 not taken0000 0000:

11

taken

taken

1011

taken

taken

0000 0001:

0000 0010:

0

1…

not taken OR 01

10

…

not taken0000 0011:

10

taken

not taken

1000

taken

not taken

1111 1110:

1111 1111:


Two-bit Prediction SchemeTwo-bit Prediction Scheme

1998

Strong Prediction Weak Prediction

Kauf

fman

©

Mor

gan

urce

: CO

D,

So

Slightly modified saturating two-bit counterTwo mispredictions Strong reversal


p g(e.g., UltraSPARC-I)

Prediction AccuracyPrediction Accuracy

1996

Kauf

fman

tion

s

© M

orga

n

pred

ict

urce

: AQ

A,

Mis

p

So


More Complex StrategiesMore Complex Strategies…

Important novelty in speculative techniques: One does Important novelty in speculative techniques: One does not care to be always right but just to be right most of the time!

Cheap but clever techniques can be used…

Pattern History Table (PHT):

Exploit correlation:

ffm

an 1

996

Pattern History Table (PHT): n-bit standard predictors

(m,n) Branch Predictor BufferA global m-bit predictor uses the

outcome of the last two branches to

© M

orga

n Ka

uf

select one among four different predictors

Sour

ce:

AQA,


Branch History Register (BHR): m-bit shift register

Return Address StackReturn Address Stack

Special elementary case of branch prediction:Special elementary case of branch prediction:Small stack (e.g., 8-16 values)Each call (CALL JAL etc ) pushes a valueEach call (CALL, JAL, etc.) pushes a valueEach return (RET, JP $ra, etc.) pops a predicted

return addressreturn address

Functionally identical to the “real” stack but avoids any SP manipulation memory accessesavoids any SP manipulation, memory accesses, argument bypassing, etc.


Dynamic Control SpeculationDynamic Control Speculation

The next step after Dynamic Branch Prediction isThe next step after Dynamic Branch Prediction is Control SpeculationLimited scope if one limits to fetch and decodeLimited scope if one limits to fetch and decode

speculativelyMore aggressively one could execute tooMore aggressively, one could execute too…

If instructions are issued and execute before the branch target is known one needs to avoidbranch target is known, one needs to avoid changing the state of the processor until the correctness of the prediction is assessed orthe correctness of the prediction is assessed or to restore the state preceding the branch


Remember Reordering Problem?Remember Reordering Problem?

Fundamental observation: a processor canFundamental observation: a processor can do whatever it wants provided that it gives the appearance of sequential execution (i e theappearance of sequential execution (i.e., the architectural machine state is updated in program order)program order)

Idea was to commit in order, so that one could t d thi t t d i fpretend something was not executed in case of

an exception in previous instructionsPrediction and Speculation?!

Yes! Prediction: “no exceptions will occur”


p

Double Use of the Reorder BufferDouble Use of the Reorder Buffer

OOO execution was hidden from the userOOO execution was hidden from the user by preventing instruction to commit until in order execution would have taken placein-order execution would have taken placeIf there is an exception, one squashes all

uncommitted instructions in the reorder bufferuncommitted instructions in the reorder buffer which followed the instruction which raised the exceptionthe exception

One can easily extend this functionality to squash all uncommitted instructionssquash all uncommitted instructions following a mispredicted branch!


Commit Unit Now Includes Also BranchesBranches

Something more in the Commit Unit to check branch outcomes Something more in the Commit Unit to check branch outcomes

from from EUsF&D Unit

0


to MEM

00

head to MEMand RF1

1

— $f3 0x627f ba5a

BR1 ???0x2012 1111

head

0x1000 0008

0x1000 0004

10

$f5MUL2 ???tail

0x2012 1111


Misprediction Has High Cost Lots of Efforts in Improving Accuracyof Efforts in Improving Accuracy

Pipelines become more and more deep (e g upPipelines become more and more deep (e.g., up to 22-24 cycles in Pentium 4)

I idth (t i ll 3 8)Issue width grows (typically 3-8)Large number of in-flight instructions (up to

100-200!)Many predicted branches in-flight at oncey p gProbability of executing speculatively something

useful reduces quicklyuseful reduces quickly

predall _


i

itot pp

Current High End

Source: Microprocessor Report, © Cahners 2009

High-End Processors


Speculation Is Not Necessarily a Run-Time ConceptRun-Time Concept

Dynamic: in hardware no interactionDynamic: in hardware, no interaction whatsoever from the compilerBinary code is unmodified

Static: in software, planned beforehandStatic: in software, planned beforehand by the compilerBi d i itt i h t dBinary code is written in such a way as to do

speculation (with or without some hardware t i th ISA)support in the ISA)


Static Control SpeculationExampleExample

We need to compute:We need to compute:if (A==0) A=B; else A=A+4;

I blIn assembly:LW R1, 0(R3) ; load ABNEZ R1, L1 ; test A, possibly skip thenBNEZ R1, L1 ; test A, possibly skip thenLW R1, 0(R2) ; ‘then’ clause: load BJ L2 ; skip else

L1: ADD R1, R1, 4 ; ‘else’ clause: compute A+4L2: SW 0(R3), R1 ; store new A

If we know that the ‘then’ clause is almostIf we know that the then clause is almost always executed, can we optimise this code?


Static Control SpeculationExampleExample

We could speculatively start earlier to load B intoWe could speculatively start earlier to load B into another register and, if needed, squash the value with the right oneg

In assembly:LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4, , p

L3: SW 0(R3), R14 ; store new A

Advantages: now we load B while the test is f d ( i ll l)performed ( in parallel)

Any problem? As usual: exceptions…


Exceptions and SpeculationExceptions and Speculation

Some ways to handle exceptions inSome ways to handle exceptions in speculative execution:Static renaming: Hardware and operating

systems cooperatively ignore exceptionsPoison bits: Mark results as speculative and

delay exception at first usey pSpeculative instructions: Mark instruction as

speculative and do not commit the result untilspeculative and do not commit the result until speculation is solved


Static Renaming and Hardware-Software CooperationSoftware Cooperation

Back to our example: Back to our example:LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4

L3: SW 0(R3), R14 ; store new AL3: SW 0(R3), R14 ; store new A

OS “helps” with two policies: Nonterminating exceptions (e.g., Page Fault): resume g p ( g , g )

independently from speculativeness performance penalty, but execution ok

Terminating exceptions (e g Divide by Zero): ignore and return Terminating exceptions (e.g., Divide by Zero): ignore and return an undefined value if it was speculated, it will be unused

Problem: nonspeculative terminating exceptions?


p g p

Static Renaming and Poison BitsStatic Renaming and Poison Bits

Special marker for speculative instructions: Special marker for speculative instructions:LW R1, 0(R3) ; load ALW* R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4

L3: SW 0(R3), R14 ; store new A; report exceptionsL3: SW 0(R3), R14 ; store new A; report exceptions

The processor knows the load is speculative and turns on R14’s Poison Bit if it raises a terminating exception, g p ,and suppresses the exception

The add, if executed, resets the R14’s Poison BitWhen R14 is used, a deferred terminal exception is

raised if its Poison Bit is set


Example:Speculative Loads in Itanium (I)Speculative Loads in Itanium (I)

Goal: move loads as early as possible, even speculatively before Goal: move loads as early as possible, even speculatively before preceding branches (i.e., without being sure they are really needed)

<some code>(p1) br.cond somewhere// ------ barrierld 1 [ 2] Exceptions?ld r1 = [r2]<some code using r1>

Exceptions?

ld r1 = [r2] // load could be speculated// if old value r1 not needed

<some code> // <- neither here nor(p1) br.cond somewhere // in “somewhere”// ------ barrier<some code using r1> // but


<some code using r1> // but…

Example:Speculative Loads in Itanium (II)Speculative Loads in Itanium (II)

Speculative loads and deferred exceptions to explicit compiler- Speculative loads and deferred exceptions to explicit compilergenerated fix-up code

ld.s r1 = [r2] // speculative loads do not raise// exceptions but mark the register// with the additional NaT bit

<some code><some code using r1> // NaT is propagated in further

// calculations, which also// defer exceptions

(p1) br.cond somewhere(p )// ------ barrier<some more code using r1>chk.s r1, fix_code_r1 // call exception handler if needed

// to fix-up execution// to fix up execution


Predication (= Guarded) ExecutionPredication (= Guarded) Execution

A special form of static control speculation? A special form of static control speculation?“I cannot make a good prediction? I will avoid gambling

and will do both”and will do both

A bit more than that: removes control flow change A bit more than that: removes control flow change altogether

Not always a good idea: compiler trade-off (Almost) free if one uses execution units which where not used ( )

otherwise (e.g., because of limited ILP) Not free at all in the general case: more than needed is always

executed


executed

Weak- and Strong-dependence ModelsModels

Typical model for data dependences is:Typical model for data dependences is:No dependence (B does not depend on A)St d d (B d d A)Strong dependence (B depends on A)If A and B are executed OOO, one must be

always right about the dependencealways right about the dependence

Weak-dependence model:A d d b t il i l t dA dependence can be temporarily violated

(predicted negative and speculated) if means are in place to recover correct executionare in place to recover correct executionIf A and B are executed OOO, one must be

mostly right about the dependence


mostly right about the dependence

Dynamic Data Dependence PredictionPrediction

Examples:Examples:Memory Dependence Prediction: predicts whether a

load is dependent on a pending store (memoryload is dependent on a pending store (memory aliasing or disambiguation)

Alias Prediction: predicts which pending storeAlias Prediction: predicts which pending store contains the right value for a load

???

MEM

???

MEM

=

R WW WW WR R MEM R WW WW WR R MEM

???


???

Static Data Dependence Speculation (I)Speculation (I)

Potential RAW dependencies through memory are to be Potential RAW dependencies through memory are to be conservatively assumed as real dependencies Loss of useful reordering possibilities

G l l d l ibl l ti l b f Goal: move loads as early as possible, even speculatively before preceding stores (i.e., without being sure that the value is right)

<some code>st [r3] = r4// ------ barrierld r1 = [r2] NO!ld r1 [r2]<some code using r1>

NO!

ld r1 = [r2] // load could be speculated…<some code>st [r3] = r4 // …but if r2==r3, r1 is WRONG!


[ ] ,// ------ barrier<some code using r1>

Static Data Dependence Speculation (II)Speculation (II)

Speculative Loads get executed but mark the destination register as Speculative Loads get executed but mark the destination register as “speculatively” loaded and track subsequent stores for a conflict

ld.a r1 = [r2] // speculative loads are normal[ ] // p// but mark always the register// with the additional NaT bit

<some code><some code using r1> // NaT is propagated in further

// l l ti// calculations

st [r3] = r4 // successive stores are checked// to see if they rewrite locations// which were object of speculative// which were object of speculative// loads

// ------ barrier<some more code using r1>chk.a r1, fix_code_r1 // if violated RAW dependence, call

//

Important advantage because loads (slow operations) can now be started earlier

_ _// special fix-up routine


started earlier

Dynamic Data Value PredictionDynamic Data Value Prediction

Examples:Examples:Source Operand Value Prediction: predict

quasi-constant input operands Many constant values during program execution History table recording last value

Value Stride Prediction: predict constant pincrements across input operands History table recording stride between last two y g

values

Load Addresses and Load Values


Types of Prediction & SpeculationTypes of Prediction & Speculation

Dynamic (by the hardware) Static (by the compiler)

ExceptionsOOO execution and reorderingI i ti i DBTExceptions Imprecise exceptions in DBT (e.g., Transmeta Crusoe)

—

Trace Scheduling (AQA, 4.4)

Control Branch Prediction (AQA, 4.3)HyperblocksPredication (AQA, 4.6)Speculative Loads (e.g., Itanium)

Data Availability Virtual memory —

Data Research? Advanced Loads (e g Itanium)Dependence Research? Advanced Loads (e.g., Itanium)

Data Value Research Dynamic compilers (e.g., DyC, Calpa)


p )

Not all of these are traditionally called “speculation”!

References on Prediction & SpeculationSpeculation

AQA 5th ed Chapter 3 and Appendix H AQA 5 ed., Chapter 3 and Appendix H PA, Sections 4.3 and 5.3 J E Smith A Study of Branch Prediction Strategies J. E. Smith, A Study of Branch Prediction Strategies,

Eight International Symposium on Computer Architecture (ISCA), 135-48, May 1981( ), , y


Simultaneous MultithreadingSimultaneous Multithreading

(How do I fill my issue slots?!…)

Sources of ParallelismSources of Parallelism

Bit-level Bit level Wider processor datapaths (8163264…)

Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)

Instruction-level Instruction level Pipelining Superscalar

d VLIW and EPIC

Task- and Application-levels… Explicit parallel programming Explicit parallel programming Multiple threads Multiple applications…


Simple Sequential ProcessorSimple Sequential Processor

op 1op 1

cycl

esc

op 2

op 3


Pipelined ProcessorPipelined Processor

op 1op 1op 2

cycl

es

op 3

c

op 4

5op 5op 6 (br)



op 1 op 2

functional units

op 1 op 2

cycl

es

op 3op 4

c

op 5op 6 (br)


OOO Superscalar ProcessorOOO Superscalar Processor

op 1 op 2

functional units

op 1op 4

op 5

op 2

cycl

es

op 3p

op 6 (br)

c

7op 7op 11


Speculative ExecutionSpeculative Execution

op 1 op 2

functional units

op 1op 4

op 5

op 2

op 10 ?

cycl

es

op 3p

op 6 (br)op 7 ?

p

c

11op 9 ?op 8 ?

op 11op 12 op 13


Limits of ILPLimits of ILP

op 1 op 2

functional units

op 1op 4

op 5

op 2

op 10 ?

cycl

es

op 3p

op 6 (br)op 7 ?

p

?c

11op 9 ?op 8 ? ?

op 11op 12 op 13


Sources of Unused Issue SlotsSources of Unused Issue Slots

IEEE

1995

en e

t al

., ©

ISo

urce

: Tu

lls


Horizontal and Vertical WasteHorizontal and Vertical Waste

f ti l it

op 1 op 2

functional units

op 3 op 4op 7

IEEE

1995

cycl

es

op 3 op 4

Vertical Waste61% en

et

al.,

© I

op 10op 11op 5op 6

61%

Sour

ce:

Tulls


Horizontal Waste39%

Multithreading: The IdeaThe Idea

ranc

h

Issueranc

h

ranc

h

ranc

h

ranc

h

BrBrBrBrBr

future

Bran

ch

Bran

ch

Rather than enlarging the depth of

Issueanch

BBRather than enlarging the depth of the instruction window (more

speculation with lowering fid !) l i “ id h”

Issue

Br

nch

nch

confidence!), enlarge its “width”

fetch from multiple threads!

© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically101Br

an

Bran

Basic Needs of a Multithreaded ProcessorMultithreaded Processor

Processor must be aware of severalProcessor must be aware of several independent states, one per each thread:Program CounterRegister File (and Flags)g ( g )(Memory)

Either multiple resources in the processorEither multiple resources in the processor or a fast way to switch across states


Thread SchedulingThread Scheduling

When one switches thread?When one switches thread?Which thread will be run next?

Simple interleaving options:Cycle by cycle multithreadingCycle-by-cycle multithreading Round-robin selection between a set of threads

Bl k l i h diBlock multithreading Keep executing a thread until something happens

Long latency instruction foundSome indication of scheduling difficultiesMaximum number of cycles per thread executed


Maximum number of cycles per thread executed

Cycle-by-cycle Interleaving(or Fine-Grain) Multithreading(or Fine-Grain) Multithreading

op 1 op 5op 2

functional units

op 1 op 5op 2

op 1op 2op 5op 2op 1

s

cycl

es

op 3 op 4ppp

op 7op 3 op 4op 6

switc

hes

c

op 10 op 8op 3

5 10op 11 co

ntex

t

op 4op 6 op 7op 9op 5 op 10


Block Interleaving (or Coarse-Grain) MultithreadingMultithreading

op 1 op 5op 2

functional units

op 1op 3 op 4

op 5op 2

op 6 s

cycl

es

op 2op 1

p

switc

hes

op 8

c

125

op 7op 3 op 4

cont

ext

op 1op 3

op 2op 5op 6 op 7op 9


Fundamental RequirementFundamental Requirement

Key issue in general purpose processorsKey issue in general-purpose processors which has prevented for many years

l h d d h bmultithreaded techniques to become commercially relevanty

It i t t bl th t i l th dIt is not acceptable that single-thread performance goes significantly down

or at all


Problems of Cycle-by-Cycle MultithreadingMultithreading

Null time to switch context Null time to switch context Multiple Register Files

No need for forwarding paths if threads supported are No need for forwarding paths if threads supported are more than pipeline depth! Simple(r) hardware

Fills well short vertical waste (other threads hide latencies ~ no. of threads)

Fills much less well long vertical waste (the thread is rescheduled no matter what)

Does not reduce significantly horizontal waste (per thread, the instruction window is not much different…)

Si ifi t d t i ti f i l th d j b


Significant deterioration of single thread job

Block Interleaving TechniquesBlock Interleaving Techniques

9

Block Interleaving

Sprin

ger

1999

Static Dynamic ilc e

t al

., ©

S

y

pted

from

: Ši

Explicit Switch Implicit SwitchSwitch-on-LoadSwitch-on-Store

Explicit SwitchConditional-Switch

Implicit SwitchSwitch-on-MissSwitch-on-Use

Ada

Switch-on-Branch (a.k.a. Lazy-Switch-on-Miss)Switch-on-Signal

(interrupt, trap,…)


Problems of Block MultithreadingProblems of Block Multithreading

Scheduling of threads not self evident:Scheduling of threads not self-evident:What happens of thread #2 if thread #1 executes

perfectly well and leaves no gap?perfectly well and leaves no gap?Explicit techniques require ISA modifications Bad…

More time allowable for context switchMore time allowable for context switchFills very well long vertical waste (other threads

i )come in)Fills poorly short vertical waste (if not sufficient

to switch context)Does not reduce almost at all horizontal waste


Simultaneous Multithreading (SMT):The IdeaThe Idea

op 1 op 5op 2

functional units

op 1op 1op 1op 4

op 5op 2 op 1op 3

op 2op 5op 7 op 5op 2

op 1

cycl

es

op 3 op 6op 7

ppop 4op 6 op 8

p p

op 3 op 4

c

op 9op 10

12

op 11op 8op 7 op 10

11

op 8op 6

10op 12op 13op 14 op 9

op 11op 10op 11


Several Simple Scheduling PossibilitiesPossibilities

Prioritised scheduling?Prioritised scheduling?Thread #0 schedules freelyThread #1 is allowed to use #0 empty slotsThread #2 is allowed to use #0 and #1Thread #2 is allowed to use #0 and #1

empty slots, etc.

Fair scheduling?Fair scheduling?All threads compete for resourcesIf several threads want the same resource,

round-robin assignment




(Multiple Instructions per Cycle)Multiple Buses

PC

ReservationStn

ReservationStn

ReservationStn

ReservationStn

Multiple Buses

Stn.Stn.


Stn.

Branch

Stn.

ALU 2 FP UnitALU 1Register File

/UnitUnitALU 2


IQ


Multiple Buses

Reorder BufferReorder Buffer

fromF&D Unit


from EUs

PC

00


to MEMand RF

011

— $f3 0x627f ba5a

ALU1 ???0 87f b351

head

0 1000 0008

0x1000 0004

11

ALU1 ???$f5MUL2 ???

0xa87f b351

tail

0x1000 000c

0x1000 0008

0tail

Register Renaming


Register Renaming(between fetch/decode and commit)

What Must Be Added to a Superscalar to Achieve SMT?Superscalar to Achieve SMT?

Multiple program counters (= threads) and a Multiple program counters (= threads) and a policy for the instruction fetch unit(s) to decide which thread(s) to fetch

F/D

( ) Multiple or larger register file(s) with at least as

many registers as logical registers for all threads

mm

it

Multiple instruction retirement (e.g., per thread squashing) C

om

No changes needed in the execution pathAnd also: Thread-aware branch predictors (BTBs, etc.) Per-thread Return Address Stacks


SMT Processor as a Natural Extension of a SuperscalarExtension of a Superscalar


(Multiple Instructions per Cycle)

PCPCPCPC

ReservationStn

ReservationStn

ReservationStn

ReservationStnStn.Stn.


Stn.

Branch

Stn.

ALU 2 FP UnitALU 1 /UnitUnitALU 2

Register File(s)


IQIQIQIQ


IQ

Reorder Buffer Remembers the Thread of OriginThread of Origin

Some changes to the reorder buffer in the Commit Unit—e.g.: Some changes to the reorder buffer in the Commit Unit e.g.:

fromF&D Unit

from EUs

0

Register Address ValueTagPC Thread

to MEMand RF1

1

— $f3 0x627f ba5a

ALU1 ???0xa87f b351

head

0x1000 0008

0x1000 0004 2

2

1 $f5MUL2 ???0x1000 000c

1 MUL30x2001 1234

2

1 $f3 ???

0tail


Architectural Register Identifier:Reg # + Thread #


Reservation stations do not need to know which thread an Reservation stations do not need to know which thread an instruction belongs to

Remember: operand sources are renamed—physical regs, tags, etc.


fromEUs and RF

fromF&D Unit

Threadaddd – MUL3 ???1subd ALU1 – ??? 0xffff fee11

ALU1:

ALU2:

0xa87f b351Thread#2?!

Thread

0ALU3: #1?!

a bALU


Does It Work?!Does It Work?!

IEEE

1996

en e

t al

., ©

ISo

urce

: Tu

lls


Main Results on ImplementabilitySMT vs SuperscalarSMT vs. Superscalar

From [TullsenJun96]:From [TullsenJun96]:Instruction scheduling not more complexRegister File datapaths not more complex (but

much larger register file!)Instruction Fetch Throughput is attainable even

without more fetch bandwidthUnmodified cache and branch predictors are

appropropriate also for SMTappropropriate also for SMTSMT achieves better results than aggressive

superscalar


superscalar

Where to Fetch?Where to Fetch?

Static solutions: Round-robinStatic solutions: Round robinEach cycle 8 instructions from 1 threadEach cycle 4 instructions from 2 threads 2 from 4Each cycle 4 instructions from 2 threads, 2 from 4,…Each cycle 8 instructions from 2 threads, and forward

as many as possible from #1 then when long latency instruction in #1 pick rest from #2

Dynamic solutions: Check execution queues!Favour threads with minimal # of in-flight branchesFavour threads with minimal # of outstanding missesF th d ith i i l # f i fli htFavour threads with minimal # of in-flight

instructionsFavour threads with instructions far from queue head


Favour threads with instructions far from queue head

What to Issue?What to Issue?

Not exactly the same as in superscalars Not exactly the same as in superscalars… In superscalar: oldest is the best (least speculation, more dependent

ones waiting, etc.) In SMT not so clear: branch speculation level and optimism (cache hit In SMT not so clear: branch-speculation level and optimism (cache-hit

speculation) vary across threads

One can think of many selection strategies: Oldest first Cache-hit speculated last Branch speculated last Branch speculated last Branches first…

Important result: doesn’t matter too much!

Issue Logic (critical in superscalars) can be left alone


Importance of Accurate Branch Prediction in SMT vs SuperscalarPrediction in SMT vs. Superscalar

Reduce the impact of Branch Prediction was oneReduce the impact of Branch Prediction was one of the qualitative initial motivations

R lt f [T ll J 96]Results from [TullsenJun96]:Perfect branch prediction advantage

25% 1 h d 25% at 1 thread 15% at 4 threads 9% at 8 threads 9% at 8 threads

Losses due to suppression of speculative execution -7% at 8 threads7% at 8 threads -38% at 1 thread ( speculation was a good idea…)


BottlenecksSources of Unused Issue SlotsSources of Unused Issue Slots

ger

1999

al.,

© S

prin

gou

rce:

Šilc

et

Completion queue not very relevant (remember: this is out of the execution path…)

So

Rename register count important Most critical: number of register writeback ports


SMT for utilisation rate (EUs) not bandwidth!…

BottlenecksBottlenecks

Fetch and memory throughput Fetch and memory throughput are still bottlenecks Fetch: branches, etc. M t dd d Memory not addressed

Performance vs. # of rename registers (8T) in addition to

© I

EEE

1996

g ( )the architectural ones Infinite: +2% 100: ref lls

en e

t al

., ©

100: ref. 90: -1% 80 -3% So

urce

: Tu

70 -6%

Register file access time likely limit to # of threads IPC vs. # threads

200 h i l i t


limit to # of threads 200 physical registers

Performance (IPC) per Unit CostPerformance (IPC) per Unit Cost

ger

1999

al.,

© S

prin

gou

rce:

Šilc

et

Superscalars are cheap only for relatively small issue bandwidth,

So

Superscalars are cheap only for relatively small issue bandwidth, then quickly down

SMT improves significantly the picture already with 2 threads and maximum moves to larger issue bandwidths with more threads


maximum moves to larger issue bandwidths with more threads

Introduction of SMT in Commercial Processorsin Commercial Processors

Compaq Alpha 21464 (EV8)Compaq Alpha 21464 (EV8)4T SMTP j t kill d J 2001Project killed June 2001

Intel Pentium IV (Xeon)2T SMTAvailability since 2002 y

(already there before, but not enabled)10-30% gains expectedg p

SUN Ultra III2 core CMP 4T SMT


2-core CMP, 4T SMT

Intel SMT: Xeon Hyper-ThreadingPipelinePipeline

d l d

F t d

duplicatedresources

Front-end(TC hit)

or(TC miss)

ntel

2002

freely sharedresources

splitresources

time-sharedresources rr

et

al.,

© I

nSo

urce

: M

ar

OOOExecution


Intel SMT: Xeon Hyper-ThreadingSwitching Off ThreadsSwitching Off Threads

What happens when there is only one thread? WhatWhat happens when there is only one thread? What does the OS when there is nothing to do? Ahem… Four modes: Low-power, ST0, ST1, and MT Four modes: Low power, ST0, ST1, and MT

002

al.,

© I

ntel

20rc

e: M

arr

et a

Halt0

Halt1

Int1Int0

Halt1

Halt0

Int0Int1

Sour


Low-Power Mode

Halt0 Halt1

Intel SMT: Xeon Hyper-ThreadingGoals and ResultsGoals and Results

Minimum additional cost: SMT = approx. 5% area Minimum additional cost: SMT approx. 5% area No impact on single-thread performance

Recombine partitioned resources

Fair behaviour with 2 threads

tel2

002

r et

al.,

© I

ntSo

urce

: M

arr


And Now What’s Next?And Now, What s Next?

Key ingredients for success so far:Key ingredients for success so far:Maximise compatibility, no info from programmers

beyond straight sequential code and coarse threadsbeyond straight sequential code and coarse threadsAggressive prediction and speculation of anything

predictablepredictableUse irregular, fine-grained parallelism (ILP): it is

“easier” to extract, can be done at runtime,…, ,

Problems:Branch prediction accuracy hard to improveBranch prediction accuracy hard to improveHard to exploit ILP any further within a thread


SMT Research DirectionsSMT Research Directions

Dynamic Multithreading (DMT) [AkkariNov98] Dynamic Multithreading (DMT) [AkkariNov98] Automatic generation of threads from loops (backward

branches) and procedure call

IEEE

1998

ary

et a

l., ©

ISo

urce

: Ak

ka

Execution on a “typical” SMT microarchitecture Speculative execution of threads with interthread data

dependence and value speculation


dependence and value speculation

SMT Research DirectionsSMT Research Directions

What Dynamic Multithreading represents?What Dynamic Multithreading represents? Introduces, from a single program, not only out-of-order Issue

but now also out-of-order Fetch Lowers the dependence on correct Branch Prediction

Experiments show good advantages even without more f t h b d idthfetch bandwidth

EEE

1998

ry e

t al

., ©

IE

Sour

ce:

Akka

r


S

References on Simultaneous MultithreadingMultithreading

AQA 5th ed Chapter 3 AQA 5 ed., Chapter 3 PA, Sections 6.3, 6.4, and 6.5 CAR, Chapter 5—Introduction, p D. M. Tullsen et al., Exploiting Choice: Instruction Fetch

and Issue on an Implementable Simultaneous M ltith di P ISCA 1996Multithreading Processor, ISCA, 1996

H. Marr et al., Hyper-Threading Technology Architecture and Microarchitecture Intel Technology Journal Q1and Microarchitecture, Intel Technology Journal, Q1, 2002

H. Accary and M. A. Driscoll, A Dynamic Multithreaded y , yProcessor, MICRO, 1998


Documents

Advanced Computer Architecture - EPFLlap2.epfl.ch/courses/advcomparch/slides/Exploiting ILP Dynamically.pdfAdvanced Computer Architecture — Part I: General Purpose Exppgloiting ILP