Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Advanced Computer Architecture—
Part I: General PurposeExploiting ILP Dynamicallyp g y y
[email protected] – I&C – LAPEPFL I&C LAP
ILP? The Traditional WayILP? The Traditional Way
(Let’s Make It Fast!)
Speed: Main Goal in General Purpose Computer ArchitecturePurpose Computer Architecture
Reduce delay per gate Technology (~ x1.2/year) Reduce delay per gate Technology ( x1.2/year) Improve architecture Parallelism (~ x1.3/year)
1n,
© M
K 20
11y
& P
atte
rson
rce:
Hen
ness
y
Architectural and organisational ideas are the
Sour
© Ienne 2003-123 AdvCompArch — Exploiting ILP Dynamically
gmain performance drivers since the mid-1980s
Clock RateDoes Not Grow Much (Anymore!)Does Not Grow Much (Anymore!)
1n,
© M
K 20
11y
& P
atte
rson
rce:
Hen
ness
ySo
ur
© Ienne 2003-124 AdvCompArch — Exploiting ILP Dynamically
Sources of ParallelismSources of Parallelism
Bit-level Bit level Wider processor datapaths (8163264…)
Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)
Instruction-level Instruction level Pipelining Superscalar
d
This lesson:ILP = Instruction Level Parallelism
VLIW and EPIC Task- and Application-levels…
Explicit parallel programming Explicit parallel programming Multiple threads Multiple applications…
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically5
Starting Point (Programmer Model)Starting Point (Programmer Model)
Sequential multicycle processorSequential multicycle processor
CyclesCycles1:
2:
3:Instructions 3:
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically6
ILP?ILP?
Instructions
Cycles
??© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically7
Standard
First Step:PipeliningPipelining
CyclesIF ID EX WBIF ID EX MEM
IF ID EX MEM WB
Cycles
ons
1:
2:
3
MEMWB
IF ID EX MEM WBIF ID
IF IDnstr
uctio 3:
4:
5:
EX MEMEXWB
IF IDIn
5: EX
Simplest form of Instruction Level Parallelism (ILP): Several instructions are being executed at once
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically8
being executed at once
Simple PipelineSimple Pipeline
RF
F D E M1 M2 W
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically9
Simple PipeliningSimple Pipelining
Scope for parallelism is limited:Scope for parallelism is limited: Control hazards limit the usability of the pipeline
Must squash fetched and decoded instruction following a branch Must squash fetched and decoded instruction following a branch
Data hazards limit the usability of the pipelineWhenever the next instruction cannot be executed, the pipeline , p p
is stalled and no new useful work is done until the “problem” is solved (e.g., cache miss)
Rigid seq encing Rigid sequencing Special “slots” for everything even if sometimes useless (e.g.,
MEM before WB)) Every instruction must be coerced to the same framework Structural hazards avoided “by construction”
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically10
Simple Pipeline with ForwardingSimple Pipeline with Forwarding
RF
F E M1 M2 WDF E M1 M2 WD
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically11
ILP So FarILP So Far…
Instructions
Cycles
??Pipelining
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically12Standard
Dynamic Scheduling: The IdeaDynamic Scheduling: The Idea
Extend the scope to extract parallelism:Extend the scope to extract parallelism:divd $f0, $f2, $f4addd $f10, $f0, $f8addd $ 0, $ 0, $ 8subd $f12, $f8, $f14
Why not to execute subd while addd waits for y ot to e ecute subd e addd a ts othe result of divd?
Relax a fundamental rule: instructions can beRelax a fundamental rule: instructions can be executed out of program order! (but the result must still be correct )result must still be correct…)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically13
Break the Rigidity of the Basic PipeliningPipelining
Continue fetching and decoding even and especially Continue fetching and decoding even and especially if one cannot execute previous instructions
Keep writeback waiting if there is a structural hazard, Keep writeback waiting if there is a structural hazard, without slowing down execution
Solution: Split the tasks in independent units/pipelines
Fetch and decode ExecuteWriteback
Clearly, instructions will now produce results out-of-order (OOO)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically14
Dynamically Scheduled ProcessorDynamically Scheduled Processor
RF
ALURS
F D ROB W
MEM(3)RS
F D E/M1/ W
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically15
F D E/M1/… W
Reservation StationsReservation Stations
Fetch&Decode Unit and Register File (1) Fetched operation descriptions and
(2a) known operands (from RF)All Execution Units
(1) Tags of the executed operationsor (2b) source-operation tags and (2) corresponding results
ReservationSt tiStation
Dependent Execution Unit(1) Description of operations ready to execute
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically16
(1) Description of operations ready to executewith (2) corresponding tags and (3) operands
Reservation StationsReservation Stations
A reservation station checks that the operands are available (RAW) and that A reservation station checks that the operands are available (RAW) and that the Execution Unit is free (Structural Hazards), then starts execution
ff
Tag1 Tag2 Arg1 Arg2Op
fromEUs and RF
fromF&D Unit
addd – MUL3 ???1subd ALU1 – ??? 0xffff fee11
ALU1:
ALU2:
0xa87f b351
0ALU3:
a bALU
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically17
Reservation StationsReservation Stations
Unavailable operands are identified by theUnavailable operands are identified by the name of the entry in the reservation station in charge of the originatingstation in charge of the originating instructionI li it i t i thImplicit register renaming, thus
removing WAR and WAW hazardsNew results are seen at their inputs
through special result bus(es)g p ( )Writeback into the registers is not on the
critical execution path
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically18
critical execution path
A Problem with ExceptionsA Problem with Exceptions…
Precise exceptions Precise Precise exceptions Reordering at commit; user
view is that of a fully in-
andi $t4, $t2, 0xffandi $t5, $t3, 0xffaddi $v0, $v0, 1srl $t2, $t2, 8
order processor
Imprecise exceptions N d i t f d
srl $t2, $t2, 8 srl $t3, $t3, 8
andi $t4, $v0, 3addi $t0, $t0, 4addi $t1, $t1, 4 No reordering; out-of-order
completion visible to the user
addi $t1, $t1, 4
Impreciseandi $t4, $t2, 0xffandi $t5 $t3 0xff
The OS/programmer must be aware of the problem and take appropriate action
andi $t5, $t3, 0xffaddi $v0, $v0, 1srl $t2, $t2, 8
srl $t3, $t3, 8andi $t4 $v0 3and take appropriate action
(e.g., execute again the complete subroutine where the problem occurred)
andi $t4, $v0, 3addi $t0, $t0, 4addi $t1, $t1, 4
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically19
the problem occurred)
Out-of-order Commitment and ExceptionsExceptions
Exception handlers should know exactly where aException handlers should know exactly where a problem has occurred, especially for nonterminating exceptions (e g page fault)nonterminating exceptions (e.g., page fault) so that they can correct the problem and resume exactly where the exception occurredresume exactly where the exception occurred
Of course, one assumes that everything before th f lt i t ti t d dthe faulty instruction was executed and everything after was not
With OOO dynamic execution it might no longer be true…
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically20
ReorderingReordering
Fundamental observation: a processor canFundamental observation: a processor can do whatever it wants provided that it gives the appearance of sequential execution (i e theappearance of sequential execution (i.e., the architectural machine state is updated in program order)program order)
New phase: COMMIT or RETIRE or GRADUATE (b id th l F D E W)(besides the usual F, D, E, W)
This observation is fundamental because it allows many techniques (precise interrupts, speculation, multithreading, etc.)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically21
Dynamically Scheduled ProcessorDynamically Scheduled Processor
InstructionFetch & Decode
Unit
Computation advances independently from machine
state updatesUnit
Reservation Reservation Reservation ReservationReservationStn.
L d/StRegister File
B h
ReservationStn.
ReservationStn.
ReservationStn.
Load/StoreUnitFP UnitALU
gBranchUnit
M hi t t i
CommitUnit
Machine state is updated in order
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically22
Unit
Reorder BufferReorder Buffer
Fetch&Decode Unit(1) Fetched-operation tags in original
order, (2) destination register orAll Execution Units
(1) Tags of the executed operationsaddress, and (3) PC and (2) corresponding results
Commit Unit(R d B ff )(Reorder Buffer)
Register File and MemoryFor each instruction in the original fetch order
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically23
For each instruction, in the original fetch order,(1) destination register or address and (2) value to write
Reordering Instructions at WritebackWriteback
Needs a reorder buffer in the Commit Unit Needs a reorder buffer in the Commit Unit
from from EUsF&D Unit
0
Register Address ValueTagPC
to MEM
00
head to MEMand RF1
1
— $f3 0x627f ba5a
ALU1 ???0xa87f b351
head
0x1000 0008
0x1000 0004
10
$f5MUL2 ???tail
0x1000 000c
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically24
Origins of ReorderingOrigins of Reordering
Paper Smith & Pleszkun on precise interrupts 1988 Paper Smith & Pleszkun on precise interrupts, 1988
© I
EEE
1988
& P
lesz
kun,
©ur
ce:
Smith
&So
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically25
Second Step: Dynamic SchedulingDynamic Scheduling
IF ID EX1 EX2 WB Cycles1 IF ID EX1 EX2 WBIF ID EX1 EX2 EX3 EX4 EX5 MEM
IF ID EX1 EX2
Cycles1:
2:
3: WBIF ID EX1 EX2
EX2IF ID EX1 MEM WB
IF ID EX1 MEMruct
ions
3:
4:
5:
WB
Inst
r
EX2IF ID EX16:
Tangible amount of ILP now possibleWhat’s next?!
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically26
ILP So FarILP So Far…
?Instructions
Cycles ?Dynamic Scheduling
Pipelining
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically27Standard
Superscalar ExecutionSuperscalar Execution
Why not more than one instructionWhy not more than one instruction beginning execution (issued) per cycle?Key requirements areFetching more instruction in a cycle: no bigFetching more instruction in a cycle: no big
difficulty provided that the instruction cache can sustain the bandwidthcan sustain the bandwidthDecide on data and control dependencies:
dynamic scheduling already takes care of thisdynamic scheduling already takes care of this
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically28
Superscalar ProcessorSuperscalar Processor
InstructionFetch & Decode Unit
(Multiple Instructions per Cycle)Multiple Buses
ReservationStn
ReservationStn
ReservationStn
ReservationStn
Multiple Buses
Stn.Stn.
FP UnitALU 1 Load/Store
Stn.
Branch
Stn.
ALU 2 FP UnitALU 1Register File
/UnitUnitALU 2
Commit Unit(Multiple Instructions per Cycle)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically29
Multiple Buses
Third Step: Superscalar ExecutionThird Step: Superscalar Execution
IF ID EX1 EX2 EX3 WB Cycles1: IF ID EX1 EX2 EX3 WBIF ID EX1 EX2 EX3 EX4 EX5IF ID EX1 MEM WB
y1:
2:
3:
MEM
IF ID EX1 MEMIF ID EX1 EX2 EX3 EX4 EX5 WB
s
3:
4:
5:
IF ID EX1 MEMIF ID EX1 EX2 EX3 WB
IF ID EX1 EX2 EX3 EX4 EX5 EX6 WBtruc
tions 6:
7:
IF ID EX1 EX2 EX3 EX4 EX5 EX6 WB
Inst 8:
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically30
Several Steps in Exploiting ILPSeveral Steps in Exploiting ILP
Instructions
Cycles
SuperscalarSuperscalar
Dynamic Scheduling
Pipelining
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically31Standard
References on ILPReferences on ILP
AQA 5th ed Appendix C AQA 5 ed., Appendix C CAR, Chapter 4—Introduction J E Smith and A R Pleszkun Implementation of J. E. Smith and A. R. Pleszkun, Implementation of
Precise Interrupts in Pipelined Processors, IEEE Transactions on Computers, 37(5):562-73, May 1988p , ( ) , y
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically32
Register RenamingRegister Renaming
(How Do I Get Rid of WAR and WAW?!…)
Register RenamingRegister Renaming
Importance of removing WAR and WAW dependences Importance of removing WAR and WAW dependences with “close-to-ideal” instruction windows (2K entries) and maximum issue rate (64 per cycle) 19
96( p y )
Kauf
fman
©
Mor
gan
urce
: AQ
A,
So
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically34
A Little History of (Modern) RenamingRenaming
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically35
Source: Sima, © IEEE 2000
First: IBM 360/91 (1967, FP partial renaming)
Main Dimensions in Renaming PoliciesPolicies
1 Scope of register renaming1. Scope of register renaming Simple: only some classes of registers are
renamed (e g integer or FP only)renamed (e.g., integer or FP only)2. Layout of the renamed registers Where are they?
3. Method of register mappingg pp g Allocation, tracking, and deallocation
4 Rename rate4. Rename rate How many instructions can be renamed at
once?
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically36
once?
Where Are the Rename Registers?Where Are the Rename Registers?
Four possibilities:Four possibilities:1. Merged rename and architectural RF2. Split rename and architectural RFs3 Renamed al es in the eo de b ffe3. Renamed values in the reorder buffer4. Renamed values in the reservation
stations (a.k.a. shelving buffers)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically37
Four Possible Locations for Rename RegistersRegisters
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically38
Source: Sima, © IEEE 2000
Dynamically Scheduled ProcessorDynamically Scheduled Processor
InstructionFetch & Decode
Unit
ArchitecturalRegistersUnit
Reservation Reservation Reservation Reservation
g
ReservationStn.
L d/StRegister File
B h
ReservationStn.
ReservationStn.
ReservationStn.
Load/StoreUnitFP UnitALU
gBranchUnit
CommitUnit
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically39
UnitRenameRegisters
Typical ROBTypical ROB
fromF&D Unit
Register Address ValueTag
from EUs
PC
00
Register Address ValueTagPC
to MEMand RF
011
— $f3 0x627f ba5a
ALU1 ???0 87f b351
head
0 1000 0008
0x1000 0004
11
ALU1 ???$f5MUL2 ???
0xa87f b351
tail
0x1000 000c
0x1000 0008
0tail
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically40
Possible States of each Registerin a Merged Filein a Merged File
EEE
2000
e:Si
ma,
© I
ESo
urce
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically41
State Transitions in a Merged FileState Transitions in a Merged File
Initialisation: Initialisation: First N registers are “AR”, others “Available”
1. Available Renamed Invalid Instruction enters the Reservation Stations and/or the ROB:
register allocated for the result (i.e., register uninitialised)2 Renamed Invalid Renamed Valid2. Renamed Invalid Renamed Valid
Instruction completes (i.e., register initialised)3. Renamed Valid Architectural Registerg
Instruction commits (i.e., register “exists”)4. Architectural Register Available
h h ( Another instruction commits to the same AR (i.e., register is dead)
5. Renamed Invalid and Renamed Valid Available
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically42
Squashing
Tracking the Mapping: Where is Physically an Architectural Register?Physically an Architectural Register?
Mapping in aMapping Table
Mapping in theRename Buffer
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically43
Source: Sima, © IEEE 2000
Reminder: Reservation StationsReminder: Reservation Stations
A reservation station checks that the operands are available (RAW) A reservation station checks that the operands are available (RAW) and that the Execution Unit is free (Control), then starts execution
ff
Tag1 Tag2 Arg1 Arg2Op
fromEUs and RF
fromF&D Unit
addd – MUL3 ???1subd ALU1 – ??? 0xffff fee11
ALU1:
ALU2:
0xa87f b351
0ALU3:
a bALU
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically44
MIPS R10000:Merged RF with Mapping TableMerged RF with Mapping Table
1996
ager
, © I
EEE
Sour
ce:
Yea
Remark the complexity of the Mapping Table: 4-issue processor ( 4x above scheme in parallel)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically45
4 issue processor ( 4x above scheme in parallel) 16 parallel accesses: 16 read ports and 4 write ports!
State Transitions Replaced by Copying in Stand-alone RRFCopying in Stand-alone RRF
Initialisation: Initialisation: All Rename Registers are “Available”
1. Available Renamed Invalid1. Available Renamed Invalid Instruction enters the Reservation Stations and/or the ROB:
register allocated for the result (i.e., register uninitialised)
2. Renamed Invalid Renamed Valid Instruction completes (i.e., register initialised)
3 d l d l bl3. Renamed Valid Available Instruction commits (i.e., register “exists”) Value is copied in the Architectural RF Value is copied in the Architectural RF
4. Renamed Invalid and Renamed Valid Available Squashing (no copy to the Architectural RF)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically46
Squashing (no copy to the Architectural RF)
State of the Rename Registers in the Commit Unit (ROB)the Commit Unit (ROB)
fromF&D Unit
Register Address ValueTag
from EUs
PC
00
Register Address ValueTagPC
to MEMand RF
011
— $f3 0x627f ba5a
ALU1 ???0 87f b351
head
0 1000 0008
0x1000 0004
11
ALU1 ???$f5MUL2 ???
0xa87f b351
tail
0x1000 000c
0x1000 0008
0tail
R d I lidRenamed Valid
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically47
Renamed InvalidRenamed ValidAvailable
MIPS R10000:32 AR 64 PhR Merged Register File32 AR, 64 PhR, Merged Register File
Mapping Table:Mapping Table:fD AR PhR
Free Register Table:Up to 32 empty PhR
6
Status Table:Invalid PhR
© I
EEE
1996
I-Queue / Resv. Station ROB
Invalid PhR
urce
: Ye
ager
,
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically48
Sou
New PhRto Hold fD
Previous PhRHolding fD
fD AR
MIPS R10000:Information FlowInformation Flow
1. Available Renamed Invalida ab a d a d Read new PhR from top of Free Register Table Create new mapping LogDest Dest in the Mapping Table Set corresponding Busy-Bit (=invalid) in the Status Table Set corresponding Busy Bit (=invalid) in the Status Table
2. Renamed Invalid Renamed Valid Write PhR Dest indicated in the I-Queue R t di B Bit ( lid) i th St t T bl Reset corresponding Busy-Bit (=valid) in the Status Table Mark as Done in the corresponding entry in the ROB
3. Renamed Valid Architectural Register Implicit (removal of historical mapping LogDest Dest)
4. Architectural Register Available Free PhR indicated by OldDest in the entry removed from the ROBd a d by O d y o d o O
5. Renamed Invalid and Renamed Valid Available Restore mapping from all squashed ROB entries (from tail to head) as
LogDest Dest
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically49
LogDest Dest Reset corresponding Busy-Bit (=valid) in the Status Table
How Many Rename Registers?How Many Rename Registers?
In-Flight instructions: In Flight instructions:
STLDEURSflightin NNNNN
Rename Registers:
f g
ROB size:
LDEURSrename NNNN ROB size:
fli htiROB NN flightinROB NN
Note: if strictly < then structural stalls can occur
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically50
Note: if strictly < then structural stalls can occur
Number of Rename RegistersNumber of Rename Registers
EEE
2000
e:Si
ma,
© I
ESo
urce
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically51
Actual Choices in Commercial ImplementationsImplementations
000
ma,
© I
EEE
2
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically52 Sour
ce:
Sim
Current High End
Source: Microprocessor Report, © Cahners 2009
High-End Processors
No renamingNo renaming only in
UltraSparc:Use of registerUse of register windows for
integers makes it diffi ltoo difficult to
implement renaming (and no OOO either!)
Nor in Itanium, of course…
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically53
References on Register RenamingReferences on Register Renaming
AQA 5th ed Appendix C and Chapter 3 AQA 5 ed., Appendix C and Chapter 3 PA, Sections 6.3, 6.4, and 6.5 CAR Chapter 5—Introduction CAR, Chapter 5—Introduction D. Sima, The Design Space of Register Renaming
Techniques IEEE Micro (20):5 Sept -Oct 2000Techniques, IEEE Micro, (20):5, Sept. Oct. 2000 K. C. Yeager, The MIPS R10000 Superscalar
Microprocessor, IEEE Micro, 16(2):28-40, April 1996c op ocesso , c o, 6( ) 8 0, p 996
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically54
Prediction and SpeculationPrediction and Speculation
(Don’t Know It? Don’t Wait but Guess…)
Prediction & Speculation: The IdeaPrediction & Speculation: The Idea
Some operation takes awfully long?Some operation takes awfully long?The processor needs the result to proceed?To fetch the next instruction, one needs to know
which one must be fetchedT f t ti d th dTo perform a computation, one needs the operands
’ iDon’t wait!!!1. Make a guess ( Predict) and
2. Proceed tentatively ( Speculate)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically56
General ProblemsGeneral Problems
1 How do I make a good guess?1. How do I make a good guess? Remember some history…
2 Wh t d I d if th ?2. What do I do if the guess was wrong? Undo speculatively-executed instructions (“squash”) May cost nothing—e.g.,
Squash some results
M t thi May cost something—e.g., Empty pipelines Restore saved state Restore saved state Execute compensation code
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically57
Branch PredictionBranch Prediction
The idea:The idea:Start fetching and decoding beforeknowing the outcome of a branch
Rudimentary form of speculationRudimentary form of speculationGuess if branch will be taken or notT i PC d d ffTentative PC advancements do not affect
machine state (no execution) Backing up is easy…
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically58
Branch PredictionBranch Prediction
Branch outcome and additional info
Predicted direction(Taken/Not Taken)
BranchPredictor
LogicLogic
Current PCPredicted target address
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically59
Branch Target BuffersBranch Target Buffers
PC
031CAM
031
Target AddressBranch Address0xa123 fee4
…One needs to know if a just fetched and yet
0x1234 56780x1235 ef5a
……
…
just fetched and yet undecoded instruction
is a branch and what is the destination
……
…
……
…(computed branch, relative address,
return, etc.)……
…
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically60
……
More Complex But CheaperBranch Target BuffersBranch Target Buffers
PC (Branch Address)
07831Typical Cache/TLB
i ti07831 organisations
0xa123 fee40x1234 560x1235 ef
Target Addr.Tag (31..8)0x7834 38470x5678 23
0x1235 78
Target Addr.Tag (31..8)0000 0000:
0000 0001 ……
…0x1235 ef
…
…
……
…
0x1235 78…
…
0000 0001:
0000 0011:
……
…
……
…
……
…
……
…
1111 1110:
1111 1111:
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically61
= Etc.
Which Strategy to Predict?Which Strategy to Predict?
Static predictions: ignore history Static predictions: ignore history1. Never-taken or always-taken2 Al t k b k d ( l )2. Always-taken-backward (e.g., loops)3. Compiler-specified, etc. Still a form of dynamic control speculation,
because the squashing process is done in h dhardware
Dynamic prediction: learn from history Record how often a branch was taken in the
past
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically62
Which Strategy to Predict Dynamically?Dynamically?
1 Same outcome as last time1. Same outcome as last time Keep one bit of history per recently visited branches
Needs an associative memory expensive
Keep one bit of history per hashed address Needs only a RAM inexpensive Different branches alias mistakes but we are only guessing Different branches alias mistakes, but we are only guessing,
anyway…
2. Same outcome as last few times (hysteresis)( y ) Keep a two-bit saturating history counter per hashed
address; use sign as a predictor Tuned to for loops: one misprediction is normal (last iteration) Tuned to for loops: one misprediction is normal (last iteration)
and should not modify the successive prediction (first iteration of a new execution)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically63
Branch History TableBranch History Table
PC (Branch Address)
07831 One-bit Prediction
Two-bit Prediction
0
07831
not taken 01 not taken0000 0000:
11
taken
taken
1011
taken
taken
0000 0001:
0000 0010:
0
1…
not taken OR 01
10
…
not taken0000 0011:
10
taken
not taken
1000
taken
not taken
1111 1110:
1111 1111:
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically64
Two-bit Prediction SchemeTwo-bit Prediction Scheme
1998
Strong Prediction Weak Prediction
Kauf
fman
©
Mor
gan
urce
: CO
D,
So
Slightly modified saturating two-bit counterTwo mispredictions Strong reversal
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically65
p g(e.g., UltraSPARC-I)
Prediction AccuracyPrediction Accuracy
1996
Kauf
fman
tion
s
© M
orga
n
pred
ict
urce
: AQ
A,
Mis
p
So
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically66
More Complex StrategiesMore Complex Strategies…
Important novelty in speculative techniques: One does Important novelty in speculative techniques: One does not care to be always right but just to be right most of the time!
Cheap but clever techniques can be used…
Pattern History Table (PHT):
Exploit correlation:
ffm
an 1
996
Pattern History Table (PHT): n-bit standard predictors
(m,n) Branch Predictor BufferA global m-bit predictor uses the
outcome of the last two branches to
© M
orga
n Ka
uf
select one among four different predictors
Sour
ce:
AQA,
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically67
Branch History Register (BHR): m-bit shift register
Return Address StackReturn Address Stack
Special elementary case of branch prediction:Special elementary case of branch prediction:Small stack (e.g., 8-16 values)Each call (CALL JAL etc ) pushes a valueEach call (CALL, JAL, etc.) pushes a valueEach return (RET, JP $ra, etc.) pops a predicted
return addressreturn address
Functionally identical to the “real” stack but avoids any SP manipulation memory accessesavoids any SP manipulation, memory accesses, argument bypassing, etc.
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically68
Dynamic Control SpeculationDynamic Control Speculation
The next step after Dynamic Branch Prediction isThe next step after Dynamic Branch Prediction is Control SpeculationLimited scope if one limits to fetch and decodeLimited scope if one limits to fetch and decode
speculativelyMore aggressively one could execute tooMore aggressively, one could execute too…
If instructions are issued and execute before the branch target is known one needs to avoidbranch target is known, one needs to avoid changing the state of the processor until the correctness of the prediction is assessed orthe correctness of the prediction is assessed or to restore the state preceding the branch
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically69
Remember Reordering Problem?Remember Reordering Problem?
Fundamental observation: a processor canFundamental observation: a processor can do whatever it wants provided that it gives the appearance of sequential execution (i e theappearance of sequential execution (i.e., the architectural machine state is updated in program order)program order)
Idea was to commit in order, so that one could t d thi t t d i fpretend something was not executed in case of
an exception in previous instructionsPrediction and Speculation?!
Yes! Prediction: “no exceptions will occur”
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically70
p
Double Use of the Reorder BufferDouble Use of the Reorder Buffer
OOO execution was hidden from the userOOO execution was hidden from the user by preventing instruction to commit until in order execution would have taken placein-order execution would have taken placeIf there is an exception, one squashes all
uncommitted instructions in the reorder bufferuncommitted instructions in the reorder buffer which followed the instruction which raised the exceptionthe exception
One can easily extend this functionality to squash all uncommitted instructionssquash all uncommitted instructions following a mispredicted branch!
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically71
Commit Unit Now Includes Also BranchesBranches
Something more in the Commit Unit to check branch outcomes Something more in the Commit Unit to check branch outcomes
from from EUsF&D Unit
0
Register Address ValueTagPC
to MEM
00
head to MEMand RF1
1
— $f3 0x627f ba5a
BR1 ???0x2012 1111
head
0x1000 0008
0x1000 0004
10
$f5MUL2 ???tail
0x2012 1111
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically72
Misprediction Has High Cost Lots of Efforts in Improving Accuracyof Efforts in Improving Accuracy
Pipelines become more and more deep (e g upPipelines become more and more deep (e.g., up to 22-24 cycles in Pentium 4)
I idth (t i ll 3 8)Issue width grows (typically 3-8)Large number of in-flight instructions (up to
100-200!)Many predicted branches in-flight at oncey p gProbability of executing speculatively something
useful reduces quicklyuseful reduces quickly
predall _
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically73
i
itot pp
Current High End
Source: Microprocessor Report, © Cahners 2009
High-End Processors
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically74
Speculation Is Not Necessarily a Run-Time ConceptRun-Time Concept
Dynamic: in hardware no interactionDynamic: in hardware, no interaction whatsoever from the compilerBinary code is unmodified
Static: in software, planned beforehandStatic: in software, planned beforehand by the compilerBi d i itt i h t dBinary code is written in such a way as to do
speculation (with or without some hardware t i th ISA)support in the ISA)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically75
Static Control SpeculationExampleExample
We need to compute:We need to compute:if (A==0) A=B; else A=A+4;
I blIn assembly:LW R1, 0(R3) ; load ABNEZ R1, L1 ; test A, possibly skip thenBNEZ R1, L1 ; test A, possibly skip thenLW R1, 0(R2) ; ‘then’ clause: load BJ L2 ; skip else
L1: ADD R1, R1, 4 ; ‘else’ clause: compute A+4L2: SW 0(R3), R1 ; store new A
If we know that the ‘then’ clause is almostIf we know that the then clause is almost always executed, can we optimise this code?
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically76
Static Control SpeculationExampleExample
We could speculatively start earlier to load B intoWe could speculatively start earlier to load B into another register and, if needed, squash the value with the right oneg
In assembly:LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4, , p
L3: SW 0(R3), R14 ; store new A
Advantages: now we load B while the test is f d ( i ll l)performed ( in parallel)
Any problem? As usual: exceptions…
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically77
Exceptions and SpeculationExceptions and Speculation
Some ways to handle exceptions inSome ways to handle exceptions in speculative execution:Static renaming: Hardware and operating
systems cooperatively ignore exceptionsPoison bits: Mark results as speculative and
delay exception at first usey pSpeculative instructions: Mark instruction as
speculative and do not commit the result untilspeculative and do not commit the result until speculation is solved
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically78
Static Renaming and Hardware-Software CooperationSoftware Cooperation
Back to our example: Back to our example:LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4
L3: SW 0(R3), R14 ; store new AL3: SW 0(R3), R14 ; store new A
OS “helps” with two policies: Nonterminating exceptions (e.g., Page Fault): resume g p ( g , g )
independently from speculativeness performance penalty, but execution ok
Terminating exceptions (e g Divide by Zero): ignore and return Terminating exceptions (e.g., Divide by Zero): ignore and return an undefined value if it was speculated, it will be unused
Problem: nonspeculative terminating exceptions?
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically79
p g p
Static Renaming and Poison BitsStatic Renaming and Poison Bits
Special marker for speculative instructions: Special marker for speculative instructions:LW R1, 0(R3) ; load ALW* R14, 0(R2) ; speculative load BBEQZ R1, L3 ; test A, possibly skip elseADD R14, R1, 4 ; ‘else’ clause: compute A+4
L3: SW 0(R3), R14 ; store new A; report exceptionsL3: SW 0(R3), R14 ; store new A; report exceptions
The processor knows the load is speculative and turns on R14’s Poison Bit if it raises a terminating exception, g p ,and suppresses the exception
The add, if executed, resets the R14’s Poison BitWhen R14 is used, a deferred terminal exception is
raised if its Poison Bit is set
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically80
Example:Speculative Loads in Itanium (I)Speculative Loads in Itanium (I)
Goal: move loads as early as possible, even speculatively before Goal: move loads as early as possible, even speculatively before preceding branches (i.e., without being sure they are really needed)
<some code>(p1) br.cond somewhere// ------ barrierld 1 [ 2] Exceptions?ld r1 = [r2]<some code using r1>
Exceptions?
ld r1 = [r2] // load could be speculated// if old value r1 not needed
<some code> // <- neither here nor(p1) br.cond somewhere // in “somewhere”// ------ barrier<some code using r1> // but
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically81
<some code using r1> // but…
Example:Speculative Loads in Itanium (II)Speculative Loads in Itanium (II)
Speculative loads and deferred exceptions to explicit compiler- Speculative loads and deferred exceptions to explicit compilergenerated fix-up code
ld.s r1 = [r2] // speculative loads do not raise// exceptions but mark the register// with the additional NaT bit
<some code><some code using r1> // NaT is propagated in further
// calculations, which also// defer exceptions
(p1) br.cond somewhere(p )// ------ barrier<some more code using r1>chk.s r1, fix_code_r1 // call exception handler if needed
// to fix-up execution// to fix up execution
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically82
Predication (= Guarded) ExecutionPredication (= Guarded) Execution
A special form of static control speculation? A special form of static control speculation?“I cannot make a good prediction? I will avoid gambling
and will do both”and will do both
A bit more than that: removes control flow change A bit more than that: removes control flow change altogether
Not always a good idea: compiler trade-off (Almost) free if one uses execution units which where not used ( )
otherwise (e.g., because of limited ILP) Not free at all in the general case: more than needed is always
executed
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically83
executed
Weak- and Strong-dependence ModelsModels
Typical model for data dependences is:Typical model for data dependences is:No dependence (B does not depend on A)St d d (B d d A)Strong dependence (B depends on A)If A and B are executed OOO, one must be
always right about the dependencealways right about the dependence
Weak-dependence model:A d d b t il i l t dA dependence can be temporarily violated
(predicted negative and speculated) if means are in place to recover correct executionare in place to recover correct executionIf A and B are executed OOO, one must be
mostly right about the dependence
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically84
mostly right about the dependence
Dynamic Data Dependence PredictionPrediction
Examples:Examples:Memory Dependence Prediction: predicts whether a
load is dependent on a pending store (memoryload is dependent on a pending store (memory aliasing or disambiguation)
Alias Prediction: predicts which pending storeAlias Prediction: predicts which pending store contains the right value for a load
???
MEM
???
MEM
=
R WW WW WR R MEM R WW WW WR R MEM
???
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically85
???
Static Data Dependence Speculation (I)Speculation (I)
Potential RAW dependencies through memory are to be Potential RAW dependencies through memory are to be conservatively assumed as real dependencies Loss of useful reordering possibilities
G l l d l ibl l ti l b f Goal: move loads as early as possible, even speculatively before preceding stores (i.e., without being sure that the value is right)
<some code>st [r3] = r4// ------ barrierld r1 = [r2] NO!ld r1 [r2]<some code using r1>
NO!
ld r1 = [r2] // load could be speculated…<some code>st [r3] = r4 // …but if r2==r3, r1 is WRONG!
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically86
[ ] ,// ------ barrier<some code using r1>
Static Data Dependence Speculation (II)Speculation (II)
Speculative Loads get executed but mark the destination register as Speculative Loads get executed but mark the destination register as “speculatively” loaded and track subsequent stores for a conflict
ld.a r1 = [r2] // speculative loads are normal[ ] // p// but mark always the register// with the additional NaT bit
<some code><some code using r1> // NaT is propagated in further
// l l ti// calculations
st [r3] = r4 // successive stores are checked// to see if they rewrite locations// which were object of speculative// which were object of speculative// loads
// ------ barrier<some more code using r1>chk.a r1, fix_code_r1 // if violated RAW dependence, call
//
Important advantage because loads (slow operations) can now be started earlier
_ _// special fix-up routine
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically87
started earlier
Dynamic Data Value PredictionDynamic Data Value Prediction
Examples:Examples:Source Operand Value Prediction: predict
quasi-constant input operands Many constant values during program execution History table recording last value
Value Stride Prediction: predict constant pincrements across input operands History table recording stride between last two y g
values
Load Addresses and Load Values
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically88
Types of Prediction & SpeculationTypes of Prediction & Speculation
Dynamic (by the hardware) Static (by the compiler)
ExceptionsOOO execution and reorderingI i ti i DBTExceptions Imprecise exceptions in DBT (e.g., Transmeta Crusoe)
—
Trace Scheduling (AQA, 4.4)
Control Branch Prediction (AQA, 4.3)HyperblocksPredication (AQA, 4.6)Speculative Loads (e.g., Itanium)
Data Availability Virtual memory —
Data Research? Advanced Loads (e g Itanium)Dependence Research? Advanced Loads (e.g., Itanium)
Data Value Research Dynamic compilers (e.g., DyC, Calpa)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically89
p )
Not all of these are traditionally called “speculation”!
References on Prediction & SpeculationSpeculation
AQA 5th ed Chapter 3 and Appendix H AQA 5 ed., Chapter 3 and Appendix H PA, Sections 4.3 and 5.3 J E Smith A Study of Branch Prediction Strategies J. E. Smith, A Study of Branch Prediction Strategies,
Eight International Symposium on Computer Architecture (ISCA), 135-48, May 1981( ), , y
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically90
Simultaneous MultithreadingSimultaneous Multithreading
(How do I fill my issue slots?!…)
Sources of ParallelismSources of Parallelism
Bit-level Bit level Wider processor datapaths (8163264…)
Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)
Instruction-level Instruction level Pipelining Superscalar
d VLIW and EPIC
Task- and Application-levels… Explicit parallel programming Explicit parallel programming Multiple threads Multiple applications…
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically92
Simple Sequential ProcessorSimple Sequential Processor
op 1op 1
cycl
esc
op 2
op 3
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically93
Pipelined ProcessorPipelined Processor
op 1op 1op 2
cycl
es
op 3
c
op 4
5op 5op 6 (br)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically94
Superscalar ProcessorSuperscalar Processor
op 1 op 2
functional units
op 1 op 2
cycl
es
op 3op 4
c
op 5op 6 (br)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically95
OOO Superscalar ProcessorOOO Superscalar Processor
op 1 op 2
functional units
op 1op 4
op 5
op 2
cycl
es
op 3p
op 6 (br)
c
7op 7op 11
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically96
Speculative ExecutionSpeculative Execution
op 1 op 2
functional units
op 1op 4
op 5
op 2
op 10 ?
cycl
es
op 3p
op 6 (br)op 7 ?
p
c
11op 9 ?op 8 ?
op 11op 12 op 13
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically97
Limits of ILPLimits of ILP
op 1 op 2
functional units
op 1op 4
op 5
op 2
op 10 ?
cycl
es
op 3p
op 6 (br)op 7 ?
p
?c
11op 9 ?op 8 ? ?
op 11op 12 op 13
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically98
Sources of Unused Issue SlotsSources of Unused Issue Slots
IEEE
1995
en e
t al
., ©
ISo
urce
: Tu
lls
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically99
Horizontal and Vertical WasteHorizontal and Vertical Waste
f ti l it
op 1 op 2
functional units
op 3 op 4op 7
IEEE
1995
cycl
es
op 3 op 4
Vertical Waste61% en
et
al.,
© I
op 10op 11op 5op 6
61%
Sour
ce:
Tulls
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically100
Horizontal Waste39%
Multithreading: The IdeaThe Idea
ranc
h
Issueranc
h
ranc
h
ranc
h
ranc
h
BrBrBrBrBr
future
Bran
ch
Bran
ch
Rather than enlarging the depth of
Issueanch
BBRather than enlarging the depth of the instruction window (more
speculation with lowering fid !) l i “ id h”
Issue
Br
nch
nch
confidence!), enlarge its “width”
fetch from multiple threads!
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically101Br
an
Bran
Basic Needs of a Multithreaded ProcessorMultithreaded Processor
Processor must be aware of severalProcessor must be aware of several independent states, one per each thread:Program CounterRegister File (and Flags)g ( g )(Memory)
Either multiple resources in the processorEither multiple resources in the processor or a fast way to switch across states
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically102
Thread SchedulingThread Scheduling
When one switches thread?When one switches thread?Which thread will be run next?
Simple interleaving options:Cycle by cycle multithreadingCycle-by-cycle multithreading Round-robin selection between a set of threads
Bl k l i h diBlock multithreading Keep executing a thread until something happens
Long latency instruction foundSome indication of scheduling difficultiesMaximum number of cycles per thread executed
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically103
Maximum number of cycles per thread executed
Cycle-by-cycle Interleaving(or Fine-Grain) Multithreading(or Fine-Grain) Multithreading
op 1 op 5op 2
functional units
op 1 op 5op 2
op 1op 2op 5op 2op 1
s
cycl
es
op 3 op 4ppp
op 7op 3 op 4op 6
switc
hes
c
op 10 op 8op 3
5 10op 11 co
ntex
t
op 4op 6 op 7op 9op 5 op 10
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically104
Block Interleaving (or Coarse-Grain) MultithreadingMultithreading
op 1 op 5op 2
functional units
op 1op 3 op 4
op 5op 2
op 6 s
cycl
es
op 2op 1
p
switc
hes
op 8
c
125
op 7op 3 op 4
cont
ext
op 1op 3
op 2op 5op 6 op 7op 9
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically105
Fundamental RequirementFundamental Requirement
Key issue in general purpose processorsKey issue in general-purpose processors which has prevented for many years
l h d d h bmultithreaded techniques to become commercially relevanty
It i t t bl th t i l th dIt is not acceptable that single-thread performance goes significantly down
or at all
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically106
Problems of Cycle-by-Cycle MultithreadingMultithreading
Null time to switch context Null time to switch context Multiple Register Files
No need for forwarding paths if threads supported are No need for forwarding paths if threads supported are more than pipeline depth! Simple(r) hardware
Fills well short vertical waste (other threads hide latencies ~ no. of threads)
Fills much less well long vertical waste (the thread is rescheduled no matter what)
Does not reduce significantly horizontal waste (per thread, the instruction window is not much different…)
Si ifi t d t i ti f i l th d j b
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically107
Significant deterioration of single thread job
Block Interleaving TechniquesBlock Interleaving Techniques
9
Block Interleaving
Sprin
ger
1999
Static Dynamic ilc e
t al
., ©
S
y
pted
from
: Ši
Explicit Switch Implicit SwitchSwitch-on-LoadSwitch-on-Store
Explicit SwitchConditional-Switch
Implicit SwitchSwitch-on-MissSwitch-on-Use
Ada
Switch-on-Branch (a.k.a. Lazy-Switch-on-Miss)Switch-on-Signal
(interrupt, trap,…)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically108
Problems of Block MultithreadingProblems of Block Multithreading
Scheduling of threads not self evident:Scheduling of threads not self-evident:What happens of thread #2 if thread #1 executes
perfectly well and leaves no gap?perfectly well and leaves no gap?Explicit techniques require ISA modifications Bad…
More time allowable for context switchMore time allowable for context switchFills very well long vertical waste (other threads
i )come in)Fills poorly short vertical waste (if not sufficient
to switch context)Does not reduce almost at all horizontal waste
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically109
Simultaneous Multithreading (SMT):The IdeaThe Idea
op 1 op 5op 2
functional units
op 1op 1op 1op 4
op 5op 2 op 1op 3
op 2op 5op 7 op 5op 2
op 1
cycl
es
op 3 op 6op 7
ppop 4op 6 op 8
p p
op 3 op 4
c
op 9op 10
12
op 11op 8op 7 op 10
11
op 8op 6
10op 12op 13op 14 op 9
op 11op 10op 11
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically110
Several Simple Scheduling PossibilitiesPossibilities
Prioritised scheduling?Prioritised scheduling?Thread #0 schedules freelyThread #1 is allowed to use #0 empty slotsThread #2 is allowed to use #0 and #1Thread #2 is allowed to use #0 and #1
empty slots, etc.
Fair scheduling?Fair scheduling?All threads compete for resourcesIf several threads want the same resource,
round-robin assignment
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically111
Superscalar ProcessorSuperscalar Processor
InstructionFetch & Decode Unit
(Multiple Instructions per Cycle)Multiple Buses
PC
ReservationStn
ReservationStn
ReservationStn
ReservationStn
Multiple Buses
Stn.Stn.
FP UnitALU 1 Load/Store
Stn.
Branch
Stn.
ALU 2 FP UnitALU 1Register File
/UnitUnitALU 2
Commit Unit(Multiple Instructions per Cycle)
IQ
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically112
Multiple Buses
Reorder BufferReorder Buffer
fromF&D Unit
Register Address ValueTag
from EUs
PC
00
Register Address ValueTagPC
to MEMand RF
011
— $f3 0x627f ba5a
ALU1 ???0 87f b351
head
0 1000 0008
0x1000 0004
11
ALU1 ???$f5MUL2 ???
0xa87f b351
tail
0x1000 000c
0x1000 0008
0tail
Register Renaming
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically113
Register Renaming(between fetch/decode and commit)
What Must Be Added to a Superscalar to Achieve SMT?Superscalar to Achieve SMT?
Multiple program counters (= threads) and a Multiple program counters (= threads) and a policy for the instruction fetch unit(s) to decide which thread(s) to fetch
F/D
( ) Multiple or larger register file(s) with at least as
many registers as logical registers for all threads
mm
it
Multiple instruction retirement (e.g., per thread squashing) C
om
No changes needed in the execution pathAnd also: Thread-aware branch predictors (BTBs, etc.) Per-thread Return Address Stacks
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically114
SMT Processor as a Natural Extension of a SuperscalarExtension of a Superscalar
InstructionFetch & Decode Unit
(Multiple Instructions per Cycle)
PCPCPCPC
ReservationStn
ReservationStn
ReservationStn
ReservationStnStn.Stn.
FP UnitALU 1 Load/Store
Stn.
Branch
Stn.
ALU 2 FP UnitALU 1 /UnitUnitALU 2
Register File(s)
Commit Unit(Multiple Instructions per Cycle)
IQIQIQIQ
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically115
IQ
Reorder Buffer Remembers the Thread of OriginThread of Origin
Some changes to the reorder buffer in the Commit Unit—e.g.: Some changes to the reorder buffer in the Commit Unit e.g.:
fromF&D Unit
from EUs
0
Register Address ValueTagPC Thread
to MEMand RF1
1
— $f3 0x627f ba5a
ALU1 ???0xa87f b351
head
0x1000 0008
0x1000 0004 2
2
1 $f5MUL2 ???0x1000 000c
1 MUL30x2001 1234
2
1 $f3 ???
0tail
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically116
Architectural Register Identifier:Reg # + Thread #
Reservation StationsReservation Stations
Reservation stations do not need to know which thread an Reservation stations do not need to know which thread an instruction belongs to
Remember: operand sources are renamed—physical regs, tags, etc.
Tag1 Tag2 Arg1 Arg2Op
fromEUs and RF
fromF&D Unit
Threadaddd – MUL3 ???1subd ALU1 – ??? 0xffff fee11
ALU1:
ALU2:
0xa87f b351Thread#2?!
Thread
0ALU3: #1?!
a bALU
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically117
Does It Work?!Does It Work?!
IEEE
1996
en e
t al
., ©
ISo
urce
: Tu
lls
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically118
Main Results on ImplementabilitySMT vs SuperscalarSMT vs. Superscalar
From [TullsenJun96]:From [TullsenJun96]:Instruction scheduling not more complexRegister File datapaths not more complex (but
much larger register file!)Instruction Fetch Throughput is attainable even
without more fetch bandwidthUnmodified cache and branch predictors are
appropropriate also for SMTappropropriate also for SMTSMT achieves better results than aggressive
superscalar
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically119
superscalar
Where to Fetch?Where to Fetch?
Static solutions: Round-robinStatic solutions: Round robinEach cycle 8 instructions from 1 threadEach cycle 4 instructions from 2 threads 2 from 4Each cycle 4 instructions from 2 threads, 2 from 4,…Each cycle 8 instructions from 2 threads, and forward
as many as possible from #1 then when long latency instruction in #1 pick rest from #2
Dynamic solutions: Check execution queues!Favour threads with minimal # of in-flight branchesFavour threads with minimal # of outstanding missesF th d ith i i l # f i fli htFavour threads with minimal # of in-flight
instructionsFavour threads with instructions far from queue head
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically120
Favour threads with instructions far from queue head
What to Issue?What to Issue?
Not exactly the same as in superscalars Not exactly the same as in superscalars… In superscalar: oldest is the best (least speculation, more dependent
ones waiting, etc.) In SMT not so clear: branch speculation level and optimism (cache hit In SMT not so clear: branch-speculation level and optimism (cache-hit
speculation) vary across threads
One can think of many selection strategies: Oldest first Cache-hit speculated last Branch speculated last Branch speculated last Branches first…
Important result: doesn’t matter too much!
Issue Logic (critical in superscalars) can be left alone
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically121
Importance of Accurate Branch Prediction in SMT vs SuperscalarPrediction in SMT vs. Superscalar
Reduce the impact of Branch Prediction was oneReduce the impact of Branch Prediction was one of the qualitative initial motivations
R lt f [T ll J 96]Results from [TullsenJun96]:Perfect branch prediction advantage
25% 1 h d 25% at 1 thread 15% at 4 threads 9% at 8 threads 9% at 8 threads
Losses due to suppression of speculative execution -7% at 8 threads7% at 8 threads -38% at 1 thread ( speculation was a good idea…)
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically122
BottlenecksSources of Unused Issue SlotsSources of Unused Issue Slots
ger
1999
al.,
© S
prin
gou
rce:
Šilc
et
Completion queue not very relevant (remember: this is out of the execution path…)
So
Rename register count important Most critical: number of register writeback ports
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically123
SMT for utilisation rate (EUs) not bandwidth!…
BottlenecksBottlenecks
Fetch and memory throughput Fetch and memory throughput are still bottlenecks Fetch: branches, etc. M t dd d Memory not addressed
Performance vs. # of rename registers (8T) in addition to
© I
EEE
1996
g ( )the architectural ones Infinite: +2% 100: ref lls
en e
t al
., ©
100: ref. 90: -1% 80 -3% So
urce
: Tu
70 -6%
Register file access time likely limit to # of threads IPC vs. # threads
200 h i l i t
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically124
limit to # of threads 200 physical registers
Performance (IPC) per Unit CostPerformance (IPC) per Unit Cost
ger
1999
al.,
© S
prin
gou
rce:
Šilc
et
Superscalars are cheap only for relatively small issue bandwidth,
So
Superscalars are cheap only for relatively small issue bandwidth, then quickly down
SMT improves significantly the picture already with 2 threads and maximum moves to larger issue bandwidths with more threads
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically125
maximum moves to larger issue bandwidths with more threads
Introduction of SMT in Commercial Processorsin Commercial Processors
Compaq Alpha 21464 (EV8)Compaq Alpha 21464 (EV8)4T SMTP j t kill d J 2001Project killed June 2001
Intel Pentium IV (Xeon)2T SMTAvailability since 2002 y
(already there before, but not enabled)10-30% gains expectedg p
SUN Ultra III2 core CMP 4T SMT
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically126
2-core CMP, 4T SMT
Intel SMT: Xeon Hyper-ThreadingPipelinePipeline
d l d
F t d
duplicatedresources
Front-end(TC hit)
or(TC miss)
ntel
2002
freely sharedresources
splitresources
time-sharedresources rr
et
al.,
© I
nSo
urce
: M
ar
OOOExecution
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically127
Intel SMT: Xeon Hyper-ThreadingSwitching Off ThreadsSwitching Off Threads
What happens when there is only one thread? WhatWhat happens when there is only one thread? What does the OS when there is nothing to do? Ahem… Four modes: Low-power, ST0, ST1, and MT Four modes: Low power, ST0, ST1, and MT
002
al.,
© I
ntel
20rc
e: M
arr
et a
Halt0
Halt1
Int1Int0
Halt1
Halt0
Int0Int1
Sour
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically128
Low-Power Mode
Halt0 Halt1
Intel SMT: Xeon Hyper-ThreadingGoals and ResultsGoals and Results
Minimum additional cost: SMT = approx. 5% area Minimum additional cost: SMT approx. 5% area No impact on single-thread performance
Recombine partitioned resources
Fair behaviour with 2 threads
tel2
002
r et
al.,
© I
ntSo
urce
: M
arr
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically129
And Now What’s Next?And Now, What s Next?
Key ingredients for success so far:Key ingredients for success so far:Maximise compatibility, no info from programmers
beyond straight sequential code and coarse threadsbeyond straight sequential code and coarse threadsAggressive prediction and speculation of anything
predictablepredictableUse irregular, fine-grained parallelism (ILP): it is
“easier” to extract, can be done at runtime,…, ,
Problems:Branch prediction accuracy hard to improveBranch prediction accuracy hard to improveHard to exploit ILP any further within a thread
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically130
SMT Research DirectionsSMT Research Directions
Dynamic Multithreading (DMT) [AkkariNov98] Dynamic Multithreading (DMT) [AkkariNov98] Automatic generation of threads from loops (backward
branches) and procedure call
IEEE
1998
ary
et a
l., ©
ISo
urce
: Ak
ka
Execution on a “typical” SMT microarchitecture Speculative execution of threads with interthread data
dependence and value speculation
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically131
dependence and value speculation
SMT Research DirectionsSMT Research Directions
What Dynamic Multithreading represents?What Dynamic Multithreading represents? Introduces, from a single program, not only out-of-order Issue
but now also out-of-order Fetch Lowers the dependence on correct Branch Prediction
Experiments show good advantages even without more f t h b d idthfetch bandwidth
EEE
1998
ry e
t al
., ©
IE
Sour
ce:
Akka
r
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically132
S
References on Simultaneous MultithreadingMultithreading
AQA 5th ed Chapter 3 AQA 5 ed., Chapter 3 PA, Sections 6.3, 6.4, and 6.5 CAR, Chapter 5—Introduction, p D. M. Tullsen et al., Exploiting Choice: Instruction Fetch
and Issue on an Implementable Simultaneous M ltith di P ISCA 1996Multithreading Processor, ISCA, 1996
H. Marr et al., Hyper-Threading Technology Architecture and Microarchitecture Intel Technology Journal Q1and Microarchitecture, Intel Technology Journal, Q1, 2002
H. Accary and M. A. Driscoll, A Dynamic Multithreaded y , yProcessor, MICRO, 1998
© Ienne 2003-12AdvCompArch — Exploiting ILP Dynamically133