Upload
jakob-slee
View
231
Download
0
Embed Size (px)
Citation preview
Computer Structure 2014 – Out-Of-Order Execution1
Computer Structure
Out-Of-Order Execution
Lihu Rappoport and Adi Yoaz
Computer Structure 2014 – Out-Of-Order Execution2
What’s Next Goal: minimize CPU Time
CPU Time = clock cycle CPI IC
So far we have learned Minimize clock cycle add more pipe stages Minimize CPI use pipeline Minimize IC architecture
In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1
Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards
What can we do ? Further reduce the CPI !
Computer Structure 2014 – Out-Of-Order Execution3
A Superscalar CPU Duplicating HW in one pipe stage won’t help
e.g., have 2 ALUs the bottleneck moves to other stages
Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock:
IF ID EXE MEM WB
IF ID EXE MEM WB
Computer Structure 2014 – Out-Of-Order Execution4
The Pentium Processor Fetches and decodes 2 instructions per cycle
Before register file read, decide on pairing: can the two instructions be executed in parallel
Pairing decision is based on Data dependencies: 2nd instruction must be independent
of 1st
Resources: U-pipe and V-pipe are not symmetric (save HW)• Common instructions can execute on either pipe• Some instructions can execute only on the U-pipe
• If the 2nd instruction requires the U-pipe, it cannot pair• Some instructions use resources of both pipes
IF IDU-pipe
V-pipe
pairing
Computer Structure 2014 – Out-Of-Order Execution5
MPI : miss-per-instruction:
#incorrectly predicted branches #predicted branches
MPI = = MPR× total # of instructions total # of
instructions
MPI correlates well with performance, e.g., assume MPR = 5%, %branches = 20% MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles
We get MPI = 1% flush in every 100 instructions IPC=2 flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles 10% performance
hit
For IPC=1 we would get 5 cycles flush penalty per 100 cycles 5% performance
hit
Flush penalty increases as the machine is deeper and wider
Misprediction Penalty in a Superscalar CPU
Computer Structure 2014 – Out-Of-Order Execution6
Extract More ILP ILP – Instruction Level Parallelism
A given program, executed on a given input data has a given parallelism
Can execute only independent instructions in parallel If for example each instruction is dependent on the
previous instruction, the ILP of the program is 1• Adding more HW will not change that
Adjacent instructions are usually dependent The utilization of the 2nd pipe is usually low There are algorithms in which both pipes are highly
utilized
Solution: Out-Of-Order Execution Look for independent instructions further ahead in the
program Execute instructions based on data readiness Still need to keep the semantics of the original program
Computer Structure 2014 – Out-Of-Order Execution7
Data Flow Analysis Example:
(1) r1 r4 / r7 ; assume divide takes 20 cycles(2) r8 r1 + r2(3) r5 r5 + 1(4) r6 r6 - r3 (5) r4 r5 + r6(6) r7 r8 * r4
134
52
6
In-order execution
134
5 2 6
Out-of-order execution
1 3 4
2 5
6
Data Flow Graph
r1 r5 r6
r4r8
Computer Structure 2014 – Out-Of-Order Execution8
OOOE – General Scheme
Fetch & decode instructions in parallel but in order Fill the Instruction Pool
Execute ready instructions from the instructions pool All source data ready + needed execution resources available
Once an instruction is executed signal all dependent instructions that data is ready
Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling
Retire(commit)
In-order
Fetch &Decode
Instruction pool
In-order
Execute
Out-of-order
Computer Structure 2014 – Out-Of-Order Execution9
(1) r1R9/17(2) r2r2+r1
(4) r3r3+r1(5) jcc L2
(6) L2 r135
Write-After-Write Dependency
(8) r32(7) r4r3+r1
(3) r123
Computer Structure 2014 – Out-Of-Order Execution10
(1) r1R9/17(2) r2r2+r1
(4) r3r3+r1(5) jcc L2
(6) L2 r135
Write-After-Write Dependency
(8) r32(7) r4r3+r1
(3) r123
If inst (3) is executed before inst (1), r1 ends up having a wrong value.
Called write-after-write false dependency.
Computer Structure 2014 – Out-Of-Order Execution11
(1) r1R9/17(2) r2r2+r1
(4) r3r3+r1(5) jcc L2
(6) L2 r135
Write-After-Write Dependency
(8) r32(7) r4r3+r1
(3) r123
Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3).
Write-After-Write (WAW) is a false dependencyNot a real data dependency, but an artifact of OOO execution
Computer Structure 2014 – Out-Of-Order Execution12
(1) r1R9/17(2) r2r2+r1
(4) r3r3+r1(5) jcc L2
(6) L2 r135
Speculative Execution
(8) r32(7) r4r3+r1
(3) r123
1/5 instruction is a branch continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path.
Called “speculative execution”
Computer Structure 2014 – Out-Of-Order Execution13
(1) r1R9/17(2) r2r2+r1
(4) r3r3+r1(5) jcc L2
(6) L2 r135
Write-After-Read Dependency
(3) r123
(8) r32(7) r4r3+r1
Computer Structure 2014 – Out-Of-Order Execution14
(7) r4r3+r1
(1) r1R9/17(2) r2r2+r1
(4) r3r3+r1(5) jcc L2
(6) L2 r135
Write-After-Read Dependency
(3) r123
(8) r32If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3.
Called write-after-read false dependency.
Write-After-Read (WAR) is a false dependencyNot a real data dependency, but an artifact of OOO execution
Computer Structure 2014 – Out-Of-Order Execution15
Register Renaming Hold a pool of physical registers
Map architectural registers into physical registers
When an instruction is allocated into the instruction pool (still in-order) Allocate a free physical register from a pool The physical register points to the architectural register
When an instruction executes and writes a result Write the result value to the physical register
When an instruction needs data from a register Read data from the physical register allocated to the latest
inst which writes to the same arch register, and precedes the current inst• If no such instruction exists, read from the reset arch.
value
When an instruction commits Copy the value from its physical register to the architectural
register
Computer Structure 2014 – Out-Of-Order Execution16
Renaming
r1:pr1 pr117r2:pr2 pr2r2+pr1r1:pr3 pr323r3:pr4 pr4r3+pr3
r1:pr5 pr535r4:pr6 pr6pr4+pr5r3:pr7 pr72
(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2
(6) L2 r135(7) r4r3+r1(8) r32
Register Renaming
r1 r2 r3 r4
Register mapping r1 r2 r3 r4pr1 pr2pr3 pr4pr5 pr6pr7
When an instruction commits: Copy its physical register into the architectural register
Computer Structure 2014 – Out-Of-Order Execution17
Renaming
r1:pr1 pr117r2:pr2 pr2r2+pr1r1:pr3 pr323r3:pr4 pr4r3+pr3
r1:pr5 pr535r4:pr6 pr6pr4+pr5r3:pr7 pr72
(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2
(6) L2 r135(7) r4r3+r1(8) r32
Speculative Execution – Misprediction
r1 r2 r3 r4
Register mapping r1 r2 r3 r4pr1 pr2pr3 pr4pr5 pr6pr7
If the predicted branch path turns out to be wrong (when the branch is executed):
The instructions following the branch are flushed before they are committed the architectural state is not changed
Computer Structure 2014 – Out-Of-Order Execution18
Renaming
r1:pr1 pr117r2:pr2 pr2r2+pr1r1:pr3 pr323r3:pr4 pr4r3+pr3
r1:pr5 pr535r4:pr6 pr6pr4+pr5r3:pr7 pr72
(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2
(6) L2 r135(7) r4r3+r1(8) r32
Speculative Execution – Misprediction
r1 r2 r3 r4
Register mapping r1 r2 r3 r4pr1 pr2pr3 pr4pr5 pr6pr7
But the register mapping was already wrongly updated by the wrong path instructions
Computer Structure 2014 – Out-Of-Order Execution19
Jump Misprediction – Flush at Retire
When the mispredicted jump retires Flush the pipeline
• When the branch commits, all the instructions remaining in the pipe are younger than the branch from the wrong path
Reset the renaming map• So all register are mapped to architectural registers• This is ok since there are no consumers of physical
registers (pipe is flushed)
Start fetching instructions from the correct path
Disadvantage Very high misprediction penalty Misprediction is already known after the jump was
executed We will see ways to recover a misprediction at execution
Computer Structure 2014 – Out-Of-Order Execution20
OOO Requires Accurate Branch Predictor
Accurate branch predictor increases the effective scheduling window size
Speculate across multiple branches (a branch every 5 – 10 instructions)
70% 75% 80% 85% 90% 95%100%0%
10%
20%
30%
40%
50%
60%%wrong instructions
Prediction Rate
Instruction pool
branches
High chances
to commit
Low chances
to commit
Computer Structure 2014 – Out-Of-Order Execution21
Interrupts and Faults Handling
Complications for pipelined and OOO execution Interrupts occur in the middle of an instruction A speculative instruction can get a fault (divide by 0, page
fault)
Faults are served in program order, at retirement only Mark an instruction that takes a fault at execution Instructions older than the faulting instruction are retired Only when the faulting instruction retires – handle the fault
• Flush subsequent instructions• Initiate the fault handling code according to the fault type• Restart faulting and/or subsequent instructions
Interrupts are served when the next instruction retires Let the instruction in the current cycle retire Flush subsequent instructions and initiate the interrupt
service code Fetch the subsequent instructions
Computer Structure 2014 – Out-Of-Order Execution22
Out Of Order Execution Summary
Look ahead in a window of instructions Dispatch ready instructions to execution
• Do not depend on data from previous instructions still not executed
• Have the required execution resources available
Advantages Exploit Instruction Level Parallelism beyond adjacent
instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler
• Can look for ILP beyond conditional branches• In a given control path instructions may be independent• Register Renaming: use more than the number architectural
registers
Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering – so far we did not talk about that