Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz

Computer Structure 2014 – Out-Of-Order Execution1

Computer Structure

Out-Of-Order Execution

Lihu Rappoport and Adi Yoaz


What’s Next Goal: minimize CPU Time

CPU Time = clock cycle CPI IC

So far we have learned Minimize clock cycle add more pipe stages Minimize CPI use pipeline Minimize IC architecture

In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1

Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards

What can we do ? Further reduce the CPI !


A Superscalar CPU Duplicating HW in one pipe stage won’t help

e.g., have 2 ALUs the bottleneck moves to other stages

Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock:

IF ID EXE MEM WB

IF ID EXE MEM WB


The Pentium Processor Fetches and decodes 2 instructions per cycle

Before register file read, decide on pairing: can the two instructions be executed in parallel

Pairing decision is based on Data dependencies: 2nd instruction must be independent

of 1st

Resources: U-pipe and V-pipe are not symmetric (save HW)• Common instructions can execute on either pipe• Some instructions can execute only on the U-pipe

• If the 2nd instruction requires the U-pipe, it cannot pair• Some instructions use resources of both pipes

IF IDU-pipe

V-pipe

pairing


MPI : miss-per-instruction:

#incorrectly predicted branches #predicted branches

MPI = = MPR× total # of instructions total # of

instructions

MPI correlates well with performance, e.g., assume MPR = 5%, %branches = 20% MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles

We get MPI = 1% flush in every 100 instructions IPC=2 flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles 10% performance

hit

For IPC=1 we would get 5 cycles flush penalty per 100 cycles 5% performance

hit

Flush penalty increases as the machine is deeper and wider

Misprediction Penalty in a Superscalar CPU


Extract More ILP ILP – Instruction Level Parallelism

A given program, executed on a given input data has a given parallelism

Can execute only independent instructions in parallel If for example each instruction is dependent on the

previous instruction, the ILP of the program is 1• Adding more HW will not change that

Adjacent instructions are usually dependent The utilization of the 2nd pipe is usually low There are algorithms in which both pipes are highly

utilized

Solution: Out-Of-Order Execution Look for independent instructions further ahead in the

program Execute instructions based on data readiness Still need to keep the semantics of the original program


Data Flow Analysis Example:

(1) r1 r4 / r7 ; assume divide takes 20 cycles(2) r8 r1 + r2(3) r5 r5 + 1(4) r6 r6 - r3 (5) r4 r5 + r6(6) r7 r8 * r4

134

52

6

In-order execution

134

5 2 6

Out-of-order execution

1 3 4

2 5

6

Data Flow Graph

r1 r5 r6

r4r8


OOOE – General Scheme

Fetch & decode instructions in parallel but in order Fill the Instruction Pool

Execute ready instructions from the instructions pool All source data ready + needed execution resources available

Once an instruction is executed signal all dependent instructions that data is ready

Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling

Retire(commit)

In-order

Fetch &Decode

Instruction pool

In-order

Execute

Out-of-order


(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Write Dependency

(8) r32(7) r4r3+r1

(3) r123


(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135


(8) r32(7) r4r3+r1

(3) r123

If inst (3) is executed before inst (1), r1 ends up having a wrong value.

Called write-after-write false dependency.


(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135


(8) r32(7) r4r3+r1

(3) r123

Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3).

Write-After-Write (WAW) is a false dependencyNot a real data dependency, but an artifact of OOO execution


(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Speculative Execution

(8) r32(7) r4r3+r1

(3) r123

1/5 instruction is a branch continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path.

Called “speculative execution”


(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Read Dependency

(3) r123

(8) r32(7) r4r3+r1


(7) r4r3+r1

(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Read Dependency

(3) r123

(8) r32If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3.

Called write-after-read false dependency.

Write-After-Read (WAR) is a false dependencyNot a real data dependency, but an artifact of OOO execution


Register Renaming Hold a pool of physical registers

Map architectural registers into physical registers

When an instruction is allocated into the instruction pool (still in-order) Allocate a free physical register from a pool The physical register points to the architectural register

When an instruction executes and writes a result Write the result value to the physical register

When an instruction needs data from a register Read data from the physical register allocated to the latest

inst which writes to the same arch register, and precedes the current inst• If no such instruction exists, read from the reset arch.

value

When an instruction commits Copy the value from its physical register to the architectural

register


Renaming

r1:pr1 pr117r2:pr2 pr2r2+pr1r1:pr3 pr323r3:pr4 pr4r3+pr3

r1:pr5 pr535r4:pr6 pr6pr4+pr5r3:pr7 pr72

(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2

(6) L2 r135(7) r4r3+r1(8) r32

Register Renaming

r1 r2 r3 r4

Register mapping r1 r2 r3 r4pr1 pr2pr3 pr4pr5 pr6pr7

When an instruction commits: Copy its physical register into the architectural register


Renaming



(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2

(6) L2 r135(7) r4r3+r1(8) r32

Speculative Execution – Misprediction

r1 r2 r3 r4


If the predicted branch path turns out to be wrong (when the branch is executed):

The instructions following the branch are flushed before they are committed the architectural state is not changed


Renaming



(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2

(6) L2 r135(7) r4r3+r1(8) r32

Speculative Execution – Misprediction

r1 r2 r3 r4


But the register mapping was already wrongly updated by the wrong path instructions


Jump Misprediction – Flush at Retire

When the mispredicted jump retires Flush the pipeline

• When the branch commits, all the instructions remaining in the pipe are younger than the branch from the wrong path

Reset the renaming map• So all register are mapped to architectural registers• This is ok since there are no consumers of physical

registers (pipe is flushed)

Start fetching instructions from the correct path

Disadvantage Very high misprediction penalty Misprediction is already known after the jump was

executed We will see ways to recover a misprediction at execution


OOO Requires Accurate Branch Predictor

Accurate branch predictor increases the effective scheduling window size

Speculate across multiple branches (a branch every 5 – 10 instructions)

70% 75% 80% 85% 90% 95%100%0%

10%

20%

30%

40%

50%

60%%wrong instructions

Prediction Rate

Instruction pool

branches

High chances

to commit

Low chances

to commit


Interrupts and Faults Handling

Complications for pipelined and OOO execution Interrupts occur in the middle of an instruction A speculative instruction can get a fault (divide by 0, page

fault)

Faults are served in program order, at retirement only Mark an instruction that takes a fault at execution Instructions older than the faulting instruction are retired Only when the faulting instruction retires – handle the fault

• Flush subsequent instructions• Initiate the fault handling code according to the fault type• Restart faulting and/or subsequent instructions

Interrupts are served when the next instruction retires Let the instruction in the current cycle retire Flush subsequent instructions and initiate the interrupt

service code Fetch the subsequent instructions


Out Of Order Execution Summary

Look ahead in a window of instructions Dispatch ready instructions to execution

• Do not depend on data from previous instructions still not executed

• Have the required execution resources available

Advantages Exploit Instruction Level Parallelism beyond adjacent

instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler

• Can look for ILP beyond conditional branches• In a given control path instructions may be independent• Register Renaming: use more than the number architectural

registers

Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering – so far we did not talk about that

Documents

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz