Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1...

Preview:

Citation preview

Instruction Level Parallelism

● ILP, Loop level Parallelism● Dependences, Hazards● Speculation, Branch prediction

Basic Block● A straight line code sequence with no branches in

except to the entry and no branches out except at the exit

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

DADDUI R1, 0(R2)

BEQZ R2, L1

LW R1, 0(R2)

L1:

ILP

● Name dependence

– antidependence, output dependence

– Register renaming● Hazard

– Overlap during execution would change the order of access to the operand involved in the dependence.

for (i=0; i<=999; i=i+1)x[i] = x[i] + a;

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, LoopData DependenceName Dependence

ADD.D F4, F0, F2ADD.D F4, F6, F8

Hazards● Program Order

– ILP preserves program order only where it affects the outcome of the program

● Structural Hazards– Resource conflicts

● Data Hazards– RAW, WAW, WAR

Structural Hazard

MEM ID EX MEM WB

MEM ID EX MEM WB

MEM ID EX MEM WB

MEM ID EX MEM WB

i1

i2

i3

i4

...

1 2 3 4 5 6 7 8 9

MEM ID EX MEM WBi5

HAZARD!!!

Data HazardDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

DSUB

AND

OR

Time (clock cycles)

XOR

ALU REG

IM REG DMALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Avoiding Data Hazards – ForwardingDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

DSUB

AND

OR

Time (clock cycles)

XOR

ALU REG

IM REG DMALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Load Delay SlotLDDSUBANDOR

R4,R1,R5R6,R1,R7

R1,0(R2)

R8,R1,R9

IM REG DMLD

DSUB

AND

OR

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU REG

IM REG DMALU

The loaded value might not be available in the destination

register for use by the instruction immediately following the load

LOAD DELAY SLOT

Cost of StallsData references = 40%. Ideal CPI=1.Processor with hazard is 1.1 times faster than the processor without hazard.Which processor is faster?

Pipeline CPI= Ideal pipeline CPI +Structural stalls+Data hazard stalls+Control stalls

Pipeline Scheduling

Reorder the instructions of the program so that dependent

instructions are far enough apart

Done by the compiler, before the program runs:

Static Instruction Scheduling

Done by the hardware, when the program is running:

Dynamic Instruction Scheduling

Pipeline Scheduling

LW R3, 0(R1)

LW R13, 0(R11)

ADDI R5, R3, 1

ADD R2, R2, R3

ADD R12, R13, R3

LW R3, 0(R1)

ADDI R5, R3, 1

ADD R2, R2, R3

LW R13, 0(R11)

ADD R12, R13, R3

stall

stall

Original Program

Pipeline Scheduling

Scheduled Code

Total Execution Cycles: 7 Total Execution Cycles: 5

Loop-level Parallelism

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

Original Loop:Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

L.D F6, -8(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

UNROLLED

LOOP

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

3

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

Total Cycles: 27 cycles

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

3

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

ADD.D F8, F2, F6

ADD.D F12, F2, F10

ADD.D F16, F2, F14

S.D F4, 0(R1)

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

S.D F12, 16(R1)

L.D F14, -24(R1)

S.D F16, 8(R1)

BNE R1, R2, Loop

Total Cycles: 14 cyclesDADDUI R1, R1, #-32

➢ Code Size➢ Register pressure

Exceptions● Certain exceptional events that occur during

program execution, handled by the processor hardware

● Control transfer to specific OS code based on the family of exception

● I/O device requests, System call, Breakpoint, Integer arithmetic overflow, FP arithmetic anomaly, Page fault, Undefined or unimplemented instruction, Hardware malfunctions, Power failure.

Exceptions● Synchronous vs. Asynchronous● User requested vs. Coerced● User maskable vs. User non-maskable● Within vs. Between instructions

– Save, and restore processor state

– restartable pipeline

● Resume vs. Terminate

Stopping and Restarting Execution● Trap instruction, Turn off writes, Save PC, Save

processor state, Exception handler, RFE● Precise exceptions

Pipeline stage Problem exceptions occurring

IF Page fault on IF, misaligned memory access; memory protection violation

ID Undefined or illegal opcode

EX Arithmetic exception

MEM Page fault on data fetch; misaligned memory access; memory protection violation

WB None

Precise ExceptionsLD IF ID EX MEM WB

DADD IF ID EX MEM WB

● Exceptions at the same cycle● Early exception by a later instruction● Instruction Status Vector

– Check before commit

Control Dependences● Program correctness

– Data flow and Exception behaviour

● Software Speculation– Liveness

DADDU R2, R3, R4

BEQZ R2, L1

LW R1, 0(R2)

L1:

DADDU R1, R2, R3

BEQZ R4, L1

DSUBU R1, R5, R6

L1: …........

OR R7, R1, R8

DADDU R1, R2, R3

BEQZ R12, L1

DSUBU R4, R5, R6

DADDU R5, R4, R9

L1: OR R7, R8, R9

Branch Hazards

● 1 stall cycle for every branch yields a performance loss of 10% to 30%!

IF ID EX MEM WB

IF

IF ID EX MEM WB

IF ID EX MEM WB

Branch

Branch Successor

Branch Successor + 1

Branch Successor + 2

Time(clock cycles)

1 2 3 4 5 6 7 8 9

IF ID EX MEM WB

Reducing Pipeline Branch Penalties● Freeze the pipeline● Static Prediction

– Predict Taken, Predict Untaken

● Fill Branch Delay Slot

IF ID EX MEM WB

IF

IF ID EX MEM WB

IF ID EX MEM WB

Branch

Branch Delay Slot

Branch Successor

Branch Successor + 1

Time(clock cycles)

1 2 3 4 5 6 7 8 9

ID EX MEM WB

From the MIPS ISA ManualThe transfer of control

takes place only following the instruction

immediately after the control transfer

instruction

Branch Delay Slot

Performance of Branch Schemes

Stall cyclesBranches=Branch frequency×Branch penalty

Speedup pipelining=Pipeline depth

1+Pipeline stall cycles per instruction

Speedup pipelining=Pipeline depth

1+Branch frequency×Branch penalty

Classes of ExceptionsException type Synchronous

vs. AsyncUser request vs. Coerced

User maskable vs. nonmaskable

Within vs. between instructions

Resume vs. Terminate

I/O device request

Async Coerced Nonmaskable Between Resume

Invoke OS Sync User request Nonmaskable Between Resume

Tracing Instruction Execution

Sync User request User maskable

Between Resume

Breakpoint Sync User request User maskable

Between Resume

Arithmetic Overflow

Sync Coerced User maskable

Within Resume

FP underflow or overflow

Sync Coerced User maskable

Within Resume

Page fault Sync Coerced Nonmaskable Within Resume

Undefined Instructions

Sync Coerced Nonmaskable Within Terminate

Hardware malfunctions

Async Coerced Nonmaskable Within Terminate

Power Failure Async Coerced Nonmaskable Within Terminate

Smith and Pleszkun, Implementing precise interrupts in pipelined processors, IEEE Transactions on Computers, 37(5), 1998.

Recommended