24
Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Instruction Level Parallelism

● ILP, Loop level Parallelism● Dependences, Hazards● Speculation, Branch prediction

Page 2: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Basic Block● A straight line code sequence with no branches in

except to the entry and no branches out except at the exit

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

DADDUI R1, 0(R2)

BEQZ R2, L1

LW R1, 0(R2)

L1:

Page 3: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

ILP

● Name dependence

– antidependence, output dependence

– Register renaming● Hazard

– Overlap during execution would change the order of access to the operand involved in the dependence.

for (i=0; i<=999; i=i+1)x[i] = x[i] + a;

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, LoopData DependenceName Dependence

ADD.D F4, F0, F2ADD.D F4, F6, F8

Page 4: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Hazards● Program Order

– ILP preserves program order only where it affects the outcome of the program

● Structural Hazards– Resource conflicts

● Data Hazards– RAW, WAW, WAR

Page 5: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Structural Hazard

MEM ID EX MEM WB

MEM ID EX MEM WB

MEM ID EX MEM WB

MEM ID EX MEM WB

i1

i2

i3

i4

...

1 2 3 4 5 6 7 8 9

MEM ID EX MEM WBi5

HAZARD!!!

Page 6: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Data HazardDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

DSUB

AND

OR

Time (clock cycles)

XOR

ALU REG

IM REG DMALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Page 7: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Avoiding Data Hazards – ForwardingDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

DSUB

AND

OR

Time (clock cycles)

XOR

ALU REG

IM REG DMALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Page 8: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Load Delay SlotLDDSUBANDOR

R4,R1,R5R6,R1,R7

R1,0(R2)

R8,R1,R9

IM REG DMLD

DSUB

AND

OR

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU REG

IM REG DMALU

The loaded value might not be available in the destination

register for use by the instruction immediately following the load

LOAD DELAY SLOT

Page 9: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Cost of StallsData references = 40%. Ideal CPI=1.Processor with hazard is 1.1 times faster than the processor without hazard.Which processor is faster?

Pipeline CPI= Ideal pipeline CPI +Structural stalls+Data hazard stalls+Control stalls

Page 10: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Pipeline Scheduling

Reorder the instructions of the program so that dependent

instructions are far enough apart

Done by the compiler, before the program runs:

Static Instruction Scheduling

Done by the hardware, when the program is running:

Dynamic Instruction Scheduling

Page 11: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Pipeline Scheduling

LW R3, 0(R1)

LW R13, 0(R11)

ADDI R5, R3, 1

ADD R2, R2, R3

ADD R12, R13, R3

LW R3, 0(R1)

ADDI R5, R3, 1

ADD R2, R2, R3

LW R13, 0(R11)

ADD R12, R13, R3

stall

stall

Original Program

Pipeline Scheduling

Scheduled Code

Total Execution Cycles: 7 Total Execution Cycles: 5

Page 12: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Loop-level Parallelism

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

Original Loop:Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

L.D F6, -8(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

UNROLLED

LOOP

Page 13: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

3

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

Total Cycles: 27 cycles

Page 14: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

3

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

ADD.D F8, F2, F6

ADD.D F12, F2, F10

ADD.D F16, F2, F14

S.D F4, 0(R1)

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

S.D F12, 16(R1)

L.D F14, -24(R1)

S.D F16, 8(R1)

BNE R1, R2, Loop

Total Cycles: 14 cyclesDADDUI R1, R1, #-32

➢ Code Size➢ Register pressure

Page 15: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Exceptions● Certain exceptional events that occur during

program execution, handled by the processor hardware

● Control transfer to specific OS code based on the family of exception

● I/O device requests, System call, Breakpoint, Integer arithmetic overflow, FP arithmetic anomaly, Page fault, Undefined or unimplemented instruction, Hardware malfunctions, Power failure.

Page 16: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Exceptions● Synchronous vs. Asynchronous● User requested vs. Coerced● User maskable vs. User non-maskable● Within vs. Between instructions

– Save, and restore processor state

– restartable pipeline

● Resume vs. Terminate

Page 17: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Stopping and Restarting Execution● Trap instruction, Turn off writes, Save PC, Save

processor state, Exception handler, RFE● Precise exceptions

Pipeline stage Problem exceptions occurring

IF Page fault on IF, misaligned memory access; memory protection violation

ID Undefined or illegal opcode

EX Arithmetic exception

MEM Page fault on data fetch; misaligned memory access; memory protection violation

WB None

Page 18: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Precise ExceptionsLD IF ID EX MEM WB

DADD IF ID EX MEM WB

● Exceptions at the same cycle● Early exception by a later instruction● Instruction Status Vector

– Check before commit

Page 19: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Control Dependences● Program correctness

– Data flow and Exception behaviour

● Software Speculation– Liveness

DADDU R2, R3, R4

BEQZ R2, L1

LW R1, 0(R2)

L1:

DADDU R1, R2, R3

BEQZ R4, L1

DSUBU R1, R5, R6

L1: …........

OR R7, R1, R8

DADDU R1, R2, R3

BEQZ R12, L1

DSUBU R4, R5, R6

DADDU R5, R4, R9

L1: OR R7, R8, R9

Page 20: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Branch Hazards

● 1 stall cycle for every branch yields a performance loss of 10% to 30%!

IF ID EX MEM WB

IF

IF ID EX MEM WB

IF ID EX MEM WB

Branch

Branch Successor

Branch Successor + 1

Branch Successor + 2

Time(clock cycles)

1 2 3 4 5 6 7 8 9

IF ID EX MEM WB

Page 21: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Reducing Pipeline Branch Penalties● Freeze the pipeline● Static Prediction

– Predict Taken, Predict Untaken

● Fill Branch Delay Slot

IF ID EX MEM WB

IF

IF ID EX MEM WB

IF ID EX MEM WB

Branch

Branch Delay Slot

Branch Successor

Branch Successor + 1

Time(clock cycles)

1 2 3 4 5 6 7 8 9

ID EX MEM WB

From the MIPS ISA ManualThe transfer of control

takes place only following the instruction

immediately after the control transfer

instruction

Page 22: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Branch Delay Slot

Page 23: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Performance of Branch Schemes

Stall cyclesBranches=Branch frequency×Branch penalty

Speedup pipelining=Pipeline depth

1+Pipeline stall cycles per instruction

Speedup pipelining=Pipeline depth

1+Branch frequency×Branch penalty

Page 24: Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Original Program Scheduled

Classes of ExceptionsException type Synchronous

vs. AsyncUser request vs. Coerced

User maskable vs. nonmaskable

Within vs. between instructions

Resume vs. Terminate

I/O device request

Async Coerced Nonmaskable Between Resume

Invoke OS Sync User request Nonmaskable Between Resume

Tracing Instruction Execution

Sync User request User maskable

Between Resume

Breakpoint Sync User request User maskable

Between Resume

Arithmetic Overflow

Sync Coerced User maskable

Within Resume

FP underflow or overflow

Sync Coerced User maskable

Within Resume

Page fault Sync Coerced Nonmaskable Within Resume

Undefined Instructions

Sync Coerced Nonmaskable Within Terminate

Hardware malfunctions

Async Coerced Nonmaskable Within Terminate

Power Failure Async Coerced Nonmaskable Within Terminate

Smith and Pleszkun, Implementing precise interrupts in pipelined processors, IEEE Transactions on Computers, 37(5), 1998.