Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1...

Instruction Level Parallelism

● ILP, Loop level Parallelism● Dependences, Hazards● Speculation, Branch prediction

Basic Block● A straight line code sequence with no branches in

except to the entry and no branches out except at the exit

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

DADDUI R1, 0(R2)

BEQZ R2, L1

LW R1, 0(R2)

● Name dependence

– antidependence, output dependence

– Register renaming● Hazard

– Overlap during execution would change the order of access to the operand involved in the dependence.

for (i=0; i<=999; i=i+1)x[i] = x[i] + a;

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, LoopData DependenceName Dependence

ADD.D F4, F0, F2ADD.D F4, F6, F8

Hazards● Program Order

– ILP preserves program order only where it affects the outcome of the program

● Structural Hazards– Resource conflicts

● Data Hazards– RAW, WAW, WAR

Structural Hazard

MEM ID EX MEM WB

1 2 3 4 5 6 7 8 9

MEM ID EX MEM WBi5

HAZARD!!!

Data HazardDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Avoiding Data Hazards – ForwardingDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Load Delay SlotLDDSUBANDOR

R4,R1,R5R6,R1,R7

R1,0(R2)

R8,R1,R9

IM REG DMLD

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU

The loaded value might not be available in the destination

register for use by the instruction immediately following the load

LOAD DELAY SLOT

Cost of StallsData references = 40%. Ideal CPI=1.Processor with hazard is 1.1 times faster than the processor without hazard.Which processor is faster?

Pipeline CPI= Ideal pipeline CPI +Structural stalls+Data hazard stalls+Control stalls

Pipeline Scheduling

Reorder the instructions of the program so that dependent

instructions are far enough apart

Done by the compiler, before the program runs:

Static Instruction Scheduling

Done by the hardware, when the program is running:

Dynamic Instruction Scheduling

Pipeline Scheduling

LW R3, 0(R1)

LW R13, 0(R11)

ADDI R5, R3, 1

ADD R2, R2, R3

ADD R12, R13, R3

LW R3, 0(R1)

ADDI R5, R3, 1

ADD R2, R2, R3

LW R13, 0(R11)

ADD R12, R13, R3

Original Program

Pipeline Scheduling

Scheduled Code

Total Execution Cycles: 7 Total Execution Cycles: 5

Loop-level Parallelism

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

Original Loop:Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

L.D F6, -8(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

UNROLLED

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

Total Cycles: 27 cycles

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

ADD.D F8, F2, F6

ADD.D F12, F2, F10

ADD.D F16, F2, F14

S.D F4, 0(R1)

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

S.D F12, 16(R1)

L.D F14, -24(R1)

S.D F16, 8(R1)

BNE R1, R2, Loop

Total Cycles: 14 cyclesDADDUI R1, R1, #-32

➢ Code Size➢ Register pressure

Exceptions● Certain exceptional events that occur during

program execution, handled by the processor hardware

● Control transfer to specific OS code based on the family of exception

● I/O device requests, System call, Breakpoint, Integer arithmetic overflow, FP arithmetic anomaly, Page fault, Undefined or unimplemented instruction, Hardware malfunctions, Power failure.

Exceptions● Synchronous vs. Asynchronous● User requested vs. Coerced● User maskable vs. User non-maskable● Within vs. Between instructions

– Save, and restore processor state

– restartable pipeline

● Resume vs. Terminate

Stopping and Restarting Execution● Trap instruction, Turn off writes, Save PC, Save

processor state, Exception handler, RFE● Precise exceptions

Pipeline stage Problem exceptions occurring

IF Page fault on IF, misaligned memory access; memory protection violation

ID Undefined or illegal opcode

EX Arithmetic exception

MEM Page fault on data fetch; misaligned memory access; memory protection violation

WB None

Precise ExceptionsLD IF ID EX MEM WB

DADD IF ID EX MEM WB

● Exceptions at the same cycle● Early exception by a later instruction● Instruction Status Vector

– Check before commit

Control Dependences● Program correctness

– Data flow and Exception behaviour

● Software Speculation– Liveness

DADDU R2, R3, R4

BEQZ R2, L1

LW R1, 0(R2)

DADDU R1, R2, R3

BEQZ R4, L1

DSUBU R1, R5, R6

L1: …........

OR R7, R1, R8

DADDU R1, R2, R3

BEQZ R12, L1

DSUBU R4, R5, R6

DADDU R5, R4, R9

L1: OR R7, R8, R9

Branch Hazards

● 1 stall cycle for every branch yields a performance loss of 10% to 30%!

IF ID EX MEM WB

Branch

Branch Successor

Branch Successor + 1

Time(clock cycles)

1 2 3 4 5 6 7 8 9

IF ID EX MEM WB

Reducing Pipeline Branch Penalties● Freeze the pipeline● Static Prediction

– Predict Taken, Predict Untaken

● Fill Branch Delay Slot

IF ID EX MEM WB

Branch

Branch Delay Slot

Branch Successor

Time(clock cycles)

1 2 3 4 5 6 7 8 9

ID EX MEM WB

From the MIPS ISA ManualThe transfer of control

takes place only following the instruction

immediately after the control transfer

instruction

Branch Delay Slot

Performance of Branch Schemes

Stall cyclesBranches=Branch frequency×Branch penalty

Speedup pipelining=Pipeline depth

1+Pipeline stall cycles per instruction

Speedup pipelining=Pipeline depth

1+Branch frequency×Branch penalty

Classes of ExceptionsException type Synchronous

vs. AsyncUser request vs. Coerced

User maskable vs. nonmaskable

Within vs. between instructions

Resume vs. Terminate

I/O device request

Async Coerced Nonmaskable Between Resume

Invoke OS Sync User request Nonmaskable Between Resume

Tracing Instruction Execution

Sync User request User maskable

Between Resume

Breakpoint Sync User request User maskable

Between Resume

Arithmetic Overflow

Sync Coerced User maskable

Within Resume

FP underflow or overflow

Sync Coerced User maskable

Within Resume

Page fault Sync Coerced Nonmaskable Within Resume

Undefined Instructions

Sync Coerced Nonmaskable Within Terminate

Hardware malfunctions

Async Coerced Nonmaskable Within Terminate

Power Failure Async Coerced Nonmaskable Within Terminate

Smith and Pleszkun, Implementing precise interrupts in pipelined processors, IEEE Transactions on Computers, 37(5), 1998.

Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1...

Documents

LC3 Intro/Review - Georgetown Universitypeople.cs.georgetown.edu/.../Lec-1b-LC3intro.pdf · 2012-08-21 · .orig x3000 ld r1, six ld r2, number and r3,r3,#0 again add r3,r3,r2 add

R3 Route 41 fromfiles.transport.act.gov.au/autoTT/Network_2020_Update...R2 R2 R2 R2 R3 R3 R3 R3 R4 R4 R5 R6 R6 R7 R8 R8 R9 R9 R10 Canberra Airport Cohen St Interchange West“eld Belconnen

PharmacokineticsofGanodericAcidsAandF ...2 Evidence-Based Complementary and Alternative Medicine Ganoderic acid A: R1 =O, R2 β-OH, R3 =H, R4 =α-OH Ganoderic acid F: R1 R1 R4 R3 =R2

TSHWANE JOburg - Microsofteolstoragewe.blob.core.windows.net/wm-695976-cms... · joburg : sunday october, Marks Park 1st R5 300 2nd R3 750 3rd R2 150 1st R5 300 2nd R3 750 3rd R2

Sharpless Asymmetric Epoxidation · R4 OH R3 R2 R1 O R4 OH R3 R1 R2 Sharpless Asymmetric Epoxidation (SAE) - Converts primary and secondary allylic alcohols into 2,3 epoxyalcohols-The

Research Article Multiangle Social Network Recommendation Algorithms and Similarity ... · 2019. 7. 31. · t4 u1 u2 u3 u1 u2 u3 r1 r2 r3 r4 r5 r1 r2 r3 r4 r5 u1 u2 u3 r1 r2 r3 r5

C++ + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 + vector length vadd v3, v1, v2 VECTOR (N operations)

Substation Standard ZSS Template - EESS-10309 … · -r1, -r2, -r3 metrosil -r11 -r12 -r13 shunt resistors main protection -x61 main protection test rack 16/16 assembly -k3m high-speed

Vectores r2 y r3

R2 R3 R4 R5 Platinum 2 - moa-home.commoa-home.com/moa2013/files/MOA2013_FloorPlans.pdf · R2 R3 R4 R5 R6 Tea Break 40’ 20’ 20’ 10’ 10’ Super Platinum 2 6’ ... Pelagus

2020 MEDIA KIT - Farmer's Weekly...r8 000 r5 000 r1 500 r4 000 r7 000 r61 900 discount r3 460 r3 440 r4 000 r3 000 r0 r3 000 r3 500 r20 400 cost r13 340 r16 160 r4 000 r2 000 r1 500

LUMENPLUS 3 - Orlight · 2018-09-12 · R R10 R11 R12 R13 R14 R15 1 90 83 80 88 9 77 R1 R2 R3 R4 R5 R R7 R8 8 9 9 8 8 91 88 74 R R10 R11 R12 R13 R14 R15 2 85 8 9 89 99 81 R1 R2 R3

ContinUoUS DUty on/off ElECtriC ACtUAtorS With r2, r3, l2 ...€¦ · identified by the code “R2”, “R3”,“L2”, “L3”, “L4” or “L5” and the letter n in the model

Design representations Control Oriented Modelsfileadmin.cs.lth.se/cs/Education/EDAN15/Lectures/Lecture6.pdf33/u2 r1 r1 r1 r3 r3 r3 r1 r1 r1 r1 r1 r1 r2 r2 r2 r2 r2 r2 r3 r3 r3 r3 2012-03-30

R2, R3, R4 Multi-Family Survey Report City of West Hollywood

RC Group Meeting - Scripps Research › baran › images › grpmtgpdf › ... · R2 1 )T(O -Pr4/ MgCl N R3 R4 2) 5 then H 2O NH R3 4 1 R2 R1 R2 NH R3 R4 + yields 48-94% rr >20:1

ARMnic.vajn.icu/PDF/ARM/ARM-inst.pdfr14 (lr) r13 (sp ) r 12 r10 r 11 r9 r8 User mode CPSR copied to FIQ mode SPSR cpsr r15 (pc) r14 (l r13 (s ) 12 r10 1 r 9 r 8 r7 r4 r5 r2 r1 r0 r3

Backup Exec 2010, 2010 R2 and 2010 R3 Software Compatibility List

Soraa Internal Report: IES LM79‐08€¦ · R8 97 R9 93 R10 95 R11 89 R12 76 R13 96 R14 98 Ra 96 CRI-20 40 60 80 100 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 CRI Page 4. Spectral

Zoning Ordinance Update - Santa Monica · 2012. 5. 14. · WORKSHOP AGENDA . Development ... R 1 C R4 C2 M1 M2 M1 OP4 2 DP R3 R2 R2 RVC C2 CP5 C6 R2 R3 OP1 C4 BS C1 CP 3 DP DP R2