Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Instruction Level Parallelism
Appendix C and Chapter 3, HP5e
Outline● Pipelining, Hazards● Branch prediction● Static and Dynamic Scheduling● Speculation● Compiler techniques, VLIW● Limits of ILP.
Pipelining Basics
Implementation of RISC ISA - Stages● Instruction Fetch (IF)● Instruction Decode/Register Fetch (ID)
– Fixed field decoding
● Execution/Effective address (EX)● Memory Access (MEM)● Write back (WB)
MIPS Datapath
AD
D
PC
4
IM
NPC
RegsIR
SignExtend
A
B
Imm16 32
rs
rt
rd
AL
U ALUOutput
MUX
MUX
Zero? Cond
DM LMD MUX
MUX
Instruction Fetch Instruction Decode/Register Fetch
Execute/Address
Calculation
MemoryAccess
WriteBack
IF ID EX MEM WB
Multiple Issue Integer Pipeline
IMRF
Read
AB
DM
RF
Write
IR0
IR1
Zero?
IF ID EX MEM WB
Pipeline PerformanceAn unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup?
Average Instruction Execution time = Clock cycle * Average CPI
CPI=∑i=1
n IC i
InstructionCount×CPI i
Dependences
Pipeline Hazards – Structural & Data
Outline● Data dependences● Name dependences● Structural hazards● Data hazards
– Stalling, Forwarding
Basic Block● A straight line code sequence with no branches in
except to the entry and no branches out except at the exit
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, Loop
Dependence
● Name dependences
– Register renaming● Hazard
– Overlap during execution could change the order of access to the operand involved in the dependence.
for (i=0; i<=999; i=i+1)x[i] = x[i] + a;
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, LoopData Dependence (RAW)Name Dependences (WAR, WAW)
ADD.D F4, F0, F2ADD.D F4, F6, F8
Hazards● Program Order
– ILP preserves program order only where it affects the outcome of the program
● Structural Hazards– Resource conflicts
● Data Hazards– RAW, WAW, WAR
● Control Hazard– Whether or not an instruction should be executed
depends on a control decision made by an earlier instruction
Structural Hazard
MEM ID EX MEM WB
MEM ID EX MEM WB
MEM ID EX MEM WB
MEM ID EX MEM WB
i1
i2
i3
i4
...
1 2 3 4 5 6 7 8 9
MEM ID EX MEM WBi5
HAZARD!!!
● Unified Memory example● Register File – WB, ID example.
Cost of a Load Structural Hazard● Data references constitute 40% of the instruction
mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much?
Avg. InstructionTime =CPI×Clock cycle time
Avg. InstructionTime ideal=CPI×Clock cycle timeideal
Cost of a Load Structural Hazard
Avg. InstructionTime =CPI×Clock cycle time
Avg. InstructionTime =(1+0.4×1)×Clock cycle timeideal
1.1
Avg. InstructionTime =1.27×Clock cycle timeideal
Data HazardsA
DD
PC
4
IM
NPC
RegsIR
SignExtend
A
B
Imm16 32
rs
rt
rd
AL
U ALUOutput
MUX
MUX
Zero? Cond
DM LMD MUX
MUX
IR IR IR
R1 ← R2 + R3
R4 ← R1 + R5
R1 is updated in the WB stage.
Stalled Stages and Pipeline BubblesTime (clock cycles)
R1 ← R2 + R3
R4 ← R1 + R5
IF ID
IF
EX MA WB
ID
IF
EX MA WB
IF IF
ID ID
EX MA WBID
I1 I2 I3
I4
Stalled Stages EX MA WBIDIF
I5
EX MA WBIDIF
EX
MA
WB
ID
IF
I1
I1
I1
I1
I2 I2 I2 I2
I2
I2
I2
I3 I3 I3
I3
I3
I3
I3
I4
I4
I4
I4
I5
I5
I5
I5
nop nop nop
nop nop nop
nop nop nop
How to overcome this hazard?
Resolving Data Hazards● Stalling one of the instructions● Data Forwarding (Bypassing)● Scheduling hazardous instructions away from
each other
Stalling (Interlocking)A
DD
PC
4
IM
NPC
RegsIR
SignExtend
A
B
Imm16 32
rs
rt
rd
AL
U ALUOutput
MUX
MUX
Zero? Cond
DM LMD MUX
MUX
IR IR IR
R1 ← R2 + R3
R4 ← R1 + R5
NOP
Stall Condition
Pipeline Performance
Speedup pipelining=Pipeline depth
1+Stall cycles per instruction
Speedup pipelining=CPI unpipelined
CPI pipelined
ForwardingDADDDSUBANDORXOR
R4,R1,R5R6,R1,R7
R1,R2,R3
R8,R1,R9R10,R1,R11
IM REG DMDADD
DSUB
AND
Time (clock cycles)
ALU REG
IM REG DMALU REG
IM REG DMALU REG
Forwarding
Time (clock cycles)
R1 ← R2 + R3
R4 ← R1 + R5
IF ID
IF
EX MA WB
ID
IF
EX MA WB
IF IF
ID ID
EX MA WBIDIF
ID
Stalled Stages
Before Bypassing
Time (clock cycles)
R1 ← R2 + R3
R4 ← R1 + R5
IF ID
IF
EX MA WB
EX MA WBID
After Bypassing
CPI > 1
CPI = 1IF EX MA WBID
Cost of Forwarding
● In longer pipelines?● In multiple issue pipelines?
● All the dependences have been solved?
Forwarding
● Forwarding cannot solve all data dependence problems
LD R2, 4(R1)
ADD R4, R2, R3
IM REG DMLD
ADD
Time (clock cycles)
ALU REG
IM REG DMALU REG
Forwarding - Stall Condition
● Forwarding cannot solve all data dependence problems
LD R2, 4(R1)
ADD R4, R2, R3
IM REG DMLD
ADD
Time (clock cycles)
ALU REG
IM REG DMALU REGREG
STALL
Instruction Level Parallelism
Static Scheduling
Outline
● ILP● Multicycle instructions● Loop unrolling, scheduling● Superscalar pipelines
ILP● Instruction-level parallelism: overlap among
instructions: pipelining or multiple instruction execution
● What determines the degree of ILP?– dependences: property of the program
– hazards: property of the pipeline
Pipeline Scheduling● Reorder instructions so that dependent instructions are
far enough apart
● Done by the compiler, before the program runs:
● Static Instruction Scheduling
● Done by the hardware, when the program is running:
● Dynamic Instruction Scheduling
Static vs. Dynamic Scheduling● Dynamic scheduling:
– requires complex structures to identify independent instructions (scoreboards, issue queue)
– high power consumption
– low clock speed
– high design and verification effort
● Static: Compiler can compute instruction latencies and dependences
Pipeline Scheduling
LW R3, 0(R1)
LW R13, 0(R11)
ADDI R5, R3, 1
ADD R2, R2, R3
ADD R12, R13, R3
LW R3, 0(R1)
ADDI R5, R3, 1
ADD R2, R2, R3
LW R13, 0(R11)
ADD R12, R13, R3
stall
stall
Original Program
Pipeline Scheduling
Scheduled Code
Total Execution Cycles: 7 Total Execution Cycles: 5
References● HP5e. Chapter 3 – Instruction-Level Parallelism
and Its Exploitation.● HP5e. Appendix A – Instruction Set Principles.● HP5e. Appendix C – Pipelining: Basic and
Intermediate Concepts.● HP5e. Appendix H – Hardware and Software for
VLIW and EPIC.