EECE476: Computer Architecture
Lecture 22:Zero-cycle Branches (no text)
Superpipelining (no text)vs.
Superscalar (text 6.8)
The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux
2
Jumps and Unconditional Branches
• BTB: Branch-target buffer– Tells us where a branch will jump to
• BPB: Branch-prediction buffer– Tells us if branch will be taken
• Consider J and (unconditional) BR– Always takes the branch (prediction unnecessary)
ADD $t3, $t1,$t2J LABEL…
LABEL: SUBI $t3, $t3, 1
– This ADD always knows a “J is coming after” (the J is always at PC+4)– Target of JMP/BR is known (not from a register, JR)– Recipe for zero-cycle branches!
3
Zero-cycle Branches?ADD $t3, $t1,$t2J LABEL…
LABEL: SUBI $t3, $t3, 1
• Use BTB for non-branch instruction (eg, ADD)– Any instruction immediately before a J or BR
• BTB reserves entry for ADD, Target: “LABEL” of J, Tag: PC of ADD– ADD now “looks like” a branch/jump
• When executing ADD– BTB says “always fetch from target LABEL”– Requires small change to datapath (BTB can also select next PC, not just BPB)
• Do not have to fetch JMP/BR itself!– Branch executed in zero cycles!
4
Zero-cycle Branch: Limits
• What if target comes from a register?– BTB holds useless value, usually wrong
• What if branch is conditional (eg, BEQ)?– Two paths: taken and untaken– Do not know which path is correct until after executing BEQ
• Actually need to fetch & execute BEQ!• To determine Rs and Rt, and do comparison• Cannot do in zero cycles?
• Can conditional branches ever take zero cycles?– YES!– But I’ll let you figure out how…
Superpipelining
6
Pipeline Trends
• Slowest stages in classic 5-stage pipeline
– Instruction and data memory accesses
– CPUs get faster much more quickly than memory
– Memory accesses continue to be the bottleneck in computer architecture for last 10-15 years
– Instruction and Data Memory replaced with faster caches• A cache is a small, fast on-chip memory
– Keeps a local copy of data from main memory• French: cache means HIDE• Idea: cache memory is hidden from your program (transparent)
– Discuss details later..
7
Superpipelining
• 5-stage pipeline is “classical”– MIPS– Intel 486 has 5-stage pipeline
• First Intel CPU with on-chip cache
• Superpipelining– More pipeline stages
• Basic Idea: faster clock speeds– Do less work per clock cycle– Still complete 1 instruction per cycle
8
MIPS R4000 Superpipeline• 5 stages: I, D, X, M, W
• I stage: read memory• M stage: read memory
– Fast caches are still too slow
• 8 stages: IF, IS, D, X, DF, DS, DC, W– Approx 2x clock speed of 5-stage pipeline
• Split “I” stage in two– IF “I First”– IS “I Second”
• Split “M” stage in three– DF “D First”– DS “D Second”– DC “D CheckTag”
9
MIPS R4000 CheckTag Stage• CheckTag Stage
– Cache is similar to BTB• Contains a TAG specifying the memory address
for the data it is holding
– Access data cache• Must check TAG to verify we got the correct data
– CheckTag takes 1 extra clock cycle!– If CheckTag fails
• pipeline must stall• get data from actual data memory (10-100+ clock cycles)
• MIPS R4000 is very aggressive– Forwarding Units take data out of “DS” stage (can’t take from DF)– If CheckTag fails, it BACKS UP the pipeline 1 cycle (hard to do!)
10
Superpipelining Limits
• Data Hazards– More forwarding
• Eg, X forwarding from DF, DS, DC, and WB stages
– More pipeline stalls• CheckTag failure causes stall
• Load-Use Penalty: 2 cycles– Load instruction: 2 clock cycles (DF, DS)– Use instruction: must wait for load to finish
• Insert 2 instructions between Load and Use• Can use NOP• If no instructions, pipeline will stall
11
Importance of Branch Prediction
• Branch-Delay Penalty– Branch in “D” stage
• Two more instructions are being fetched (IF, IS)• Two branch delay slots!
– Next version of superpipeline…• May have 3 branch delay slots?
• Not a good idea!
• Need BRANCH PREDICTION– MIPS R4000
• Total branch delay: 3 cycles• 1 delay slot (historical), followed by• 2 cycles static branch prediction (predict-untaken)
Superscalar
13
Superscalar
• Basic Idea– Why execute only 1 instruction in a clock cycle?
– How about 2 instructions per cycle?
• Tempting to begin calling it IPC (instructions per cycle)– IPC = 1 / CPI– Compare “IPC” to “MIPS” … both are rates
• Stick to CPI for this course:ExecutionTime = InstructionCount * CPI * ClockPeriod
• Ideal CPI = 0.5 in this case
14
Static Superscalar
• Find 2 instructions every clock cycle!– Pair them up when writing assembly code– Called Static Superscalar
• Compiler does the work ahead of time– Given two instructions, CPU just executes them
• Instructions must be independent• If hard to find independent instruction, use NOP
– Compiler looks for “eligible” pairs• Automagically avoid dependences between instruction pairs• Not much brains in CPU…
15
Static Superscalar:Need to Double All Resources?
• Need to double everything?– Need 2 Instruction Memories?– Need 2 Register Files (4 read ports, 2 write ports) ?– Need 2 ALUs ?– Need 2 Data Memories?
• Too much overhead, not usually done– Just 1 Instruction Memory with 2 x 32bit outputs (8 bytes)– Just 1 ALU– Just 1 Data Memory (need partial ALU to compute address)– Need bigger register file (4 read ports, 2 write ports)
• Practical limits imposed to use fewer resources– Only combine 1 ALU instruction + 1 Memory instruction
• Cannot combine 2 ALU instructions or 2 Memory instructions– Align all instructions in pairs in the instruction memory
• PC%8==0 for ALU instructions, PC%8==4 for memory instructions
16
Static Superscalar
17
Pipeline Diagram for Superscalar• Two instructions per cycle
1a ALU or BR I D X M W
1b LD or ST I D X M W
2a ALU or BR I D X M W
2b LD or ST I D X M W
3a ALU or BR I D X M W
3b LD or ST I D X M W
4a ALU or BR I D X M W
4b LD or ST I D X M W
5a ALU or BR I D X M W
5b LD or ST I D X M W
18
Code Scheduling for Superscalar• Example
Loop: lw $t0, 0($s1)
addi $s1,$s1,-4
addu $t0,$t0,$s2
sw $t0, 4($s1)
bne $s1,$zero, Loop
Regular pipeline:
5 cycles per iteration (assuming no delay slots)
int *p;for( ; p != 0; p-- ) {
*p = *p + CONST;}
19
Code Scheduling for Superscalar
Loop: lw $t0, 0($s1)
addi $s1,$s1,-4
addu $t0,$t0,$s2
sw $t0, 4($s1)
bne $s1,$zero,Loop
LABEL ALU/BR INSTR LD/ST INSTR Cycle
Loop: LW $t0,0($s1) 1
ADDI $s1,$s1,-4 2
ADDU $t0,$t0,$s2 3
BNE $s1,$zero,Loop SW $t0,4($s1) 4
Blank table entries are NOPS.
Load-use delay prevents ADDU being earlier.
Effective CPI is 0.8, not 0.5!
20
Code Scheduling for Superscalar• The compiler can further improve CPI• Loop unrolling
– Example: unroll previous code 4 times (# iterations multiple of 4)– Execute new body ¼ number of iterations
LABEL ALU/BR INSTR LD/ST INSTR Cycle
Loop: LW $t0, 0($s1) 1
LW $t1, -4($s1) 2
ADDU $t0,$t0,$s2 LW $t2, -8($s1) 3
ADDU $t1,$t1,$s2 LW $t3,-12($s1) 4
ADDU $t2,$t2,$s2 SW $t0, 0($s1) 5
ADDU $t3,$t3,$s2 SW $t1, -4($s1) 6
ADDI $s1,$s1,-16 SW $t2, -8($s1) 7
BNE $s1,$zero,Loop SW $t3, 4($s1) 8
21
Code Scheduling for Superscalar• Unroll loop 4 times
– More registers used– Some BNE/ADDI
instructions are gone
• CPI Improved– Before Unrolling: 0.8– After Unrolling 8/14 = 0.57
• InstrCount Improved– Before Unrolling: 20/4 = 5 per iteration– After Unrolling: 14/4 = 3.5 per iteration
• We don’t get this with superpipelining!
• Overall Performance– Pipelined: 5 cycles / iteration– Superscalar before unrolling: 4 cycles / iteration– Superscalar after unrolling: 2 cycles / iteration
• Superpipelined: 2.0x faster than pipelined!• Superscalar unrolled: 2.5x faster than pipelined!
LABEL ALU/BR INSTR LD/ST INSTR Cycle
Loop: LW $t0, 0($s1) 1
LW $t1, -4($s1) 2
ADDU $t0,$t0,$s2 LW $t2, -8($s1) 3
ADDU $t1,$t1,$s2 LW $t3,-12($s1) 4
ADDU $t2,$t2,$s2 SW $t0, 0($s1) 5
ADDU $t3,$t3,$s2 SW $t1, -4($s1) 6
ADDI $s1,$s1,-16 SW $t2, -8($s1) 7
BNE $s1,$zero,Loop SW $t3, 4($s1) 8
22
Importance of Branch Prediction
• Now fetching two instructions every cycle– Given a branch:
• Which two instructions to fetch: Taken or Not-Taken path?
• Misprediction?– Many lost opportunities to execute instructions– Significant performance loss!
• Branch prediction CRUCIAL!
23
Superpipelining vs. Superscalar• Which is better?
– Debate lasted a few years in mid-1990s
• Result: both won!– Can combine superpipelining and superscalar
• Branch prediction is now crucial!– 6 instructions enter pipeline after a branch
• x3 from superpipelining• x2 from superscalar
• Superscalar can be enhanced further– Rely less upon compiler– Hardware finds instructions to pair together
• More hazard detection, etc– Dynamic superscalar (next class!)