EECE476: Computer Architecture Lecture 22: Zero-cycle Branches (no text) Superpipelining (no text) vs. Superscalar (text 6.8) The University of British

EECE476: Computer Architecture

Lecture 22:Zero-cycle Branches (no text)

Superpipelining (no text)vs.

Superscalar (text 6.8)

The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux

2

Jumps and Unconditional Branches

• BTB: Branch-target buffer– Tells us where a branch will jump to

• BPB: Branch-prediction buffer– Tells us if branch will be taken

• Consider J and (unconditional) BR– Always takes the branch (prediction unnecessary)

ADD $t3, $t1,$t2J LABEL…

LABEL: SUBI $t3, $t3, 1

– This ADD always knows a “J is coming after” (the J is always at PC+4)– Target of JMP/BR is known (not from a register, JR)– Recipe for zero-cycle branches!

3

Zero-cycle Branches?ADD $t3, $t1,$t2J LABEL…

LABEL: SUBI $t3, $t3, 1

• Use BTB for non-branch instruction (eg, ADD)– Any instruction immediately before a J or BR

• BTB reserves entry for ADD, Target: “LABEL” of J, Tag: PC of ADD– ADD now “looks like” a branch/jump

• When executing ADD– BTB says “always fetch from target LABEL”– Requires small change to datapath (BTB can also select next PC, not just BPB)

• Do not have to fetch JMP/BR itself!– Branch executed in zero cycles!

4

Zero-cycle Branch: Limits

• What if target comes from a register?– BTB holds useless value, usually wrong

• What if branch is conditional (eg, BEQ)?– Two paths: taken and untaken– Do not know which path is correct until after executing BEQ

• Actually need to fetch & execute BEQ!• To determine Rs and Rt, and do comparison• Cannot do in zero cycles?

• Can conditional branches ever take zero cycles?– YES!– But I’ll let you figure out how…

Superpipelining

6

Pipeline Trends

• Slowest stages in classic 5-stage pipeline

– Instruction and data memory accesses

– CPUs get faster much more quickly than memory

– Memory accesses continue to be the bottleneck in computer architecture for last 10-15 years

– Instruction and Data Memory replaced with faster caches• A cache is a small, fast on-chip memory

– Keeps a local copy of data from main memory• French: cache means HIDE• Idea: cache memory is hidden from your program (transparent)

– Discuss details later..

7

Superpipelining

• 5-stage pipeline is “classical”– MIPS– Intel 486 has 5-stage pipeline

• First Intel CPU with on-chip cache

• Superpipelining– More pipeline stages

• Basic Idea: faster clock speeds– Do less work per clock cycle– Still complete 1 instruction per cycle

8

MIPS R4000 Superpipeline• 5 stages: I, D, X, M, W

• I stage: read memory• M stage: read memory

– Fast caches are still too slow

• 8 stages: IF, IS, D, X, DF, DS, DC, W– Approx 2x clock speed of 5-stage pipeline

• Split “I” stage in two– IF “I First”– IS “I Second”

• Split “M” stage in three– DF “D First”– DS “D Second”– DC “D CheckTag”

9

MIPS R4000 CheckTag Stage• CheckTag Stage

– Cache is similar to BTB• Contains a TAG specifying the memory address

for the data it is holding

– Access data cache• Must check TAG to verify we got the correct data

– CheckTag takes 1 extra clock cycle!– If CheckTag fails

• pipeline must stall• get data from actual data memory (10-100+ clock cycles)

• MIPS R4000 is very aggressive– Forwarding Units take data out of “DS” stage (can’t take from DF)– If CheckTag fails, it BACKS UP the pipeline 1 cycle (hard to do!)

10

Superpipelining Limits

• Data Hazards– More forwarding

• Eg, X forwarding from DF, DS, DC, and WB stages

– More pipeline stalls• CheckTag failure causes stall

• Load-Use Penalty: 2 cycles– Load instruction: 2 clock cycles (DF, DS)– Use instruction: must wait for load to finish

• Insert 2 instructions between Load and Use• Can use NOP• If no instructions, pipeline will stall

11

Importance of Branch Prediction

• Branch-Delay Penalty– Branch in “D” stage

• Two more instructions are being fetched (IF, IS)• Two branch delay slots!

– Next version of superpipeline…• May have 3 branch delay slots?

• Not a good idea!

• Need BRANCH PREDICTION– MIPS R4000

• Total branch delay: 3 cycles• 1 delay slot (historical), followed by• 2 cycles static branch prediction (predict-untaken)

Superscalar

13

Superscalar

• Basic Idea– Why execute only 1 instruction in a clock cycle?

– How about 2 instructions per cycle?

• Tempting to begin calling it IPC (instructions per cycle)– IPC = 1 / CPI– Compare “IPC” to “MIPS” … both are rates

• Stick to CPI for this course:ExecutionTime = InstructionCount * CPI * ClockPeriod

• Ideal CPI = 0.5 in this case

14

Static Superscalar

• Find 2 instructions every clock cycle!– Pair them up when writing assembly code– Called Static Superscalar

• Compiler does the work ahead of time– Given two instructions, CPU just executes them

• Instructions must be independent• If hard to find independent instruction, use NOP

– Compiler looks for “eligible” pairs• Automagically avoid dependences between instruction pairs• Not much brains in CPU…

15

Static Superscalar:Need to Double All Resources?

• Need to double everything?– Need 2 Instruction Memories?– Need 2 Register Files (4 read ports, 2 write ports) ?– Need 2 ALUs ?– Need 2 Data Memories?

• Too much overhead, not usually done– Just 1 Instruction Memory with 2 x 32bit outputs (8 bytes)– Just 1 ALU– Just 1 Data Memory (need partial ALU to compute address)– Need bigger register file (4 read ports, 2 write ports)

• Practical limits imposed to use fewer resources– Only combine 1 ALU instruction + 1 Memory instruction

• Cannot combine 2 ALU instructions or 2 Memory instructions– Align all instructions in pairs in the instruction memory

• PC%8==0 for ALU instructions, PC%8==4 for memory instructions

16

Static Superscalar

17

Pipeline Diagram for Superscalar• Two instructions per cycle

1a ALU or BR I D X M W

1b LD or ST I D X M W









18

Code Scheduling for Superscalar• Example

Loop: lw $t0, 0($s1)

addi $s1,$s1,-4

addu $t0,$t0,$s2

sw $t0, 4($s1)

bne $s1,$zero, Loop

Regular pipeline:

5 cycles per iteration (assuming no delay slots)

int *p;for( ; p != 0; p-- ) {

*p = *p + CONST;}

19

Code Scheduling for Superscalar

Loop: lw $t0, 0($s1)

addi $s1,$s1,-4

addu $t0,$t0,$s2

sw $t0, 4($s1)

bne $s1,$zero,Loop

LABEL ALU/BR INSTR LD/ST INSTR Cycle

Loop: LW $t0,0($s1) 1

ADDI $s1,$s1,-4 2

ADDU $t0,$t0,$s2 3

BNE $s1,$zero,Loop SW $t0,4($s1) 4

Blank table entries are NOPS.

Load-use delay prevents ADDU being earlier.

Effective CPI is 0.8, not 0.5!

20

Code Scheduling for Superscalar• The compiler can further improve CPI• Loop unrolling

– Example: unroll previous code 4 times (# iterations multiple of 4)– Execute new body ¼ number of iterations


Loop: LW $t0, 0($s1) 1

LW $t1, -4($s1) 2

ADDU $t0,$t0,$s2 LW $t2, -8($s1) 3

ADDU $t1,$t1,$s2 LW $t3,-12($s1) 4

ADDU $t2,$t2,$s2 SW $t0, 0($s1) 5

ADDU $t3,$t3,$s2 SW $t1, -4($s1) 6

ADDI $s1,$s1,-16 SW $t2, -8($s1) 7

BNE $s1,$zero,Loop SW $t3, 4($s1) 8

21

Code Scheduling for Superscalar• Unroll loop 4 times

– More registers used– Some BNE/ADDI

instructions are gone

• CPI Improved– Before Unrolling: 0.8– After Unrolling 8/14 = 0.57

• InstrCount Improved– Before Unrolling: 20/4 = 5 per iteration– After Unrolling: 14/4 = 3.5 per iteration

• We don’t get this with superpipelining!

• Overall Performance– Pipelined: 5 cycles / iteration– Superscalar before unrolling: 4 cycles / iteration– Superscalar after unrolling: 2 cycles / iteration

• Superpipelined: 2.0x faster than pipelined!• Superscalar unrolled: 2.5x faster than pipelined!


Loop: LW $t0, 0($s1) 1

LW $t1, -4($s1) 2

ADDU $t0,$t0,$s2 LW $t2, -8($s1) 3

ADDU $t1,$t1,$s2 LW $t3,-12($s1) 4

ADDU $t2,$t2,$s2 SW $t0, 0($s1) 5

ADDU $t3,$t3,$s2 SW $t1, -4($s1) 6

ADDI $s1,$s1,-16 SW $t2, -8($s1) 7

BNE $s1,$zero,Loop SW $t3, 4($s1) 8

22

Importance of Branch Prediction

• Now fetching two instructions every cycle– Given a branch:

• Which two instructions to fetch: Taken or Not-Taken path?

• Misprediction?– Many lost opportunities to execute instructions– Significant performance loss!

• Branch prediction CRUCIAL!

23

Superpipelining vs. Superscalar• Which is better?

– Debate lasted a few years in mid-1990s

• Result: both won!– Can combine superpipelining and superscalar

• Branch prediction is now crucial!– 6 instructions enter pipeline after a branch

• x3 from superpipelining• x2 from superscalar

• Superscalar can be enhanced further– Rely less upon compiler– Hardware finds instructions to pair together

• More hazard detection, etc– Dynamic superscalar (next class!)

Documents

EECE476: Computer Architecture Lecture 22: Zero-cycle Branches (no text) Superpipelining (no text) vs. Superscalar (text 6.8) The University of British