3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster

3.13. Fallacies and Pitfalls

• Fallacy: Processors with lower CPIs will always be faster

• Fallacy: Processors with faster clock rates will always be faster– Balance must be found:

• E.g. sophisticated pipeline: CPI ↓ clock cycle ↑

Fallacies and Pitfalls

• Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance– Again, question of balance

• SuperSPARC –vs– HP PA 7100

– Complex interactions between cycle time and organisation


• Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement– Amdahl’s Law!– Boosting performance of one area may uncover

problems in another


• Pitfall: Sometimes bigger and dumber is better!– Alpha 21264: sophisticated multilevel

tournament branch predictor– Alpha 21164: simple two-bit predictor– 21164 performs better for transaction

processing application!• Can handle twice as many local branch predictions

Concluding Remarks

• Lots of open questions!– Clock speed –vs– CPI– Power issues– Exploiting parallelism

• ILP –vs– explicit

Characteristics of Modern (2001) Processors

• Figure 3.61– 3–4 way superscalar– 4–22 stage pipelines– Branch prediction– Register renaming (except UltraSPARC)– 400MHz – 1.7GHz– 7–130 million transistors

Chapter 4Exploiting ILP with Software

4.1. Compiler Techniques for Exposing ILP

• Compilers can improve the performance of simple pipelines– Reduce data hazards– Reduce control hazards

Loop Unrolling

• Compiler technique to increase ILP– Duplicate loop body– Decrease iterations

• Example:– Basic code: 10 cycles per iteration– Scheduled: 6 cycles

for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; }

Loop Unrolling

• Basic code: 7 cycles per “iteration”

• Scheduled: 3.5 cycles (no stalls!)

Loop Unrolling

• Requires clever compilers– Analysing data dependences, name

dependences and control dependences

• Limitations– Code size– Decrease in amortisation of overheads– “Register pressure”– Compiler limitations

• Useful for any architecture

Superscalar Performance

• Two-issue MIPS (int + FP)

• 2.4 cycles per “iteration”– Unrolled five times

4.2. Static Branch Prediction

• Useful:– where behaviour can be predicted at compile-

time– to assist dynamic prediction

• Architectural support– Delayed branches

Static Branch Prediction

• Simple:– Predict taken– Has average misprediction rate of 34% (SPEC)– Range: 59% – 9%

• Better:– Predict backward taken, forward not-taken– Worse for SPEC!

Static Branch Prediction

• Advanced compiler analysis can do better

• Profiling is very useful– FP: 9% ± 4%– Int: 15% ± 5%

4.3. Static Multiple Issue: VLIW

• Compiler groups instructions into “packets”, checking for dependences– Remove dependences– Flag dependences

• Simplifies hardware

VLIW

• First machines used a wide instruction with multiple operations per instruction– Hence Very Long Instruction Word (VLIW)– 64–128 bits

• Alternative: group several instructions into an issue packet

VLIW Architectures

• Multiple functional units

• Compiler selects instructions for each unit to create one long instruction/an issue packet

• Example: five operations– Integer/branch, 2 × FP, 2 × memory access

• Need lots of parallelism– Use loop unrolling, or global scheduling

Example

• Loop unrolled seven times!

• 1.29 cycles per result

• 60% of available instruction “slots” filled

for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

Summary of Improvements

Technique Unscheduled Scheduled

Basic code 10 6

Loop unrolled (4) 7 3.5

Superscalar (5) 2.4

VLIW (7) 1.29

Drawbacks of Original VLIWs

• Large code size– Need to use loop unrolling– Wasted space for unused slots

• Clever encoding techniques, compression

• Lock-step execution– Stalling one unit stalls them all

• Binary code compatibility– Variations on structure required recompilation

4.4. Compiler Support for Exploiting ILP

• We will not cover this section in detail

• Loop unrolling– Loop-carried dependences

• Software pipelining– Interleave instructions from different iterations

4.5. Hardware Support for Extracting More Parallelism

• Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time

• If not, we need more advanced techniques:– Conditional instructions– Hardware support for compiler speculation

Conditional or Predicated Instructions

• Instructions have associated conditions– If condition is true execution proceeds normally– If not, instruction becomes a no-op

• Removes control hazards

if (a == 0) b = c;

bnez %r8, L1 nop mov %r1, %r2L1: ...

cmovz %r8, %r1, %r2

Conditional Instructions

• Control hazards effectively replaced by data hazards

• Can be used for speculation– Compiler reorders instructions depending on

likely outcome of branches

Limitations on Conditional Instructions

• Annulled instructions still execute– But may occupy otherwise stalled time

• Most useful when conditions evaluated early

• Limited usefulness for complex conditions

• May be slower than unconditional operations

Conditional Instructions in Practice

Machine Conditional Instructions

MIPS, Alpha, SPARC

Move

HP PAAny register-register instruction can annul the following instruction

IA-64 Full predication

Documents

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster