Upload
duane-john-george
View
237
Download
3
Tags:
Embed Size (px)
Citation preview
3.13. Fallacies and Pitfalls
• Fallacy: Processors with lower CPIs will always be faster
• Fallacy: Processors with faster clock rates will always be faster– Balance must be found:
• E.g. sophisticated pipeline: CPI ↓ clock cycle ↑
Fallacies and Pitfalls
• Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance– Again, question of balance
• SuperSPARC –vs– HP PA 7100
– Complex interactions between cycle time and organisation
Fallacies and Pitfalls
• Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement– Amdahl’s Law!– Boosting performance of one area may uncover
problems in another
Fallacies and Pitfalls
• Pitfall: Sometimes bigger and dumber is better!– Alpha 21264: sophisticated multilevel
tournament branch predictor– Alpha 21164: simple two-bit predictor– 21164 performs better for transaction
processing application!• Can handle twice as many local branch predictions
Concluding Remarks
• Lots of open questions!– Clock speed –vs– CPI– Power issues– Exploiting parallelism
• ILP –vs– explicit
Characteristics of Modern (2001) Processors
• Figure 3.61– 3–4 way superscalar– 4–22 stage pipelines– Branch prediction– Register renaming (except UltraSPARC)– 400MHz – 1.7GHz– 7–130 million transistors
4.1. Compiler Techniques for Exposing ILP
• Compilers can improve the performance of simple pipelines– Reduce data hazards– Reduce control hazards
Loop Unrolling
• Compiler technique to increase ILP– Duplicate loop body– Decrease iterations
• Example:– Basic code: 10 cycles per iteration– Scheduled: 6 cycles
for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; }
Loop Unrolling
• Basic code: 7 cycles per “iteration”
• Scheduled: 3.5 cycles (no stalls!)
Loop Unrolling
• Requires clever compilers– Analysing data dependences, name
dependences and control dependences
• Limitations– Code size– Decrease in amortisation of overheads– “Register pressure”– Compiler limitations
• Useful for any architecture
Superscalar Performance
• Two-issue MIPS (int + FP)
• 2.4 cycles per “iteration”– Unrolled five times
4.2. Static Branch Prediction
• Useful:– where behaviour can be predicted at compile-
time– to assist dynamic prediction
• Architectural support– Delayed branches
Static Branch Prediction
• Simple:– Predict taken– Has average misprediction rate of 34% (SPEC)– Range: 59% – 9%
• Better:– Predict backward taken, forward not-taken– Worse for SPEC!
Static Branch Prediction
• Advanced compiler analysis can do better
• Profiling is very useful– FP: 9% ± 4%– Int: 15% ± 5%
4.3. Static Multiple Issue: VLIW
• Compiler groups instructions into “packets”, checking for dependences– Remove dependences– Flag dependences
• Simplifies hardware
VLIW
• First machines used a wide instruction with multiple operations per instruction– Hence Very Long Instruction Word (VLIW)– 64–128 bits
• Alternative: group several instructions into an issue packet
VLIW Architectures
• Multiple functional units
• Compiler selects instructions for each unit to create one long instruction/an issue packet
• Example: five operations– Integer/branch, 2 × FP, 2 × memory access
• Need lots of parallelism– Use loop unrolling, or global scheduling
Example
• Loop unrolled seven times!
• 1.29 cycles per result
• 60% of available instruction “slots” filled
for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
Summary of Improvements
Technique Unscheduled Scheduled
Basic code 10 6
Loop unrolled (4) 7 3.5
Superscalar (5) 2.4
VLIW (7) 1.29
Drawbacks of Original VLIWs
• Large code size– Need to use loop unrolling– Wasted space for unused slots
• Clever encoding techniques, compression
• Lock-step execution– Stalling one unit stalls them all
• Binary code compatibility– Variations on structure required recompilation
4.4. Compiler Support for Exploiting ILP
• We will not cover this section in detail
• Loop unrolling– Loop-carried dependences
• Software pipelining– Interleave instructions from different iterations
4.5. Hardware Support for Extracting More Parallelism
• Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time
• If not, we need more advanced techniques:– Conditional instructions– Hardware support for compiler speculation
Conditional or Predicated Instructions
• Instructions have associated conditions– If condition is true execution proceeds normally– If not, instruction becomes a no-op
• Removes control hazards
if (a == 0) b = c;
bnez %r8, L1 nop mov %r1, %r2L1: ...
cmovz %r8, %r1, %r2
Conditional Instructions
• Control hazards effectively replaced by data hazards
• Can be used for speculation– Compiler reorders instructions depending on
likely outcome of branches
Limitations on Conditional Instructions
• Annulled instructions still execute– But may occupy otherwise stalled time
• Most useful when conditions evaluated early
• Limited usefulness for complex conditions
• May be slower than unconditional operations
Conditional Instructions in Practice
Machine Conditional Instructions
MIPS, Alpha, SPARC
Move
HP PAAny register-register instruction can annul the following instruction
IA-64 Full predication