Upload
sara-byrd
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Hardware Support for Compiler Speculation
• Compiler needs to move instructions before branch, possibly before condition
• Requirements:– Instructions that can be moved without
disrupting data flow– Exceptions that can be ignored until outcome is
known– Ability to speculatively access memory with
potential address conflicts
Exception Support
• Four methods:– Hardware and OS cooperate to ignore
exceptions for speculative instructions– Speculative instructions never raise exceptions;
explicit checks must be made– Poison bits used to mark registers with invalid
results; use causes exception– Speculative results are buffered until certain
Exception Handling
• Nonterminating exceptions can be handled normally (e.g. page fault)– May cause serious performance loss
Memory Reference Speculation
• Moving loads across stores is only safe if the addresses do not conflict
• Special instructions check for address conflicts
4.6. Crosscutting Issues: Hardware–vs– Software Speculation
• A number of trade-offs and limitations– Disambiguating memory references is hard for
a compiler– Hardware branch prediction is usually better– Precise exceptions easier in hardware– Hardware does not require “housekeeping”
code– Compilers can “look” further– Hardware techniques are more portable
Hardware/Software Speculation
• Major disadvantage of hardware: complexity!
• Some architectures combine hardware and software approaches
4.7. Putting It All Together:IA-64 and Itanium
• IA-64 – RISC-style
• Register-register
• Emphasis on software-based optimisations
• Features:– 128 × 65-bit integer registers– 128 × 82-bit FP registers– 64 predicate registers; 8 branch registers
Registers
• Integer registers– Use windowing mechanism
• 0–31 always visible
• Remainder arranged in overlapping windows– Local and out areas (variable size)
– Hardware for over-/underflow
• Int and FP registers support register rotation– Supports software pipelining
Instruction Format and VLIW
• Compiler schedules parallel instructions; flags dependences
• Instruction group– Sequence of (register) independent instructions– Compiler marks boundaries between groups
(stop)
• Bundle– 128-bits: 5-bit template + 3 × 41-bit
instructions
Instruction Bundle
• Template specifies stops and execution unit– I-unit (int + special — multimedia, etc.)– M-unit (int + memory access)– F-unit (FP)– B-unit (branches)– L+X (extended instructions)
Example
• Unrolled seven times– Optimised for size:
• 9 bundles; 15% nops
• 21 cycles (3 per calculation)
– Optimised for performance:• 11 bundles; 30% nops
• 12 cycles (1.7 per calculation)
for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
Instructions
• 41-bits long– 4-bit opcode (+ template bits)– 6-bit predicate register specifier
• Predication– Almost all instructions can be predicated
• Branch is jump with predicate check!
– Complex comparisons set two predicate registers
Speculation
• Exceptions can be deferred– Uses poison bits (65-bit registers)– Nonspeculative and chk instructions raise
exception
• Speculative loads– Called advanced load (ld.a)– Stores check addresses
Itanium
• First implementation of IA-64
• Issues up to six instructions per cycle (two bundles)
• Nine functional units– 2 × I, 2 × M, 3 × B, 2 × F
• 10-stage pipeline
• Multilevel dynamic branch predictor
Itanium
• Complex hardware with many features of dynamically scheduled pipelines!– Branch prediction– Register renaming– Scoreboarding– Deep pipeline– etc.
Itanium: Performance
• SPECint not too impressive– 85% of Alpha 21264 (older, more power-
efficient processor!)
• FP better– Faster, even with slower clock!– But skewed by one benchmark for Pentium– Alpha compilers need improvement
4.8. Another View:ILP in Embedded Processors
• Trimedia (see chapter 2)– “Classic” VLIW– Hardware decompression of code
• Crusoe– Software translation of 80x86 to VLIW– Low power
Trimedia TM32 Architecture
• VLIW– Instruction specifies five operations– Static scheduling– No hardware hazard detection– 23 functional units (11 types)
Transmeta Crusoe
• Low power design
• Emulates 80x86
• VLIW– 64-bit (2 op) and 128-bit (4 op) instructions– Five types of operations:
• ALU (int, register-register)
• Compute (int ALU, FP, multimedia)
• Memory
• Branch
• Immediate
Crusoe
• Simple, in-order pipeline– Integer: 6-stage (IF1, IF2, DEC, OP, EX, WB)– FP: 10-stage (5 EX stages)
Crusoe
• Software interpretation of 80x86 code:– Basic blocks cached– Exception handling complicated
• Crusoe has good support for speculative reordering
• Memory writes buffered and committed only when safe
4.9. Fallacies and Pitfalls
• Fallacy: There is a simple approach to multiple-issue (high performance with low complexity)– Big gap between peak and sustained
performance for multiple issue processors• Need dynamic scheduling, speculation support,
branch prediction, sophisticated prefetch, etc.
• Sophisticated compilers are required
4.10. Concluding Comments
• “Hardware” techniques migrating to “software” and vice versa
• Multiprocessors may be important in future
Memory Hierarchies
• Not a new idea!
• Takes advantage of the principle of locality– Temporal– Spatial
• Small, fast memories close to processor
Introduction
• Usually includes responsibility for memory protection
• Performance is a major problem
Characterising Levels of the Memory Hierarchy
• Four questions:– Where can a block be placed? (placement)– How is a block found? (identification)– Which block should be replaced on a miss?
(replacement)– What happens on a write? (write strategy)
Caches• Where is a block placed in a cache?
– Three possible answers three different types
Anywhere Fully associative
Only intoone block
Direct mapped
Into subsetof blocks
Set associative
Cache Categories
• Set associative– n-way set associative, where n is number of
blocks in set– Commonly, n = 2 or n = 4
• Direct-mapped– “1-way set associative”
• Fully associative– “m-way set associative” (m is total number of
blocks in cache)