Hardware Support for Compiler Speculation Compiler needs to move instructions before branch, possibly before condition Requirements: –Instructions that

Hardware Support for Compiler Speculation

• Compiler needs to move instructions before branch, possibly before condition

• Requirements:– Instructions that can be moved without

disrupting data flow– Exceptions that can be ignored until outcome is

known– Ability to speculatively access memory with

potential address conflicts

Exception Support

• Four methods:– Hardware and OS cooperate to ignore

exceptions for speculative instructions– Speculative instructions never raise exceptions;

explicit checks must be made– Poison bits used to mark registers with invalid

results; use causes exception– Speculative results are buffered until certain

Exception Handling

• Nonterminating exceptions can be handled normally (e.g. page fault)– May cause serious performance loss

Memory Reference Speculation

• Moving loads across stores is only safe if the addresses do not conflict

• Special instructions check for address conflicts

4.6. Crosscutting Issues: Hardware–vs– Software Speculation

• A number of trade-offs and limitations– Disambiguating memory references is hard for

a compiler– Hardware branch prediction is usually better– Precise exceptions easier in hardware– Hardware does not require “housekeeping”

code– Compilers can “look” further– Hardware techniques are more portable

Hardware/Software Speculation

• Major disadvantage of hardware: complexity!

• Some architectures combine hardware and software approaches

4.7. Putting It All Together:IA-64 and Itanium

• IA-64 – RISC-style

• Register-register

• Emphasis on software-based optimisations

• Features:– 128 × 65-bit integer registers– 128 × 82-bit FP registers– 64 predicate registers; 8 branch registers

Registers

• Integer registers– Use windowing mechanism

• 0–31 always visible

• Remainder arranged in overlapping windows– Local and out areas (variable size)

– Hardware for over-/underflow

• Int and FP registers support register rotation– Supports software pipelining

Instruction Format and VLIW

• Compiler schedules parallel instructions; flags dependences

• Instruction group– Sequence of (register) independent instructions– Compiler marks boundaries between groups

(stop)

• Bundle– 128-bits: 5-bit template + 3 × 41-bit

instructions

Instruction Bundle

• Template specifies stops and execution unit– I-unit (int + special — multimedia, etc.)– M-unit (int + memory access)– F-unit (FP)– B-unit (branches)– L+X (extended instructions)

Example

• Unrolled seven times– Optimised for size:

• 9 bundles; 15% nops

• 21 cycles (3 per calculation)

– Optimised for performance:• 11 bundles; 30% nops

• 12 cycles (1.7 per calculation)

for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

Instructions

• 41-bits long– 4-bit opcode (+ template bits)– 6-bit predicate register specifier

• Predication– Almost all instructions can be predicated

• Branch is jump with predicate check!

– Complex comparisons set two predicate registers

Speculation

• Exceptions can be deferred– Uses poison bits (65-bit registers)– Nonspeculative and chk instructions raise

exception

• Speculative loads– Called advanced load (ld.a)– Stores check addresses

Itanium

• First implementation of IA-64

• Issues up to six instructions per cycle (two bundles)

• Nine functional units– 2 × I, 2 × M, 3 × B, 2 × F

• 10-stage pipeline

• Multilevel dynamic branch predictor

Itanium

• Complex hardware with many features of dynamically scheduled pipelines!– Branch prediction– Register renaming– Scoreboarding– Deep pipeline– etc.

Itanium: Performance

• SPECint not too impressive– 85% of Alpha 21264 (older, more power-

efficient processor!)

• FP better– Faster, even with slower clock!– But skewed by one benchmark for Pentium– Alpha compilers need improvement

4.8. Another View:ILP in Embedded Processors

• Trimedia (see chapter 2)– “Classic” VLIW– Hardware decompression of code

• Crusoe– Software translation of 80x86 to VLIW– Low power

Trimedia TM32 Architecture

• VLIW– Instruction specifies five operations– Static scheduling– No hardware hazard detection– 23 functional units (11 types)

Transmeta Crusoe

• Low power design

• Emulates 80x86

• VLIW– 64-bit (2 op) and 128-bit (4 op) instructions– Five types of operations:

• ALU (int, register-register)

• Compute (int ALU, FP, multimedia)

• Memory

• Branch

• Immediate

Crusoe

• Simple, in-order pipeline– Integer: 6-stage (IF1, IF2, DEC, OP, EX, WB)– FP: 10-stage (5 EX stages)

Crusoe

• Software interpretation of 80x86 code:– Basic blocks cached– Exception handling complicated

• Crusoe has good support for speculative reordering

• Memory writes buffered and committed only when safe

Crusoe Performance

• Hard to measure accurately

• Power consumption is low (⅓ of Pentium)

4.9. Fallacies and Pitfalls

• Fallacy: There is a simple approach to multiple-issue (high performance with low complexity)– Big gap between peak and sustained

performance for multiple issue processors• Need dynamic scheduling, speculation support,

branch prediction, sophisticated prefetch, etc.

• Sophisticated compilers are required

4.10. Concluding Comments

• “Hardware” techniques migrating to “software” and vice versa

• Multiprocessors may be important in future

Chapter 5Memory Hierarchy Design

Memory Hierarchies

• Not a new idea!

• Takes advantage of the principle of locality– Temporal– Spatial

• Small, fast memories close to processor

Memory Hierarchies

Registers

Cache

Memory

I/O Devices (virtual memory)

SpeedCost

Size

Introduction

• Usually includes responsibility for memory protection

• Performance is a major problem

Figure 5.2

Characterising Levels of the Memory Hierarchy

• Four questions:– Where can a block be placed? (placement)– How is a block found? (identification)– Which block should be replaced on a miss?

(replacement)– What happens on a write? (write strategy)

Example

• The Alpha 21264 is used as an example throughout

Caches• Where is a block placed in a cache?

– Three possible answers three different types

Anywhere Fully associative

Only intoone block

Direct mapped

Into subsetof blocks

Set associative

Cache Categories

• Set associative– n-way set associative, where n is number of

blocks in set– Commonly, n = 2 or n = 4

• Direct-mapped– “1-way set associative”

• Fully associative– “m-way set associative” (m is total number of

blocks in cache)

Documents

Hardware Support for Compiler Speculation Compiler needs to move instructions before branch, possibly before condition Requirements: –Instructions that