1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

1/36

by Martin Labrecque

How to Fake 1000 Registers

Oehmke, Binkert, Mudge, Reinhartto appear in Nov @ Micro 2005

2/36

Outline● Motivation:

– Observations on registers● Idea

– Virtual Context Architecture● Evaluation in 2 types of applications

3/36

Some definitions

● Activation record:

Data structure {● variables belonging to one particular scope

(e.g. a procedure body)● links to other activation records

};

Synonyms: "data frame", "stack frame"● Context:

– Activation record of a thread of execution

A register is only meaningful to the current activation record

4/36

Key observation● Virtual Memory:

– For the ISA standpoint: each process has an 'infinite' amount of memory available

– Memory is managed in caches, RAM and disk

– Memory is context free● This is not true for registers

– Limited resource

Need to virtualize registers

5/36

How registers are used

Compiler

Pipeline

Source code: variables

IR: virtual registers

Binary: logical registers

Data path: physical registers

Register allocation

Decode/Rename

6/36

Registers are useful

● Can't get rid of registers:– Efficient address encoding in instructions– Unambiguous data dependences– Efficient integration in the micro-

architecture

7/36

Attach a memory address tothe content of the register!

Dawn of a New Idea

8/36

Virtualizing registers

9/36

Mapping registers to memory

● Registers are virtualized because they hold the content of a memory location

● 2 options– At register allocation, map compiler

virtual registers to memory● Memory to memory operations ● Doesn't make use of ISA registers

– Map ISA registers to memory ● Key Idea of the Virtual Context Architecture

10/36

Programming the VCA

● Where are the registers mapped in memory?

● The Stack Pointer is the Reference– Allows to 'allocate' memory dynamically– Efficient way of passing parameters to a a

function – Need some architectural support to

address with offsets to the stack pointer

11/36

Renaming

● To get the register memory address, combine:– the source/destination register index of

the binary program– base pointer (stack pointer)

● ISA register index register memory address physical register

12/36

Register memory address physical reg.

● The address = base pointer + offset● Exploit locality of the addresses to

compress the number of bits in the conversion, low probability of capacity miss

13/36

Register File is a Cache

● Hardware controlled cache● An instruction requires its source

operands and destination register to execute

What happens on a “cache” miss?We need some hardware control!

14/36

Some additional HW

● Each register has 3 new attributes:

1) A reference count: ● Incremented when instruction using it goes

through rename● Decremented when instruction is committed● Non zero value means that register cannot be

reallocated to other logical registers● Guarantees instruction correct execution

15/36

Some additionnal HW (ctnd)

2) A 'committed' bit● Valid, non speculative value

3) A 'dirty' bit● Value more up-to-date than memory

• Using those attributes, a state machine controls which registers are available or not

• Branch recovery works by having a duplicate renaming table containing the committed architectural state

16/36

Source operand to physical

registerconversion

17/36

Destination logical

register to physical register

conversion

18/36

Allocation of an entry for

destination register

● Replacement policy in rename table

19/36

Pipeline modifications

● Changes in the renaming● ATSQ: architectural state transfer queue

– Adds to the queue upon fills and spills– Has priority on the instruction to execute– Addresses for fills and spills are pre-calculated– No memory disambiguation required– No data dependences

20/36

Outline

● Motivation:– Observations on registers

● Idea– Virtual Context Architecture

● Evaluation in 2 types of applications– Baseline & Methodology– Register windows w/ results– SMT w/ results– Combined register windows + SMT

21/36

Baseline machine

22/36

More on methodology

● Uses SimPoints to find representative simulation intervals

● SPEC CPU 2000● Baseline doesn't have register windows

– (Alpha’s register remapping with issue queues)● Window overflow/underflow: 10 cycles

23/36

Applications

● Register windows● Multithreading

http://en.wikipedia.org/wiki/Register_windowhttp://www.sics.se/~psm/sparcstack.html

24/36

Register Windows● Global register allocation

– How many registers should we reserve for the current procedure versus the rest of the program?

– SPARC example:● usually contains as many as 128 GPRs● At any point only 32 are available:

– 8 global, 8 params in, 8 params out, 8 local values– Up to 32 windows– Windows changed by an instruction usually along with 'call' and

'return'– Partial overlap: 'params out' of caller are 'params in' of callee

– Also used in Itanium (variable sized window)– Alternative is e.g.: renaming with reservation

stations

Save some memory (stack) traffic on function calls

25/36

Register Windows Caveats

● Problem: – Overflow of windows: call depth too deep– Underflow of window: need to restore a

window from memory● Solution

– Operating system handler– typical scheme saves and restores

windows– VCA handles registers individually

Performance Advantage of the Register Stack in Intel® Itanium™ Processors

26/36

Register windows evaluation

‘Ideal’: fills and spills are freeVCA is especially good with few

registersClose to ideal at 256 registersVCA 4% faster than baseline

@256 regs

Less registers means less in-flight

instructions and less branch

misprediction increaseFor others decrease

27/36

Single data cache port experiment

● Normalized to 2-port baseline● 7% faster than baseline @ 256 regs● 0.5 % slower than ideal @ 256 regs

28/36

2nd App:

multi-threadin

g

29/36

SMT: simultaneous multi-threading

● Lots of replicated resources (larger register file)

● VCA: renaming table is not replicated, only base thread pointer

● VCA: – # of in-flight instructions determine

number of registers required– not # of threads

30/36

SMT:

2 and 4

threads

● Normalized to single thread baseline 256 regs (not shown)

● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%)

● @192 regs, VCA 4T is at 98.7% of baseline @448 regs

31/36

Combined

SMT w/ register windows

● Normalized to single thread baseline @ 256 regs● VCA 4T: 98% of peak performance @ 192 regs

32/36

SMT + register windows

● Register window reduces cache accesses while SMT increases them

● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline

33/36

VCA summarized● unifies support for both multiple independent

threads and register windowing within each thread;

● backwards compatible with existing ISAs at the application level for multithreaded contexts;

● requires only minimal ISA changes for register windowing;

● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop;

● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;

34/36

VCA summarized (ctnd)

● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file;

● does not involve speculation or prediction, avoiding the need for recovery mechanisms.

35/36

Conclusions● A VCA-based implementation of register

windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation.

● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.

36/36

Conclusions (ctnd)

● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture.

● VCA allows SMT to be combined with register windows with no additional physical registers.

● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.

Documents

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005