36
1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

Embed Size (px)

Citation preview

Page 1: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

1/36

by Martin Labrecque

How to Fake 1000 Registers

Oehmke, Binkert, Mudge, Reinhartto appear in Nov @ Micro 2005

Page 2: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

2/36

Outline● Motivation:

– Observations on registers● Idea

– Virtual Context Architecture● Evaluation in 2 types of applications

Page 3: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

3/36

Some definitions

● Activation record:

Data structure {● variables belonging to one particular scope

(e.g. a procedure body)● links to other activation records

};

Synonyms: "data frame", "stack frame"● Context:

– Activation record of a thread of execution

A register is only meaningful to the current activation record

Page 4: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

4/36

Key observation● Virtual Memory:

– For the ISA standpoint: each process has an 'infinite' amount of memory available

– Memory is managed in caches, RAM and disk

– Memory is context free● This is not true for registers

– Limited resource

Need to virtualize registers

Page 5: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

5/36

How registers are used

Compiler

Pipeline

Source code: variables

IR: virtual registers

Binary: logical registers

Data path: physical registers

Register allocation

Decode/Rename

Page 6: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

6/36

Registers are useful

● Can't get rid of registers:– Efficient address encoding in instructions– Unambiguous data dependences– Efficient integration in the micro-

architecture

Page 7: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

7/36

Attach a memory address tothe content of the register!

Dawn of a New Idea

Page 8: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

8/36

Virtualizing registers

Page 9: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

9/36

Mapping registers to memory

● Registers are virtualized because they hold the content of a memory location

● 2 options– At register allocation, map compiler

virtual registers to memory● Memory to memory operations ● Doesn't make use of ISA registers

– Map ISA registers to memory ● Key Idea of the Virtual Context Architecture

Page 10: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

10/36

Programming the VCA

● Where are the registers mapped in memory?

● The Stack Pointer is the Reference– Allows to 'allocate' memory dynamically– Efficient way of passing parameters to a a

function – Need some architectural support to

address with offsets to the stack pointer

Page 11: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

11/36

Renaming

● To get the register memory address, combine:– the source/destination register index of

the binary program– base pointer (stack pointer)

● ISA register index register memory address physical register

Page 12: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

12/36

Register memory address physical reg.

● The address = base pointer + offset● Exploit locality of the addresses to

compress the number of bits in the conversion, low probability of capacity miss

Page 13: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

13/36

Register File is a Cache

● Hardware controlled cache● An instruction requires its source

operands and destination register to execute

What happens on a “cache” miss?We need some hardware control!

Page 14: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

14/36

Some additional HW

● Each register has 3 new attributes:

1) A reference count: ● Incremented when instruction using it goes

through rename● Decremented when instruction is committed● Non zero value means that register cannot be

reallocated to other logical registers● Guarantees instruction correct execution

Page 15: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

15/36

Some additionnal HW (ctnd)

2) A 'committed' bit● Valid, non speculative value

3) A 'dirty' bit● Value more up-to-date than memory

• Using those attributes, a state machine controls which registers are available or not

• Branch recovery works by having a duplicate renaming table containing the committed architectural state

Page 16: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

16/36

Source operand to physical

registerconversion

Page 17: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

17/36

Destination logical

register to physical register

conversion

Page 18: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

18/36

Allocation of an entry for

destination register

● Replacement policy in rename table

Page 19: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

19/36

Pipeline modifications

● Changes in the renaming● ATSQ: architectural state transfer queue

– Adds to the queue upon fills and spills– Has priority on the instruction to execute– Addresses for fills and spills are pre-calculated– No memory disambiguation required– No data dependences

Page 20: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

20/36

Outline

● Motivation:– Observations on registers

● Idea– Virtual Context Architecture

● Evaluation in 2 types of applications– Baseline & Methodology– Register windows w/ results– SMT w/ results– Combined register windows + SMT

Page 21: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

21/36

Baseline machine

Page 22: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

22/36

More on methodology

● Uses SimPoints to find representative simulation intervals

● SPEC CPU 2000● Baseline doesn't have register windows

– (Alpha’s register remapping with issue queues)● Window overflow/underflow: 10 cycles

Page 23: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

23/36

Applications

● Register windows● Multithreading

http://en.wikipedia.org/wiki/Register_windowhttp://www.sics.se/~psm/sparcstack.html

Page 24: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

24/36

Register Windows● Global register allocation

– How many registers should we reserve for the current procedure versus the rest of the program?

– SPARC example:● usually contains as many as 128 GPRs● At any point only 32 are available:

– 8 global, 8 params in, 8 params out, 8 local values– Up to 32 windows– Windows changed by an instruction usually along with 'call' and

'return'– Partial overlap: 'params out' of caller are 'params in' of callee

– Also used in Itanium (variable sized window)– Alternative is e.g.: renaming with reservation

stations

Save some memory (stack) traffic on function calls

Page 25: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

25/36

Register Windows Caveats

● Problem: – Overflow of windows: call depth too deep– Underflow of window: need to restore a

window from memory● Solution

– Operating system handler– typical scheme saves and restores

windows– VCA handles registers individually

Performance Advantage of the Register Stack in Intel® Itanium™ Processors

Page 26: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

26/36

Register windows evaluation

‘Ideal’: fills and spills are freeVCA is especially good with few

registersClose to ideal at 256 registersVCA 4% faster than baseline

@256 regs

Less registers means less in-flight

instructions and less branch

misprediction increaseFor others decrease

Page 27: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

27/36

Single data cache port experiment

● Normalized to 2-port baseline● 7% faster than baseline @ 256 regs● 0.5 % slower than ideal @ 256 regs

Page 28: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

28/36

2nd App:

multi-threadin

g

Page 29: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

29/36

SMT: simultaneous multi-threading

● Lots of replicated resources (larger register file)

● VCA: renaming table is not replicated, only base thread pointer

● VCA: – # of in-flight instructions determine

number of registers required– not # of threads

Page 30: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

30/36

SMT:

2 and 4

threads

● Normalized to single thread baseline 256 regs (not shown)

● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%)

● @192 regs, VCA 4T is at 98.7% of baseline @448 regs

Page 31: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

31/36

Combined

SMT w/ register windows

● Normalized to single thread baseline @ 256 regs● VCA 4T: 98% of peak performance @ 192 regs

Page 32: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

32/36

SMT + register windows

● Register window reduces cache accesses while SMT increases them

● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline

Page 33: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

33/36

VCA summarized● unifies support for both multiple independent

threads and register windowing within each thread;

● backwards compatible with existing ISAs at the application level for multithreaded contexts;

● requires only minimal ISA changes for register windowing;

● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop;

● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;

Page 34: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

34/36

VCA summarized (ctnd)

● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file;

● does not involve speculation or prediction, avoiding the need for recovery mechanisms.

Page 35: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

35/36

Conclusions● A VCA-based implementation of register

windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation.

● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.

Page 36: 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

36/36

Conclusions (ctnd)

● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture.

● VCA allows SMT to be combined with register windows with no additional physical registers.

● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.