1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in...

by Martin Labrecque

How to Fake 1000 Registers

Oehmke, Binkert, Mudge, Reinhartto appear in Nov @ Micro 2005

Outline● Motivation:

– Observations on registers● Idea

– Virtual Context Architecture● Evaluation in 2 types of applications

Some definitions

● Activation record:

Data structure {● variables belonging to one particular scope

(e.g. a procedure body)● links to other activation records

Synonyms: "data frame", "stack frame"● Context:

– Activation record of a thread of execution

A register is only meaningful to the current activation record

Key observation● Virtual Memory:

– For the ISA standpoint: each process has an 'infinite' amount of memory available

– Memory is managed in caches, RAM and disk

– Memory is context free● This is not true for registers

– Limited resource

Need to virtualize registers

How registers are used

Compiler

Pipeline

Source code: variables

IR: virtual registers

Binary: logical registers

Data path: physical registers

Register allocation

Decode/Rename

Registers are useful

● Can't get rid of registers:– Efficient address encoding in instructions– Unambiguous data dependences– Efficient integration in the micro-

architecture

Attach a memory address tothe content of the register!

Dawn of a New Idea

Virtualizing registers

Mapping registers to memory

● Registers are virtualized because they hold the content of a memory location

● 2 options– At register allocation, map compiler

virtual registers to memory● Memory to memory operations ● Doesn't make use of ISA registers

– Map ISA registers to memory ● Key Idea of the Virtual Context Architecture

Programming the VCA

● Where are the registers mapped in memory?

● The Stack Pointer is the Reference– Allows to 'allocate' memory dynamically– Efficient way of passing parameters to a a

function – Need some architectural support to

address with offsets to the stack pointer

Renaming

● To get the register memory address, combine:– the source/destination register index of

the binary program– base pointer (stack pointer)

● ISA register index register memory address physical register

Register memory address physical reg.

● The address = base pointer + offset● Exploit locality of the addresses to

compress the number of bits in the conversion, low probability of capacity miss

Register File is a Cache

● Hardware controlled cache● An instruction requires its source

operands and destination register to execute

What happens on a “cache” miss?We need some hardware control!

Some additional HW

● Each register has 3 new attributes:

1) A reference count: ● Incremented when instruction using it goes

through rename● Decremented when instruction is committed● Non zero value means that register cannot be

reallocated to other logical registers● Guarantees instruction correct execution

Some additionnal HW (ctnd)

2) A 'committed' bit● Valid, non speculative value

3) A 'dirty' bit● Value more up-to-date than memory

• Using those attributes, a state machine controls which registers are available or not

• Branch recovery works by having a duplicate renaming table containing the committed architectural state

Source operand to physical

registerconversion

Destination logical

register to physical register

conversion

Allocation of an entry for

destination register

● Replacement policy in rename table

Pipeline modifications

● Changes in the renaming● ATSQ: architectural state transfer queue

– Adds to the queue upon fills and spills– Has priority on the instruction to execute– Addresses for fills and spills are pre-calculated– No memory disambiguation required– No data dependences

Outline

● Motivation:– Observations on registers

● Idea– Virtual Context Architecture

● Evaluation in 2 types of applications– Baseline & Methodology– Register windows w/ results– SMT w/ results– Combined register windows + SMT

Baseline machine

More on methodology

● Uses SimPoints to find representative simulation intervals

● SPEC CPU 2000● Baseline doesn't have register windows

– (Alpha’s register remapping with issue queues)● Window overflow/underflow: 10 cycles

Applications

● Register windows● Multithreading

http://en.wikipedia.org/wiki/Register_windowhttp://www.sics.se/~psm/sparcstack.html

Register Windows● Global register allocation

– How many registers should we reserve for the current procedure versus the rest of the program?

– SPARC example:● usually contains as many as 128 GPRs● At any point only 32 are available:

– 8 global, 8 params in, 8 params out, 8 local values– Up to 32 windows– Windows changed by an instruction usually along with 'call' and

'return'– Partial overlap: 'params out' of caller are 'params in' of callee

– Also used in Itanium (variable sized window)– Alternative is e.g.: renaming with reservation

stations

Save some memory (stack) traffic on function calls

Register Windows Caveats

● Problem: – Overflow of windows: call depth too deep– Underflow of window: need to restore a

window from memory● Solution

– Operating system handler– typical scheme saves and restores

windows– VCA handles registers individually

Performance Advantage of the Register Stack in Intel® Itanium™ Processors

Register windows evaluation

‘Ideal’: fills and spills are freeVCA is especially good with few

registersClose to ideal at 256 registersVCA 4% faster than baseline

@256 regs

Less registers means less in-flight

instructions and less branch

misprediction increaseFor others decrease

Single data cache port experiment

● Normalized to 2-port baseline● 7% faster than baseline @ 256 regs● 0.5 % slower than ideal @ 256 regs

2nd App:

multi-threadin

SMT: simultaneous multi-threading

● Lots of replicated resources (larger register file)

● VCA: renaming table is not replicated, only base thread pointer

● VCA: – # of in-flight instructions determine

number of registers required– not # of threads

2 and 4

threads

● Normalized to single thread baseline 256 regs (not shown)

● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%)

● @192 regs, VCA 4T is at 98.7% of baseline @448 regs

Combined

SMT w/ register windows

● Normalized to single thread baseline @ 256 regs● VCA 4T: 98% of peak performance @ 192 regs

SMT + register windows

● Register window reduces cache accesses while SMT increases them

● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline

VCA summarized● unifies support for both multiple independent

threads and register windowing within each thread;

● backwards compatible with existing ISAs at the application level for multithreaded contexts;

● requires only minimal ISA changes for register windowing;

● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop;

● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;

VCA summarized (ctnd)

● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file;

● does not involve speculation or prediction, avoiding the need for recovery mechanisms.

Conclusions● A VCA-based implementation of register

windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation.

● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.

Conclusions (ctnd)

● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture.

● VCA allows SMT to be combined with register windows with no additional physical registers.

● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in...

Documents

Questionnaires for clinical and epidemiological purposes Manon Labrecque,M.D., M.Sc

Expanding Regridding Capabilities of the Earth System Modeling Framework Andrew Scholbrock University of Colorado – Boulder Robert Oehmke NOAA/CIRES 1

CONCORDE AMERICA, INC., ABSOLUTE HEALTH … Concorde America, Inc., Absolute Health and Fitness, Inc., Hartley Lord, Donald E. Oehmke, Bryan Kos, Thomas M. Heysek, Andrew Kline and

Kaitlyn Labrecque

Towards a Compilation Infrastructure for Network ProcessorsTowards a Compilation Infrastructure for Network Processors Martin Labrecque Master of Applied Science, 2006 Department of

mainefirstmedia.com · Dirty Democrats Attack Republicans LaBrecque Strikes Back Dems Deep Secret Revealed --- Geoffrey Gratwick Learn how Geoffrey Gratwick and the dirty Democratic

Safe Haven CDS Premiums - Danmarks Nationalbank Haven CDS Premiums Sven Klinglery David Landoz ... Allan Mortensen, Martin Oehmke, Lasse Pedersen, Stephen Schaefer ... (2014), Arne

AGENDA | PAR/BY MARIE LABRECQUE

Enabling Deployments through Standards & Certification Margaret LaBrecque WiMAX President margaret.labrecque@intel.com rgaret.labrecque@intel.com Margaret

Industrial Alliance Groupe Financier Key Consulting Stephanie Chabot Shana Labrecque Philippe Latreille

Michel Labrecque - Move, Achieve, Succeed

Science Mission Directorate Earth Surface and Interior Focus Area John LaBrecque / Focus Area Lead Craig Dobson / InSAR Devel. Lead Herbert Frey / GSFC-Cntr

Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

J. Alvarez Argote 1 , C. Burbano 1 , MP Gagnon 2 , M. Cauchon 2 , J. Asua 3 , M. Labrecque 2

JOSEPH R. LABRECQUE, MA · JOSEPH R. LABRECQUE, MA 1 5 1 1 Un i v er si ty A v e. , B ou l d er , C ol or a d o 8 0 3 0 9 - 0 4 7 8 T h e A r m or y - 1 B 1 7

Capstone Design Auto Dispensing Prophy Angle Ari Katz Brendan LaBrecque Stefan Mag Matt Rao Nick Starno 1 2. 0 4. 0 7

Bubbles, Financial Crises, and Systemic Risk...Bubbles, Financial Crises, and Systemic Risk Markus K. Brunnermeier Martin Oehmke Abstract This chapter surveys the literature on bubbles,

Kim Berryman-Dages Gregory Gates Bryan LaBrecque Judd Quarles

ESMF Regridding Update Robert Oehmke, Peggy Li, Ryan O’Kuinghttons, Mat Rothstein, Joseph Jacob NOAA Cooperative Institute for Research in Environmental

Distribution of Budget Shares for Food: An Application of ...€¦ · Charles B. Moss 1,*, James F. Oehmke 2,†, Alexandre Lyambabaje 3, ... Moss Oehmke and Lyambabaje estimated