144
Fall 2000 CS6241 / ECE8833A - (6- 1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

  • View
    220

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-1)

Topic 6Register Allocation

Page 2: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-2)

Revisiting A Typical Optimizing Compiler

Front End Back EndSource Program

Intermediate Language

Scheduling Register Allocation

Page 3: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-3)

Rationale for Separating Register Allocation from Scheduling

• Each of Scheduling and Register Allocation are hard to solve individually, let alone solve globally as a combined optimization.

• So, solve each optimization locally and heuristically “patch up” the two stages.

Page 4: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-4)

Why Register Allocation?

• Storing and accessing variables from registers is much faster than accessing data from memory.

The way operations are performed in load/store (RISC) processors.

• Therefore, in the interests of performance—if not by necessity—variables ought to be stored in registers.

• For performance reasons, it is useful to store variables as long as possible, once they are loaded into registers.

• Registers are bounded in number (say 32.)• Therefore, “register-sharing” is needed over time.

Page 5: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-5)

The Goal

• Primarily to assign registers to variables.

• However, the allocator runs out of registers quite often.

• Decide which variables to “flush” out of registers to free them up, so that other variables can be bought in.

This important indirect consequence of allocation is referred to as spilling.

Page 6: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-6)

Register Allocation and Assignment

Allocation: identifying program values (virtual registers, live ranges) and program points at which values should be stored in a physical register.

Program values that are not allocated to registers are

said to be spilled.

Assignment: identifying which physical register

should hold an allocated value at each program point.

Page 7: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-7)

Live Ranges

Live range of virtual register a = (BB1, BB2, BB3, BB4, BB5, BB6, BB7).

Def-Use chain of virtual register a = (BB1, BB3, BB5, BB7).

a :=...

:= a

:= a

:= a

T F

BB1

BB2

BB4 BB3

BB5

BB6

BB7

Page 8: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-8)

Computing Live Ranges

Using data flow analysis, we compute for each basic

block:

• In the forward direction, the reaching attribute.

A variable is reaching block i if a definition or use of the variable reaches the basic block along the edges of the CFG.

• In the backward direction, the liveness attribute.

A variable is live at block i if there is a direct reference to the variable at block i or at some block j that succeeds i in the CFG, provided the variable in question is not redefined in the interval between i and j.

Page 9: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-9)

Computing Live Ranges (Contd.)

The live range of a variable is the intersection of basic-blocks in CFG nodes in which the variable is live, and the set which it can reach.

Page 10: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-10)

Local Register Allocation

• Perform register allocation one basic block (or super-block or hyper-block) at a time

• Must note virtual registers that are live on entry and live on exit of a basic block - later perform reconciliation

• During allocation, spill out all live outs and bring all live ins from memory

• Advantage: very fast

Page 11: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-11)

Local Register Allocation - Reconciliation Codes

• After completing allocation for all blocks, we need to reconcile differences in allocation

• If there are spare registers between two basic blocks, replace the spill out and spill in code with register moves

Page 12: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-12)

Linear Scan Register Allocation

• A simple local allocation algorithm

• Assume code is already scheduled

• Build a linear ordering of live ranges (also called live intervals or lifetimes)

• Scan the live range and assign registers until you run out of them - then spill

Page 13: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-13)

Linear Scan RA

live ranges

scan order

sortedaccordingto thestart timeof thefirst def

may use same physical register!

Page 14: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-14)

The Linear Scan Algo

Source:M. Poletto & V. Sarkar.,“Linear Scan Register Allocation”, ACM TOPLAS, Sep 1999.

actual spill

Page 15: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-15)

Combining Local IS and RA

• J.R. Goodman, and W.-C. Hsu, “Code Scheduling and Register Allocation in Large Basic Blocks”, 1988

• Recall phase ordering problem

• One solution: combined IS and RA

Page 16: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-16)

The Goodman & Hsu Algo

• While ready set is not empty do:– if we are not “running out” of registers

• select a node from the ready set based on some scheduling priority and mark it as ready (plain list scheduling)

• decrease free register count by 1 if instruction writes to a register• if the instruction ends some live ranges, return register to free register

pool

– else• select the ready node that will free up the most register; and do the

same as above• if no such instruction exist, spill! (similar to linear scan)

Page 17: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-17)

Global Register Allocation

• Local register allocation does not store data in registers across basic blocks.

Local allocation has poor register utilization global

register allocation is essential.

• Simple global register allocation: allocate most “active” values in each inner loop.

• Full global register allocation: identify live ranges in control flow graph, allocate live ranges, and split ranges as needed.

Goal: select allocation so as to minimize number of load/store instructions performed by optimized program.

Page 18: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-18)

Topological Sorting

• Given a directed graph, G = V, E, we can define a topological ordering of the nodes

• Let T = {v1, v2, ..., vn} be an enumeration of the nodes of V, T is a topological ordering if vi vj E, then i < j (i.e. vi comes before vj in T)

• A topological order linearizes a partial order

Page 19: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-19)

Global Linear Scan RA

• Ignoring back-edges, perform a topological sort of the basic blocks using the CFG

• Compute the live range over the entire topological order

Page 20: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-20)

Global Live Ranges

A

A

BA

B

B1

B2 B3

B4

B1 B2 B3 B4

A

B

Global Live Ranges

Topological Order

Page 21: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-21)

Simple Example of Global Register Allocation

Live range of a = {B1, B3}

Live range of b = {B2, B4}No interference! a and b can be assigned to the same register

a =...

b = ... ..= a

.. = b

T F

B1

B3

B4

B2

Control Flow Graph

Page 22: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-22)

Second Chance Bin-packing

• O. Traub, G. Holloway, and M.D. Smith, “Quality and Speed in Linear-Scan Register Allocation”, SIGPLAN ‘98

• Considers “holes” in live ranges

• Considers register allocation as a bin-packing problem

• Performs register allocation and spill code generation at the same time

Page 23: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-23)

Holes in Live Ranges

Page 24: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-24)

Bin-packing

• The binpacking problem: determine how to put the most objects in the least number of fixed space bins

• More formally, find a partition and assignment of a set of objects such that constraint is satisfied or an objective function is minimized (or maximized)

• In register allocation, the constraint is that overlapping live ranges cannot be assigned to the same bin (register)

• Another way of looking at linear scan

Page 25: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-25)

Working with holes

• We can allocate two non-overlapping live ranges to the same physical register

• We can assign two live ranges to the same physical register if one fits entirely into the hole of another

Page 26: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-26)

Second Chance Linear Scan

• Suppose we encounter variable t, and we assigned a register to it by spilling out variable u currently occupying that register

• When u is needed again, it may be loaded into a different register (it gets a “second chance”)

• It will stay till its lifetime ends or it is evicted again

Page 27: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-27)

A Problem

resolution code

Page 28: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-28)

Execution Frequency & Global Register Allocation

Live range of a = {B1, B2, B3, B4}

Live range of b = {B2, B4}

Live range of c = {B3}In this example, a and c interfere, and c should be given priority because it has a higher usage count.

a =...

b = ... c = c +1

...= a +b

T F

B1

B3

B4

B2

Control Flow Graph

T

F

Page 29: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-29)

Cost and Savings

Compilation Cost: running time and space of the

global allocation algorithm.

Execution Savings: cycles saved due to register

residence of variables in optimized program execution.

Contrast with memory-residence which leads to longer

execution times.

Page 30: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-30)

Interference Graph

Definition: An interference graph G is an undirected

graph with the following properties:

(a) each node x denotes exactly one distinct live range X, and

(b) an edge exists between nodes x and y iff X Y , where X and Y are the live ranges corresponding to nodes x and y.

Page 31: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-31)

Interference Graph Example

Live Ranges

a := …

b := …

c := …

:= a

:= b

d := …

:= c

:= d

Interference Graph

a

b c

Live ranges overlapand hence interfere

Node modellive ranges d

Page 32: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-32)

The Classical Approach

“Register Allocation and Spilling via Graph Coloring”,

G. Chatin, Proceedings SIGPLAN-82 Symposium on

Compiler Construction, 98-105, 1982.

“Register Allocation via Coloring”, G. Chaitin, M.

Auslander, A. Chandra, J. Cocke, M. Hopkins and P.

Markstein, Computer Languages, vol. 6, 47-57, 1981.

more…

Page 33: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-33)

The Classical Approach (Contd.)

• These works introduced the key notion of an interference graph for encoding conflicts between the live ranges.

• This notion was defined for the global control flow graph.

• It also introduced the notion of graph coloring to model the idea of register allocation.

Page 34: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-34)

Execution Time and Spill-cost

Spilling: Moving a variable that is currently registerresident to memory when no more registers areavailable, and a new live-range needs to be allocatedone spill.

Minimizing Execution Cost: Given an optimistic assigment— i.e., one where all the variables areregister-resident, minimizing spilling.

Page 35: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-35)

Graph Coloring

• Given an undirected graph G and a set of k distinct colors, compute a coloring of the nodes of the graph i.e., assign a color to each node such that no two adjacent nodes get the same color.

Recall that two nodes are adjacent iff they have an edge between them.

• A given graph might not be k-colorable.• In general, it is a computationally hard problem to

color a given graph using a given number k of colors.

• The register allocation problem uses good heuristics for coloring.

Page 36: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-36)

Register Allocation as Coloring

• Given k registers, interpret each register as a color.• The graph G is the interference graph of the given

program.• The nodes of the interference graph are the

executable live ranges on the target platform.• A coloring of the interference graph is an

assignment of registers (colors) to live ranges (nodes).

• Running out of colors implies not enough registers and hence a need to spill in the above model.

Page 37: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-37)

The Approach Discussed Here

“The Priority Based Coloring Approach to Register Allocation”, F. Chow and J. Hennessey, ACM Transactions on Programming Languages and Systems, vol. 12, 501-536, 1990.

Page 38: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-38)

Important Modeling Difference

• The first difference from the classical approach is that now, we assume that the “home location” of a live range is in memory.

– Conceptually, values are always in memory unless promoted to a register; this is also referred to as the pessimistic approach.

– In the classical approach, the dual of this model is used where values are always in registers except when spilled; recall that this is referred to as the optimistic approach.

more...

Page 39: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-39)

Important Modeling Difference

• A second major difference is the granularity at which code is modeled.– In the classical approach, individual instructions are

modeled whereas– Now, basic blocks are the primitive units modeled as nodes

in live ranges and the interference graph.

• The final major difference is the place of the register allocation in the overall compilation process.– In the present approach, the interference graph is

considered earlier in the compilation process using intermediate level statements; compiler generated temporaries are known.

– In contrast, in the previous work the allocation is done at the level of the machine code.

Page 40: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-40)

The Main Information to be Used by the Register Allocator

• For each live range, we have a bit vector LIVE of the basic blocks in it.

• Also we have INTERFERE which gives for the live range, the set of all other live ranges that interfere with it.

• Recall that two live ranges interfere if they intersect in at least one (basic-block).

• If INTERFERE is smaller than the number of available of registers for a node i, then i is unconstrained; it is constrained otherwise.

more...

Page 41: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-41)

The Main Information to be Used by the Register Allocator

• An unconstrained node can be safely assigned a register since conflicting live ranges do not use up the available registers.

• We associate a (possibly empty) set FORBIDDEN with each live range that represents the set of colors that have already been assigned to the members of its INTERFERENCE set.

The above representation is essentially a detailedinterference graph representation.

Page 42: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-42)

Prioritizing Live Ranges

In the memory bound approach, given live ranges with a choice of assigning registers, we do the following:

• Choose a live range that is “likely” to yield greater savings in execution time.

• This means that we need to estimate the savings of each basic block in a live range.

Page 43: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-43)

Estimate the Savings

Given a live range X for variable x, the estimated savings in a basic block i is determined as follows:

1. First compute CyclesSaved which is the number of loads and stored of x in i scaled by the number of cycles taken for each load/store.

2. Compensate the single load and/or store that might be needed to bring the variable in and/or store the variable at the end and denote it by Setup.

Note that Setup is derived from a single load or store or a load plus a store.

more...

Page 44: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-44)

Estimate the Savings (Contd.)

3. Savings(X,i) = {CyclesSaved-Setup}

These indicate the actual savings in cycles after accounting for the possible loads/stores needed to move x at the beginning/end of i.

4. TotalSavings(X) = iX Savings(X,i) x W( i ).

(a) x is the set of all basic blocks in the live range of X. (b) W( i ) is the execution frequency of variable x in block i.

more...

Page 45: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-45)

Estimate the Savings (Contd.)

5. Note however that live regions might span a few

blocks but yield a large savings due to frequent use of the variable while others might yield the same cumulative gain over a larger number of basic blocks. We prioritize the former case and define:

{Priority(X) = TotalSavings(X)/Span(X)}

where Span(X) is the number of basic blocks in X.

Page 46: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-46)

The Algorithm

For all constrained live ranges, execute the following steps:

1. Compute Priority(X) if it has not already been computed.2. For the live range X with the highest priority: (a) If its priority is negative or if no basic block i in X can be assigned a register—because every color has been assigned to a basic block that interferes with i — then delete X from the list and modify the interference graph. (b) Else, assign it a color that is not in its forbidden set.

(c) Update the forbidden sets of the members of INTERFERE for X’.

more...

Page 47: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-47)

The Algorithm (Contd.)

3. For each live range X’ that is in INTERFERE for X’ do:

(a) If the FORBIDDEN of X’ is the set of all colors i.e., if no colors are available, SPLIT (X’). Procedure SPLIT breaks a live range into smaller

live ranges with the intent of reducing the interference of X’ it will be described next.

4. Repeat the above steps till all constrained live ranges are colored or till there is no color left to color any basic block.

Page 48: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-48)

The Idea Behind Splitting

• Splitting ensures that we break a live range up into increasingly smaller live ranges.

• The limit is of course when we are down to the size of a single basic block.

• The intuition is that we start out with coarse-grained interference graphs with few nodes.

• This makes the interference node degree possibly high.

• We increase the problem size via splitting on a need-to basis.

• This strategy lowers the interference.

Page 49: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-49)

The Splitting Strategy

A sketch of an algorithm for splitting:

1. Choose a split point.

Note that we are guaranteed that X has at least one basic block i which can be assigned a color i.e., its forbidden set does not include all the colors. The earliest such in the order of control flow can be the split point.

2. Separate the live range X into X1 and X2 around the split point.

3. Update the sets INTERFERE for X1 and X2 and those for the live ranges that interfered with X

more...

Page 50: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-50)

The Splitting Strategy (Contd.)

4. Recompute priorities and reprioritize the list.

Other bookkeeping activities to realize a safe implementation are also executed.

Page 51: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-51)

Live Range Splitting Example

Live Ranges:a: BB1, BB2, BB3, BB4, BB5b: BB1, BB2, BB3, BB4, BB5, BB6c: BB2, BB3, BB4, BB5Assume the number of physical registers = 2

a := b :=

c :=

:= a := b

:= b

BB1

BB2

BB4

BB5

BB6

BB3

a

b c

interference graph

Page 52: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-52)

Live Range Splitting Example

New live ranges:a: BB1, BB2, BB3, BB4, BB5b: BB1c: BB2, BB3, BB4, BB5b2: BB6b and b2 are logically the same program variableb2 is a renamed equivalent of b.All nodes are now unconstrained.

a :=b :=…

c :=

:= a:= b

… := b

BB1

BB2

BB4

BB5

BB6

BB3

a

b c

interference graph

b2

spill introduced

split b

T F

Page 53: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-53)

Interaction Between Allocation and Scheduling

• The allocator and the scheduler are typically patched together heuristically.

• Leads to the “phase ordering problem: Should allocation be done before scheduling or vice-versa?

• Saving on spilling or “good allocation” is only indirectly connected to the actual execution time.

Contrast with instruction scheduling.

• Factoring in register allocation into scheduling and solving the problem “globally” is a research issue.

Page 54: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-54)

Example Basic Block

Source Code:

z = x(i)

temp = x( i+1+N)

Intermediate Code:

v1: VR1 ADDR (X) + i

v2: VR2 LOAD @(VR1)

v3: z STORE VR2

v4: VR4 VR1 + 1

v5: VR5 LOAD N

v6: VR6 LOAD @ (VR4+VR5)

v1v3

v2

v4v6

v5

100

0

1

Page 55: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-55)

Instruction Scheduling followed by Register Allocation

v1 v2 v5 v3 v4 v6

VR2 VR5

VR1 VR1 VR4

Completion time = 6 cycles.

Maximum register width = 3.

Page 56: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-56)

Register Allocation followed by Instruction Scheduling

v2 v3 v4 v5

VR2 VR5

VR1 VR1 VR4

Completion time = 8 cycles.

Maximum register width = 2.

v6v1

Page 57: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-57)

Combined Register Allocation and Instruction Scheduling

v2 v4 v3 v5 v6

VR2 VR5

VR1 VR1 VR4

Completion time = 7 cycles.

Maximum register width = 2.

v1

Page 58: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-58)

Region Based Register Allocation for Explicit Parallel Instruction

Computing(EPIC)

Hansoo Kim

React-ILP Laboratory

New York University

Hansoo Kim

React-ILP Laboratory

New York University

Page 59: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-59)

Evolution of EPIC

Page 60: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-60)

Directions of EPIC

• Explicitly controlled architectures– Simplify architecture as much as possible– Architectural template is a known, conventional one– Compiler handles a lot of processor’s decision making

• Explicitly control issue, scheduling, allocation

– Explicitly parallel instruction computing (EPIC)• Subset of explicitly controlled architectures

Page 61: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-61)

Frontend and Optimizer

Determine Dependences

Determine Independences

Bind Operations to Function Units

Bind Transports to Busses

Determine Dependences

Bind Transports to Busses

Execute

Superscalar

Dataflow

Indep. Arch.

EPIC

TTA

Compiler Hardware

Determine Independences

Bind Operations to Function Units

B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.

Compiler vs. Processor

Page 62: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-62)

Compilation Time as a Resource• Compiler plays even more critical role in EPIC in managing hardware

– Hardware resource management is under compiler control

• In EPIC processors, the compiler needs to optimize for Instruction Level Parallelism (ILP) – To exploit high levels of parallelism

• Drives compilation time up significantly• With EPIC becoming stable technology, impact of compilation time is significant.

– Just-in-time compilation makes this more significant as well

0

5

10

15

20

25

30

35

Small (BB) Large(Procedure)

Compilation Unit Size

Per

form

ace

Execution Time

Compile Time

Page 63: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-63)

Focus for My Work

• One of the important steps during compiler optimization is register allocation – “Because of the central role that register allocation plays, it

is one of the most important - if not the most important- optimization” D. Patterson and J. Hennessy, “ Computer architecture: A quantitative approach”

– Poor register allocation will generate many memory access operations and degrade execution performance

Page 64: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-64)

Definitions

• Local optimization– Basic block is the base unit of compilation

• Global optimization– Compilation scope beyond basic blocks– Ex. function, loop

• Compilation time– Time required for compilation

• Execution time– Time required to run the compiled binary– Determines the “quality of code”

Page 65: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-65)

Containing Compilation Time

• Richard Hank, Wen-Mei Hwu and Bob Rau “Region-based compilation: an introduction and motivation”, Micro-28, 1995

• Richard Hank, “Region-based compilation”, Ph.D thesis, UIUC, 1996

• Region-based compilation provides flexible compilation unit size– Compilation unit is the part of the program that an optimizer works

with

• The compiler is allowed to repartition the program into a new set of compilation units, called regions

• Typically, a region is constructed to get significant ILP – Aids scheduling (i.e. hyper-block, super-block and basic-blocks)

Page 66: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-66)

Example: Region-based Compilation

2 3

1

4

6 7

5

8

function A function B

Page 67: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-67)

Benefits of Region Based Compilation

• Region size is typically smaller than function size– Reducing the impact of the algorithmic complexity

• Use execution frequency information to select regions – allows the compiler to select compilation units that more

accurately reflects dynamic behavior of the program– allows the compiler to produce more compact optimized

code

• Each region is compiled completely before compilation proceeds to next region– All function-oriented compiler transformations may be

applied before moving from one region to the next

Page 68: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-68)

The Research Goal

• Achieving runtime performance via region-based compilation comparable to global optimization

• With smaller compilation units and hence faster compilation

Page 69: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Technical Contributions of My Thesis

• Leverage frequencies already used in region-based approach

frequency based

propagation

live-range splitting

enhanced priority

• Region restructuring based on– frequency– and register pressure as well (key to our improvements)– first quantitative characterization of register pressure\

• Collectively achieve the goal as I will demonstrate

Page 70: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-70)

Background ofRegister Allocation

Page 71: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-71)

What Is Register Allocation ?

• The goal of register allocation is to minimize the number of memory access for the variables used in program.– As many variables and temporaries as possible should be

mapped to physical registers

Page 72: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-72)

Register Allocation:ClassicalApproach Graph Coloring

• A coloring of a graph is an assignment of a color to each node of the graph such that any two nodes connected by an edge do not have the same color

• For register allocation, interference graph is constructed – Each node represents a live range of each variable in the

program• Live range is a subset or region of the program where a particular

definition of that variable is live

– Two nodes in the graph are connected if they interfere with each other and hence cannot reside in the same register

Page 73: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-73)

Live Ranges & Interference Graph

x=

=x

=y

w=

y=

=z

z=

=w

x

y z

w

Page 74: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-74)

Two Major Coloring Approaches

• Chaitin style, 1981 – Graph simplification driven– Led to Briggs’ style register allocator 1992

• Chow and Hennessy style, 1984, 1990– Priority based, spill cost driven– Our approach will build on this idea

Page 75: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-75)

Priority Based Register AllocationBackground

F.Chow and J.Hennessy

Page 76: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-76)

Framework of Priority Based Register Allocation

• Live range construction• Interference graph

construction• Remove unconstrained• Priority function• Coloring decision

– Register binding– Spilling– Live range splitting

• Spill and reconcile (shuffle) code insertion

Build

Simplify

Compute priority

Select live range

Binding

Splitting

Spilling

Page 77: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-77)

Definition of Priority Function

• Priority function is an empirical measure of the relative benefit of assigning a live range to a register.

• Colors are assigned in order of priority.– Giving the higher priority to the more important live range makes it more

likely to be bound to a register.

• Priority function for global register allocation can be defined as:

iw

iu

id

wCOSTLOADuCOSTSAVEdPR

i

i

i

iii

region offrequency :

region in uses ofnumber :

block in sdefinition ofnumber :

}__{

Page 78: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-78)

Priority Function: Example

x=y=

=x

=x =y

B1(10)

B2(100)

B3(10)

• PR(x) = STORE_COST 10 + LOAD_COST 110

• PR(y) = STORE_COST 10 + LOAD_COST 10

Page 79: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-79)

Live Range Splitting

• Live range splitting– Breaking a live range into several

smaller pieces and each separate live range can produce an interference graph that is colorable

=x

x=B1

B5

B2

B4

=y(R1)

y(R1)=

=z(R2)

z(R2)=

Splitting

Page 80: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-80)

Frequency DrivenRegion Based Approach

My contributions

Page 81: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-81)

Roadmap of the Presentation

I. Improving compilation time and execution performance

II. Improving compilation time

Register binding with frequency-driven propagation

Priority function based on register binding propagation

Frequency-based live range splitting and re-materialization

Region restructuring based on register pressure

III. Justification of my experimental methodology

Page 82: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-82)

Build

Simplify

Compute priority

Select live range

Coloring

Splitting

Spilling

Region restructuring

Re-materialization

Frequency-Based Splitting

Delaying

Propagation

New priority

New Framework

Page 83: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-83)

I.I Register Binding With Frequency Based Propagation

Improving compilation time and execution performance

Improving compilation time

Register binding with frequency driven propagation

Priority function based on register binding propagation

Frequency based live range splitting and re-materialization

Region restructuring based on register pressure

Page 84: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-84)

Problems

• Limited scope of region-based register allocation can cause coloring miss-matched at region boundaries– shuffle code (compensation code) is required– compilation time overhead– slow execution time

• Propagation of register binding information can help.

• Propagating order or the cost of compensation codes inserted is also important– We will explore these issues in the context of frequency

Page 85: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

CS6241 / ECE8833A - (6-85)

Register Binding:First IssueColor Mismatches

• Issue– Limited scope of region-

based compilation as compared to its global counterpart

B1

B2

=y(R1)

y=

=y(R3)

=y

z(R2)=

=z

y {R1}

z {R2}

y {R1}

z {R2}

=y(R1)

R3:=R1

• Possible Solution– Coloring information of a

region can be propagated to the other regions

Page 86: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-86)

Issue1: Frequency Based Register Binding Propagation

B1(200)

x(R1)=

=x

B3

(100)

(10)

• Register binding for x is R1 in B1 while R2 in B3– R1 is used for other variable in B3

• Q: Should we use R1 or R2 for x in B2?

• A: Register biding information is propagated from B3 by the control flow frequency

=x

=x(R2)

B2

(150)

(100)(50)

(90)

Page 87: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Register Binding: Second IssuePass-through Live-ranges

x=

B1

B3

B2

y(R1)=

=y(R1)

=x

=x

=x

• Issue: Pass-through live range: – Some values are live in a region

but not referenced on the region itself

– When there are not enough registers available for several pass-through ranges, it is hard to decide which variables to spill

– Even with large number of registers, it may be more beneficial to spill the live range

• Possible Solution– Coloring decisions of pass-

through live ranges are delayed

Page 88: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-88)

Issue2: Frequency Based Propagation for Delayed Binding

=x(R1)

B5

B4

B2=x(SP)

B1

B3

=x

(40)(10)

(50)

(100)(40)

(500)

x=

(10) (40)

• Q: Should we spill or use R1 for x in region B4?

• Using frequency based propagation, register binding for x in B4 will be propagated from B3 (spilling)

• Register binding for x is delayed in B4 while x is bound to R1 in B2 and spilled in B3

Page 89: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-89)

Register Binding: Third IssueNon-preferred Registers

=x

x(R1)=B1

B3

B2y(R1)=

=y(R1)

• Issue– In region B2, register R1 is used,

and x is delayed. So R1 is forbidden for x in B2

– When we choose register for variable x in B1 or B3, unavailable register information is used and using register R1 for x can be avoided.

x:{NA: R1}y:{R1}

x:{NA: R1}y:{R1}

• Possible Solution– Unavailable registers, in addition

to binding information, need to be propagated to avoid unnecessary coloring mismatch

Page 90: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Issue3: Propagation of Non-preferred With Frequency

=x

x(R1)=B1(1)

B5(1)

B3 a(R1)=

=a(R1)

B2

B4

(10)

B6

(90)

(100)

(500)

e1

(10)e3

x:{NA: R1,R2}x:{NA: R1,R2}

(99)

e2

• Register R1 is used in B3 while R2 is used in B2

• Unavailable registers (R1/R2) are propagated to B1

• However, variable x needs to be bound to R1/R2 in B1, because all other registers are used.

• Q: Is x to be spilled in B2 ? Or bound to R1 or R2?

• A: Frequency based propagation

– propagate each unavailable registers information with frequency

– x can be bound to R1 in B1 and B2, and spill code is inserted in e1

c(R2)=

=c(R2)

=x

=x

Page 91: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-91)

Summary of Frequency Based Propagation

• Register binding information is propagated from higher frequency region to next, based on control flow frequency– Coloring mismatches will be moved to the least frequency

point, if there is any

• Pass-through live ranges are delayed until – The highest control flow neighboring live range is decided or

all its neighbors are processed (pessimistic delayed binding)

• Unavailable register information is propagated selectively with minimal frequency point on the path

Page 92: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-92)

Performance Improvement of Frequency Based Propagation Compared to C&H

0%

5%

10%

15%

20%

25%

30%

35%

008.

espr

esso

023.

eqnt

ott

072.

sc

085.

gcc

124.

m88

ksim

129.

com

pres

s13

0.li

132.

ijpeg

cccp

cmp

eqn lex tb

lya

cc

Benchmarks

Per

form

ance

Impr

ovem

ent c

ompa

red

to C

how

and

H

enne

ssy

Frequency Based

Page 93: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-93)

I.II Priority Function in Region Based Register Allocation

Improving compilation time and execution performance

Improving compilation time

Register binding with frequency driven propagation

Priority function based on register binding propagation

Frequency based live range splitting and re-materialization

Region restructuring based on register pressure

Page 94: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-94)

Issues of Priority FunctionRelated to Regions

• Any of current approaches based on priority functions consider the scope of compilation (intra-region) for the cost of spilling

• Benefit of register binding of a live range in a given region should consider the effect of register binding of its neighboring regions

Page 95: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Different Priority Functions When Applied to a Region

=x(R1)

y=

=x

=x(R1)

=y

B1(100)

B2(90)

B3(100)

Priority functions for region B2

• PR(y)B2 = STORE_COST 90 + LOAD_COST 90

• PR(x)B2 = LOAD_COST 90

=x(R1)

y=

=x

=x(R1)

=y

B1(100)

B2(90)

B3(100)

But if x is spilled at the boundaries of B2, actual spill cost is bigger in region boundaries

• PR(x)B2 = STORE_COST 90 + LOAD_COST 90 + LOAD_COST 90

Store x

Load x

Page 96: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-96)

Priority Function for Region Based Register Allocation

• Priority function should consider the shuffle cost for region boundaries.

• Live-in/live-out for a live-range have similar effect as def/use points in region based register allocation when it is bound to a register.

• Priority function is refined with neighboring live-range information using register propagation and edge frequency.

Page 97: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Other Difficulties With Priority Functions When Applied to a Region

=x(R1)

y=

=x

=x(R1)

=y

B1(100)

B2(90)

B3(100)

Previous priority with region consideration

• PR(y)B2 = STORE_COST 90 + LOAD_COST 90

• PR(x)B2 = STORE_COST 90 + LOAD_COST 90 + LOAD_COST 90

=x(R1)

y=

=x

=x(SPILL)

=y

B1(100)

B2(90)

B3(100)

But if x is spilled in B3

• PR(x)B2 = STORE_COST 10 + LOAD_COST 90,

Page 98: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-98)

Priority Function for Region-based Approach: Formula

BixU(Bi(x))

BixD(Bi(x))

Bi,Bk

Bj,Bi

U(Bi(x))D(Bi(x)) PR(Bi(x))

Bj(x)

Bi(x)Bj(x)

freq(Bi) (freq(Bj) { Bj(x) | )OUT( Bi(x)

Bj(x)

Bi(x)Bj(x)

freq(Bi) (freq(Bj) { Bj(x) |)IN( Bi(x)

BixxBi

UT(Bi(x))Bk(x) in O

N(Bi(x))Bj(x) in I

in of Uses:

in of Defs:

))freq( (LOAD_COST

))freq( T(STORE_COS

LOAD_COST * STORE_COST

register}a tobound is and

following is and

register}a tobound is and

preceding is and

region in of range live)(

Page 99: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-99)

Performance Improvement of New Priority Function

-2%

0%

2%

4%

6%

8%

10%

12%

008.

espr

esso

023.

eqnt

ott

072.

sc

085.

gcc

124.

m88

ksim

129.

com

pres

s13

0.li

132.

ijpeg

cccp

cmp

eqn lex tb

lya

cc

Benchmarks

Per

form

ance

Imp

rove

men

t

Page 100: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-100)

I.III Frequency Based Splitting and Rematerialization

Improving compilation time and execution performance

Improving compilation time

Register binding with frequency driven propagation

Priority function based on register binding propagation

Frequency based live range splitting and re-materialization

Region restructuring based on register pressure

Page 101: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-101)

Main Idea

• Difficulties of live range splitting.– Choosing the right live ranges to split.– Finding right places to split them.

• Frequency based splitting.– In region-based RA, region construction is based on

frequency.• Therefore, frequency information is readily available.

– Frequency information can guide live range splitting.• To find good place to split.

Page 102: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-102)

Live Range Splitting(Previous Approach)

• Chow and Hennessy– Search split point in BFS order– Expand while it is colorable

• Jeanne Ferrante and Mike Lake– Splitting based on dominance– Split points are based on the location of -nodes in the SSA

graph

• Preston Briggs– Loop base splitting

• None of these approaches consider execution frequency

Page 103: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-103)

Frequency Based SplittingOur Approach

• For given live range LR, start from the highest frequent sub-region or seed region and create live range LR1

• Expand LR1 in the order of control flow edge frequency between LR1 and (LR-LR1) while– LR1 is colorable– The edge is the highest frequency entry/exit edge

• Help to make split points to be the lowest frequency control flow edges.

Page 104: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-104)

Rematerialization

• Rematerialization is a spill reduction technique of replacing spills and reloads with instructions that recomputes values

• Rematerialization is possible for the values of easily recomputable– Constants, address arithmetic, stack frame offset– P.Briggs ‘92

• Rematerialization is cheaper than load/store operation

Page 105: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-105)

Frequency Based Rematerialization(FBR)

• FBR is a natural extension of live range splitting– Rematerialize at split point only

• Optimistic approach– Insert rematerialization code at split points even if the

rematerialized code is not colorable– Apply FBR recursively and rematerialization can be

optimized

• Advantages– Smaller number of rematerialization instructions (split point

is the only place we need rematerialization instructions)– The split point is a good point to put rematerialization code

with the low execution frequency point– Rematerialization can reduce the sizes of split live ranges

Page 106: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-106)

Performance Improvement of FBS and FBR Compared to C&H

-2%

0%

2%

4%

6%

8%

10%

12%

14%

16%

008.

espr

esso

023.

eqnt

ott

072.

sc

085.

gcc

124.

m88

ksim

129.

com

pres

s13

0.li

132.

ijpeg

cccp

cmp

eqn lex tb

lya

cc

Benchmarks

Exe

cutio

n P

erfo

rman

ce Im

pro

vem

ent

SBR

FBS

FBS+FBR

Page 107: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-107)

Performance of Region BasedRegister Allocation Compare to C&H

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Benchmarks

Tim

e co

mp

ared

to

fu

nct

ion

bas

edCompile Execution

Page 108: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-108)

II. Register Pressure SensitiveRegion Restructuring

Improving compilation time and execution performance

Improving compilation time

Register binding with frequency driven propagation

Priority function based on register binding propagation

Frequency based live range splitting and re-materialization

Region restructuring based on register pressure

Page 109: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-109)

Experiment of Register Pressure

= (live_range_bandwidth) / (number_of_register)– Higher : higher register pressure– When is lower than 1, we have more registers than perfect

coloring

Page 110: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-110)

Analysis of Register Pressure

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

0 1 2 3 4 5 6 7

124.m88ksim

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

0 0.5 1 1.5 2 2.5 3 3.5

072.sc

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

0 1 2 3 4 5

023.eqntott

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

0 1 2 3 4 5

008.espresso

Page 111: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

The Effect of Compilation Unit Size

YACC(BB)

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 4 8 16 32 64 MAX

Region Size

Re

lativ

e P

erf

orm

ace

Execution

Compile

023.eqntott(BB)

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1 2 4 8 12 16 32 64

Region Size

Rel

ativ

e P

erfo

rmac

e

Execution

Compile

130.li

0.7

0.8

0.9

1

1.1

1.2

1 2 4 8 16 32 MAX

Region Size

Rel

ativ

e P

erfo

rman

ce

Execution

Compile

CCCP(HB)

0

0.5

1

1.5

2

2.5

3

1 2 4 8 12 16 32 64 96 MAX

Region Size

Re

lativ

e T

ime

Execution

Compile

Page 112: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-112)

The Issues Related to Region Size

• As the size of the region increases, in general– Performance of the optimized code improves– Compilation time increases

• When a region is constructed without register pressure considerations– It can be too large

• High register pressure • Needs many live range splitting steps• Can degrade compilation time performance

– It can be too small• May need unnecessary compensation code• May need more propagation time

Page 113: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-113)

Summary: Effect of Region Size

Small Region Large Region

Propagation More effective Less effective

Priority Function More effective Less effective

Liverange Splitting Less effective More effective

Page 114: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-114)

Region Restructuring: Idea

• Estimate register pressure by number of operations.– Limit the region size based on register file size.

• Base regions are given regions like hyperblock or superblock.– Build new regions by grouping them together.– The region restructuring stops its expansion whenever next

highest path is blocked.• A region is considered to be blocked if it is already been grouped.

Page 115: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-115)

Performance of Region RestructuringCompared to HB Based Regions

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

Benchmarks

Rel

ativ

e tim

e to

reg

ion

base

d

Execution Time

Compile Time

Page 116: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-116)

Summary of Contributions

• Using frequency information in region-based register allocation

• Studied the effect of the region size on allocation process

• Proposed the concept of region restructuring based on register pressure

• Accomplished considerable compilation time savings with execution time comparable to global approach

Page 117: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-117)

III. Experimental Methodology

Page 118: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-118)

Experimental Framework Based on Trimaran

• Trimaran is a compiler infrastructure for supporting state of the art research in compiling for instruction level parallel (ILP) architectures.

• Supports compiler research in optimizing compilation techniques such as instruction scheduling, register allocation, and machine-dependent optimizations.

• Collaboratively developed by.– Compiler and Architecture Research group at Hewlett Packard.

– IMPACT group at the University of Illinois.

– ReaCT-ILP laboratory at New York University and now the Center for Research in Embedded Systems and Technology (CREST) at Georgia Tech.

Page 119: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-119)

Stability of Experiments

• In our register allocation framework, the compilation time is limited by– Interference graph construction: O(n2)– Register binding with propagation: O(nR)– Number of live range splitting: O(N)– Compute priority function: O(nN)– Where O(n)=O(N), n is number of live ranges ,N is the number of

operations and R is the number of registers

• Interference graph and computing priority is dominating term: O(n2)

• So, it is expected that compilation time will show same trends in any register allocation implementation based on our region-based approach

Page 120: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-120)

Further Rationale

• Two broad philosophies– Chow and Hennessy approach – Chaitin’s approach and Briggs’ extension

• Chaitin-Briggs not region-based – Thus compile time is comparable to Chow-Hennessy

• Our frequency-based innovations are also applicable to Chaitin-Briggs

• Comparisons are done only in the Chow-Hennessy context– Framework that uses frequency and hence is a natural

starting point for our approach

Page 121: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-121)

Assumption

• It is assumed that execution frequency used for region restructuring and register allocation is accurate– We used perfect profiling

Page 122: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-122)

Future Work

• Region formation considering instruction scheduling as well as register allocation

• Register allocation for non-unit assumed latency (NUAL)

• Integrating instruction scheduling and register allocation

Page 123: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-123)

Supported by

• Hewlett Packard laboratory.

• Panasonic inc.

• NYU research challenge grant.

• DAPRA contract number DADT63-96-C-0049 and 25-74100-F0944.

Page 124: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-124)

Register Allocation for Rotating Registers

Page 125: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-125)

Architectural Model

• VLIW• Non-rotating registers

– Predicate Registers– General Purpose Registers

• Rotating Registers– Iteration Control Register File– Rotating Register File

• Indexed by the Iteration Control Pointer (ICP)• ICP is automatically decremented by the brtop

instruction

Page 126: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-126)

Rotating Registers

rotating registers

iteration control registers

r

ICP

p

register specifier

predicate specifier

Page 127: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-127)

Why Rotating Registers

• Lifetimes of a value generated by an operation in one iteration can co-exist with values in other iterations

• ICR– if-conversion– allow fine control of filling and draining of software pipeline– side effect: reduce size of prologues and epilogues

Page 128: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-128)

Notation

• r1 := r2[1] + r3[2] means add the version of r2 produced in the last iteration to the version of r3 produced in 2 iterations back

Page 129: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-129)

Example

subroutine foo(a,s)real a(10, sdo i = 1,35 s = s + a(i) a(i) = s*s*a(i)enddostopend

II=2time operation0 r34 := mem[r33[1]]13 r35 := r34 + r35[1]15 r36 := r35 * r3518 r37 := r36 * r3420 mem[r33[1]] := r370 r33 := r33[1] + 40 r39 := brtop

Page 130: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-130)

Lifetimes of Loop Variants

• Lifetimes are specified as (start,end,omega,alpha)

• Start - the issue time of a producer of the value

• End - the latest completion time of the consumer of the value

• Omega - number of iterations span by this value

• Alpha - number of iterations liveout from the end of the loop

Page 131: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-131)

Lifetimes of Loop Variants

Register Start End Omega Alphar35 (s) 13 16 1 1 r37 18 20 0 0r36 15 19 0 0r34(a(i)) 0 19 0 0r33(a) 0 22 1 0

time operation0 r34 := mem[r33[1]]13 r35 := r34 + r35[1]15 r36 := r35 * r3518 r37 := r36 * r3420 mem[r33[1]] := r370 r33 := r33[1] + 40 r39 := brtop

Page 132: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-132)

Vector Lifetime

time

registers

liveout

livein

trailing blade

leading bladediagonal wand

II

Page 133: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-133)

Register Allocation

• For loop variants only

• If a rotating register is allocated to physical register r in the first iteration, then iteration i writes to register r-i+1

• Bin-packing problem for vector lifetimes

• Encoded using distance matrix

Page 134: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-134)

Distance Matrix

• One row and one column for each vector lifetime

• Given a matrix DIST, DIST[A,B] denotes the minimal register distance allowable between A and B

Page 135: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-135)

Computing the Distance Matrix

• d1 = end(A) - start(B)/II• d2 = d1 if omega(B) = 0• d2 = max(d1,omega(A)) otherwise• d3 = d2 if alpha(A) = 0• d3 = max(d2,alpha(A)) otherwise

• DIST[A,B] = d3

Page 136: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-136)

Algorithm Framework

• Given a set of register lifetimes, create a feasible schedule

procedure allocate; order by criteria; lifetimes[1].location := 0; for lt := 2 to number of lifetimes do update the set of disallowed allocations for every unallocated lifetimes; fit(selectedLT,selectedLocation); lifetimes[selectedLT].location := selectedLocation;

Page 137: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-137)

Discussion

• Criteria - order by• start time• conflict (conflict[i,j] = dist[i,j] + dist[j,i] - 1)

– choose the lifetime with the smallest total conflict

• adjacency (start[B] - end[A]) + dist[A,B] * II is minimized)– A is last allocated– choose B such that adjacency is minimized

Page 138: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-138)

Discussion cont.

• Fit algorithm can be – best fit– first fit– end fit

• Best fit tries minimizes the number of registers that is needed at each step

• First fit chooses the lowest location such that the next select lifetime can be allocated

• End fit choose the closest location from the last lifetime such that the next select lifetime can be allocated

Page 139: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-139)

Additional Reading:

1. “URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures”, D. Berson, R. Gupta, and M. L. Soffa, Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, IFIP Transactions

A-23, 243-254, Januaray 1993.

2. “Rematerialization”, P. Briggs, K.D. Cooper, and L. Torczon, Proceedings of the SIGPLAN-92 Conference on Programming Language Design and Implementation, SIGPLAN Notices 27(7), 311-321, July 1992.

Page 140: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-140)

Additional Reading:

3. “Improving Register Allocation for Subscripted Variables”, D. Callahan, S. Carr and K. Kennedy, Proceedings SIGPLAN-90 Conference on Programming Language Design and Implementations, 53-65, 1990.

4. “Register Allocation via Hierarchical Graph”, D. Callahan and B. Koblens, Proceedings SIGPLAN-91 Conference on Programming Language Design and Implementation, 192-203, June 1991.

5. “A Portable Machine Independent Global Optimizer—Design and Measurements”, F. Chow, Ph.D. Thesis, Tech Re. 83-254, Standford University, 1983.

Page 141: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-141)

Additional Reading:

6. “Register Allocation”, F. Chow, K. Knobe, A. Meltzer, R.Morgan and K. Zadeck, in Optimizing Compilers, F. Allen, B. Rosen and K. Zadeck Eds. ACM Press and Addison-Wesley, to appear.

7. “Register Allocation via Usage Counts”, R. Freiburghouse, Communications fo the ACM, vol. 17, 638-642, 1974.

8. “Code Scheduling and Register Allocation in Large Basic Blocks”, J. Goodman and W. Hsu, Proceedings of ACM Conference on Supercomputing, 442-452, 1998.

Page 142: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-142)

Additional Reading:

9. “Efficient Instruction Scheduling for Delayed-Load Architectures”, S. Kurlander, T. Proebsting and C. Fischer, ACM Transactions on Programming Languages and Systems, vol. 17, no 5., 740-776, 1995.

10. “Combining Register Allocation and Instruction Scheduling”, R. Matwani, K.V. Palem, V. Sarkar, S. Reyen, TR 698, Courant Institute, NYU, July 1995.

11. “A Scheduler-Sensitive Global Register Allocator”, C. Norris and L. Pollock, Supercomputing, November 1993. Protland, Oregon.

Page 143: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-143)

Additional Reading:

12. “Register allocation with instruction scheduling: a new approach”, S.S. Pinter, Proceedings

SIGPLAN-93 Conference on Programming Language Design and Implementation, June 1993.

13. “Linear-time optimal code scheduling for delayed- load architectuers”, T. Proebsting and C. Fisher, Proceedings SIGPLAN-91 Conference on Programming Language Design and Implementation, 256-267, June 1991.

14. “The generation of optimal code for arthmetic expressions”, R. Sethi and J. Ullman, Journal of the ACM, vol 17, no 4, 715-728, 1970.

Page 144: Fall 2000 CS6241 / ECE8833A - (6-1) Topic 6 Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-144)

Additional Reading:

15. ``Register Allocation for Software Pipelined Loops’’, B. R. Rau, M. Lee, P.P. Tirumalai, M.S.Schlansker, PLDI 92