40
Spring 2014 Jim Hogg - UW - CSE - P501 P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global allocation Register coloring

Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Embed Size (px)

Citation preview

Page 1: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Spring 2014 Jim Hogg - UW - CSE - P501 P-1

CSE P501 – Compiler Construction

Register allocation constraints

Local allocation Fast, but poorer code

Global allocation Register coloring

Page 2: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Spring 2014 Jim Hogg - UW - CSE - P501 P-2

k

IR typically assumes infinite number of virtual registers

Real machine has k physical registers available x86: k = 8 x64: k = 16

Goals Produce correct code that uses k or less registers Minimize added loads and stores Minimize space needed for spilled values Do this efficiently – O(n), O(n log n), maybe O(n2)

Page 3: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Spring 2014 Jim Hogg - UW - CSE - P501 P-3

Register Allocation

At each point in the code, pick which values to keep in registers

Insert code to move values between registers and memory

No additional transformations – scheduling already ran

But will usually re-run scheduling afterwards (to include the effects of spilling)

Minimize inserted save/restores ("spill" code)

Page 4: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Spring 2014 Jim Hogg - UW - CSE - P501 P-4

Allocation vs Assignment

Allocation: which values to keep in registers?

Assignment: which specific registers?

Compiler must do both

Page 5: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Spring 2014 Jim Hogg - UW - CSE - P501 P-5

Applied to basic blocks (ie: "local")

Produces good register usage inside a block But can have inefficiencies at boundaries between

blocks

Two variations: top-down, bottom-up

Local Register Allocation

Page 6: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

"Top-down" Local Allocation

We assume a RISC target machine ("load/store architecture") load operands into two registers (if not already there) perform instruction (eg: add, lshift) store result back to memory (if required) ie: operands for instructions MUST be in registers, rather than memory (note, by contrast, that x86 allows operands directly from memory)

Principle: keep most heavily used values in registers Priority = # of times virtual-register used (destination and/or source) in

block

If more virtual registers than physical (the common case) Reserve some registers for values allocated to memory

Need enough to address and load two operands and store result (R1 = R2 + R3)

Other registers dedicated to “hot” values But are tied up for entire block with particular value, even if only needed for

part of the block

Spring 2014 Jim Hogg - UW - CSE - P501 P-6

Page 7: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

"Bottom-up" Local Allocation, 1

Keep a list of free physical registers initially all physical registers at beginning of block except frame-pointer, stack-pointer - pre-allocated

Scan code, instruction-by-instruction Allocate a register when one is needed

if one is free, allocate it otherwise, pick a victim - eg: VR3 using R7 save contents of R7 allocate R7 to requesting virtual register

Emit instructions to use physical register Free physical register as soon as possible

In x = y op z, free y and z if no longer needed before allocating x

Spring 2014 Jim Hogg - UW - CSE - P501 P-7

Page 8: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Bottom-up Local Allocation, 2

How to pick a victim?

Scan ahead thru code Look for next use of each currently-allocated-to virtual

register Pick the virtual register whose next use is furthest off.

Why? Save the victim register's to memory (restore later)

Spring 2014 Jim Hogg - UW - CSE - P501 P-8

Virtual RegisterVR

Physical Register VR Next Use

Instruction

14 3 5

23 2 8

2 4 5

5 1

Example: CPU with 4 available registers

Page 9: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Local "bottom-up" Register Allocation, -1

1. ; load v2 from memory2. ; load v3 from memory3. v1 = v2 + v34. ; load v5, v6 from

memory5. v4 = v5 - v66. v7 = v2 - 297. ; load v9 from memory8. v8 = - v99. v10 = v6 * v4 10. v11 = v10 - v3

Spring 2014 Jim Hogg - UW - CSE - P501 P-9

• Still in LIR. So lots (too many!) virtual registers required (v2, etc).• Grey instructions (1,2,4,7) load operands from memory into virtual

registers.• We will ignore these going forward. Focus on mapping virtual to

physical.

Page 10: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Local "bottom-up" Register Allocation, 0

1. v1 = v2 + v32. v4 = v5 - v63. v7 = v2 - 294. v8 = - v95. v10 = v6 * v4 6. v11 = v10 - v3

Spring 2014 Jim Hogg - UW - CSE - P501 P-10

v1 1v2 1v3 1v4 2v5 2v6 2v7 3v8 4v9 4v10 5v11 6

vReg NextRef

R1 -R2 -R3 -R4 -

pReg vReg

Page 11: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Local "bottom-up" Register Allocation, 1

1. v1 = v2 + v32. v4 = v5 - v63. v7 = v2 - 294. v8 = - v95. v10 = v6 * v4 6. v11 = v10 - v3

Spring 2014 Jim Hogg - UW - CSE - P501 P-11

v1 1 v2 1 3v3 1 6v4 2v5 2v6 2v7 3v8 4v9 4v10 5v11 6

vReg NextRef

R1 v2R2 v3R3 v1R4 -

pReg vReg

R3 = R1 + R2

Page 12: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Local "bottom-up" Register Allocation, 2

1. v1 = v2 + v32. v4 = v5 - v63. v7 = v2 - 294. v8 = - v95. v10 = v6 * v4 6. v11 = v10 - v3

Spring 2014 Jim Hogg - UW - CSE - P501 P-12

v1 v2 3v3 6v4 2 5v5 2 v6 2 5v7 3v8 4v9 4v10 5v11 6

vReg NextRef

R1 v2R2 v3 v4R3 v1 v6R4 v5

pReg vReg

R3 = R1 + R2; spill R3; spill R2? - no - still cleanR2 = R4 - R3

Page 13: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Local "bottom-up" Register Allocation, 3

1. v1 = v2 + v32. v4 = v5 - v63. v7 = v2 - 294. v8 = - v95. v10 = v6 * v4 6. v11 = v10 - v3

Spring 2014 Jim Hogg - UW - CSE - P501 P-13

v1 v2 3 v3 6v4 5v5 v6 5v7 3 v8 4v9 4v10 5v11 6

vReg NextRef

R1 v2R2 v4R3 v6R4 v5 v7

pReg vReg

And so on . . .

R3 = R1 + R2; spill R3; spill R2? - no!R2 = R4 - R3; spill R4? - no!R4 = R1 - 29

Page 14: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Spring 2014 Jim Hogg - UW - CSE - P501 P-14

Bottom-Up Allocator

Invented about once per decade Sheldon Best, 1955, for Fortran I Laslo Belady, 1965, for analyzing paging algorithms William Harrison, 1975, ECS compiler work Chris Fraser, 1989, LCC compiler Vincenzo Liberatore, 1997, Rutgers

Will be reinvented again, no doubt

Many arguments for optimality of this

Page 15: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Standard technique is graph coloring

Use control-graph and dataflow-graph to derive interference graph

Nodes are live ranges (not registers!) Edge between nodes when live at same time Connected nodes cannot use same register

Then color the nodes in the graph Two nodes connected by an edge may not have same color (ie:

be allocated to same physical register) If more than k colors are needed, insert spill code

Spring 2014 Jim Hogg - UW - CSE - P501 P-15

Global Register Allocation

Page 16: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Graph Coloring, 1

Spring 2014 Jim Hogg - UW - CSE - P501 P-16

1

32 4

5

• Assign a color to each node, such that

• No adjacent nodes have same color

Page 17: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Graph Coloring, 2

Spring 2014 Jim Hogg - UW - CSE - P501 P-17

1

32 4

5

• 2 colors are enough for this graph• called a 2-coloring• this graph's chromatic number = 2

Page 18: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Graph Coloring, 3

Spring 2014 Jim Hogg - UW - CSE - P501 P-18

1

32 4

5

1

32 4

5

small changes forces a 3rd color

Page 19: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Live Ranges

1 a = ...2 b = ...3 c = ...4 a = a + b + c5 if a < 10 then6 d = c + 97 print(c)8 elseif a < 20 then9 e = 1010 d = e + a11 print(e)12 else13 f = 1214 d = f + a15 print(f)16 endif17 print(d)

Spring 2014 Jim Hogg - UW - CSE - P501 P-19

• Live Range of a variable is between a def and all subsequent possible uses

• if a < 10, then a is live lines 1..5• if a < 20, then a is live lines 1..10• otherwise, a is live lines 1..14

• Mmm - clearly nonsense! We need a flowgraph for a correct specification

• Note: line 14: a goes dead on RHS; d comes live on LHS. So death and birth happen 'half-way thru' the instruction

• This example is simplified: variables are never re-def'd. And there are no temps

Page 20: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Live Ranges in Flowgraphs

a = ...b = ...c = ...a = a + b + cif a < 10 then d = c + 9 print(c)elseif a < 20 then e = 10 d = e + a print(e)else f = 12 d = f + a print(f)endifprint(d)

Spring 2014 Jim Hogg - UW - CSE - P501 P-20

a = ...b = ...c = ...a = a + b + c

a < 10

a < 20d = c + 8print(c)

e = 10d = e + aprint(e)

f = 12d = f + aprint(f)

print(d)

ab

c

Page 21: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Register Interference Graph

Spring 2014 Jim Hogg - UW - CSE - P501 P-21

f

e

ba

d c

• If live range of 2 variables overlap, or interfere, they cannot use same register

• Eg: a and b cannot both be stored in register R5• Eg: c and e can be stored in register R5 (at different, non-overlapping

times)

• How many registers will we need? - Guesses?

Page 22: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Coloring

Spring 2014 Jim Hogg - UW - CSE - P501 P-22

f

e

ba

d c

• Problem: color this interference graph, where each color represents a different physical register

• Solution: simplify graph: recursively remove easiest nodes (lowest degree)

NodeDegreea 3

b 2c 3d 3e 2f 2

Page 23: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Remove b

Spring 2014 Jim Hogg - UW - CSE - P501 P-23

f

e

ba

d c

NodeDegreea 2

b 2c 2d 3e 2f 2

Removed: b

Page 24: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Remove a

Spring 2014 Jim Hogg - UW - CSE - P501 P-24

f

e

ba

d c

NodeDegreea 2

b 2c 1d 3e 1f 1

Removed: b a

Page 25: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Remove c

Spring 2014 Jim Hogg - UW - CSE - P501 P-25

f

e

ba

d c

NodeDegreea 2

b 2c 1d 2e 1f 1

Removed: b a c

Page 26: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Remove e

Spring 2014 Jim Hogg - UW - CSE - P501 P-26

f

e

ba

d c

NodeDegreea 2

b 2c 1d 2e 1f 1

Removed: b a c e

Page 27: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Color!

Spring 2014 Jim Hogg - UW - CSE - P501 P-27

f

e

ba

d c

Removed: b a c e

Page 28: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Add e

Spring 2014 Jim Hogg - UW - CSE - P501 P-28

f

e

ba

d c

Removed: b a c

Page 29: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Add c

Spring 2014 Jim Hogg - UW - CSE - P501 P-29

f

e

ba

d c

Removed: b a

Page 30: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Add a

Spring 2014 Jim Hogg - UW - CSE - P501 P-30

f

e

ba

d c

Removed: b

Page 31: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Interference Graph - Add b

Spring 2014 Jim Hogg - UW - CSE - P501 P-31

f

e

ba

d c

• 3 colors are enough• eg: red = R1, green = R2, blue = R3

• If less registers available - need to spill

Page 32: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Live Ranges: More Realistic Example1. varp = ...

2. vw = [varp + 0]

3. v2 = 2

4. vx = [varp + @x]

5. vy = [varp + @y]

6. vz = [varp + @z]

7. vw = vw * v2

8. vw = vw * vx

9. vw = vw * vy

10. vw = vw * vz

11. [varp + 0] = vw

varp 1..11

vw 2..7

vw 7..8

vw 8..9

vw 9..10

vw 10..11

v2 3..7

vx 4..8

vy 5..9

vz 6..10

P-32

RegisterInterval

In this example, virtual registers are re-def'd. Eg: vw participates in 5 live ranges Eg: vw [7..8] interferes with vx [4..8] but not vw [9..10]But no control flow!

Page 33: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Build Live Ranges, 1

Use dataflow information to build interference graph

Nodes = live ranges Add an edge for each pair of live ranges that overlap

Beware copy instructions: rx = ry does not create interference between rx and ry : can use same register if the ranges do not otherwise

interfere

Spring 2014 Jim Hogg - UW - CSE - P501 P-33

Page 34: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Build Live Ranges, 2

Construct the interference graph

Find live ranges – SSA

Build SSA form of IR Live range initially includes single SSA name' At a -function, form union of live ranges Either rewrite code to use live range names or keep a mapping

between SSA names and live-range names

Spring 2014 Jim Hogg - UW - CSE - P501 P-34

Page 35: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Simplify & Color Algorithm

Spring 2014 Jim Hogg - UW - CSE - P501 P-35

while graph cannot be colored remove easiest node (fewest neighbors = lowest degree) if degree of easiest node > k insert spill code remember that node into stack recalculate degree of each remaining node

Simplify

color the sole remaining nodewhile stack is non-empty pop node from stack color newly expanded graph

Color

Page 36: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Coalescing Live Ranges

Idea: if two live ranges are connected by a copy operation (rx = ry) do not otherwise interfere, then coalesce

Rewrite all references to rx to use ry

Remove the copy instruction

Then fix up interference graph

Spring 2014 Jim Hogg - UW - CSE - P501 P-36

Page 37: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Advantages of Coalescing?

Makes the code smaller, faster (no copy operation)

Shrinks number of live ranges

Reduces the degree of any live range that interfered with both live ranges rx and ry

But: coalescing two live ranges can prevent coalescing of others, so ordering matters

coalesce most frequently executed ranges first (eg: inner loops)

Can have a substantial payoff – do it!

Spring 2014 Jim Hogg - UW - CSE - P501 P-37

Page 38: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Overall Structure

Spring 2014 Jim Hogg - UW - CSE - P501 P-38

Find live

ranges

Build int.

graph

Coalesce

Spill Costs

Find Coloring

Insert Spills

No Spills

More Coalescing Possible

Spills

Page 39: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Real-Life Complications

Need to deal with irregularities in the register set

Some operations require dedicated registers Eg: idiv in x86 Eg: split address/data registers in M68k

Register conventions like function results Eg: use of registers across calls (return in EAX)

Different register classes Eg: integer, floating-point, SIMD

Overlapping registers Eg: x86 AL, AH, AX, EAX, RAX

Model by precoloring nodes, adding constraints in the graph, etc.

Spring 2014 Jim Hogg - UW - CSE - P501 P-39

Page 40: Spring 2014Jim Hogg - UW - CSE - P501P-1 CSE P501 – Compiler Construction Register allocation constraints Local allocation Fast, but poorer code Global

Graph Representation

Recall: Register allocation is one of the most important optimizations Finding optimum solution for typical interference graph will take a

million years So we use heuristics to find a good solution a million times each day

Interference graph representation drives the time & space requirements for the allocator (may even dominate the entire compiler)

Not unknown to have ~5k nodes and ~1M edges

Spring 2014 Jim Hogg - UW - CSE - P501 P-40