Dealing With Register Hierarchies -...

Dealing with Register Hierarchies

FP Register

Matthias Braun (MatzeB) / LLVM Developers' Meeting 2016

r0,r1,r2,r3

r1,r2,r3,r4 r2,r3,r4,r5

r3,r4,r5,r6 r4,r5,r6,r7

r5,r6,r7,r8 r6,r7,r8,r9

4 Tuple Class

r0;r1;r2;r3r1;r2;r3;r4

r1;r2;r3r2;r3

r3r3;r4

r2;r3;r4r3;r4;r5

Register Allocation• Rewrite program with unlimited number of virtual registers to use

actual registers

• Techniques: Interference Checks, Assignment, Spilling, Splitting, Rematerialization

%0 = const 5 %1 = const 7 %2 = add %0, %1 return %2

r0 = const 5 r1 = const 7 r0 = add r0, r1 return r0

Register Allocation for GPUs• Hundreds of registers available, but using fewer increases

parallelism

• Mix of Scalar (single value) and Vector (multiple values) operations

• Load/Store instructions work on multiple registers(high latency, high throughput)

r[0:3] = load_x4 # Load r0, r1, r2, r3 r4 = add r0, 1 r5 = add r1, 2 r6 = add r2, 3 r7 = add r3, 4 store_x4 r[4:7] # Store r4, r5, r6, r7

Liveness Tracking• Linearize program

• Number instructions consecutively (SlotIndexes)

b1: %1 = const 5 jmp b3

%0 = def cmp ... jeq b2

b2: store %0 %1 = def

b3: 2% = add %1, 1

Liveness Tracking

b3: 2% = add %1, 1

SlotIdx0 1 2

• Linearize program

Liveness Tracking• Linearize program

• Liveness as sorted list of intervals (segments)

SlotIdx%0 %1 %2

… %1: [4:6)[8:9)[9:10) …

b3: 2% = add %1, 1

Modeling Register Hierarchies

r[0:3] = load_x4 r4 = add r0, 1 r5 = add r1, 2 r6 = add r2, 3 r7 = add r3, 4 store_x4 r[4:7]

Tuple Registers

%0,%1,%2,%3 = load_x4 %4 = add %0, 1 %5 = add %1, 2 %6 = add %2, 3 %7 = add %3, 4 store_x4 %4,%5,%6,%7

Tuple Registers

%0,%1,%2,%3 = load_x4 %4 = add %0, 1 %5 = add %1, 2 %6 = add %2, 3 %7 = add %3, 4 store_x4 %4,%5,%6,%7

❌ No relation between virtual registers but need to be consecutive

Tuple Registers

%0 = load_x4 %1.sub0 = add %0.sub0, 1 %1.sub1 = add %0.sub1, 2 %1.sub2 = add %0.sub2, 3 %1.sub3 = add %0.sub3, 4 store_x4 %1

Tuple Registers

r0,r1,r2,r3

r1,r2,r3,r4 r2,r3,r4,r5

r3,r4,r5,r6 r4,r5,r6,r7

r5,r6,r7,r8 r6,r7,r8,r9

4 Tuple Class

• Register class contains tuples

• Allocator picks a single (tuple) register

• Parts called subregisters or lanes

• Select parts with subregister index (.xxx Syntax)

Construction%0 = load %1 = const 42 %2 = reg_sequence %0, sub1, %1, sub0 store_x2 %2

• reg_sequence defines multiple subregisters (for SSA)(there is also insert_subreg, extract_subreg)

%0 = load %1 = const 42 %2.sub0<undef> = copy %0 %2.sub1 = copy %1 store_x2 %2

• TwoAddressInstruction pass translates to copy sequence

%0 = load %1 = const 42 %2.sub0<undef> = copy %0 %2.sub1 = copy %1 store_x2 %2

• RegisterCoalescing pass eliminates copies

%2.sub0<undef> = load %2.sub1 = const 42 store_x2 %2

• TwoAddressInstruction pass translates to copy sequence

Improving Register Allocation

Subregister Liveness

Can allocate v0 and v1 to the same register tuple

%0sub0 sub1 sub2 sub3

%1sub0 sub1 sub2 sub3

Subregister Liveness: Lane Masks• Lane Mask: 1 bit per subregister

• Annotate subregister liveness parts with lane mask

• Start with whole virtual register; Split and refine as necessary

Lane Masks:sub0: 0b0001 sub1: 0b0010 sub2: 0b0100

sub1_sub2: 0b0110 sub3: 0b1000 all: 0b1111

%0 = load_x4 store_x4 %0

%1 = load_x4 %1.sub0 = const 13 %1.sub3 = const 42 store_x4 %1

Subregister Liveness: Lane Masks• Lane Mask: 1 bit per subregister

%01111

Subregister Liveness: Lane Masks

Lane Mask:

• Lane Mask: 1 bit per subregister

01010001 1000Lane Mask:%01111

Subregister Liveness: Lane Masks

Assignment Heuristics

• Default: Assign in program order

To Assignr0 r1 r2 r3 r4 r5

Assignment Heuristics

r0 r1 r2 r3 r4 r5

Assignment HeuristicsTo Assign

r0 r1 r2 r3 r4 r5

• Wide pieces may not fit in holes left by small ones

r0 r1 r2 r3 r4 r5

• Tweak: Prioritize bigger classes

r0 r1 r2 r3 r4 r5

Interference Checks: Register Units• Tuples multiply number of registers

• Interference check of single register in target with 1-10 tuples:45 aliases!

r0,r1,r2,r3 r1,r2,r3,r4 r2,r3,r4,r5 r3,r4,r5,r6

r1,r2,r3

r2,r3,r4 r3,r4,r5

Interference Checks: Register Units

• Each register mapped to one or more units:Registers alias iff they share a unit

• Liveness/Interference checks of actual registers uses register units

u0 u1 u2 u3 u4 u5

r0;r1;r2;r3

r1;r2;r3;r4

r1;r2;r3

r2;r3;r4

r3;r4;r5

Usage, Results, Future Work

Use in LLVM

• Declare Subregister Indexes + Subregisters in XXXRegisterInfo.td

• TableGen computes register units and combined subregister indexes/classes

• Enable fine grained liveness tracking by overriding TargetSubtargetInfo::enableSubRegLiveness()

• AllocationPriority part of register class specification

Results: Apple GPU Compiler

• Compared various benchmarks and captured application shaders

• Average 20% reduction in register usage (-6% up to 50%)!

• Speedup 2-3% (-4% up to 70%)

Results: AMDGPU Target

Future Work

• Support partially dead/undef operands

• Early splitting and rematerialization (before register limit)

• Partial registers spilling

• Consider partial liveness in register pressure tracking

• Missed optimizations (no obvious use/def relation for lanes)

Thank You for Your Attention!

Thank you for your attention!

Backup Slides

Register Hierarchies• CPU registers can overlap. Partial register accessible by subregister.

Also called lanes (Vector Regs)

X86 GP Register

ARM FP Registermovw 0xABCD, %ax # Put 16bits into %ax movb %al, x # Uses lower 8 bits: 0xCD movb %ah, y # Uses upper 8 bits: 0xAB

Register Allocation Pipeline

PHIElimination

TwoAddressInstruction

RegisterCoalescer

ProcessImplicitDefs

DetectDeadLanes

RenameIndependentSubregs

MachineScheduler

RegAllocGreedy

VirtRegRewriter

StackSlotColoring

Subregister Indexes• Subregister indexes relate wide/

small registers on virtual registers

• Writes may be marked undef if other parts of register do not matter

• LLVM synthesizes combined indexes (`sub0_low16bits`)

Register Allocation:

%0 = load_x4 %1.sub0<undef> = add %0.sub2, 13 %1.sub1 = const 42 store_x2 %1

r4_r5_r6_r7 = load_x4 r0 = add r6, 13 r1 = const 42 store_x2 r0_r1

Slot Indexes• Position in a program; Each instruction is assigned a number (incremented

by 4 so we need to renumber less often when inserting instructions)

• Slots describe position in the instruction:

• Block/Base (Block begin/end, PHI-defs)

• EarlyClobber (early point to force interference with normal def/use)

• Register (normal def/uses use this)

• Dead (liveness of dead definitions ends here)

Constraints & Classes• A register class is set of registers; Models register constraints

• Class defined for each register operand of LLVM MI Instruction (MCInstrDesc)

• Each virtual register has a class EAX EBX

ECX EDX ESI

ESP EBPEDI

GR32 class

R8 ...

Dealing With Register Hierarchies -...

Documents

Benefit Guide · Key Benefits OUT OF HOSPITAL BENEFITS MATERNITY BENEFITS ... Income Bracket Main Member Adult Child R0 – R7 540 R902 R892 R376 R7 541 – R8 796 R1 258 R1 258 R456

for Ghidra Implementing a New CPU Architecture · # MOV Rn,Rm - 0000_nnnn_mmmm_0000 define register offset=0 size=4 [ r0 r1 ]; define token instr(16)

2010 - Weierstrass Institute · const vec3s& p1, const vec3s& p2, const vec3s& p3, const vec3s& p4, const Expression& u, const Expression& v, ... 2010 Author: Terrence Masson Created

const eqns_PZT

COMP 122/L Lecture 24 Output = Input1 + Input2 mov r0, #5 add r1, r0, r0-...however, adding numbers is only one part of the problem-For a processor, we need a way to store values in

Application of the R0-R3 formulas using the ECToolbox ...ECToolbox R0, R1, R2 and R3 were compared with those obtained by ERNV. Using LVEF≥50% obtained by ERNV as the gold standard,

32x16 and 32x32 RGB LED Matrix - Mouser Electronics · R1/G1/B1/R2/G2/B2 are changed to R0/G0/B0/R1/G1/B1… but again, no functional difference, it’s just ink. Our earliest 32x32

History of Version - dena-de.com · 28 R6 data bit r6 29 R5 data bit r5 30 R4 data bit r4 31 R3 data bit r3 32 R2 data bit r2 33 R1 data bit r1 34 R0 data bit r0 35 VSS Ground 36

:t=- S CB Q S t-.tc,tq Ex-Tc s lt* · a) EOR R1, R0, f3 b) MLA R4, R3, R7, R8 cICMP R0, R1 d) ADD R0, R2, R3, tSL F1 e) MVN R0,ll4 Q5 B) Draw and explain internal structure of portO

MXIC MX10F201FC Data Sheet - Keil pointers are R0 and R1 of the selected register bank. - RAM 128 to 255 can only be addressed indirectly . Address pointers are R0 and R1 of the selected

JSR 292 on Android · invoke-frame-mh rv1 m4 r0 1 // call f2i const r0 rv1 :end invoke-frame-mh 0 null r0 1 // return} gwt.nb_r = 1 gwt.nb_rv = 2 code + 8

Analytical Toolbox for Smart City Applications: Garbage Collection … · 2020-04-09 · Preparation: Implicit Order Join r0 r1 r2 r3 1 1 10 d 2 4 20 d 1 2 30 e 2 3 50 e rank of r1

151455000 GUIA USO IAM SP2230BT-SP2240BT BL R0...151455001 GUIA USO IAM SP2230BT-SP2240BT BL R1.pdf 2 04/09/18 17:42 Title 151455000 GUIA USO IAM SP2230BT-SP2240BT BL R0 Created Date

Central Processing Unit - Neff · 2017-03-14 Module 4.1.2 –Central Processing Unit 3 Operand Fetch Source R1, R2 + R0 Execute Operation ... •Pipelining •Example –SUN SPARC

The HD7970 and Graphics Core Next - Uppsala University · v_mul_f32 r2,r2,r1 label1: s_mov_b64 exec,s0 r2,r1,r0 Generates a comparison bit vector is all 0, skip the Scalar execution

CONST BENCH

More Loop Unrolling and Vectorization - Computer …jad5ju/cs4501/More Loop Unrolling.pdf · L1: ble r1 r0 L2 add r2

17th ESO-ESMO Masterclass Clinical Oncology · Current opinion on resection margins. R0 ideal = 5mm R1, 2 high risk . R0

Adv. No. 2015-020 R1 File: 1613.9 - BC Hydro · Adv. No. 2015-020 R1 Supersedes: R0 Issue Date: Aug 19, 2015 File: 1613.9.1 Information Bulletin: Duct Spacers and Warning Boards for

2.ALU Design - Linköping · PDF file2.ALU Design Olle Seger (olles@isy.liu.se) ... REPEAT loop,32 LD r1,(ar0++) NOP ABS r2,r1 MAX r0,r2,r0 ... SW should test all corner cases : About