Retargetting of VPO to the tms320c54x - a status report

Retargetting of VPO to the tms320c54x - a status report

Presented by Joshua GeorgeAdvisor: Dr. Jack Davidson

Status Register assignment and allocation Common sub-expression elimination Constant propagation/Copy

propagation Induction variable elimination Code motion Recurrence detection

Status (continued) Strength reduction Instruction selection Dead code elimination Constant folding (simp()) Branch minimization Support for repeat blocks

The tms320c54x 1 40-bit ALU, 2 40-bit accumulators

(A,B) (r[0],r[2] in vpo) 1 17x17bit parallel multiplier with

adder for single cycle MAC operation

1 barrel shifter 8 16-bit address registers (AR0-

AR7) (w[0]..w[7] in vpo)

Compiler writer woes Address arithmetic – can only add a constant

to an address register. Causes complications in optimizer (eg. in strength reduction code).

Interesting note:r[0]=(w[0]{24)}24;r[0]=r[0]+1;w[1]=r[0]; /* w[1]=w[0]+1 gets rejected */W[w[1]]=50; /* by instruction selection */---------------------------w[0]=w[0]+1; W[w[0]+1]=50;The first sequence cannot normally collapse into the more efficient second sequence. But after minimize_registers, instruction selection

isable to fold them into a single instruction.

Compiler writer woes 16 bit word addressing – required

special case handling in lcc frontend. Only 2 accumulator registers.

Local Register Assigner had to be fixed to handle this.

Lots of spills. Refined vpo to use memory disambiguation techniques in instruction selection (maybe_same()).

Compiler writer woes No pipeline interlocks => unprotected

pipeline conflicts. 40 bit accumulator. Needed major

change to simp(). Complicated machine description with sign-extends and ANDs.

Global data placed in special cinit section and is relocated to RAM at run-time. VISTA/EASE code instrumentation had to be done differently from other targets.

Compiler writer woes Compare and jump has the induction

variable and the value to compare with, spread over two instructions. All targets till now had a simple compare and jump. Resulted in small change to vpo lib/md interface. Eg. AR1 (w[1]) is the induction variable and runs

from 0 to 9. The loop exit check –SSBX SXM // s[0]=1; (set sign-ext

on)LD *(AR1),A ; // r[0]=(w[1]{24)}24;SUB #10,A,A ; // r[0]=r[0]-10;BC L1,ALT ; // PC=r[0],0?L1;

Timeline of progress on this project Spring 2002

Code-expander completed. Only basic addressing modes and instructions

supported. Stack layout Calling sequence Data declarations Structure operations

Passes ctests/ptests with instruction selection. Support for stdargs added.

Timeline of progress on this project Fall 2002

Major changes to simp() to handle 40 bit arithmetic.

Enabled Register Coloring and CSE. Lot of work on comp() to allow better

instruction selection and other optimizations.(eg. w[1]=( (w[1]{24)}24)+1 ) & 65535

folds down to w[1]=w[1]+1; <- only now strength

reduction can detect the induction variable) Integrated VISTA into mainline vpo.

Timeline of progress on this project Spring 2003

Enabled Code motion & Strength reduction.

Further refined the machine description/grammar.

Started work on Zero Overhead Loop Buffer (ZOLB) support.

Second merge of VISTA with vpo done. Retargeted VISTA to the tms320c54x.

To-Dos/Future work Parallel instructions Issues with ZOLB (details later) Scheduling The banz instruction (very useful

for loops) – allows comparison of an address register with zero

Circular addressing

TI’s compiler cl500 has.. Inter-procedural analysis

For eg. if the parameters to a function are constants or globals, the actual parameters are substituted into the function, thus avoiding expensive stack frame setup.

Inline expansion of runtime-support library functions.

Code comparison

Code Fragment: Get address of local _ar[2]=(w[7]{24)}24;r[2]=r[2]+_l0_2_a;w[3]=r[2]&65535; // w[3]=w[7]+_l0_2_a

----------------------------w[3]=w[7];w[3]=w[3]+_l0_2_a;

VPO

cl500 (TI-compiler)

Code comparison Code fragment:for (i = 0; i < STRUCTSIZE; i++) // STRUCTSIZE=2 sum += b.field[i];

Because vpo maintains the running sum in a 16 bit register (address register) we use 2 extra instructions and lose the opportunity for converting into a repeat single instruction.The TI-compiler maintains the sum in an accumulator register.

AR3 (w[3]) points to start of array.AR1 maintains the running count.brc=1;rptb .L10_rpt_end-1.L10:ld *(AR1),A // r[0]=(w[1]{24)}24;add *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1;stl A,*(AR1) // w[1]=r[0]&65535;.L10_rpt_end:--------------------------------------------AR3 (w[3]) points to start of array.A (r[0]) maintains the running count.RPT #1L5:ADD *AR3+,A // r[0]=r[0]+(W[w[3]]

{24)}24;w[3]=w[3]+1;

L6:

VPO

cl500 (TI-compiler)

Zero Overhead Loop Buffers Loops are buffered in a special internal buffer

using a rpt instruction whose parameters are start label, end label and loop count. Access to this buffer may be faster than fetching the instructions from memory.

The usual branch instruction at the end of the loop is no longer necessary when using a repeat instruction, and hence pipeline bubbles are avoided.

On the tms320c54x a single instruction rpt allows memory block copies/initializations without using an address register.

Detail on ZOLB Advantage of doing it in vpo

Can make use of all the information that vpo has already collected about the loop.

Easily retargetable Code in machine independent part is reused. Code in machine dependent part for one target

provides a framework for the new target. After conversion to a Repeat Block, registers

may be freed up. Other optimizations may get enabled.

Status of ZOLB Repeat Blocks with compile time

known loop iteration count implemented.

Plan to implement the banz instruction which is the next best option to ZOLB.

Acknowledgements

Dr. Jack Davidson (advisor)Jason HiserClark Coleman

Documents

Retargetting of VPO to the tms320c54x - a status report