Generating Truly Optimal Code Using a Metaprogramming Library Don Clugston First D Programming Conference, 24 August 2007

Generating Truly Optimal CodeUsing a Metaprogramming

Library

Don Clugston

First D Programming Conference, 24 August 2007

String mixins in D – undercooked, but very tasty

• Compiles to:

• Vindicates built-in string operations

char [] greet(char [] greeting) { return `writefln(“` ~ greeting ~`, world!”);`;}

void main() { mixin( greet( “Hello” ) );}

void main() { writefln( “Hello, world!” );}

The Challenge

• Fortran: BLAS (a standard set of highly optimised routines). The crucial functions are coded in asm.

y[] += a * x[]

• But BLAS is limited – nothing for simple things:– x[] = y[] - z[]– a[] = r[]*0.3 + g[]*0.5 + b[]*0.2;

void DAXPY(double [] y, double [] x, double a) { for (i = 0; i < y.length; ++i) y[i] += x[i] * a;}

Operating overloading• Gives ideal syntax, always works• Can’t operate on built-in types• Inefficient because:

– Creates unnecessary temporaries.– Multiple loops, eg a[]=b[]+c[]+d[]

• Somehow, we need to get the expression inside the ‘for’ loop!

double [] temp1= new double[], temp2 = new double[]; for(int i=0; i<b.length; ++i) temp1[i] = b[i] + c[i]; for(int i=0, i<temp1.length; ++i) temp2[i] = temp1[i] + d[i];a = temp2;

The Wizard Solution: Expression Templates (eg, Blitz++)

• Overloaded operators don’t do the calculation: instead, they record the operation as a proxy type, creating a syntax tree.

• Example: (a+b)/(c-d):

• Need a good optimiser.• Works in D as well as C++. BUT… we are

fighting the compiler!

DVExpr<DVBinExprOp<DVExpr< DVBinExprOp<DVec::iterT, DVec::iterT, DApAdd>>, DVExpr<DVBinExprOp< DVec::iterT, DVec::iterT, DApSubtract>>, DApDivide>>

Representing the Syntax Tree in D

• In D, any expression can be represented in a single template.

• Represent types and values in a tuple. Represent expression in a char []. A..Z correspond to T[0]..T[25]. eg:

Note that ‘A’ appears twice in the expression (operator overloading can’t represent that).

void vectorOperation(char [] expression, T…)(T values) { }

vectorOperation!(“A+=(B*C)/(A+D)”)(x, y, z, u, v);

Finding the vectors in a tuple

• It’s a vector if you can index it.• Imperfection – can’t index tuple in CTFE.• Workaround – create array of results.

• Usage: if ( isVector!(Tuple)[i]) { … }

template isVector(T...){ static if (T.length == 0) const bool [] isVector = []; else static if( is( typeof(T[0][0]) ) ) const bool [] isVector = true ~ isVector!(T[1..$]); else const bool [] isVector = false ~ isVector!(T[1..$]);}

Metaprogramming For Muggles

USAGE:double [] firstvec, secondvec, thirdvec;VEC!("A+=B*(C+A*D)")(firstvec, secondvec, thirdvec, 25.7);

char [] muggle (char [] expr, Values...)() { char [] code = "for (int i=0; i<values[0].length; ++i) {"; foreach(c; expr) if (c >= 'A' && c <= 'Z’) { // A-Z become tuple members. code ~= "values[" ~ itoa(c-'A') ~ "]"; // add [i] if it was a vector if (isVector!(Values)[c-'A']) code ~= "[i]"; } else code ~= c; // Everything else is unchanged return code ~ "; }“;}

template VEC(char [] expr) { void VEC(Values...)(Values values) {

mixin( muggle!(expr, Values) ); }}

Trivial enhancements

• Ensure all vectors are the same length.

• Assert no aliasing (vectors don’t overlap).• Equalize with hand-coded asm BLAS

routines.

foreach(int i, bool b; isVector!(Values)[1..$]) { if (b)

code ~= “assert(values[“ ~ atoi(i) ~ “].length == values[0].length);”; }

static if ( expr == “A+=B*C” && is( Values[0] == double[] ) && is( Values[1] == double[] ) && is ( Values[2] : double ) ) { return “DAXPY(values[0].length, values[0].ptr, values[1].ptr, values[2]);”; }

Asm code via perturbation

• It’s hard to determine the optimal asm for an algorithm, much easier to modify existing code.

• Begin with Agner Fogg’s optimal asm code for DAXPY. Use same loop design and register allocation strategy.

• Ignore difficult cases – fallback to D code.

X87 (stack-based)

• Convert the infix expression into postfix. Split += into + and =.

• Swap operands to avoid FMUL latency.A += B - C * D A = (A+B) - (C*D)

C D * A B + - A =

• Avoid gaps in the instruction set– Eg, fewer instructions for 80-bit reals, so load

them first whenever possible.

X87 code generation

• Directly convert postfix to inline asm. VEC!("C+=B*(A+D)")( 2213.3, vec1, floatvec, vec2);// Postfix : BAD+*C+C=

L1: fld double ptr [EAX + 8*ESI]; //B fld double ptr [EAX + 8*ESI]; //A fadd double ptr [EDX + 8*ESI]; //D+ fmulp ST(1), ST; //* fadd float ptr [ECX + 4*ESI]; //C+ fxch ST(1), ST; fstp float ptr [ECX + 4*ESI - 4]; // C=L2: inc ESI; jnz L1;

SSE/SSE2 (register-based)

• Can’t do mixed-precision operations.

• Unroll loop by 2 or 4, to take advantage of SIMD.

• Instruction scheduling is less critical, but register allocation is more complicated than for x87.

GPGPU

• Use the GPU in modern video cards to perform massively parallel calculations.

• Uses OpenGL or DirectX calls, instead of inline asm.

• Full of hacks (pretend your data is a texture!) – but a rational API should emerge soon.

• This should NOT be built into a compiler!

Adding a front end

• Operator overloading – Same limitations as before

• Mixins eg, mixin(blade(“firstvec+=secondvec*2.38”));– clumsy syntax BUT:– Can detect aliases– Allows better error messages– Can unroll small loops inline– Closer to proposed macro syntax

Front end using mixins

1. Lex: first += second * 2.38 A+=B*C.2. Determine types, resolve aliases, convert

constants to literals.3. Determine precedence and associativity4. Perform constant folding

• We can do most of this using mixins• Compiler help is most required for 4• __traits could help

Determining typeschar[] getSymbolTable(char [][] symbols){ char [] result = "["; for(int i=0; i<symbols.length; ++i) { if (i>0) result ~=","; result ~= "[typeof(" ~ symbols[i] ~ `).stringof, ` ~ symbols[i] ~ `.stringof]`; } result ~= "]"; return result;}

• When mixed in, this creates an array[2] of string literals.•[0] is the type, [1] is the value

Determining precedenceclass AST(char [] expr) { alias expr text; AST!("(" ~ text ~ “+” ~ T.text ~ ")") opAdd(T)(T x) { return null; } AST!("(" ~ text ~ “*” ~ T.text ~ ")") opMul(T)(T x) { return null; } AST!( text ~ "([" ~ T.text ~ "])“ ) opIndex(T)(T x) { return null; }}

char [] getPrecedence(char [] expr) { char [] code = "typeof("; for(int i=0; i<expr.length; ++i) { if (expr[i]>='A' && expr[i]<='Z') code ~= "(cast(AST!(`" ~ expr[i] ~"`))(null))"; else code ~= expr[i]; } return code ~ ").text";}

mixin(getPrecedence(“A+B*C*D”) ) “A+((B*C)*D)”

Conclusion• Implementation and syntactic issues

remain– Syntax for runtime and compile-time reflection– Macros, and an extended __traits syntax

should help.– How to clean up mixin(), yet retain its power?

• Yet perfectly optimal code is already possible. Libraries can perform optimisations previously required a compiler back-end.

Documents

Generating Truly Optimal Code Using a Metaprogramming Library Don Clugston First D Programming Conference, 24 August 2007