Upload
lesley-melton
View
212
Download
0
Embed Size (px)
Citation preview
Generating Truly Optimal CodeUsing a Metaprogramming
Library
Don Clugston
First D Programming Conference, 24 August 2007
String mixins in D – undercooked, but very tasty
• Compiles to:
• Vindicates built-in string operations
char [] greet(char [] greeting) { return `writefln(“` ~ greeting ~`, world!”);`;}
void main() { mixin( greet( “Hello” ) );}
void main() { writefln( “Hello, world!” );}
The Challenge
• Fortran: BLAS (a standard set of highly optimised routines). The crucial functions are coded in asm.
y[] += a * x[]
• But BLAS is limited – nothing for simple things:– x[] = y[] - z[]– a[] = r[]*0.3 + g[]*0.5 + b[]*0.2;
void DAXPY(double [] y, double [] x, double a) { for (i = 0; i < y.length; ++i) y[i] += x[i] * a;}
Operating overloading• Gives ideal syntax, always works• Can’t operate on built-in types• Inefficient because:
– Creates unnecessary temporaries.– Multiple loops, eg a[]=b[]+c[]+d[]
• Somehow, we need to get the expression inside the ‘for’ loop!
double [] temp1= new double[], temp2 = new double[]; for(int i=0; i<b.length; ++i) temp1[i] = b[i] + c[i]; for(int i=0, i<temp1.length; ++i) temp2[i] = temp1[i] + d[i];a = temp2;
The Wizard Solution: Expression Templates (eg, Blitz++)
• Overloaded operators don’t do the calculation: instead, they record the operation as a proxy type, creating a syntax tree.
• Example: (a+b)/(c-d):
• Need a good optimiser.• Works in D as well as C++. BUT… we are
fighting the compiler!
DVExpr<DVBinExprOp<DVExpr< DVBinExprOp<DVec::iterT, DVec::iterT, DApAdd>>, DVExpr<DVBinExprOp< DVec::iterT, DVec::iterT, DApSubtract>>, DApDivide>>
Representing the Syntax Tree in D
• In D, any expression can be represented in a single template.
• Represent types and values in a tuple. Represent expression in a char []. A..Z correspond to T[0]..T[25]. eg:
Note that ‘A’ appears twice in the expression (operator overloading can’t represent that).
void vectorOperation(char [] expression, T…)(T values) { }
vectorOperation!(“A+=(B*C)/(A+D)”)(x, y, z, u, v);
Finding the vectors in a tuple
• It’s a vector if you can index it.• Imperfection – can’t index tuple in CTFE.• Workaround – create array of results.
• Usage: if ( isVector!(Tuple)[i]) { … }
template isVector(T...){ static if (T.length == 0) const bool [] isVector = []; else static if( is( typeof(T[0][0]) ) ) const bool [] isVector = true ~ isVector!(T[1..$]); else const bool [] isVector = false ~ isVector!(T[1..$]);}
Metaprogramming For Muggles
USAGE:double [] firstvec, secondvec, thirdvec;VEC!("A+=B*(C+A*D)")(firstvec, secondvec, thirdvec, 25.7);
char [] muggle (char [] expr, Values...)() { char [] code = "for (int i=0; i<values[0].length; ++i) {"; foreach(c; expr) if (c >= 'A' && c <= 'Z’) { // A-Z become tuple members. code ~= "values[" ~ itoa(c-'A') ~ "]"; // add [i] if it was a vector if (isVector!(Values)[c-'A']) code ~= "[i]"; } else code ~= c; // Everything else is unchanged return code ~ "; }“;}
template VEC(char [] expr) { void VEC(Values...)(Values values) {
mixin( muggle!(expr, Values) ); }}
Trivial enhancements
• Ensure all vectors are the same length.
• Assert no aliasing (vectors don’t overlap).• Equalize with hand-coded asm BLAS
routines.
foreach(int i, bool b; isVector!(Values)[1..$]) { if (b)
code ~= “assert(values[“ ~ atoi(i) ~ “].length == values[0].length);”; }
static if ( expr == “A+=B*C” && is( Values[0] == double[] ) && is( Values[1] == double[] ) && is ( Values[2] : double ) ) { return “DAXPY(values[0].length, values[0].ptr, values[1].ptr, values[2]);”; }
Asm code via perturbation
• It’s hard to determine the optimal asm for an algorithm, much easier to modify existing code.
• Begin with Agner Fogg’s optimal asm code for DAXPY. Use same loop design and register allocation strategy.
• Ignore difficult cases – fallback to D code.
X87 (stack-based)
• Convert the infix expression into postfix. Split += into + and =.
• Swap operands to avoid FMUL latency.A += B - C * D A = (A+B) - (C*D)
C D * A B + - A =
• Avoid gaps in the instruction set– Eg, fewer instructions for 80-bit reals, so load
them first whenever possible.
X87 code generation
• Directly convert postfix to inline asm. VEC!("C+=B*(A+D)")( 2213.3, vec1, floatvec, vec2);// Postfix : BAD+*C+C=
L1: fld double ptr [EAX + 8*ESI]; //B fld double ptr [EAX + 8*ESI]; //A fadd double ptr [EDX + 8*ESI]; //D+ fmulp ST(1), ST; //* fadd float ptr [ECX + 4*ESI]; //C+ fxch ST(1), ST; fstp float ptr [ECX + 4*ESI - 4]; // C=L2: inc ESI; jnz L1;
SSE/SSE2 (register-based)
• Can’t do mixed-precision operations.
• Unroll loop by 2 or 4, to take advantage of SIMD.
• Instruction scheduling is less critical, but register allocation is more complicated than for x87.
GPGPU
• Use the GPU in modern video cards to perform massively parallel calculations.
• Uses OpenGL or DirectX calls, instead of inline asm.
• Full of hacks (pretend your data is a texture!) – but a rational API should emerge soon.
• This should NOT be built into a compiler!
Adding a front end
• Operator overloading – Same limitations as before
• Mixins eg, mixin(blade(“firstvec+=secondvec*2.38”));– clumsy syntax BUT:– Can detect aliases– Allows better error messages– Can unroll small loops inline– Closer to proposed macro syntax
Front end using mixins
1. Lex: first += second * 2.38 A+=B*C.2. Determine types, resolve aliases, convert
constants to literals.3. Determine precedence and associativity4. Perform constant folding
• We can do most of this using mixins• Compiler help is most required for 4• __traits could help
Determining typeschar[] getSymbolTable(char [][] symbols){ char [] result = "["; for(int i=0; i<symbols.length; ++i) { if (i>0) result ~=","; result ~= "[typeof(" ~ symbols[i] ~ `).stringof, ` ~ symbols[i] ~ `.stringof]`; } result ~= "]"; return result;}
• When mixed in, this creates an array[2] of string literals.•[0] is the type, [1] is the value
Determining precedenceclass AST(char [] expr) { alias expr text; AST!("(" ~ text ~ “+” ~ T.text ~ ")") opAdd(T)(T x) { return null; } AST!("(" ~ text ~ “*” ~ T.text ~ ")") opMul(T)(T x) { return null; } AST!( text ~ "([" ~ T.text ~ "])“ ) opIndex(T)(T x) { return null; }}
char [] getPrecedence(char [] expr) { char [] code = "typeof("; for(int i=0; i<expr.length; ++i) { if (expr[i]>='A' && expr[i]<='Z') code ~= "(cast(AST!(`" ~ expr[i] ~"`))(null))"; else code ~= expr[i]; } return code ~ ").text";}
mixin(getPrecedence(“A+B*C*D”) ) “A+((B*C)*D)”
Conclusion• Implementation and syntactic issues
remain– Syntax for runtime and compile-time reflection– Macros, and an extended __traits syntax
should help.– How to clean up mixin(), yet retain its power?
• Yet perfectly optimal code is already possible. Libraries can perform optimisations previously required a compiler back-end.