Upload
allen-nichols
View
215
Download
0
Embed Size (px)
Citation preview
Fundamentals of High Performance Programming
[email protected] www.netlib.org/atlas
R. Clint Whaley,ATLAS group,Innovative Computing Lab,University of Tennessee
Outline of Talk
• Intro to examples
• Optimization considerations
• Overview of memory and computational opt
• Memory opt in detail– Memory hierarchy
– Cache Basics
– Mem opt examples and analysis
• Computational opt in detail– FPU basics– Computational
optimization techniques with examples
• Optimized examples• Escape before
audience recovers
Basic Analysis Definitions
• Number of memory references
• Number of FLOPS (floating point operations)
• Number of input data words– Get reuse when number of memory references
is greater than number of input data words
Example Operations
for (dot=0.0, i=0; i < N; i++) dot += x[i] * y[i]
for (j=0; j < N; j++) for (i=0; i < N; i++) for (k=0; k < N; k++) C[i+j*ldc] += A[i+k*lda] * B[k+j*ldb];
DDOT: Dot product ofvectors X & Y
GEMM: matmul C += A*B
2N FLOPS, 2N mem ref2N data
2N3 FLOPS4N3 mem ref3N2 data
+= *N
N
Optimization Considerations
• Remember, two ways to optimize:– Change algorithm
– Performance tuning
• 90/10 rule – Use profiling to find
kernels
• Ask if identified kernels can be recast as already opt kernel such as GEMM
• Remember: “Optimized for performance” and “portable/maintainable” are antonyms
Hand-tuned Optimization Facts
• Purely A Priori optimization is pipe dream– Hardware component interaction difficult enough to
predict
– Compiler, OS, ISA get in the way of hardware
• Will rewrite kernel for every new arch– Use assembler sparingly
– Try to find collaborators who also use kernel
• The more you hand-optimize, the less the compiler can do for you
Two Types of Optimization
• Memory/cache optimization– Theoretical peak: (bus width) * (bus speed)
• PII : (32 bits) * (66 Mhz) = 264 MB/s = 33 MW/s• Athlon: (64 bits) * (200 Mhz) = 1600 MB/s = 200 MW/s• Power3: (128 bits) * (100 Mhz) = 1600 MB/s = 200 MW/s
• Computational optimizations– Theoretical peak: (# fpus) (flops/cycle) * Mhz
• PII: (1 fpu) * (1 flop/cycle) * (450 Mhz) = 450 MFLOP• Athlon: (2 fpu) * (1flop/cycle) * (600 Mhz) = 1200 MFLOP• Power3: (2 fpu) * (2 flops/cycle) * (375 Mhz) = 1500 MFLOP
• Memory at least order of magnitude slower
Memory Hierarchy
Registers
Level 1 Cache
Level 2 Cache
Main Memory
Disk
1
4-16
80-100
500
Millertime
8, 16, or 32 words
8-128KB1-16KW
512KB – 8MB64 KW – 1 MW
.5-8 GB
X TB
Cache Basics
• Write policy (write-through, non-wt)– Writes more expensive than
reads
• Replacement policy (LRU)
• Line size (2-8 dp words)
• Associativity
• Level (1-3)
• Separate or combined data/instruction
• Latency between lvls• Bandwidth btwn lvls• # of outstanding
requests before blocking
• Prefetch strategy/units• TLB : Translation
Lookaside Buffer– virtual-physical mem
@ buffer– >= 32 vm pages
Cache Basics
• Do:– Start on cache line
boundary
– Use entire line
– Make stride (lda) a multiple of cache line size
– Issue as many nonblocking fetches as cache supports
– If you have reuse:• Block for cache size
• copy to contiguous storage
• Don’t:– Use power of 2 for
non-contiguous matrix stride (lda)
– Access strided memory
– Access more than TLB separate memory locations
Memory optimized DDOT
• 2N flops, 2N data fetches (X and Y not reused)
• Can still unroll for outstanding fetches and multiple prefetch units
for (dot=0.0, i=0; i < N; i++) dot += x[i] * y[i];
for (dot0=dot1=0.0, i=0; i < N; i += 8){ dot0 += x[0] * y[0]; dot1 += x[4] * y[4]; dot0 += x[1] * y[1]; dot1 += x[5] * y[5]; dot0 += x[2] * y[2]; dot1 += x[6] * y[6]; dot0 += x[3] * y[3]; dot1 += x[7] * y[7]; x += 8; y += 8;}
n2 = N / 2;xx = X + n2; yy = Y + n2;for (dot0=dot1=0.0, i=0; i < n2; i++){ dot0 += x[i] * y[i]; dot1 += xx[i] * yy[i];}
4 prefetch units:4 fetches:
+= *N
N
Mem opt for GEMM• (N2 elts of C) * (N adds + N
mults) = 2N3 FLOPS
• (N2 elts of C) * (N (RA RB RC WC)) = 3N3 R + N3 W = 4N3 mem ref
• Since there are 4N3 mem ref, but only 3N2 data words, reuse is possible
• # of mem ref irreducible, but # that hit main mem is not
• Number of main mem ref:– [3N2 – 4N3]
• Can block for reuse at each level of mem hierarchy
• Cache blocking can be varied thru parameterization
• Register blocking requires differing codes to vary
+= *I
J
Register Blocking• Typically 8-32 registers, so can keep only one array in reg• Remember, cost is: N3 RA + N3 RB + (N3 RC + N3 WC)• So C has at least twice cost of A or B, so put K as inner
loop– No reuse along K of A & B, so reuse must come within register
block
• Can use differing number of registers for A & B, but near-square are theoretically superior, so for simplicity let the number of registers blocking A and B be N0
• Main mem access now:– N2 RC + N2 WC + (N3/N0) RA + (N3/N0) RB
K
K
J
I
Ex.: 2x2 Register Blockingfor (j=0; j < N; j += 2){ for (i=0; i < N; i += 2) { c00 = C[i+j*ldc]; c01 = C[i+(j+1)*ldc]; c10 = C[i+1+j*ldc]; c11 = C[i+1+(j+1)*ldc; for (k=0; k < N; k++) { a0 = A[i+k*lda]; a1 = A[i+1+k*lda]; b0 = B[k+j*ldb]; b1 = B[k+1+j*ldb]; c00 += a0 * b0; c10 += a1 * b0; c01 += a0 * b1; c11 += a1 * b1; } C[i+j*ldc] = c00; C[i+(j+1)*ldc] = c01; C[i+1+j*ldc] = c10; C[i+1+(j+1)*ldc] = c11; }}
Blocking for Level 1 Cache• Register block for dot product to C access:
– ( (2N3)/N0 + N2 ) R + N2 W
• Assume N02 + 2N N0 <= L1, access reduces:
– N2 RA + (N3/N0) RB + N2 RC) + N2 WC
– Still cubic access, mem cost dominant
• So we need to explicitly block for L1
+= *I
J
K
J
I
K
Blocking for Level 1 Cache• Choose blocking factor N1, such that:
– 3N12 <= L1
• (2N3/N1 + N2) RM + N2 WM +
• 2N3/N0 R1 + N3/N1(R1 + W1)
– N2 + N1 N0 + N02 <= L1
• (2N3/N1 + N2) RM + N2 WM +
• 3N3/N1 R2 + N3/N1 W2 +
• 2N3/N0 R1 + N3/N1(R1 + W1)
+= *I
J
K
J
I
K
Blocking for Level 2 Cache• Implicit blocking occurs if
– 2N N1 + N12 <= L2
– Main memory access reduced from:• (2N3/N1 +N2) R + N2 W
– to:• (N3/N1 + 2N2) R + N2 W
• Can do explicit L2 block by cutting K so:– 2N0 N1 + N1
2 <= L2
• Or can block all dimension with N2 just as with N1 for L1• Can do same for arbitrary levels of cache
+= *I
J
K
J
I
K
FPU Basics
• Number of FPUs• Pipelined or lame• Repeat rate (1 cycle or not fully pipelined)• Instruction type:
– muladd– multiply – add– multiply or add
double precision ddot(const int N, const double *X, const double *Y){ const int N4 = (N>>2)<<2, nr = N-N4; const double *stX = X+N4; register double dot=0.0;
if (N > 0) { do { dot += *X * *Y; dot += X[1] * Y[1]; dot += X[2] * Y[2]; dot += X[3] * Y[3]; X += 4; Y += 4; } while(X != stX); if (nr) { stX += nr; do {dot += *X++ * *Y++} while (X != stX) } } return(dot);}
Computational Optimization
• Use “const” if variable not changing
• Use bitwise shift to avoid integer mult and div
• Tmp arrays should come from heap not stack
• Eliminate unused loop vars – ptr controlled loops
• Unroll loop for reduced loop overhead
• Increment ptrs only at end of loop
for (max=0.0, i=0; i < N; i++) if (X[i] > max) max = X[i];
Computational Optimization
• Never use < or > if you can safely use != or ==– for (i=0; i < N; i++)– for (i=0; i != N; i++)– for (i=N; i; i--)
• Branch prediction usually guesses “true”– In if/else, put most
common case first
– Use do-while not while
if (N > 0){ max = *X; if (N != 1) { X++; do { if (*X <= max) continue; else max = *X; } while(++X != stX); }}
Max of a vector:
Becomes:
Computational Optimization
• If operating on multiple mem @ seperated by non-compile-time constant, use multiple ptrs
• Use local vars (registers) to avoid aliasing prob and provide register blocking
• Do not recompute intermediate computations
const int n2 = (N>>1)<<1, incC=(ldc<<1), incB=ldb<<1, incA=1-N*lda;double *pC0=C, *pC1=C+ldc;const double *pB0=B, *pB1=B+ldb, a=A;
for(j=0; j != n2; j += 2){ for (i=0; i != N; i++) { rC0 = pC0[i]; rC1 = pC1[i]; for (k=0; k != N; k++) { rA0 = *pA0; pA0 += lda; rB0 = pB0[k]; rB1 = pB1[k]; rC0 += rA0 * rB0; rC1 += rA0 * rB0; } *pC0 = rC0; *pC1 = rC1; A += incA; } A = a; pC0 += incC; pC1 += incC; pB0 += incB; pB1 += incB;
}
for (j=0; j != (N/2)*2; j+= 2){ for (i=0; i != N; i++) { for(k=0; k != N; k++) { C[i+j*ldc] += A[k*lda+i] * B[k+j*ldb]; C[i+(j+1)*ldc] += A[k*lda+i] * B[k+(j+1)*ldb]; } }}
Simple MULADD Pipelining• For each FPU, use at least pipelen
accumulators to avoid pipe stalls
register double dot0, dot1, dot2, dot3;if (n4){ do { dot0 += *X * *Y; dot1 += X[1] * Y[1]; dot2 += X[2] * Y[2]; dot3 += X[3] * Y[3]; X += 4; Y += 4; } while (X != stX);}
4 length pipeline example:
for (i=0; i < N; i++) dot += X[i] * Y[i];
Pipelining/loop skewing for separate mult/add units
register double m0, m1, m2, m3, dot0, dot1, dot2, dot3;m0 = *X * *Y; m1 = X[1] * Y[1]; m2 = X[2] * Y[2]; m3 = X[3] * Y[3];X += 4; Y += 4;do{ dot0 += m0; m0 = *X * *Y; dot1 += m1; m1 = X[1] * Y[1]; dot2 += m2; m2 = X[2] * Y[2]; dot3 += m3; m3 = X[3] * Y[3]; X += 4; Y += 4;}while (X != stX);dot0 += m0; dot1 += m1; dot2 += m2; dot3 += m3;
void ATL_dJIK60x60x60TN60x60x0_a1_b1 (const int M, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc)/* * matmul with TA=T, TB=N, MB=60, NB=60, KB=60, * lda=60, ldb=60, ldc=0, mu=6, nu=3, ku=1 */{ const double *stM = A + 3600; const double *stN = B + 3600; #define incAk 1 const int incAm = 300, incAn = -3600; #define incBk 1 const int incBm = -60, incBn = 180; #define incCm 6 const int incCn = (((ldc) << 1)+ldc) - 60; double *pC0=C, *pC1=pC0+(ldc), *pC2=pC1+(ldc); const double *pA0=A; const double *pB0=B; register int k; register double rA0, rA1, rA2, rA3, rA4, rA5; register double rB0, rB1, rB2; register double rC00, rC10, rC20, rC30, rC40, rC50, rC01, rC11, rC21, rC31, rC41, rC51, rC02, rC12, rC22, rC32, rC42, rC52;
do /* N-loop */ { do /* M-loop */ { rC00 = *pC0; rC10 = pC0[1]; rC20 = pC0[2]; rC30 = pC0[3]; rC40 = pC0[4]; rC50 = pC0[5]; rC01 = *pC1; rC11 = pC1[1]; rC21 = pC1[2]; rC31 = pC1[3]; rC41 = pC1[4]; rC51 = pC1[5]; rC02 = *pC2; rC12 = pC2[1]; rC22 = pC2[2]; rC32 = pC2[3]; rC42 = pC2[4]; rC52 = pC2[5];
for (k=60; k; k--) /* easy loop to unroll */ { rB0 = *pB0; rB1 = pB0[60]; rB2 = pB0[120]; rA0 = *pA0; rA1 = pA0[60]; rA2 = pA0[120]; rA3 = pA0[180]; rA4 = pA0[240]; rA5 = pA0[300]; rC00 += rA0 * rB0; rC10 += rA1 * rB0; rC20 += rA2 * rB0; rC30 += rA3 * rB0; rC40 += rA4 * rB0; rC50 += rA5 * rB0; rC01 += rA0 * rB1; rC11 += rA1 * rB1; rC21 += rA2 * rB1; rC31 += rA3 * rB1; rC41 += rA4 * rB1; rC51 += rA5 * rB1; rC02 += rA0 * rB2; rC12 += rA1 * rB2; rC22 += rA2 * rB2; rC32 += rA3 * rB2; rC42 += rA4 * rB2; rC52 += rA5 * rB2; pA0 += incAk; pB0 += incBk; }
*pC0 = rC00; pC0[1] = rC10; pC0[2] = rC20; pC0[3] = rC30; pC0[4] = rC40; pC0[5] = rC50; *pC1 = rC01; pC1[1] = rC11; pC1[2] = rC21; pC1[3] = rC31; pC1[4] = rC41; pC1[5] = rC51; *pC2 = rC02; pC2[1] = rC12; pC2[2] = rC22; pC2[3] = rC32; pC2[4] = rC42; pC2[5] = rC52; pC0 += incCm; pC1 += incCm; pC2 += incCm; pA0 += incAm; pB0 += incBm; } while(pA0 != stM); pC0 += incCn; pC1 += incCn; pC2 += incCn; pA0 += incAn; pB0 += incBn; } while(pB0 != stN);}
do /* N-loop */ { do /* M-loop */ { rC00 = *pC0; rC10 = pC0[1];/* * Start pipeline */ rA0 = *pA0; rB0 = *pB0; rA1 = pA0[40]; m0 = rA0 * rB0; m1 = rA1 * rB0; rA0 = pA0[1]; rB0 = pB0[1]; rA1 = pA0[41]; m2 = rA0 * rB0; m3 = rA1 * rB0; rA0 = pA0[2]; rB0 = pB0[2]; rA1 = pA0[42]; m4 = rA0 * rB0;/* * Completely unrolled K-loop */ rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[3]; rB0 = pB0[3]; rA1 = pA0[43]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0;
rA0 = pA0[4]; rB0 = pB0[4]; rA1 = pA0[44]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[5]; rB0 = pB0[5]; rA1 = pA0[45]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[6]; rB0 = pB0[6]; rA1 = pA0[46]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[7]; rB0 = pB0[7]; rA1 = pA0[47]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[8]; rB0 = pB0[8]; rA1 = pA0[48]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[9]; rB0 = pB0[9]; rA1 = pA0[49]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[10]; rB0 = pB0[10]; rA1 = pA0[50]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[11]; rB0 = pB0[11]; rA1 = pA0[51]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0;
rA0 = pA0[12]; rB0 = pB0[12]; rA1 = pA0[52]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[13]; rB0 = pB0[13]; rA1 = pA0[53]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[14]; rB0 = pB0[14]; rA1 = pA0[54]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[15]; rB0 = pB0[15]; rA1 = pA0[55]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[16]; rB0 = pB0[16]; rA1 = pA0[56]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[17]; rB0 = pB0[17]; rA1 = pA0[57]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[18]; rB0 = pB0[18]; rA1 = pA0[58]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[19]; rB0 = pB0[19]; rA1 = pA0[59]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0;
rA0 = pA0[20]; rB0 = pB0[20]; rA1 = pA0[60]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[21]; rB0 = pB0[21]; rA1 = pA0[61]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[22]; rB0 = pB0[22]; rA1 = pA0[62]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[23]; rB0 = pB0[23]; rA1 = pA0[63]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[24]; rB0 = pB0[24]; rA1 = pA0[64]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[25]; rB0 = pB0[25]; rA1 = pA0[65]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[26]; rB0 = pB0[26]; rA1 = pA0[66]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[27]; rB0 = pB0[27]; rA1 = pA0[67]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[28]; rB0 = pB0[28]; rA1 = pA0[68];
rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[29]; rB0 = pB0[29]; rA1 = pA0[69]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[30]; rB0 = pB0[30]; rA1 = pA0[70]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[31]; rB0 = pB0[31]; rA1 = pA0[71]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[32]; rB0 = pB0[32]; rA1 = pA0[72]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[33]; rB0 = pB0[33]; rA1 = pA0[73]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[34]; rB0 = pB0[34]; rA1 = pA0[74]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[35]; rB0 = pB0[35]; rA1 = pA0[75]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[36]; rB0 = pB0[36]; rA1 = pA0[76];
rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[37]; rB0 = pB0[37]; rA1 = pA0[77]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[38]; rB0 = pB0[38]; rA1 = pA0[78]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[39]; rB0 = pB0[39]; rA1 = pA0[79]; rC10 += m3; m3 = rA0 * rB0;/* * Drain pipe on last iteration of K-loop */ rC00 += m4; m4 = rA1 * rB0; rC10 += m0; rC00 += m1; rC10 += m2;rC00 += m3;rC10 += m4; pA0 += incAk; pB0 += incBk; *pC0 = rC00; pC0[1] = rC10; pC0 += incCm; pA0 += incAm; pB0 += incBm; } while(pA0 != stM); pC0 += incCn; pA0 += incAn; pB0 += incBn; } while(pB0 != stN);