Upload
felix-lawrence
View
214
Download
0
Embed Size (px)
Citation preview
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi
University of Texas at San Antonio
Improving Memory Performance
processor
Registers
Cache
Memory
processor
Registers
Cache
Memory
Shared memory
Network connection
Locality optimizations
Parallelization
Communication optimizations
Compiler Optimizations For Locality Computation optimizations
Loop blocking, fusion, unroll-and-jam, interchange, unrolling Rearrange computations for better spatial and temporal locality
Data-layout optimizations Static layout transformation
A single layout for entire application No additional overhead, tradeoff between different layout choices
Dynamic layout transformation A different layout for each computation phase Flexible but could be expensive
Combining computation and data transformations Static layout transformation
Transform layout first, then computation Dynamic layout transformation
Transform computation first, then dynamic re-arrange layout
Array Copying --- Related Work Dynamic layout transformation for arrays
Copy arrays into local buffers before computation Copy modified local buffers back to array
Previous work Lam, Rothberg and Wolf, Temam, Granston and Jalby
Copy arrays after loop blocking Assume arrays are accessed through affine expressions
Array copy always safe Static data layout transformations
A single layout throughout the application --- always safe Optimizing irregular applications
Data access patterns not known until runtime Dynamic layout transformation --- through libraries
Scalar Replacement Equivalent to copying single array element into scalars Carr and Kennedy: applied to inner loops
Question: how expensive is array copy? How difficult is it?
What is new Array copy for arbitrary loop computations
Stand-alone optimization independent of blocing Works for computations with non-affine array access
General array copy algorithm Work on arbitrarily shaped loops Automatic insert copy operations to ensure safety Heuristics to reduce buffer size and copy cost Performs scalar replacement when specialized
Applications Improve cache and register locality
When combined with blocking and when without blocking Future work
Communication and parallelization Empirical tuning of optimizations
Interface that allows different application heuristics
Apply Array Copy: Matrix Multiplication
Step1: build dependence graph True, output, anti and input deps between array accesses Is each dependence precise?
for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l];
C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]
Apply Array Copy: Matrix Multiplication
Step2-3: Separate array references connected by imprecise deps Impose an order on all array references
C[i+j*m] (R) -> A[i+k*m] -> B[k+j*l] -> C[i+j*m] (W) Remove all back edges Apply typed fusion
for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l];
C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]
Array Copy with Imprecise Dependences
Array references connected by imprecise deps Cannot precisely determine a mapping between subscripts Sometimes may refer to the same location, sometimes not Not safe to copy into a single buffer Apply typed-fusion algorithm to separate them
for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) { C[f(i,j,m)] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l];
C[f(i,j,m)] C[i+j*m] A[i+k*m] B[k+j*l]
Imprecise
Profitability of Applying Array Copy
Each buffer should be copied at most twice (splitting groups) Determine outermost location to perform copy
No interference with other groups Enforce size limit on the buffer
constant size => scalar replacement Ensure reuse of copied elements
Lowering position to copy if no extra reuse is gained
for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l];
C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]
location to copy A
j ik
k
kk
location to copy C
location to copy B
Array Copy Result: Matrix Multiplication
Dimensionality of buffer enforced by command-line options Can be applied to arbitrarily shaped loop structures Can be applied independently or after blocking
Copy A[0:m,0:l] to A_buf;for (j=0; j<n; ++j) { Copy C[0:m, j*m] to C_buf; for (k=0; k<l; ++k) { Copy B[k+j*l] to B_buf; for (i=0; i<m; ++i) C_buf[i] = C_buf[i] + alpha * A_buf[i+k*m]*B_buf; } Copy C_buf[0:m] back to C[0:m,j*m];}
Experiments
Implementation Loop transformation framework by Yi, Kennedy and Adve ROSE compiler infrastructure at LLNL (Quinlan et al)
Benchmarks dgemm (matrix multiplication, LAPACK) dgetrf (LU factorization with partial pivoting, LAPACK) tomcatv (mesh generation with Thompson solver, SPEC95)
Machines A Dell PC with two 2.2GHz Intel XEON processors
512KB cache on each processor and 2GB memory A SGI workstation with a 195 MHz R10000 processor
32KB 2-way L1, 1MB 4-way L2, and 256MB memory; A single 8-way P655+ node on a IBM terascale machine
32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2, 16MB 8-way L3 16GB memory
DGEMM on Dell PC
0
10
20
30
40
50
60
70
80
2000*2000 2048*2048
origorig-cp0orig-cp1blkblk-cp0blk-cp1blk-cp2
DGEMM on SGI Workstation
0
200
400
600
800
1000
1200
1400
2000*2000 2048*2048
origorig-cp0orig-cp1blkblk-cp0blk-cp1blk-cp2
DGEMM on IBM
0
10
20
30
40
50
60
70
80
90
2000*2000 2048*2048
origorig-cp0orig-cp1blkblk-cp0blk-cp1blk-cp2
Summary
When should we apply scalar replacement? Profitable unless register pressure too high
3-12% improvement observed No overhead
When should we apply array copy? When regular conflict misses occur When prefetching of array elements is need
10-40% improvement observed Overhead is 0.5-8% when not beneficial
Optimizations not beneficial on the IBM node Both blocking and 2-dim copying not profitable Integer operation overhead too high Will be further investigated