Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi

University of Texas at San Antonio

Improving Memory Performance

processor

Registers

Cache

Memory

processor

Registers

Cache

Memory

Shared memory

Network connection

Locality optimizations

Parallelization

Communication optimizations

Compiler Optimizations For Locality Computation optimizations

Loop blocking, fusion, unroll-and-jam, interchange, unrolling Rearrange computations for better spatial and temporal locality

Data-layout optimizations Static layout transformation

A single layout for entire application No additional overhead, tradeoff between different layout choices

Dynamic layout transformation A different layout for each computation phase Flexible but could be expensive

Combining computation and data transformations Static layout transformation

Transform layout first, then computation Dynamic layout transformation

Transform computation first, then dynamic re-arrange layout

Array Copying --- Related Work Dynamic layout transformation for arrays

Copy arrays into local buffers before computation Copy modified local buffers back to array

Previous work Lam, Rothberg and Wolf, Temam, Granston and Jalby

Copy arrays after loop blocking Assume arrays are accessed through affine expressions

Array copy always safe Static data layout transformations

A single layout throughout the application --- always safe Optimizing irregular applications

Data access patterns not known until runtime Dynamic layout transformation --- through libraries

Scalar Replacement Equivalent to copying single array element into scalars Carr and Kennedy: applied to inner loops

Question: how expensive is array copy? How difficult is it?

What is new Array copy for arbitrary loop computations

Stand-alone optimization independent of blocing Works for computations with non-affine array access

General array copy algorithm Work on arbitrarily shaped loops Automatic insert copy operations to ensure safety Heuristics to reduce buffer size and copy cost Performs scalar replacement when specialized

Applications Improve cache and register locality

When combined with blocking and when without blocking Future work

Communication and parallelization Empirical tuning of optimizations

Interface that allows different application heuristics

Apply Array Copy: Matrix Multiplication

Step1: build dependence graph True, output, anti and input deps between array accesses Is each dependence precise?

for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l];

C[i+j*m] C[i+j*m] A[i+k*m] B[k+j*l]

Apply Array Copy: Matrix Multiplication

Step2-3: Separate array references connected by imprecise deps Impose an order on all array references

C[i+j*m] (R) -> A[i+k*m] -> B[k+j*l] -> C[i+j*m] (W) Remove all back edges Apply typed fusion



Array Copy with Imprecise Dependences

Array references connected by imprecise deps Cannot precisely determine a mapping between subscripts Sometimes may refer to the same location, sometimes not Not safe to copy into a single buffer Apply typed-fusion algorithm to separate them

for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) { C[f(i,j,m)] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l];

C[f(i,j,m)] C[i+j*m] A[i+k*m] B[k+j*l]

Imprecise

Profitability of Applying Array Copy

Each buffer should be copied at most twice (splitting groups) Determine outermost location to perform copy

No interference with other groups Enforce size limit on the buffer

constant size => scalar replacement Ensure reuse of copied elements

Lowering position to copy if no extra reuse is gained



location to copy A

j ik

k

kk

location to copy C

location to copy B

Array Copy Result: Matrix Multiplication

Dimensionality of buffer enforced by command-line options Can be applied to arbitrarily shaped loop structures Can be applied independently or after blocking

Copy A[0:m,0:l] to A_buf;for (j=0; j<n; ++j) { Copy C[0:m, j*m] to C_buf; for (k=0; k<l; ++k) { Copy B[k+j*l] to B_buf; for (i=0; i<m; ++i) C_buf[i] = C_buf[i] + alpha * A_buf[i+k*m]*B_buf; } Copy C_buf[0:m] back to C[0:m,j*m];}

Experiments

Implementation Loop transformation framework by Yi, Kennedy and Adve ROSE compiler infrastructure at LLNL (Quinlan et al)

Benchmarks dgemm (matrix multiplication, LAPACK) dgetrf (LU factorization with partial pivoting, LAPACK) tomcatv (mesh generation with Thompson solver, SPEC95)

Machines A Dell PC with two 2.2GHz Intel XEON processors

512KB cache on each processor and 2GB memory A SGI workstation with a 195 MHz R10000 processor

32KB 2-way L1, 1MB 4-way L2, and 256MB memory; A single 8-way P655+ node on a IBM terascale machine

32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2, 16MB 8-way L3 16GB memory

DGEMM on Dell PC

0

10

20

30

40

50

60

70

80

2000*2000 2048*2048

origorig-cp0orig-cp1blkblk-cp0blk-cp1blk-cp2

DGEMM on SGI Workstation

0

200

400

600

800

1000

1200

1400

2000*2000 2048*2048


DGEMM on IBM

0

10

20

30

40

50

60

70

80

90

2000*2000 2048*2048


Summary

When should we apply scalar replacement? Profitable unless register pressure too high

3-12% improvement observed No overhead

When should we apply array copy? When regular conflict misses occur When prefetching of array elements is need

10-40% improvement observed Overhead is 0.5-8% when not beneficial

Optimizations not beneficial on the IBM node Both blocking and 2-dim copying not profitable Integer operation overhead too high Will be further investigated

Documents

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio