DATA LOCALITY & ITSOPTIMIZATIONTECHNIQUES
Presented by Preethi Rajaram
CSS 548 Introduction to Compilers Professor Carol ZanderFall 2012
Why?• Processor Speed - increasing at a faster rate than the
memory speed
• Computer Architectures -more levels of cache memory
• Cache - takes advantage of data locality
• Good Data Locality - good application performance
• Poor Data Locality - reduces the effectiveness of the cache
Data Locality• It is the property that, references to the same memory location or
adjacent locations are reused within a short period of time
• Temporal locality
• Spatial locality
Fig: Program to find the squares of the differences (a) without loop fusion (b) with loop fusion[Image from: The Dragon book 2nd edition]
Matrix Multiplication - Example
Fig: Basic Matrix Multiplication Algorithm [Image from: The Dragon book 2nd edition]
• Poor data locality• N2 multiply add operations separates the reuse of same data element in
matrix Y• N operations separate the reuse of same cache line in Y
• Solutions• Changing the layout of the data structures• Blocking
Matrix Multiplication – Example Contd…• Changing the data structure layout
• Store Y in column-major order• Improves reuse of cache lines of matrix Y• Limited Applicability
• Blocking• Changes the execution order of instructions• Divide the matrix into submatrices or blocks• Order the operations such that entire block is used over a short period of
time• Choose B such that, one block from each of the matrices fits into cache
Image from: The Dragon book 2nd edition
Data Reuse• Locality Optimization• Identify set of iterations that access the same data or same cache line• Static Access- an instruction in a program e.g x = z[i,j]• Dynamic Access- execution of instruction many times as in a loop nest• Types of Reuse
• Self• Iterations using same data come from same static access
• Group• Iterations using same data come from different static access
• Temporal• If the same exact location is referenced
• Spatial• If the same cache line is referenced
Self Temporal Reuse• Save substantial memory by exploiting self reuse• n(d-k) times reused for data with ‘k’ dimensions in a loop nest of depth
‘d’ e.g. 3-deep nested loop accesses one column of an array, then there is a potential
saving accesses of n2 accesses• Dimensionality of access- Rank of the matrix in access• Iterations referring to the same location – Null Space of a matrix• Rank of a Matrix
• No. of rows or columns that are linearly independent• Null Space of a matrix
• A reference in ‘d’ deep loop nest with ‘r’ rank, accesses O(nr) data elements in O(nd) iterations, so on an average, O(nd-r) iterations must refer to the same array element
Rank = Dimensionality = 22nd row = 1st + 3rd 4th row = 3rd – 2* 1st
Nullity = 3-2 = 1 Loop depth = 3Rank = 2
Self Spatial Reuse• Depends on data layout of the matrix – e.g. Row major
order• In an array of ‘d’ dimension, array elements share a cache
line if they differ only in the last dimensione.g. Two array elements share the same cache line if and only if they share the same row in a 2-D array
• Truncated matrix is obtained by dropping of the last row from the matrix
• If the resulting matrix has a rank ‘r’ that is less than depth ‘d’, we can assure for spatial reuse
Truncated Matrix, r = 1, d = 2r<d, assures spatial reuse
Group Reuse• Group reuse only among accesses in a loop sharing the
same coefficient matrix
Fig: 2-deep loop nest [Image from: The Dragon book 2nd edition]
• z[i,j] and z[i-1,j] access almost the same set of array elements
• Data read by access z[i-1,j] is same as the data written by z[i,j], except for i = 1
Rank = 2, no self temporal reuse
Truncated Matrix, Rank = 1, self spatial reuse
Locality Optimization• Temporal Locality of data
Use the results as soon as they are generated
Fig: Code excerpt for a multigrid algorithm (a) before partition (b) after patition [Image from: The Dragon book 2nd edition]
Locality Optimization Contd…• Array Contraction
Reduce the dimension of the array and reduce the number of memory locations accessed
Fig: Code excerpt for a multigrid algorithm after partition and after array contractionImage from: The Dragon book 2nd edition
Locality Optimization Contd…• Instead of executing each partition one after the other; we interleave a number of the
partitions so that reuse among partitions occur close together• Interleaving Inner Loops in a Parallel Loop
• Interleaving Statements in a Parallel Loop
Fig: The statement interleaving transformation [Image from: The Dragon book 2nd edition]
Fig: Interleaving four instances of the inner loop[Image from: The Dragon book 2nd edition]
References• Wolf, Michael E., and Monica S. Lam. "A data locality optimizing algorithm."
ACM Sigplan Notices 26.6 (1991): 30-44. • McKinley, Kathryn S., Steve Carr, and Chau-Wen Tseng. "Improving data
locality with loop transformations." ACM Transactions on Programming Languages and Systems (TOPLAS) 18.4 (1996): 424-453.
• Bodin, François, et al. "A quantitative algorithm for data locality optimization."
Code Generation: Concepts, Tools, Techniques (1992): 119-145. • Kennedy, Ken, and Kathryn S. McKinley. "Optimizing for parallelism and data
locality." Proceedings of the 6th international conference on Supercomputing. ACM, 1992.
• Compilers ‐ Principles, Techniques, and Tools by A. Aho, M. Lam (2nd edition),
R. Sethi, and J.Ullman, Addison‐Wesley.
Thank You!
Questions??