Upload
anna-caldwell
View
218
Download
0
Embed Size (px)
Citation preview
Optimizing Compilers for Modern Architectures
Creating Coarse-grained Parallelism for Loop Nests
Chapter 6, Sections 6.3 through 6.9
Optimizing Compilers for Modern Architectures
Last time…
• Single loop methods
• Privatization
• Loop distribution
• Alignment
• Loop Fusion
Optimizing Compilers for Modern Architectures
Loop Interchange
• Moves dependence-free loops to outermost level
• Theorem—In a perfect nest of loops, a particular loop can be
parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only ‘=‘ entries
• Vectorization moves loops to innermost level
Optimizing Compilers for Modern Architectures
Loop Interchange
DO I = 1, N
DO J = 1, N
A(I+1, J) = A(I, J) + B(I, J)
ENDDO
ENDDO
• OK for vectorization
• Problematic for parallelization
Optimizing Compilers for Modern Architectures
Loop Interchange
PARALLEL DO J = 1, N
DO I = 1, N
A(I+1, J) = A(I, J) + B(I, J)
ENDDO
END PARALLEL DO
Optimizing Compilers for Modern Architectures
Loop Interchange
• Working with direction matrix—Move loops with all “=“ entries into outermost position
and parallelize it and remove the column from the matrix—Move loops with most “<“ entries into next outermost
position and sequentialize it, eliminate the column and any rows representing carried dependences
—Repeat step 1
Optimizing Compilers for Modern Architectures
while L is not empty
while there exist columns in M with all “=“
success := true;
l:= loop with all “=“ column;
remove l from L;
parallelize l;
eliminate l from M;
end;
if L is not empty
select_loop_and_interchange(L);
l:= outermost loop; remove l from L; sequentialize l;
remove column corresponding to l from M;
remove all rows corresponding to dependences carried by l from M;
Loop Interchange
Optimizing Compilers for Modern Architectures
Loop Selection
• Generate most parallelism with adequate granularity—Key is to select proper loops to run in parallel
• Informal parallel code generation strategy
• While there are loops that can be run in parallel, move them to the outermost position and parallelize them, then
• Select a sequential loop, run it sequentially, and find what new parallelism may have been revealed.
Optimizing Compilers for Modern Architectures
= < << = >< = =
Loop SelectionDO I = 2, N+1
DO J = 2, M+1
DO K = 1, L
A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Loop SelectionDO I = 2, N+1
DO J = 2, M+1
PARALLEL DO K = 1, L
A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
< < = =< = < =< = = <= < = == = < == = = <
Loop Selection
• Is it possible to derive a selection heuristic that provides optimal code?—Probably not possible, NP-complete problem
• Assume simple approach of selecting the loop with the most ‘ < ‘ directions to eliminate the max number of rows from the direction matrix—Applying to this matrix will fail
Optimizing Compilers for Modern Architectures
Loop Selection
• Favor the selection of loops that must be sequentialized before parallelism can be uncovered.
• If there exists a loop that can legally be moved to the outermost position and there is a dependence for which that loop has the only ‘<‘ direction, sequentialize that loop
• If there are several such loops, they will all need to be sequentialized at some point in the process
Optimizing Compilers for Modern Architectures
• To show that loop selection is NP-complete—Bit vector
—Problem of loop selection corresponds to finding a minimal basis among the loops, with “logical or” as the combining operation
—The same as minimum set cover problem, known to be NP-complete
• Loop selection is best done by a heuristic
1 1 0 01 0 1 01 0 0 10 1 0 00 0 1 00 0 0 1
Loop Selection
Optimizing Compilers for Modern Architectures
= < =< = <= < << = >
Loop Selection
• Example of principals involved in heuristic loop selection
DO I = 2, N
DO J = 2, M
DO K = 2, L
A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +
A(I, J+1, K+1) + A(I-1, J, K+1)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Loop Selection
DO J = 2, M
DO I = 2, N
PARALLEL DO K = 2, L
A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +
A(I, J+1, K+1) + A(I-1, J, K+1)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
= < >< = >
Loop Reversal
DO I = 2, N+1
DO J = 2, M+1
DO K = 1, L
A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Loop ReversalDO K = L, 1, -1
PARALLEL DO I = 2, N+1
PARALLEL DO J = 2, M+1
A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)
END PARALLEL DO
END PARALLEL DO
ENDDO
• Increase the range of options available for loop selection heuristics
Optimizing Compilers for Modern Architectures
= < =< = == = <= = =
Loop SkewingDO I = 2, N+1
DO J = 2, M+1
DO K = 1, L
A(I, J, K) = A(I, J-1, K) + A(I-1, J, K)
B(I, J, K+1) = B(I, J, K) + A(I, J, K)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
= < << = <= = <= = =
Loop Skewing
• Skewed using k = K + I + J yield:DO I = 2, N+1
DO J = 2, M+1
DO k = I+J+1, I+J+L
A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)
B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Loop Skewing
DO k = 5, N+M+1
PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2)
PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1)
A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)
B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Loop Skewing
• Transforms skewed loop into one that can be interchanged to the outermost position without changing the meaning of the program
• Can be used to transform the skewed loop in such a way that, after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed
Optimizing Compilers for Modern Architectures
Loop Skewing
• Selection Heuristics—Parallelize outermost loop if possible—Sequentializes at most one outer loop to find parallelism in
the next loop—If 1 and 2 fails, try skewing—If 3 fails, sequentialize the loop that can be moved to the
outermost position and cover the most other loops
Optimizing Compilers for Modern Architectures
Unimodular Transformations
• Definitions—A transformation represented by a matrix T is unimodular
if—T is square—All the elements of T are integral and—The absolute value of the determinant of T is 1
• Any composition of unimodular transformations is unimodular
Optimizing Compilers for Modern Architectures
Profitability-Based Methods
• Motivation—Minimum granularity, low sync cost —Many alternatives for parallel code generation
• Static performance estimation function—No need to be accurate—Good at selecting the better of two alternatives
• Key considerations—Cost of memory references—Sufficiency of granularity
Optimizing Compilers for Modern Architectures
Profitability-Based Methods
• Pick the best of all possible permutations and parallelizations—Impractical
– Total number of alternative is exponential in the number of loops in a nest
– Many of the loop upper bounds are unknown at compile time
• Consider only subset of the possible code arrangements, based on properties of the cost function
Optimizing Compilers for Modern Architectures
Profitability-Based Methods
• Subdivide all the references in the loop body into reference groups
• Determine whether subsequent accesses to the same reference are—Loop invariant
– Cost = 1—Unit stride
– Cost = number of iterations / cache line size—Non-unit stride
– Cost = number of iterations
• Assign the loop a cost of the sum of the reference costs times the aggregate number of times the loop will be executed if it is innermost in the loop nest
Optimizing Compilers for Modern Architectures
Profitability-Based MethodsDO I = 1, N
DO J = 1, N
DO K = 1, N
C(I, J) = C(I, J) + A(I, K) * B(K, J)
ENDDO
ENDDO
ENDDO—C = 1 A = N B = N/L— Innermost K loop = N3(1+1/L)+N2— Innermost J loop = 2N3+N2— Innermost I loop = 2N3/L+N2
• Reorder loop from innermost to outermost by increasing loop cost
• Can’t always have desired loop order
Optimizing Compilers for Modern Architectures
Multilevel Loop Fusion
• Commonly used for imperfect loop nests
• Used after maximal loop distribution
Optimizing Compilers for Modern Architectures
Multilevel Loop Fusion
• Decision making needs look-ahead
• Heuristic: Fuse with the loop that cannot be fused with one of its successors
Optimizing Compilers for Modern Architectures
Parallel Code Generation
procedure Parallelize(l, Dl);
ParallelizeNest(l, success);
if not success then begin
if l can be distributed then begin
distribute l into loop nests l1, l2, …, ln;
for i:=1 to n do begin
Parallelize(li, Di);
end
Merge({l1, l2, …, ln});
end
else if l cannot be distributed then begin
Optimizing Compilers for Modern Architectures
Parallel Code Generation
for each outer loop l0 nested in l do begin
let D0 be the set of dependences between statements in l0 less dependences carried by l;
Parallelize(Io,D0);
end
let S be the set of outer loops and statements loops left in l;
If ||S||>1 then Merge(S);
Optimizing Compilers for Modern Architectures
DO J = 1, JMAXD
DO I = 1, IMAXD
F(I, J, 1) = F(I, J, 1) * B(1)
DO K = 2, N-1
DO J = 1, JMAXD
DO I = 1, IMAXD
F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
DO J = 1, JMAXD
DO I = 1, IMAXD
TOT(I, J) = 0.0
DO J = 1, JMAXD
DO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
DO K = 2, N-1
DO J = 1, JMAXD
DO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
Erlebacher
Optimizing Compilers for Modern Architectures
ErlebacherPARALLEL DO = 1, JMAXD
DO I = 1, IMAXD
F(I, J, 1) = F(I, J, 1) * B(1)
DO K = 2, N – 1
DO I = 1, IMAXD
F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
DO I = 1, IMAXD
TOT(I, J) = 0.0
DO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
DO K = 2, N-1
DO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
Optimizing Compilers for Modern Architectures
ErlebacherPARALLEL DO J = 1, JMAXD
DO I = 1, IMAXD
F(I, J, 1) = F(I, J, 1) * B(1)
TOT(I, J) = 0.0
TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
ENDDO
DO K = 2, N-1
DO I = 1, IMAXD
F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
ENDDO
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Strip Mining
• Converts available parallelism into a form more suitable for the hardware
DO I = 1, N
A(I) = A(I) + B(I)
ENDDO
k = CEIL (N / P)
PARALLEL DO I = 1, N, k
DO i = I, MIN(I + k-1, N)
A(i) = A(i) + B(i)
ENDDO
END PARALLEL DO
Optimizing Compilers for Modern Architectures
Strip Mining
DO I = 1, N
DO J = 2, I
A(I, J) = A(I, J-1) + B(I)
ENDDO
ENDDO
• Choose smaller unit size to allow more balanced load distribution
Optimizing Compilers for Modern Architectures
Pipeline Parallelism
• Fortran command DOACROSS
• Useful where parallelization is not available
• High synchronization costs
DO I = 2, N-1
DO J = 2, N-1
A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1))
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Pipeline ParallelismDOACROSS I = 2, N-1
POST (EV(1))
DO J = 2, N-1
WAIT(EV(J-1))
A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1))
POST (EV(J))
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Pipeline Parallelism
Optimizing Compilers for Modern Architectures
Pipeline ParallelismDOACROSS I = 2, N-1
POST (E(1))
K = 0
DO J = 2, N-1, 2
K = K+1
WAIT(EV(K))
DO j = J, MAX(J+1, N-1)
A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)
ENDDO
POST (EV(K+1))
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Pipeline Parallelism
Optimizing Compilers for Modern Architectures
Scheduling Parallel Work
• Parallel execution is slower than serial execution if
• Bakery-counter scheduling—High synchronization overhead
Optimizing Compilers for Modern Architectures
Guided Self-Scheduling
• Minimize synchronization overhead—Schedules groups of iterations unlike the bakery counter
method—Going from large to small chunks of work
• Keep all processors busy at all times
• Iterations dispensed at time t follows:
• Alternatively we can have GSS(k) that guarantees that all blocks handed out are of size k or greater
Optimizing Compilers for Modern Architectures
Guided Self-Scheduling
• GSS(1)