29
Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. Maydan CS243: Loop Optimization and Array Analysis 1

Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Embed Size (px)

Citation preview

Page 1: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Lecture 15

Loop Transformations

Chapter 11.10-11.11

Dror E. MaydanCS243: Loop Optimization and Array

Analysis1

Page 2: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Optimization

• Domain– Loops: Change the order in which we iterate through loops

• Goals– Minimize inner loop dependences that inhibit software pipelining– Minimize loads and stores– Parallelism

• SIMD Vector today, in general multiprocessor as well

– Minimize cache misses– Minimize register spilling

• Tools– Loop interchange– Fusion– Fission– Outer loop unrolling– Cache Tiling– Vectorization

• Algorithm for putting it all together

Dror E. MaydanCS243: Loop Optimization and Array Analysis

2

Page 3: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Interchange

for i = 1 to n

for j = 1 to n A[j][i] = A[j-1][i-1] * b[i]

• Should I interchange the two loops?– Stride-1 accesses are better for caches– But one more load in the inner loop– But one less register needed to hold the result of the loop

Dror E. MaydanCS243: Loop Optimization and Array Analysis

3

for j = 1 to n

for i = 1 to n A[j][i] = A[j-1][i-1] * b[i]

Page 4: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Interchange

for i = 1 to n

for j = 1 to n A[j][i] = A[j+1][i-1] * b[i]

Distance Vector is (deltai, deltaj) = (1, -1)Direction Vector is (>, <)

• Dependence represents that one ref, aw must happen before another ar

• To permute loops, permute direction vectors in the same manner– Permutation is legal iff all permuted direction vectors are lexicographically positive

• Special case: Fully permutable loop nest– Either dependence “carried” by a loop outside of the nest or all components > or =

• All the loops in the nest can be arbitrarily permuted– (>, >, <) Inner two loops are fully permutable– (>=, =, >) All three loops are fully permutable

Dror E. MaydanCS243: Loop Optimization and Array Analysis

4

j

i

Page 5: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Interchange

for i = 1 to n

for j = 1 to i …

• How do I interchangefor j = 1 to n for i = j to n

– In general ugly but doable

Dror E. MaydanCS243: Loop Optimization and Array Analysis

5

j

i

Page 6: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Non Perfectly Nested loops

for i = 1 to n

for j = 1 to n

S1

for j = 1 to n

S2

• Can’t always interchange– Can be expensive when you can

Dror E. MaydanCS243: Loop Optimization and Array Analysis

6

Page 7: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Fusion

for i = 1 to n

for j = 1 to n

S1

for j = 1 to n

S2

• Moving S2 across “j” iterations but not any of “i” iterations• Pretend to fuse• Legal as long as there is no direction vector from S2 to S1 with “=“ in

all the outer loops and > in one of the inner (=, =, …, =, >, …),– That would imply that S2 is now before S1

Dror E. MaydanCS243: Loop Optimization and Array Analysis

7

for i = 1 to n

for j = 1 to n

S1

S2

Page 8: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Fusion

for i = 1 to n

for j = 1 to n

a[i][j] = …

for j = 1 to n

… = a[i][j+1] • Legal as long as there is no direction vector from the read to the

write with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …)– (=, 1) so can’t fusion

Dror E. MaydanCS243: Loop Optimization and Array Analysis

8

Page 9: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Fusion

for i = 1 to n

for j = 1 to n

a[i][j] = …

for j = 1 to n

… = a[i][j+1]

If the first “+” direction is always a small literal constant, can skew the loop and allow fusion

Bonus: can get rid of a load and maybe a store

Dror E. MaydanCS243: Loop Optimization and Array Analysis

9

for i = 1 to n

a[i][1] = …

for j = 2 to n

a[i][j] = …

for j = 1, n-1

… = a[i][j+1]

… = a[i][n+1]

for i = 1 to n

a[i][1] = …

for j = 2 to n

a[i][j] = …

for j = 2, n

… = a[i][j]

… = a[i][n+1]

for i = 1 to n

a[i][1] = …

for j = 2 to n {

a[i][j] = …

… = a[i][j]

}

… = a[i][n-1]

Page 10: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Fission

for i = 1 to n

for j = 1 to n

S1

for j = 1 to n

S2

• Moving S2 across all later “i” iterations– Legal as long as no dependences from S2 to S1 with > in the

fissioned outer loops

Dror E. MaydanCS243: Loop Optimization and Array Analysis

10

for i = 1 to n

for j = 1 to n

S1

for i = 1 to n

for j = 1 to n

S2

Page 11: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Loop Fission

for i = 1 to n

for j = 1 to n

= a[i-1][j]

for j = 1 to n

a[i][j] =

• Moving S2 across all later “i” iterations– Legal as long as no dependences from S2 to S1 with > in the

fissioned outer loops

Dror E. MaydanCS243: Loop Optimization and Array Analysis

11

for i = 1 to n

for j = 1 to n

= a[i-1][j]

for i = 1 to n

for j = 1 to n

a[i][j] = Dep from write to read of (1)

Page 12: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Inner Loop Fission

for i = 1 to n

for j = 1 to n

… = h[i];

… = h[i+1];

… = h[i+49];

… = h[i+50];

• Legal as long as there is no dependence from an S2 to an S1 where the first “>” is in the “j” loop.

Dror E. MaydanCS243: Loop Optimization and Array Analysis

12

for i = 1 to n

for j = 1 to n

= h[i];

… = h[i+25];

for j = 1 to n

… = h[26];

… = h[50];

Page 13: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Inner Loop Fission

for j = 1 to n

S1

S2

S3

• Looking at edges carried by the inner most loops• Strongly Connected Components can not be fissioned• Everything else can be fissoned as long as loops are emitted in topological

order

Dror E. MaydanCS243: Loop Optimization and Array Analysis

13

S1

S2

S3

=

=>

Page 14: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Outer Loop Unrolling

for i = 1 to n

for j = 1 to n

for k = 1 to n

c[i][j] += a[i][k] * b[k][j];

• How many loads in the inner loop? How many MACs?

Dror E. MaydanCS243: Loop Optimization and Array Analysis

14

Page 15: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Outer Loop Unrolling

for i = 1 to n by 2

for j = 1 to n by 2

for k = 1 to n

c[i][j] += a[i][k] * b[k][j];

c[i][j+1] += a[i][k] * b[k][j+1];

c[i+1][j] += a[i+1][k] * b[k][j];

c[i+1][j+1] += a[i+1][k] * b[k][j+1];

• Is it legal?

Dror E. MaydanCS243: Loop Optimization and Array Analysis

15

Page 16: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Outer Loop Unrolling

If n = 2– Original order was

• (1, 1, 1) (1, 1, 2) (1, 2, 1) (1, 2, 2) (2, 1, 1) (2, 1, 2) (2, 2, 1) (2, 2, 2)

– New order is • (1, 1, 1) (1, 2, 1) (2, 1, 1) (2, 2, 1) (1,

1, 2) (1, 2, 2) (2, 1, 2) (2, 2, 2)• Equivalent to permuting the loops into

for k = 1 to 2

for i = 1 to 2

for j = 1 to 2

• If loops are fully permutable can also outer loop unroll

Dror E. MaydanCS243: Loop Optimization and Array Analysis

16

for i = 1 to 2 by 2

for j = 1 to 2 by 2

for k = 1 to 2

c[i][j] += a[i][k] * b[k];

c[i][j+1] += a[i][k] * b[k][j+1];

c[i+1][j] += a[i+1][k] * b[k][j];

c[i+1][j+1] += a[i+1][k] * b[k][j+1];

Page 17: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Unrolling Trapezoidal Loops

Dror E. MaydanCS243: Loop Optimization and Array Analysis

17

j

i

• Ugly• We unroll two level trapezoidal loops but the details are very ugly

for i=1 to n by 2 for j = 1 to i

Page 18: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Trapezoidal Example

for (i=0; i<n; i++) { for (j=2*i; j<n-i; j++) {

a[i][j] += 1;

} }

D.E. MaydanCS243: Loop Optimization and Array Analysis

18

for(i = 0; i <= (n + -2); i = i + 2) { lstar = (i * 2) + 2; ustar = (n - (i + 1)) + -1; if(((i * 2) + 2) < (n - (i + 1))) { for(r2d_i = i; r2d_i <= (i + 1); r2d_i = r2d_i + 1){ for(j = r2d_i * 2; j <= ((i * 2) + 1); j = j + 1){ a[r2d_i][j] = a[r2d_i][j] + 1; } } for(j0 = lstar; ustar >= j0; j0 = j0 + 1) { a[i][j0] = a[i][j0] + 1; a[i + 1][j0] = a[i + 1][j0] + 1; }; for(r2d_i0 = i; r2d_i0 <= (i + 1); r2d_i0 = r2d_i0 + 1) { for(j1 = n - (i + 1); j1 < (n - r2d_i0); j1 = j1 + 1) { a[r2d_i0][j1] = a[r2d_i0][j1] + 1; }; } } else { for(r2d_i1 = i; r2d_i1 <= (i + 1); r2d_i1 = r2d_i1 + 1) { for(j2 = r2d_i1 * 2; j2 < (n - r2d_i1); j2 = j2 + 1) { a[r2d_i1][j2] = a[r2d_i1][j2] + 1; } } } } if(n > i) { for(j3 = i * 2; j3 < (n - i); j3 = j3 + 1) { a[i][j3] = a[i][j3] + 1; }; }

Page 19: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Cache Tiling

for i = 1 to n

for j = 1 to n

for k = 1 to n

c[i][j] += a[i][k] * b[k][j];

• How many cache misses?

Dror E. MaydanCS243: Loop Optimization and Array Analysis

19

Page 20: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Cache Tiling

for jb = 1 to n by b

for kb = 1 to n by b

for i = 1 to n

for j = jb to jb+b

for k = kb to kb + b

c[i][j] += a[i][k] * b[k][j];

• How many cache misses?– Order b reuse for each array

• If loops are fully permutable can cache tile

Dror E. MaydanCS243: Loop Optimization and Array Analysis

20

Page 21: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Vectorization: SIMD

for i = 1 to n

for j = 1 to n

a[j][i] = 0;

• N-way parallel where N is the SIMD width of the machine

Dror E. MaydanCS243: Loop Optimization and Array Analysis

21

for i = 1 to n by 8 for j = 1 to n a[j][i:i+7] = 0;

Page 22: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Vectorization: SIMD

for i = 1 to n

for j = 1 to n

for k = 1 to n

S1

….

SM

• We have moved later iterations of S1 ahead of earlier iterations of S2, …, SM, etc

• Legal as long as no dependence from a latter S to an earlier S where that dependence is carried by the vector loop– E.g legal to vectorize ‘j’ above if no dependence from a latter S

to an earlier S with direction (=, >, *)

Dror E. MaydanCS243: Loop Optimization and Array Analysis

22

Page 23: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Putting It All Together

• Three phase algorithm1. Use fission and fusion to build perfectly nested loops

1. We prefer fusion but not obvious that that is right

2. Enumerate possibilities for unrolling, interchanging, cache tiling and vectorizing

3. Use inner loop fission if necessary to minimize register pressure

Dror E. MaydanCS243: Loop Optimization and Array Analysis

23

Page 24: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Phase 2

Choose a loop to vectorize All references that refer to vector loop must be stride-1For each possible inner loop

Compute best possible unrollings for each outerCompute best possible ordering and tiling

To compute best possible unrollingTry all combinations of unrolling up to a max product of 16For each possible unrollingEstimate the machine cycles for the inner loop (ignoring cache)Estimate the register pressureDon’t unroll more if too much register pressure

To compute best possible ordering and tilingConsider only loops with “reuse”Choose best threeIterate over all orderings of three with a binary search on cache tile sizeNote and record the total cycle time for this configuration. Pick the best

Estimating cyclesCould compile every combination, but …

Dror E. MaydanCS243: Loop Optimization and Array Analysis

24

Page 25: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Machine Modeling

• Recall that software pipelining had resource limits and latency limits– Map high level IR to machine resources– Unroll high level IR operations

• Remove duplicate loads and stores• Count machine resources• Build a latency graph of unrolled operations

– Iterate over inner loop cycles and find worst cycle• Assume performance is worst of two limits

• Model register pressure– Count loop invariant loads and stores– Count address streams– Count cross iteration cse’s = a[i] + a[i-2]– Add machine dependent constant

Dror E. MaydanCS243: Loop Optimization and Array Analysis

25

Page 26: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Cache Modeling

• Given a loop ordering and a set of tile factors• Combine array references that differ by constant, e.g. a[i][j] and

a[i+1][j+1]• Estimate capacity of all array references, multiply by fudge factor

for interference, stop increasing block sizes if capacity is larger than cache

• Estimate quantity of data that must be brought into cache

Dror E. MaydanCS243: Loop Optimization and Array Analysis

26

Page 27: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Phase 3: Inner loop fission

• Does inner loop use too many registers– Break down into SCCs– Pick biggest SCC

• Does it use too many registers– If yes, too bad– If no, search for other SCCs to merge in

» Pick one with most commonality » Keep merging while enough registers

Dror E. MaydanCS243: Loop Optimization and Array Analysis

27

Page 28: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Extra: Reductions

for i = 1 to n for j = 1 to n

a[j] += b[i][j];

Can I unroll

for i = 1 to n by 2 for j = 1 to n a[j] += b[i][j]; a[j] += b[i+1][j];• Legal

– Integer: yes– Floating point: maybe

Dror E. MaydanCS243: Loop Optimization and Array Analysis

28

Page 29: Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1

Carnegie Mellon

Extra: Outer Loop Invariants

for i for j

a[i][j] += b[i] * cos(c[j])

Can replace with

for j t[j] = cos(c[j])for i for j

a[i][j] += b[i] * t[j];

• Need to integrate with model – Model must assume that invariant computation will be replaced with loads

Dror E. MaydanCS243: Loop Optimization and Array Analysis

29