53
CS420 lecture six Loops

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Embed Size (px)

Citation preview

Page 1: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

CS420 lecture sixLoops

Page 2: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Time Analysis of loops

Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop body takes constant time2. loop body is executed times

(n − i) =n(n −1)

2i=1

n−1

∑ =O(n2)

Page 3: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Convex hull

Given a set of points in 2D ((x,y) coordinates),find the smallest convex polygon surroundingthem all.

Page 4: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop
Page 5: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Convex hull

Given a set of points in 2D ((x,y) coordinates),find the smallest polygon surrounding them all.

The problem reduces to finding line segments connecting points of the set.

Page 6: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Convex hull

Page 7: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Convex hull: first attempt

Let L be a line segment connecting two points inthe set. For L to be in the convex hull it issufficient that all other points are on the same side of L’s extension to a full line.

Page 8: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Convex hull: first attempt

Let L be a line segment connecting two points inthe set. For L to be in the convex hull it issufficient that all other points are on the same side of L’s extension to a full line.

How do you find out all other points are on thesame side?

Page 9: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Convex hull: first attempt

Let L be a line segment connecting two points in the set. For L tobe in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line.

for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(pi,pj,pk)

Page 10: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Convex hull: first attempt

for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(pi,pj,pk)

check is O(1) so this algorithm is O(n3)

Page 11: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

the question that drives us.....

Page 12: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

is there a better algorithm• Find lowest point P1• Sort remaining points by angle they

form with P1 and the horizontal, resulting in a sequence P2…Pn

• Start with P1-P2 in current hull• for i from 3 to n • add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break

Page 13: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

is there a better algorithm• Find lowest point P1• Sort remaining points by angle they

form with P1 and the horizontal, resulting in a sequence P2…Pn

• Start with P1-P2 in current hull• for i from 3 to n • add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break1

3

2

4

Page 14: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Complexity?

• find lowest: O(n)• sort O(nlgn)• nested add/eliminate loop outer: i from 3 to n inner: j from i-1 downto 3

O(?)

Page 15: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

nested add/eliminate loop

• O(N) !! why?• n-2 points considered in i loop j loop either eliminates a point, ie it will not

be checked again, or stops. The total number of points considered in all j loop iterations is therefore O(n)

• Complex hull algorithm complexity O(n lg n)

Page 16: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

is there a better algorithm?

• no

Page 17: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

is there a better algorithm?

• no, argument is harder (lower bound arguments usually are)

• it can be shown that sorting can be reduced to convex hull (reduced: translated such that when the convex hull problem is solved the original sorting problem is solved) and we have shown that sorting is Ω(n lg n)

Page 18: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

sort({3, 1, 2}) convex hull({(3,9), (2,4), (1,1)})

3,9

2,4

1,1

reduction:x x,x2

Page 19: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Sub-O Optimizations

• Suppose you have written an asymptotically optimal program, and still want to speed it up.

• Using a profiler identify which parts of your code are the hotspots of your program.

• 10/90 rule of thumb: 90% of the time is spent in 10% of the code: hotspots – Usually some of the innermost loops– Only improve the hotspots. Leave the rest clear

and simple.

Page 20: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Data reorganization

• Create sentinel (value at boundary) to simplify loop control.

found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++;

Page 21: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Data reorganization

• Create sentinel to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++; • Sentinel: value at boundary x[n]=T; i=0; while (x[i]!=T)i++; found = (i<n);

Page 22: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Loop unrolling

• Loop unrolling is textually repeating the loop body so that the loop control is executed fewer times– Eg, a median filter operator on an image executes

a 3x3 inner loop for each resulting pixel; this can be fully unrolled

– some compilers (eg CUDA) allow unroll k pragmas– in a linked list, if the last element points at itself,

visiting the elements can be partially unrolled

Page 23: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Loop peeling

• When the body of a loop tests whether it is on a boundary, and has a special case for that boundary, it is often advantageous to have separate code for the boundary avoiding the conditional in the loop body.

• Eg, median filter

Page 24: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Loop unrolling and trivial assignments

fibonacci(n) a=b=c =1; // what happens if the loop gets unrolled once?

for i = 3 to n { c=a+b, a=b; b=c } return c;

Page 25: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Loop unrolling and trivial assignments

fibonacci(n) a=b=c =1; for i = 3 to n { c=a+b, a=b; b=c } return c;

fibonacci(n) a=b=1; for i = 1 to (n/2 -1) {a=a+b; b=a+b} if odd(n) b = a+b; return b;

Page 26: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Memory hierarchy (cache) issues

• Processor are an order of magnitude faster than memories– both have been speeding up exponentially for ~30

years: but with different bases, so their ratio has been growing exponentially as well

– caches keep recently used (temporal locality) and fetch in cache lines (spatial locality)

Page 27: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

cache issues

• memory wall• getting over it: cache• cache line• cache replacement policy: LRU• cache and memory layout of 1D representation

of 2D arrays in C– row access– col access

Page 28: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Data or loop reordering for improve cache performance

Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j]

Page 29: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Data or loop reordering for improve cache performance

Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j]

B is accessed in column order. If the arrays are (as in C) stored in row major order, this causes cache misses and unnecessary reads!!

Page 30: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Data or loop reordering for improve cache performance

Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j]

While one row of A is read, all of B is read

If the cache cannot keep all of B and uses theLeast Recently Used replace policy, all reads of B will cause a cache miss

Page 31: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiling for improved cache behavior

Instead of reading a whole row of A and doing n whole row A column B inner products we canread a block of A and compute smaller inner products with sub columns of B.

(Remember blocked matrix multiply in Strassen)

These partial products are then added up.

Page 32: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

Page 33: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

Page 34: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

Page 35: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

Page 36: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

Page 37: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

Page 38: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

etc. .....

Page 39: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Conventional matrix multiply

All elements of B are used once, while all of row A[i] are used n times. A[i] may fit in the cache, B will probably not!

Page 40: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 41: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 42: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 43: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 44: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 45: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 46: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 47: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 48: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiled matrix multiply

Page 49: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Reuse of tile of B

• A k x k tile of A (which can fit in the cache) block multiplies with a k x k tile of B (which can fit in the cache) and thus reuses the B tile k times, potentially providing better cache use

• We can parameterize our program with k and experiment

• Data and loop reordering matrix multiply: assignment 2

Page 50: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Experiments you can do

• Transpose B for better cache line behavior• Tile the loop as in the example• In array access A[i*N+j] avoid the multiply by

doing pointer increments and dereferences• You will have a number of versions of your

code. Make a 2D table of results. Then make observations about your results. In a follow up discussion, exchange your experiences

Page 51: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiling

• Loops become nested loops– outer loop visits tile origins– inner loops visit the tile points

Page 52: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Can every loop be tiled???

Tile this for i=1 to n for j=1 to n if (i==1 && j==1) A[i,j]=1 elif j==1 A[i,j]=A[i-1,n] else A[i,j]=A[i,j-1]

Page 53: CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop

Tiling cont'• Tiling is loop reordering. The reordered loop must

obey the data dependences in the original loop. Let iteration i',j',k' occur before iteration i,j,k – true dependence:

• i,j,k uses a value that i',j',k produced

– anti dependence:• i,j,k redefines a value that i',j',k' used

– output dependence:• i,j,k redefines a value that i',j',k' defined

in all these cases the reordered loop must obey the original ordering of the two iterations.