41
. Communication Lower Bounds and Optimal Algorithms for Programs that Reference Arrays James Demmel UC Berkeley Math and EECS Depts. Joint work with Michael Christ, Nicholas Knight, Thomas Scanlon, Katherine Yelick

Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

.

Communication Lower Bounds and OptimalAlgorithms for Programs that Reference Arrays

James DemmelUC Berkeley

Math and EECS Depts.

Joint work withMichael Christ, Nicholas Knight, Thomas Scanlon, Katherine Yelick

1

Page 2: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Motivation: Why avoid communication?

• Communication = moving data

– Between levels of memory hierarchy

– Between processors over network

• Running time of an algorithm is sum of 3 terms:

– #flops * time per flop

– #words moved / bandwidth ... communication

– #messages * latency ... communication

• Time per flop 1/bandwidth latency

– Gaps growing exponentially

• Avoid communication to save time

• Same story for energy: Avoid communication to save energy

2

Page 3: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Example: Optimal Sequential Matmul

• Naive code

– for i=1:n, for j=1:n, for k=1:n, C(i, j)+ = A(i, k) ∗B(k, j)

– Moves Θ(n3) words between cache (size M < n2) and DRAM

• “Blocked” code

– Write A as n/b× n/b matrix of b× b blocks A[i, j]

– Ditto for B, C

– for i=1:n/b, for j=1:n/b, for k=1:n/b,C[i, j]+ = A[i, k] ∗B[k, j] ... b× b matmul

• Thm [Hong,Kung]: Choosing b<≈ (M/3)1/2 attains lower bound:

#words moved = Ω(n3/M1/2)

•Where do 1/2’s come from?

3

Page 4: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

New Theorem, applied to Matmul

• for i=1:n, for j=1:n, for k=1:n, C(i, j)+ = A(i, k) ∗B(k, j)

• Record array indices in matrix ∆

∆ =

i j k

A 1 0 1B 0 1 1C 1 1 0

• Let x = [xi, xj, xk]T , 1 = vector of 1’s

• Solve LP: maximize 1Tx such that ∆x ≤ 1

• Solution: x = [1/2, 1/2, 1/2], 1Tx = 3/2 ≡ sHBL

• Thm: #words moved = Ω(n3/MsHBL−1) = Ω(n3/M1/2).

• Attain by blocking index i by Θ(Mxi) = Θ(M1/2), ditto for j, k

4

Page 5: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

New Theorem, applied to Direct n-Body

• for i=1:n, for j=1:n, F (i)+ = force(P (i), P (j))

• Record array indices in matrix ∆

∆ =

i j

F 1 0P (i) 1 0P (j) 0 1

• Let x = [xi, xj]

T , 1 = vector of 1’s

• Solve LP: maximize 1Tx such that ∆x ≤ 1

• Solution: x = [1, 1], 1Tx = 2 ≡ sHBL

• Thm: #words moved = Ω(n2/MsHBL−1) = Ω(n2/M1).

• Attain by blocking index i by Θ(Mxi) = Θ(M1), ditto for j

5

Page 6: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

New Theorem, applied to Random Code

• for i1=1:n, ... , for i6=1:n,A1(i1, i3, i6)+ = func1(A2(i1, i2, i4), A3(i2, i3, i5), A4(i3, i4, i6))A5(i2, i6)+ = func2(A6(i1, i4, i5), A3(i3, i4, i6))

• Record array indices in 6× 6 matrix ∆

– one column per index i1,...,i6

– one row per distinct set of array subscripts A1, ..., A6

– ∆(i, j) = 1 if array subscript i has index j, else 0

• Let x = [x1, ..., x6]T , 1 = vector of 1’s

• Solve LP: maximize 1Tx such that ∆x ≤ 1

• Solution: x = [2/7, 3/7, 1/7, 2/7, 3/7, 4/7], 1Tx = 15/7 ≡ sHBL

• Thm: #words moved = Ω(n6/MsHBL−1) = Ω(n6/M8/7).

• Attained by block sizes M2/7, M3/7, M1/7, M2/7, M3/7, M4/7

6

Page 7: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Summary of Results (1/3)

• Extend communication lower bound proof from linear algebra toany program with

– Inner loop iterations indexed by (i1, ..., id)

– Arrays in inner loop subscripted by linear functions of indices

– Ex: A(i1, i2 − i1, 3i1 − 4i2 + 7i4, ...), B(pntr(i5 + 6i6)), ...

– Can be dense or sparse, sequential or parallel, ...

• Based on recent generalization of Holder, Loomis-Whitney,Brascamp-Lieb inequalities by Bennett/Carbery/Christ/Tao

– Need to count lattice points, not volumes

– Get linear program with one inequality per subgroup H ≤ Zd

– Solution of linear program (HBL-LP) is sHBL– Thm: #words moved = Ω(#loop iterations/MsHBL−1)

7

Page 8: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Summary of Results (2/3)

• Can we write down the lower bound?

– One inequality per subgroup H ≤ Zd, but still finitely many!

– Thm (Bad news): Writing down all inequalities in HBL-LP⇐⇒ Hilbert’s 10th Problem over Q

– Thm (Good news): Another LP has same solution, is decidable(but expensive, so far)

– Thm (Better news): Easy to write down HBL-LP explicitly inmany cases of interest (eg when subscripts are just subsets ofindices)

– Also easy to get upper/lower bounds on solution sHBL

8

Page 9: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Summary of Results (3/3)

• Can we attain the lower bound?

– Depends on loop dependencies

– Best case: none, or reductions (like matmul)

– Thm: When subscripts are just subsets of indices, the solutionx of dual HBL-LP tells us the optimal tile sizes Mx1,...,Mxd

– Ex: linear algebra, n-body, “random code”, database join, ...

– Conjecture: always attainable (modulo dependencies)

9

Page 10: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Outline

1. Recall lower bound proof for direct linear algebra usingLoomis-Whitney

2. Holder-Brascamp-Lieb Linear Program (HBL-LP)

• Continuous case, then discrete case

3. Applying lower bound to more general code

4. Decidability of lower bound

•Where Hilbert’s 10th Problem over Q arises, how to avoid it

5. Special Case: When subscripts are just subsets of indices

•Why HBL-LP simpler, why dual tells us optimal algorithm

6. Conclusions and Open Problems

10

Page 11: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Recall Proof for Direct Linear Algebra (3 Nested Loops)

11

Page 12: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Geometric Model

12

Page 13: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Loomis-Whitney

13

Page 14: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Loomis-Whitney

14

Page 15: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Summary of Lower Bound Proof for 3 Nested Loops

•M = fast memory size, G = total number of flops

• Break instruction stream into segments of M loads/stores

• =⇒ 2M words of data available during segment

• Use Loomis Whitney to bound F = #multiplies/segment by

F ≤ (#A entries)1/2 · (#B entries)1/2 · (#C entries)1/2

≤ (2M)3/2 = O(M3/2)

• F ·#segments≥ G =⇒#segments ≥ G/F

• #loads/stores = M · #segments ≥MG/F = Ω(G/M1/2)

• Result independent of dependencies (so works for LU, etc)

• Result independent of G (so works for sparse, parallel etc)

• Bound decreases with M =⇒ replication may help (2.5D algs)

15

Page 16: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

First Extension Strategy

• Loomis-Whitney =⇒ Holder-Brascamp-Lieb (HBL)

• Volume of E ⊂ R3 =⇒ Volume of E ⊂ Rd

• Projections from (i, j, k) to (i, j), (i, k), (k, j) =⇒any linear projections φ1, ..., φm

• vol(E) ≤ (area(Eij))1/2 · (area(Eik))1/2 · (area(Ejk))1/2 =⇒

vol(E) ≤ C ·∏mi=1 vol(φi(E))si

Where do we get exponents si and C <∞?

16

Page 17: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Continuous HBL

Continuous HBL Linear Program (C-HBL-LP):

dim(Rd) = d =

m∑i=1

si · dim(φi(Rd)) =

m∑i=1

si · di

and for all subspaces H ≤ Rd, dim(H) ≤∑mi=1 si · dim(φi(H))

Note: There exist infinitely many H , but only finitely manypossible constraints in C-HBL-LP (at most (d + 1)m+1)

Thm (B/C/C/T): si ≥ 0 satisfy C-HBL-LP if and only if∃ C <∞ such that for all fi : Rdi → [0,∞) in L1/si∫Rd

m∏i=1

fi(φi(x))dx ≤ C ·m∏i=1

(

∫Rdi

[fi(y)]1/sidy)si = C ·m∏i=1

‖fi‖1/si

17

Page 18: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Continuous HBL - Special case (1/3)

dim(Rd) = d =

m∑i=1

si · dim(φi(Rd)) =

m∑i=1

si · di

and for all subspaces H ≤ Rd, dim(H) ≤∑mi=1 si · dim(φi(H))

Thm (B/C/C/T): si ≥ 0 satisfy C-HBL-LP if and only if∃ C <∞ such that for all fi : Rdi → [0,∞) in L1/si∫

Rd

m∏i=1

fi(φi(x))dx ≤ C ·m∏i=1

‖fi‖1/si

Holder’s Inequality: Choose all φi = identity, so∑mi=1 si = 1

‖m∏i=1

fi(x)‖1 ≤ Cm∏i=1

‖fi‖1/si ... can show C = 1

18

Page 19: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Continuous HBL - Special case (2/3)

dim(Rd) = d =

m∑i=1

si · dim(φi(Rd)) =

m∑i=1

si · di (*)

and for all subspaces H ≤ Rd, dim(H) ≤∑mi=1 si · dim(φi(H))

Thm (B/C/C/T): si ≥ 0 satisfy C-HBL-LP if and only if∃ C <∞ such that for all fi : Rdi → [0,∞) in L1/si∫

Rd

m∏i=1

fi(φi(x))dx ≤ C ·m∏i=1

‖fi‖1/si

Brascamp-Lieb Inequality: Given only (*), C maximized byfi(x) = exp(−xTAix) for some s.p.d. Ai (C could be ∞)

19

Page 20: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Continuous HBL - Special case (3/3)

dim(Rd) = d =

m∑i=1

si · dim(φi(Rd)) =

m∑i=1

si · di

and for all subspaces H ≤ Rd, dim(H) ≤∑mi=1 si · dim(φi(H))

Thm (B/C/C/T): si ≥ 0 satisfy C-HBL-LP if and only if∃ C <∞ such that for all fi : Rdi → [0,∞) in L1/si∫

Rd

m∏i=1

fi(φi(x))dx ≤ C ·m∏i=1

‖fi‖1/si

Loomis-Whitney & beyond: Given bounded E ⊂ Rd,fi = indicator function of φi(E),

vol(E) ≤ C ·m∏i=1

(vol(φi(E)))si

20

Page 21: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Illustration of C-HBL-LP

21

Page 22: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

But we want to count lattice points ≡ loop iterations

22

Page 23: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Second Extension Strategy: Discrete HBL (1/2)

• Count lattice points instead of volumes:

– Lattice points correspond to loop iterations

(i, j, k)←→ C(i, j)+ = A(i, k) ∗B(k, j)

– Projected lattice points correspond to array entries

(i, j)←→ C(i, j), etc

• Vector space Rd =⇒ abelian group Zd under addition

• Subspaces H ≤ Rd =⇒ subgroups H ≤ Zd

• Linear projection φi =⇒ group homomorphism φi

• Subspace φi(H) =⇒ subgroup φi(H)

• dim(H) =⇒ rank(H), dim(φi(H)) =⇒ rank(φi(H))

• Like C-HBL-LP, but all H , φi are integer, not real

23

Page 24: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Second Extension Strategy: Discrete HBL (2/2)

Discrete HBL Linear Program (D-HBL-LP):for all subgroups H ≤ Zd, rank(H) ≤

∑mi=1 si · rank(φi(H))

Note: There exist infinitely many H , but only finitely manypossible constraints in D-HBL-LP (at most (d + 1)m+1)

Thm (B/C/C/T): si ≥ 0 satisfy D-HBL-LP if and only iffor any finite set E ⊂ Zd its cardinality |E| is bounded by

|E| ≤m∏i=1

|φi(E)|si ... C = 1!

We want tightest bound when |φi(E)| ≤ 2M , i.e. |E| ≤ (2M)∑mi=1 si

=⇒ Compute sHBL ≡ min∑mi=1 si subject to D-HBL-LP

Thm: #words moved = Ω(#iterations/MsHBL−1)

24

Page 25: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Some ideas in the proof of Discrete HBL (1/2)

∀H ≤ Zd, rank(H) ≤∑mi=1 si·rank(φi(H))⇐⇒ |E| ≤

∏mi=1 |φi(E)|si

• Necessity

– For any H ≤ Zd, let En be n× n× · · · × n “brick” in H

– |En| = Θ(nrank(H)) and |φi(En)| = O(nrank(φi(H)))

Θ(nrank(H)) = |En| ≤m∏i=1

|φi(En)|si

= O(

m∏i=1

nsi·rank(φi(H))) = O(n∑ni=1 si·rank(φi(H)))

25

Page 26: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Some ideas in the proof of Discrete HBL (2/2)

∀H ≤ Zd, rank(H) ≤∑mi=1 si·rank(φi(H))⇐⇒ |E| ≤

∏mi=1 |φi(E)|si

• Sufficiency (hard part)

– Suffices to consider extreme points s = [s1, ..., sm] of polytopedefined by D-HBL-LP

– Induction over d

– Def: H ≤ Zd critical if rank(H) =∑mi=1 si · rank(φi(H))

– Given V ≤ Zd and s extreme point, then either∃ critical 0 < H < V (induction on H) or s ∈ 0, 1m

26

Page 27: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Applying Bounds to More General Code (1/5)

• General model:

for all I ∈ Z ⊂ Zd, in some orderinner loop(I, A1(φ1(I)), ..., Am(φm(I)))

• Ex: LU inner loop: A(i, j) = A(i, j)− L(i, k) ∗ U(k, j)

– Ok to ignore loop scaling columns of L

– Ok to overwrite A: L(i, k) = A(i, k) for i > k, ditto for U

– Same idea applies to BLAS, Cholesky, LDLT , ...

– Same idea applies to tensor contractions

– QR, eig, SVD need another idea

27

Page 28: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Applying Bounds to More General Code (2/5)

• General model:

for all I ∈ Z ⊂ Zd, in some orderinner loop(I, A1(φ1(I)), ..., Am(φm(I)))

• Ex: Computing B = Ak (k odd)for i1 = 1 : bk/2c, C = A ·B, B = A · C• Imperfectly nested loops

• Can’t just omit B = A · C; infinite data reuse possible, so anylower bound ∝ |Z| must be 0; leads to infeasible HBL-LP

• Solution: impose reads/writes : let A[1] = A, thenfor i1 = 2 : k, A[i1] = A[1] ∗ A[i1 − 1]

• Apply lower bound to new code, subtract added #reads/writes

• #words moved = Ω(kn3/M1/2 − kn2) = Ω(kn3/M1/2)

28

Page 29: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Applying Bounds to More General Code (3/5)

• General model:

for all I ∈ Z ⊂ Zd, in some orderinner loop(I, A1(φ1(I)), ..., Am(φm(I)))

• Ex: Database join

for i1 = 1 : N1, for i2 = 1 : N2if predicate(R(i1), S(i2)) = true,

output(i1, i2) = func(R(i1), S(i2))

– Write Z = ZT ∪ ZF , depending on predicate

– Apply lower bound to ZT , ZF separately, take max

– #words moved = Ω(max(|ZT |, |Z|/M))

29

Page 30: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Applying Bounds to More General Code (4/5)

• General model:

for all I ∈ Z ⊂ Zd, in some orderinner loop(I, A1(φ1(I)), ..., Am(φm(I)))

• Ex: Dense or sparse QR decomposition, using orthogonal trans-formations

• Not one “algorithm,” many variations: un/blocked Givens/Householder,order in which entries zeroed out, ...

• Blocking orth. trans. ⇒ imperfectly nested loops

– Challenge: output of first nest input to second, so need to bounddata reuse

30

Page 31: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Applying Bounds to More General Code (5/5)

• Dense or sparse QR decomposition, continued

• Thm 1: #words moved = Ω(#flops/M1/2) if

– Blocked Householder with any block sizes

– One Householder transform per column

• Thm 2: #words moved = Ω(#flops/M1/2) if

– “Forward Progress”: each entry zeroed out once

– Block size must be 1

• Conjecture: Forward Progress sufficient

• Generalizes to eigenvalue problems

31

Page 32: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Decidability of the Lower Bound (1/3)

• Recall Continuous HBL-LP: dim(Rd) = d =∑mi=1 si·dim(φi(Rd))

and ∀H ≤ Rd, dim(H) ≤∑mi=1 si · dim(φi(H))

• To write this down, need to solve:Given rH , rH1

, ..., rHm, decide if ∃H ≤ Rd s.t.dim(H) = rH , dim(φ1(H)) = rH1

,..., dim(φm(H)) = rHm

•Write H as d× d matrix

•Write each φi as di × d matrix

• Express rank conditions by (non)zero constraints on minors

• Tarski-decidable

– Enough to get upper bound on sHBL =⇒ valid lower boundon communication (possibly too low)

32

Page 33: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Decidability of the Lower Bound (2/3)

•What about Discrete HBL-LP?∀H ≤ Zd, rank(H) ≤

∑mi=1 si · rank(φi(H))

• To write this down, need to solve:Given rH , rH1

, ..., rHm, decide if ∃H ≤ Zd s.t.rank(H) = rH , rank(φ1(H)) = rH1

,..., rank(φm(H)) = rHm

• Can encode with minors as before

• Thm: Whether any given system of polynomial equations withrational coefficients has a rational solution or not can be encodedby right choice of φ1, ..., φm.

• Cor: Being able to write down D-HBL-LP ⇐⇒∃ decision procedure for Hilbert’s 10th Problem over Q– Over Q instead of Z because all conditions homogeneous

33

Page 34: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Decidability of the Lower Bound (3/3)

•What about Discrete HBL-LP?∀H ≤ Zd, rank(H) ≤

∑mi=1 si · rank(φi(H))

• Constraints define polytope P in space of [s1, ..., sm] ∈ Rm

• Enough to get any subset of subgroups H defining P• Let (H1, H2, H3, ...) be any enumeration of all H ≤ Zd

• Let Pi be polytope defined by (H1, ..., Hi)

• “Simple” decidability algorithm:

i = 0, repeat i = i + 1 until Pi = P

• Thm: Decidable whether a vertex of Pi in P– Similar induction idea as before

• Better algorithm: which subgroups H to try first?

34

Page 35: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Special Case: When subscripts are justsubsets of indices (1/3)

• Ex: linear algebra, N-body, database join, ...

– Matmul: (i, j, k) are indices, subscriptsA(i, k), B(k, j), C(i, j)

•Much simpler:

– Easy to write down Discrete HBL-LP to get lower bound

– Easy to attain lower bound (modulo dependencies):Dual of Discrete HBL-LP gives optimal block sizes

– Basis of examples at start of talk

• Extends to subsets of unimodular transformations of indices

– Ex: subsets of (i, 2i + j, 3i + 2j + k)

35

Page 36: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Special Case: When subscripts are justsubsets of indices (2/3)

• i1, ..., id be indices, φ1,...,φm be projections

• Let ∆j,k = 1 if ik in range of φj, else 0

• Thm: Let s = [s1, ..., sm] minimize 1Ts ≡ sHBL such thatsT∆ ≥ 1T . Then#words moved = Ω(#loop iterations/MsHBL−1)

• Proof idea

– Constraints sT∆ ≥ 1 are subset of Discrete HBL-LP,for all H spanned by (0, ..., 0, 1, 0, ..., 0) (k-th entry = 1)

– Show this subset implies rank(H) ≤∑mj=1 sjrank(φj(H))

for all H ≤ Zd

36

Page 37: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Special Case: When subscripts are justsubsets of indices (3/3)

• i1, ..., id be indices, φ1,...,φm be projections

• Let ∆j,k = 1 if ik in range of φj, else 0

• Dual LP: Let x = [x1, ..., xd] maximize 1Tx ≡ sHBL such that∆x ≤ 1T .

• Thm: The solution x of the Dual LP gives the optimal block sizesto minimize communication: ik blocked by Mxk

• Proof idea

– Each constraint in ∆x ≤ 1 bounds number of entries of eacharray by M

– 1Tx = sHBL says number of inner loop iterations per block isMsHBL.

• Extends to parallel case, “n.5D” algorithms

37

Page 38: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Some improved algorithms that avoid communication

•Work of many people!

– Up to 12x faster for 2.5D matmul on 64K core IBM BG/P

– Up to 3x faster for tensor contractions on 3K core Cray XE6

– Up to 6.2x faster for APSP on 24K core Cray XE6

– Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P

– Up to 11.8x for direct N-body on 32K core IBM BP/P

– Up to 13x for TSQR on Tesla C2050 Ferma NVIDIA GPU

– Up to 6.7x faster for symeig(bandA) on 10 core Intel Westmere

– Up to 2x faster for 2.5D Strassen on 38K core Cray XT4

• Communicate asymptotically < existing algorithms in theory

– SVD, LDLT , 2.5D QR, Nonsymmetric eigenproblem

– QR with column pivoting, other pivoting schemes

– Sparse Cholesky, for suitable graphs

38

Page 39: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Conclusions

• Possible to derive decidable communication lower bounds for manywidely used algorithms that access arrays

• Possible to achieve these bounds in many cases, leading to fasteralgorithms

• Open problems

– Make derivation of lower bounds efficient, automate it

– Conjecture: Always attainable, modulo loop dependencies

– Implement in compilers

39

Page 40: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Key to Success

40

Page 41: Communication Lower Bounds and Optimal Algorithms for …demmel/Wisconsin13/HBL... · 2013. 9. 29. · Communication = moving data { Between levels of memory hierarchy { Between processors

Key to Success

Don’t Communic...

41