76
Introduction Communication-avoiding linear algebra Communication-avoiding tensor computations Conclusion Scalable numerical algorithms for electronic structure calculations Edgar Solomonik UC Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73

Scalable numerical algorithms for electronic structure

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Scalable numerical algorithms for electronicstructure calculations

Edgar Solomonik

UC Berkeley

July, 2012

Edgar Solomonik Cyclops Tensor Framework 1/ 73

Page 2: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Outline

Introduction

Communication-avoiding linear algebraMotivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Communication-avoiding tensor computationsMotivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Conclusion

Edgar Solomonik Cyclops Tensor Framework 2/ 73

Page 3: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Algorithmic roots: communication avoidance

Targeting leadership class platforms (e.g. BG/Q)

I Large amount of distributed memory parallelism

I Hierarchical parallelism

I Communication architecture lagging behind computearchitecture

Architectures motivates communication-avoiding algorithms whichconsider

I bandwidth cost (amount of data communicated)

I latency cost (number of messages or synchronizations)

I critical path (communication load balance)

I topology (communication contention)

Edgar Solomonik Cyclops Tensor Framework 3/ 73

Page 4: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Application roots: electronic structure calculations

Electronic structure calculations seek approximate solution to theSchrodinger equation

i~∂

∂t|Ψ〉 = H|Ψ〉

which satisfyH|Ψ〉 = E |Ψ〉

often we want the ground-state (lowest-energy) wave-function Ψ0,such that

〈Ψ0|H|Ψ0〉 = E0.

Edgar Solomonik Cyclops Tensor Framework 4/ 73

Page 5: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Density Function Theory (DFT)DFT uses the fact that the ground-state wave-function Ψ0 is aunique functional of the particle density n(~r)

Ψ0 = Ψ[n0]

Since H = T + V + U, where T , V , and U, are the kinetic,potential, and interaction contributions respectively,

E [n0] = 〈Ψ[n0]|T [n0] + V [n0] + U[n0]|Ψ[n0]〉DFT assumes U = 0, and solves the Kohn-Sham equations[

− ~2

2m∇2 + Vs(~r)

]φi (~r) = εiφi (~r)

where Vs has a exchange-correlation potential correction,

Vs(~r) = V (~r) +

∫e2ns(~r ′)

|~r − ~r ′|d3r ′ + VXC [ns(~r)]

Edgar Solomonik Cyclops Tensor Framework 5/ 73

Page 6: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Density Function Theory (DFT), contd.

The exchange-correlation potential VXC is approximated by DFT,by a functional which is often system-dependent. This allows thefollowing iterative scheme

1. Given an (initial guess) n(~r) calculate Vs via Hartree-Fockand functional

2. Solve (diagonalize) the Kohn-Sham equation to obtain each φi

3. Compute a new guess at n(~r) based on φi

Due to the rough approximation of correlation and exchange DFTis good for weakly-correlated systems (which appear in solid-statephysics), but suboptimal for strongly-correlated systems.

Edgar Solomonik Cyclops Tensor Framework 6/ 73

Page 7: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Linear algebra in DFT

DFT requires a few core numerical linear algebra kernels

I Matrix multiplication (of rectangular matrices)

I Linear equations solver

I Symmetric eigensolver (diagonalization)

We proceed to study schemes for optimization of these algorithms.

Edgar Solomonik Cyclops Tensor Framework 7/ 73

Page 8: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Matrix multiplication (MM)

We consider matrix multiplication

C [i , j ] =n−1∑k=0

A[i , k]B[k , j ]

which in algorithmic form is

for i = 0 to n − 1 dofor j = 0 to n − 1 do

for k = 0 to n − 1 doC [i , j ]+ = A[i , k] · B[k , j ]

Edgar Solomonik Cyclops Tensor Framework 8/ 73

Page 9: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

SUMMA/Cannon blocked algorithm for MM

On a l-by-l processor grid, with b = n/l

for k = 0 to n − 1 dofor p = 0 to l − 1 in parallel do

for q = 0 to l − 1 in parallel dobroadcast A[:, k]broadcast B[k , :]for i = 0 to b − 1 dofor j = 0 to b − 1 doC [i + pb, j + qb]+ = A[i + pb, k] · B[k , j + qb]

Edgar Solomonik Cyclops Tensor Framework 9/ 73

Page 10: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

SUMMA/Cannon cyclic algorithm for MMReplace

C [i + pb, j + qb]+ = A[i + pb, k] · B[k , j + qb]

withC [il + p, jl + q]+ = A[il + p, k] · B[k , jl + q]

for k = 0 to n − 1 dofor p = 0 to l − 1 in parallel do

for q = 0 to l − 1 in parallel dobroadcast A[:, k]broadcast B[k , :]for i = 0 to b − 1 dofor j = 0 to b − 1 doC [il + p, jl + q]+ = A[il + p, k] · B[k, jl + q]

Edgar Solomonik Cyclops Tensor Framework 10/ 73

Page 11: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

SUMMA/Cannon costs

With blocking, on p processors these algorithms move A and Bl =√p times. Each process own n2/p of the matrices, so the

bandwidth cost isW = O(n2/

√p)

and the number of synchronizations necessary is

S = O(√p).

For rectangular matrices with dimensions n,m, k , we select analgorithm on process grid l1-by-l2 that communicates A and B, orA and C , or B and C , for a cost of

W = O(minl1,l2

(nml1 + mkl2, nml1 + nkl2,mkl1 + nkl2)/p)

Edgar Solomonik Cyclops Tensor Framework 11/ 73

Page 12: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D matrix multiplication

On a l-by-l-by-l processor grid, with b = n/l

broadcast Abroadcast Bfor r = 0 to l − 1 in parallel do

for p = 0 to l − 1 in parallel dofor q = 0 to l − 1 in parallel do

for k = 0 to b − 1 dofor i = 0 to b − 1 dofor j = 0 to b − 1 doC [i+pb, j+qb]+ = A[i+pb, k+rb]·B[k+rb, j+qb]

reduce C

Edgar Solomonik Cyclops Tensor Framework 12/ 73

Page 13: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D matrix multiplication with block notation

On a l-by-l-by-l processor grid, with b = n/l

broadcast Abroadcast Bfor r = 0 to l − 1 in parallel do

for p = 0 to l − 1 in parallel dofor q = 0 to l − 1 in parallel do

C [p, q]+ = A[p, r ] · B[r , q]reduce C

Edgar Solomonik Cyclops Tensor Framework 13/ 73

Page 14: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D MM costs

On p processors these algorithms move A and B and C once. Eachprocess own n2/p2/3 of the matrices, so the bandwidth cost is

W = O(n2/p2/3)

and the number of synchronizations necessary is

S = O(1).

However, the algorithm requires storage (memory usage) of

M = Ω(n2/p2/3)

which is p1/3 more than minimal.

Edgar Solomonik Cyclops Tensor Framework 14/ 73

Page 15: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D matrix multiplication

On a l-by-l-by-c processor grid, with b = n/l

for s = 0 to l/c − 1 dobroadcast Abroadcast Bfor r = 0 to l − 1 in parallel do

for p = 0 to l − 1 in parallel dofor q = 0 to l − 1 in parallel doC [p, q]+ = A[p, sr ] · B[sr , q]

reduce C

Edgar Solomonik Cyclops Tensor Framework 15/ 73

Page 16: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D MM costsOn p processors these algorithms move A and B,

√p/c3 times

and C once. Each process own n2√cp of the matrices, so the

bandwidth cost is

W = O

(n2√cp

)and the number of synchronizations necessary is

S = O

(√p/c3

).

while the memory is now

M = Ω(cn2/p)

which is tunable given the architectural constraint, allowing betterasymptotic strong scaling.

Edgar Solomonik Cyclops Tensor Framework 16/ 73

Page 17: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D rectangular matrix multiplication

On a l1-by-l2-by-l3 processor grid, with b = n/l1 = m/l2 = k/l3

broadcast Abroadcast Bfor r = 0 to l3 − 1 in parallel do

for p = 0 to l1 − 1 in parallel dofor q = 0 to l2 − 1 in parallel do

C [p, q]+ = A[p, r ] · B[r , q]reduce C

Edgar Solomonik Cyclops Tensor Framework 17/ 73

Page 18: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Blocking matrix multiplication

A

BA

B

A

B

AB

Edgar Solomonik Cyclops Tensor Framework 18/ 73

Page 19: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D matrix multiplication[Cannon 69],

[Van De Geijn and Watts 97]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

16 CPUs (4x4)O(n3/p) flops

O(n2/√p) words moved

O(√p) messages

O(n2/p) bytes of memory

Edgar Solomonik Cyclops Tensor Framework 19/ 73

Page 20: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D matrix multiplication[Agarwal et al 95],

[Aggarwal, Chandra, and Snir 90],

[Bernsten 89], [McColl and Tiskin 99]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

O(n3/p) flops

O(n2/p2/3) words moved

O(1) messages

O(n2/p2/3) bytes of memory

Edgar Solomonik Cyclops Tensor Framework 20/ 73

Page 21: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D matrix multiplication[McColl and Tiskin 99]

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices

O(n3/p) flops

O(n2/√c · p) words moved

O(√

p/c3) messages

O(c · n2/p) bytes of memory

Edgar Solomonik Cyclops Tensor Framework 21/ 73

Page 22: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D MM on 65,536 cores

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

2.5D MM on BG/P (n=65,536)

2.5D SUMMA2D SUMMA

Edgar Solomonik Cyclops Tensor Framework 22/ 73

Page 23: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Solutions to linear systems of equations

We want to solve some matrix equation

A · X = B

where A and B are known. Can solve by factorizing A = LU (Llower triangular and U upper triangular) via Gaussian elimination,then computing TRSMs

X = U−1L−1B

via triangular solves. If A is symmetric positive definite, we can useCholesky factorization. Cholesky and TRSM are no harder thanLU.

Edgar Solomonik Cyclops Tensor Framework 23/ 73

Page 24: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Non-pivoted LU factorization

for k = 0 to n − 1 doU[k , k : n − 1] = A[k , k : n − 1]for i = k + 1 to n − 1 do

L[i , k] = A[i , k]/U[k , k]for j = k + 1 to n − 1 do

A[i , j ]− = L[i , k] · U[k , j ]

This algorithm has a dependency that requires

k ≤ i , k ≤ j .

Edgar Solomonik Cyclops Tensor Framework 24/ 73

Page 25: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Non-pivoted 2D LU factorizationOn a l-by-l process grid

Algorithm 1 [L,U] = 2D-LU(A)

for k = 0 to n − 1 doFactorize A[k , k] = L[k, k] · U[k , k]Broadcast L[k , k] and U[k , k]for p = 0 to l − 1 in parallel do

solve L[p, k] = A[p, k]U[k , k]−1

for q = 0 to l − 1 in parallel dosolve U[k , q] = L[k, k]−1A[1, k]

Broadcast L[p, k] and U[k , q]for p = 0 to l − 1 in parallel dofor q = 0 to l − 1 in parallel doA[p, q]− = L[p, k] · U[k, q]

Edgar Solomonik Cyclops Tensor Framework 25/ 73

Page 26: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D recursive non-pivoted LU and Cholesky

A 3D recursive algorithm with no pivoting [A. Tiskin 2002]I Tiskin gives algorithm under the BSP model

I Bulk Synchronous ParallelI considers communication and synchronization

I We give an alternative distributed-memory adaptation andimplementation

I Also, we have a new lower-bound for the latency cost

Edgar Solomonik Cyclops Tensor Framework 26/ 73

Page 27: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D non-pivoted LU and Cholesky

On a l-by-l-by-l process grid

for r = 0 to l − 1 do[L[r , r ],U[r , r ]] = 2D-LU(A[r , r ])Broadcast L[k , k] and U[k , k][L[r + 1 : l − 1, r ]] = 2D-TRSM(A[r + 1 : l − 1, r ],U[r , r ]);[U[r , r + 1 : l − 1]] = 2D-TRSM(A[r , r + 1 : l − 1],L[r , r ]);for s = 0 to l − 1 in parallel do

Broadcast L[p, rs] and U[rs, q]for p = 0 to l − 1 in parallel dofor q = 0 to l − 1 in parallel do

A[p, q]− = L[p, rs] · U[rs, q]

Edgar Solomonik Cyclops Tensor Framework 27/ 73

Page 28: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D blocked LU factorization

A

Edgar Solomonik Cyclops Tensor Framework 28/ 73

Page 29: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D blocked LU factorization

L₀₀

U₀₀

Edgar Solomonik Cyclops Tensor Framework 29/ 73

Page 30: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D blocked LU factorization

L

U

Edgar Solomonik Cyclops Tensor Framework 30/ 73

Page 31: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D blocked LU factorization

L

U

S=A-LU

Edgar Solomonik Cyclops Tensor Framework 31/ 73

Page 32: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D block-cyclic decomposition

8 8 8 8

8 8 8 8

8 8 8 8

8 8 8 8

Edgar Solomonik Cyclops Tensor Framework 32/ 73

Page 33: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D block-cyclic LU factorization

Edgar Solomonik Cyclops Tensor Framework 33/ 73

Page 34: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D block-cyclic LU factorization

L

U

Edgar Solomonik Cyclops Tensor Framework 34/ 73

Page 35: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2D block-cyclic LU factorization

L

U

S=A-LU

Edgar Solomonik Cyclops Tensor Framework 35/ 73

Page 36: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik Cyclops Tensor Framework 36/ 73

Page 37: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik Cyclops Tensor Framework 37/ 73

Page 38: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU factorization

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

(A)

(B)

U

L

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik Cyclops Tensor Framework 38/ 73

Page 39: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU strong scaling (without pivoting)

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

LU without pivoting on BG/P (n=65,536)

ideal scaling2.5D LU

2D LU

Edgar Solomonik Cyclops Tensor Framework 39/ 73

Page 40: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU with pivoting

A = P · L · U, where P is a permutation matrix

I 2.5D generic pairwise elimination (neighbor/pairwise pivotingor Givens rotations (QR)) [A. Tiskin 2007]

I pairwise pivoting does not produce an explicit LI pairwise pivoting may have stability issues for large matrices

I Our approach uses tournament pivoting, which is more stablethan pairwise pivoting and gives L explicitly

I pass up rows of A instead of U to avoid error accumulation

Edgar Solomonik Cyclops Tensor Framework 40/ 73

Page 41: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Tournament pivoting

Partial pivoting is not communication-optimal on a blocked matrix

I requires message/synchronization for each column

I O(n) messages needed

Tournament pivoting is communication-optimal

I performs a tournament to determine best pivot row candidates

I passes up ’best rows’ of A

Edgar Solomonik Cyclops Tensor Framework 41/ 73

Page 42: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU factorization with tournament pivoting

PA₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

PA₀

Edgar Solomonik Cyclops Tensor Framework 42/ 73

Page 43: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

PA₀

Edgar Solomonik Cyclops Tensor Framework 43/ 73

Page 44: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

PA₀

Edgar Solomonik Cyclops Tensor Framework 44/ 73

Page 45: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

L₀₀U₀₀

U₀₁

U₀₂

U₀₃

L₃₀

L₁₀L₂₀

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀ L₀₀U₀₀

L₀₀U₀₀

L₀₀U₀₀

PA₀

Edgar Solomonik Cyclops Tensor Framework 45/ 73

Page 46: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

2.5D LU on 65,536 cores

0

20

40

60

80

100

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

Tim

e (s

ec)

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

2X faster

computeidle

communication

Edgar Solomonik Cyclops Tensor Framework 46/ 73

Page 47: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Symmetric eigensolve via QR

To solve the symmetric eigenproblem on matrix A, we need todiagonalize

A = UDUT

where U are the singular vectors and D is the singular values. Thiscan be done by a series of two-sided orthogonal transformations

A = U1U2 . . .UkDUTk . . .UT

2 UT1

The process may be reduced to three stages: a QR factorizationreducing to banded form, a reduction from banded to tridiagonal,and a tridiagonal eigensolve. We consider the QR, which is themost expensive step.

Edgar Solomonik Cyclops Tensor Framework 47/ 73

Page 48: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D QR factorization

A = Q · R where Q is orthogonal R is upper-triangular

I 3D QR using Givens rotations (generic pairwise elimination) isgiven by [A. Tiskin 2007]

I Tiskin minimizes latency and bandwidth by working onslanted panels

I 3D QR cannot be done with right-looking updates as 2.5D LUdue to non-commutativity of orthogonalization updates

Edgar Solomonik Cyclops Tensor Framework 48/ 73

Page 49: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D QR factorization using the YT representation

The YT representation of Householder QR factorization is morework efficient when computing only R

I We give an algorithm that performs 2.5D QR using the YTrepresentation

I The algorithm performs left-looking updates on Y

I Householder with YT needs fewer computation (roughly 2x)than Givens rotations

Edgar Solomonik Cyclops Tensor Framework 49/ 73

Page 50: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D QR using YT representation

Edgar Solomonik Cyclops Tensor Framework 50/ 73

Page 51: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

Latency-optimal 2.5D QR

To reduce latency, we can employ the TSQR algorithm

1. Given n-by-b panel partition into 2b-by-b blocks

2. Perform QR on each 2b-by-b block

3. Stack computed Rs into n/2-by-b panel and recursive

4. Q given in hierarchical representation

Need YT representation from hierarchical Q...

Edgar Solomonik Cyclops Tensor Framework 51/ 73

Page 52: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

YT reconstruction

Yamamoto et al.

I Take Y to be the first b columns of Q minus the identity

I Define T = (I − Q1)−1

I Sacrifices triangular structure of T and Y .

Our first attempt

LU(R−A) = LU(R−(I−YTY T )R) = LU(YTY TR) = (Y )·(TY TR)

was unstable due to being dependent on the condition number ofR. However, performing LU on Yamamoto’s T seems to be stable,

LU(I−Q1) = LU(I−(I−Y1TYT1 )) = LU(Y1TY

T1 ) = (Y1)·(TY T

1 )

and should yield triangular Y and T .

Edgar Solomonik Cyclops Tensor Framework 52/ 73

Page 53: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D algorithms on BG/Q

8

16

32

64

128

256

512

1024

2048

256 512 1024 2048 4096 8192 16384

Tera

flops

#nodes

BG/Q matrix multiplication

Cyclops TFScalapack

Edgar Solomonik Cyclops Tensor Framework 53/ 73

Page 54: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Density Functional TheoryMatrix multiplicationSolving systems of linear equationsSolving the symmetric eigenvalue problem

3D algorithms for DFT

3D matrix multiplication is integrated into QBox.

I QBox is a DFT code developed by Erik Draeger et al.

I Depending on system/functional can spend as much as 80%time in MM

I Running on most of Sequoia and getting significant speed upfrom 3D

I 1.75X speed-up on 8192 nodes 1792 gold atoms, 31electrons/atom

I Eventually hope to build and integrate a 3D eigensolver intoQBox

Edgar Solomonik Cyclops Tensor Framework 54/ 73

Page 55: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Coupled Cluster (CC)

Coupled Cluster is a method for electronic structure calculations ofstrongly-correlated systems. CC rewrites the wave-function |Ψ〉 asan excitation operator T applied to the Slater determinant |Ψ0〉

|Ψ〉 = eT |Ψ0〉

where T is as a sum of Tn (the n’th excitation operators)

TCCSD = T1 + T2

TCCSDT = T1 + T2 + T3

TCCSDTQ = T1 + T2 + T3 + T4

Edgar Solomonik Cyclops Tensor Framework 55/ 73

Page 56: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Coupled Cluster (CC)The CC amplitudes Tn can be solved via the coupled equations

〈Ψab...ij ... |e−THeT |Ψ0〉

where we expand out the excitation operator

eT = 1 + T +T 2

2!. . .

By Wick’s theorem only fully contracted terms of the expansionwill be non-zero, and diagrammatic or algebraic derivations yieldmany terms such as ∑

klcd

〈kl ||cd〉T ck T

al T

dbTij

which can be factorized into two-term contractions.Edgar Solomonik Cyclops Tensor Framework 56/ 73

Page 57: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

A parallel algorithm for any dimensional tensor

Dense tensor contractions can be reduced to matrix multiplication

I tensors must be transposed (indices must be reordered)

I parallelized via 2D/3D algorithms

Alternatively, we can keep tensors in high-dimensional layout andperform recursive SUMMA

I replicate along any dimension for 3D

I SUMMA along each dimension where indices are mismatched.

Edgar Solomonik Cyclops Tensor Framework 57/ 73

Page 58: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

4D tensor contraction

On a l-by-l-by-l-by-l processor grid, with b = n/l

for p = 0 to n − 1 dofor q = 0 to n − 1 dofor r = 0 to l − 1 in parallel do

for s = 0 to l − 1 in parallel dobroadcast A[p, :, :, :]broadcast B[:, p, :, :]for t = 0 to l − 1 in parallel do

for u = 0 to l − 1 in parallel dobroadcast A[:, :, q, :]broadcast B[:, :, :, q]C [r , s, t, i ]+ = A[p, s, q, i ] · B[r , p, t, q]

Edgar Solomonik Cyclops Tensor Framework 58/ 73

Page 59: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Tensor symmetryMost physical tensors of interest have symmetric indices

W ab = W ba

or skew-symmetric indices

W ab = −W ba.

Multi-index symmetries and partial index symmetries also arise, e.g.

W abijkl

where (a, b) are permutationally symmetric and (i , j , k, l) andpermutationally symmetric. Symmetry is a vital computationalconsideration, since it can save computation and much memoryscaling as d! where d is the number of symmetric indices.

Edgar Solomonik Cyclops Tensor Framework 59/ 73

Page 60: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Symmetric contractionsConsider the contraction

C abij =

∑c

Aacij · Bcb,

if A and C both have i , j symmetry (symmetry preserved), compute

C abi<j =

∑c

Aaci<j · Bcb

if B is symmetric in (c , b) (broken symmetry), compute

C abij =

∑c

Aacij · Bc<b + Aac

ij · Bb≤c

if C is skew-symmetric in (a, b) (broken symmetry), symmetrize

C a<bij =

∑c

Aacij · Bcb − Abc

ij · Bca

Edgar Solomonik Cyclops Tensor Framework 60/ 73

Page 61: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Cyclops Tensor Framework (CTF)

Big idea:

I decompose tensors cyclically among processors

I pick cyclic phase to preserve partial/full symmetric structure

Interface:C [”abij”]+ = A[”acij”] · B[”cb”]

with symmetries pre-specified for A, B, and C .

Edgar Solomonik Cyclops Tensor Framework 61/ 73

Page 62: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

NWChem blocked approach

for k = 0 to n − 1 dofor p = 0 to l − 1 in parallel do

for q = 0 to p − 1 in parallel dobroadcast A[:, k]broadcast B[k , :]for i = 0 to b − 1 dofor j = 0 to b − 1 doC [i + pb, j + qb]+ = A[i + pb, k] · B[k , j + qb]

Edgar Solomonik Cyclops Tensor Framework 62/ 73

Page 63: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Cyclops TF cyclic approach

for k = 0 to n − 1 dofor p = 0 to l − 1 in parallel do

for q = 0 to l − 1 in parallel dobroadcast A[:, k]broadcast B[k , :]for i = 0 to b − 1 dofor j = 0 to i − 1 do

C [il + p, jl + q]+ = A[il + p, k] · B[k, jl + q]

Edgar Solomonik Cyclops Tensor Framework 63/ 73

Page 64: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Blocked vs block-cyclic vs cyclic decompositions

Blocked layout Block-cyclic layout Cyclic layout

Red denotes padding / load imbalanceGreen denotes fill (unique values)

Edgar Solomonik Cyclops Tensor Framework 64/ 73

Page 65: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

3D tensor contraction

Edgar Solomonik Cyclops Tensor Framework 65/ 73

Page 66: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

3D tensor cyclic decomposition

Edgar Solomonik Cyclops Tensor Framework 66/ 73

Page 67: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

3D tensor mapping

Edgar Solomonik Cyclops Tensor Framework 67/ 73

Page 68: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

A cyclic layout is still challenging

I In order to retain partial symmetry, all symmetric dimensionsof a tensor must be mapped with the same cyclic phase

I The contracted dimensions of A and B must be mapped withthe same phase

I And yet the virtual mapping, needs to be mapped to aphysical topology, which can be any shape

Edgar Solomonik Cyclops Tensor Framework 68/ 73

Page 69: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Virtual processor grid dimensions

I Our virtual cyclic topology is somewhat restrictive and thephysical topology is very restricted

I Virtual processor grid dimensions serve as a new level ofindirection

I If a tensor dimension must have a certain cyclic phase, adjustphysical mapping by creating a virtual processor dimension

I Allows physical processor grid to be ’stretchable’

Edgar Solomonik Cyclops Tensor Framework 69/ 73

Page 70: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Virtual processor grid construction

Matrix multiply on 2x3 processor grid. Red lines representvirtualized part of processor grid. Elements assigned to blocks by

cyclic phase.

X =

AB

C

Edgar Solomonik Cyclops Tensor Framework 70/ 73

Page 71: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

3D algorithms for tensors

We incorporate data replication for communication minimizationinto CTF

I Replicate only one tensor/matrix (minimize bandwidth butnot latency)

I Autotune over mappings to all possible physical topologies

I Select mapping with least amount of communication

I Achieve minimal communication for tensors of widely differentsizes

Edgar Solomonik Cyclops Tensor Framework 71/ 73

Page 72: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Motivation: Coupled ClusterTensor contractionsSymmetric and skew-symmetric tensors

Preliminary Coupled Cluster results on Blue Gene/Q

A Coupled Cluster with Double excitations (CCD) implementationsis up and running

I Already scaled on up to 1024 nodes of BG/Q, up to 480virtual orbitals

I Preliminary results already favorable performance with respectto NWChem

I Spending 30-40% of time in DGEMM, with good strong andweak scalability

Edgar Solomonik Cyclops Tensor Framework 72/ 73

Page 73: Scalable numerical algorithms for electronic structure

IntroductionCommunication-avoiding linear algebra

Communication-avoiding tensor computationsConclusion

Future Work

3D eigensolver

I Working on formalization and error proof of YT reconstruction

I Plan to implement 3D QR and 3D symmetric eigensolve

I Integrate with QBox

Cyclops Tensor Framework

I Implement CCSD, CSSD(T), CCSDT, CSSDTQ

I Merge with SCF and eigensolver codes

I Sparse tensors

I Consider multi-term factorization/other tensor computations

Edgar Solomonik Cyclops Tensor Framework 73/ 73

Page 74: Scalable numerical algorithms for electronic structure

Rectangular collectives

Backup slides

Edgar Solomonik Cyclops Tensor Framework 74/ 73

Page 75: Scalable numerical algorithms for electronic structure

Rectangular collectives

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik Cyclops Tensor Framework 75/ 73

Page 76: Scalable numerical algorithms for electronic structure

Rectangular collectives

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

Edgar Solomonik Cyclops Tensor Framework 76/ 73