Communication costs of LU decomposition algorithms for banded matrices

Preview:

DESCRIPTION

Communication costs of LU decomposition algorithms for banded matrices. Razvan Carbunescu. Outline (1/2). Sequential general LU factorization (GETRF) and Lower Bounds Definitions and Lower Bounds LAPACK algorithm Communication cost Summary - PowerPoint PPT Presentation

Citation preview

1

Communication costs of LU decomposition algorithms for

banded matrices

Razvan Carbunescu

12/02/2011

2

Outline (1/2)• Sequential general LU factorization (GETRF) and Lower Bounds• Definitions and Lower Bounds• LAPACK algorithm• Communication cost• Summary

• Sequential banded LU factorization (GBTRF) and Lower Bounds• Definitions and Lower Bounds• Banded format• LAPACK algorithm• Communication cost• Summary

• Sequential LU Summary

12/02/2011

3

Outline (2/2)• Parallel LU definitions and Lower bounds

• Parallel Cholesky algorithms (Saad, Schultz ‘85)• SPIKE Cholesky algorithm (Sameh’85)

• Parallel banded LU factorization (PGBTRF)• ScaLAPACK algorithm• Communication cost• Summary

• Parallel banded LU and Cholesky Summary

• Future Work

• General Summary12/02/2011

4

GETRF – Definitions and Lower Bounds• Variables:

n - size of the matrix

r - block size (panel width)

i - current panel number

M - size of fast memory

• fits into pattern of 3-nested loops and has usual lower bounds:

12/02/2011

5

GETRF - Communication assumptions•BLAS2 LU on (m x n) matrix takes

•TRSM on (n x m) with LL (n x n) takes

•GEMM in (m x n) - (m x k) (k x n) takes

12/02/2011

m

n

m

n

n

n

P

L

U

n

m

n

n

n

m

U

LL-1

A

m

n

m

k

k

n

A

L

U

m

m

A

6

GETRF – LAPACK algorithm

12/02/2011

• For each panel block:

1) Factorize panel (n x r) 2) Permute matrix3) Compute U update (TRSM) of size r x (n-ir) with LL of size r x r4) Compute GEMM update of size:

(n-ir) x (n-ir) - ((n-ir) x r ) * (r x (n-ir))

7

GETRF – LAPACK algorithm (1/4)

12/02/2011

• Factorize panel P

Words:

Total words :

n- (i-1)r

r

r

r

r

P

L

U

n- (i-1)r

8

GETRF – LAPACK algorithm (2/4)

12/02/2011

• Permute matrix with pivot information from panel

Words:

Total words :

9

GETRF – LAPACK algorithm (3/4)

12/02/2011

• Permute matrix with pivot information from panel

Words:

Total words :

r

n-ir

r

r

r

n-ir U

LL-1

A

10

GETRF – LAPACK algorithm (4/4)

12/02/2011

• Permute matrix with pivot information from panel

Words:

Total words :

n-ir

n - ir

r r

n -ir A

L

U

n-ir A

n-ir

n-ir

11

GETRF – Communication cost

12/02/2011

• Communication cost

• Simplified in the big O notation we get:

12

GETRF - General LU Summary• General LU lower bounds are:

• LAPACK LU algorithm gives :

12/02/2011

13

GBTRF - Banded LU factorization• Variables:

n - size of the matrix

b - matrix bandwidth

r - block size (panel width)

M - size of fast memory

• Also fits into 3-nested loops lower bounds:

12/02/2011

14

Banded Format• GBTRF uses a special “banded format”

• Packed data format that stores mostly data and very few non-zeros

• columns map to columns ; diagonals map to rows

• easy to retrieve a square block from original A by using lda – 1

12/02/2011

15

Banded Format

12/02/2011

Conceptual

Actual

• Because of format the update of U and of the Schur complement get split into multiple stages for the parts of the band matrix near the edges of the storage array

16

GBTRF Algorithm• For each panel block

1) Factorize panel of size b x r2) Permute rest of matrix affected by panel3) Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r)4) Compute U update (TRSM) of size r x r with LL of size (r x r)5) Compute 4 GEMM updates of sizes:

(b-2r) x (b-2r) + ((b-2r) x r ) * (r x (b-2r)) (b-2r) x r + ((b-2r) x r ) * (r x r) r x (b-2r) + (r x r) * (r x (b-2r)) r x r + (r x r) * (r x r)

12/02/2011

17

GBTRF – LAPACK algorithm (1/8)

12/02/2011

• Factorize panel P

Words:

Total words :

b

r rr

b

r

18

GBTRF – LAPACK algorithm (2/8)

12/02/2011

• Apply permutations

Words:

Total words :

19

GBTRF – LAPACK algorithm (3/8)

12/02/2011

• Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r)

Words:

Total words :

r

b – 2r b – 2rr

r r-1

20

GBTRF – LAPACK algorithm (4/8)

12/02/2011

• Compute U update (TRSM) of size r x r with LL of size (r x r)

Words:

Total words :

r

-1rr

r

r

r

21

GBTRF – LAPACK algorithm (5/8)

12/02/2011

• Compute GEMM update of size (b-2r)x(b-2r) + ((b-2r) x r)*(r x (b-2r))

Words:

Total words :

b – 2r

b – 2r b – 2rrb – 2r

22

GBTRF – LAPACK algorithm (6/8)

12/02/2011

• Compute GEMM update of size

Words:

Total words :

b – 2r b – 2r b – 2r

r

r

23

GBTRF – LAPACK algorithm (7/8)

12/02/2011

• Compute GEMM update of size

Words:

Total words :

b – 2r

r r r

r

r

24

GBTRF – LAPACK algorithm (8/8)

12/02/2011

• Compute GEMM update of size

Words:

Total words :

r

r r r r

25

GBTRF communication cost

12/02/2011

• A full cost would be:

• If we choose r < b/3 this simplifies the leading terms to:

• Since r < b the other option is b/3 < r < b which gives in this case we get:

26

GBTRF - Banded LU Summary• Banded LU lower bounds are:

• LAPACK banded LU algorithm gives :

12/02/2011

27

Sequential Summary

12/02/2011

28

Parallel banded LU - Definitions• Variables:

n - size of the matrix

p - number of processors

b - matrix bandwidth

M - size of fast memory

12/02/2011

29

Parallel banded LU – Lower Bounds• Assuming banded matrix is distributed in a 1D layout across n

• Lower Bounds

12/02/2011

P(i-1) P(i)

30

Parallel banded algorithms – (Saad ‘85)• In (Saad, Schultz ’85) we are presented with a computation and communication analysis for banded Cholesky (LLT) solvers on a 1D ring, 2D torus and n-D hypercube as well as a pipelined approach • While this is a different computation from LU, Cholesky can be viewed as a minimum cost for LU since it does not require pivoting nor the computation of the U but is also used for Gaussian Elimination

• Since most parallel banded algorithms also increase the amount of computation done that will also be compared between the algorithms in terms of multiplicative factors to the leading term.

12/02/2011

31

Parallel banded algorithms – RIGBE

12/02/2011

32

Parallel banded algorithms – BIGBE

12/02/2011

33

Parallel banded algorithms – HBGE

12/02/2011

• Same algorithm as BIGGE but the 2D grid is embedded in the Hypercube to allow for faster communication costs

34

Parallel banded algorithms – WFGE

12/02/2011

• Uses the 2D cyclic layout and then performs operations diagonally

35

Parallel banded algorithms – (Saad ‘85)• Parallel band LU lower bounds:

• Banded Cholesky algorithms :

12/02/2011

36

Parallel banded algorithms – SPIKE (1/3)• Another parallel banded implementation is presented in the SPIKE Algorithm (Lawrie, Sameh ‘84) which is a Cholesky solver which is just a special case of Gaussian Elimination

• This algorithm for factorization and solver is extended to a pivoting LU implementation in (Sameh ’05)

12/02/2011

37

Parallel banded algorithms – SPIKE (2/3)

12/02/2011

38

Parallel banded algorithms – SPIKE (3/3)

12/02/2011

• parallel band LU Lower Bounds

• SPIKE Cholesky algorithm

39

PGBTRF – Data Layout• Adopts same banded layout as sequential with a slightly higher bandwidth storage (4b instead of 3b) and 1D block distribution

12/02/2011

n

P1 P2 P3 P4

2b

2b

40

PGBTRF – Algorithm• Description from ScaLAPACK code

1) Compute Fully Independent band LU factorizations of the submatrices located in local memory.

2) Pass the upper triangular matrix from the end of the local storage on to the next processor.

3) From local factorization and upper triangular matrix form a reduced blocked bidiagonal system and store extra data in Af (extra storage)

4) Solve reduced blocked bidiagonal system to compute extra factors and store in Af

12/02/2011

41

PGBTRF – Communication cost

12/02/2011

• Parallel band LU lower bounds:

• ScaLAPACK band LU algorithm:

42

Parallel Summary• Lower Bounds

• (Saad’85)

• SPIKE

• ScaLAPACK

12/02/2011

43

Future Work• Checking the lower bounds and implementation details of applying CALU to the panel in the LAPACK algorithm

• Investigate parallel band LU lower bounds for an exact cost

• Heterogeneous analysis of implemented MAGMA sgbtrf and lower bounds for a heterogeneous model

• Looking at Nested Dissection as another Divide and Conquer method for parallel banded LU

• Analysis of cost of applying a parallel banded algorithm to the sequential model to see if we can reduce the communication by increasing computation

12/02/2011

44

General Summary

12/02/2011

45

Questions?

12/02/2011

Recommended