Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Sparse Linear Algebra: Direct Methods
P. Amestoy and A. ButtariINPT(ENSEEIHT)
A. Guermouche (Univ. Bordeaux-LaBRI),J.-Y. L’Excellent and Bora Ucar (INRIA-CNRS/LIP-ENS Lyon) and
F.-H. Rouet (Lawrence Berkeley National Laboratory)
2012-2013
1 / 217
Outline
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationConclusion
2 / 217
A selection of references
I Books
I Duff, Erisman and Reid, Direct methods for Sparse Matrices,Clarendon Press, Oxford 1986.
I Dongarra, Duff, Sorensen and van der Vorst, Solving LinearSystems on Vector and Shared Memory Computers, SIAM, 1991.
I Davis, Direct methods for sparse linear systems, SIAM, 2006.I George, Liu, and Ng, Computer Solution of Sparse Positive
Definite Systems, book to appear
I Articles
I Gilbert and Liu, Elimination structures for unsymmetric sparse LUfactors, SIMAX, 1993.
I Liu, The role of elimination trees in sparse factorization, SIMAX,1990.
I Heath, Ng and Peyton, Parallel Algorithms for Sparse LinearSystems, SIAM review 1991.
3 / 217
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationConclusion
4 / 217
Motivations
I solution of linear systems of equations → key algorithmic kernel
Continuous problem↓
Discretization↓
Solution of a linear system Ax = b
I Main parameters:I Numerical properties of the linear system (symmetry, pos.
definite, conditioning, . . . )I Size and structure:
I Large (order > 107 to 108), square/rectangularI Dense or sparse (structured / unstructured)
I Target computer (sequential/parallel/multicore)
→ Algorithmic choices are critical
5 / 217
Motivations for designing efficient algorithms
I Time-critical applications
I Solve larger problems
I Decrease elapsed time (code optimization, parallelism)
I Minimize cost of computations (time, memory)
6 / 217
Difficulties
I Access to data:I Computer: complex memory hierarchy (registers, multilevel
cache, main memory (shared or distributed), disk)I Sparse matrix: large irregular dynamic data structures.
→ Exploit the locality of references to data on the computer(design algorithms providing such locality)
I Efficiency (time and memory)I Number of operations and memory depend very much on the
algorithm used and on the numerical and structural properties ofthe problem.
I The algorithm depends on the target computer (vector, scalar,shared, distributed, clusters of Symmetric Multi-Processors(SMP), multicore).
→ Algorithmic choices are critical
7 / 217
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationConclusion
8 / 217
Sparse matrices
Example:
3 x1 + 2 x2 = 52 x2 - 5 x3 = 1
2 x1 + 3 x3 = 0
can be represented as
Ax = b,
where A =
3 2 00 2 −52 0 3
, x =
x1x2x3
, and b =
510
Sparse matrix: only nonzeros are stored.
9 / 217
Sparse matrix ?
0 100 200 300 400 500
0
100
200
300
400
500
nz = 5104
Matrix dwt 592.rua (N=592, NZ=5104);Structural analysis of a submarine
10 / 217
Sparse matrix ?
Matrix from Computational Fluid Dynamics;(collaboration Univ. Tel Aviv)
0 1000 2000 3000 4000 5000 6000 7000
0
1000
2000
3000
4000
5000
6000
7000
nz = 43105
“Saddle-point” problem
11 / 217
Preprocessing sparse matrices
Original (A =lhr01) Preprocessed matrix (A′(lhr01))
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 184270 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 18427
Modified Problem:A′x ′ = b′ with A′ = PnPDrADcQPtQn
12 / 217
Factorization process
Solution of Ax = b
I A is unsymmetric :I A is factorized as: A = LU, where
L is a lower triangular matrix, andU is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:I positive definite A = LLT
I general A = LDLT
I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT) and R is triangular.I Solve: y = QTb then Rx = y
13 / 217
Factorization process
Solution of Ax = b
I A is unsymmetric :I A is factorized as: A = LU, where
L is a lower triangular matrix, andU is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:I positive definite A = LLT
I general A = LDLT
I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT) and R is triangular.I Solve: y = QTb then Rx = y
14 / 217
Factorization process
Solution of Ax = b
I A is unsymmetric :I A is factorized as: A = LU, where
L is a lower triangular matrix, andU is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:I positive definite A = LLT
I general A = LDLT
I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT) and R is triangular.I Solve: y = QTb then Rx = y
15 / 217
Difficulties
I Only non-zero values are stored
I Factors L and U have far more nonzeros than A
I Data structures are complex
I Computations are only a small portion of the code (the rest isdata manipulation)
I Memory size is a limiting factor ( → out-of-core solvers )
16 / 217
Key numbers:
1- Small sizes : 500 MB matrix;Factors = 5 GB; Flops = 100 Gflops ;
2- Example of 2D problem: Lab. Geosiences Azur, ValbonneI Complex 2D finite difference matrix n=16× 106 , 150× 106
nonzerosI Storage (single prec): 2 GB (12 GB with the factors)I Flops: 10 TeraFlops
3- Example of 3D problem: EDF (Code Aster, structuralengineering)
I Real matrix finite elements n = 106 , nz = 71× 106 nonzerosI Storage: 3.5× 109 entries (28 GB) for factors, 35 GB totalI Flops: 2.1× 1013
17 / 217
Example in structural mechanics:
Car body,227,362 unknowns,5,757,996 nonzeros,MSC.Software
Size of factors:51.1 million entriesNumber of operations:44.9× 109
18 / 217
Example in structural mechanics:
BMW crankshaft,148,770 unknowns,5,396,386 nonzeros,MSC.Software
Size of factors:97.2 million entriesNumber of operations:127.9× 109
19 / 217
Data structure for sparse matrices
I Storage scheme depends on the pattern of the matrix and on thetype of access required
I band or variable-band matricesI “block bordered” or block tridiagonal matricesI general matrixI row, column or diagonal access
20 / 217
Data formats for a general sparse matrix A
What needs to be represented
I Assembled matrices: MxN matrix A with NNZ nonzeros.
I Elemental matrices (unassembled): MxN matrix A with NELTelements.
I Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)
I Symmetric (or Hermitian)→ store only part of the data.
I Distributed format ?
I Duplicate entries and/or out-of-range values ?
21 / 217
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33COLPTR [1 : N + 1] = 1 3 4 6
column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row
22 / 217
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33COLPTR [1 : N + 1] = 1 3 4 6
column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row
23 / 217
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33COLPTR [1 : N + 1] = 1 3 4 6
column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row
24 / 217
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Diagonal format (M=N):NDIAG = 3IDIAG = −2 0 1
VAL =
na a11 0na a22 a23a31 a33 na
(na: not accessed)
VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 ≤ i + IDIAG(j) ≤N)
25 / 217
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:
I Coordinate format:
Y( 1 :M) = 0DO k=1,NNZ
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k ) ∗ X(JCN( k ) )ENDDO
I CSC format:
Y( 1 :M) = 0DO J=1,N
Xj=X( J )DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k)∗XjENDDO
ENDDO
I CSR format
DO I =1,MYi=0DO k=ROWPTR( I ) ,ROWPTR( I+1)−1
Yi = Yi + VAL( k)∗X(JCN( k ) )ENDDOY( I )=Yi
ENDDO
26 / 217
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:
I Coordinate format:Y( 1 :M) = 0DO k=1,NNZ
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k ) ∗ X(JCN( k ) )ENDDO
I CSC format:
Y( 1 :M) = 0DO J=1,N
Xj=X( J )DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k)∗XjENDDO
ENDDO
I CSR format
DO I =1,MYi=0DO k=ROWPTR( I ) ,ROWPTR( I+1)−1
Yi = Yi + VAL( k)∗X(JCN( k ) )ENDDOY( I )=Yi
ENDDO
27 / 217
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:
I Coordinate format:Y( 1 :M) = 0DO k=1,NNZ
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k ) ∗ X(JCN( k ) )ENDDO
I CSC format:Y( 1 :M) = 0DO J=1,N
Xj=X( J )DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k)∗XjENDDO
ENDDO
I CSR format
DO I =1,MYi=0DO k=ROWPTR( I ) ,ROWPTR( I+1)−1
Yi = Yi + VAL( k)∗X(JCN( k ) )ENDDOY( I )=Yi
ENDDO
28 / 217
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:
I Coordinate format:Y( 1 :M) = 0DO k=1,NNZ
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k ) ∗ X(JCN( k ) )ENDDO
I CSC format:Y( 1 :M) = 0DO J=1,N
Xj=X( J )DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN( k ) ) = Y( IRN( k ) ) + VAL( k)∗XjENDDO
ENDDO
I CSR formatDO I =1,M
Yi=0DO k=ROWPTR( I ) ,ROWPTR( I+1)−1
Yi = Yi + VAL( k)∗X(JCN( k ) )ENDDOY( I )=Yi
ENDDO
29 / 217
Example of elemental matrix format
A1 =123
−1 2 32 1 11 1 1
, A2 =345
2 −1 31 2 −13 2 1
A =
−1 2 3 0 0
2 1 1 0 01 1 3 −1 30 0 1 2 −10 0 3 2 1
= A1 + A2
30 / 217
Example of elemental matrix format
A1 =123
−1 2 32 1 11 1 1
, A2 =345
2 −1 31 2 −13 2 1
I N=5 NELT=2 NVAR=6 A =
∑NELTi=1 Ai
IELTPTR [1:NELT+1] = 1 4 7ELTVAR [1:NVAR] = 1 2 3 3 4 5ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1
I Remarks:
I NVAR = ELTPTR(NELT+1)-1I Order of element i : Si = ELTPTR(i + 1)− ELTPTR(i)I NVAL =
∑S2i (unsym) ou
∑Si (Si + 1)/2 (sym),
I storage of elements in ELTVAL: by columns
31 / 217
File storage: Rutherford-Boeing
I Standard ASCII format for filesI Header + Data (CSC format). key xyz:
I x=[rcp] (real, complex, pattern)I y=[suhzr] (sym., uns., herm., skew sym. (A = −AT ), rectang.)I z=[ae] (assembled, elemental)I ex: M T1.RSA, SHIP003.RSE
I Supplementary files: right-hand-sides, solution, permutations. . .
I Canonical format introduced to guarantee a uniquerepresentation (order of entries in each column, no duplicates).
32 / 217
File storage: Rutherford-Boeing
DNV-Ex 1 : Tubular joint-1999-01-17 M_T1
1733710 9758 492558 1231394 0
rsa 97578 97578 4925574 0
(10I8) (10I8) (3e26.16)
1 49 96 142 187 231 274 346 417 487
556 624 691 763 834 904 973 1041 1108 1180
1251 1321 1390 1458 1525 1573 1620 1666 1711 1755
1798 1870 1941 2011 2080 2148 2215 2287 2358 2428
2497 2565 2632 2704 2775 2845 2914 2982 3049 3115
...
1 2 3 4 5 6 7 8 9 10
11 12 49 50 51 52 53 54 55 56
57 58 59 60 67 68 69 70 71 72
223 224 225 226 227 228 229 230 231 232
233 234 433 434 435 436 437 438 2 3
4 5 6 7 8 9 10 11 12 49
50 51 52 53 54 55 56 57 58 59
...
-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11
0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08
0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10
0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07
0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09
0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07
0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08
0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08
33 / 217
File storage: Matrix-market
I Example
%%MatrixMarket matrix coordinate real general
% Comments
5 5 8
1 1 1.000e+00
2 2 1.050e+01
3 3 1.500e-02
1 4 6.000e+00
4 2 2.505e+02
4 4 -2.800e+02
4 5 3.332e+01
5 5 1.200e+01
34 / 217
Examples of sparse matrix collections
I The University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/
I Matrix market http://math.nist.gov/MatrixMarket/
I Rutherford-Boeinghttp://www.cerfacs.fr/algor/Softs/RB/index.html
I TLSE http://gridtlse.org/
35 / 217
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationConclusion
36 / 217
Gaussian elimination
A = A(1), b = b(1), A(1)x = b(1): a11 a12 a13a21 a22 a23a31 a32 a33
x1x2x3
=
b1b2b3
2← 2− 1× a21/a113← 3− 1× a31/a11
A(2)x = b(2) a11 a12 a13
0 a(2)22 a
(2)23
0 a(2)32 a
(2)33
x1
x2x3
=
b1
b(2)2
b(2)3
b(2)2 = b2 − a21b1/a11 . . .
a(2)32 = a32 − a31a12/a11 . . .
Finally A(3)x = b(3) a11 a12 a13
0 a(2)22 a
(2)23
0 0 a(3)33
x1
x2x3
=
b1
b(2)2
b(3)3
a(3)(33)
= a(2)(33)− a
(2)32 a
(2)23 /a
(2)22 . . .
Typical Gaussian elimination step k : a(k+1)ij = a
(k)ij −
a(k)ik a
(k)kj
a(k)kk
37 / 217
Relation with A = LU factorization
I One step of Gaussian elimination can be written:A(k+1) = L(k)A(k) , with
L(k) =
1.
.1
−lk+1,k .. .−ln,k 1
and lik =
a(k)ik
a(k)kk
.
I Then, A(n) = U = L(n−1) . . .L(1)A, which gives A = LU ,
with L = [L(1)]−1 . . . [L(n−1)]−1 =
1 0
..
.li,j 1
,
I In dense codes, entries of L and U overwrite entries of A.
I Furthermore, if A is symmetric, A = LDLT with dkk = a(k)kk :
A = LU = At = U tLt implies (U)(Lt)−1 = L−1U t = D diagonal and
U = DLt , thus A = L(DLt) = LDLt
Dense LU factorization
I Step by step columns of A are set to zero and A is updatedL(n−1) . . .L(1)A = U leading toA = LU where L = [L(1)]−1 . . . [L(n−1)]−1
I - zero entries in column of A can be replaced by entries in L- row entries of U can be stored in corresponding locations of A
Algorithm 1 Dense LU factorization1: for k = 1 to n do2: L(k : k) = 1 ; L(k + 1 : n, k) = A(k+1:n,k)
A(k,k)
3: U(k, k : n) = A(k, k : n)4: for j = k + 1 to n do5: for i = k + 1 to n do6: A(i , j) = A(i , j)− L(i , k)× U(k, j)7: end for8: end for9: end for
When |A(k, k)| is relatively too small, numerical pivoting required39 / 217
Gaussian elimination and sparsity
Step k of LU factorization (akk pivot):
I For i > k compute lik = aik/akk (= a′ik),
I For i > k , j > k update remaining rows/cols in matrix
a′ij = aij −aik × akj
akk= aij − lik × akj
I If aik 6= 0 and akj 6= 0 then a′ij 6= 0
I If aij was zero → its non-zero value must be stored k j
k
i
x
x
x
x
k j
k
i
x
x
x
0
fill-in
40 / 217
Gaussian elimination and sparsity
I Idem for Cholesky :
I For i > k compute lik = aik/√
akk (= a′ik),
I For i > k , j > k , j ≤ i (lower triang.)
a′ij = aij −aik × ajk√
akk= aij − lik × ajk
41 / 217
Gaussian elimination and sparsity
I Interest of permuting a matrix:X X X X XX X 0 0 0X 0 X 0 0X 0 0 X 0X 0 0 0 X
1↔ 5
X 0 0 0 X0 X 0 0 X0 0 X 0 X0 0 0 X XX X X X X
I Ordering the variables has a strong impact on
I fill-inI number of operationsI shape of the dependency graph (tree) and parallelism
I Fill reduction is NP-hard in general [Yannakakis 81]
42 / 217
Illustration: Reverse Cuthill-McKee on matrix dwt 592.rua
Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202
Original matrix Factorized matrix
0 100 200 300 400 500
0
100
200
300
400
500
nz = 51040 100 200 300 400 500
0
100
200
300
400
500
nz = 58202
43 / 217
Illustration: Reverse Cuthill-McKee on matrix dwt 592.rua
NZ(LU factors)=16924
Permuted matrix Factorized permuted matrix(RCM)
0 100 200 300 400 500
0
100
200
300
400
500
nz = 51040 100 200 300 400 500
0
100
200
300
400
500
nz = 16924
44 / 217
Table: Benefits of Sparsity on a matrix of order 2021 × 2021 with 7353nonzeros. (Dongarra etal 91) .
Procedure Total storage Flops Time (sec.)on CRAY J90
Full Syst. 4084 Kwords 5503 ×106 34.5Sparse Syst. 71 Kwords 1073×106 3.4Sparse Syst. and reordering 14 Kwords 42×103 0.9
45 / 217
Sparse LU factorization : a (too) simple algorithm
Only non-zeros are stored and operated on
Algorithm 2 Simple sparse LU factorization1: Permute matrix A to reduce fill-in and flops (NP complete problem)2: for k = 1 to n do3: L(k : k) = 1 ; For nonzeros in column k: L(k + 1 : n, k) = A(k+1:n,k)
A(k,k)
4: U(k, k : n) = A(k, k : n)5: for j = k + 1 to n limited to nonzeros in row U(k, :) do6: for i = k + 1 to n limited to nonzeros in col. L(:, k) do7: A(i , j) = A(i , j)− L(i , k)× U(k, j)8: end for9: end for
10: end for
Questions
Dynamic data structure for A to accommodate fill-inData access efficiency; Can we predict position of fill-in ?|A(k , k)| too small −→ numerical permutation needed !!!
46 / 217
Control of numerical stability: numerical pivoting
I In dense linear algebra partial pivoting commonly used (at eachstep the largest entry in the column is selected).
I In sparse linear algebra, flexibility to preserve sparsity is offered :I Partial threshold pivoting : Eligible pivots are not too small with
respect to the maximum in the column.
Set of eligible pivots = {r | |a(k)rk | ≥ u ×maxi |a(k)ik |}, where0 < u ≤ 1.
I Then among eligible pivots select one preserving better sparsity.I u is called the threshold parameter (u = 1 → partial pivoting).I It restricts the maximum possible growth of: aij = aij − aik×akj
akkto
1 + 1u which is sufficient to the preserve numerical stability.
I u ≈ 0.1 is often chosen in practice.
I For symmetric indefinite problems 2× 2 pivots (with threshold)is also used to preserve symmetry and sparsity.
Threshold pivoting and numerical accuracy
Table: Effect of variation in threshold parameter u on matrix 541× 541with 4285 nonzeros (Dongarra etal 91) .
u Nonzeros in LU factors Error
1.0 16767 3 ×10−9
0.25 14249 6 ×10−10
0.1 13660 4 ×10−9
0.01 15045 1 ×10−5
10−4 16198 1 ×102
10−10 16553 3 ×1023
48 / 217
Threshold pivoting and numerical accuracy
Table: Effect of variation in threshold parameter u on matrix 541× 541with 4285 nonzeros (Dongarra etal 91) .
u Nonzeros in LU factors Error
1.0 16767 3 ×10−9
0.25 14249 6 ×10−10
0.1 13660 4 ×10−9
0.01 15045 1 ×10−5
10−4 16198 1 ×102
10−10 16553 3 ×1023
Difficulty: numerical pivoting implies dynamic datastructures thatcan not be forecasted symbolically
49 / 217
Three-phase scheme to solve Ax = b
1. Analysis stepI Preprocessing of A (symmetric/unsymmetric orderings, scalings)I Build the dependency graph (elimination tree, eDAG . . . )I Apre = P Dr A Qc Dc PT ,
2. Factorization (Apre = LU, LDLT, LLT, QR)
Numerical pivoting
3. Solution based on factored matricesI triangular solves: Ly = bpre , then Uxpre = yI improvement of solution (iterative refinement), error analysis
50 / 217
Efficient implementation of sparse algorithms
I Indirect addressing is often used in sparse calculations: e.g.sparse SAXPY
do i = 1, m
A( ind(i) ) = A( ind(i) ) + alpha * w( i )
enddo
I Even if manufacturers provide hardware for improving indirectaddressing
I It penalizes the performance
I Identify dense blocks or switch to dense calculations as soon asthe matrix is not sparse enough
51 / 217
Effect of switch to dense calculations
Matrix from 5-point discretization of the Laplacian on a 50× 50 grid(Dongarra etal 91)
Density for Order of Millions Timeswitch to full code full submatrix of flops (seconds)
No switch 0 7 21.81.00 74 7 21.40.80 190 8 15.00.60 235 11 12.50.40 305 21 9.00.20 422 50 5.50.10 531 100 3.70.005 1420 1908 6.1
Sparse structure should be exploited if density < 10%.
52 / 217
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationConclusion
53 / 217
Complexity of sparse direct methods
Regular problems 2D 3D(nested dissections) N × N grid N × N × N gridNonzeros in original matrix Θ(N2) Θ(N3)Nonzeros in factors Θ(N2log n) Θ(N4)Floating-point ops Θ(N3) Θ(N6)
3D example in earth science:acoustic wave propagation,27-point finite difference grid
Current goal [Operto and Virieux,
http://seiscope.oca.eu/]:LU on complete earth
Extrapolation on a 1000× 1000× 1000 grid:15 exaflops, 200 TBytes for factors, 32 TBytes of working memory!
54 / 217
Summary – sparse matrices main issues
I Widely used in engineering and industry (critical for performanceof simulations)
I Irregular dynamic data structures of very large size(Tera/PetaBytes)
I Strong relations between sparse matrices and graphs
I Which graph for which matrix (symmetric, unsymmetric,triangular)?
I Efficient algorithms needed for:
I Which graph for which algorithm ?I Reordering sparse matricesI Predict data structures of factor matricesI Model factorization and accomodate numerical issues
55 / 217
Outline
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
56 / 217
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
57 / 217
Graph notations and definitions
A graph G = (V ,E ) consists of a finite set V , called the vertex setand a finite, binary relation E on V , called the edge set.
Three standard graph models
Undirected graph: The edges are unordered pair of vertices, i.e.,{u, v} ∈ E for some u, v ∈ V .
Directed graph: The edges are ordered pair of vertices, that is, (u, v)and (v , u) are two different edges.
Bipartite graph: G = (U ∪ V ,E ) consists of two disjoint vertex setsU and V such that for each edge (u, v) ∈ E , u ∈ U and v ∈ V .
An ordering or labelling of G = (V ,E ) having n vertices, i.e.,|V | = n, is a mapping of V onto 1, 2, . . . , n.
58 / 217
Matrices and graphs: Rectangular matrices
The rows/columns and nonzeros of a given sparse matrix correspond (withnatural labelling) to the vertices and edges, respectively, of a graph.
Rectangular matrices
A =
1 2 3 4
1 × ×2 × ×3 × ×
Bipartite graph
1
r2
r3
1c
2c
3c
4c
r
The set of rows corresponds to one of the vertex set R, the set of columns
corresponds to the other vertex set C such that for each aij 6= 0, (ri , cj) is
an edge.59 / 217
Matrices and graphs: Square unsymmetric pattern
The rows/columns and nonzeros of a given sparse matrix correspond (withnatural labelling) to the vertices and edges, respectively, of a graph.
Squareunsymmetricpattern matrices
A =
1 2 3
1 × ×2 × ×3 ×
Graph models
I Bipartite graph as before.
I Directed graph
v2 3v
1v
The set of rows/cols corresponds the vertex set V such that for each
aij 6= 0, (vi , vj) is an edge. Transposed view possible too, i.e., the edge
(vi , vj) directed from column i to row j . Usually self-loops are omitted. 60 / 217
Matrices and graphs: Square unsymmetric pattern
A special subclass
Directed acyclic graphs (DAG):A directed graphs with no loops(maybe except for self-loops).
DAGs
We can sort the vertices suchthat if (u, v) is an edge, then uappears before v in the ordering.
Question: What kind of matrices have a DAG structure ?
61 / 217
Matrices and graphs: Symmetric pattern
The rows/columns and nonzeros of a given sparse matrix correspond (withnatural labelling) to the vertices and edges, respectively, of a graph.
Square symmetricpattern matrices
A =
1 2 3
1 ×2 × × ×3 × ×
Graph models
I Bipartite and directed graphs as before.
I Undirected graph
v2 3v
1v
The set of rows/cols corresponds the vertex set V such that for each
aij , aji 6= 0, {vi , vj} is an edge. No self-loops; usually the main diagonal is
assumed to be zero-free. 62 / 217
Definitions: Edges, degrees, and paths
Many definitions for directed and undirected graphs are the same.We will use (u, v) to refer to an edge of an undirected or directedgraph to avoid repeated definitions.
I An edge (u, v) is said to incident on the vertices u and v .
I For any vertex u, the set of vertices in adj(u) = {v : (u, v) ∈ E}are called the neighbors of u. The vertices in adj(u) are said tobe adjacent to u.
I The degree of a vertex is the number of edges incident on it.
I A path p of length k is a sequence of vertices 〈v0, v1, . . . , vk〉where (vi−1, vi ) ∈ E for i = 1, . . . , k. The two end points v0 andvk are said to be connected by the path p, and the vertex vk issaid to be reachable from v0.
63 / 217
Definitions: Components
I An undirected graph is said to be connected if every pair ofvertices is connected by a path.
I The connected components of an undirected graph are theequivalence classes of vertices under the “is reachable” fromrelation.
I A directed graph is said to be strongly connected if every pair ofvertices are reachable from each other.
I The strongly connected components of a directed graph are theequivalence classes of vertices under the “are mutuallyreachable” relation.
64 / 217
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
65 / 217
Definitions: Trees and spanning trees
A tree is a connected, acyclic, undirected graph. If an undirectedgraph is acyclic but disconnected, then it is a forest.
Properties of trees
I Any two vertices are connected by a unique path.
I |E | = |V | − 1
A rooted tree is a tree with a distinguished vertex r , called the root.
There is a unique path from the root r to every other vertex v . Anyvertex y in that path is called an ancestor of v . If y is an ancestor ofv , then v is a descendant of y .
The subtree rooted at v is the tree induced by the descendants of v ,rooted at v .
A spanning tree of a connected graph G = (V ,E ) is a treeT = (V ,F ), such that F ⊆ E .
66 / 217
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
67 / 217
Ordering of the vertices of a rooted tree
I A topological ordering of a rooted tree is an ordering thatnumbers children vertices before their parent.
I A postorder is a topological ordering which numbers the verticesin any subtree consecutively.
u w
x
y
z
v
with topological ordering
1 3
2
4
5
6
Rooted spanning tree
w
yz
xu
v
Connected graph G
u w
x
y
z
v
1
6
54
3
2
Rooted spanning treewith postordering
68 / 217
How to explore a graph:
Two algorithms such that each edge is traversed exactly once in theforward and reverse direction and each vertex is visited. (Connectedgraphs are considered in the description of the algorithms.)
I Depth first search : Starting from a given vertex (mark it)follow a path (and mark each vertex in the path) until it isadjacent to only marked vertices. Return to previous vertex andcontinue.
I Breadth first search Select a vertex and put it into an initiallyempty queue of vertices to be visited. Repeatedly remove thevertex x at the head of the queue and place all neighbors of xthat were not enqueue before into the queue.
69 / 217
Illustration of DFS and BFS exploration
z
xu
vts w
v
z t
sy x
w
u
Breadth−first spanning tree (v)1
32 4
5 6 7
8
v
z
y
x
t
s
w u
1
2
3
4
7
8
5 6
Depth−first spanning tree (v)
y
70 / 217
Properties of DFS and BFS exploration
I If the graph G is not connected the algorithm is restarted and infact it detects the connected components of G .
I Each vertex is visited once.
I The visited edges of the DFS algorithm is a spanning forest of ageneral graph G so-called the depth first spanning forest of G .
I The visited edges of the BFS algorithm is a spanning forest of ageneral graph G so-called the breadth first spanning forest ofG .
71 / 217
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
72 / 217
Graph shapes and peripheral vertices
I The distance between two vertices of a graph is the size of theshortest path joining those vertices.
I The eccentricity l(v) = maxd(v ,w)|w ∈ V
I The diameter δ(G ) = maxl(v)|v ∈ V
I v is a peripheral vertex if l(v) = δ(G )
I Example of connected graph with δ(G ) = 5 and peripheralvertices s,w , x .
yz
xu
wvts
73 / 217
Pseudo-peripheral vertices
I The rooted level structure L(v) of G is a partitioning of GL(v) = Lo(v), . . . , Ll(v)(v) such that Lo(v) = vLi (v) = Adj(Li−1(v)) \ Li−2(v)
I The eccentricity l(v) is also-called the length of L(v). Thewidth w(v) of L(v) is defined as w(v) = max |Li (v)||0 ≤ l(v)
I The aspect ratio of L(v) is defined as l(v)/w(v).
I A pseudo-peripheral is a node with a large eccentricity.
I Algorithm to determine a pseudo-peripheral :Let i = 0, ri be an arbitrary node, and nblevels = l(ri ). Thealgorithm iteratively updates nblevels by selecting the vertex ri+1
of minimum degree of Ll(ri ) until l(ri+1) > nblevels.
74 / 217
Pseudo-peripheral vertices
yz
xu
wvts
w
z x
y
uv
t
s
2nd step: select W in L3 (min. degree)
L0(w)
L1
L2
L3
L4
L5
v
z
s
x
y
w
u
t
L0(v)
L1
L2
First step, select v
Level structure(v)
L3
75 / 217
Permutation matrices
A permutation matrix is a square (0, 1)-matrix where each row andcolumn has a single 1.
If P is a permutation matrix, PPT = I , i.e., it is an orthogonalmatrix. Let,
A =
1 2 3
1 × ×2 × ×3 ×
and suppose we want to permute columns as [2, 1, 3]. Definep2,1 = 1, p1,2 = 1, p3,3 = 1, and B = AP (if column j to be atposition i , set pji = 1)
B =
2 1 3
1 × ×2 × ×3 ×
=
1 2 3
1 × ×2 × ×3 ×
1 2 3
1 12 13 1
76 / 217
Symmetric matrices and graphs
I Remarks:I Number of nonzeros in column j = |adjG (j)|I Symmetric permutation ≡ renumbering the graph
3
4
25
11
2
3
4
5
1 2 3 4 5
Symmetric matrix Corresponding graph
77 / 217
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
78 / 217
Matching in bipartite graphs and permutations
A matching in a graph is a set of edges no two of which share acommon vertex. We will be mostly dealing with matchings inbipartite graphs.
In matrix terms, a matching in the bipartite graph of a matrixcorresponds to a set of nonzero entries no two of which are in thesame row or column.
A vertex is said to be matched if there is an edge in the matchingincident on the vertex, and to be unmatched otherwise. In a perfectmatching, all vertices are matched.
The cardinality of a matching is the number of edges in it. Amaximum cardinality matching or a maximum matching is amatching of maximum cardinality. Solvable in polynomial time.
79 / 217
Matching in bipartite graphs and permutations
Given a square matrix whose bipartite graph has a perfect matching,such a matching can be used to permute the matrix such that thematching entries are along the main diagonal.
1
r2
r3 3c
2c
1cr 1 2 3
1 × ×2 × ×3 ×
2 1 3
1 × ×2 × ×3 ×
=
1 2 3
1 × ×2 × ×3 ×
1 2 3
1 12 13 1
80 / 217
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
81 / 217
Definitions: Reducibility
Reducible matrix: An n × n square matrix is reducible if there existsan n × n permutation matrix P such that
PAPT =
(A11 A12
O A22
),
where A11 is an r × r submatrix, A22 is an (n − r)× (n − r)submatrix, where 1 ≤ r < n.
Irreducible matrix: There is no such a permutation matrix.
Theorem: An n × n square matrix is irreducible iff its directed graphis strongly connected.
Proof: Follows by definition.
82 / 217
Graph definitions and relations to sparse matricesWhich graph for which matrix?Trees and spanning treesOrdering and tree traversalPeripheral and pseudo-peripheral verticesGraph matching and matrix permutationReducibility and connected componentsDefinition of hypergraphs
Postorderings and memory usage
83 / 217
Definitions: Hypergraphs
Hypergraph: A hypergraph H = (V ,N) consists of a finite set V called thevertex set and a set of non-empty subsets of vertices N called the hyperedgeset or the net set. A generalization of graphs.
For a matrix A, define a hypergraph whose vertices correspond to the rowsand whose nets correspond to the columns such that vertex vi is in net nj iffaij 6= 0 (the column-net model).
A sample matrix
1 2 3 4
1 × × ×2 × ×3 × × ×
The column-net hypergraph model
43n
1n
1v
2v
3v
n
n2v
2n
4n
3n
1n
1
2
3
v
v
84 / 217
Outline
Reordering sparse matricesFill-in and reorderingThe elimination graph modelFill-in characterizationFill-reducing heuristicsReduction to Block Triangular Form
85 / 217
Reordering sparse matricesFill-in and reorderingThe elimination graph modelFill-in characterizationFill-reducing heuristicsReduction to Block Triangular Form
86 / 217
Fill-in and reordering
Step k of LU factorization (akk pivot):I For i > k compute lik = aik/akk (= a′ik),I For i > k , j > k
a′ij = aij −aik × akj
akk= aij − lik × akj
I If aik 6= 0 and akj 6= 0 then a′ij 6= 0I If aij was zero → non-zero a′ij must be stored: fill-in
k j
k
i
x
x
x
x
k j
k
i
x
x
x
0
Interest ofpermutinga matrix:
X X X X XX X 0 0 0X 0 X 0 0X 0 0 X 0X 0 0 0 X
X 0 0 0 X0 X 0 0 X0 0 X 0 X0 0 0 X XX X X X X
87 / 217
Symmetric matrices and graphs
I Assumptions: A symmetric and pivots are chosen from thediagonal
I Structure of A symmetric represented by the graph G = (V ,E )
I Vertices are associated to columns: V = {1, . . . , n}I Edges E are defined by: (i , j) ∈ E ↔ aij 6= 0
I G undirected (symmetry of A)
88 / 217
Symmetric matrices and graphs
I Remarks:I Number of nonzeros in column j = |adjG (j)|I Symmetric permutation ≡ renumbering the graph
3
4
25
11
2
3
4
5
1 2 3 4 5
Symmetric matrix Corresponding graph
89 / 217
Breadth-First Search (BFS) of a graph to permute a matrix
Exercice 1 (BFS traversal to order a matrix)
Given a symmetric matrix A and its associated graph G = (V ,E )
1. Derive from BFS traversal of G an algorithm to reorder matrix A
2. Explain the structural property of the permuted matrix?
3. Explain why such an ordering can help reducing fill-in during LUfactorisation?
4. How can the proposed algorithm be used on an unsymmetricmatrix?
90 / 217
Reordering sparse matricesFill-in and reorderingThe elimination graph modelFill-in characterizationFill-reducing heuristicsReduction to Block Triangular Form
91 / 217
The elimination graph model for symmetric matrices
I Let A be a symmetric positive define matrix of order n
I The LLT factorization can be described by the equation:
A = A0 = H0 =
(d1 vT
1
v1 H1
)=
( √d1 0
v1√d1
In−1
)(1 00 H1
)( √d1
vT1√d1
0 In−1
)= L1A1LT
1, where
H1 = H1 −v1vT
1
d1
I The basic step is applied on H1H2 · · · to obtain :
A = (L1L2 · · ·Ln−1) In(LTn−1 . . .L
T2 LT
1
)= LLT
92 / 217
The basic step: H1 = H1 − v1vT1
d1
What is v1vT1 in terms of structure?
v1 is a column of A, hence theneighbors of the correspondingvertex.
v1vT1 results in a dense sub-
block in H1.
If any of the nonzeros in densesubmatrix are not in A, then wehave fill-ins.
93 / 217
The elimination process in the graphs
GU(V ,E )← undirected graph of Afor k = 1 : n − 1 do
V ← V − {k} {remove vertex k}E ← E − {(k , `) : ` ∈ adj(k)} ∪ {(x , y) : x ∈ adj(k) and y ∈adj(k)}Gk ← (V ,E ) {for definition}
end for
Gk are the so-called elimination graphs (Parter,’61).
4
321
6 5
12
34
56
H0 =G0 :
94 / 217
A sequence of elimination graphs
4
321
6 5
G0 :
4
32
6 5
6 5
34G2 : H2 =
34
56
45
6H3 =
6 5
4
G1 :
G3 :
12
34
56
H0 =
23
45
6
H1 =
95 / 217
Fill-in and reordering
“Before permutation” Permuted matrix(A”(lhr01)) (A′(lhr01))
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 184270 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 18427
Factored matrix (LU(A′))
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 76105
96 / 217
Reordering sparse matricesFill-in and reorderingThe elimination graph modelFill-in characterizationFill-reducing heuristicsReduction to Block Triangular Form
97 / 217
Fill-in characterization
Let A be a symmetric matrix (G (A) its associated graph), L thematrix of factors A = LLt ;
Fill path theorem, Rose, Tarjan, Leuker, 76
lij 6= 0 iff there is a path in G (A) between i and j such that all nodesin the path have indices smaller than both i and j .
i
j
p1 p2
pk
p1
p2
pk
j
i
j
i
98 / 217
Fill-in characterization (proof intuition)
Let A be a symmetric matrix (G (A) its associated graph), L thematrix of factors A = LLt ;
Fill path theorem, Rose, Tarjan, Leuker, 76
lij 6= 0 iff there is a path in G (A) between i and j such that all nodesin the path have indices smaller than both i and j .
i
j
p1 p2
pk
p1
p2
pk
j
i
j
i
99 / 217
Fill-in characterization (proof intuition)
Let A be a symmetric matrix (G (A) its associated graph), L thematrix of factors A = LLt ;
Fill path theorem, Rose, Tarjan, Leuker, 76
lij 6= 0 iff there is a path in G (A) between i and j such that all nodesin the path have indices smaller than both i and j .
i
j
p1 p2
pk
p1
p2
pk
j
i
j
i
100 / 217
Fill-in characterization (proof intuition)
Let A be a symmetric matrix (G (A) its associated graph), L thematrix of factors A = LLt ;
Fill path theorem, Rose, Tarjan, Leuker, 76
lij 6= 0 iff there is a path in G (A) between i and j such that all nodesin the path have indices smaller than both i and j .
i
j
p1 p2
pk
p1
p2
pk
j
i
j
i
Fill(i,j) !!!
Fill(i,j) !!!
101 / 217
Introducing the filled graph G+(A)
I Let F = L + LT be the filled matrix,and G (F) the filled graph of A denoted by G+(A).
I Lemma (Parter 1961) : (vi , vj) ∈ G+ if and only if (vi , vj) ∈ G
or ∃k < min(i , j) such that (vi , vk) ∈ G+ and (vk , vj) ∈ G+.
5
1
6
2
34
+G (A) = G(F) F = L + L
T
1
54
32
6
102 / 217
Reordering sparse matricesFill-in and reorderingThe elimination graph modelFill-in characterizationFill-reducing heuristicsReduction to Block Triangular Form
103 / 217
Fill-reducing heuristics
Three main classes of methods for minimizing fill-in duringfactorization
I Global approach: The matrix is permuted into a matrix with agiven pattern
I Fill-in is restricted to occur within that structureI Cuthill-McKee (block tridiagonal matrix)
(Remark: BFS traversal of the associated graph)I Nested dissections (“block bordered” matrix)
(Remark: interpretation using the fill-path theorem)
Graph partitioning Permuted matrix
(1)
(5)
(4)
(2)
S1
S2
S3
S1
12
34
S2
S3
104 / 217
Fill-reducing heuristics
I Local heuristics: At each step of the factorization, selection ofthe pivot that is likely to minimize fill-in.
I Method is characterized by the way pivots are selected.I Markowitz criterion (for a general matrix).I Minimum degree or Minimum fill-in (for symmetric matrices).
I Hybrid approaches: Once the matrix is permuted to blockstructure, local heuristics are used within the blocks.
105 / 217
Local heuristics to reduce fill-in during factorization
Let G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric such that Metric(vi ) < Metric(vj) implies vi is a betterthan vj
Generic algorithmLoop until all nodes are selected
Step1: select current node p (so called pivot) withminimum metric value,
Step2: update elimination graph,Step3: update Metric(vj) for all non-selected nodes vj .
Step3 should only be applied to nodes for which the Metric valuemight have changed.
106 / 217
Reordering unsymmetric matrices: Markowitz criterion
I At step k of Gaussian elimination:
Ak
L
U
I rki = number of non-zeros in row i of Ak
I ckj = number of non-zeros in column j of Ak
I aij must be large enough and should minimize(rki − 1)× (ck
j − 1) ∀i , j > k
I Minimum degree : Markowitz criterion for symmetricdiagonally dominant matrices
107 / 217
Minimum degree algorithm
I Step 1:Select the vertex that possesses the smallest number ofneighbors in G 0.
23
45
67
89
10
1
14
3
5
6
8
9
10
2
7
(a) Sparse symmetric matrix(b) Elimination graph
The node/variable selected is 1 of degree 2.
108 / 217
I Notation for the elimination graph
I Let G k = (V k ,E k), the graph built at step k .I G k describes the structure of Ak after elimination of k pivots.I G k is non-oriented (Ak is symmetric)I Fill-in in Ak ≡ adding edges in the graph.
109 / 217
Illustration
Step 1: elimination of pivot 1
/
12
34
56
78
910
4
3
5
6
8
9
10
2
7
1
23
45
67
89
10
1
7
4
3
5
6
8
9
10
2
1
(a) Elimination graph (b) Factors and active submatrix
Initial nonzeros Fill−inNonzeros in factors
Minimum degree algorithm applied to the graph:
I Step k : Select the node with the smallest number of neighbors
I G k is built from G k−1 by suppressing the pivot and adding edgescorresponding to fill-in.
111 / 217
Minimum degree algorithm based on elimination graphs
∀i ∈ [1 . . . n] ti = |AdjG0(i)|For k = 1 to n Do
p = mini∈Vk−1(ti )
For each i ∈ AdjG k−1(p) DoAdjG k (i) = (AdjG k−1(i) ∪ AdjG k−1(p)) \ {i , p}ti = |AdjG k (i)|
EndForV k = V k−1 \ p
EndFor
112 / 217
Illustration (cont’d)
Graphs G1,G2,G3 and corresponding reduced matrices.
e
7
4
3
5
6
8
9
10
2
12
34
56
78
910
12
34
56
78
910
4
3
5
6
8
9
10
7
23
45
67
89
10
1
4
5
6
8
9
10
7
(a) Elimination graphs
(b) Factors and active submatrices
Original nonzero Fill−in
Original nonzero modified Nonzeros in factors
113 / 217
Minimum Degree does not always minimize fill-in !!!
12
34
67
89
5
4
3
1
2
6
7
9
8
5
4
3
1
2
6
7
9
8
Consider the following matrix
Remark: Using initial ordering
No fill−in
Corresponding elimination graph
Step 1 of Minimum Degree:
Select pivot 5 (minimum degree = 2)
Updated graph
Add (4,6) i.e. fill−in
114 / 217
Reduce memory complexity
Decrease the size of working spaceUsing the elimination graph, working space is of order O(#nonzerosin factors).
I Property: Let pivot be the pivot at step k
If i ∈ AdjGk−1(pivot) Then AdjGk−1(pivot) ⊂ AdjGk (i)
I We can then use an implicit representation of fill-in by definingthe notions of element (variable already eliminated) andquotient graph. A variable of the quotient graph is adjacent tovariables and elements.
I We can show that ∀k ∈ [1 . . .N] Size of quotient graph= O(G 0)
115 / 217
Influence on the structure of factors
Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202
0 100 200 300 400 500
0
100
200
300
400
500
nz = 5104
116 / 217
Mininimum fill-in heuristics
Recalling the generic algorithmLet G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric be such that Metric(vi ) < Metric(vj) ≡ vi is a betterthan vj
Generic algorithmLoop until all nodes are selected
Step1: Select current node p (so called pivot) withminimum metric value,
Step2: update elimination (or quotient) graph,Step3: update Metric(vj) for all non-selected nodes vj .
Step3 should only be applied to nodes for which the Metric valuemight have changed.
117 / 217
Minimum fill based algorithm
I Metric(vi ) is the amount of fill-in that vi would introduce if itwere selected as a pivot.
I Illustration: r has a degree d = 4 and a fill-in metric ofd × (d − 1)/2 = 6 whereas s has degree d = 5 but a fill-inmetric of d × (d − 1)/2− 9 = 1.
r
s
i2
i1
i5
i3 i4j1
j2j3 j4
118 / 217
Minimum fill-in properties
I The situation typically occurs when {i1, i2, i3} and {i2, i3, i4, i5}were adjacent to two already selected nodes (here e2 and e1)
e1 and e2 are previously selected nodes
rs
i1i2
i3i4
i5j1
j2j3
j4
r
s
i2
i1
i5
i3 i4j1
j2j3 j4
e2e1
I The elimination of a node vk affects the degree of nodes adjacentto vk . The fill-in metric of Adj(Adj(vk)) is also affected.
I Illustration: selecting r affects the fill-in of i1 (fill edge (j3, j4)should be deduced).
Impact of fill-reducing heuristics
Number of operations (millions)
METIS SCOTCH PORD AMF AMDgupta2 2757.8 4510.7 4993.3 2790.3 2663.9ship 003 83828.2 92614.0 112519.6 96445.2 155725.5twotone 29120.3 27764.7 37167.4 29847.5 29552.9wang3 4313.1 5801.7 5009.9 6318.0 10492.2xenon2 99273.1 112213.4 126349.7 237451.3 298363.5
I METIS (Karypis and Kumar) and SCOTCH (Pellegrini) are global strategies(recursive nested dissection based orderings).
I PORD (Schulze, Paderborn Univ.) recursive dissection based on a bottomup strategy to build the separator
I AMD (Amestoy, Davis and Duff) local strategy based on ApproximateMinimum Degree.
I AMF (Amestoy) local strategy based on Approx. Minimum Fill.
120 / 217
Impact of fill-reducing heuristics
Time for factorization (seconds)
1p 16p 32p 64p 128p
coneshl METIS 970 60 41 27 14PORD 1264 104 67 41 26
audi METIS 2640 198 108 70 42PORD 1599 186 146 83 54
Matrices with quasi dense rows:Impact on the analysis time (seconds) of gupta2 matrix
AMD METIS QAMD
Analysis 361 52 23Total 379 76 59
I QAMD (Amestoy) Approximate Minimum Degree (local) strategy designed
for matrices with quasi dense rows.
121 / 217
Reordering sparse matricesFill-in and reorderingThe elimination graph modelFill-in characterizationFill-reducing heuristicsReduction to Block Triangular Form
122 / 217
Reduction to Block Triangular Form (BTF)
I Suppose that there exists permutations matrices P and Q suchthat
PAQ =
B11
B21 B22
B31 B32 B33
. . . .
. . . .
. . . .BN1 BN2 BN3 . . . BNN
I If N > 1 A is said to be reducible (irreducible otherwise)
I Each Bii is supposed to be irreducible (otherwise finerdecomposition is possible)
I Advantage: to solve Ax = b only Bii need be factoredBiixi = (Pb)i −
∑i−1j=1 Bijyj , i = 1, . . . ,N with y = QTx
123 / 217
A two stage approach to compute the reduction
I Stage (1): compute a (column) permutation matrix Q such thatAQ1 has non zeros on the diagonal (find a maximum transversalof A), then
I Stage(2): compute a (symmetric) permutation matrix P suchthat PAQ1Pt is BTF.
The diagonal blocks of the BTF are uniquely defined. Techniquesexits to directly compute the BTF form. They do not presentadvantage over this two stage approach.
124 / 217
Main components of the algorithm
I Objective : assume AQ1 has non zeros on the diagonal andcompute P such that PAQ1Pt is BTF.
I Use of digraph (directed graph) associated to the matrix withself loops (diagonal entries) not needed.
I Symmetric permutations on digraph ≡ relabelling nodes of thegraph.
I If there is closed path through all nodes in the digraph then thedigraph can be subdivided in two parts.
I Strong components of a graph are the set of nodes belonging toa closed path.
I The strong components of the graph are the diagonal blocks Bii
of the BTF format.
125 / 217
Main components of the algorithm
34
51
2
21
6
3 4
53
45
Digraph of A.
PAPt =
BTF of A
6
12
A = 6
126 / 217
Preamble : Finding the triangular form of a matrix
I Observations: a triangular matrix has a BTF form with blocks ofsize 1(BTF will be considered as generalization of the triangular formwhere each diagonal entry is now a strong component).Note that if the matrix has triangular form the digraph has nocycle (i.e. acyclic).
I Sargent and Westerberg Algorithm :
1. Select a starting node and trace a path until finding a node withno paths leave.
2. Number the last node first, eliminate it (and all edges pointing toit).
3. Continue from previous node on the path (or choose a newstarting node).
127 / 217
Preamble : Finding the triangular form of a matrix
21
6
3 4
5
12
34
56
A = PAPt =
6
51
2
43
Digraph of A.Steps
Path
1
6 3 34
35 5 2
1225555
52 3 4 6 7 8 9
A permuted to triangular form
(not unique)
128 / 217
Generalizing Sargent and Westerberg Algorithm tocompute BTF
A composite node is a set nodes belonging to a closed path.Strat from any node and follow a path until(1) a closed path is found:
collapse all nodes in closed path into a composite nodethe path continue from the composite node.
(2) reaching a node or composite node with no path leave:the node or composite node is numbered next.
129 / 217
Generalizing Sargent and Westerberg Algorithm tocompute BTF
22 2334 4
3
5354322 2
1222
212
63 3 3
2
1 2 3 4 5 76 8 9 10 11
PATHS
STEPS
21
6
3 4
5
Digraph of A.
21
6
3 4
5
Digraph of A.
130 / 217
Generalizing Sargent and Westerberg Algorithm tocompute BTF
21
6
3 4
5
Digraph of A.
22 2334 4
3
5354322 2
1222
212
63 3 3
2
1 2 3 4 5 76 8 9 10 11
PATHS
STEPS
21
6
3 4
5
Digraph of A.
34
5
12
A = 6
BTF of A
63
4
12
5PAPt =
131 / 217
Concluding remarks
Summary:
Fill-in has been characterized;Ordering strategies and associated graphs have been introduced toreduce fill-in or to permute to target structure (block tridiagonal,recursive block-bordered, block triangular form) Influence onperformance (sequential and parallel)
Next step:
How to model sparse factorization, predict data structures to run thenumerical factorization ?
132 / 217
Outline
Graphs to model factorizationThe elimination tree
133 / 217
Graphs to model factorizationThe elimination tree
134 / 217
The elimination tree for symmetric matrices
The elimination tree provides structural information relevant todescribing both Gaussian elimination of symmetric matrices and theorthogonal factorization. Its extension to unsymmetric matrices willbe illustrated
135 / 217
Set-up
Reminder
I A spanning tree of a connected graph G = (V ,E ) is a treeT = (V ,F ), such that F ⊆ E .
I A topological ordering of a rooted tree is an ordering thatnumbers children vertices before their parent.
I A postorder is a topological ordering which numbers the verticesin any subtree consecutively.
Let A be an n × n symmetric positive-definite and irreducible matrix,A = LLT its Cholesky factorization, and G+(A) its filled graph(graph of F = L + LT ).
136 / 217
A first definition
Since A is irreducible, each of the first n − 1 columns of L has atleast one off-diagonal nonzero (prove?).
For each column j < n of L, remove all the nonzeros in the column jexcept the first one below the diagonal.
Let Lt denote the remaining structure and consider the matrixFt = Lt + LT
t . The graph G (Ft) is a tree called the elimination tree.
138 JOSEPH W. H. LIU
a
d
fa
o
o o h0
o oo
d
I
oj
FIG. 2.1. An example of matrix structures.
3 10
84
G(AG(F)
G(Ft)= T(A
FIG. 2.2. Graph structures of the example in Fig. 2.1.
138 JOSEPH W. H. LIU
a
d
fa
o
o o h0
o oo
d
I
oj
FIG. 2.1. An example of matrix structures.
3 10
84
G(AG(F)
G(Ft)= T(A
FIG. 2.2. Graph structures of the example in Fig. 2.1.
137 / 217
A first definition
The elimination tree of A is a spanning tree of G+(A) satisfying therelation PARENT [j ] = min{i > j : `ij 6= 0}.
138 JOSEPH W. H. LIU
a
d
fa
o
o o h0
o oo
d
I
oj
FIG. 2.1. An example of matrix structures.
3 10
84
G(AG(F)
G(Ft)= T(A
FIG. 2.2. Graph structures of the example in Fig. 2.1.
+G (A) = G(F)
a
h
i
d
e
j
b
f
c g
j 10d4
i9
e5
b2
a1
c3
g7
h8f
6h8
d4
e5
b2
f6
c3
a1
i9
g7
j 10
ab
cd
ef
gh
ij
entries in T(A)fill−in entries
F =A =
ab
cd
ef
gh
ij
ab
cd
ef
gh
ij
entries in T(A)fill−in entries
F =A =
ab
cd
ef
gh
ij
G(A)
T(A)138 / 217
A second definition: Represents column dependencies
I Dependency between columns of L:I Column i > j depends on
column j iff `ij 6= 0
I Use a directed graph toexpress this dependency(edge from j to i , if columni depends on column j)
I Simplify redundantdependencies (transitivereduction)
j
i
j i
I The transitive reduction of the directed filled graph gives theelimination tree structure. Remove a directed edge (j , i) if thereis a path of length greater than one from j to i .
139 / 217
Directed filled graph and its transitive reduction
h8d
4e
5
b2
f6
c3
a1
i9
g7
j 10
d4
i9
e5
b2
a1
c3
g7
h8f
6
j 10d
4
i9
e5
b2
a1
c3
g7
h8f
6
j 10
T(A)
Directed filled graph Transitive reduction
140 / 217
The elimination tree: a major tool to model factorization
The elimination tree and its generalisation to unsymmetric matricesare compact structures of major importance during sparsefactorization
I Express the order in which variables can be eliminated: becausethe elimination of a variable only affects the elimination of itsancestors, any topological order of the elimination tree will leadto a correct result and to the same fill-in
I Express concurrency: because variables in separate subtrees donot affect each other, they can be eliminated in parallel
I Efficient to characterize the structure of the factors moreefficiently than with elimination graphs
141 / 217
Outline
Efficiency of the solution phaseSparsity in the right hand side and/or solution
142 / 217
Efficiency of the solution phaseSparsity in the right hand side and/or solution
143 / 217
Exploit sparsity of the right-hand-side/solution
Applications
I Highly reducible matrices and/or sparse right-hand-sides (linearprogramming, seismic processing)
I Null-space basis computationI Partial computation of A−1
I Computing variances of the unknowns of a data fitting problem =computing the diagonal of a so-called variance-covariance matrix.
I Computing short-circuit currents = computing blocks of aso-called impedance matrix.
I Approximation of the condition number of a SPD matrix.
Core idea
An efficient algorithm has to take advantage of the sparsity of A andof both the right-hand sides and the solution.
Exploit sparsity of the right-hand-side/solution
Applications
I Highly reducible matrices and/or sparse right-hand-sides (linearprogramming, seismic processing)
I Null-space basis computationI Partial computation of A−1
I Computing variances of the unknowns of a data fitting problem =computing the diagonal of a so-called variance-covariance matrix.
I Computing short-circuit currents = computing blocks of aso-called impedance matrix.
I Approximation of the condition number of a SPD matrix.
Core idea
An efficient algorithm has to take advantage of the sparsity of A andof both the right-hand sides and the solution.
Exploit sparsity in RHS : an quick insight of mainproperties
solve y ← L \ b
L yb\1
5
2
3
4
L yb\
1
5
2
3
4
I In all application cases,only part of factors needs to be loaded
I Objectives with sparse RHSI Efficient use of the RHS sparsityI Characterize LU factors to be loaded
from diskI Efficiently load only needed factors
from disk
(1) Predicting structure of the solution vector,
Gilbert-Liu, ’93
146 / 217
Exploit sparsity in RHS : an quick insight of mainproperties
solve y ← L \ b
L yb\1
5
2
3
4
L yb\
1
5
2
3
4
I In all application cases,only part of factors needs to be loaded
I Objectives with sparse RHSI Efficient use of the RHS sparsityI Characterize LU factors to be loaded
from diskI Efficiently load only needed factors
from disk
(1) Predicting structure of the solution vector,
Gilbert-Liu, ’93
147 / 217
Exploit sparsity in RHS : an quick insight of mainproperties
solve y ← L \ b
L yb\1
5
2
3
4
L yb\
1
5
2
3
4
I In all application cases,only part of factors needs to be loaded
I Objectives with sparse RHSI Efficient use of the RHS sparsityI Characterize LU factors to be loaded
from diskI Efficiently load only needed factors
from disk
(1) Predicting structure of the solution vector,
Gilbert-Liu, ’93
148 / 217
Exploit sparsity in RHS : an quick insight of mainproperties
solve y ← L \ b
L yb\1
5
2
3
4
L yb\
1
5
2
3
4
I In all application cases,only part of factors needs to be loaded
I Objectives with sparse RHSI Efficient use of the RHS sparsityI Characterize LU factors to be loaded
from diskI Efficiently load only needed factors
from disk
(1) Predicting structure of the solution vector,
Gilbert-Liu, ’93
149 / 217
Exploit sparsity in RHS : an quick insight of mainproperties
solve y ← L \ b
L yb\1
5
2
3
4
L yb\
4
3
1
5
2
I In all application cases,only part of factors needs to be loaded
I Objectives with sparse RHSI Efficient use of the RHS sparsityI Characterize LU factors to be loaded
from diskI Efficiently load only needed factors
from disk
(1) Predicting structure of the solution vector,
Gilbert-Liu, ’93
150 / 217
Application : elements in A−1
AA−1 = I , specific entry: a−1ij = (A−1ej)i ,
A−1ej – column j of A−1
Theorem: structure of x (based on Gilbert and Liu ’93)
For any matrix A such that A = LU, the structure of the solution (x) is given by the set ofnodes reachable from nodes associated with right-hand side entries by paths in the e-tree.
compute some elements inA−1
2 4 5 8 11
1263
7 13
14
1
9
10
2
7
14
b xy
Which factors needed to compute a−182 ?a−182 = (U−1(L−1e2))8
We have to load :L factors associated with nodes 2, 3, 7, 14and U factors associated with nodes 14, 13, 9, 8
Note:A part of the tree is concerned
151 / 217
Application : elements in A−1
AA−1 = I , specific entry: a−1ij = (A−1ej)i ,
A−1ej – column j of A−1
Theorem: structure of x (based on Gilbert and Liu ’93)
For any matrix A such that A = LU, the structure of the solution (x) is given by the set ofnodes reachable from nodes associated with right-hand side entries by paths in the e-tree.
compute some elements inA−1
5 8 11
126
13
9
1042
3
7
14 2
7
14
2
7
14
3
b xy
1
Which factors needed to compute a−182 ?a−182 = (U−1(L−1e2))8
We have to load :L factors associated with nodes 2, 3, 7, 14
and U factors associated with nodes 14, 13, 9, 8
Note:A part of the tree is concerned
152 / 217
Application : elements in A−1
AA−1 = I , specific entry: a−1ij = (A−1ej)i ,
A−1ej – column j of A−1
Theorem: structure of x (based on Gilbert and Liu ’93)
For any matrix A such that A = LU, the structure of the solution (x) is given by the set ofnodes reachable from nodes associated with right-hand side entries by paths in the e-tree.
compute some elements inA−1
2 4 5 11
1263
7
1 10
14
13
9
8
2
7
14 14
2
7
14
3
8
9
13
b xy
Which factors needed to compute a−182 ?a−182 = (U−1(L−1e2))8
We have to load :L factors associated with nodes 2, 3, 7, 14and U factors associated with nodes 14, 13, 9, 8
Note:A part of the tree is concerned
153 / 217
Application : elements in A−1
AA−1 = I , specific entry: a−1ij = (A−1ej)i ,
A−1ej – column j of A−1
Theorem: structure of x (based on Gilbert and Liu ’93)
For any matrix A such that A = LU, the structure of the solution (x) is given by the set ofnodes reachable from nodes associated with right-hand side entries by paths in the e-tree.
compute some elements inA−1
5 11
126
10
2
7
14 14
2
7
14
3
8
9
13
b xy
3
21 4
7
14
13
9
8
Which factors needed to compute a−182 ?a−182 = (U−1(L−1e2))8
We have to load :L factors associated with nodes 2, 3, 7, 14and U factors associated with nodes 14, 13, 9, 8
Note:A part of the tree is concerned
154 / 217
Entries of the inverse: a single one
Notation for later use
P(i): denotes the nodes in the uniquepath from the node i to the rootnode r (including i and r).
P(S): denotes⋃
s∈S P(s) for a set ofnodes S .
Use the elimination tree
For each requested(diagonal) entry a−1ii ,
(1) visit the nodes of theelimination tree fromthe node i to theroot: at each nodeaccess necessary partsof L,
(2) visit the nodes fromthe root to the node iagain; this timeaccess necessary partsof U.
155 / 217
Experiments: interest of exploiting sparsity
Implementation
These ideas have been implemented in MUMPS solver during Tz.Slavova’s PhD.
Experiments: computation of the diagonal of the inverse of matricesfrom data fitting in Astrophysics (CESR, Toulouse)
Matrix Time (s)size No ES ES
46,799 6,944 47272,358 27,728 408
148,286 >24h 1,391
Interest
Exploiting sparsity of the right-hand sides reduces the number ofaccesses to the factors (in-core: number of flops, out-of-core:accesses to hard disks).
156 / 217
Outline
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachEquivalent orderings and elimination treesSome parallel solversConcluding remarks
157 / 217
Factorization of sparse matrices
Outline
1. Introduction
2. Elimination tree and multifrontal method
3. Postorderings, equivalent orderings and memory usage
4. Concluding remarks
158 / 217
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachEquivalent orderings and elimination treesSome parallel solversConcluding remarks
159 / 217
Recalling the Gaussian elimination
Step k of LU factorization (akk pivot):
I For i > k compute lik = aik/akk (= a′ik),
I For i > k , j > k such that aik and akj are nonzeros
a′ij = aij −aik × akj
akk
I If aik 6= 0 et akj 6= 0 then a′ij 6= 0
I If aij was zero → its non-zero value must be stored
I Orderings (minimum degree, Cuthill-McKee, ND) limit fill-in,the number of operations and modify the tasks graph
160 / 217
Three-phase scheme to solve Ax = b
1. Analysis stepI Preprocessing of A (symmetric/unsymmetric orderings, scalings)I Build the dependency graph (elimination tree, eDAG . . . )
2. Factorization (A = LU, LDLT, LLT, QR)
Numerical pivoting
3. Solution based on factored matricesI triangular solves: Ly = b, then Ux = yI improvement of solution (iterative refinement), error analysis
161 / 217
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachEquivalent orderings and elimination treesSome parallel solversConcluding remarks
162 / 217
Elimination tree and Multifrontal approach
We recall that:
The elimination tree is the dependency graph of thefactorization.
Building the elimination tree (small matrices)
I Permute matrix (to reduce fill-in) PAPT.
I Build filled matrix AF = L + LT where PAPT = LLT
I Transitive reduction of associated filled graph
Multifrontal approach
→ Each column corresponds to a node of the tree
→ (multifrontal) Each node k of the tree corresponds to the partialfactorization of a frontal matrix whose row and column structureis that of column k of AF .
163 / 217
Illustration of multifrontal factorization
We assume pivots are chosen down the diagonal in order.
_
_
_
3, 4
4 4
1, 3, 4_ 2, 3, 4
3
1 2
F
F
Elimination graph
Filled matrix
Treatment at each node:
I Assembly of the frontal matrix using the contributions from thesons.
I Gaussian elimination on the frontal matrix
164 / 217
I Elimination of variable 1 (a11 pivot)I Assembly of the frontal matrix
1 3 4
1 x x x
3 x
4 x
I Contributions : aij =−(ai1×a1j )
a11i > 1, j > 1 on a33, a44, a34 and a43 :
a(1)33 = − (a31 × a13)
a11a(1)34 = − (a31 × a14)
a11
a(1)43 = − (a41 × a13)
a11a(1)44 = − (a41 × a14)
a11
Terms − ai1×a1ja11
of the contribution matrix are stored for later updates.
165 / 217
I Elimination of variable 2 (a22 pivot)I Assembly of frontal matrix: update of elements of pivot row and
column using contributions from previous updates (none here)2 3 4
2 x x x
3 x
4 x
I Contributions on a33, a34, a43, and a44.
a(2)33 = − (a32 × a23)
a22
a(2)34 = − (a32 × a24)
a22
a(2)43 = − (a42 × a23)
a22
a(2)44 = − (a42 × a24)
a22
166 / 217
I Elimination of variable 3.I Assembly of frontal matrix using the previous contribution
matrices(a(1)33 a
(1)34
a(1)43 a
(1)44
)and
(a(2)33 a
(2)34
a(2)43 a
(2)44
):
a′33 = a33 + a(1)33 + a
(2)33
a′34 = a34 + a(1)34 + a
(2)34 , (a34 = 0)
a′43 = a43 + a(1)43 + a
(2)43 , (a43 = 0)
a′44 = a(1)44 + a
(2)44
I Contribution on variable 4:a(3)44 = a′44 −
(a′43×a34)a33
167 / 217
3 4
3 x x
4 x
I Contribution on a44 : a(3)44 = a′44 −
(a′43×a′34)a′33
Note that a44 is partially summed since it still does not includethe entry from the initial matrix
I Elimination of variable 4I Frontal involves only a44 : a44 = a44 + a
(3)44
4
4 x
168 / 217
Supernodes/supervariables
Definition 1
A supernode (or supervariable) is a set of contiguous columns in thefactors L that share essentially the same sparsity structure. Columnsi1, i2, . . . , ip in the filled graph form a supernode: |Lik+1
| = |Lik | − 1.
I All algorithms (ordering, symbolic factor., factor., solve)generalize to blocked versions.
I Use of efficient matrix-matrix kernels (improve cache usage).
I Same concept as supervariables for elimination tree/minimumdegree ordering.
I Supernodes and pivoting: pivoting inside a supernode does notincrease fill-in.
169 / 217
Amalgamation
I GoalI Exploit a more regular structure in the original matrixI Decrease the amount of indirect addressingI Increase the size of frontal matrices
I How?I Relax the number of nonzeros of the matrixI Amalgamation of nodes of the elimination tree
170 / 217
I Consequences?I Increase in the total amount of flopsI But decrease of indirect addressingI And increase in performance
I Amalgamation of supernodes (same lower diagonal structure) iswithout fill-in
171 / 217
Illustration of amalgamation
Original Matrix Elimination tree
1
2
3
4
5
6
1 2 3 4 5 6
F3
F3
F4
F4
6
4, 5, 6
5, 6
3, 5, 62, 4, 51, 4
(Leaves)
(Root)
Structure of node i = frontal matrix noted i, i1, i2 . . . if
172 / 217
Illustration of amalgamation
Amalgamation
(WITHOUT fill-in) (WITH fill-in )
3, 5, 6
4, 5, 6
2, 4, 51, 4 2, 4, 51, 4
3, 4, 5, 6
fill-in : (3,4) and (4,3)
173 / 217
Amalgamation and Supervariables
Amalgamation of supervariables does not cause fill-inInitial Graph:
1
2
4
3
5
6
7
8
9 10
11
12
13
Reordering: 1, 3, 4, 2, 6, 8, 10, 11, 5, 7, 9, 12, 13Supervariables: {1, 3, 4} ; {2, 6, 8} ; {10, 11} ; {5, 7, 9, 12, 13}
174 / 217
Supervariables and multifrontal method
AT EACH NODE
F22 ← F22 − FT12F−111 F12
Pivot can ONLY be chosen from F11 block since F22 is NOT fullysummed
175 / 217
Multifrontal method
From children to parent
I ASSEMBLY: Scatter-addoperations (indirectaddressing)
I ELIMINATION: Densepartial Gaussian elimination,Level 3 BLAS (TRSM,GEMM)
I CONTRIBUTION to parent
176 / 217
Multifrontal method
From children to parent
I ASSEMBLY: Scatter-addoperations (indirectaddressing)
I ELIMINATION: Densepartial Gaussian elimination,Level 3 BLAS (TRSM,GEMM)
I CONTRIBUTION to parent
177 / 217
Multifrontal method
From children to parent
I ASSEMBLY: Scatter-addoperations (indirectaddressing)
I ELIMINATION: Densepartial Gaussian elimination,Level 3 BLAS (TRSM,GEMM)
I CONTRIBUTION to parent
178 / 217
Multifrontal method
From children to parent
I ASSEMBLY: Scatter-addoperations (indirectaddressing)
I ELIMINATION: Densepartial Gaussian elimination,Level 3 BLAS (TRSM,GEMM)
I CONTRIBUTION to parent
179 / 217
Multifrontal method
From children to parent
I ASSEMBLY: Scatter-addoperations (indirectaddressing)
I ELIMINATION: Densepartial Gaussian elimination,Level 3 BLAS (TRSM,GEMM)
I CONTRIBUTION to parent
180 / 217
Multifrontal method
From children to parent
I ASSEMBLY: Scatter-addoperations (indirectaddressing)
I ELIMINATION: Densepartial Gaussian elimination,Level 3 BLAS (TRSM,GEMM)
I CONTRIBUTION to parent
181 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
FactorsActivefrontalmatrix
Stack ofcontributionblocks
Active storage
5
54
145
34
23
Elimination tree
182 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
183 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
184 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
185 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
186 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
187 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
188 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
189 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
190 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
191 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
192 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
193 / 217
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:
I Factors
I Active memory
5
54
145
34
23
Active memory
194 / 217
The multifrontal method [Duff & Reid ’83]
I Factors are incompressible andusually scale fairly; they canoptionally be written on disk.
I In sequential, the traversal thatminimizes active memory is known[Liu’86].
I In parallel, active memory becomesdominant.Example: share of active storage onthe AUDI matrix using MUMPS4.10.0
1 processor: 11%256 processors: 59%
5
54
145
34
23
Active memory
195 / 217
Postorderings and memory usage
I Assumptions:I Tree processed from the leaves to the rootI Parents processed as soon as all children have completed
(postorder of the tree)I Each node produces and sends temporary data consumed by its
father.I Exercise: In which sense is a postordering-based tree traversal
more interesting than a random topological ordering ?
I Furthermore, memory usage also depends on the postorderingchosen:
a b ab
c
d f
e
g
h
c
d
e
f
g
h
ii
Best (abcdefghi) Worst (hfdbacegi)
Leaves
Root
196 / 217
Postorderings and memory usage
I Assumptions:I Tree processed from the leaves to the rootI Parents processed as soon as all children have completed
(postorder of the tree)I Each node produces and sends temporary data consumed by its
father.I Exercise: In which sense is a postordering-based tree traversal
more interesting than a random topological ordering ?
I Furthermore, memory usage also depends on the postorderingchosen:
a b ab
c
d f
e
g
h
c
d
e
f
g
h
ii
Best (abcdefghi) Worst (hfdbacegi)
Leaves
Root
197 / 217
Example 1: Processing a wide tree
1 2 3 6
4 5
7
1 2 3 6
4 5
7
Memory
..
.
..
.
unused memory space stack memory spacefactor memory space non-free memory space
1
1
65
5
5
5
5
5
11
1 1
1
1
1
1
1
1
2
2
2
2
2
2
2
23
3
3
3
3
3
3
3
4
4
4
4
4
6
6
6
6
6
6
4
7
7
7
Active memory
Example 2: Processing a deep tree
1 2
3
4
Memory
unused memory space stack memory spacefactor memory space non-free memory space
1
11
1122
1 23
1 23
1 2 33
1 2 33 4
1 23 4
..
.
12
..
.
Allocation of 3
Assembly step for 3
Factorization step for 3 +
Stack step for 3
Modelization of the problem
I Mi : memory peak for complete subtree rooted at i ,
I tempi : temporary memory produced by node i ,
I mparent : memory for storing the parent.
M2 M3
M(parent)
M1
temp3temp2
temp1
Mparent = max( maxnbchildrenj=1 (Mj +∑j−1
k=1 tempk),
mparent +∑nbchildren
j=1 tempj)(1)
Objective: order the children to minimize Mparent
201 / 217
Modelization of the problem
I Mi : memory peak for complete subtree rooted at i ,
I tempi : temporary memory produced by node i ,
I mparent : memory for storing the parent.
M2 M3
M(parent)
M1
temp3temp2
temp1
Mparent = max( maxnbchildrenj=1 (Mj +∑j−1
k=1 tempk),
mparent +∑nbchildren
j=1 tempj)(1)
Objective: order the children to minimize Mparent
202 / 217
Memory-minimizing schedules
Theorem 1
[Liu,86] The minimum of maxj(xj +∑j−1
i=1 yi ) is obtained when thesequence (xi , yi ) is sorted in decreasing order of xi − yi .
Corollary 1
An optimal child sequence is obtained by rearranging the childrennodes in decreasing order of Mi − tempi .
Interpretation: At each level of the tree, child with relatively largepeak of memory in its subtree (Mi large with respect to tempi )should be processed first.
⇒ Apply on complete tree starting from the leaves(or from the root with a recursive approach)
Optimal tree reordering
Objective: Minimize peak of stack memory
Tree Reorder (T ):Begin
for all i in the set of root nodes doProcess Node(i);
end forEnd
Process Node(i):if i is a leaf then
Mi=mi
elsefor j = 1 to nbchildren do
Process Node(j th child);end forReorder the children of i in decreasing order of (Mj − tempj);Compute Mparent at node i using Formula (1);
end if
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachEquivalent orderings and elimination treesSome parallel solversConcluding remarks
205 / 217
Equivalent orderings of symmetric matrices
Let F be the filled matrix of a symmetric matrix A (that is,F = L + Lt , where A = LLT)G+(A) = G (F) is the associated filled graph.
Definition 2 (Equivalent orderings)
P and Q are said to be equivalent orderings iffG+(PAPT) = G+(QAQT)By extension, a permutation P is said to be an equivalent ordering ofa matrix A iff G+(PAPT) = G+(A)
It can be shown that an equivalent reordering also preserves theamount of arithmetic operations for sparse Cholesky factorization.
206 / 217
Relation with elimination trees
I Let A be a reordered matrix, and G+(A) be its filled graphI In the elimination tree, any topological tree traversal (that
processes children before parents) corresponds to an equivalentordering P of A and the elimination tree of PAPT is identical tothat of A.
u w
x
y
z
v
w
yz
xu
v
uv
wx
yz
F
F
A =
1 3
2
4
5
6
with topological orderingElimination Tree of A
341
562
Graph G (A)+
207 / 217
Tree rotations (Liu, 1988)
Definition 3
An ordering that does not introduce any fill is referred to as a perfectelimination ordering (or PEO in short)
Natural ordering is a PEO of the filled matrix F.
Theorem 2
For any node x of G+(A) = G (F), there exists a PEO on G (F) suchthat x is numbered last.
I Essence of tree rotations :I Nodes in the clique of x in F are numbered lastI Relative ordering of other nodes is preserved.
208 / 217
Example of equivalent orderings
On the right-hand side tree rotation applied on w :(clique of w is {w , x} and for other nodes relative ordering w.r.t. treeon the left is preserved).
u
vz
wx
yF
F
u w
x
y
z
v
1
6
2
54
3
with a postordering
x
w
z
vy
u
xw
5
4
6
32
1
yu
zv
FF
F =
Remark: Tree rotations can help reducing the temporary memory usage!
Example of equivalent orderings
On the right-hand side tree rotation applied on w :(clique of w is {w , x} and for other nodes relative ordering w.r.t. treeon the left is preserved).
u
vz
wx
yF
F
u w
x
y
z
v
1
6
2
54
3
with a postordering
x
w
z
vy
u
xw
5
4
6
32
1
yu
zv
FF
F =
Remark: Tree rotations can help reducing the temporary memory usage!
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachEquivalent orderings and elimination treesSome parallel solversConcluding remarks
211 / 217
Some parallel solvers
212 / 217
Some (shared memory) sparse direct codes
Code Technique Scope Availability (www.)
BCSLIB Multifrontal SYM/UNS Boeing → Access Analytics
HSL MA87 Supernodal SPD cse.clrc.ac.uk/Activity/HSL
MA41 Multifrontal UNS cse.clrc.ac.uk/Activity/HSL
MA49 Multifr. QR RECT cse.clrc.ac.uk/Activity/HSL
PanelLLT Left-looking SPD NgPARDISO Left-right SYM/UNS SchenkPSL† Left-looking SPD/UNS SGI productSuperLU MT Left-looking UNS nersc.gov/∼xiaoye/SuperLU
SuiteSparseQR Multifr. QR RECT cise.ufl.edu/research/sparse/SPQR
TAUCS Left/Multifr. SYM/UNS tau.ac.il/∼stoledo/taucs
WSMP† Multifrontal SYM/UNS IBM product† Only object code is available.
213 / 217
Distributed-memory sparse direct codes
Code Technique Scope Availability (www.)
DSCPACK Multifr./Fan-in SPD cse.psu.edu/∼raghavan/Dscpack
MUMPS Multifrontal SYM/UNS graal.ens-lyon.fr/MUMPS
mumps.enseeiht.fr
PaStiX Fan-in SPD labri.fr/perso/ramet/pastix
PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases
SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles
SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU
S+ Fan-out† UNS cs.ucsb.edu/research/S+
WSMP † Multifrontal SYM IBM product‡ Only object code is available.
214 / 217
Distributed-memory sparse direct codes
Code Technique Scope Availability (www.)
DSCPACK Multifr./Fan-in SPD cse.psu.edu/∼raghavan/Dscpack
MUMPS Multifrontal SYM/UNS graal.ens-lyon.fr/MUMPS
mumps.enseeiht.fr
PaStiX Fan-in SPD labri.fr/perso/ramet/pastix
PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases
SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles
SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU
S+ Fan-out† UNS cs.ucsb.edu/research/S+
WSMP † Multifrontal SYM IBM product
Case study: Comparison of MUMPS and SuperLU
215 / 217
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachEquivalent orderings and elimination treesSome parallel solversConcluding remarks
216 / 217
Concluding remarks
I Key parameters in selecting a method1. Functionalities of the solver2. Characteristics of the matrix
I Numerical properties and pivoting.I Symmetric or generalI Pattern and density
3. Preprocessing of the matrixI ScalingI Reordering for minimizing fill-in
4. Target computer (architecture)
I Substantial gains can be achieved with an adequate solver: interms of numerical precision, computing and storage
I Good knowledge of matrix and solversI Many challenging problems
I Active research area
217 / 217