17
A parallel block LU decomposition method for distributed finite element matrices Daniel Maurer, Christian Wieners Institute for Applied and Numerical Mathematics, 3, Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany article info Article history: Available online 7 June 2011 Keywords: Parallel computing Finite elements Direct solver for linear equations Block LU decomposition abstract In this work we present a new parallel direct linear solver for matrices resulting from finite element problems. The algorithm follows the nested dissection approach, where the result- ing Schur complements are also distributed in parallel. The sparsity structure of the finite element matrices is used to pre-compute an efficient block structure for the LU factors. We demonstrate the performance and the parallel scaling behavior by several test examples. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Finite element applications often require a fine mesh which results into several million or billion unknowns. To obtain a reasonable computing time to solve this equation, this can only be realized on a parallel machine. For this purpose several applications require parallel direct solvers, since they are more general and more robust than iterative solvers. For example, preconditioned CG-methods are restricted to symmetric positive definite systems, and also multigrid or domain decompo- sition preconditioner require an efficient parallel coarse problem solver. A direct solver often performs well in cases, where many iterative solvers fail, e.g., if the problem is indefinite, unsymmetric, or ill-conditioned. Our concept of the parallel LU decomposition for matrices resulting from finite element problems is based on a nested dissection approach. On P =2 S processors, the algorithm uses S + 1 steps in which sets of processors are combined together consistently and problems between these processors are solved, beginning with one processor in the first step. The first step itself is comparable to the preprocessing step for non-overlapping domain decomposition methods [29], where the resulting Schur complement is then solved by a suitable preconditioned iteration. Here, a parallel LU decomposition for this Schur complement problem is introduced. Nevertheless, a simple nested dissection method fails for finite element applications since, in particular in 3D, the resulting Schur complement problems emerge into large dense problems, so that they have to be solved distributed and in parallel, too. Block LU factorizations and its analysis have been discussed by various authors, see e.g., [14,10,18,8]. A general purpose algorithm for sparse matrices is realized, e.g., in MUMPS [1] for distributed memory, and, e.g., in PARDISO [25,26] on shared memory machines. Parallel solvers also can be used in hybrid method for solving subproblems, e.g., in SPIKE [23]. Sequential solvers for sparse matrices such as SUPERLU [9] and UMFPACK [28] can be used to eliminate local degrees of freedom. A more general discussion of block LU decomposition and handling linear systems on parallel computers can be found in [7,11,13].A block LU decomposition method with iterative methods for the Schur complement can be found, e.g., in [4]. Our new parallel solver explicitly uses the structure of the finite element matrix. Thus, it is not a ‘‘black box’’ solver, the knowledge of the structure of the finite element matrix and the decomposition of the domain to the processors is essential. 0167-8191/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2011.05.007 Corresponding author. Tel.: +49 721 608 42063; fax: +49 721 608 43197. E-mail addresses: [email protected] (D. Maurer), [email protected] (C. Wieners). URL: http://www.math.kit.edu (C. Wieners). Parallel Computing 37 (2011) 742–758 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco

A parallel block LU decomposition method for distributed finite element matrices

Embed Size (px)

Citation preview

Page 1: A parallel block LU decomposition method for distributed finite element matrices

Parallel Computing 37 (2011) 742–758

Contents lists available at ScienceDirect

Parallel Computing

journal homepage: www.elsevier .com/ locate /parco

A parallel block LU decomposition method for distributed finiteelement matrices

Daniel Maurer, Christian Wieners ⇑Institute for Applied and Numerical Mathematics, 3, Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany

a r t i c l e i n f o a b s t r a c t

Article history:Available online 7 June 2011

Keywords:Parallel computingFinite elementsDirect solver for linear equationsBlock LU decomposition

0167-8191/$ - see front matter � 2011 Elsevier B.Vdoi:10.1016/j.parco.2011.05.007

⇑ Corresponding author. Tel.: +49 721 608 42063E-mail addresses: [email protected] (D. MauURL: http://www.math.kit.edu (C. Wieners).

In this work we present a new parallel direct linear solver for matrices resulting from finiteelement problems. The algorithm follows the nested dissection approach, where the result-ing Schur complements are also distributed in parallel. The sparsity structure of the finiteelement matrices is used to pre-compute an efficient block structure for the LU factors. Wedemonstrate the performance and the parallel scaling behavior by several test examples.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Finite element applications often require a fine mesh which results into several million or billion unknowns. To obtain areasonable computing time to solve this equation, this can only be realized on a parallel machine. For this purpose severalapplications require parallel direct solvers, since they are more general and more robust than iterative solvers. For example,preconditioned CG-methods are restricted to symmetric positive definite systems, and also multigrid or domain decompo-sition preconditioner require an efficient parallel coarse problem solver. A direct solver often performs well in cases, wheremany iterative solvers fail, e.g., if the problem is indefinite, unsymmetric, or ill-conditioned.

Our concept of the parallel LU decomposition for matrices resulting from finite element problems is based on a nesteddissection approach. On P = 2S processors, the algorithm uses S + 1 steps in which sets of processors are combined togetherconsistently and problems between these processors are solved, beginning with one processor in the first step. The first stepitself is comparable to the preprocessing step for non-overlapping domain decomposition methods [29], where the resultingSchur complement is then solved by a suitable preconditioned iteration. Here, a parallel LU decomposition for this Schurcomplement problem is introduced. Nevertheless, a simple nested dissection method fails for finite element applicationssince, in particular in 3D, the resulting Schur complement problems emerge into large dense problems, so that they haveto be solved distributed and in parallel, too.

Block LU factorizations and its analysis have been discussed by various authors, see e.g., [14,10,18,8]. A general purposealgorithm for sparse matrices is realized, e.g., in MUMPS [1] for distributed memory, and, e.g., in PARDISO [25,26] on sharedmemory machines. Parallel solvers also can be used in hybrid method for solving subproblems, e.g., in SPIKE [23]. Sequentialsolvers for sparse matrices such as SUPERLU [9] and UMFPACK [28] can be used to eliminate local degrees of freedom. A moregeneral discussion of block LU decomposition and handling linear systems on parallel computers can be found in [7,11,13]. Ablock LU decomposition method with iterative methods for the Schur complement can be found, e.g., in [4].

Our new parallel solver explicitly uses the structure of the finite element matrix. Thus, it is not a ‘‘black box’’ solver, theknowledge of the structure of the finite element matrix and the decomposition of the domain to the processors is essential.

. All rights reserved.

; fax: +49 721 608 43197.rer), [email protected] (C. Wieners).

Page 2: A parallel block LU decomposition method for distributed finite element matrices

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 743

Our new contribution is the efficient and transparent use of this structure for the parallel distribution of the eliminationsteps. In particular, we can identify a priori parts of the resulting LU decomposition which remains zero, so that they canbe ignored during the algorithm.

The paper is organized as follows. In Section 2 a general setting for parallel finite elements is introduced which leads to aparallel block structure for the corresponding matrix. The mesh is first distributed on P processors and then refined locally ltimes, such that each processor handles a local part of the mesh. Then, a suitable block LU decomposition is defined and theparallel realization is discussed in Section 3. In Section 4 we introduce several finite element problems which are used inSection 5 for the evaluation of the parallel performance of our direct solver. In some cases the results are compared withthe parallel direct solver MUMPS [1].

2. Parallel finite elements

Following [30,31], we define a parallel additive representation of the stiffness matrix which directly corresponds to a par-allel domain decomposition. This additive representation is the basis for the LU algorithm discussed in the next section.

2.1. The finite element setting

Let X � Rdðd ¼ 2;3Þ be a reference domain, and let V � L2ðX;RmÞ be a Hilbert space. We consider finite element approx-imations of the linear variational equation

find u 2 V such that aðu;vÞ ¼ ‘ðvÞ for all v 2 V ð1Þ

where að�; �Þ : V � V ! R is a bilinear form and ‘ð�Þ : V ! R is a functional.Let VI ¼ span f/i : i 2 Ig be a finite element space with basis /i, where I is the corresponding index set of size jI j ¼ N.

The finite element approximation is the Galerkin projection of (1):

find uI 2 VI such that aðuI ; vI Þ ¼ ‘ðvI Þ for all vI 2 VI ð2Þ

Inserting the basis functions /i, we obtain the linear matrix problem

find x 2 RN such that Ax ¼ b ð3Þ

where A ¼ ðA½i; j�Þi;j2I 2 RN�N is the stiffness matrix with entries A[i, j] = a(/i,/j), and b ¼ ð‘ð/iÞÞi2I 2 RN is the right-hand side.This defines the finite element solution uI ¼

Pi2Ixi/i.

We assume that the finite element discretization is based on a decomposition into a set of cells C, such that

X ¼[c2C

Xc

where the cell domain is denoted by Xc. In the finite element context, the bilinear form a(�, �) allows for a corresponding cell-based additive decomposition

aðv ;wÞ ¼Xc2C

acðv ;wÞ; v ;w 2 V

We assume that the finite element basis functions /i are associated to nodal points zi 2 X, and that

supp /i � Ci ¼ fc 2 C : zi 2 Xcg ð4Þ

Then, we have acð/i;/jÞ – 0 only if c 2 Ci \ Cj.

2.2. Parallel load balancing

Let P ¼ f1; . . . ; Pg be the set of processors. For simplicity, we assume P = 2S. On P, a load balancing determines a disjointdecomposition C ¼ C1 [ � � � [ CP . This corresponds to a non-overlapping domain decomposition such that we have

X ¼[p2P

Xp; Xp ¼[

c2Cp

Xc ð5Þ

We assume that Cp – ; for p = 1, . . . ,P.The efficiency of the block LU decomposition relies on small processor interfaces for the final steps. This is a well-known

graph-partitioning problem [16,17]. Examples for suitable load balancing algorithms are recursive coordinate bisection [16]or spectral bisection algorithms [24].

Depending on the domain decomposition (5) we define Ip ¼ fi 2 I : zi 2 Xpg and Np ¼ jIpj for each processor p 2 P. Fori 2 I we define

pðiÞ ¼ fp 2 P : i 2 Ipg

Page 3: A parallel block LU decomposition method for distributed finite element matrices

744 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

2.2.1. Parallel assembling and parallel block structureOn every processor p 2 P we compute the entries of the local stiffness matrix

Ap ¼ ðAp½i; j�Þi;j2Ip 2 RNp�Np; Ap½i; j� ¼

Xc2Cp

acð/i;/jÞ

Now, we introduce a further block structure which decomposes Ap. Let P ¼ 2P be the set of all possible processor sets. Wedefine the subset of all active processor sets PI ¼ fpðiÞ 2 P : i 2 Ig with respect to the corresponding matrix graph. A suit-able numbering of the active processor sets

PI ¼ fp1;p2; . . . ;pKg ð6Þ

directly induces a non-overlapping decomposition

I ¼ I1 [ � � � [ IK with I k ¼ fi 2 I : pðiÞ ¼ pkg

Note that we have I k � Ip for all p 2 pk. We have N ¼PK

k¼1Nk and Np ¼P

p2pkNk, where Nk ¼ jI kj. The corresponding matrix

blocks are

Apkm ¼ ðA

p½i; j�Þi2Ik ;j2Im2 RNk�Nm ð7Þ

which are a part of the local stiffness matrix Ap on processor p and represent the block matrices of the stiffness matrix A addi-tively by

Akm ¼Xp2P

Apkm ð8Þ

Lemma 1. The block matrix Apkm is zero for p R pk \ pm, and therefore Akm ¼

Pp2pk\pm

Apkm.

Proof. It holds Apkm ¼ ðA

p½i; j�Þi2Ik ;j2Im¼P

c2Cp acð/i;/jÞ� �

i2Ik ;j2Im. Now we fix some i 2 I k and k 2 Im.

Let p R pk \ pm and c 2 Cp. Assume acð/i;/jÞ– 0) c 2 Ci \ Cj. By (4), we have zi 2 Xc; zj 2 Xc , and since c 2 Cp by (5), alsozi; zj 2 Xp. This gives i; j 2 Ip, i.e. p 2 p(i) = pk, p 2 p(j) = pm. This contradicts p R pk \ pm. h

2.2.2. Example in 2DWe consider a square partitioned into four subdomains as in Fig. 1, where each part is distributed to one processor. This

partition defines our active processor set PI ¼ fp1; . . . ;p9g with p1 = {1}, p2 = {2}, p3 = {3}, p4 = {4}, p5 = {1,2}, p6 = {3,4},p7 = {1,3}, p8 = {2,4}, p9 = {1,2,3,4}. The index set I is represented by the ‘‘big dots’’, where the different colors representsthe belonging to a different active processor set (see Fig. 1).

The (global) matrix A is divided into a 9 � 9 block matrix, where the block matrices itself are represented additively on thefour processors. Note, that all matrices Ap

km; p ¼ 1; . . . ;4, are of the same size. Most of the block matrices are zero, e.g., oneach processor only 16 of 81 matrices are nonzero. Moreover, the global block matrix Akm ¼ A1

km þ � � � þ A4km is zero, if

pk \ pm = ;.This example shows that in 2D and with a small number of processors the major part of the computation is done com-

pletely in parallel in the initial step (the indices corresponding to the interior of the subdomains can be eliminated indepen-dently, cf. Chapter 3). This changes completely in 3D and for a large processor number P, where the efficient parallel LUdecomposition of the Schur complements are the challenging task.

2.3. The block LU decomposition

We define a standard block LU decomposition A = LU of the stiffness matrix A = (Akm)k,m=1,. . .,K with a lower and upper tri-angular K � K block matrix L = (Lkm)k,m=1,. . .,K and U = (Ukm)k,m=1,. . .,K with Ukk ¼ INk

. As usual we can store the result in theblocks Akm [14, Chap. 3.2]. We do not examine if any further pivoting of the block matrices is necessary in the indefinite case,but each diagonal block matrix itself is decomposed with pivoting strategies of the SUPERLU [9] or LAPACK [3,2] routines.

Fig. 1. A square distributed on four processors.

Page 4: A parallel block LU decomposition method for distributed finite element matrices

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 745

Nevertheless, in our finite element context with well-posed problems the subproblems related to the Schur complements areinvertible.

We require three block matrix operations:

� Akk :¼ LU (Akk)Here (e.g., using a suitable LAPACK routine [3,2]), we compute an LU decomposition of Akk, and the result is stored in thematrix entries Akk. Local pivoting is performed by the subroutine.� Akm :¼ SolveðAkk;AkmÞ

Here (using the previously determined LU decomposition) we compute A�1kk Akm. Again, the result is stored in Akm.

� Anm :¼ Anm � AnkAkm

This operation is realized by one BLAS3 call.

Then, the block LU decomposition can be written as follows:

Algorithm 1.

1 FOR k = 1, . . . ,K2 Akk :¼ LU (Akk)3 FOR m = k + 1, . . . ,K4 Akm :¼ SolveðAkk;AkmÞ5 FOR n = k + 1, . . . ,K6 Anm :¼ Anm � AnkAkm

The algorithm ends with Lkm :¼ Akm, k P m and Ukm :¼ Akm, k < m.

3. A parallel block LU decomposition

Now we introduce a parallel block LU decomposition based on the blocks associated to the processor sets pk. We will seebelow that our algorithm is equivalent to a recursive Schur complement reduction method.

We start with a suitable numbering of the processors

P ¼ fp1;p2; . . . ;p2Sg ð9Þ

The block LU decomposition is based on a recursive definition of combined processor sets Ps;t in each step s with a clusternumber t = 1, . . . ,2S�s, such that

St¼1;...;2S�sPs;t ¼ P. For s = 0 each processor set P0;t exists of exactly one processor number

P0;t ¼ fptg, the processor sets in the further steps s are built of two processor sets from the previous step s � 1, such thateach processor number is located in exactly one set in each step.

In the first step s = 0, on every processor p 2 P we eliminate all indices i in the interior of the subdomains with zi 2 Ip andp(i) = {p}. This can be done independently on each processor. Then, we build locally the Schur complement on the skeleton,which is given by all indices i 2 Ip with jp(i)j > 1. Again, the Schur complement is represented additively.

In the next step s = 1 we eliminate the indices on the interface of two processors, given by the processor set P1;t , i.e. allindices i with pðiÞ � P1;t which have not yet been eliminated. Therefore we first have to sum up the previous Schur comple-ment of the two combined processor sets. Then, the indices can be eliminated and the new Schur complement, given addi-tively by the processor sets P1;t , is built.

This procedure is executed recursively until only one processor set exists in step s = S, where all remaining indices areeliminated.

Since the Schur complements become large in the final steps, we distribute the columns of the (local) Schur complementin a cyclic way to processors in the processor set Ps;t .

3.1. The set of interfaces

Depending on the processor numbering (9), we define for each step s = 0, . . . ,S a cluster of size Ts = 2S�s of combined pro-cessor sets

Ps;t ¼ fpj : 2sðt � 1Þ < j 6 2stg; t ¼ 1; . . . ; Ts ð10Þ

which results in a disjoint decomposition P ¼ Ps;1 [ � � � [ Ps;Ts for all steps s.For every p 2 PI we define the associated step by

sðpÞ ¼minfsj9t 2 f1; . . . ; Tsg : p � Ps;tg ð11Þ

which results in

Page 5: A parallel block LU decomposition method for distributed finite element matrices

746 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

Ps ¼ fp 2 PI : sðpÞ ¼ sg; Ps;t ¼ fp 2 Ps : p � Ps;tg

Now, we can define a consecutive numbering for the disjoint decomposition

PI ¼ P0 [ � � � [PS ¼[Ss¼0

[Ts

t¼1

Ps;t

For s = 0 we set pk = {k} and Ps,k = {pk}, k = 1, . . . ,P, and for s = 1, . . . ,S we set Ps,t = {pk : k = Ks,t�1 + 1, . . . ,Ks,t} with K0,t = t,t = 0, . . . ,K and Ks;0 ¼ Ks�1;Ts�1 . This results in Ps = {pk : k = Ks�1 + 1, . . . ,Ks} with K�1 = 0 and Ks ¼ Ks;Ts . We set K = KS. Together,this defines the numbering (6) of the active processor sets which is used in the block LU decomposition.

In principle, any processor numbering results in a processor set numbering and a parallel block LU decomposition. Nev-ertheless, the for the parallel efficiency of the block LU decomposition the numbering is important, since the size of processorinterfaces in the final steps strongly depends on this numbering.

3.2. Reduction of the basic block LU decomposition

With the definition of the step s, the cluster t and the numbering for Ps,t, Algorithm 1 has the following form:

Algorithm 2.

1 FOR s = 0, . . . ,S2 FOR t = 1, . . . ,Ts

3 FOR k = Ks,t�1 + 1, . . . ,Ks,t

4 Akk :¼ LU (Akk)5 FOR m = k + 1, . . . ,K6 Akm :¼ SolveðAkk;AkmÞ7 FOR n = k + 1, . . . ,K8 Anm :¼ Anm � AnkAkm

Note that the lines (1. . .3) coincide with k = 1, . . . ,K.

Now, we identify block matrices which remain zero within the algorithm and only a subset of these block matrices haveto be computed. To this end, we define the actual operating set for the cluster t = 1, . . . ,Ts in step s by

Ks;t ¼ fk 2 fKs;t�1 þ 1; . . . ;Kg : pk \ Ps;t – ;g

Let Ks;tLU : fKs;t�1 þ 1; . . . ;Ks;tg and Ks

LU :¼ fKs�1 þ 1; . . . ;Ksg. Furthermore let Ks;tD ¼ fk 2 fKs;t�1 þ 1; . . . ;Kg : pk \ Ps;t ¼ ;g be the

complement to Ks;t . We want to note that Ks;tLU � Ks;t since pk � Ps;t for k 2 Ks;t

LU . The Schur complement in step s is denoted byASch

s :¼ ðAnmÞn;m¼Ksþ1;...;K .

Lemma 2. For each step s in Algorithm 2 holds Akm = 0 and Amk = 0 during the whole algorithm for k 2 Ks;tLU ;m 2 Ks;t

D .

Remark 3. Since Ks;~tLU \Ks;t ¼ ; for ~t – t, we obtain from Lemma 2 a block-diagonal structure of

ðAklÞk;l2KsLU¼ diagt¼1;...;Ts

ðAklÞk;l2Ks;tLU

. Thus, the loop in line 2 of Algorithm 2 can be computed in parallel resulting in an additiveSchur complement. With Lemma 2 the loop in line 5 of Algorithm 2 also can be restricted to m 2 Ks;t ; m > k. Furthermore thematrices Anm are not changed by Algorithm 2 in line 8 if n or m 2 Ks;t

D , since Akm = 0 or Ank = 0, such that the loop in line 7 ofAlgorithm 2 can be restricted to n 2 Ks;t ; n > k.

Proof. We prove by induction that Akm = 0 and Amk = 0 at the beginning of each step s for k 2 Ks;tLU ;m 2 Ks;t

D . Let s = 0. ThenKs;t

LU ¼ ftg and Ps;t ¼ ftg for t = 1,. . .,T0 by definition and Ks;t ¼ fk P t : pk \ ftg– ;g;Ks;tD ¼ fk P t : pk \ ftg ¼ ;g. In this step

we can identify an element k 2 Ks;tLU with k = t. Let m 2 Ks;t

D . Then pm \ {t} = ;. Since {t} = pk for k 2 Ks;tLU we have Akm = 0 and

Amk = 0 as a consequence of Lemma 1. Furthermore we have for the matrices Anm in the Schur complement ASch0 that

Anm = 0 for pn \ pm = ;,n,m > K0 = T0 also by Lemma 1.We prove that all zero matrices do not change in this step. There are three operations in the algorithm, where matrices are

being changed. The first one is Akk :¼ LU (Akk), but this only relies on Akk with k 2 Ks;tLU , but k R Ks;t

D , so Akk – 0. The second oneis Akm :¼ SolveðAkk;AkmÞ. If Akm is a zero matrix, the result is again a zero matrix, such that this matrix does not change itsstructure. The third one is Anm :¼ Anm � AnkAkm. To change a matrix Anm, both Ank – 0 and Akm – 0. Thus, k 2 pn and k 2 pm, butthen pn \ pm � {k} – ;.

For the induction step, consider s > 0. Let n 2 Ks;tLU ;m 2 Ks;t

D . Then, pn � Ps;t and pm \ Ps;t ¼ ;, so pm \ pn = ;. ThereforeAnm = 0 at the beginning of the algorithm. We have to show that Anm has not being changed before step s. The argument forthe first two operations is the same as before, so we only consider the operation Anm = Anm � AnkAkm for k < n. There is nochange of Anm if either Ank = 0 or Akm = 0, so we have to show that if Ank – 0, then Akm = 0, k < n.

Page 6: A parallel block LU decomposition method for distributed finite element matrices

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 747

Let k 2 Kr;sLU with r < s,s 2 {1, . . . ,Tr} and Ank – ;. Then, n 2 Kr;s and therefore pn \ Pr;s – ; by induction hypothesis. We

have to show that Akm = 0. We show this by contradiction: Let Akm – 0. By induction hypothesis we have k 2 Kr;s

) pm \ Pr;s – ;. As one can see by definition (10) we have Ps;t ¼ Ps�1;s1 [ Ps�1;s2 for all s > 0 and Ps;t1 \ Ps;t2 ¼ ; for t1 – t2.But this means that Pr;s � Ps;t since pn \ Pr;s – ; and pn � Ps;t . Therefore pm \ Ps;t – ;, but this is a contradiction. h

Using Lemma 2 and Remark 3, Algorithm 2 can be reduced to

Algorithm 3.

1. FOR s = 0, . . . ,S2. FOR t = 1, . . . ,Ts

3. FOR k = Ks,t�1 + 1, . . . ,Ks,t

4. Akk :¼ LU (Akk)5. FOR m 2 Ks;t; m > k6. Akm :¼ SOLVE(Akk,Akm)7. FOR n 2 Ks;t ; n > k8. Anm :¼ Anm � AnkAkm

3.3. The Schur complement for additive matrices

Since the matrices obtained from the interface are decomposed additively, we examine the Schur complement for addi-tive matrices. In the first step, we consider a parallel finite element problem given on two processors P = 2, i.e. S = 1. Withrespect to Lemma 1 this leads to the additive matrix representation

A ¼A11 0 A13

0 A22 A23

A31 A32 A33

0B@

1CA ¼

A111 0 A1

13

0 A222 A2

23

A131 A2

32 A133 þ A2

33

0BB@

1CCA

with p1 = {1}, p2 = {2}, p3 = {1,2}.We examine Algorithm 3 with S ¼ 1; T0 ¼ 2; T1 ¼ 1; P0;1 ¼ f1g; P0;2 ¼ f2g; P1;1 ¼ f1;2g; K0;0 ¼ 0; K0;1 ¼ 1;

K0;2 ¼ 2; K1;0 ¼ 2; K1;1 ¼ 3; K0;1 ¼ f1;3g; K0;2 ¼ f2;3g; K1;1 ¼ f3g. We start with

A1 ¼A1

11 A113

A131 A1

33

!on p ¼ 1; A2 ¼

A222 A2

23

A232 A2

33

!on p ¼ 2

First, we consider step s = 0, where the substep t = 1 is executed on processor p = 1 and the substep t = 2 is executed on pro-cessor p = 2. This results (locally) in

LU A111

� �A1

11

� ��1A1

13

A131 A1;Sch

33

0@

1A on p ¼ 1;

LU A222

� �A2

22

� ��1A2

23

A232 A2;Sch

33

0@

1A on p ¼ 2

with A1;Sch33 :¼ A1

33 � A131 A1

11

� ��1A1

13; A2;Sch33 :¼ A2

33 � A232 A2

22

� ��1A2

23.In the next step s ¼ 1; ASch

33 ¼ A1;Sch33 þ A2;Sch

33 is computed and decomposed on p = 1, which finally leads to

LU A111

� �0 A1

11

� ��1A1

13

0 LU A222

� �A2

22

� ��1A2

23

A131 A2

32 ASch33

0BBBB@

1CCCCA

with ASch33 ¼ A1

33 þ A233 � A1

31 A111

� ��1A1

13 � A232 A2

22

� ��1A2

23.This illustrates our algorithm for S = 1. For S > 1, this procedure is applied recursively on the processor sets, where the

Schur complements may consist of more than one block matrix. Then, the columns of the Schur complement are distributedon the processors in Ps;t to reduce the computing time of several operations.

3.4. The parallel block LU decomposition

We use a (cyclic) distribution of the columns of the block matrices Akn for k;n 2 Ks;t . This is done independently for everycluster t. Therefore, we select a processor map

ps;t : Ks;t ! Ps;t ð12Þ

and for n 2 Ks;t the block matrices Akn for all k 2 IKs;t are computed on processor p = ps,t(n).

Page 7: A parallel block LU decomposition method for distributed finite element matrices

748 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

This results in the following parallel algorithm:

Algorithm 4.

1. FOR s = 0, . . . ,S2. FOR t = 1, . . . ,Ts

3. IF (s > 0)4. FOR n;m 2 Ks;t

5. p :¼ ps,t(m)6. ON q :¼ ps � 1, 2t(m): SEND Aq

nm to p7. ON r :¼ ps � 1, 2t+1(m): SEND Ar

nm to p

8. ON p: RECEIVE AðqÞnm AND AðrÞnm, SET Apnm :¼ AðqÞnm þ AðrÞnm

9. FOR k = Ks,t � 1 + 1, . . . ,Ks,t

10. p :¼ ps,t(k)11. ON p: Ap

kk :¼ LUðApkkÞ

12. FOR n 2 Ks;t

13. ON p: SEND Apnk TO q 2 Ps;t

14. ON q 2 Ps;t: RECEIVE AðpÞnk AND SET Aqnk :¼ AðpÞnk

15. FOR m 2 Ks;t; m > k16. q :¼ ps,t(m)17. ON q: Aq

km :¼ SOLVE Aqkk;A

qkm

� �18. FOR n 2 Ks;t ; n > k19. ON q: Aq

nm :¼ Aqnm � Aq

nkAqkm

Matrices denoted by AðqÞnm are received matrices from a processor q, while any operation with these matrices is executed onmaybe another processor.

In the lines (3. . .8) the Schur complement given on two clusters 2t, 2t + 1 in step s � 1 is combined to one cluster t in step s.

3.4.1. Example on 4 processorsWe illustrate the steps Algorithm 4 for the example in Section 2.2.2, where P = 4 and S = 2. For the distributed matrices the

processor numbers are indicated, where the matrix is non-zero; zero matrices are denoted by a dot.

In step s = 0, the indices associated to p1, . . . ,p4 are eliminated on all processors. Then, the Schur complement (beginningwith p5) is communicated within the new processor sets (P1;1 ¼ f1;2g and P1;2 ¼ f3;4g). After this step, the distribution ofthe matrices changes to

Page 8: A parallel block LU decomposition method for distributed finite element matrices

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 749

For step s = 1, the elements associated to p5 are eliminated on the processor set P1;1 and p6 on P1;2, respectively. Note that allmatrices belonging to the next Schur complement (p7, . . . ,p9) are still distributed additively.

The procedure is repeated for the last step again. Here, we only have one processor set P2;1 ¼ f1;2;3;4g. Note that someblock matrices which were zero before now have entries, (p7,p8) and (p8,p7) and that the distribution of the columns de-pends on the step number s. This results in the final situation, where each block matrix is represented on exactly oneprocessor.

3.5. Solving linear equations with the block LU decomposition

With the parallel LU decomposition we solve the linear system Ax = b, where A is the stiffness matrix given in a paralleladditive representation

A ¼ ðAklÞk;l¼1;...;K with Akl ¼X

p2pk\pl

Apkl

and b is the right-hand side also represented additively by

b ¼ ðbkÞk¼1;...;K with bk ¼Xp2pk

bpk

We use the following operations for the vectors bk:

� bk :¼ SOLVE(Akk,bk).Here (using the LU decomposition of Akk) we compute A�1

kk bk. The result is stored in bk.� bn :¼ bn � Ankbk

This operation is realized by one BLAS2 call.

From a global point of view, the solving routine using the decomposed matrices Akl is given by

Algorithm 5.

1. FOR k = 1, . . . ,K2. bk :¼ SOLVE(Akk,bk)3. FOR n = k + 1, . . . ,K4. bn :¼ bn � Ankbk

5. FOR n = K � 1, . . . ,16. FOR k = K � 1, . . . ,n + 17. bn :¼ bn � Ankbk

The algorithm ends with x :¼ b.

The parallel realization of Algorithm 5 reads as follows:

Algorithm 6.

1. FOR s = 0, . . . ,S2. FOR t = 1, . . . ,Ts

3. SOLVE_L (s, t)4. FOR s = S, . . . ,05. FOR t = Ts, . . . ,16. SOLVE_U (s, t)

Page 9: A parallel block LU decomposition method for distributed finite element matrices

with the subroutines for the parallel forward and backward substitution

Algorithm 7. SOLVE_L (s,t)

1. p :¼ ps,t(Ks,t�1 + 1)2. FOR k 2 Ks;t

3. ON q :¼ ps � 1, 2t(k): SEND bqk TO p

4. ON r :¼ ps � 1, 2t+1(k): SEND brk TO p

5. ON p: RECEIVE bðqÞk and bðrÞk , set bpk :¼ bðqÞk þ bðrÞk

6. FOR k = Ks,t�1 + 1, . . . ,Ks,t

7. p :¼ ps,t(k)8. ON p: bp

k :¼ SolveðApkk; b

pkÞ, SEND bp

k TO q 2 Ps;t

9. ON q 2 Ps;t: RECEIVE bðpÞk , SET bqk :¼ bðpÞk

10. FOR n 2 Ks;t; n > k11. ON p: bp

n :¼ bpn � Ap

nkbpk

12. IF k < Ks,t

13. ON p: SEND bpn TO q :¼ ps,t(k + 1)

14. ON q :¼ ps,t(k + 1): RECEIVE bðpÞn , SET bqn :¼ bðpÞn

15. ELSE16. ON p: SEND bp

n TO q 2 Ps;t

17. ON q 2 Ps;t: RECEIVE bðpÞn , SET bqn :¼ bðpÞn

750 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

Algorithm 8. SOLVE_U (s, t)

1. FOR n = Ks,t, . . . ,Ks,t�1 + 12. FOR k 2 Ks;t ; k > n3. p :¼ ps,t(k)4. ON p: bp

n :¼ bpn � Ap

nkbpk

5. IF k > Ks,t

6. ON p: SEND bpn TO q :¼ ps,t(k � 1)

7. ON q :¼ ps,t(k � 1): RECEIVE bðpÞn , SET bqn :¼ bðpÞn

8. ELSE9. ON p: SEND bp

n TO q 2 Ps;t

10. ON q 2 Ps;t: RECEIVE bðpÞn , SET bqn :¼ bðpÞn

In Algorithm 7 the lines (2. . .5) combine the right-hand sides of the respective two clusters. In the lines (8. . .9) the part bk

of the current right-hand side b is sent to all processors in the current processor set Ps;t while the part of b which will bechanged in the algorithm (i.e. bn for k < n 2 Ks;t) is just sent to the next processor ps,t(k + 1) (lines (13. . .14)). At the endthe rest of b is sent to the current processor set (lines (16. . .17)), such that each processor in this set has the same informa-tion about the right-hand side. As in Algorithm 4 a vector denoted as bðpÞn is a received vector from processor p.

4. Model problems

We define a series of different model problems for our numerical test. In our notation, u = u(x) is the solution, wherex = (X,Y)T and x = (X,Y,Z)T are the coordinates in the 2D and 3D case, respectively.

4.1. The Poisson problem

We start with a commonly used benchmark problem, the Poisson equation �Du = f in X = (0,1)d (d = 2,3) with homoge-neous Dirichlet boundary conditions on @X. For the weak formulation we define

aðu;vÞ ¼Z

Xru � rv dx on V ¼ H1

0ðXÞ

The discretization with bi-/trilinear finite elements on a uniform mesh results into a 9-points stencil in 2D and a 27-points stencil in 3D [5, Chap. II.§5]. The resulting matrix is a symmetric positive definite M-matrix.

Page 10: A parallel block LU decomposition method for distributed finite element matrices

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 751

4.2. Linear elasticity

In the next problem we consider 3D linear elasticity. Here, the boundary @X is decomposed into a Dirichlet part CD and aNeumann part CN. We compute the displacement u, the strain eðuÞ ¼ 1

2 ðruþ ðruÞTÞ, and the stress r = (2le(u) + k(divu)I),satisfying �divr = 0 in X, rn = t on CN, and u = 0 on CD, where t is the surface traction and k, l > 0 are the Lamé parameters.The corresponding bilinear form on V = {v 2 H1(X)3 : v = 0 on CD} is given by

aðu;vÞ ¼Z

X2leðuÞ : eðvÞ þ kðdiv uÞðdiv vÞ dx

cf. [5, Chap. VI.§3]. Again, on a regular mesh with trilinear elements we obtain a 81-points stencil. The resulting stiffnessmatrix is symmetric positive definite, but no M-matrix.

4.3. The convection–diffusion equation

For a given vector field b : X! Rd we consider the convection–diffusion equation

�eDuþ b � ru ¼ f

For VI � H10ðXÞ we use the streamline-diffusion method defined by the (non-symmetric) bilinear form

aðu;vÞ ¼ eZ

Xru � rv dxþ

ZXðb � ruÞv dxþ

Xc2C

dc

ZXc

ðb � ruÞðb � rvÞ dx

with suitable mesh-dependent scaling parameters dc > 0 [12, Chap. 3.3.2]. This results in a non-symmetric stiffness matrix.

4.4. The Stokes equations

The Stokes problem for the velocity u and the pressure p is given by

� Duþrp ¼ 0div u ¼ 0

For the discretization we use Taylor–Hood-Serendipity Q2/Q1 elements in V ¼ H10ðXÞ � L2;0ðXÞ [5, Chap. III.§7]. The bilinear

form is given by

aððu; pÞ; ðv ; qÞÞ ¼Z

Xðru � rv þ div uqþ p div vÞ dx

The corresponding matrix is symmetric, but indefinite.

4.5. The Maxwell cavity resonator

Finally, we consider the Maxwell cavity resonator [20, Chap. 1.4.2]: For a given current density f find a vector field u, suchthat

r� l�1r r� u

� �� j2�ru ¼ f in X

m� u ¼ 0 on @X

Here, we use Nedelec elements on tetrahedra in V = H0(curl,X) and the (indefinite) bilinear form

aðu;vÞ ¼Z

Xl�1

r r� u � r � v dx� j2Z

X�ru � v dx

[20, Chap. 5.5.1]. The resulting matrix is symmetric, but indefinite and very ill-conditioned if j is large.

5. Results

The numerical tests are realized on the Cluster HC3 of the STEINBUCH CENTRE OF COMPUTING (SCC) in Karlsruhe with 332 eight-waycompute nodes, where each node has two Intel Xeon Quad Core sockets with 2.53 GHz frequency. They are connected by anInfiniBand 4X QDR interconnect [15]. In the algorithm we use BLAS and LAPACK routines for dense matrices, given from theINTEL MATH KERNEL LIBRARY [19] and MUMPS [1,21] or SUPERLU [9,27] as a solver for sparse matrices, where MUMPS emerges as thebetter sparse solver for large matrices. All communication between processors are done by MPI.

For the block LU decomposition defined in Section 3 we use the sparse solver in the first step s = 0 for k 6 K0 for the LUdecomposition of the diagonal blocks (Akk :¼ LU (Akk)) since all matrices are sparse at the beginning. In the further steps thediagonal block matrices are dense, such that we use some BLAS and LAPACK routines there.

Page 11: A parallel block LU decomposition method for distributed finite element matrices

752 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

Since the communication part (12. . .14) in Algorithm 4 takes a long time compared to the computation time for smallproblems on large processor numbers, we define a parameter pmax as a maximum processor number in each step for eachcluster. This means, that the processor map ps,t maps only to a subset of Ps;t such that the broadcast of the matrices Ap

nk fromprocessor p takes less time. In general, for small problems pmax = 64 is sufficient, for the larger problems we take pmax = 128or pmax = 256.

We compare our algorithm with MUMPS for the Poisson problem on a square or cube, since this is a standard test problemwith a very simple parallel distribution, such that each processor has nearly the same number of unknowns. For all otherproblems we use more general domains which are distributed by a simple recursive coordinate bisection load balancing.Probably, some of the results can be improved by a better load balancing (in particular for the example in Section 5.5),but that is not in the focus of our research. Here, we mainly want to demonstrate that our solver works well for differentproblem classes with general domains.

5.1. The Poisson problem

For the two-or three-dimensional Poisson problem on the unit square/ cube we compare the factorization time of oursolver with the parallel direct solver MUMPS [1] on refinement level l for P = 2S processors. Here, we have an optimal distri-bution of the domain to the processors (see Table 1 and Fig. 2).

The results for the 3D case are presented in Table 2. For MUMPS a standard configuration for a distributed matrix is cho-sen. We remark that better results may be obtained with an optimized configuration (for the used simple configurationMUMPS could not be applied to all test cases).

As it can be seen in Table 2, MUMPS performs better for a moderate processor number than our solver since then theSchur complements in step s = 0 are very large and the computation gets inefficient. This is also the reason why the solverbecomes inefficient for P = 2, but then it has a quite good efficiency until a minimum of factorization time is achieved. If thecommunication part becomes too expensive, the total factorization time (computation + communication time) is growing.

Fig. 2. Solution u of the Poisson problem in 2D (left) and parallel distribution of 64 processors in 3D (right).

Table 1Number of degrees of freedom N ¼ dim VI (d.o.f.) and non-zero matrix entries (nz-entries: #fði; jÞ 2 I � I : A½i; j� – 0gÞ for the Poisson problem (left: 2D;right: 3D).

l d.o.f. nz-entries l d.o.f. nz-entries

8 263169 2362369 5 35937 9126739 1050625 9443329 6 274625 7189057

10 4198401 37761025 7 2146149 57066625

Table 2Factorization time [s] of the 3D Poisson problem on refinement level l with P = 2S processors.

S P l = 5 l = 6 l = 7

ALG.4 MUMPS ALG.4 MUMPS ALG.4 MUMPS

6 64 1.25 3.48 21.95 – 1833.00 –7 128 1.75 5.72 15.54 47.89 1155.12 1690.398 256 2.57 9.54 10.06 29.52 439.98 1855.609 512 5.68 20.24 13.26 54.91 228.34 –

Page 12: A parallel block LU decomposition method for distributed finite element matrices

0 2 4 6 8 10100

101

102

103

104

#cores (2S)

time[

s]l=5l=6l=7l=5, MUMPSl=6, MUMPSl=7, MUMPS

Fig. 3. Comparison of the factorization time of the 3D Poisson problem with the parallel direct solver and MUMPS.

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 753

This can be observed for level l = 6 for the 3D Poisson problem in Table 2. For larger problems, the optimal factorization timeis attained for higher processor numbers (See Fig. 3).

Next, we study the solution time. For a better overview, we expand the range of the processors, especially to see the min-imum of the solution time for l = 6. Since less data has to be communicated, the optimal efficiency is achieved for a smallerprocessor number, but overall we get a comparably result as for the factorization time (note that computing times less thanabout 0.05 s are not significant within our test framework) (See Table 3).

With our solver we achieve a residual reduction on level 5 of about 2.8 ⁄ 10�14 on 4 processors and 3.6 ⁄ 10�14 on 1024processors. With MUMPS we have a reduction of about 2.7 ⁄ 10�14 for any processor number. On level 7 the reduction with

Table 3Solution time [s] of the 3D Poisson problem on refinement level l on P = 2S processors.

S P l = 5 l = 6 l = 7

ALG.6 MUMPS ALG.6 MUMPS ALG.6 MUMPS

3 8 0.03 0.03 0.27 – – –4 16 0.03 0.18 0.15 – – –5 32 0.03 0.18 0.09 – – –6 64 0.04 0.20 0.11 – 1.05 –7 128 0.07 0.21 0.14 0.52 0.98 2.388 256 0.15 0.17 0.21 0.74 1.05 3.029 512 0.42 0.78 0.52 1.40 1.66 –

0 1 2 3 4 5 6 7 8 9 1010−1

100

101

102

103

#cores (2S)

time[

s]

l=8l=9l=10l=8, MUMPSl=9, MUMPSl=10, MUMPS

Fig. 4. Factorization time of the 2D Poisson problem with the parallel direct solver and MUMPS.

Page 13: A parallel block LU decomposition method for distributed finite element matrices

754 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

our solver is between 2.0 ⁄ 10�12 for 128 processors and 2.5 ⁄ 10�12 for 1024 processors. Note that within the finite elementcontext, in most cases a residual reduction of about 10�5 is sufficient.

Similar results are obtained for the 2D Poisson problem. Here, we observe a good performance for an increasing processornumber if the problem is large enough. We expect the minimal factorization time on level l = 10 for far more than 1024 pro-cessors (cf. Fig. 4). Again, our new solver outperforms MUMPS for larger processor numbers (see Tables 4 and 5).

5.2. Linear elasticity

Linear elasticity is tested for the geometry illustrated in Fig. 5. On the bottom we choose Dirichlet boundary conditionsand on the top we apply a traction load. We use the Lamé parameters l = 80193.80, k = 110743.82 corresponding to stainlesssteel. The configuration is taken from [22]. The results are represented in Table 6. Similar results as for the 3D Poisson prob-lem are observed (see Fig. 6).

5.3. Convection–diffusion problem

For this problem class we use the configuration described in [12, Example 3.1.1]. We consider a 2D problem given inX = (�1,1)2 with an analytic solution uðX; YÞ ¼ X 1�eðY�1Þ=�

1�e�2=�

� �. The solution is prescribed on @X. We set � = 1/200, so that the

solution has a strong boundary layer (see Figs. 7).

Table 4Factorization time [s] of the 2D Poisson problem on refinement level l on P = 2S processors.

S P l = 8 l = 9 l = 10

ALG.4 MUMPS ALG.4 MUMPS ALG.4 MUMPS

5 32 2.88 4.58 18.68 11.97 155.59 47.986 64 1.24 4.55 11.01 12.51 70.46 44.417 128 0.94 6.22 4.66 12.77 31.81 44.128 256 0.83 8.49 2.40 18.64 12.79 49.46

Table 5Solution time [s] of the 2D Poisson problem on refinement level l on P = 2S processors.

S P l = 8 l = 9 l = 10

ALG.6 MUMPS ALG.6 MUMPS ALG.6 MUMPS

3 8 0.06 0.09 0.24 0.27 – –4 16 0.03 0.07 0.12 0.34 – –5 32 0.03 0.07 0.07 0.21 0.28 0.796 64 0.03 0.08 0.04 0.32 0.14 0.747 128 0.01 0.17 0.04 0.24 0.10 0.748 256 0.03 0.34 0.04 0.40 0.10 0.799 512 0.07 0.54 0.08 0.47 0.12 1.00

Fig. 5. Surface mesh of the triangulation.

Page 14: A parallel block LU decomposition method for distributed finite element matrices

Table 6Factorization time [s] of the linear elasticity problem on refinement level l on P = 2S processors.

S P l = 3 l = 4 l = 5

87363 655875 5079555 d.o.f.6499017 50872329 402541065 nz-entries

6 64 8.46 466.997 128 10.35 256.898 256 11.36 176.569 512 34.09 249.00 5947.00

10 1024 66.82 578.52 3524.00

3 4 5 6 7 8 9 10100

101

102

103

104

#cores (2S)

time[

s]

l=3l=4l=5

Fig. 6. Factorization time of the linear elasticity problem.

Fig. 7. Surface plot of the solution of the convection–diffusion problem.

Table 7Number of unknowns and non-zero entries of the convection–diffusion problem.

S P l = 8 l = 9 l = 10

263169 1050625 4198401 d.o.f.2362369 9443329 37761025 nz-entries

7 128 2.29 24.52 167.478 256 2.05 15.99 101.159 512 2.25 11.29 69.23

10 1024 2.73 8.26 44.29

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 755

Page 15: A parallel block LU decomposition method for distributed finite element matrices

756 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

Again, our algorithm performs well for the 2D convection–diffusion problem for an increasing processor number.Since we have the same block matrix structure as for the 2D Poisson problem we obtain similar results (see Table 7and Fig. 8).

5.4. The Stokes problem

The Stokes problem is tested for the two-dimensional backward facing step configuration [12, Example 5.1.2]. On the leftis a inflow boundary, and a no-flow condition is imposed on the walls (top and bottom of the geometry). On the right side aNeumann condition is applied (see Fig. 9). Again, we obtain for this indefinite system similar results as for the 2D scalar prob-lems, cf. Table 8 and Fig. 10.

2 3 4 5 6 7 8 9 10100

101

102

103

#cores (2S)

time[

s]

l=8l=9l=10

Fig. 8. Factorization time [s] of the convection–diffusion problem.

Fig. 9. Solution for the Stokes problem.

Table 8Factorization time [s] of the Stokes problem on refinement level l on P = 2S processors.

S P l = 4 l = 5 l = 6

20355 80131 317955 d.o.f.798345 2179273 12688905 nz-entries

4 16 0.20 3.21 33.235 32 0.26 1.20 9.076 64 0.61 0.84 3.387 128 0.62 1.02 1.98

Page 16: A parallel block LU decomposition method for distributed finite element matrices

0 1 2 3 4 5 6 7 8 9 1010−1

100

101

102

103

#cores (2S)

time[

s]

l=4l=5l=6

Fig. 10. Factorization time of the Stokes problem.

D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 757

5.5. The Maxwell cavity resonator

For the Maxwell cavity resonator we use a geometry illustrated in Fig. 11. The mesh is provided by CST DARMSTADT [6]. Thefirst eigenvalue for this problem is k1 0.0028, the 30th eigenvalue is k30 0.014. We set j2 = 0.01, so we compute a strongindefinite problem.

The distribution of the domain to the processors is not optimal which results in large interfaces. So we do not expect asgood results as for the other 3D problems. Nevertheless, a decreasing factorization time for larger problems on higher pro-cessor numbers is observable (see Table 9).

Fig. 11. Geometry for the Maxwell cavity resonator.

Table 9Factorization time [s] of the Maxwell problem on refinement level l on P = 2S processors.

S P l = 0 l = 1

63561 507012 d.o.f.4 16 15.495 32 9.61 863.366 64 8.74 561.957 128 9.12 256.648 256 216.89

Page 17: A parallel block LU decomposition method for distributed finite element matrices

758 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758

Acknowledgement

The authors acknowledge the financial support from BMBF Grant 01IH08014A within the joint research project ASIL (Ad-vanced Solvers Integrated Library).

References

[1] P.R. Amestoy, I.S. Duff, J. Koster, J.-Y. L’Excellent, A fully asynchronous multifrontal solver using distributed dynamic scheduling, SIAM J. Matrix Anal.Appl. 23 (1) (2001) 15–41.

[2] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, Lapack: A Portable LinearAlgebr. Library High-performance Comput. (1990).

[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. Sorensen, LAPACK’suser’s guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992.

[4] C.W. Bomhof, H.A. van der Vorst, A parallel linear system solver for circuit simulation problems, Numer. Linear Algebr. Appl. 7 (7-8) (2000) 649–665.[5] D. Braess, Finite Elements: Theory, Fast Solvers, and Applications in Solid Mechanics, Cambridge University Press, 1997.[6] CST-computer simulation technology, darmstadt. Available from: <http://www.cst.com/>.[7] Krister Dackland, Erik Elmroth, Bo Kågström, Charles Van Loan. Design and Evaluation of Parallel Block Algorithms: LU Factorization on an IBM 3090

VF/600J, in: Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing, Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, 1992, pp. 3–10.

[8] J.W. Demmel, N.J. Higham, R.S. Schreiber, Stability of block LU factorization, Numer. Linear Algebr. Appl. 2 (1995) 173–190.[9] James W. Demmel, Stanley C. Eisenstat, John R. Gilbert, Xiaoye S. Li, Joseph W.H. Liu, A supernodal approach to sparse partial pivoting, SIAM J. Matrix

Anal. Appl. 20 (3) (1999) 720–755.[10] James W. Demmel, Nicholas J. Higham, Stability of block algorithms with fast level-3 BLAS, ACM Trans. Math. Softw. 18 (3) (1992) 274–291.[11] Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, Henk Van Der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Society for

Industrial and Applied Mathematics, Philadelphia, PA, USA, 1990.[12] Howard Elman, David Silvester, Andy Wathen, Finite Elements and Fast Iterative Solvers with Applications in Incompressible Fluid Dynamics, Oxford

University Press, 2005.[13] K.A. Gallivan, R.J. Plemmons, A.H. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Rev. 32 (1990) 54–135.[14] G.H. Golub, C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, Md., 1996.[15] KIT Cluster HP XC3000 (hc3). Available from: <http://www.scc.kit.edu/dienste/hc3.php>.[16] G. Karypis, V. Kumar, Analysis of Multilevel Graph Partitioning, in: Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, 1995.[17] G. Karypis, V. Kumar, A parallel algorithm for multilevel graph partitioning and sparse matrix ordering, J. Parall. Distribut. Comput. 48 (1998) 71–95.[18] R.M.M. Mattheij, Stability of block LU-decompositions of matrices arising from BVP, SIAM J. Algebr. Discrete Methods 5 (1984) 314–331.[19] Intel MKL. Available from: <http://software.intel.com/en-us/intel-mkl/>.[20] P. Monk, Finite Element Methods for Maxwell’s Equations, Clarendon Press, Oxford, 2003.[21] MUMPS-a MUltifrontal Massively Parallel sparse direct Solver. version 4.9.2. Available from: <http://graal.ens-lyon.fr/MUMPS/>.[22] P. Neff, A. Sydow, C. Wieners, Numerical approximation of incremental infinitesimal gradient plasticity, Int. J. Numer. Methods Eng. 77 (2009) 414–

436.[23] Eric Polizzi, Ahmed Sameh, SPIKE: a parallel environment for solving banded linear systems, Comput. Fluids 36 (1) (2007) 113–120.[24] A. Pothen, H.D. Simon, K.-P. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Matrix Anal. Appl. 11 (1990) 430–452.[25] O. Schenk, K. Gärtner, Solving unsymmetric sparse systems of linear equations with PARDISO. Sloot, in: M.A. Peter et al. (Eds.), Computational science-

ICCS 2002. 2nd international conference, Amsterdam, the Netherlands, April. 21–24, 2002. Proceedings. Part 2, Lect. Notes Comput. Sci, vol. 2330,Springer, Berlin, 2002, pp. 355–363.

[26] O. Schenk, K. Gärtner, W. Fichtner, A. Stricker, PARDISO: A high-performance serial and parallel sparse linear solver in semiconductor devicesimulation, FGCS. Future Generat. Comput. Syst. 18 (1) (2001) 69–78.

[27] Super LU. version 3.0. Available from: <http://crd.lbl.gov/xiaoye/SuperLU/>.[28] T.A. Davis, User’s guide for the unsymmetric-pattern multifrontal package (umfpack), Tech. Report TR-95-004, Computer and Information Sciences

Department, University of Florida, Gainesville, FL, 1995.[29] A. Toselli, O. Widlund, Domain Decomposition Methods-Algorithms and Theory, Springer-Verlag, 2005.[30] C. Wieners, Distributed point objects, A new concept for parallel finite elements, in: Kornhuber, Ralf (Ed.) et al., Domain decomposition methods in

science and engineering, Selected papers of the 15th international conference on domain decomposition, Berlin, Germany, July 21–25, 2003. Berlin:Springer. Lecture Notes in Computational Science and Engineering 40, 175–182, 2005.

[31] Christian Wieners, A geometric data structure for parallel finite elements and the application to multigrid methods with block smoothing, Comput. Vis.Sci. 13 (4) (2010) 161–175.