Upload
daniel-maurer
View
213
Download
1
Embed Size (px)
Citation preview
Parallel Computing 37 (2011) 742–758
Contents lists available at ScienceDirect
Parallel Computing
journal homepage: www.elsevier .com/ locate /parco
A parallel block LU decomposition method for distributed finiteelement matrices
Daniel Maurer, Christian Wieners ⇑Institute for Applied and Numerical Mathematics, 3, Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany
a r t i c l e i n f o a b s t r a c t
Article history:Available online 7 June 2011
Keywords:Parallel computingFinite elementsDirect solver for linear equationsBlock LU decomposition
0167-8191/$ - see front matter � 2011 Elsevier B.Vdoi:10.1016/j.parco.2011.05.007
⇑ Corresponding author. Tel.: +49 721 608 42063E-mail addresses: [email protected] (D. MauURL: http://www.math.kit.edu (C. Wieners).
In this work we present a new parallel direct linear solver for matrices resulting from finiteelement problems. The algorithm follows the nested dissection approach, where the result-ing Schur complements are also distributed in parallel. The sparsity structure of the finiteelement matrices is used to pre-compute an efficient block structure for the LU factors. Wedemonstrate the performance and the parallel scaling behavior by several test examples.
� 2011 Elsevier B.V. All rights reserved.
1. Introduction
Finite element applications often require a fine mesh which results into several million or billion unknowns. To obtain areasonable computing time to solve this equation, this can only be realized on a parallel machine. For this purpose severalapplications require parallel direct solvers, since they are more general and more robust than iterative solvers. For example,preconditioned CG-methods are restricted to symmetric positive definite systems, and also multigrid or domain decompo-sition preconditioner require an efficient parallel coarse problem solver. A direct solver often performs well in cases, wheremany iterative solvers fail, e.g., if the problem is indefinite, unsymmetric, or ill-conditioned.
Our concept of the parallel LU decomposition for matrices resulting from finite element problems is based on a nesteddissection approach. On P = 2S processors, the algorithm uses S + 1 steps in which sets of processors are combined togetherconsistently and problems between these processors are solved, beginning with one processor in the first step. The first stepitself is comparable to the preprocessing step for non-overlapping domain decomposition methods [29], where the resultingSchur complement is then solved by a suitable preconditioned iteration. Here, a parallel LU decomposition for this Schurcomplement problem is introduced. Nevertheless, a simple nested dissection method fails for finite element applicationssince, in particular in 3D, the resulting Schur complement problems emerge into large dense problems, so that they haveto be solved distributed and in parallel, too.
Block LU factorizations and its analysis have been discussed by various authors, see e.g., [14,10,18,8]. A general purposealgorithm for sparse matrices is realized, e.g., in MUMPS [1] for distributed memory, and, e.g., in PARDISO [25,26] on sharedmemory machines. Parallel solvers also can be used in hybrid method for solving subproblems, e.g., in SPIKE [23]. Sequentialsolvers for sparse matrices such as SUPERLU [9] and UMFPACK [28] can be used to eliminate local degrees of freedom. A moregeneral discussion of block LU decomposition and handling linear systems on parallel computers can be found in [7,11,13]. Ablock LU decomposition method with iterative methods for the Schur complement can be found, e.g., in [4].
Our new parallel solver explicitly uses the structure of the finite element matrix. Thus, it is not a ‘‘black box’’ solver, theknowledge of the structure of the finite element matrix and the decomposition of the domain to the processors is essential.
. All rights reserved.
; fax: +49 721 608 43197.rer), [email protected] (C. Wieners).
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 743
Our new contribution is the efficient and transparent use of this structure for the parallel distribution of the eliminationsteps. In particular, we can identify a priori parts of the resulting LU decomposition which remains zero, so that they canbe ignored during the algorithm.
The paper is organized as follows. In Section 2 a general setting for parallel finite elements is introduced which leads to aparallel block structure for the corresponding matrix. The mesh is first distributed on P processors and then refined locally ltimes, such that each processor handles a local part of the mesh. Then, a suitable block LU decomposition is defined and theparallel realization is discussed in Section 3. In Section 4 we introduce several finite element problems which are used inSection 5 for the evaluation of the parallel performance of our direct solver. In some cases the results are compared withthe parallel direct solver MUMPS [1].
2. Parallel finite elements
Following [30,31], we define a parallel additive representation of the stiffness matrix which directly corresponds to a par-allel domain decomposition. This additive representation is the basis for the LU algorithm discussed in the next section.
2.1. The finite element setting
Let X � Rdðd ¼ 2;3Þ be a reference domain, and let V � L2ðX;RmÞ be a Hilbert space. We consider finite element approx-imations of the linear variational equation
find u 2 V such that aðu;vÞ ¼ ‘ðvÞ for all v 2 V ð1Þ
where að�; �Þ : V � V ! R is a bilinear form and ‘ð�Þ : V ! R is a functional.Let VI ¼ span f/i : i 2 Ig be a finite element space with basis /i, where I is the corresponding index set of size jI j ¼ N.
The finite element approximation is the Galerkin projection of (1):
find uI 2 VI such that aðuI ; vI Þ ¼ ‘ðvI Þ for all vI 2 VI ð2Þ
Inserting the basis functions /i, we obtain the linear matrix problem
find x 2 RN such that Ax ¼ b ð3Þ
where A ¼ ðA½i; j�Þi;j2I 2 RN�N is the stiffness matrix with entries A[i, j] = a(/i,/j), and b ¼ ð‘ð/iÞÞi2I 2 RN is the right-hand side.This defines the finite element solution uI ¼
Pi2Ixi/i.
We assume that the finite element discretization is based on a decomposition into a set of cells C, such that
X ¼[c2C
Xc
where the cell domain is denoted by Xc. In the finite element context, the bilinear form a(�, �) allows for a corresponding cell-based additive decomposition
aðv ;wÞ ¼Xc2C
acðv ;wÞ; v ;w 2 V
We assume that the finite element basis functions /i are associated to nodal points zi 2 X, and that
supp /i � Ci ¼ fc 2 C : zi 2 Xcg ð4Þ
Then, we have acð/i;/jÞ – 0 only if c 2 Ci \ Cj.
2.2. Parallel load balancing
Let P ¼ f1; . . . ; Pg be the set of processors. For simplicity, we assume P = 2S. On P, a load balancing determines a disjointdecomposition C ¼ C1 [ � � � [ CP . This corresponds to a non-overlapping domain decomposition such that we have
X ¼[p2P
Xp; Xp ¼[
c2Cp
Xc ð5Þ
We assume that Cp – ; for p = 1, . . . ,P.The efficiency of the block LU decomposition relies on small processor interfaces for the final steps. This is a well-known
graph-partitioning problem [16,17]. Examples for suitable load balancing algorithms are recursive coordinate bisection [16]or spectral bisection algorithms [24].
Depending on the domain decomposition (5) we define Ip ¼ fi 2 I : zi 2 Xpg and Np ¼ jIpj for each processor p 2 P. Fori 2 I we define
pðiÞ ¼ fp 2 P : i 2 Ipg
744 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
2.2.1. Parallel assembling and parallel block structureOn every processor p 2 P we compute the entries of the local stiffness matrix
Ap ¼ ðAp½i; j�Þi;j2Ip 2 RNp�Np; Ap½i; j� ¼
Xc2Cp
acð/i;/jÞ
Now, we introduce a further block structure which decomposes Ap. Let P ¼ 2P be the set of all possible processor sets. Wedefine the subset of all active processor sets PI ¼ fpðiÞ 2 P : i 2 Ig with respect to the corresponding matrix graph. A suit-able numbering of the active processor sets
PI ¼ fp1;p2; . . . ;pKg ð6Þ
directly induces a non-overlapping decomposition
I ¼ I1 [ � � � [ IK with I k ¼ fi 2 I : pðiÞ ¼ pkg
Note that we have I k � Ip for all p 2 pk. We have N ¼PK
k¼1Nk and Np ¼P
p2pkNk, where Nk ¼ jI kj. The corresponding matrix
blocks are
Apkm ¼ ðA
p½i; j�Þi2Ik ;j2Im2 RNk�Nm ð7Þ
which are a part of the local stiffness matrix Ap on processor p and represent the block matrices of the stiffness matrix A addi-tively by
Akm ¼Xp2P
Apkm ð8Þ
Lemma 1. The block matrix Apkm is zero for p R pk \ pm, and therefore Akm ¼
Pp2pk\pm
Apkm.
Proof. It holds Apkm ¼ ðA
p½i; j�Þi2Ik ;j2Im¼P
c2Cp acð/i;/jÞ� �
i2Ik ;j2Im. Now we fix some i 2 I k and k 2 Im.
Let p R pk \ pm and c 2 Cp. Assume acð/i;/jÞ– 0) c 2 Ci \ Cj. By (4), we have zi 2 Xc; zj 2 Xc , and since c 2 Cp by (5), alsozi; zj 2 Xp. This gives i; j 2 Ip, i.e. p 2 p(i) = pk, p 2 p(j) = pm. This contradicts p R pk \ pm. h
2.2.2. Example in 2DWe consider a square partitioned into four subdomains as in Fig. 1, where each part is distributed to one processor. This
partition defines our active processor set PI ¼ fp1; . . . ;p9g with p1 = {1}, p2 = {2}, p3 = {3}, p4 = {4}, p5 = {1,2}, p6 = {3,4},p7 = {1,3}, p8 = {2,4}, p9 = {1,2,3,4}. The index set I is represented by the ‘‘big dots’’, where the different colors representsthe belonging to a different active processor set (see Fig. 1).
The (global) matrix A is divided into a 9 � 9 block matrix, where the block matrices itself are represented additively on thefour processors. Note, that all matrices Ap
km; p ¼ 1; . . . ;4, are of the same size. Most of the block matrices are zero, e.g., oneach processor only 16 of 81 matrices are nonzero. Moreover, the global block matrix Akm ¼ A1
km þ � � � þ A4km is zero, if
pk \ pm = ;.This example shows that in 2D and with a small number of processors the major part of the computation is done com-
pletely in parallel in the initial step (the indices corresponding to the interior of the subdomains can be eliminated indepen-dently, cf. Chapter 3). This changes completely in 3D and for a large processor number P, where the efficient parallel LUdecomposition of the Schur complements are the challenging task.
2.3. The block LU decomposition
We define a standard block LU decomposition A = LU of the stiffness matrix A = (Akm)k,m=1,. . .,K with a lower and upper tri-angular K � K block matrix L = (Lkm)k,m=1,. . .,K and U = (Ukm)k,m=1,. . .,K with Ukk ¼ INk
. As usual we can store the result in theblocks Akm [14, Chap. 3.2]. We do not examine if any further pivoting of the block matrices is necessary in the indefinite case,but each diagonal block matrix itself is decomposed with pivoting strategies of the SUPERLU [9] or LAPACK [3,2] routines.
Fig. 1. A square distributed on four processors.
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 745
Nevertheless, in our finite element context with well-posed problems the subproblems related to the Schur complements areinvertible.
We require three block matrix operations:
� Akk :¼ LU (Akk)Here (e.g., using a suitable LAPACK routine [3,2]), we compute an LU decomposition of Akk, and the result is stored in thematrix entries Akk. Local pivoting is performed by the subroutine.� Akm :¼ SolveðAkk;AkmÞ
Here (using the previously determined LU decomposition) we compute A�1kk Akm. Again, the result is stored in Akm.
� Anm :¼ Anm � AnkAkm
This operation is realized by one BLAS3 call.
Then, the block LU decomposition can be written as follows:
Algorithm 1.
1 FOR k = 1, . . . ,K2 Akk :¼ LU (Akk)3 FOR m = k + 1, . . . ,K4 Akm :¼ SolveðAkk;AkmÞ5 FOR n = k + 1, . . . ,K6 Anm :¼ Anm � AnkAkm
The algorithm ends with Lkm :¼ Akm, k P m and Ukm :¼ Akm, k < m.
3. A parallel block LU decomposition
Now we introduce a parallel block LU decomposition based on the blocks associated to the processor sets pk. We will seebelow that our algorithm is equivalent to a recursive Schur complement reduction method.
We start with a suitable numbering of the processors
P ¼ fp1;p2; . . . ;p2Sg ð9Þ
The block LU decomposition is based on a recursive definition of combined processor sets Ps;t in each step s with a clusternumber t = 1, . . . ,2S�s, such that
St¼1;...;2S�sPs;t ¼ P. For s = 0 each processor set P0;t exists of exactly one processor number
P0;t ¼ fptg, the processor sets in the further steps s are built of two processor sets from the previous step s � 1, such thateach processor number is located in exactly one set in each step.
In the first step s = 0, on every processor p 2 P we eliminate all indices i in the interior of the subdomains with zi 2 Ip andp(i) = {p}. This can be done independently on each processor. Then, we build locally the Schur complement on the skeleton,which is given by all indices i 2 Ip with jp(i)j > 1. Again, the Schur complement is represented additively.
In the next step s = 1 we eliminate the indices on the interface of two processors, given by the processor set P1;t , i.e. allindices i with pðiÞ � P1;t which have not yet been eliminated. Therefore we first have to sum up the previous Schur comple-ment of the two combined processor sets. Then, the indices can be eliminated and the new Schur complement, given addi-tively by the processor sets P1;t , is built.
This procedure is executed recursively until only one processor set exists in step s = S, where all remaining indices areeliminated.
Since the Schur complements become large in the final steps, we distribute the columns of the (local) Schur complementin a cyclic way to processors in the processor set Ps;t .
3.1. The set of interfaces
Depending on the processor numbering (9), we define for each step s = 0, . . . ,S a cluster of size Ts = 2S�s of combined pro-cessor sets
Ps;t ¼ fpj : 2sðt � 1Þ < j 6 2stg; t ¼ 1; . . . ; Ts ð10Þ
which results in a disjoint decomposition P ¼ Ps;1 [ � � � [ Ps;Ts for all steps s.For every p 2 PI we define the associated step by
sðpÞ ¼minfsj9t 2 f1; . . . ; Tsg : p � Ps;tg ð11Þ
which results in
746 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
Ps ¼ fp 2 PI : sðpÞ ¼ sg; Ps;t ¼ fp 2 Ps : p � Ps;tg
Now, we can define a consecutive numbering for the disjoint decomposition
PI ¼ P0 [ � � � [PS ¼[Ss¼0
[Ts
t¼1
Ps;t
For s = 0 we set pk = {k} and Ps,k = {pk}, k = 1, . . . ,P, and for s = 1, . . . ,S we set Ps,t = {pk : k = Ks,t�1 + 1, . . . ,Ks,t} with K0,t = t,t = 0, . . . ,K and Ks;0 ¼ Ks�1;Ts�1 . This results in Ps = {pk : k = Ks�1 + 1, . . . ,Ks} with K�1 = 0 and Ks ¼ Ks;Ts . We set K = KS. Together,this defines the numbering (6) of the active processor sets which is used in the block LU decomposition.
In principle, any processor numbering results in a processor set numbering and a parallel block LU decomposition. Nev-ertheless, the for the parallel efficiency of the block LU decomposition the numbering is important, since the size of processorinterfaces in the final steps strongly depends on this numbering.
3.2. Reduction of the basic block LU decomposition
With the definition of the step s, the cluster t and the numbering for Ps,t, Algorithm 1 has the following form:
Algorithm 2.
1 FOR s = 0, . . . ,S2 FOR t = 1, . . . ,Ts
3 FOR k = Ks,t�1 + 1, . . . ,Ks,t
4 Akk :¼ LU (Akk)5 FOR m = k + 1, . . . ,K6 Akm :¼ SolveðAkk;AkmÞ7 FOR n = k + 1, . . . ,K8 Anm :¼ Anm � AnkAkm
Note that the lines (1. . .3) coincide with k = 1, . . . ,K.
Now, we identify block matrices which remain zero within the algorithm and only a subset of these block matrices haveto be computed. To this end, we define the actual operating set for the cluster t = 1, . . . ,Ts in step s by
Ks;t ¼ fk 2 fKs;t�1 þ 1; . . . ;Kg : pk \ Ps;t – ;g
Let Ks;tLU : fKs;t�1 þ 1; . . . ;Ks;tg and Ks
LU :¼ fKs�1 þ 1; . . . ;Ksg. Furthermore let Ks;tD ¼ fk 2 fKs;t�1 þ 1; . . . ;Kg : pk \ Ps;t ¼ ;g be the
complement to Ks;t . We want to note that Ks;tLU � Ks;t since pk � Ps;t for k 2 Ks;t
LU . The Schur complement in step s is denoted byASch
s :¼ ðAnmÞn;m¼Ksþ1;...;K .
Lemma 2. For each step s in Algorithm 2 holds Akm = 0 and Amk = 0 during the whole algorithm for k 2 Ks;tLU ;m 2 Ks;t
D .
Remark 3. Since Ks;~tLU \Ks;t ¼ ; for ~t – t, we obtain from Lemma 2 a block-diagonal structure of
ðAklÞk;l2KsLU¼ diagt¼1;...;Ts
ðAklÞk;l2Ks;tLU
. Thus, the loop in line 2 of Algorithm 2 can be computed in parallel resulting in an additiveSchur complement. With Lemma 2 the loop in line 5 of Algorithm 2 also can be restricted to m 2 Ks;t ; m > k. Furthermore thematrices Anm are not changed by Algorithm 2 in line 8 if n or m 2 Ks;t
D , since Akm = 0 or Ank = 0, such that the loop in line 7 ofAlgorithm 2 can be restricted to n 2 Ks;t ; n > k.
Proof. We prove by induction that Akm = 0 and Amk = 0 at the beginning of each step s for k 2 Ks;tLU ;m 2 Ks;t
D . Let s = 0. ThenKs;t
LU ¼ ftg and Ps;t ¼ ftg for t = 1,. . .,T0 by definition and Ks;t ¼ fk P t : pk \ ftg– ;g;Ks;tD ¼ fk P t : pk \ ftg ¼ ;g. In this step
we can identify an element k 2 Ks;tLU with k = t. Let m 2 Ks;t
D . Then pm \ {t} = ;. Since {t} = pk for k 2 Ks;tLU we have Akm = 0 and
Amk = 0 as a consequence of Lemma 1. Furthermore we have for the matrices Anm in the Schur complement ASch0 that
Anm = 0 for pn \ pm = ;,n,m > K0 = T0 also by Lemma 1.We prove that all zero matrices do not change in this step. There are three operations in the algorithm, where matrices are
being changed. The first one is Akk :¼ LU (Akk), but this only relies on Akk with k 2 Ks;tLU , but k R Ks;t
D , so Akk – 0. The second oneis Akm :¼ SolveðAkk;AkmÞ. If Akm is a zero matrix, the result is again a zero matrix, such that this matrix does not change itsstructure. The third one is Anm :¼ Anm � AnkAkm. To change a matrix Anm, both Ank – 0 and Akm – 0. Thus, k 2 pn and k 2 pm, butthen pn \ pm � {k} – ;.
For the induction step, consider s > 0. Let n 2 Ks;tLU ;m 2 Ks;t
D . Then, pn � Ps;t and pm \ Ps;t ¼ ;, so pm \ pn = ;. ThereforeAnm = 0 at the beginning of the algorithm. We have to show that Anm has not being changed before step s. The argument forthe first two operations is the same as before, so we only consider the operation Anm = Anm � AnkAkm for k < n. There is nochange of Anm if either Ank = 0 or Akm = 0, so we have to show that if Ank – 0, then Akm = 0, k < n.
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 747
Let k 2 Kr;sLU with r < s,s 2 {1, . . . ,Tr} and Ank – ;. Then, n 2 Kr;s and therefore pn \ Pr;s – ; by induction hypothesis. We
have to show that Akm = 0. We show this by contradiction: Let Akm – 0. By induction hypothesis we have k 2 Kr;s
) pm \ Pr;s – ;. As one can see by definition (10) we have Ps;t ¼ Ps�1;s1 [ Ps�1;s2 for all s > 0 and Ps;t1 \ Ps;t2 ¼ ; for t1 – t2.But this means that Pr;s � Ps;t since pn \ Pr;s – ; and pn � Ps;t . Therefore pm \ Ps;t – ;, but this is a contradiction. h
Using Lemma 2 and Remark 3, Algorithm 2 can be reduced to
Algorithm 3.
1. FOR s = 0, . . . ,S2. FOR t = 1, . . . ,Ts
3. FOR k = Ks,t�1 + 1, . . . ,Ks,t
4. Akk :¼ LU (Akk)5. FOR m 2 Ks;t; m > k6. Akm :¼ SOLVE(Akk,Akm)7. FOR n 2 Ks;t ; n > k8. Anm :¼ Anm � AnkAkm
3.3. The Schur complement for additive matrices
Since the matrices obtained from the interface are decomposed additively, we examine the Schur complement for addi-tive matrices. In the first step, we consider a parallel finite element problem given on two processors P = 2, i.e. S = 1. Withrespect to Lemma 1 this leads to the additive matrix representation
A ¼A11 0 A13
0 A22 A23
A31 A32 A33
0B@
1CA ¼
A111 0 A1
13
0 A222 A2
23
A131 A2
32 A133 þ A2
33
0BB@
1CCA
with p1 = {1}, p2 = {2}, p3 = {1,2}.We examine Algorithm 3 with S ¼ 1; T0 ¼ 2; T1 ¼ 1; P0;1 ¼ f1g; P0;2 ¼ f2g; P1;1 ¼ f1;2g; K0;0 ¼ 0; K0;1 ¼ 1;
K0;2 ¼ 2; K1;0 ¼ 2; K1;1 ¼ 3; K0;1 ¼ f1;3g; K0;2 ¼ f2;3g; K1;1 ¼ f3g. We start with
A1 ¼A1
11 A113
A131 A1
33
!on p ¼ 1; A2 ¼
A222 A2
23
A232 A2
33
!on p ¼ 2
First, we consider step s = 0, where the substep t = 1 is executed on processor p = 1 and the substep t = 2 is executed on pro-cessor p = 2. This results (locally) in
LU A111
� �A1
11
� ��1A1
13
A131 A1;Sch
33
0@
1A on p ¼ 1;
LU A222
� �A2
22
� ��1A2
23
A232 A2;Sch
33
0@
1A on p ¼ 2
with A1;Sch33 :¼ A1
33 � A131 A1
11
� ��1A1
13; A2;Sch33 :¼ A2
33 � A232 A2
22
� ��1A2
23.In the next step s ¼ 1; ASch
33 ¼ A1;Sch33 þ A2;Sch
33 is computed and decomposed on p = 1, which finally leads to
LU A111
� �0 A1
11
� ��1A1
13
0 LU A222
� �A2
22
� ��1A2
23
A131 A2
32 ASch33
0BBBB@
1CCCCA
with ASch33 ¼ A1
33 þ A233 � A1
31 A111
� ��1A1
13 � A232 A2
22
� ��1A2
23.This illustrates our algorithm for S = 1. For S > 1, this procedure is applied recursively on the processor sets, where the
Schur complements may consist of more than one block matrix. Then, the columns of the Schur complement are distributedon the processors in Ps;t to reduce the computing time of several operations.
3.4. The parallel block LU decomposition
We use a (cyclic) distribution of the columns of the block matrices Akn for k;n 2 Ks;t . This is done independently for everycluster t. Therefore, we select a processor map
ps;t : Ks;t ! Ps;t ð12Þ
and for n 2 Ks;t the block matrices Akn for all k 2 IKs;t are computed on processor p = ps,t(n).
748 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
This results in the following parallel algorithm:
Algorithm 4.
1. FOR s = 0, . . . ,S2. FOR t = 1, . . . ,Ts
3. IF (s > 0)4. FOR n;m 2 Ks;t
5. p :¼ ps,t(m)6. ON q :¼ ps � 1, 2t(m): SEND Aq
nm to p7. ON r :¼ ps � 1, 2t+1(m): SEND Ar
nm to p
8. ON p: RECEIVE AðqÞnm AND AðrÞnm, SET Apnm :¼ AðqÞnm þ AðrÞnm
9. FOR k = Ks,t � 1 + 1, . . . ,Ks,t
10. p :¼ ps,t(k)11. ON p: Ap
kk :¼ LUðApkkÞ
12. FOR n 2 Ks;t
13. ON p: SEND Apnk TO q 2 Ps;t
14. ON q 2 Ps;t: RECEIVE AðpÞnk AND SET Aqnk :¼ AðpÞnk
15. FOR m 2 Ks;t; m > k16. q :¼ ps,t(m)17. ON q: Aq
km :¼ SOLVE Aqkk;A
qkm
� �18. FOR n 2 Ks;t ; n > k19. ON q: Aq
nm :¼ Aqnm � Aq
nkAqkm
Matrices denoted by AðqÞnm are received matrices from a processor q, while any operation with these matrices is executed onmaybe another processor.
In the lines (3. . .8) the Schur complement given on two clusters 2t, 2t + 1 in step s � 1 is combined to one cluster t in step s.
3.4.1. Example on 4 processorsWe illustrate the steps Algorithm 4 for the example in Section 2.2.2, where P = 4 and S = 2. For the distributed matrices the
processor numbers are indicated, where the matrix is non-zero; zero matrices are denoted by a dot.
In step s = 0, the indices associated to p1, . . . ,p4 are eliminated on all processors. Then, the Schur complement (beginningwith p5) is communicated within the new processor sets (P1;1 ¼ f1;2g and P1;2 ¼ f3;4g). After this step, the distribution ofthe matrices changes to
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 749
For step s = 1, the elements associated to p5 are eliminated on the processor set P1;1 and p6 on P1;2, respectively. Note that allmatrices belonging to the next Schur complement (p7, . . . ,p9) are still distributed additively.
The procedure is repeated for the last step again. Here, we only have one processor set P2;1 ¼ f1;2;3;4g. Note that someblock matrices which were zero before now have entries, (p7,p8) and (p8,p7) and that the distribution of the columns de-pends on the step number s. This results in the final situation, where each block matrix is represented on exactly oneprocessor.
3.5. Solving linear equations with the block LU decomposition
With the parallel LU decomposition we solve the linear system Ax = b, where A is the stiffness matrix given in a paralleladditive representation
A ¼ ðAklÞk;l¼1;...;K with Akl ¼X
p2pk\pl
Apkl
and b is the right-hand side also represented additively by
b ¼ ðbkÞk¼1;...;K with bk ¼Xp2pk
bpk
We use the following operations for the vectors bk:
� bk :¼ SOLVE(Akk,bk).Here (using the LU decomposition of Akk) we compute A�1
kk bk. The result is stored in bk.� bn :¼ bn � Ankbk
This operation is realized by one BLAS2 call.
From a global point of view, the solving routine using the decomposed matrices Akl is given by
Algorithm 5.
1. FOR k = 1, . . . ,K2. bk :¼ SOLVE(Akk,bk)3. FOR n = k + 1, . . . ,K4. bn :¼ bn � Ankbk
5. FOR n = K � 1, . . . ,16. FOR k = K � 1, . . . ,n + 17. bn :¼ bn � Ankbk
The algorithm ends with x :¼ b.
The parallel realization of Algorithm 5 reads as follows:
Algorithm 6.
1. FOR s = 0, . . . ,S2. FOR t = 1, . . . ,Ts
3. SOLVE_L (s, t)4. FOR s = S, . . . ,05. FOR t = Ts, . . . ,16. SOLVE_U (s, t)
with the subroutines for the parallel forward and backward substitution
Algorithm 7. SOLVE_L (s,t)
1. p :¼ ps,t(Ks,t�1 + 1)2. FOR k 2 Ks;t
3. ON q :¼ ps � 1, 2t(k): SEND bqk TO p
4. ON r :¼ ps � 1, 2t+1(k): SEND brk TO p
5. ON p: RECEIVE bðqÞk and bðrÞk , set bpk :¼ bðqÞk þ bðrÞk
6. FOR k = Ks,t�1 + 1, . . . ,Ks,t
7. p :¼ ps,t(k)8. ON p: bp
k :¼ SolveðApkk; b
pkÞ, SEND bp
k TO q 2 Ps;t
9. ON q 2 Ps;t: RECEIVE bðpÞk , SET bqk :¼ bðpÞk
10. FOR n 2 Ks;t; n > k11. ON p: bp
n :¼ bpn � Ap
nkbpk
12. IF k < Ks,t
13. ON p: SEND bpn TO q :¼ ps,t(k + 1)
14. ON q :¼ ps,t(k + 1): RECEIVE bðpÞn , SET bqn :¼ bðpÞn
15. ELSE16. ON p: SEND bp
n TO q 2 Ps;t
17. ON q 2 Ps;t: RECEIVE bðpÞn , SET bqn :¼ bðpÞn
750 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
Algorithm 8. SOLVE_U (s, t)
1. FOR n = Ks,t, . . . ,Ks,t�1 + 12. FOR k 2 Ks;t ; k > n3. p :¼ ps,t(k)4. ON p: bp
n :¼ bpn � Ap
nkbpk
5. IF k > Ks,t
6. ON p: SEND bpn TO q :¼ ps,t(k � 1)
7. ON q :¼ ps,t(k � 1): RECEIVE bðpÞn , SET bqn :¼ bðpÞn
8. ELSE9. ON p: SEND bp
n TO q 2 Ps;t
10. ON q 2 Ps;t: RECEIVE bðpÞn , SET bqn :¼ bðpÞn
In Algorithm 7 the lines (2. . .5) combine the right-hand sides of the respective two clusters. In the lines (8. . .9) the part bk
of the current right-hand side b is sent to all processors in the current processor set Ps;t while the part of b which will bechanged in the algorithm (i.e. bn for k < n 2 Ks;t) is just sent to the next processor ps,t(k + 1) (lines (13. . .14)). At the endthe rest of b is sent to the current processor set (lines (16. . .17)), such that each processor in this set has the same informa-tion about the right-hand side. As in Algorithm 4 a vector denoted as bðpÞn is a received vector from processor p.
4. Model problems
We define a series of different model problems for our numerical test. In our notation, u = u(x) is the solution, wherex = (X,Y)T and x = (X,Y,Z)T are the coordinates in the 2D and 3D case, respectively.
4.1. The Poisson problem
We start with a commonly used benchmark problem, the Poisson equation �Du = f in X = (0,1)d (d = 2,3) with homoge-neous Dirichlet boundary conditions on @X. For the weak formulation we define
aðu;vÞ ¼Z
Xru � rv dx on V ¼ H1
0ðXÞ
The discretization with bi-/trilinear finite elements on a uniform mesh results into a 9-points stencil in 2D and a 27-points stencil in 3D [5, Chap. II.§5]. The resulting matrix is a symmetric positive definite M-matrix.
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 751
4.2. Linear elasticity
In the next problem we consider 3D linear elasticity. Here, the boundary @X is decomposed into a Dirichlet part CD and aNeumann part CN. We compute the displacement u, the strain eðuÞ ¼ 1
2 ðruþ ðruÞTÞ, and the stress r = (2le(u) + k(divu)I),satisfying �divr = 0 in X, rn = t on CN, and u = 0 on CD, where t is the surface traction and k, l > 0 are the Lamé parameters.The corresponding bilinear form on V = {v 2 H1(X)3 : v = 0 on CD} is given by
aðu;vÞ ¼Z
X2leðuÞ : eðvÞ þ kðdiv uÞðdiv vÞ dx
cf. [5, Chap. VI.§3]. Again, on a regular mesh with trilinear elements we obtain a 81-points stencil. The resulting stiffnessmatrix is symmetric positive definite, but no M-matrix.
4.3. The convection–diffusion equation
For a given vector field b : X! Rd we consider the convection–diffusion equation
�eDuþ b � ru ¼ f
For VI � H10ðXÞ we use the streamline-diffusion method defined by the (non-symmetric) bilinear form
aðu;vÞ ¼ eZ
Xru � rv dxþ
ZXðb � ruÞv dxþ
Xc2C
dc
ZXc
ðb � ruÞðb � rvÞ dx
with suitable mesh-dependent scaling parameters dc > 0 [12, Chap. 3.3.2]. This results in a non-symmetric stiffness matrix.
4.4. The Stokes equations
The Stokes problem for the velocity u and the pressure p is given by
� Duþrp ¼ 0div u ¼ 0
For the discretization we use Taylor–Hood-Serendipity Q2/Q1 elements in V ¼ H10ðXÞ � L2;0ðXÞ [5, Chap. III.§7]. The bilinear
form is given by
aððu; pÞ; ðv ; qÞÞ ¼Z
Xðru � rv þ div uqþ p div vÞ dx
The corresponding matrix is symmetric, but indefinite.
4.5. The Maxwell cavity resonator
Finally, we consider the Maxwell cavity resonator [20, Chap. 1.4.2]: For a given current density f find a vector field u, suchthat
r� l�1r r� u
� �� j2�ru ¼ f in X
m� u ¼ 0 on @X
Here, we use Nedelec elements on tetrahedra in V = H0(curl,X) and the (indefinite) bilinear form
aðu;vÞ ¼Z
Xl�1
r r� u � r � v dx� j2Z
X�ru � v dx
[20, Chap. 5.5.1]. The resulting matrix is symmetric, but indefinite and very ill-conditioned if j is large.
5. Results
The numerical tests are realized on the Cluster HC3 of the STEINBUCH CENTRE OF COMPUTING (SCC) in Karlsruhe with 332 eight-waycompute nodes, where each node has two Intel Xeon Quad Core sockets with 2.53 GHz frequency. They are connected by anInfiniBand 4X QDR interconnect [15]. In the algorithm we use BLAS and LAPACK routines for dense matrices, given from theINTEL MATH KERNEL LIBRARY [19] and MUMPS [1,21] or SUPERLU [9,27] as a solver for sparse matrices, where MUMPS emerges as thebetter sparse solver for large matrices. All communication between processors are done by MPI.
For the block LU decomposition defined in Section 3 we use the sparse solver in the first step s = 0 for k 6 K0 for the LUdecomposition of the diagonal blocks (Akk :¼ LU (Akk)) since all matrices are sparse at the beginning. In the further steps thediagonal block matrices are dense, such that we use some BLAS and LAPACK routines there.
752 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
Since the communication part (12. . .14) in Algorithm 4 takes a long time compared to the computation time for smallproblems on large processor numbers, we define a parameter pmax as a maximum processor number in each step for eachcluster. This means, that the processor map ps,t maps only to a subset of Ps;t such that the broadcast of the matrices Ap
nk fromprocessor p takes less time. In general, for small problems pmax = 64 is sufficient, for the larger problems we take pmax = 128or pmax = 256.
We compare our algorithm with MUMPS for the Poisson problem on a square or cube, since this is a standard test problemwith a very simple parallel distribution, such that each processor has nearly the same number of unknowns. For all otherproblems we use more general domains which are distributed by a simple recursive coordinate bisection load balancing.Probably, some of the results can be improved by a better load balancing (in particular for the example in Section 5.5),but that is not in the focus of our research. Here, we mainly want to demonstrate that our solver works well for differentproblem classes with general domains.
5.1. The Poisson problem
For the two-or three-dimensional Poisson problem on the unit square/ cube we compare the factorization time of oursolver with the parallel direct solver MUMPS [1] on refinement level l for P = 2S processors. Here, we have an optimal distri-bution of the domain to the processors (see Table 1 and Fig. 2).
The results for the 3D case are presented in Table 2. For MUMPS a standard configuration for a distributed matrix is cho-sen. We remark that better results may be obtained with an optimized configuration (for the used simple configurationMUMPS could not be applied to all test cases).
As it can be seen in Table 2, MUMPS performs better for a moderate processor number than our solver since then theSchur complements in step s = 0 are very large and the computation gets inefficient. This is also the reason why the solverbecomes inefficient for P = 2, but then it has a quite good efficiency until a minimum of factorization time is achieved. If thecommunication part becomes too expensive, the total factorization time (computation + communication time) is growing.
Fig. 2. Solution u of the Poisson problem in 2D (left) and parallel distribution of 64 processors in 3D (right).
Table 1Number of degrees of freedom N ¼ dim VI (d.o.f.) and non-zero matrix entries (nz-entries: #fði; jÞ 2 I � I : A½i; j� – 0gÞ for the Poisson problem (left: 2D;right: 3D).
l d.o.f. nz-entries l d.o.f. nz-entries
8 263169 2362369 5 35937 9126739 1050625 9443329 6 274625 7189057
10 4198401 37761025 7 2146149 57066625
Table 2Factorization time [s] of the 3D Poisson problem on refinement level l with P = 2S processors.
S P l = 5 l = 6 l = 7
ALG.4 MUMPS ALG.4 MUMPS ALG.4 MUMPS
6 64 1.25 3.48 21.95 – 1833.00 –7 128 1.75 5.72 15.54 47.89 1155.12 1690.398 256 2.57 9.54 10.06 29.52 439.98 1855.609 512 5.68 20.24 13.26 54.91 228.34 –
0 2 4 6 8 10100
101
102
103
104
#cores (2S)
time[
s]l=5l=6l=7l=5, MUMPSl=6, MUMPSl=7, MUMPS
Fig. 3. Comparison of the factorization time of the 3D Poisson problem with the parallel direct solver and MUMPS.
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 753
This can be observed for level l = 6 for the 3D Poisson problem in Table 2. For larger problems, the optimal factorization timeis attained for higher processor numbers (See Fig. 3).
Next, we study the solution time. For a better overview, we expand the range of the processors, especially to see the min-imum of the solution time for l = 6. Since less data has to be communicated, the optimal efficiency is achieved for a smallerprocessor number, but overall we get a comparably result as for the factorization time (note that computing times less thanabout 0.05 s are not significant within our test framework) (See Table 3).
With our solver we achieve a residual reduction on level 5 of about 2.8 ⁄ 10�14 on 4 processors and 3.6 ⁄ 10�14 on 1024processors. With MUMPS we have a reduction of about 2.7 ⁄ 10�14 for any processor number. On level 7 the reduction with
Table 3Solution time [s] of the 3D Poisson problem on refinement level l on P = 2S processors.
S P l = 5 l = 6 l = 7
ALG.6 MUMPS ALG.6 MUMPS ALG.6 MUMPS
3 8 0.03 0.03 0.27 – – –4 16 0.03 0.18 0.15 – – –5 32 0.03 0.18 0.09 – – –6 64 0.04 0.20 0.11 – 1.05 –7 128 0.07 0.21 0.14 0.52 0.98 2.388 256 0.15 0.17 0.21 0.74 1.05 3.029 512 0.42 0.78 0.52 1.40 1.66 –
0 1 2 3 4 5 6 7 8 9 1010−1
100
101
102
103
#cores (2S)
time[
s]
l=8l=9l=10l=8, MUMPSl=9, MUMPSl=10, MUMPS
Fig. 4. Factorization time of the 2D Poisson problem with the parallel direct solver and MUMPS.
754 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
our solver is between 2.0 ⁄ 10�12 for 128 processors and 2.5 ⁄ 10�12 for 1024 processors. Note that within the finite elementcontext, in most cases a residual reduction of about 10�5 is sufficient.
Similar results are obtained for the 2D Poisson problem. Here, we observe a good performance for an increasing processornumber if the problem is large enough. We expect the minimal factorization time on level l = 10 for far more than 1024 pro-cessors (cf. Fig. 4). Again, our new solver outperforms MUMPS for larger processor numbers (see Tables 4 and 5).
5.2. Linear elasticity
Linear elasticity is tested for the geometry illustrated in Fig. 5. On the bottom we choose Dirichlet boundary conditionsand on the top we apply a traction load. We use the Lamé parameters l = 80193.80, k = 110743.82 corresponding to stainlesssteel. The configuration is taken from [22]. The results are represented in Table 6. Similar results as for the 3D Poisson prob-lem are observed (see Fig. 6).
5.3. Convection–diffusion problem
For this problem class we use the configuration described in [12, Example 3.1.1]. We consider a 2D problem given inX = (�1,1)2 with an analytic solution uðX; YÞ ¼ X 1�eðY�1Þ=�
1�e�2=�
� �. The solution is prescribed on @X. We set � = 1/200, so that the
solution has a strong boundary layer (see Figs. 7).
Table 4Factorization time [s] of the 2D Poisson problem on refinement level l on P = 2S processors.
S P l = 8 l = 9 l = 10
ALG.4 MUMPS ALG.4 MUMPS ALG.4 MUMPS
5 32 2.88 4.58 18.68 11.97 155.59 47.986 64 1.24 4.55 11.01 12.51 70.46 44.417 128 0.94 6.22 4.66 12.77 31.81 44.128 256 0.83 8.49 2.40 18.64 12.79 49.46
Table 5Solution time [s] of the 2D Poisson problem on refinement level l on P = 2S processors.
S P l = 8 l = 9 l = 10
ALG.6 MUMPS ALG.6 MUMPS ALG.6 MUMPS
3 8 0.06 0.09 0.24 0.27 – –4 16 0.03 0.07 0.12 0.34 – –5 32 0.03 0.07 0.07 0.21 0.28 0.796 64 0.03 0.08 0.04 0.32 0.14 0.747 128 0.01 0.17 0.04 0.24 0.10 0.748 256 0.03 0.34 0.04 0.40 0.10 0.799 512 0.07 0.54 0.08 0.47 0.12 1.00
Fig. 5. Surface mesh of the triangulation.
Table 6Factorization time [s] of the linear elasticity problem on refinement level l on P = 2S processors.
S P l = 3 l = 4 l = 5
87363 655875 5079555 d.o.f.6499017 50872329 402541065 nz-entries
6 64 8.46 466.997 128 10.35 256.898 256 11.36 176.569 512 34.09 249.00 5947.00
10 1024 66.82 578.52 3524.00
3 4 5 6 7 8 9 10100
101
102
103
104
#cores (2S)
time[
s]
l=3l=4l=5
Fig. 6. Factorization time of the linear elasticity problem.
Fig. 7. Surface plot of the solution of the convection–diffusion problem.
Table 7Number of unknowns and non-zero entries of the convection–diffusion problem.
S P l = 8 l = 9 l = 10
263169 1050625 4198401 d.o.f.2362369 9443329 37761025 nz-entries
7 128 2.29 24.52 167.478 256 2.05 15.99 101.159 512 2.25 11.29 69.23
10 1024 2.73 8.26 44.29
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 755
756 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
Again, our algorithm performs well for the 2D convection–diffusion problem for an increasing processor number.Since we have the same block matrix structure as for the 2D Poisson problem we obtain similar results (see Table 7and Fig. 8).
5.4. The Stokes problem
The Stokes problem is tested for the two-dimensional backward facing step configuration [12, Example 5.1.2]. On the leftis a inflow boundary, and a no-flow condition is imposed on the walls (top and bottom of the geometry). On the right side aNeumann condition is applied (see Fig. 9). Again, we obtain for this indefinite system similar results as for the 2D scalar prob-lems, cf. Table 8 and Fig. 10.
2 3 4 5 6 7 8 9 10100
101
102
103
#cores (2S)
time[
s]
l=8l=9l=10
Fig. 8. Factorization time [s] of the convection–diffusion problem.
Fig. 9. Solution for the Stokes problem.
Table 8Factorization time [s] of the Stokes problem on refinement level l on P = 2S processors.
S P l = 4 l = 5 l = 6
20355 80131 317955 d.o.f.798345 2179273 12688905 nz-entries
4 16 0.20 3.21 33.235 32 0.26 1.20 9.076 64 0.61 0.84 3.387 128 0.62 1.02 1.98
0 1 2 3 4 5 6 7 8 9 1010−1
100
101
102
103
#cores (2S)
time[
s]
l=4l=5l=6
Fig. 10. Factorization time of the Stokes problem.
D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758 757
5.5. The Maxwell cavity resonator
For the Maxwell cavity resonator we use a geometry illustrated in Fig. 11. The mesh is provided by CST DARMSTADT [6]. Thefirst eigenvalue for this problem is k1 0.0028, the 30th eigenvalue is k30 0.014. We set j2 = 0.01, so we compute a strongindefinite problem.
The distribution of the domain to the processors is not optimal which results in large interfaces. So we do not expect asgood results as for the other 3D problems. Nevertheless, a decreasing factorization time for larger problems on higher pro-cessor numbers is observable (see Table 9).
Fig. 11. Geometry for the Maxwell cavity resonator.
Table 9Factorization time [s] of the Maxwell problem on refinement level l on P = 2S processors.
S P l = 0 l = 1
63561 507012 d.o.f.4 16 15.495 32 9.61 863.366 64 8.74 561.957 128 9.12 256.648 256 216.89
758 D. Maurer, C. Wieners / Parallel Computing 37 (2011) 742–758
Acknowledgement
The authors acknowledge the financial support from BMBF Grant 01IH08014A within the joint research project ASIL (Ad-vanced Solvers Integrated Library).
References
[1] P.R. Amestoy, I.S. Duff, J. Koster, J.-Y. L’Excellent, A fully asynchronous multifrontal solver using distributed dynamic scheduling, SIAM J. Matrix Anal.Appl. 23 (1) (2001) 15–41.
[2] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, Lapack: A Portable LinearAlgebr. Library High-performance Comput. (1990).
[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. Sorensen, LAPACK’suser’s guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992.
[4] C.W. Bomhof, H.A. van der Vorst, A parallel linear system solver for circuit simulation problems, Numer. Linear Algebr. Appl. 7 (7-8) (2000) 649–665.[5] D. Braess, Finite Elements: Theory, Fast Solvers, and Applications in Solid Mechanics, Cambridge University Press, 1997.[6] CST-computer simulation technology, darmstadt. Available from: <http://www.cst.com/>.[7] Krister Dackland, Erik Elmroth, Bo Kågström, Charles Van Loan. Design and Evaluation of Parallel Block Algorithms: LU Factorization on an IBM 3090
VF/600J, in: Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing, Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, 1992, pp. 3–10.
[8] J.W. Demmel, N.J. Higham, R.S. Schreiber, Stability of block LU factorization, Numer. Linear Algebr. Appl. 2 (1995) 173–190.[9] James W. Demmel, Stanley C. Eisenstat, John R. Gilbert, Xiaoye S. Li, Joseph W.H. Liu, A supernodal approach to sparse partial pivoting, SIAM J. Matrix
Anal. Appl. 20 (3) (1999) 720–755.[10] James W. Demmel, Nicholas J. Higham, Stability of block algorithms with fast level-3 BLAS, ACM Trans. Math. Softw. 18 (3) (1992) 274–291.[11] Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, Henk Van Der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, 1990.[12] Howard Elman, David Silvester, Andy Wathen, Finite Elements and Fast Iterative Solvers with Applications in Incompressible Fluid Dynamics, Oxford
University Press, 2005.[13] K.A. Gallivan, R.J. Plemmons, A.H. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Rev. 32 (1990) 54–135.[14] G.H. Golub, C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, Md., 1996.[15] KIT Cluster HP XC3000 (hc3). Available from: <http://www.scc.kit.edu/dienste/hc3.php>.[16] G. Karypis, V. Kumar, Analysis of Multilevel Graph Partitioning, in: Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, 1995.[17] G. Karypis, V. Kumar, A parallel algorithm for multilevel graph partitioning and sparse matrix ordering, J. Parall. Distribut. Comput. 48 (1998) 71–95.[18] R.M.M. Mattheij, Stability of block LU-decompositions of matrices arising from BVP, SIAM J. Algebr. Discrete Methods 5 (1984) 314–331.[19] Intel MKL. Available from: <http://software.intel.com/en-us/intel-mkl/>.[20] P. Monk, Finite Element Methods for Maxwell’s Equations, Clarendon Press, Oxford, 2003.[21] MUMPS-a MUltifrontal Massively Parallel sparse direct Solver. version 4.9.2. Available from: <http://graal.ens-lyon.fr/MUMPS/>.[22] P. Neff, A. Sydow, C. Wieners, Numerical approximation of incremental infinitesimal gradient plasticity, Int. J. Numer. Methods Eng. 77 (2009) 414–
436.[23] Eric Polizzi, Ahmed Sameh, SPIKE: a parallel environment for solving banded linear systems, Comput. Fluids 36 (1) (2007) 113–120.[24] A. Pothen, H.D. Simon, K.-P. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Matrix Anal. Appl. 11 (1990) 430–452.[25] O. Schenk, K. Gärtner, Solving unsymmetric sparse systems of linear equations with PARDISO. Sloot, in: M.A. Peter et al. (Eds.), Computational science-
ICCS 2002. 2nd international conference, Amsterdam, the Netherlands, April. 21–24, 2002. Proceedings. Part 2, Lect. Notes Comput. Sci, vol. 2330,Springer, Berlin, 2002, pp. 355–363.
[26] O. Schenk, K. Gärtner, W. Fichtner, A. Stricker, PARDISO: A high-performance serial and parallel sparse linear solver in semiconductor devicesimulation, FGCS. Future Generat. Comput. Syst. 18 (1) (2001) 69–78.
[27] Super LU. version 3.0. Available from: <http://crd.lbl.gov/xiaoye/SuperLU/>.[28] T.A. Davis, User’s guide for the unsymmetric-pattern multifrontal package (umfpack), Tech. Report TR-95-004, Computer and Information Sciences
Department, University of Florida, Gainesville, FL, 1995.[29] A. Toselli, O. Widlund, Domain Decomposition Methods-Algorithms and Theory, Springer-Verlag, 2005.[30] C. Wieners, Distributed point objects, A new concept for parallel finite elements, in: Kornhuber, Ralf (Ed.) et al., Domain decomposition methods in
science and engineering, Selected papers of the 15th international conference on domain decomposition, Berlin, Germany, July 21–25, 2003. Berlin:Springer. Lecture Notes in Computational Science and Engineering 40, 175–182, 2005.
[31] Christian Wieners, A geometric data structure for parallel finite elements and the application to multigrid methods with block smoothing, Comput. Vis.Sci. 13 (4) (2010) 161–175.