10.1.1.99

NODAL REORDERING STRATEGIES TO IMPROVE

PRECONDITIONING FOR FINITE ELEMENT SYSTEMS

Peter S. Hou

Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Science

in

Mathematics

APPROVED:

Jeff Borggaard, Chair Traian Iliescu

Serkan Gugercin

April 27, 2005 Blacksburg, Virginia

Keywords: Nodal Reordering Strategy, Preconditioner, Finite Element Method, Iterative Solver, Scientific Computing, Unstructured Mesh

Copyright 2005, Peter S. Hou

Nodal Reordering Strategies to Improve Preconditioning for Finite Element Systems

Peter S. Hou

Jeff Borggaard, Chair

Mathematics

ABSTRACT

The availability of high performance computing clusters has allowed scientists and engineers to study more challenging problems. However, new algorithms need to be developed to take advantage of the new computer architecture (in particular, distributed memory clusters). Since the solution of linear systems still demands most of the computational effort in many problems (such as the approximation of partial differential equation models) iterative methods and, in particular, efficient preconditioners need to be developed.

In this study, we consider application of incomplete LU (ILU) preconditioners for finite

element models to partial differential equations. Since finite elements lead to large, sparse systems, reordering the node numbers can have a substantial influence on the effectiveness of these preconditioners. We study two implementations of the ILU preconditioner: a stucture-based method and a threshold-based method. The main emphasis of the thesis is to test a variety of breadth-first ordering strategies on the convergence properties of the preconditioned systems. These include conventional Cuthill-McKee (CM) and Reverse Cuthill-McKee (RCM) orderings as well as strategies related to the physical distance between nodes and post-processing methods based on relative sizes of associated matrix entries. Although the success of these methods were problem dependent, a number of tendencies emerged from which we could make recommendations. Finally, we perform a preliminary study of the multi-processor case and observe the importance of partitioning quality and the parallel ILU reordering strategy.

This thesis is dedicated to my mom, my dad, and my sister Ariel.

Acknowledgements

I wish to express my deepest appreciation for my advisor, Dr. Jeff Borggaard. You provided me with research opportunity, guidance, as well as financial support. Without you, I would not even have known what or how fascinating scientific computing is. When I encountered difficulties during my research, your encouraging words always refueled my energy and kept me motivated.

I would also like to thank Dr. Traian Iliescu and Dr. Serkan Gugercin for being on my

thesis committee. I understand that this is an entirely extra task that you willingly committed to amidst your busy schedules. Your expertise has led me further into exploring this field of mathematics.

In addition, I must acknowledge the two people who shaped my life as a mathematician,

although they taught me nothing about advanced mathematics. Thank you for talking me into becoming a math major, Dr. Lee Johnson. Your faith in me gave me the strength to come this far. Oh and I will never forget the torturous math trainings that you imposed on me, Ms. Jing-Lan Fu. I used to hate them so much, but how could I have possibly become good with numbers otherwise?

iv

Table of Contents Acknowledgements........................................................................................................................ iv List of Figures ................................................................................................................................ vi List of Tables................................................................................................................................. vii Chapter 1 Introduction ....................................................................................................................1 Chapter 2 Literature Overview .......................................................................................................2 2.1 Finite Element Methods...................................................................................................2 2.2 Linear System Solvers .....................................................................................................3 2.2.1 LU Decomposition...............................................................................................3 2.2.2 Iterative Solvers ...................................................................................................5 2.2.2.1 Jacobi Method.............................................................................................6 2.2.2.2 Gauss-Seidel Method..................................................................................6 2.2.2.3 Successive Over Relaxation (SOR) ............................................................7 2.2.2.4 Krylov Subspace Methods: Generalized Minimum Residual (GMRES) ..7 2.3 Preconditioners ................................................................................................................9 2.3.1 Jacobi Preconditioner.........................................................................................10 2.3.2 Incomplete LU (ILU) Factorization...................................................................10 2.3.2.1 Structure-Based ILU( ) ...........................................................................11 2.3.2.2 Threshold-Based ILUT .............................................................................13 2.4 Nodal Reordering Strategies for Finite Element Meshes...............................................14 2.4.1 Cuthill-McKee Algorithm..................................................................................15 2.4.2 Reverse Cuthill-McKee Algorithm (RCM) .......................................................16 Chapter 3 Problem Description.....................................................................................................17 Chapter 4 Numerical Experiments................................................................................................18 4.1 Finite Element Meshes...................................................................................................18 4.2 ILU(0) and ILUT ...........................................................................................................19 4.2.1 ILU(0) ................................................................................................................19 4.2.2 ILUT ..................................................................................................................19 4.2.3 Comparisons ......................................................................................................20 4.3 CM and RCM.................................................................................................................21 4.3.1 The Structure......................................................................................................21 4.3.2 The Experiments ................................................................................................22 4.4 Breadth-First Search Orderings .....................................................................................24 Chapter 5 The Parallel Case..........................................................................................................27 5.1 Onto a Parallel Computer...............................................................................................27 5.2 The Reordering Scheme.................................................................................................28 5.3 ILU Analysis ..................................................................................................................29 5.4 A Partitioning Test..........................................................................................................30 5.5 Other Partitioning Considerations .................................................................................31 Chapter 6 Conclusions ..................................................................................................................32 References......................................................................................................................................78 Vita.................................................................................................................................................81

v

List of Figures Figure 1 Incomplete LU Factorization..........................................................................................33 Figure 2 Cuthill-McKee Ordering ................................................................................................34 Figure 3 Cuthill-McKee Starting Node.........................................................................................35 Figure 4 Finite Element Meshes ...................................................................................................36 Figure 5 ILU(0) and ILUT............................................................................................................38 Figure 6 ILU Experiments ............................................................................................................39 Figure 7 CM and RCM .................................................................................................................40 Figure 8 “Umbrella Regions” .......................................................................................................41 Figure 9 Natural v. RCM Ordering on ILUT................................................................................42 Figure 10 Mesh Partitioning .........................................................................................................43 Figure 11 Mesh Partitioning for Parallel ILU...............................................................................44 Figure 12 Parallel ILU ..................................................................................................................46 Figure 13 Mesh Partitioning Test 1...............................................................................................47 Figure 14 Mesh Partitioning Test 2...............................................................................................48

vi

List of Tables Table 1 CM v. RCM and ILU(0) v. ILUT.....................................................................................49 Table 1.1 2d..........................................................................................................................49 Table 1.2 3d..........................................................................................................................50 Table 1.3 two_hole_0...........................................................................................................51 Table 1.4 two_hole_1...........................................................................................................52 Table 1.5 two_hole_2...........................................................................................................53 Table 1.6 two_hole_3...........................................................................................................54 Table 1.7 two_hole_4...........................................................................................................55 Table 1.8 four_hole_0 ..........................................................................................................56 Table 1.9 four_hole_1 ..........................................................................................................57 Table 1.10 four_hole_2 ........................................................................................................58 Table 1.11 four_hole_3 ........................................................................................................59 Table 1.12 cross_dom_0 ......................................................................................................60 Table 1.13 cross_dom_1 ......................................................................................................61 Table 1.14 cross_dom_2 ......................................................................................................62 Table 1.15 cross_dom_3 ......................................................................................................63 Table 1.16 cross_dom_4 ......................................................................................................64 Table 1.17 two_dom_0.........................................................................................................65 Table 1.18 two_dom_1.........................................................................................................66 Table 1.19 two_dom_2.........................................................................................................67 Table 1.20 two_dom_3.........................................................................................................68 Table 1.21 two_dom_4.........................................................................................................69 Table 1.22 two_dom_5.........................................................................................................70 Table 2 15 Ordering Schemes and Their Effects on GMRES Iterations.......................................71 Table 2.1 two_hole_domains ...............................................................................................71 Table 2.2 four_hole_domains...............................................................................................72 Table 2.3 cross_dom_domains.............................................................................................73 Table 2.4 two_dom_domains ...............................................................................................74 Table 3 Mesh Partitioning Tests for Parallel ILU .........................................................................75 Table 3.1 two_hold_0...........................................................................................................75 Table 3.2 two_hold_2...........................................................................................................76 Table 3.3 two_hold_4...........................................................................................................77

vii

1

Chapter 1 Introduction Many computational problems in science or engineering require the solution of large sparse

linear systems [6, 10, 14]. These systems have the form of finding an n-dimensional vector x such that

Ax b= where A is an n-by-n matrix and b is the n-dimensional right hand side vector. Due to the challenge in solving these problems and their importance in real-world modeling and analysis, a wide class of numerical algorithms has been developed to solve them. These algorithms are very specialized, taking advantage of problem structure and computer architecture. A popular class of algorithms is based on Krylov subspaces [18, 28]. This is an iterative method that, under certain conditions (good problem conditioning, good initial guess, appropriate parameter tuning, etc.) is much more efficient than direct methods. It also has the advantage that it lends itself to parallel implementations. Thus, Krylov subspace methods are a popular choice in high performance computing applications.

One of the limitations of the iterative methods is the condition number of the matrix A ,

1( )K A A A−= ,

where i represents one of the matrix norms (usually the 2-norm). There is a correlation between the number of iterations the algorithm requires to converge (hence the computation cost) and the magnitude of the condition number. The closer ( )K A is to 1, the better. The notion of left preconditioning is to premultiply the linear system above by a matrix P that is a good approximation to 1A− , PAx Pb= , such that ( )K PA is closer to 1. The selection of a good preconditioner is critical to developing a high performance algorithm. This is typically problem dependent, though a number of popular strategies have emerged. Not many of these have good parallelism.

In this work, we focus on developing an incomplete LU (ILU) preconditioner for solving

linear systems generated using finite element methods. Unlike other numerical methods for partial differential equations, such as Finite Difference Methods, those that we deal with are each based on an unstructured finite element mesh. Therefore, in addition to the generic methods for solving the linear system, we can try to improve the efficiency of the preconditioner by reordering the nodes of the mesh in a logical manner. We test a number of standard reordering strategies as well as perform a parameter study for the ILU preconditioner. Then, we explore the importance of nodal reordering for ILU in the parallel case.

2

Chapter 2 Literature Overview This chapter introduces and examines some well-known techniques involved in solving the

linear systems of our interest on a computer. Section 2.1 briefly describes finite element methods as the origin of our problem. Section 2.2 examines classic linear system solvers, and leads into more efficient iterative solvers. Section 2.3 discusses preconditioners which preprocess the linear systems to help iterative solvers converge faster. Lastly, Section 2.4 introduces finite element nodal reordering strategies that can potentially make preconditioners more effective.

2.1 Finite Element Methods The finite element methods are a family of extremely powerful numerical techniques

developed to solve complex problems in solid mechanics, fluid dynamics, heat transfer, vibrations, etc [4]. It breaks down a complex, continuous physical geometry into simpler, finite number of components, and uses these components as basis functions to approximate the solution of the problem. Because of their flexibility to adapt to a wide range of applications and their ability to greatly reduce the complexity of each problem, the finite element methods are a very popular tool in the scientific and engineering community.

The first step in a finite element method is to discretize the domain Ω of interest into a

finite element mesh, which is an undirected graph with nodes spaced across the domain. The density of the nodes, or mesh points, may vary depending on the complexity of the subdomains. Each mesh point is associated with an unknown and a basis function ϕ , which has value 1 at the mesh point and zero elsewhere.

Consider the Poisson problem: in u f−Δ = Ω , and 0 on u = ∂Ω . For all functions v

smooth in the domain Ω and vanish on the boundary ∂Ω , the weak formulation of the problem is ( ) ( )uv dA fv dA

Ω Ω−Δ =∫∫ ∫∫ ,

which by Divergence Theorem, becomes ( ) ( )u v dA fv dA

Ω Ω∇ ⋅∇ =∫∫ ∫∫ .

Then we discretize 1

( , ) ( , ) ( , )n

hj j

j

u x y u x y u x yϕ=

≈ =∑ , and choose iv ϕ=

The problem is then rewritten as a list of n partial differential equations, for 1 i n≤ ≤ :

( )1

( , ) ( , ) ( , ) ( , )n

i j j ij

x y x y dA u f x y x y dAϕ ϕ ϕΩ Ω

=

⎡ ⎤∇ ∇ =⎣ ⎦∑ ∫∫ ∫∫

Define ( , ) ( , )ij i jA x y x y dAϕ ϕΩ

= ∇ ∇∫∫ , i ix u= , and ( , ) ( , )i ib f x y x y dAϕΩ

= ∫∫ , this can be

represented as a linear system Ax b= , where x is a column vector of unknown values at the mesh points, while A and b correspond to the left- and right-hand sides of the equation.

3

From here, the complex physical problem has been reduced down to a standard system of equations.

The finite element methods use basis functions iϕ with local support. As a means to

construct these, the problem domain Ω is partitioned into regular subdomains: e.g., intervals in 1-D, triangles or rectangles in 2-D, and tetrahedrons or bricks in 3-D. Nodes are placed on vertices and perhaps edges, faces or interiors on which piecewise polynomial bases are generated. Where there is a natural assignment of unknown numbers in 1-D elements, there are many nodal ordering choices in higher dimensions. As we discuss below, this ordering has one dramatic impact on fill-in for direct linear system solvers and this carries over to preconditioners based on direct solvers. This is the main topic of this research.

2.2 Linear System Solvers The partial differential equations of our interest are presented as a linear system:

Ax b= where A is an n-by-n matrix, x is an n-by-1 vector of unknowns, and b is an n-by-1 vector.

The most trivial and primitive method is to find the inverse of A , assuming it exists.

Direct methods such as Gaussian Elimination are easy to understand and implement, and they produce exact inverses up to finite precision arithmetic. Subsequently, 1 1x A Ax A b− −= = gives us the solution of the system.

This method, however, is rarely used beyond elementary linear algebra class. The reason

is simple: computing the inverse of a matrix is too expensive. Real-world problems can easily have millions of equations with millions of unknowns. Computing the inverse of such a system not only requires tremendous computation power, but also can take up unrealistic amount of memory. In addition, this process is not parallelizable, hence cannot efficiently speed up even with multiple processors. Therefore, we introduce some linear system solvers that can be implemented more effectively.

2.2.1 LU Decomposition The LU decomposition is a process, based on Gaussian Elimination, to decompose a matrix

A into matrix factors L and U : A LU= . Here, L is a lower triangular matrix with ones on the main diagonal. On the other hand, U is an upper triangular matrix. Several versions of this decomposition exist, and Algorithm 1 below is a row-based implementation.

In the 4-by-4 case, the factors look like:

4

11 12 13 14 11 12 13 14

21 22 23 24 21 22 23 24

31 32 33 34 31 32 33 34

41 42 43 44 41 42 43 44

1 0 0 01 0 0 0

1 0 0 01 0 0 0

A A A A U U U UA A A A L U U UA A A A L L U UA A A A L L L U

⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥

⎣ ⎦⎣ ⎦ ⎣ ⎦

One advantage of this factorization is the ability to store L and U in the same matrix to conserve storage space. Algorithm 1 takes such advantage and factors matrix A in-place. The output of this algorithm is in the form:

11 12 13 14

21 22 23 24

31 32 33 34

41 42 43 44

U U U UL U U U

LUL L U UL L L U

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

which, when necessary, can be easily broken into two separate matrices. Note that because the value of the main diagonal of L is known, it need not to be stored in LU .

, ,

, , ,

,

For 1 to -1For 1 to

For to

End

EndEnd

i j j j

i k i k j k

i j

j ni j n

A A

k i nA A A

A

α

α

α

== +

=

== −

=

Algorithm 1: An LU decomposition We see from line 3 in the algorithm that numerical accuracy may be at stake if any entry on

the main diagonal of A becomes very small during the factorization. To improve on this situation, we could apply optional pivoting: permute the rows of A so that the absolute maximum element in each column lies on the main diagonal. This rearrangement of equations does not affect the solution, as long as proper permutation is also applied to x and b .

After matrix A has been decomposed into the appropriate factors, we can substitute L

and U to solve the linear system Ax b= : ( ) ( )Ax LU x L Ux b= = =

Let y Ux= . Note that 1

i

ij j ij

L y b=

=∑ for all 1 i n≤ ≤ , so we can solve Ly b= recursively

5

using a forward substitution algorithm.

( )1

1

For i=1 to n1 ( ) 1

End

i

i i ij j iijii

y b L y LL

−

=

= − =∑

Algorithm 2: Forward Substitution

Next, we can solve Ux y= using a backward substitution, which is a reversed version of

the forward substitution. Each triangular substitution reduces the problem down to n equations, which, when solved in order, has only one unknown per equation. This simplifies the solution to the linear system.

Generally, LU decomposition is a more preferred solution to a linear system than 1x A b−= .

However, its memory requirement can be equally unbearable as the problem size increases. This is particularly true when A is sparse (most entries are zero, and only nonzero entries need to be stored), since the factors L and U are dense in general. Hence, modern computational problems have resorted to much more memory-efficient iterative methods, whereas LU decomposition still serves as the underlying idea for some of the most robust preconditioning techniques.

2.2.2 Iterative Solvers In the case when computing the exact solution to a linear system is impossible or infeasible,

iterative solvers can be employed to numerically approximate the true solution x . Let 0x be an initial guess of the solution x . The iterative solver computes a sequence { }kx , with

kx x→ as k →∞ . The residual vector k kr b Ax= − is used to determine how close we are to the true solution. Obviously, for a well-conditioned problem, 0kr ≈ when kx is a good approximation to x . An iterative solver stops when the residual becomes smaller than a specified threshold, or when a certain number of iterations have been reached without a convergence. Good iterative solvers aim to converge { }kx quickly and to minimize the norm of the residual vector.

We introduce four common iterative solvers, in increasing order of sophistication: Jacobi

[30], Gauss-Seidel [20], Successive Over Relaxation (SOR) [31], and Generalized Minimum Residual (GMRES) [28]. The first three are based on the following breakdown of matrix A :

A L D U= + + .

The matrices L and U are the strictly lower and upper triangular parts of A , and D is

6

the main diagonal. As we discussed above, inverting a triangular matrix can be performed by forward or backward substitution. Inverting a diagonal matrix simply involves inverting the diagonal entries. Hence, the inverses appearing below are computationally tractable.

2.2.2.1 Jacobi Method From the matrix break-down above, we rewrite Ax b= ( )Ax L D U x b= + + = .

Then, we move L and U to the right hand side to motivate the Jacobi iteration,

1( ) ( )n nDx b L U x+ = − +

11 ( ( ) )n nx D b L U x−+ = − + .

The Jacobi method solves each equation in the linear system independently [30]. It solves

one variable ix at a time while assuming all other variables x remain fixed. It is extremely parallelizable in nature. Unfortunately, while this simple idea is very easy to implement, it is very unstable. It works well with diagonally dominant tridiagonal matrices, but its convergence is not guaranteed otherwise. The matrix 1( )D L U− + must have all of the eigenvalues inside the unit disk (the smaller, the better).

2.2.2.2 Gauss-Seidel Method The Gauss-Seidel method is fairly similar to the Jacobi method [20]:

1( ) n nL D x b Ux++ = −

1

1 ( ) ( )n nx L D b Ux−+ = + − .

Note that ( )L D+ is a lower triangular matrix, so the iterations can be computed using a

forward substitution (no matrix inversion is necessary). Due to this nature, the computations are sequential: solving each equation requires the solutions from the previous equations. Therefore, this algorithm is not parallelizable like Jacobi method. However, it is relatively more stable, and is applicable to strictly diagonally dominant matrices and symmetric positive definite matrices.

Note: the Gauss-Seidel method may also be implemented as 1

1 ( ) ( )n nx D U b Lx−+ = + − if

1( )D U L−+ has smaller eigenvalues.

7

2.2.2.3 Successive Over Relaxation (SOR) SOR is derived from extrapolating the Gauss-Seidel method, by taking a weighted average

on the two sides of the equal sign [31]:

1( ) (1 )n n nL D x b Ux Dxω ω ω ω++ = − + −

1

1 ( ) ( (1 ) )n n nx L D b Ux Dxω ω ω ω−+ = + − + −

0 2ω< < When ω is chosen properly, this method speeds up convergence rate. The difficult task

is to choose a good value for each specific problem. When 1ω = , SOR reduces to Gauss-Seidel. Also, this method fails to converge if ω falls outside of (0, 2).

2.2.2.4 Krylov Subspace Methods: Generalized Minimum Residual (GMRES) The Krylov subspace methods are a family of iterative solvers that, unlike the three above,

do not have an iterative matrix. Their implementations are based on the minimization of some measure of error over the affine space 0 kx K+ at each iteration k . 0x is the initial iterate vector and kK is the kth Krylov subspace,

{ }10 0 0, ,..., k

kK span r Ar A r−= ,

where 0 0r b Ax= − is the initial residual vector. Many variants of Krylov subspace methods exist, and they possess various strengths and

limitations. Well known versions include the Conjugate Gradient Method, the General Conjugate Residual Method, and the Minimum Residual Method, whose applications limit to symmetric positive definite systems, non-symmetric positive definite systems, and symmetric indefinite systems, respectively [18]. The most popular variant of Krylov subspace methods is the Generalized Minimum Residual (GMRES) Method [28], due to its applicability to non-symmetric indefinite systems. We introduce this method here and use it as our iterative solver in the experiments.

GMRES generates an orthonormal basis explicitly, using a modified Gram-Schmidt

orthonormalization:

1

For 1 to ,

End

i i

i i i k k

i i i

w Avk i

w w w v v

v w w+

==

= −

=

When applied to the Krylov sequence, this method is known as the Arnoldi Algorithm [1].

8

The inner product coefficients ,i kw v and iw are stored in an upper Hessenberg matrix. Suppose we generated the complete set of an orthonormal basis V , we can represent the

solution as 01

n

i ii

x x v y=

= +∑ , where iv are column vectors of V , and iy are scalars chosen to

minimize at each step the norm of the residual vector 01

( )n

i ii

b Ax b A x v y=

− = − +∑ . In other

words, this algorithm always converges to the exact solution in at most n iterations, provided exact arithmetic is used. In practice, however, this fact does not have much value. When n is large, not only is the number of iterations unaffordable, but the storage required to store V and H also becomes prohibitively tremendous.

The restarted version of GMRES overcomes this problem. Given a natural number m n≤ ,

the algorithm stops and “restarts” after m iterations. The intermediate result

01

m

m i ii

x x v y=

= +∑ is used as the new 0x , V and H are cleared from the memory, and the

whole process repeats from the beginning until convergence is achieved.

0

0 0 1 0 0 1 0

,

1,

1,

1 1,

Choose

Compute - ; / ; For 1 to

Compute

For 1 to

,

End

If 0 then ; exit for

EndDefine the ( 1) He

j j

i j j i

j j ij i

j j j

j j

j j j j

x

r b Ax v r r rj m

w Av

i j

h w v

w w h v

h w

h m j

V w h

m m

β

+

+

+ +

= = =

==

=

=

= −

=

= =

=

+ × ,

1 1

0

ssenberg matrix ( )

Compute to minimize -m i j

m m

m m m

H h

y e H y

x x V y

β

=

= +

Algorithm 3: Restarted GMRES

9

The difficult task in the restarted version of GMRES is to choose an appropriate m . When m is too small, the algorithm may converge very slowly or fail to converge. When m is too large, excessive computations and storage make the process unnecessarily expensive. Unfortunately, the optimal m depends entirely on each particular system, and there is no definite rule for choosing this number.

2.3 Preconditioners In nearly every practical example, iterative methods for the original linear system converges

too slowly. Analyses of these algorithms have found that there is a correlation between the conditioning of the system matrix, measured by the condition number, 1( )K A A A−= , and the number of iterations (work) required to converge [18]. Thus, the original system must be preconditioned to improve algorithm performance. However, to be effective, the cost of solving the preconditioned system (including the cost of preconditioning) should be less than the cost of solving the original system. In fact, the reduction needs to be dramatic for iterative methods to be effective.

In this study, we consider left preconditioners. Thus, we “premultiply” both sides of the

equation by a preconditioning matrix P , ( )PA x Pb= . The optimal preconditioner would be the inverse of A (although this would never be a practical preconditioner). The resulting preconditioned matrix would have a condition number of 1 (the smallest possible).

In general, the larger the condition number is, the harder it is to find a good approximate

inverse for the matrix. The base-b logarithm of ( )K A estimates how many base-b digits are lost in solving a linear system with matrix A . The convergence of GMRES is bounded by

0( ) 1( ) 1

k

kK A

r C rK A

⎛ ⎞−≤ ⎜ ⎟⎜ ⎟+⎝ ⎠

,

where kr is the k th residual vector in GMRES [18]. Moreover, the accuracy of any iterative solution is bounded by

( )k kx x rK A

x b−

≤ ,

where x is the true solution and kx is the k th approximation to x [5]. Therefore, reducing the condition number of the system is important for both speed and accuracy of an iterative solver.

Many preconditioners with different strengths and applications have been developed, and

we examine two: Jacobi Preconditioner [18] and the ILU Preconditioner family [13, 27]. They are both based on modified versions of two linear system solvers. We shall see how impractical

10

solvers can be transformed into powerful preconditioners.

2.3.1 Jacobi Preconditioner The Jacobi Preconditioner, as known as the Diagonal Preconditioner, is derived from the

Jacobi Iterative Method. It applies the inverse of the diagonal entries of A to both sides of the equation, with the hope of reducing the condition number. If matrix A were diagonally dominant, the inverse of its diagonal may be a good approximation to the inverse of A itself.

If

11 12 1

21 22 2

1 2

n

n

n n nn

a a aa a a

A

a a a

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

, then Jacobi Preconditioner

111

122

1

0 00 0

0 0 nn

aa

P

a

−

−

−

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

.

It works well on certain diagonally dominant matrices. Like Jacobi Method, this

preconditioning procedure is highly parallelizable but very unstable. Improved versions such as Block Diagonal Preconditioning are available, but suffer from similar limitations [20].

2.3.2 Incomplete LU (ILU) Factorization Recall that in LU factorization, we factor matrix A

A LU= and then compute 1 1 1 1 1 1( ) ( )x U L LU x U L Ax U L b− − − − − −= = = This is a very stable solution for a linear system. However, it is not often used in practice due to one problem – its high memory consumption.

Define a fill-in to be an initially zero entry in matrix A whose value becomes nonzero as a

result of the basic row operations in the LU factorization. When any Gaussian Elimination-based algorithm is applied to a sparse matrix, many fill-ins take place and the resulting product may become very dense. As the number of nonzero entries increase, the memory requirement increases. When the problem size is large, it can easily become unbearably expensive to store all the fill-in entries created by LU. Therefore, Incomplete LU (ILU) Factorization is developed as a practical alternative [22].

First, we “approximate” the lower and upper triangular factors of matrix A : A LU LU= ≈ ; L L≈ ; U U≈ The U and L produced by ILU are upper- and lower-triangular matrices “close” to U

and L factors of A , less some or all of the fill-ins. See Figure 1 to visualize ILU’s reduced memory cost. A major decision in implementing an ILU factorization is to determine which

11

fill-ins to allow, and which to eliminate. Eliminating more fill-ins keeps the factors more sparse, saving memory space and computing power. On the other hand, allowing more fill-ins keeps U and L “less different” from U and L , so that 1 1U L− − stays closer to 1A− .

Then with hope, 1 1 1 1: ( ) ( )A U L A U L LU I− − − −= = ≈ , and 1 1 1 1( )x Ax U L Ax U L b− − − −≈ = = Instead of being a linear system solver like complete LU, the ILU serves as a preconditioner.

With 1 1U L− − being “close” to 1A− , A would be “close” to the identity matrix. In other words, it would have a smaller condition number. When an iterative solver is used to solve this modified system, convergence would be reached faster with higher accuracy. Various ILU implementations use different theories to eliminate fill-ins, trading off memory requirement for conditioning quality and vice versa. We shall study the two major families of ILU algorithms: the structure-based ILU( ) [13] and the threshold-based ILUT [27].

2.3.2.1 Structure-Based ILU( ) The structure-based ILU( ) implementations allow and deny each fill-in based on its

location relative to the structure of the matrix [13]. The first phase determines the locations of permissible fill-in entries by assigning each location a level. A fill-in is allowed if its level is less than or equal to . In the second phase, an LU factorization takes place, using the “incomplete” fill-in pattern determined in the first phase to keep certain zero entries intact.

In Algorithm 4, the matrix Λ contains the level values for the entries of A . Each level

is a nonnegative integer, and ijΛ ≤ indicates that a fill-in is allowed for ijA in the factorization. Entries of Λ are initially set to undefined, and some would stay undefined if the entry is not a possible fill-in (i.e. if the entire column above the entry is zero).

In essence, this algorithm works as follows: if an entry is initially nonzero in A , it has level

0 and no limit is imposed on that entry. If an entry is initially zero, then any possible fill-in at this location depends on a nonzero entry to its left (the pivot in Gauss Elimination) and a nonzero entry above it (the row whose multiple adds to this row). Each entry’s level depends on the levels of the two entries that may be causing its fill-in. Successor entries are considered “less important” than the predecessor entries and therefore have strictly higher levels. There are two popular implementations for weighing a level based on its predecessor entries’ levels:

The sum rule:

( , ){

( 1);}

computeWeight a b

return a b+ +

12

The max rule:

( , ){

(max{ , } 1);}

computeWeight a b

return a b +

We are not to compare the strengths of these two rules in this thesis. However, one should

be able to tell that the succession of levels grows faster under the sum rule. Therefore ILU( ) would allow less fill-in entries under the sum rule, given the same .

Define sparse matrix with undefined entriesFor 1 to

For 1 to If 0

Define Storage for

0

EndEndFor 1 to -1

If 0

For 1 to If 0

( , )

If ( )

D

ij

ij

ij

ij

jt

ij jt

jt

n ni n

j nA

j iA

t j nA

w computeWeight

Undefined

× Λ=

=≠

Λ

Λ =

=≠

= +≠

= Λ Λ

Λ

efine Storage for

Elsemin{ , }

EndEnd

EndEnd

EndEnd

jt

jt

jt jt

w

w

Λ

Λ =

Λ = Λ

Algorithm 4: Level-assigning phase of ILU( )

13

ILU(0) is a special case of the ILU( ) family. Only entries with level 0, or initially

nonzero entries, are allowed to be nonzero after the factorization. In other words, the factors have the same sparsity pattern as the original matrix A . ILU(0) is a popular method since it is intuitively predictable, and its factors require the minimum amount of memory space among the entire ILU( ) family.

2.3.2.2 Threshold-Based ILUT Unlike ILU( ), the threshold-based ILUT algorithms maintain the sparsity of a matrix by

controlling the magnitudes of its entries [27]. In the factors, the significance of a nonzero entry no longer depends on its relative location, but on its absolute value. Imagine that we drop a certain number of the smallest nonzero entries from L and U , the complete LU factors of A , to produce L and U . These incomplete factors are now more sparse, yet still fairly similar to the complete factors L and U .

In general, ILUT( t ) drops any element whose magnitude is smaller than the threshold.

The threshold is often a number t in [0,1] multiplied by the norm of the active row (or column, if the algorithm is column-based) in the factorization process. In other words, ijL in the

approximate factorization is replaced by zero if its absolute value is less than * ,*it t A= .

* ,* 2

, ,

, , ,

, * ,

,

, * ,

For 1 to -1For 1 to

For to

EndFor 1 to

If < Then 0End

If < Then 0

EndEnd

i

i j j j

i k i k j k

i k i k

i j

i j i j

j ni j n

t t A

A A

k i nA A A

k i nA t A

A

A t A

α

α

α

== +

=

=

== −

= +=

=

=

Algorithm 5: An implementation of ILUT

14

Unlike in structure-based ILU( ), a nonzero entry in the original matrix A does not guarantee a nonzero entry at the same location in the factors. Note that all entries can be possibly dropped, except those on the main diagonal are kept intact so the factors still can be nonsingular. A smaller threshold value keeps the factors closer to complete, and ILUT(0) is identical to the exact LU factorization. On the other hand, a very large threshold would eliminate most off-diagonal entries. ILUT(1) produces an identity matrix L and a diagonal matrix U , which, when used as a preconditioner, is identical to the Jacobi Preconditioner.

The structure-based ILU( ) family and the threshold-based ILUT family are the two major

branches of incomplete LU factorizations. A large number of more sophisticated and robust ILU algorithms have been developed for different applications [13, 16, 33], and most of them are based on one of these two basic ideas.

2.4 Nodal Reordering Strategies for Finite Element Meshes We have now seen some iterative techniques to efficiently solve large linear systems and

some preconditioning methods to speed up the performance of iterative solvers. To further improve the overall computing performance, there is one more field to explore – nodal reordering.

The linear systems in this study are constructed from finite element meshes. Before we

construct the linear system, it is possible for us to assign numbers to the mesh nodes in different orders, to make the system “easier to process.” The same effect can be achieved by permuting the rows and columns of the linear system, although permutations of a large system can be incredibly expensive if they involve the swapping of physical entries.

Depending on how the finite element mesh is created, its nodes may be ordered in a way

that makes computing inefficient. What we refer to as “natural ordering” usually numbers the nodes by the order in which they enter the system, which may be completely unrelated to the geometrical structure or connectedness of the mesh. As we shall see, a less intuitive ordering scheme is often desired for many different reasons.

A typical method for developing parallel algorithms for finite element methods is to

distribute groups of elements to different processors. Therefore, if a node is affiliated with elements that lie on different processors, then operation with these nodes require passing data between processors. The cost for inter-processor communication is often very significant. Hence, there are two issues: The first is optimally partitioning the elements to minimize the number of nodes that must be shared (and to achieve good load balancing). The second is the numbering of nodes on the mesh so that the parallel preconditioner is optimal.

Some reordering strategies can conserve computer storage as well as reduce the actual

calculation time, since they influence the performance of some preconditioners. These strategies and their potential benefits to finite element methods are of particular interest to us. We introduce the classic Cuthill-McKee Algorithm here as a starter, and go into further analysis in a later section.

15

2.4.1 Cuthill-McKee Algorithm In order to solve a large system of equations efficiently, one must conserve computer

storage as well as calculation time. E. Cuthill and J. McKee devised a robust algorithm to condition sparse symmetric matrices by reordering the nodes [7].

Given an n-by-n matrix A , we define the bandwidth of A as ,1 ,max { 0}i ji j N

i j a≤ ≤

− ≠ . In other words, it is a measure of how far the nonzero elements lie from the main diagonal. Matrices with small bandwidths have several advantages, as we shall see later. The Cuthill-McKee algorithm is designed to reduce the bandwidth of a matrix. Figure 2 shows an example of its bandwidth reduction ability.

The basic idea is, that for a sparse matrix A , we want to find a permutation matrix P ,

such that the matrix TPAP permutes the rows and columns of A to “move” the nonzero elements as close to the main diagonal as possible, hence reducing the bandwidth. In practice, however, permuting a large matrix would be extremely inefficient, so this algorithm aims to reorder the nodes on the graph of matrix A prior to the matrix’s construction, effectively introducing a permuted index set.

Before moving on to the algorithm, we shall review some basic terminology in graph theory.

A graph consists of a finite set of nodes (or vertices) connected by a finite set of edges. In a weighted graph, each edge is assigned a weight, which is a numerical value. The degree of a node is the sum of weights of the edges connected to it, and in an unweighted graph it is simply the number of edges connected to it. Two vertices are adjacent if there is an edge between them. A path is a sequence of consecutive edges, and two nodes are connected if there is a path from one to the other. A graph is connected if every node in it is connected to every other node. A component is a connected subgraph. A circuit is a path which ends at the starting node. A tree is a graph containing no circuit, and a spanning tree of a graph is a subgraph that is a tree and contains all of the nodes. For more detailed discussion, see [8].

Given a graph G Select a node of minimum degree, label it 1 When k nodes have been labeled, 1 k n≤ < , Select the smallest i such that node i has unlabeled neighbors Locate all of node i ’s unlabeled neighbors ( 1,..., mu u ) In increasing degree order, label these nodes 1,...,k k m+ + Repeat until all n nodes in G have been labeled.

Algorithm 6: Cuthill-McKee Algorithm

In the event that G has more than one component, this algorithm stops after it labels an

entire component’s m nodes with a tree, m n< . Continue by labeling a node of a minimum

16

degree on another component 1m + , and repeat until the all nodes in the graph are labeled. When dealing with finite element meshes that are entirely connected, this algorithm generates a spanning tree across the entire mesh.

In essence, given a labeled node i in the graph, this algorithm labels its next neighbor with

the number j as close to i as possible. The edge connecting these two nodes becomes ijA

in A , and such nodal ordering keeps i j− small. It is easy to see an upper bound for the bandwidth of A is 2 1m − , where m is the maximum number of nodes per level of the spanning tree generated by the algorithm.

The Cuthill-McKee Algorithm reduces, but does not necessarily minimize bandwidth. For

a specific family of matrices that share certain special properties, it is possible to devise an algorithm to reduce bandwidth beyond Cuthill-McKee’s ability. Moreover, even for the same graph, Cuthill-McKee can yield several matrices of different bandwidths, depending on the starting node and the ordering of equal-degree nodes. See Figure 3 for an example. Nonetheless, Cuthill-McKee ordering generally provides significant bandwidth improvement from natural ordering, applies to a wide range of problems, and can be easily automated. Due to these benefits, this algorithm is widely used in scientific computing.

2.4.2 Reverse Cuthill-McKee Algorithm (RCM) While the Cuthill-McKee Algorithm is well-known for its ability to reduce the bandwidth,

many preconditioners implement a more popular variation called the Reverse Cuthill-McKee (RCM) Algorithm [11]. As its name suggests, this variation uses the same ordering pattern but assigns numbers backwards.

Given a graph G Select a node of maximum degree, label it n When k nodes have been labeled, 1 k n≤ < , Select the largest number i such that node i has unlabeled neighbors Locate node i ’s unlabeled neighbors ( 1,..., mu u ) In decreasing degree order, label these nodes ,..., 1n k n k m− − − + Repeat until all n nodes in G have been labeled.

Algorithm 7: Reverse Cuthill-McKee

Although the algorithms seem very similar, the original Cuthill-McKee and the Reverse

Cuthill-McKee behave rather differently. We shall explore their differences in a later section.

17

Chapter 3 Problem Description

In the 2002 SIAM Review article by Oliker, Li, Husbands, and Biswas, “Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations,”

The quality of an ILU preconditioner (in terms of convergence rate) also has a nontrivial dependence on the ordering; however, this is outside the scope of our paper. [25]

This sparked our interest in the relationship between ILU preconditioners and nodal

reordering strategies. Which ILU works best? Which ordering strategy can improve preconditioning quality? Do different ILU preconditioners perform well with different reordering schemes? We want to search for efficient methods for solving finite element linear systems by studying the combined behavior of nodal reordering schemes and ILU preconditioners.

First, we propose a numerical study for this problem using Matlab. The strengths and

weaknesses of structure-based ILU( ) and threshold-based ILUT are compared using a series of examples. Next, classic Cuthill-McKee and Reverse Cuthill-McKee algorithms are analyzed. Then, a detailed numerical study is conducted involving multiple meshes, reordering strategies, and preconditioners. We seek a trend in reordering-preconditioning pair that can most efficiently simplify the solution of our linear systems.

Second, we want to investigate the same problem on parallel computers. In many

real-world scientific applications, the problem of interest is so large that a single desktop computer cannot manage the calculation. When stored onto a parallel machine with more than one processor (with distributed data), the linear system becomes much more sophisticated than before. To make parallel computers effective for these problems, many linear-algebra operations need different, more sophisticated algorithms. We study how the ILU preconditioners can be implemented in parallel, and how nodal reordering strategies can play an important role in this case.

18

Chapter 4 Numerical Experiments

The following numerical experiments are carried out with Matlab 7.0 (R14) on a Windows XP machine with Intel® Pentium® 4 2.02 GHz processor and 1GB of RAM. 4.1 Finite Element Meshes

Meshes and associated matrices are generated by partitioning a region into quadratic triangular elements and simulating the Laplace equation with a standard Galerkin finite element method.

The meshes we use are listed below. Many of them use the same domain with several

different mesh densities. Figure 4 illustrates the coarsest and the most refined mesh for each example. The number of mesh points provides a perspective of the problem size, the larger of which requires the better algorithms to efficiently speed up calculations.

Mesh Family Description

Mesh Name Total Mesh Points

2D Rectangle. 2d 913D Cube. 3d 1075

two_hole_0 511two_hole_1 763two_hole_2 1291two_hole_3 2823

Square domain with two circular holes.

two_hole_4 11113four_hole_0 431four_hole_1 629four_hole_2 1089

Long rectangular domain with four circular holes.

four_hole_3 2497cross_dom_0 401cross_dom_1 647cross_dom_2 1141cross_dom_3 2448

Cross-shaped domain.

cross_dom_4 4261two_dom_0 459two_dom_1 925two_dom_2 1787two_dom_3 2726two_dom_4 4791

Two square domains connected by a long rectangular strip.

two_dom_5 10652

19

4.2 ILU(0) and ILUT

Matlab implements incomplete LU factorization in two ways: the threshold-based ILUT method and the special case ILU(0) of the structure-based ILU( ) methods [21]. To use the ILUT method, the user would call luinc( A , t ), with t being the tolerance, or threshold. Factors of matrix A are computed, and entries smaller than t times the 2-norm of the column vectors of A are dropped. Specifying zero tolerance produces factors with no dropped entries; that is, luinc( A , 0) gives the same result as lu( A ). Another special case with tolerance being 1 drops all entries except those on the main diagonal. On the other hand, calling luinc( A , ‘0’) results in the structure-based ILU(0). The L and U factors from luinc( A , ‘0’) have nonzero entries only at locations where A has nonzero entries.

However, the other cases of structure-based ILU( ), where 1≥ , are unavailable. Not

only does Matlab lack of their implementation, many researchers also neglect to mention any ILU( ) beyond ILU(0). In some literatures, it appears as if ILU(0) and ILUT are the only LU-based preconditioners of importance. The reason is not yet trivial, but for now we should use the tools readily available to learn more about the two classes of ILU. 4.2.1 ILU(0)

Before we make comparisons, let us review the behavior of ILU(0). Its most obvious property is the predictable sparsity pattern: L or U has a nonzero entry if and only if A has a nonzero entry at the same location. On the other hand, while ILUT can produce much denser or sparser factors, the nonzero elements in the factors do not necessarily overlap those in A . Figure 5 shows the U factor of ILU(0) and ILUT at three thresholds, superimposed over the same matrix. Only ILU(0) preserves the original matrix’s sparsity pattern.

This feature can be very important, especially if the memory is limited. Given any square

matrix, we know a priori exactly how much memory is required to store its ILU(0) factors. Besides, because it allows no fill-in, the dependency among the rows stays constant. This simplifies parallel implementations, as discussed in the next section.

The danger of ILU(0), or of any structure-based ILU( ) method, is its insensitivity to the

magnitude of fill-ins. Potential LU elements are dropped only based on their locations, and not values. Therefore, many small, insignificant elements may be preserved rather than large ones. According to Karypis and Kumar, this can cause preconditioners to be ineffective for matrices arising in many realistic applications [16]. 4.2.2 ILUT

The sparsity pattern of the ILUT factors is unpredictable. We study the effect of the threshold parameter through numerical experiments. Four matrices are chosen and preconditioned with ILUT, using a wide range of threshold values between 0 and 1. The

20

condition number of the preconditioned system 1 1( )K U L A− − and the sparsity of the factors ( )L U I+ − are observed. Because we know intuitively that these two values are tradeoffs of each other, we compare them on the same graph. These values are plotted in Figure 6 where the log of the condition number is used to enhance the contrast among the smaller values.

As observed commonly among the four graphs, the precondition number and the number of

nonzero elements are roughly inversely related. However, the factors being less sparse do not always imply a higher preconditioning quality. The local minimums on the precondition curves indicate the best precondition numbers among all ILUT factors of similar sizes. To most efficiently solve a large linear system, it would be of the user’s interest to take advantage of such local minimums. Unfortunately, in practice they cannot be found without excessive computation.

When the precondition number approaches 1, the ILUT factor sizes approach those of a

complete LU and become extremely memory-inefficient. On the other hand, as the memory requirement is minimized ( 1t → ), the precondition number becomes humongous (sometimes even larger than ( )K A ). As discussed in Chapter 2, the ILUT preconditioner with 1t = is identical to the Jacobi (Diagonal) preconditioner. This experiment shows that such preconditioning really does not help when the matrix is not tridiagonal. As a result of this study, we see that ILUT is ineffective when the threshold is near either extreme. 4.2.3 Comparisons

In this section, we investigate how the structure-based factorizations differ from the threshold-based versions in practice. We want to find out how their preconditioning qualities compare. The answer is revealed by simply adding the statistics from ILU(0) in Figure 6.

On all four of the graphs, the blue circle indicates log of the ILU(0) precondition number,

and the red circle indicates the number of nonzero elements in the ILU(0) factors. The blue circle being to the left of the red circle suggests that, in order to achieve the same precondition number, ILUT generates more sparse factors. In other words, when the threshold is chosen so that ILUT factors are as sparse as ILU(0), the ILUT precondition number is much lower. In all four cases in this experiment, our findings show that the ILUT method is a more efficient preconditioner than ILU(0): less memory and smaller precondition number.

Although our findings here really do not support ILU(0), its performance appears to

approach that of ILUT as the matrix size increases. When the matrix size is in the hundreds, and assuming similar factor sizes, ILUT produces a precondition number that is roughly 1/8 of the ILU(0) precondition number. When the matrix size goes over ten thousand, that ratio increases to roughly 1/4. One may wonder as the matrix size grows, whether such ratio approaches or even exceeds 1, in which case ILU(0) becomes the better preconditioner. Unfortunately, since calculating the exact precondition number 1 1( )K U L A− − is impractically resource-consuming, we are unable to carry out the same experiment on much larger matrices.

21

Now, suppose ILU(0)’s weak performance is lack of attention to element magnitudes (as mentioned above), then we have a hypothetical idea of why structure-based ILU( ) with 0> is not used in practice. While still possessing the same structure-based magnitude insensitivity that makes it a weak preconditioner, it loses the clean and obvious sparsity pattern that ILU(0) has. Therefore, ILU( ) with 0> has little value beyond theoretical discussions. 4.3 CM and RCM

The next step would be to investigate the nodal reordering strategies and their effects on ILU preconditioners. We study the classic algorithm published by E. Cuthill and J. McKee in 1969, and the “reversed” version which is more popular in the scientific computing community today. These two algorithms are very similar, making absolutely no difference in certain applications. It is not trivial why a simple reversal of an ordering would make a big difference. However, as we shall see, Reverse Cuthill-McKee and the original CM can show much different strengths for different preconditioners. 4.3.1 The Structure

Suppose we have matrix F , generated from a finite element mesh with the Cuthill-McKee reordering, and matrix R , generated from the same mesh with the Reversed Cuthill-McKee reordering. By casually examining the sparsity patterns of the two matrices, we find that they are mirror images of each other – about the antidiagonal. Moreover, they have the same bandwidth. Their only difference seems to be the alignments of the nonzero entries, as illustrated in Figure 7.

In the Cuthill-McKee Algorithm, from a starting node i , its unassigned neighbors are given

numbers j , 1j + , 2j + … with j i> . Hence, in matrix F , many nonzero entries line up with the pattern ,i jF , , 1i jF + , , 2i jF + … with j i> . On the other hand, in Reverse Cuthill-McKee, unassigned neighbors of node j are given numbers i , 1i − , 2i − … with i j< . Hence, in matrix R , many nonzero entries line up with the pattern ,i jR , 1,i jR − ,

2,i jR − … with i j< . In other words, when we look at the upper triangular half of the matrices, the nonzero entries in F line up horizontally, whereas the nonzero entries in R line up vertically (reversed in the lower triangle).

Define an “umbrella region” of a matrix M as follows: , 0i jM = is in the umbrella

region if i j≤ and ,k jM is nonzero for some k i≤ , or if j i< and ,i kM for some k j≤ . Imagine two light sources shining directly on M : one to the left and one to the top. Suppose each nonzero entry is a solid block, then all the area under shade would be in the “umbrella region.”

When we perform an incomplete LU factorization, each possible fill-in is caused by a

nonzero entry to its left and one directly above it. For upper-triangular positions with no

22

nonzero entries above, or lower-triangular positions with no nonzero entries to the left, no fill-in would ever occur. In other words, fill-ins always happen within the “umbrella region.”

Due to the nonzero alignment patterns, it is easy to see in Figure 8 that matrix F has a

much larger “umbrella region” than matrix R . Hence, more potential fill-ins would/could be eliminated during ILU(0)/ILUT. Intuitively, the more fill-ins we eliminate, the more our incomplete factors differ from the actual factors. Having less accurate factors, the ILU preconditioner produces a less ideal (higher condition number) matrix. 4.3.2 The Experiments

Next, we would like to see the effects of nodal reordering on the ILU preconditioners. Renumbering the nodes in a finite element mesh is equivalent to permuting the equations and unknowns of the corresponding linear system. Although the system as a whole and its solutions stay intact, the physical structure changes and preconditioning qualities could be affected.

We take the four matrices from the previous experiment (Section 4.2.2, Figure 6) and

reorder/permute them using the Reverse Cuthill-McKee algorithm. The ILUT preconditioner is applied with the same threshold values as in the previous experiment, and the precondition numbers and sparsity numbers are recorded accordingly. These numbers of the original (natural ordering) matrices are subtracted from those of the new (RCM ordering) matrices. The differences are plotted in Figure 9.

When the threshold is large, RCM ordering does not seem to benefit ILUT preconditioning.

The precondition numbers differ dramatically without a fixed direction or pattern. At the same time, the sparsities of the factors are not affected much, if any. Therefore, when we use ILUT with large thresholds, the RCM reordering algorithm brings unpredictable effects on the preconditioning quality while saving no memory. In this case, the reordering is an unnecessary waste of effort.

On the other hand, RCM ordering does seem to improve ILUT with small thresholds. The

difference in precondition numbers converges to zero as the threshold decreases, and the difference in sparsity grows apart concurrently. RCM-ordered matrices are preconditioned to the same quality with smaller ILUT factors. However, based on our experiment, the amount of sparsity gained is not proportional to the size of the matrix or the size of the threshold. Comparing graphs (b) and (c) in Figure 9, matrix (c) before preconditioning almost doubles the size of matrix (b), but the ILUT memory it saves with RCM ordering is less than half, when the threshold is small. In addition, graph (d) suggests that for this particular matrix, the memory efficiency caused by RCM is maximized when threshold is slightly less than 10-3, and drops back to zero as the threshold continues to decrease. In essence, RCM ordering could improve the preconditioning qualities of ILUT, although the result is not guaranteed. We say “could” because the red curve does not really start to drop until the threshold reaches below 10-2, in which case the ILUT factors are larger than ILU(0) factors. When the sparsity difference becomes more significant is also when the ILUT factors become quite dense. Since the purpose of incomplete LU is to keep factors sparse, it would probably not be sensible to employ small

23

enough thresholds to see the difference between these two ordering strategies. Recall that our real goal in preconditioning is to speed up the convergence rate of iterative

solvers, and precondition number is only a rough indication for this. The actual effectiveness of preconditioners and reordering strategies need to be tested by actually solving the system with an iterative solver. Tables 1.1~1.22 are the condition number, sparsity, and iterations to convergence of our matrices (meshes described in Section 4.1), each with original ordering, Cuthill-McKee ordering, and Reverse Cuthill-McKee ordering. Preconditioners applied are ILU(0), Jacobi, and ILUT at 18 other threshold levels. The iterative solver used is Matlab’s implementation of GMRES.

For each preconditioner on each table, the best (i.e. smallest) values are highlighted. At a

quick glance, we find the counterintuitive fact that sometimes a smaller condition number is associated with a slower convergence! For example, on Table 1.3 (two_hole_0) with ILUT(5e-2), the RCM produces the most sparse matrix with the smallest condition number, but it took the solver two more iterations to convergence than matrices of natural or CM orderings. In general, however, small condition numbers still lead to faster convergence. This shows us that using the precondition number is a good predictor of the algorithm performance, but the best predictor is to actually test it in the iterative solver.

For ILU(0) preconditioner, RCM reordering consistently produces the best results – lowest

condition number and, more importantly, fastest convergence. RCM-ordered matrices converge in up to 27% fewer steps than natural-ordered matrices. On the other hand, CM tends to be a very poor ordering for ILU(0), giving significantly worse numbers than natural ordering. From our structural analysis, recall that the classic CM ordering generates a much larger “umbrella region” than original ordering, in which ILU(0) eliminates many more potential fill-ins and makes its factors less closer to complete LU. On the other hand, RCM with a smaller “umbrella region” has an ILU(0) factorization much closer to complete LU.

For threshold-based ILUT, however, RCM is no longer so beneficial. Especially for

moderate thresholds (10-1 ~ 10-2), RCM-ordered matrices tend to cost more GMRES iterations than the other two. Only when the thresholds (hence condition numbers) are very small, and ordering makes no notable difference in convergence rates, does RCM has an advantage of having the most sparse ILUT factors. Unless a low threshold level is required and memory is extremely expensive, though, the little benefit of RCM does not even justify the reordering process.

In contrast, the classic CM ordering seems to suit ILUT fairly well. Especially when the

matrix becomes large and the threshold is relatively small, it yields better convergence rate than natural or RCM orderings. It is interesting to observe that such CM-ordered matrices often have the largest precondition numbers, yet they still converge the fastest.

In a nutshell, we learn that the usefulness of each ordering strategy depends entirely on how

it is used. Although nothing is absolute, we do observe a general pattern for best-matching ordering strategies, preconditioners, and problem sizes. Reverse Cuthill-McKee is consistently the best ordering scheme for ILU(0), classic Cuthill-McKee ordering is great for ILUT with

24

moderate to small thresholds on large systems, and natural ordering should suffice by itself on small problems. 4.4 Breadth-First Search Orderings

The Breadth-First Search (BFS) is an elementary algorithm for producing a spanning tree of a graph [12]. In computer science terms, it is a method to traverse the entire graph for some particular data. It starts from any node, searches all of its neighbors and orders them into a waiting queue. Then, it repeats the same process on the next node in queue that still has unsearched neighbors.

The Cuthill-McKee algorithm and its reversed version are essentially two special cases of

BFS, with some special requirements. Since the algorithm was originally developed with bandwidth reduction in mind, not preconditioning quality, one may wonder if there exist other BFS-based ordering schemes that better suit our interest.

Given a graph G , suppose the scheme assigns numbers low to high [high to low] Select a node and label it 1 [ n ] When k nodes have been labeled, 1 k n≤ < , Select the smallest [largest] i such that node i has unlabeled neighbors Locate all of node i ’s unlabeled neighbors ( 1,..., mu u ) In specified sorting order, label these nodes 1,...,k k m+ + [ ,..., 1n k n k m− − − + ] Repeat until all n nodes in G have been labeled.

Algorithm 8: Modified Breadth-First Search

We devise a test consisting of 14 BFS-based ordering schemes, each with some unique

requirements. There are seven ways to traverse the mesh by sorting a node’s neighboring nodes differently. With each traversal, one scheme assigns numbers from low to high while another goes from high to low. CM and RCM are included among these. Below is a table listing the schemes used in our test.

Test # Sort neighbors of node i based on… Number assignment Test01 Low to high Test02

Existing ordering in the given system. High to low

Test03 (CM) Low to high Test04

Degree of the neighboring nodes, in ascending order. High to low

Test05 Low to high Test06 (RCM)

Degree of the neighboring nodes, in descending order. High to low

25

Test07 Low to high Test08

Physical distance of the neighboring nodes from node i, in descending order. High to low

Test09 Low to high Test10

Physical distance of the neighboring nodes from node i, in ascending order. High to low

Test11* Low to high Test12*

For neighboring node j, the value of ijA , in descending order High to low

Test13* Low to high Test14*

For neighboring node j, the value of ijA , in ascending order High to low

On this list, Test01 is the most generic version of Breadth-First Search, so the new ordering

highly depends on the existing natural ordering. Test03 and Test06 are classic CM and the popular RCM, listed here for comparison purposes. The rest are some simple modifications of the existing ideas, and they only represent a small fraction of possible Breadth-First Search variants.

Test11 through Test14 are not finite element mesh orderings in the same sense as the others.

Because they are ordered based on the magnitudes of matrix A ’s entries, the orderings are not available before the matrix is built from the mesh. They can be achieved by matrix permutations and have a positive effect on ILU preconditioning. However, on computers that permute by moving physical entries (such as on distributed-data machines), the cost of such operation could become prohibitive. Disregarding this fact, rearranging a matrix according to the actual magnitude of its elements might be very practically sensible. If deemed useful, these schemes can still be applied to time-dependent problems where the cost of building/permuting the matrix can be sacrificed for a higher efficiency at each time step of solving the problem.

We pick three meshes from each of our four domains, apply the test ordering schemes to

them, precondition using ILU(0) and nine thresholds of ILUT, then record their GMRES convergence iterations. These numbers are recorded in Tables 2.1 ~ 2.4.

The purpose of this experiment is to find better ordering schemes than what we already

have from the previous experiment. Therefore, we use bold borders for columns A, 03, and 06, which are natural, CM, and RCM orderings. For each row, the smallest iteration number among the three orderings is determined. Then, any scheme that produces the same or better iteration is highlighted green or yellow, respectively. In other words, a green entry means that the ordering is as good as the best among natural, CM, and RCM; a yellow entry means that it is better than all of the three. When a column has many green and yellow entries, that particular ordering is probably what we are looking for.

First, we examine the ILU(0) case. RCM is still the best among orderings – with only one

exception. Test12 consistently produces the same quality matrix as RCM, and sometimes even better. This permutation scheme has the same property as RCM which produces a small “umbrella region” because it assigns numbers backwards (high to low). Now, because it arranges the largest elements of A close to the main diagonal, it minimizes the overall

26

magnitudes of the eliminated fill-ins. While ILU(0) performs its structure-based incomplete factorization, Test12 permutation gives it some of the threshold-based advantages. Therefore, it is stronger than the popular RCM ordering.

Next, we look at the ILUT cases. It seems difficult to compare all of the columns at the

first glance, although it is apparent that some of our new ordering schemes are highly comparable to or even better than natural, CM, and RCM. In the event that the threshold is small, the ordering schemes which assign numbers forward perform better than their backward equivalents. However, the distinction is less obvious when the threshold is large. After more detailed examinations, we find that Test05, Test10, Test11, and Test13 have the best overall performances. Disregarding Test11 and Test13, which are not true mesh reordering schemes, there are still two very satisfying results. Test05 is merely a modified Cuthill-McKee algorithm, with nodes sorted in descending degree order rather than ascending. Test10, on the other hand, requires a slightly more sophisticated implementation and is more computationally expensive due to the calculation for the physical distances between nodes.

Our experiment finds some BFS-based nodal reordering schemes that can generate better

numerical results than existing Cuthill-McKee and Reverse Cuthill-McKee algorithms. Our schemes not only can assist ILU preconditioners and GMRES iterative solver into a faster convergence, they are also very easy to understand and implement. What we still realize is that not a single scheme can be perfect, so our choice should be made around each specific problem that we want to solve and the preconditioner that we wish to use.

27

Chapter 5 The Parallel Case

Although ILU preconditioners can significantly improve the performance of iterative solvers by reducing the precondition number, researchers might be reluctant to use them due to their unparallelizable nature. The classic LU factorization is a completely sequential process, and its derivatives are usually far from being parallelizable. Parallel versions of ILU preconditioners do exist, thanks to contemporary research, but they have not become very widespread. The popular parallel scientific computing software package PETSc [2], for example, does not even have a parallel ILU implementation. Therefore, in this section we examine a method of parallelizing ILU factorization.

This parallel ILU factorization algorithm is inseparable from its nodal reordering strategy.

This ordering is much more sophisticated than those in the single-processor cases, and the ILU techniques only count for a small fraction of this algorithm. Although ILU(0) and ILUT are both parallelizable based on this theory, the ILUT case is slightly more complicated. 5.1 Onto a Parallel Computer

Suppose we have a 48-node finite element mesh with its corresponding matrix A (see Figure 10). Under natural ordering, the nonzero elements roughly form a few lines parallel to and near the main diagonal, similar to some of our study cases in the previous section. Solving this system using one processor is straightforward, given the tools that we have already discussed before.

Next, assume that we want to solve this problem on a concurrent computer of four

processors. Then it is not so straightforward. First of all, we need to partition the mesh into four pieces. This partitioning process is a nontrivial field of study itself. There are direct k-way methods, recursive methods based on geometry or graph theory, and multi-level methods [3, 9, 15, 23]. Efficient and robust partitioning algorithms often coarsen the mesh first, partition using one of the simple methods, and then uncoarsen to the original mesh with optimization and local refinement processes at each step. Typically, there are two common goals of these algorithms: 1) partition the mesh into roughly equal sizes, so the processors have balanced workload and 2) minimize the number of edges connecting two partitions, which minimizes the amount of communication across the processors. We are not to go into details of these algorithms, but we shall assume to have an ideal partition of the mesh.

Figure 10 (b) shows an ideal partitioning of this mesh, with each partition holding exactly

12 nodes. Each of the four colors represents a partition, which is held on one processor. Still under natural ordering, we color each row of matrix A with the color that corresponds to the processor where it resides. When the colors interleave, we know that this ordering leads to an inefficient ILU preconditioner. Consider row 22 (the first yellow row). To process this row,

22,27A needs information from 19,27A , which lies on the red processor. And to process row 19,

19,23A requires information from 15,23A , which lies on the blue processor. To factor A under

28

such a configuration, a large number of rows have to be passed back and forth among the processors at each step, causing the factorization to be unbearably expensive.

One intuitive way to improve this situation is to use some simple nodal reordering to group

each processor’s rows together. For example, we can assign a new number to every node in one processor before moving on to another, so matrix A looks like in Figure 10 (c). Now the inter-processor communication is somewhat reduced, and factoring the rows in the first partition requires no information from the other processors. However, while this ordering scheme can benefit sequential algorithms, the rows’ dependence on each other still prevents ILU from running concurrently on all processors. 5.2 The Reordering Scheme

Now, we introduce the nodal reordering scheme for a parallel sparse factorization as described by Karypis and Kumar [16].

First, classify the nodes in each processor as interior or interface nodes. Interior nodes are

those with adjacent nodes residing on the same processor, while interface nodes have adjacent nodes on two or more processors. Figure 11 (a) highlights all the interface nodes. Assign new numbers to the interior nodes, one processor at a time. When this is finished, there is an m n< such that nodes k m≤ are all interior nodes, and nodes 1k m≥ + are all interface nodes. In our case, as illustrated in Figure 11 (b), 26m = .

The next step is to compute maximal independent sets from the remaining interface nodes,

denoted by IA . An independent set I of a graph G is a subgraph in which no two nodes are adjacent. Maximal independent sets can be found using Algorithm 9: Luby’s algorithm [19]. Once we have a maximal independent set I of IA , we assign new numbers, in order, to the nodes in I . Afterwards, \I IA A I= , I = ∅ , and the process repeats until IA is empty. Figure 11 (c) ~ (f) illustrate the successive steps of independent sets, along with their new nodal numbering. Note that at each step, nodes in I are numbered with regard to the order of the processors.

Figure 11 (g) shows the original mesh with new the nodal numbers, and Figure 11 (h) is the

corresponding matrix A . At the first glance, this matrix seems very poorly ordered. The nonzero elements are spread out without a clear pattern, and the bandwidth is huge. On a single processor, as we have already examined, such arrangement yields poor conditioning. Also, each processor holds largely disjoint rows when row m> , which does not seem to improve from the original ordering. However, the independence among these new rows enables this seemingly messy matrix to be highly parallelizable.

29

To compute a maximal independent set I of a given graph G

{ }

{ }

While 0

For 1 to

Label( )=Random()End

For 1 to

If Label( )<Label( ) for all adj( , )

EndEnd

\

\ | ( , )

End

i

i j i j

i

i k i k

G G

G

i G

G

i G

G G G G

I I G

G G I

G G G G I I adj G I

=

>

=

=

= ∪

=

= ∈ ∃ ∈ ∋

Algorithm 9: Luby’s algorithm

5.3 ILU Analysis

As for the mesh, we should treat the interior rows and interface rows of matrix A (Figure 11 (h)) separately. The steps of factorization are illustrated in Figure 12.

The first and entirely parallelizable part of the ILU factorization of A is its interior rows.

As highlighted in Figure 12 (a) with vertical bars, different processors do not have nonzero elements in the same column. In other words, the interior rows in one processor are completely independent from interior rows in other processors. Therefore, all processors can factor their own interior rows simultaneously. No waiting or inter-processor communication is required. In fact, since each set of these rows can be viewed as an independent matrix, we can locally enhance it with reordering strategies, such as those discussed in the previous section.

Then, as illustrated in Figure 12 (b) ~ (d), one independent set of interface rows is factored

at a time. While factoring these rows requires information from the above rows, some on foreign processors, it can still be run in parallel. Because the interface rows within one independent set do not depend on any row that is within the same independent set and on another processor, there is no need for the processors to wait for each other. In the best case, all processors would have equal number of rows within each independent set. Therefore, they can

30

finish together and no processor needs to wait for another before moving on to the next set. Although this nodal ordering strategy can be used for both structure-based ILU( ),

threshold-based ILUT, and even complete LU algorithms in parallel, slight differences apply to each implementation. ILU(0) is the simplest, because it requires no fill-in whatsoever and independent rows remain independent. ILUT is more complicated because its fill-ins introduce new dependencies during factorization. When computing the independent sets, possible fill-ins have to be taken into consideration to keep the sets truly independent. As a consequence, each independent set could be smaller than that of an ordering for ILU(0).

Since the interior rows can be factored completely in parallel, it is to our best interest to

have as many of them as possible. When the mesh/matrix is relatively large compared to the number of partitions/processors, most of the nodes/rows would be interior. The more interior nodes outnumber interface nodes, the closer the factorization is to true parallel. Increasing the number of processors increases the speed of the factorization in parallel, but at the same time reduce the parallelizability by increasing the percentage of interface rows. In any event, a high-quality mesh partitioning algorithm is critical for the effectiveness of the parallel ILU. 5.4 A Partitioning Test

To better illustrate our analysis, we partition some of our test meshes and discuss their behavior when the preconditioner is implemented on parallel processors.

Our partitioning method for square domains is completely location-based, and our goal is to

assign an equal number of nodes/rows to each processor. First, we draw a horizontal line across the middle of the mesh. Then, we compare the numbers of nodes falling on either sides of the line, and move the line toward the smaller side. Eventually, the line bisects the mesh. Repeatedly applying this process to the two halves of the mesh can easily give us 2n equally sized partitions. The quality of our partitioning is fairly high, and can be easily automated. The downside is that this process can be very costly. However, for large problems arising from time-dependent nonlinear PDEs, this preprocessing cost is negligible when compared to the remaining calculations.

We choose to partition three levels of refinement of the same domain, each into 2, 4, and 16

pieces, as graphed in Figure 13. The largest mesh (two_hole_4) has over 25 times more nodes than the smallest mesh (two_hole_0). Because the nodes lay fairly evenly across the domain, each of our partitions takes up nearly the same physical area. The two holes in the domain cause the nodal bisection lines to shift; otherwise we should see perfect grids. On each mesh, different colors indicate the different processors assigned to handle the nodes. Interior nodes are labeled with circles and interface nodes are labeled with asterisks – though the symbols are not visible on the largest mesh where all nodes are crammed together.

We list in Tables 3.1 ~ 3.3 the number of total nodes and the number of interior nodes in

each of the nine graphs. Note that our algorithm distributes nodes among the processors as evenly as possible – although this is not usually feasible. The percentage of interface nodes is

31

listed on the side. The magnitude of this number is the most critical indication of the parallel efficiency.

For each of the three meshes, the percentage of interface nodes increases as the number of

partitions increases. In particular, when the smallest mesh is distributed to 16 processors, well over half of its total nodes are interface nodes! Despite the amount of concurrent computing power available, the majority of our parallel ILU efforts would be wasted transferring data back and forth among processors. It is very possible that 4 processors can factor this system faster than 16 processors together.

Also, when there are more partitions, the amount of interface nodes per partition varies

more. When the smallest mesh is partitioned in 4, the number of interface nodes per partition ranges between 27 and 32. On the other hand, when the same mesh is partitioned into 16, the range increases to be between 6 and 23. Such large difference among partitions is very undesirable, because it means great workload disparity for the processors at every step in the parallel ILU algorithm.

Another thing to observe is the inversely proportional relationship between the mesh size

and the percentage of interface nodes. For all of 2-, 4-, and 16-processor cases, the percentage drops about one-third from Tables 3.1 to 3.2, and another two-thirds from Tables 3.2 to 3.3. When the problem size is large, partitioning the mesh would yield relatively few interface nodes and make it sensible to employ a large number of processors. In Table 3.3 with 16 processors, we see that 83.85% of all nodes are interior and can be ILU-factored simultaneously without inter-processor communications. This speeds up the factorization significantly and justifies the use of many processors.

The main lesson from this experiment is that, the efficiency of parallel ILU does not

necessarily increase with the number of parallel processors. We must regard the nature of the required nodal reordering scheme, and consider the added dependency among processors as the mesh is split into smaller partitions. When we can choose an appropriate level of parallelism, the algorithm can run efficiently in near true parallel. 5.5 Other Partitioning Considerations

While the aforementioned partition strategy is capable of partitioning any mesh into equal sizes, it can often generate undesirable partitions. Figure 14 (a) and (b) are meshes four_hole_1 and cross_dom_2 partitioned in four using such strategy. The partition borders lie parallel to the long sides of the mesh, and are clearly causing more interface nodes than necessary. Only 66.04% and 74.10% of those mesh points are interior, respectively. However, if we partition according to the shapes of the meshes, we can still preserve equal sizes while reducing the amount of interface. Figure 14 (c) and (d), for example, are those two meshes with a new dividing strategy. Now their interior nodes increase to 83.11% and 88.39%, and can run in parallel much more efficiently.

32

Chapter 6 Conclusions

The aim of this thesis is to study nodal reordering strategies for finite element meshes that speed up the convergence rate of an iterative solver. We start out by examining some classic linear system solvers, iterative linear solvers, preconditioners, and reordering strategies. Then, we proceed to numerically experiment some of the schemes and methods, and to analyze the strategies on a single processor and on multiple processors.

In the single-processor case, we first compare the classic Cuthill-McKee ordering and the

popular Reverse Cuthill-McKee. While they equally reduce the matrix’s bandwidth, they behave very differently before preconditioners: RCM only works with structure-based ILU(0), and CM works best with ILUT with small thresholds. Then, we examine a list of similar ordering strategies based on the concept of Breadth-First Search, and find that some of them improve preconditioning qualities even more than the well-known CM and RCM.

In the multi-processor case, we learn that the parallel ILU algorithm is highly dependent on

a nontrivial ordering strategy. Whether or not the algorithm can run efficiently in parallel is determined by the quality of the mesh partitioning, and the goal is to minimize the amount of interface nodes in each partition. Assuming perfect partitions exist, the number of partitions and parallelizability are negatively related. Unless the problem size is large enough, employing too many processors might actually slow down the computations.

To sum up, we discover that nodal reordering strategies go side-by-side with

preconditioners. While one strategy works very poorly for one preconditioner, it can be the best choice for another. No ordering is perfect in all situations, so we must be careful when choosing the right one for each individual problem.

33

Figures and Tables Figure 1 Incomplete LU Factorization The main disadvantage of LU Factorization is the enormous memory space requirement. Incomplete LU highly reduces this cost. Matrix A : 582 nonzero entries.

A LU= : 1950 total nonzero entries.

ILU(0): A LU≈ : 648 total nonzero entries.

34

Figure 2 Cuthill-McKee Ordering This figure demonstrates Cuthill-McKee Algorithm’s ability to reduce bandwidth in matrices. In the second matrix, nonzero elements are rearranged close to the main diagonal, unlike those scattered apart in the first matrix. The mesh should be viewed as a cylinder, where the bottom gray row “wraps around” to the top row. Natural ordering. Bandwidth: 36

Cuthill-McKee ordering. Bandwidth: 5

35

Figure 3 Cuthill-McKee Starting Node To reach the smallest bandwidth using Cuthill-McKee Algorithm, sometimes we do not want to start at a node of lowest degree: Starting with a node of lowest degree. Bandwidth: 5.

Starting with some other node. Bandwidth: 3.

36

Figure 4 Finite Element Meshes

two_hole_0 two_hole_4

four_hole_0

four_hole_3

37

cross_dom_0 cross_dom_4

two_dom_0

two_dom_5

38

Figure 5 ILU(0) and ILUT These figures are for comparison among several ILU implementations’ sparsity patterns. ILU(0) and ILUT with three thresholds are applied on the same matrix. Incomplete factor U is superimposed in red over original matrix in blue.

ILU(0) ILUT(0.1)

ILUT(0.01) ILUT(0.001)

39

Figure 6 ILU Experiments

(d) two_hole_2

t Precond nnz ILU(0) 134.63 11699 ILUT

0.02 41.508 111060.01 21.438 13761

(c) two_hole_1 T Precond nnz ILU(0) 66.007 6407 ILUT

0.02 17.477 61590.01 9.3389 7396

(b) two_hole_0

t Precond nnz ILU(0) 43.118 4081 ILUT

0.02 12.249 39850.01 7.3715 4862

(a) 2d t Precond nnz ILU(0) 39.266 582 ILUT

0.02 7.3877 5290.01 3.2354 670

40

Figure 7 CM and RCM Natural ordering

Cuthill-McKee ordering, where nonzero entries line up “horizontally” in the upper triangle.

Reverse Cuthill-McKee order, where nonzero entries line up “vertically” in the upper triangle.

41

Figure 8 “Umbrella Regions” Although Cuthill-McKee Algorithm and Reverse Cuthill-McKee are very similar, the matrices they generated have dramatically different “umbrella regions.” Therefore, Incomplete LU factorizations behave rather differently on them. Matrix F : Cuthill-McKee ordering. Umbrella region: 1020

Matrix R : Reverse Cuthill-McKee ordering. Umbrella region: 508

42

Figure 9 Natural v. RCM Ordering on ILUT Matrices from the Figure 6 are reordered with Reverse Cuthill-McKee ordering, and ILUT preconditioning is applied with the same thresholds. Precondition numbers and sparsity numbers of the natural-ordering matrices are subtracted from those of the RCM-ordering matrices. The differences are plotted below.

(a) 2d (b) two_hole_0

(c) two_hole_1 (d) two_hole_2

43

Figure 10 Mesh Partitioning

(a) Original mesh

(b) Partitioned mesh

(c) One possible ordering for parallel

44

Figure 11 Mesh Partitioning for Parallel ILU

(a) (b)

(c) (d)

45

(e) (f)

(g) (h)

46

Figure 12 Parallel ILU

(a) (b)

(c) (d)

47

Figure 13 Mesh Partitioning Test 1 2 Processors 4 Processors 16 Processors

(a) two_hole_0: 415 × 415 2 Processors 4 Processors 16 Processors

(b) two_hole_2: 1115 × 1115 2 Processors 4 Processors 16 Processors

(c) two_hole_4: 10597 × 10597

48

Figure 14 Mesh Partitioning Test 2

(a)

(b)

(c)

(d)

49

Table 1 CM v. RCM and ILU(0) v. ILUT Table 1.1

2d

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations

Natural RCM CM Natural RCM CM Natural RCM CM

A 136.77 136.77 136.77 582 582 582 36 36 36

ILU(0) 39.27 22.86 72.39 582 582 582 11 8 11

Jacobi 158.22 158.22 158.22 66 66 66 30 30 30

ILUT(5e-1) 153.46 150.31 150.31 71 71 71 32 31 31

ILUT(4e-1) 143.08 152.19 150.31 77 77 77 33 32 32

ILUT(3e-1) 143.08 152.19 150.31 77 77 77 33 32 32

ILUT(2e-1) 28.61 84.71 82.46 296 300 296 12 13 13

ILUT(1e-1) 34.77 31.43 111.11 321 328 325 12 11 12

ILUT(9e-2) 37.03 31.43 71.30 339 328 340 11 11 12

ILUT(8e-2) 25.97 31.46 99.52 378 329 366 10 11 11

ILUT(7e-2) 20.65 24.47 73.22 400 348 404 9 10 10

ILUT(6e-2) 10.38 32.76 39.02 430 385 456 8 9 9

ILUT(5e-2) 9.83 32.56 19.44 440 424 475 8 9 8

ILUT(4e-2) 10.54 21.18 19.30 452 443 478 8 8 8

ILUT(3e-2) 10.68 10.49 20.54 497 472 508 7 6 8

ILUT(2e-2) 7.39 6.73 8.82 529 483 586 7 6 6

ILUT(1e-2) 3.24 2.97 6.69 670 551 737 6 5 5

ILUT(5e-3) 2.65 2.17 2.25 807 581 917 5 4 4

ILUT(1e-3) 1.15 1.26 1.22 1103 682 1130 3 3 3

ILUT(5e-4) 1.08 1.11 1.10 1183 701 1174 3 3 3

ILUT(1e-4) 1.01 1.02 1.01 1270 763 1191 2 2 2

50

Table 1.2

3d

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 192.32 192.32 192.32 14640 14640 14640 39 39 39

ILU(0) 19.92 18.58 33.99 14640 14640 14640 9 9 12

Jacobi 188.82 188.82 188.82 618 618 618 35 35 35

ILUT(5e-1) 188.82 188.82 188.82 618 618 618 35 35 35

ILUT(4e-1) 188.82 188.82 188.82 618 618 618 35 35 35

ILUT(3e-1) 189.67 188.31 188.46 643 643 643 36 36 34

ILUT(2e-1) 179.67 179.79 187.75 703 703 703 35 35 34

ILUT(1e-1) 190.43 163.76 132.80 3724 3857 3665 19 18 19

ILUT(9e-2) 190.43 163.52 144.34 3724 3973 3720 19 18 19

ILUT(8e-2) 182.19 142.24 148.55 3878 4020 3934 19 17 18

ILUT(7e-2) 179.78 88.88 97.37 5667 5680 5382 14 13 16

ILUT(6e-2) 119.84 71.44 128.72 6117 6376 5944 13 12 15

ILUT(5e-2) 178.40 80.06 146.33 6325 6699 6174 13 11 15

ILUT(4e-2) 155.44 76.12 277.73 6761 6841 7120 11 11 14

ILUT(3e-2) 50.80 37.76 88.41 7388 7471 8138 10 10 12

ILUT(2e-2) 68.65 30.75 132.54 9809 8438 10888 9 9 11

ILUT(1e-2) 39.37 12.91 114.25 14580 12820 17209 8 7 9

ILUT(5e-3) 7.66 9.77 8.36 19481 17333 22371 6 6 7

ILUT(1e-3) 1.89 2.09 2.83 35000 25508 42476 4 4 5

ILUT(5e-4) 1.53 1.48 1.81 42652 30245 53600 4 4 4

ILUT(1e-4) 1.07 1.10 1.17 66287 42460 86616 3 3 3

51

Table 1.3

two_hole_0

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 166.91 166.91 166.91 4081 4081 4081 54 54 54

ILU(0) 43.12 37.47 63.65 4081 4081 4081 14 11 15

Jacobi 180.52 180.52 180.52 415 415 415 53 53 53

ILUT(5e-1) 180.52 180.52 180.52 415 415 415 53 53 53

ILUT(4e-1) 180.52 180.52 180.52 417 417 417 53 53 53

ILUT(3e-1) 185.49 184.38 184.34 437 441 438 52 52 52

ILUT(2e-1) 200.78 196.61 195.63 974 980 971 43 42 43

ILUT(1e-1) 88.10 51.08 87.38 2531 2546 2543 16 18 16

ILUT(9e-2) 92.96 48.41 72.58 2587 2617 2598 15 17 15

ILUT(8e-2) 55.36 48.93 71.78 2632 2722 2652 14 16 15

ILUT(7e-2) 54.55 55.58 71.90 2702 2802 2756 14 16 14

ILUT(6e-2) 54.90 50.38 71.90 2810 2907 2947 13 15 13

ILUT(5e-2) 43.64 39.68 50.83 3042 3001 3218 12 14 12

ILUT(4e-2) 31.04 31.98 35.97 3378 3185 3571 11 13 11

ILUT(3e-2) 19.59 22.68 25.58 3679 3553 3851 10 11 10

ILUT(2e-2) 12.25 11.80 16.63 3985 4003 4205 9 9 9

ILUT(1e-2) 7.37 6.00 10.42 4862 4447 5453 7 7 7

ILUT(5e-3) 4.38 4.67 7.87 5731 5005 6541 6 6 6

ILUT(1e-3) 1.70 1.60 1.56 8791 6357 9873 4 4 4

ILUT(5e-4) 1.25 1.41 1.28 9924 7004 11320 3 4 3

ILUT(1e-4) 1.04 1.05 1.06 12252 8599 14197 3 3 3

52

Table 1.4

two_hole_1

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 252.69 252.69 252.69 6407 6407 6407 66 66 66

ILU(0) 66.01 47.17 73.63 6407 6407 6407 15 14 17

Jacobi 231.77 231.77 231.77 635 635 635 64 64 64

ILUT(5e-1) 231.77 231.77 231.77 635 635 635 64 64 64

ILUT(4e-1) 231.77 231.77 231.77 640 640 640 64 64 64

ILUT(3e-1) 245.91 236.85 245.91 698 707 700 63 63 63

ILUT(2e-1) 315.20 298.96 321.03 1475 1483 1497 56 55 56

ILUT(1e-1) 131.83 69.41 154.31 3962 3993 3970 18 22 19

ILUT(9e-2) 126.96 70.98 139.44 4050 4166 4060 17 21 18

ILUT(8e-2) 122.35 73.71 129.76 4123 4329 4153 16 19 17

ILUT(7e-2) 108.58 78.46 116.07 4215 4471 4269 16 18 17

ILUT(6e-2) 86.98 57.40 101.03 4413 4588 4573 15 17 15

ILUT(5e-2) 86.49 62.21 68.92 4763 4754 5049 14 16 14

ILUT(4e-2) 34.28 42.85 42.20 5246 5038 5614 12 15 12

ILUT(3e-2) 30.21 53.20 36.93 5633 5626 6006 11 12 11

ILUT(2e-2) 17.48 26.80 26.13 6159 6359 6633 10 10 10

ILUT(1e-2) 9.34 11.39 21.15 7396 7089 8631 8 8 8

ILUT(5e-3) 6.85 8.56 8.39 8704 8047 10571 7 7 7

ILUT(1e-3) 2.04 2.06 2.13 12112 10771 16182 4 4 4

ILUT(5e-4) 1.56 1.43 1.55 13589 12160 18575 4 4 4

ILUT(1e-4) 1.07 1.11 1.08 16580 15284 23458 3 3 3

53

Table 1.5

two_hole_2

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 506.16 506.16 506.16 11699 11699 11699 79 79 79

ILU(0) 134.63 85.76 139.69 11699 11699 11699 18 16 22

Jacobi 395.11 395.11 395.11 1115 1115 1115 78 78 78

ILUT(5e-1) 395.11 395.11 395.11 1116 1116 1116 78 78 78

ILUT(4e-1) 395.11 395.11 395.11 1117 1117 1117 78 78 78

ILUT(3e-1) 398.60 398.60 398.60 1133 1136 1134 78 78 78

ILUT(2e-1) 486.35 510.30 527.67 2105 2151 2111 70 67 71

ILUT(1e-1) 234.51 138.12 329.64 7293 7470 7268 22 27 23

ILUT(9e-2) 232.45 120.64 241.77 7342 7683 7336 21 25 23

ILUT(8e-2) 250.78 118.04 291.68 7416 7980 7425 21 22 22

ILUT(7e-2) 184.92 112.49 211.59 7541 8164 7652 20 21 21

ILUT(6e-2) 174.74 114.65 183.70 7773 8291 8299 19 21 19

ILUT(5e-2) 136.67 98.77 163.14 8521 8535 9332 18 20 17

ILUT(4e-2) 81.58 92.77 140.51 9517 9045 10483 15 18 16

ILUT(3e-2) 59.45 63.79 108.37 10291 10322 11231 14 15 14

ILUT(2e-2) 41.51 38.02 68.35 11106 11505 12254 12 12 13

ILUT(1e-2) 21.44 23.97 24.40 13761 12891 16535 10 10 10

ILUT(5e-3) 11.17 12.85 14.65 16379 15044 20295 8 8 7

ILUT(1e-3) 2.81 2.72 3.47 23943 21700 33531 5 5 5

ILUT(5e-4) 1.89 1.98 2.03 27428 24980 40085 4 4 4

ILUT(1e-4) 1.18 1.18 1.15 34474 33011 53683 3 3 3

54

Table 1.6

two_hole_3

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 1065.99 1065.99 1065.99 27883 27883 27883 114 114 114

ILU(0) 328.00 206.14 296.25 27883 27883 27883 25 24 31

Jacobi 934.49 934.49 934.49 2567 2567 2567 111 111 111

ILUT(5e-1) 934.49 934.49 934.49 2567 2567 2567 111 111 111

ILUT(4e-1) 934.49 934.49 934.49 2576 2576 2576 111 111 111

ILUT(3e-1) 935.46 898.72 938.47 2643 2643 2642 109 108 108

ILUT(2e-1) 1247.14 1217.30 1270.78 4390 4475 4426 100 97 98

ILUT(1e-1) 577.94 350.75 548.36 16983 17529 16978 30 38 32

ILUT(9e-2) 449.01 256.82 558.75 17101 18030 17136 30 36 32

ILUT(8e-2) 484.19 282.09 540.38 17260 18884 17323 30 31 31

ILUT(7e-2) 484.15 245.62 505.88 17525 19236 17848 29 30 30

ILUT(6e-2) 424.44 251.24 445.49 18111 19447 19366 28 30 27

ILUT(5e-2) 397.04 266.47 496.63 19901 20173 22176 26 28 24

ILUT(4e-2) 195.99 216.41 300.88 22712 21503 25139 22 27 19

ILUT(3e-2) 132.32 151.38 213.08 24523 24558 26693 20 22 18

ILUT(2e-2) 104.28 80.63 130.51 26820 27617 29470 17 17 17

ILUT(1e-2) 43.35 53.91 53.82 33217 31210 39963 14 15 13

ILUT(5e-3) 33.77 34.61 24.33 40664 37000 49617 11 12 11

ILUT(1e-3) 6.90 9.84 8.80 63671 56666 89423 6 7 6

ILUT(5e-4) 3.81 4.12 5.56 75986 67540 112382 5 6 5

ILUT(1e-4) 1.68 1.55 1.57 103552 94585 167999 4 4 3

55

Table 1.7

two_hole_4

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 118292 118292 118292 208 208 208

ILU(0) 118292 118292 118292 49 43 53

Jacobi 10597 10597 10597 179 179 179

ILUT(5e-1) 10598 10598 10598 179 179 179

ILUT(4e-1) 10626 10626 10626 179 179 180

ILUT(3e-1) 10741 10746 10741 176 175 176

ILUT(2e-1) 13816 13978 13770 175 175 177

ILUT(1e-1) 72393 75864 72463 52 69 54

ILUT(9e-2) 72567 77342 72685 52 66 54

ILUT(8e-2) 72726 81864 72929 52 53 54

ILUT(7e-2) 73010 82410 74264 52 53 52

ILUT(6e-2) 75318 82700 81967 50 53 46

ILUT(5e-2) 85564 86339 95143 43 51 41

ILUT(4e-2) 99038 91476 108045 42 49 37

ILUT(3e-2) 105946 105013 111061 36 40 35

ILUT(2e-2) 114263 115975 119316 32 31 32

ILUT(1e-2) 147046 130612 170180 26 27 23

ILUT(5e-3) 185314 155892 215882 21 22 18

ILUT(1e-3) 359415 268643 436965 12 13 10

ILUT(5e-4) 467321 330159 565932 9 11 8

ILUT(1e-4) 774166 525546 971773 6 6 5

* Precondition numbers are not computed due to excessive memory requirements.

56

Table 1.8

four_hole_0

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 73.20 73.20 73.20 2783 2783 2783 36 36 36

ILU(0) 14.84 14.77 21.45 2783 2783 2783 9 9 11

Jacobi 64.95 64.95 64.95 315 315 315 36 36 36

ILUT(5e-1) 64.95 64.95 64.95 315 315 315 36 36 36

ILUT(4e-1) 65.59 65.59 66.33 319 319 319 36 36 36

ILUT(3e-1) 68.42 68.42 67.24 358 359 355 34 33 34

ILUT(2e-1) 84.56 89.39 83.27 774 776 773 29 29 29

ILUT(1e-1) 35.69 22.61 22.50 1802 1834 1828 12 13 11

ILUT(9e-2) 26.92 17.67 21.30 1835 1907 1869 11 12 11

ILUT(8e-2) 23.05 19.12 22.03 1884 1979 1926 11 11 11

ILUT(7e-2) 18.74 15.98 21.33 1942 2031 1988 10 11 10

ILUT(6e-2) 16.23 15.40 16.82 2016 2069 2084 9 11 10

ILUT(5e-2) 17.06 15.85 11.79 2163 2152 2262 9 10 9

ILUT(4e-2) 6.55 10.73 9.38 2339 2261 2425 8 9 8

ILUT(3e-2) 4.57 10.51 8.83 2506 2485 2606 7 7 7

ILUT(2e-2) 4.45 4.65 5.50 2690 2743 2810 6 6 6

ILUT(1e-2) 2.57 4.33 2.89 3074 3012 3444 5 5 5

ILUT(5e-3) 1.88 1.64 1.90 3372 3255 3907 4 4 4

ILUT(1e-3) 1.16 1.10 1.17 4265 3794 5099 3 3 3

ILUT(5e-4) 1.07 1.06 1.07 4559 4014 5545 3 3 3

ILUT(1e-4) 1.01 1.01 1.01 5137 4487 6242 2 2 2

57

Table 1.9

four_hole_1

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 113.35 113.35 113.35 4647 4647 4647 46 46 46

ILU(0) 20.90 24.32 40.81 4647 4647 4647 10 10 13

Jacobi 102.92 102.92 102.92 493 493 493 44 44 44

ILUT(5e-1) 102.92 102.92 102.92 493 493 493 44 44 44

ILUT(4e-1) 102.92 102.92 103.27 499 499 499 45 45 45

ILUT(3e-1) 113.98 109.15 109.82 543 546 543 42 42 42

ILUT(2e-1) 128.62 125.10 126.60 1110 1123 1124 37 36 37

ILUT(1e-1) 31.38 32.38 50.21 2984 3001 2992 13 15 14

ILUT(9e-2) 32.05 34.55 53.11 3027 3128 3041 13 14 13

ILUT(8e-2) 32.10 31.50 40.38 3071 3232 3094 12 13 13

ILUT(7e-2) 32.32 23.85 34.40 3119 3310 3186 12 13 13

ILUT(6e-2) 29.81 23.70 30.70 3221 3369 3367 12 13 11

ILUT(5e-2) 19.88 21.78 20.34 3504 3475 3721 11 12 10

ILUT(4e-2) 13.94 20.41 14.90 3870 3695 4089 9 10 9

ILUT(3e-2) 9.25 16.27 12.79 4164 4107 4377 9 9 8

ILUT(2e-2) 6.49 6.60 11.92 4446 4597 4707 8 7 8

ILUT(1e-2) 3.22 3.60 3.42 5289 4994 6012 6 6 6

ILUT(5e-3) 2.37 2.26 2.60 6107 5538 6917 5 5 5

ILUT(1e-3) 1.30 1.22 1.26 8432 6737 9665 3 3 3

ILUT(5e-4) 1.14 1.13 1.08 9220 7256 10662 3 3 3

ILUT(1e-4) 1.02 1.02 1.03 10640 8521 12367 2 2 2

58

Table 1.10

four_hole_2

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 194.41 194.41 194.41 9029 9029 9029 57 57 57

ILU(0) 60.81 32.41 52.77 9029 9029 9029 14 12 16

Jacobi 202.06 202.06 202.06 897 897 897 54 54 54

ILUT(5e-1) 202.06 202.06 202.06 897 897 897 54 54 54

ILUT(4e-1) 202.06 202.06 202.06 904 904 904 54 54 54

ILUT(3e-1) 207.16 208.86 207.00 972 981 970 53 53 53

ILUT(2e-1) 221.65 220.91 221.69 2185 2161 2187 46 46 47

ILUT(1e-1) 89.18 71.58 99.84 5570 5669 5568 17 20 18

ILUT(9e-2) 78.53 49.88 97.87 5639 5836 5668 16 18 17

ILUT(8e-2) 78.98 59.91 97.75 5746 6058 5814 15 17 15

ILUT(7e-2) 59.15 60.33 73.36 5893 6207 6005 14 16 14

ILUT(6e-2) 45.78 61.35 56.99 6153 6379 6408 13 16 13

ILUT(5e-2) 50.82 45.81 40.75 6680 6621 7077 13 15 12

ILUT(4e-2) 21.53 36.67 30.76 7385 7029 7866 12 13 11

ILUT(3e-2) 17.31 21.34 30.08 7983 7821 8482 10 11 10

ILUT(2e-2) 12.94 15.22 19.71 8652 8791 9331 9 9 9

ILUT(1e-2) 7.31 8.17 11.19 10637 9794 12074 7 7 7

ILUT(5e-3) 4.02 4.41 4.64 12620 11077 14629 6 6 6

ILUT(1e-3) 1.59 1.59 1.76 18185 14407 21965 4 4 4

ILUT(5e-4) 1.34 1.26 1.28 20433 15893 25008 3 3 3

ILUT(1e-4) 1.04 1.05 1.04 24465 19395 30526 2 3 2

59

Table 1.11

four_hole_3

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 425.08 425.08 425.08 23534 23534 23534 80 80 80

ILU(0) 183.36 92.67 159.24 23534 23534 23534 20 17 21

Jacobi 423.34 423.34 423.34 2217 2217 2217 68 68 68

ILUT(5e-1) 423.34 423.34 423.34 2217 2217 2217 68 68 68

ILUT(4e-1) 424.15 424.15 425.35 2244 2244 2244 69 68 69

ILUT(3e-1) 435.30 435.00 447.75 2328 2331 2325 67 67 67

ILUT(2e-1) 576.67 510.01 535.95 3960 4005 3948 68 66 68

ILUT(1e-1) 245.42 126.05 262.47 14476 14885 14475 22 27 23

ILUT(9e-2) 251.67 142.01 263.60 14619 15318 14619 22 26 22

ILUT(8e-2) 252.37 129.75 263.95 14765 16000 14815 21 24 21

ILUT(7e-2) 228.25 118.87 230.96 15042 16413 15205 21 21 20

ILUT(6e-2) 175.59 90.86 205.27 15620 16654 16520 19 20 18

ILUT(5e-2) 199.08 118.47 166.86 17152 17370 18642 19 19 17

ILUT(4e-2) 117.18 104.19 127.00 19450 18424 21092 15 18 14

ILUT(3e-2) 52.16 73.41 87.25 20934 20850 22023 14 15 14

ILUT(2e-2) 45.27 47.72 54.10 22882 22928 23815 12 12 13

ILUT(1e-2) 25.18 21.92 27.31 28441 25965 33054 10 10 9

ILUT(5e-3) 14.51 13.12 13.43 34369 29749 40951 8 8 8

ILUT(1e-3) 2.85 2.93 2.69 51191 43675 68955 5 5 5

ILUT(5e-4) 1.87 1.97 1.94 59226 50577 82055 4 4 4

ILUT(1e-4) 1.20 1.14 1.14 76381 65779 109674 3 3 3

60

Table 1.12

cross_dom_0

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 217.61 217.61 217.61 3035 3035 3035 50 50 50

ILU(0) 69.48 49.40 67.14 3035 3035 3035 12 10 13

Jacobi 188.68 188.68 188.68 313 313 313 46 46 46

ILUT(5e-1) 188.68 188.68 188.68 313 313 313 46 46 46

ILUT(4e-1) 188.68 188.68 188.68 313 313 313 46 46 46

ILUT(3e-1) 218.40 186.31 193.91 369 373 371 44 43 43

ILUT(2e-1) 245.05 226.06 255.63 767 765 767 36 35 36

ILUT(1e-1) 117.01 69.07 149.47 1854 1872 1884 15 17 15

ILUT(9e-2) 78.31 76.92 125.04 1909 1943 1937 13 16 14

ILUT(8e-2) 84.40 66.99 116.29 1965 2045 1988 13 15 14

ILUT(7e-2) 80.70 68.82 71.08 2015 2109 2051 12 13 13

ILUT(6e-2) 67.77 53.25 59.73 2095 2157 2180 11 13 13

ILUT(5e-2) 66.51 41.16 64.67 2255 2223 2376 10 12 11

ILUT(4e-2) 38.71 32.47 39.87 2438 2374 2605 10 11 10

ILUT(3e-2) 24.57 31.85 35.52 2624 2636 2837 9 10 9

ILUT(2e-2) 15.59 10.58 21.79 2841 2959 3144 8 7 8

ILUT(1e-2) 6.59 7.32 10.96 3381 3284 4056 7 6 7

ILUT(5e-3) 3.87 4.92 6.65 3863 3623 4900 5 5 5

ILUT(1e-3) 1.58 1.67 1.89 5132 4481 7768 4 4 4

ILUT(5e-4) 1.21 1.32 1.29 5643 4830 9050 3 3 3

ILUT(1e-4) 1.04 1.04 1.06 6437 5576 12085 2 3 2

61

Table 1.13

cross_dom_1

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 326.36 326.36 326.36 5571 5571 5571 61 61 61

ILU(0) 77.37 99.01 125.64 5571 5571 5571 14 12 15

Jacobi 306.65 306.65 306.65 547 547 547 53 53 53

ILUT(5e-1) 306.65 306.65 306.65 550 550 550 53 53 54

ILUT(4e-1) 306.65 306.65 306.65 551 551 551 53 53 54

ILUT(3e-1) 331.36 312.05 327.89 581 587 581 53 52 53

ILUT(2e-1) 379.60 370.76 388.61 1138 1157 1135 48 47 48

ILUT(1e-1) 119.54 101.90 148.78 3433 3487 3449 16 19 16

ILUT(9e-2) 101.34 102.94 144.86 3478 3618 3499 15 18 16

ILUT(8e-2) 94.81 83.37 149.46 3529 3772 3563 15 16 15

ILUT(7e-2) 94.70 73.59 126.77 3597 3855 3672 14 16 15

ILUT(6e-2) 59.89 69.86 161.71 3716 3926 3899 14 15 14

ILUT(5e-2) 47.60 62.75 113.60 4117 4050 4417 13 15 12

ILUT(4e-2) 53.43 70.62 61.25 4626 4295 4958 12 13 11

ILUT(3e-2) 26.05 63.77 50.40 5003 4804 5267 11 11 10

ILUT(2e-2) 20.43 24.85 35.86 5390 5417 5703 10 9 9

ILUT(1e-2) 15.30 11.27 14.21 6782 5960 7724 8 7 7

ILUT(5e-3) 8.04 9.25 10.84 8391 6758 9511 7 6 6

ILUT(1e-3) 2.04 2.26 2.32 13619 8896 15937 4 4 4

ILUT(5e-4) 1.58 1.55 1.47 16090 9943 19016 4 4 4

ILUT(1e-4) 1.11 1.12 1.08 21761 12102 26409 3 3 3

62

Table 1.14

cross_dom_2

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 590.76 590.76 590.76 10205 10205 10205 80 80 80

ILU(0) 292.00 144.32 271.82 10205 10205 10205 18 17 21

Jacobi 621.94 621.94 621.94 973 973 973 72 72 72

ILUT(5e-1) 621.94 621.94 621.94 973 973 973 72 72 72

ILUT(4e-1) 621.94 621.94 621.94 973 973 973 72 72 72

ILUT(3e-1) 582.43 581.79 581.33 1009 1012 1009 70 70 70

ILUT(2e-1) 683.70 710.46 689.71 2160 2157 2167 59 59 60

ILUT(1e-1) 436.63 225.40 353.52 6314 6388 6297 20 27 22

ILUT(9e-2) 467.78 244.65 396.78 6393 6590 6381 19 25 22

ILUT(8e-2) 333.59 233.03 386.31 6477 6850 6510 18 23 21

ILUT(7e-2) 272.94 172.34 282.35 6616 7037 6725 17 22 20

ILUT(6e-2) 222.04 161.58 318.11 6872 7200 7174 18 21 18

ILUT(5e-2) 266.25 196.77 304.10 7482 7436 8030 16 20 17

ILUT(4e-2) 120.76 137.44 158.40 8237 7874 9002 14 18 15

ILUT(3e-2) 85.80 179.24 157.15 8900 8957 9839 13 16 13

ILUT(2e-2) 61.83 67.85 105.79 9675 10139 10827 11 12 12

ILUT(1e-2) 27.03 32.18 32.47 11831 11443 14454 10 10 9

ILUT(5e-3) 16.42 20.64 35.30 13934 13313 18201 8 8 8

ILUT(1e-3) 3.35 4.54 4.84 20211 18537 31501 5 5 5

ILUT(5e-4) 2.28 2.37 2.88 23154 21026 38091 4 4 4

ILUT(1e-4) 1.15 1.27 1.25 29217 26772 56069 3 3 3

63

Table 1.15

cross_dom_3

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 1226.96 1226.96 1226.96 24100 24100 24100 117 117 117

ILU(0) 419.58 304.25 453.44 24100 24100 24100 25 22 29

Jacobi 1143.60 1143.60 1143.60 2230 2230 2230 97 97 97

ILUT(5e-1) 1143.60 1143.60 1143.60 2230 2230 2230 97 97 97

ILUT(4e-1) 1143.60 1143.60 1143.60 2243 2243 2243 97 96 97

ILUT(3e-1) 1214.53 1171.94 1214.53 2316 2322 2311 95 95 95

ILUT(2e-1) 1517.22 1584.08 1407.88 3423 3506 3422 95 93 95

ILUT(1e-1) 595.86 266.52 869.64 14845 15386 14867 27 36 29

ILUT(9e-2) 596.07 256.49 872.03 14926 15719 14961 27 33 28

ILUT(8e-2) 596.34 240.60 878.17 15030 16543 15112 27 28 28

ILUT(7e-2) 616.63 245.91 765.31 15172 16785 15399 27 27 28

ILUT(6e-2) 570.65 245.95 570.42 15617 16922 16882 26 27 25

ILUT(5e-2) 533.97 283.08 736.94 17202 17618 19403 24 26 23

ILUT(4e-2) 205.96 262.45 317.64 19714 18697 22031 21 25 20

ILUT(3e-2) 169.57 185.22 225.60 21235 21482 23070 19 20 18

ILUT(2e-2) 126.32 87.13 145.39 23010 23574 24795 17 16 17

ILUT(1e-2) 50.77 69.68 80.13 28701 26565 34858 14 14 13

ILUT(5e-3) 45.98 39.64 42.97 35285 31206 44128 11 11 11

ILUT(1e-3) 8.66 8.18 9.62 55292 48457 83225 7 7 6

ILUT(5e-4) 4.24 4.10 3.99 65713 57115 104197 5 5 5

ILUT(1e-4) 1.54 1.52 1.61 88395 77728 158287 4 4 4

64

Table 1.16

cross_dom_4

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 43341 43341 43341 146 146 146

ILU(0) 43341 43341 43341 31 29 37

Jacobi 3949 3949 3949 123 123 123

ILUT(5e-1) 3949 3949 3949 123 123 123

ILUT(4e-1) 3970 3970 3970 124 124 124

ILUT(3e-1) 4040 4043 4038 122 121 121

ILUT(2e-1) 5435 5533 5458 121 120 121

ILUT(1e-1) 26699 27872 26737 36 47 37

ILUT(9e-2) 26782 28448 26849 35 44 37

ILUT(8e-2) 26866 30035 26948 35 36 37

ILUT(7e-2) 27022 30230 27348 35 35 36

ILUT(6e-2) 27770 30406 29994 34 35 32

ILUT(5e-2) 30574 31553 34794 32 34 30

ILUT(4e-2) 35455 33396 39623 28 33 26

ILUT(3e-2) 38231 38568 41239 25 26 23

ILUT(2e-2) 41734 42753 44295 22 20 21

ILUT(1e-2) 51852 48005 62605 18 18 16

ILUT(5e-3) 64274 56954 79275 14 14 14

ILUT(1e-3) 105425 91653 155658 8 9 7

ILUT(5e-4) 128976 109712 200863 7 7 6

ILUT(1e-4) 186817 159421 330921 4 5 4


65

Table 1.17

two_dom_0

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 153.97 153.97 153.97 3332 3332 3332 49 49 49

ILU(0) 44.06 21.30 44.89 3332 3332 3332 11 10 13

Jacobi 141.03 141.03 141.03 351 351 351 42 42 42

ILUT(5e-1) 141.03 141.03 141.03 351 351 351 42 42 42

ILUT(4e-1) 141.03 141.03 141.03 354 354 354 42 42 42

ILUT(3e-1) 144.81 145.42 150.35 393 395 393 42 42 42

ILUT(2e-1) 210.49 179.76 207.36 723 735 728 37 35 37

ILUT(1e-1) 62.59 37.99 55.00 2151 2162 2151 13 15 14

ILUT(9e-2) 63.86 35.82 53.64 2194 2229 2176 13 15 14

ILUT(8e-2) 65.85 40.65 56.26 2218 2300 2202 12 14 14

ILUT(7e-2) 60.05 42.02 52.00 2267 2359 2280 12 13 13

ILUT(6e-2) 55.34 39.49 52.01 2334 2403 2407 11 13 12

ILUT(5e-2) 28.58 35.02 30.06 2515 2489 2650 11 12 11

ILUT(4e-2) 15.39 26.84 34.49 2749 2623 2967 9 11 10

ILUT(3e-2) 14.78 18.58 20.15 2957 2939 3162 9 9 9

ILUT(2e-2) 10.17 9.23 17.15 3159 3248 3426 8 7 8

ILUT(1e-2) 5.40 4.57 5.96 3752 3540 4462 7 6 7

ILUT(5e-3) 3.12 2.69 3.38 4247 3939 5401 6 5 5

ILUT(1e-3) 1.36 1.32 1.48 5524 4787 8270 4 4 4

ILUT(5e-4) 1.17 1.15 1.27 6022 5189 9448 3 3 3

ILUT(1e-4) 1.03 1.03 1.03 7008 6150 11223 2 3 2

66

Table 1.18

two_dom_1

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 342.46 342.46 342.46 7554 7554 7554 66 66 66

ILU(0) 96.96 60.66 101.78 7554 7554 7554 15 14 18

Jacobi 304.21 304.21 304.21 749 749 749 59 59 59

ILUT(5e-1) 304.21 304.21 304.21 749 749 749 59 59 59

ILUT(4e-1) 304.21 304.21 304.21 749 749 749 59 59 59

ILUT(3e-1) 314.05 307.39 330.18 770 771 768 59 59 59

ILUT(2e-1) 411.69 364.72 400.52 1593 1609 1623 52 50 48

ILUT(1e-1) 206.20 95.02 179.36 4751 4798 4755 18 22 20

ILUT(9e-2) 210.69 91.95 161.49 4803 4979 4815 17 21 19

ILUT(8e-2) 172.68 101.55 163.66 4877 5176 4891 17 19 18

ILUT(7e-2) 148.49 96.05 174.94 4980 5303 5045 16 18 17

ILUT(6e-2) 114.51 87.49 146.97 5156 5399 5392 16 17 16

ILUT(5e-2) 100.01 87.04 106.32 5624 5562 6041 14 17 15

ILUT(4e-2) 74.84 67.62 89.14 6204 5917 6692 13 15 13

ILUT(3e-2) 39.95 48.96 50.93 6658 6696 7227 12 13 12

ILUT(2e-2) 22.99 26.75 32.10 7196 7511 7866 10 10 11

ILUT(1e-2) 11.41 13.28 14.37 8645 8325 10498 9 9 8

ILUT(5e-3) 8.98 7.69 9.45 10037 9525 12703 7 7 7

ILUT(1e-3) 2.04 2.25 2.45 14037 12403 20975 5 5 5

ILUT(5e-4) 1.46 1.60 1.72 15661 13827 25036 4 4 4

ILUT(1e-4) 1.09 1.08 1.10 18913 16976 32415 3 3 3

67

Table 1.19

two_dom_2

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 593.38 593.38 593.38 16406 16406 16406 92 92 92

ILU(0) 187.79 132.20 214.51 16406 16406 16406 21 18 23

Jacobi 652.10 652.10 652.10 1559 1559 1559 83 83 83

ILUT(5e-1) 652.10 652.10 652.10 1559 1559 1559 83 83 83

ILUT(4e-1) 652.10 652.10 652.10 1570 1570 1570 84 83 84

ILUT(3e-1) 696.23 663.97 657.06 1643 1644 1641 76 80 73

ILUT(2e-1) 756.80 739.64 782.55 2867 2883 2855 78 77 79

ILUT(1e-1) 412.75 229.76 391.23 10106 10413 10154 24 29 25

ILUT(9e-2) 265.75 235.51 392.87 10185 10733 10272 23 27 24

ILUT(8e-2) 382.04 292.81 507.99 10281 11190 10384 22 25 23

ILUT(7e-2) 268.44 248.23 362.58 10441 11447 10684 22 23 23

ILUT(6e-2) 340.24 176.57 410.44 10914 11604 11553 21 22 21

ILUT(5e-2) 257.47 152.27 312.33 12143 11994 13076 19 21 18

ILUT(4e-2) 115.61 144.00 184.67 13716 12683 14774 18 20 16

ILUT(3e-2) 80.21 186.68 104.33 14792 14428 15551 16 17 15

ILUT(2e-2) 62.85 59.92 81.66 15869 16041 16853 14 13 14

ILUT(1e-2) 47.91 33.35 42.56 20236 17881 23442 12 11 11

ILUT(5e-3) 18.08 23.34 18.86 24939 20426 29236 10 9 9

ILUT(1e-3) 3.94 4.33 4.78 42262 29173 52184 6 6 6

ILUT(5e-4) 2.18 2.08 3.01 51578 33530 64084 5 5 5

ILUT(1e-4) 1.22 1.26 1.45 70920 43570 91577 3 4 3

68

Table 1.20

two_dom_3

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 911.00 911.00 911.00 26241 26241 26241 112 112 112

ILU(0) 325.57 202.54 450.86 26241 26241 26241 25 21 29

Jacobi 867.11 867.11 867.11 2448 2448 2448 94 94 94

ILUT(5e-1) 867.11 867.11 867.11 2448 2448 2448 94 94 94

ILUT(4e-1) 869.16 869.16 870.86 2475 2475 2475 94 94 94

ILUT(3e-1) 875.73 875.73 875.73 2517 2519 2517 93 93 94

ILUT(2e-1) 1091.74 1024.02 1171.24 3799 3839 3778 88 85 85

ILUT(1e-1) 429.13 295.32 1049.05 16188 16805 16243 28 35 29

ILUT(9e-2) 429.89 298.48 1049.32 16280 17190 16321 28 34 29

ILUT(8e-2) 430.54 316.66 1052.07 16405 18114 16476 28 28 29

ILUT(7e-2) 449.01 317.54 478.93 16567 18402 16888 28 27 28

ILUT(6e-2) 423.82 306.86 453.78 17086 18544 18363 28 26 25

ILUT(5e-2) 398.97 238.50 528.04 18758 19278 21098 25 25 23

ILUT(4e-2) 150.28 176.38 248.10 21492 20496 24045 22 24 20

ILUT(3e-2) 103.01 150.25 207.83 23193 23377 25053 20 20 18

ILUT(2e-2) 86.01 74.64 156.28 25148 25512 26964 18 16 17

ILUT(1e-2) 41.30 56.72 70.86 31399 28715 37720 15 14 13

ILUT(5e-3) 34.68 31.43 30.90 38386 33107 46723 12 12 11

ILUT(1e-3) 6.89 6.12 7.41 59467 50483 85977 7 7 6

ILUT(5e-4) 2.90 3.70 4.51 70458 58708 107980 6 6 5

ILUT(1e-4) 1.34 1.34 1.63 93539 78879 162160 4 4 4

69

Table 1.21

two_dom_4

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 47939 47939 47939 146 146 146

ILU(0) 47939 47939 47939 32 31 38

Jacobi 4387 4387 4387 134 134 134

ILUT(5e-1) 4387 4387 4387 134 134 134

ILUT(4e-1) 4391 4391 4391 134 134 134

ILUT(3e-1) 4445 4448 4445 129 128 129

ILUT(2e-1) 7448 7615 7490 133 130 134

ILUT(1e-1) 29400 30338 29353 38 51 40

ILUT(9e-2) 29611 31094 29578 37 47 40

ILUT(8e-2) 29845 32584 29910 37 40 39

ILUT(7e-2) 30226 33222 30706 38 39 37

ILUT(6e-2) 31334 33638 33434 35 37 34

ILUT(5e-2) 34409 34955 38115 33 36 31

ILUT(4e-2) 39078 37079 42940 29 34 27

ILUT(3e-2) 42133 42442 45355 26 28 25

ILUT(2e-2) 46221 47218 49582 22 22 23

ILUT(1e-2) 57299 53577 68559 18 18 17

ILUT(5e-3) 70066 63575 85997 14 15 14

ILUT(1e-3) 112126 100572 158884 9 9 8

ILUT(5e-4) 135470 119510 203475 7 8 7

ILUT(1e-4) 188559 169342 325391 5 5 4


70

Table 1.22

two_dom_5

Precondition

Number

nnz(A) or

nnz(L+U-I)

GMRES Iterations


A 111647 111647 111647 204 204 204

ILU(0) 111647 111647 111647 47 42 56

Jacobi 10074 10074 10074 167 167 167

ILUT(5e-1) 10074 10074 10074 167 167 167

ILUT(4e-1) 10129 10129 10129 168 167 168

ILUT(3e-1) 10252 10254 10251 165 165 165

ILUT(2e-1) 13205 13369 13184 190 190 191

ILUT(1e-1) 68486 71823 68572 53 71 55

ILUT(9e-2) 68713 73041 68820 52 67 55

ILUT(8e-2) 68920 77217 69180 52 54 55

ILUT(7e-2) 69411 77986 70594 52 52 54

ILUT(6e-2) 71148 78407 77950 51 51 48

ILUT(5e-2) 78269 82026 89993 47 50 44

ILUT(4e-2) 91319 87029 102790 41 48 37

ILUT(3e-2) 98250 99983 106316 37 38 35

ILUT(2e-2) 107307 109151 114544 34 30 33

ILUT(1e-2) 135091 124285 162505 28 26 24

ILUT(5e-3) 169731 146435 205679 21 21 19

ILUT(1e-3) 283448 248270 403349 13 13 11

ILUT(5e-4) 353911 302134 525275 10 11 9

ILUT(1e-4) 533325 461340 913711 6 7 6


71

Table 2 15 Ordering Schemes and Their Effects on GMRES Iterations Table 2.1

Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

two_hole_0

ILU(0) 13 14 13 14 14 20 11 22 14 15 13 19 11 21 17

ILUT(1.0e-1) 15 15 15 15 15 20 18 21 19 15 14 18 15 20 17

ILUT(7.5e-2) 14 14 13 14 14 16 15 18 17 14 13 16 14 17 15

ILUT(5.0e-2) 12 12 11 12 11 14 13 13 14 11 11 12 11 12 12

ILUT(2.5e-2) 9 9 8 9 8 9 9 9 10 9 8 9 9 8 9

ILUT(1.0e-2) 7 7 7 7 6 7 7 6 7 7 7 7 7 7 6

ILUT(7.5e-3) 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6

ILUT(5.0e-3) 6 6 5 6 5 5 6 5 5 6 6 5 6 6 5

ILUT(2.5e-3) 5 5 4 5 4 4 4 4 4 5 4 5 5 5 4

ILUT(1.0e-3) 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

two_hole_2

ILU(0) 18 22 18 21 20 29 16 34 20 21 19 28 16 32 24

ILUT(1.0e-1) 21 23 21 22 22 28 26 33 29 22 21 26 22 30 22

ILUT(7.5e-2) 20 21 20 21 20 25 21 27 24 21 20 23 22 25 20

ILUT(5.0e-2) 17 16 16 16 16 20 19 20 22 16 15 18 20 17 19

ILUT(2.5e-2) 12 13 13 13 12 14 13 14 14 13 12 12 14 10 12

ILUT(1.0e-2) 10 9 10 10 9 9 10 8 10 10 9 9 10 9 9

ILUT(7.5e-3) 9 9 9 9 8 8 9 8 9 9 9 8 9 9 9

ILUT(5.0e-3) 8 7 7 8 7 7 8 7 8 8 8 7 8 7 7

ILUT(2.5e-3) 7 6 6 6 6 5 7 6 6 6 6 6 7 6 6

ILUT(1.0e-3) 5 5 5 5 5 4 5 4 5 5 5 5 5 4 5

two_hole_3

ILU(0) 25 32 29 31 30 45 24 47 26 27 24 37 23 46 34

ILUT(1.0e-1) 30 32 30 31 30 42 38 50 42 27 26 36 32 43 33

ILUT(7.5e-2) 29 30 30 29 27 39 30 40 34 27 26 35 30 35 28

ILUT(5.0e-2) 25 23 23 23 22 32 28 31 32 22 19 27 27 22 24

ILUT(2.5e-2) 19 18 18 18 18 21 19 26 19 19 18 17 20 14 18

ILUT(1.0e-2) 14 13 14 13 14 12 15 11 14 13 14 12 13 12 13

ILUT(7.5e-3) 13 12 13 11 13 11 13 11 12 12 13 12 13 12 12

ILUT(5.0e-3) 11 11 11 11 11 9 12 10 10 10 11 9 12 10 11

ILUT(2.5e-3) 10 9 10 9 9 7 10 7 9 8 9 8 10 8 9

ILUT(1.0e-3) 7 6 7 6 7 6 7 6 7 6 7 6 7 6 6

72

Table 2.2

Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

four_hole_1

ILU(0) 10 12 11 13 12 18 10 20 12 13 11 16 10 17 14

ILUT(1.0e-1) 13 13 13 14 13 18 15 19 17 14 13 16 13 18 14

ILUT(7.5e-2) 12 13 12 13 12 15 13 16 14 13 11 15 12 15 13

ILUT(5.0e-2) 11 10 10 10 10 12 12 13 12 10 9 10 11 11 10

ILUT(2.5e-2) 8 8 7 8 8 9 8 9 9 8 7 8 8 8 8

ILUT(1.0e-2) 6 6 5 6 5 6 6 6 6 6 6 6 6 6 6

ILUT(7.5e-3) 6 6 5 5 5 5 5 5 5 6 5 5 5 5 5

ILUT(5.0e-3) 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

ILUT(2.5e-3) 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

ILUT(1.0e-3) 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3

four_hole_2

ILU(0) 14 16 13 16 15 21 12 24 15 16 13 19 11 23 18

ILUT(1.0e-1) 17 18 17 18 17 23 20 24 21 18 16 20 17 22 18

ILUT(7.5e-2) 14 15 14 15 15 20 17 21 19 16 15 17 16 19 16

ILUT(5.0e-2) 13 12 12 12 12 14 15 15 16 12 12 13 13 13 13

ILUT(2.5e-2) 9 9 9 10 9 11 10 11 11 10 9 10 10 9 10

ILUT(1.0e-2) 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

ILUT(7.5e-3) 7 7 6 7 6 6 6 6 6 7 6 6 7 6 6

ILUT(5.0e-3) 6 6 6 6 6 6 6 5 5 6 6 6 6 6 5

ILUT(2.5e-3) 5 5 5 5 5 4 5 4 5 5 5 5 5 4 5

ILUT(1.0e-3) 4 4 4 4 4 4 4 4 4 4 4 3 4 3 4

four_hole_3

ILU(0) 20 22 19 22 21 29 17 35 21 22 20 27 17 33 26

ILUT(1.0e-1) 23 24 23 24 23 31 28 37 30 24 23 28 24 32 25

ILUT(7.5e-2) 22 21 21 22 20 27 23 28 26 22 21 23 23 25 22

ILUT(5.0e-2) 19 17 18 17 17 24 19 23 22 18 18 19 20 18 19

ILUT(2.5e-2) 14 14 13 14 13 15 14 18 15 14 13 14 13 12 14

ILUT(1.0e-2) 10 9 10 9 10 9 10 9 10 10 10 10 10 9 9

ILUT(7.5e-3) 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9

ILUT(5.0e-3) 8 7 8 7 8 7 8 7 7 7 8 7 8 7 8

ILUT(2.5e-3) 7 6 7 6 7 6 7 5 6 6 7 6 7 6 6

ILUT(1.0e-3) 5 4 5 5 5 4 5 4 5 4 5 4 5 4 5

73

Table 2.3

Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

cross_dom_2

ILU(0) 18 23 19 21 21 30 17 33 20 21 19 28 16 31 25ILUT(1.0e-1) 20 24 22 22 22 29 27 32 29 22 22 26 23 30 24ILUT(7.5e-2) 18 21 20 20 20 25 22 27 25 19 19 23 21 24 21ILUT(5.0e-2) 16 16 17 17 16 20 20 20 22 17 16 16 19 17 18ILUT(2.5e-2) 12 12 13 13 12 13 13 12 15 13 12 12 13 11 12ILUT(1.0e-2) 10 9 9 9 9 9 10 9 10 10 9 9 10 9 9ILUT(7.5e-3) 9 8 9 9 8 8 9 8 9 9 8 8 9 8 8ILUT(5.0e-3) 8 8 7 8 7 7 8 7 7 7 8 7 8 7 7ILUT(2.5e-3) 7 6 6 6 6 6 7 6 6 6 6 6 6 6 6ILUT(1.0e-3) 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

cross_dom_3


cross_dom_4


74

Table 2.4

Test: A 01 02 03 04 05 06 07 08 09 10 11* 12* 13* 14*

two_dom_2


two_dom_4


two_dom_5


75

Table 3 Mesh Partitioning Tests for Parallel ILU Table 3.1

two_hole_0

Processor Nodes Interface Nodes

% Interface Nodes

Two Processors Processor 1 208 27 12.98%

Processor 2 207 31 14.98%

Total 415 58 13.98%

Four Processors

Processor 1 104 32 30.77%

Processor 2 104 27 25.96%

Processor 3 103 31 30.10%

Processor 4 104 29 27.88%

Total 415 119 28.67%

Sixteen Processors

Processor 1 26 6 23.08%

Processor 2 26 13 50.00%

Processor 3 26 18 69.23%

Processor 4 26 21 80.77%

Processor 5 26 20 76.92%

Processor 6 26 13 50.00%

Processor 7 26 22 84.62%

Processor 8 26 17 65.38%

Processor 9 26 17 65.38%

Processor 10 26 23 88.46%

Processor 11 25 15 60.00%

Processor 12 26 13 50.00%

Processor 13 26 22 84.62%

Processor 14 26 17 65.38%

Processor 15 26 18 69.23%

Processor 16 26 16 61.54%

Total 415 271 65.30%

76

Table 3.2

two_hole_2


% Interface Nodes


Processor 2 557 52 9.34%

Total 1115 96 8.61%

Four Processors

Processor 1 279 49 17.56%

Processor 2 279 45 16.13%

Processor 3 279 52 18.64%

Processor 4 278 46 16.55%

Total 1115 192 17.22%

Sixteen Processors

Processor 1 69 14 20.29%

Processor 2 70 31 44.29%

Processor 3 70 24 34.29%

Processor 4 70 39 55.71%

Processor 5 70 34 48.57%

Processor 6 69 28 40.58%

Processor 7 70 36 51.43%

Processor 8 70 27 38.57%

Processor 9 69 33 47.83%

Processor 10 70 36 51.43%

Processor 11 70 25 35.71%

Processor 12 70 34 48.57%

Processor 13 70 51 72.86%

Processor 14 69 34 49.28%

Processor 15 69 33 47.83%

Processor 16 70 21 30.00%

Total 1115 500 44.84%

77

Table 3.3

two_hole_4


% Interface Nodes


Processor 2 5299 147 2.77%

Total 10597 312 2.94%

Four Processors

Processor 1 2649 193 7.29%

Processor 2 2649 121 4.57%

Processor 3 2649 121 4.57%

Processor 4 2650 179 6.75%

Total 10597 614 5.79%

Sixteen Processors

Processor 1 663 44 6.64%

Processor 2 662 103 15.56%

Processor 3 662 95 14.35%

Processor 4 662 144 21.75%

Processor 5 662 93 14.05%

Processor 6 663 86 12.97%

Processor 7 662 123 18.58%

Processor 8 662 112 16.92%

Processor 9 662 112 16.92%

Processor 10 663 120 18.10%

Processor 11 662 73 11.03%

Processor 12 662 113 17.07%

Processor 13 663 203 30.62%

Processor 14 662 100 15.11%

Processor 15 663 115 17.35%

Processor 16 662 75 11.33%

Total 10597 1711 16.15%

78

References [1] W. Arnoldi. "The Principle of Minimized Iterations in the Solution of the Matrix

Eigenvalue Problem." Quarterly of Applied Mathematics. Vol. 9 (1951), pp. 17-29.

[2] S. Balay, W. Gropp, L. McInnes, and B. Smith. The Portable, Extensible Toolkit for Scientific Computing (PETSc). Version 2.2.1, Code and Documentation, 2004. Online. Available 4/18/2005: http://www.mcs.anl.gov/petsc.

[3] E. R. Barnes. “An Algorithm for Partitioning the Nodes of a Graph.” SIAM Journal of Algebraic Discrete Methods. Vol. 3 (1984), No. 4, pp. 541-550.

[4] S. C. Brenner and L. R. Scott. The Mathematical Theory of Finite Element Methods. Second Edition. Springer. New York 2002.

[5] R. L. Burden and J. D. Faires. Numerical Analysis. Seventh Edition. Brooks/Cole. Pacific Grove, CA 2001.

[6] X. C. Cai and D. E. Keyes. “Nonlinearly Preconditioned Inexact Newton Algorithms.” SIAM Journal on Scientific Computing. Vol. 24 (2002), No. 1, pp. 183-200.

[7] E. Cuthill and J. McKee. “Reducing the Bandwidth of Sparse Symmetric Matrices.” Naval Ship Research and Development Center. ACM/CSC-ER Proceedings of the 1969 24th National Conference, pp. 157-172.

[8] R. Diestel. Graph Theory. Graduate Texts in Mathematics. Springer. New York 2000.

[9] EPCC. “Unstructured Mesh Decomposition.” Online. Available 4/18/2005: http://www.epcc.ed.ac.uk/computing/training/document_archive/meshdecomp-slides/MeshDecomp-1.html.

[10] K. A. Gallivan, A. Sameh, and Z. Slatev. "A Parallel Hybrid Sparse Linear System Solver." Computing Systems in Engineering. Vol. 1 (1990), pp. 183-195.

[11] A. George. "Computer Implementation of the Finite Element Method." Technical Report STAN-CS-208, Stanford University, Stanford, CA, 1971.

[12] G. Havas and C. Ramsay. "Breadth-First Search and the Andrews-Curtis Conjecture." International Journal of Algebra and Computation. Vol. 13 (2003), No. 1, pp. 61-68.

[13] D. Hysom and A. Pothen. “Level-based Incomplete LU Factorization: Graph Model and Algorithms.” Submitted to SIAM Journal on Matrix Analysis and Applications. November 2002.

[14] M. Jones and P. Plassmann. "Scalable Iterative Solution of Sparse Linear Systems."

79

Parallel Computing. Vol. 20 (1994), pp. 753-773.

[15] G. Karypis and V. Kumar. “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM Journal on Scientific Computing. Vol. 20 (1998), No. 1, pp. 359-392.

[16] G. Karypis and V. Kumar. “Parallel Threshold-based ILU Factorization.” University of Minnesota, Department of Computer Science / Army HPC Research Center, Technical Report #96-061. 1998.

[17] D. K. Kaushik and D. E. Keyes. “Efficient Parallelization of an Unstructured Grid Solver: A Memory-Centric Approach.” Proceedings of the International Conference on Parallel CFD (Istanbul, June 1999). (U. Gulcat & D. R. Emerson, eds.), Istanbul Technical University Press, pp. 55-67.

[18] C. T. Kelly. Iterative Methods for Linear and Nonlinear Equations. SIAM Frontiers in Applied Mathematics. Philadelphia 1995.

[19] M. Luby. "A Simple Parallel Algorithm for the Maximal Independent Set Problem." SIAM Journal on Computing. Vol. 15 (1986), pp. 1036-1053.

[20] J. Mathews. "Jacobi and Gauss-Seidel Iteration." Department of Mathematics of California State University, Fullerton. Online. Available 4/21/2005: http://math.fullerton.edu/mathews/n2003/GaussSeidelMod.html

[21] MathWorks, Inc., The. MATLAB. Version 7.0 (R14). Code and Documentation, 2004.

[22] J. Meijerink and H. Van Der Vorst. "An Iterative Solution Method for Linear Systems of Which the Coefficient Matrix is a Symmetric M-matrix." Mathematics of Computation. Vol. 31 (1997), pp. 148-162.

[23] G. L. Miller, S. H. Teng, W. Thurston, and S. A. Vavasis. “Automatic Mesh Partitioning.” Sparse Matrix Computations: Graph Theory Issues and Algorithms. Springer-Verlag, New York 1993, pp. 57-84.

[24] B. Nour-Omid, A. Raefsky, and G. Lyzenga. “Solving Finite Element Equations on Concurrent Computers.” In American Society of Mechanical Engineers. 1986, pp. 291-307.

[25] L. Oliker, X. Li, P. Husbands, and R. Biswas. “Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations.” SIAM Reivew. Vol 44 (2002), No. 3, pp. 373-393.

[26] Pothen and C. J. Fan. “Computing the Block Triangular Form of a Sparse Matrix.” ACM Transactions on Mathematical Software, Vol. 16 (1990), No. 4, pp 303-324.

80

[27] Y. Saad. "ILUT: A Dual Threshold Incomplete ILU Factorization." Numerical Linear Algebra with Applications. Vol. 1 (1994), pp. 387-402.

[28] Y. Saad and M. Schultz. "GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems." SIAM Journal on Scientific and Statistical Computing. Vol. 7 (1986), pp. 856-869.

[29] G. Strang and G. Fix. An Analysis of the Finite Element Method. Prentice Hall 1973.

[30] E. Weisstein, et al. "Jacobi Method." MathWorld - A Wolfram Web Resource. Online. Available 4/19/2005: http://mathworld.wolfram.com/JacobiMethod.html

[31] E. Weisstein, et al. "Successive Overrelaxation Method." MathWorld - A Wolfram Web Resource. Online. Available 4/19/2005: http://mathworld.wolfram.com/SuccessiveOverrelaxationMethod.html

[32] D. P. Young, R. G. Melvin, F. T. Johnson, J. E. Bussoletti, L. B. Wigton, and S. S. Samant. "Application of Sparse Matrix Solvers as Effective Preconditioners." SIAM Journal on Scientific and Statistical Computing. Vol. 10 (1989), pp. 1186-1199.

[33] J. Zhang. “A Multilevel Dual Reordering Strategy for Robust Incomplete LU Factorization of Indefinite Matrices.” SIAM Journal on Matrix Analysis and Applications. Vol. 22 (2001), No. 3, pp. 925-947.

81

Vita

Peter S. Hou was born in Taipei, Taiwan on December 28th, 1981. He grew up like any ordinary kid who loved watching cartoons and disliked math. However, perhaps fate spoke, he was chosen to be the math teacher’s assistant for five consecutive years. He moved to the United States in 1997 and attended Langley High School in McLean, Virginia. His interest for math grew as he participated in many math-related activities and brought home numerous awards. In 2000, he was accepted into Virginia Tech to study Computer Science. One day he had a strange feeling about a life without taking any more math classes, so he started to double-major in Applied and Discrete Mathematics.

Outside of classes, he enjoyed the challenges from the Putnam Math Competition, Virginia

Tech Regional Math Contest, and Mathematical Contest in Modeling. During school, he tutored part-time at the Math Emporium. Between semesters, he worked for ProfitScience, LLC in McLean, Virginia as a software developer. He joined the Tae Kwon Do Club and the Math Club, quickly became one of the leaders and has remained active to this day. In addition, he served as a webmaster for the Class Program, helped prepare for the Ring Dance, and participated in a number of other community activities.

In May 2004, one year after he was given the honor of Phi Beta Kappa membership, he

earned his two B.S. degrees in Computer Science and Mathematics, as well as a black belt in Chung Do Kwan Tae Kwon Do. Right afterwards, he continued his studies in Virginia Tech as a Master’s student in Mathematics. Under the Five-Year Bachelor-Master program and guidance from Dr. Jeff Borggaard, he will be completing his degree in May 2005. Following his graduation, he is set to join Mercer Human Resources Consulting in New York City as an actuary.

Documents

10.1.1.99