ATutorialon EllipticPDESolvers andtheirParallelization

A Tutorial on

Elliptic PDE Solvers

and their Parallelization

Craig C. Douglas Gundolf Haase Ulrich Langer

ii

Preface

Computing has become a third branch of research, joining the traditional areasof theory and laboratory experimentation and verification. Due to the expense andcomplexity of actually performing experiments in many situations, simulation mustbe done first. In addition, computation in some fields leads to new directions intheory. The three areas have formed a symbiotic relationship that is now known ascomputational sciences. In the best situations, all three areas seemlessly interact.

The “computational” supplement to many classical and new sciences is due tothe wide use during the last decade of computational methods in almost all sciences,including life sciences and business sciences

Most of the models behind these computer simulations are based on partialdifferential equations (PDE). The first step towards computer simulation consistsin the discretization of the PDE model. This usually results in a very large-scalelinear system of algebraic equations or even in a sequence of such systems. The fastsolution of these systems of linear algebraic equations is crucial for the overall effi-ciency of the computer simulation. The growing complexity of the models requiresmore and more the use of parallel computers. Beside expensive parallel computers,clusters of personal computers or workstations are now very popular as an inex-pensive alternative parallel hardware configuration for the computer simulation ofcomplex problems (even as home parallel systems).

This tutorial serves as a first introduction into the basic concepts of solvingpartial differential equations using parallel numerical methods. The ability to un-derstand, develop, and implement parallel PDE solvers requires not only some basicknowledge in PDEs, discretization methods, and solution techniques, but also someknowledge about parallel computers, parallel programming, and the run-time be-haviour of parallel algorithms. Our tutorial provides this knowledge in just 8 shortchapters. The authors kept the examples simple so that the parallelization strate-gies are not dominated by technical details. The practical course for the tutorialcan be downloaded from the internet (see Chapter 1 for the internet addresses).

The tutorial is intended for advanced undergraduates and graduate studentsin computational sciences and engineering. However, our book can be helpful formany people who are going to use PDE based parallel computer simulations in theirprofession. It is important to know at least something about the possible errors andbottlenecks in parallel scientific computing.

We are indebted to the reviewers who contribute a lot to the improvementof our manuscript. Especially, we would like to thank Michael J. Holst and David

iii

iv Preface

E. Keyes for many helpful hints and advises. We would like to acknowledge theAustrian Science Fund for supporting our cooperation within the Special ResearchProgram on “Numerical and Symbolic Scientific Computing” under the grant SFBF013 and the NFS for grants DMS-9707040, CCR-9902022, CCR-9988165, andACR-9721388.

Finally, we would like to thank our families, friends, and Gassners’ Most &Jausn (http://www.mostundjausn.at), who put up with us, nourished us, and al-lowed us to write this book.

Greenwich, Connecticut (USA) Craig C. DouglasLinz (Austria) Gundolf Haase and Ulrich Langer

September, 2002

Contents

Preface iii

1 Introduction 1

2 A simple example 52.1 The Poisson equation and its finite difference discretization . . . 52.2 Sequential solving . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Direct methods . . . . . . . . . . . . . . . . . . . . . 82.2.2 Iterative methods . . . . . . . . . . . . . . . . . . . 9

2.3 Parallel solving by means of domain decomposition . . . . . . . 112.4 Some other discretization methods . . . . . . . . . . . . . . . . . 14

3 Introduction to parallelism 153.1 Classifications of parallel computers . . . . . . . . . . . . . . . . 15

3.1.1 Classification by Flynn . . . . . . . . . . . . . . . . 153.1.2 Classification by memory access . . . . . . . . . . . 173.1.3 Communication topologies . . . . . . . . . . . . . . 19

3.2 Specialties of parallel algorithms . . . . . . . . . . . . . . . . . . 203.2.1 Synchronization . . . . . . . . . . . . . . . . . . . . 203.2.2 Message Passing . . . . . . . . . . . . . . . . . . . . 213.2.3 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . 233.2.4 Data coherency . . . . . . . . . . . . . . . . . . . . . 233.2.5 Parallel extensions of operating systems and pro-

gramming languages . . . . . . . . . . . . . . . . . . 243.3 Basic global operations . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Send and Recv . . . . . . . . . . . . . . . . . . . . 243.3.2 Exchange . . . . . . . . . . . . . . . . . . . . . . . 253.3.3 Gather-Scatter operations . . . . . . . . . . . . . . . 263.3.4 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . 273.3.5 Reduce- and Reduce-All operations . . . . . . . . . 273.3.6 Synchronization by barriers . . . . . . . . . . . . . . 273.3.7 Some remarks on portability . . . . . . . . . . . . . 28

3.4 Performance evaluation of parallel algorithms . . . . . . . . . . 283.4.1 Speedup and Scaleup . . . . . . . . . . . . . . . . . 28

v

vi Contents

3.4.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 323.4.3 Communication expenditure . . . . . . . . . . . . . 33

4 Galerkin Finite Element Discretization of Elliptic PDE 374.1 Variational Formulation of Elliptic BVPs . . . . . . . . . . . . . 374.2 Galerkin Finite-Element-Discretization . . . . . . . . . . . . . . 42

4.2.1 The Galerkin Method . . . . . . . . . . . . . . . . . 424.2.2 The Simplest Finite Element Schemes . . . . . . . . 434.2.3 Analysis of the Galerkin Finite Element Method . . 634.2.4 Iterative Solution of the Galerkin System . . . . . . 66

5 Basic numerical routines in parallel 715.1 Storage of sparse matrices . . . . . . . . . . . . . . . . . . . . . 715.2 Domain decomposition by non-overlapping elements . . . . . . . 725.3 Vector-vector operations . . . . . . . . . . . . . . . . . . . . . . 755.4 Matrix-vector operations . . . . . . . . . . . . . . . . . . . . . . 76

6 Classical solvers 836.1 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.1.1 LU factorization . . . . . . . . . . . . . . . . . . . . 846.1.2 ILU factorization . . . . . . . . . . . . . . . . . . . . 86

6.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2.1 ω-Jacobi iteration . . . . . . . . . . . . . . . . . . . 916.2.2 Gauss-Seidel iteration . . . . . . . . . . . . . . . . . 936.2.3 ADI methods . . . . . . . . . . . . . . . . . . . . . . 98

6.3 Roughers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.3.1 CG method . . . . . . . . . . . . . . . . . . . . . . . 1036.3.2 GMRES solver . . . . . . . . . . . . . . . . . . . . . 1056.3.3 BICGSTAB solver . . . . . . . . . . . . . . . . . . . 108

6.4 Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7 Multigrid methods 1117.1 Multigrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.2 The multigrid algorithm . . . . . . . . . . . . . . . . . . . . . . 112

7.2.1 Sequential algorithm. . . . . . . . . . . . . . . . . . 1127.2.2 Parallel components of multigrid . . . . . . . . . . . 1127.2.3 Parallel algorithm . . . . . . . . . . . . . . . . . . . 116

8 Problems not addressed in this book 119

A Abbreviations and Notations 121

B Internet addresses 125

Bibliography 127

List of Figures

2.1 Rectangular domain with equidistant grid points. . . . . . . . . . 62.2 Structure of matrix and load vector. . . . . . . . . . . . . . . . . . 72.3 Two subdomains with five point stencil discretization. . . . . . . . 11

3.1 Types of parallel processes. . . . . . . . . . . . . . . . . . . . . . . 163.2 Undefined status. . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Three possible outcomes. . . . . . . . . . . . . . . . . . . . . . . . 213.4 Blocking communication . . . . . . . . . . . . . . . . . . . . . . . 223.5 Non-blocking communication . . . . . . . . . . . . . . . . . . . . . 223.6 Deadlock in blocking communication . . . . . . . . . . . . . . . . . 233.7 Non-blocking Exchange . . . . . . . . . . . . . . . . . . . . . . . 253.8 Blocking Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . 263.9 Scatter and Gather . . . . . . . . . . . . . . . . . . . . . . . . 263.10 Broadcast operation . . . . . . . . . . . . . . . . . . . . . . . . . 273.11 Speedup with respect to problem size . . . . . . . . . . . . . . . . 303.12 System times for Gaussian elimination, CG, PCG. . . . . . . . . . 31

4.1 Computational domain Ω consisting of two different materials. . . 414.2 Galerkin Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Courant’s basis function. . . . . . . . . . . . . . . . . . . . . . . . 454.4 Examples of inadmissible meshes. . . . . . . . . . . . . . . . . . . 464.5 Change in the type of the boundary conditions (BC). . . . . . . . 474.6 Capturing interfaces by triangulation. . . . . . . . . . . . . . . . 474.7 Triangle with an obtuse angle. . . . . . . . . . . . . . . . . . . . . 474.8 Local numbering of the nodes in a triangle . . . . . . . . . . . . . 484.9 Triangulation of the model problem CHIP. . . . . . . . . . . . . . 494.10 Mapping between an arbitrary element and the master element . 504.11 Definition of the FE basis function ϕ(8) . . . . . . . . . . . . . . . 514.12 One element matrix is assembled into the global matrix. . . . . . . 584.13 Two element matrices are added into the global matrix. . . . . . . 594.14 Mapping of a triangular edge onto the unit interval [0, 1]. . . . . 61

5.1 Non-overlapping domain decomposition. . . . . . . . . . . . . . . 725.2 Non-overlapping elements. . . . . . . . . . . . . . . . . . . . . . . 735.3 Non-overlapping elements with a revised discretization. . . . . . . 77

vii

viii List of Figures

5.4 4 subdomains in local numbering with local discretization nx =ny = 4 and global discretization Nx = Ny = 8. . . . . . . . . . . . 81

6.1 Illustration of the rank-r modification. . . . . . . . . . . . . . . . . 846.2 Scattered distribution of a matrix . . . . . . . . . . . . . . . . . . 856.3 Data flow of Jacobi iteration. . . . . . . . . . . . . . . . . . . . . . 916.4 Data flow of Gauss-Seidel forward iteration. . . . . . . . . . . . . . 946.5 Data flow in Red-Black-Gauss-Seidel forward iteration. . . . . . . 956.6 Domain decomposition of unit square. . . . . . . . . . . . . . . . . 966.7 ADI damping factor parameter space. . . . . . . . . . . . . . . . . 996.8 Decomposition in 4 strips, • denotes an Edge node. . . . . . . . . 100

7.1 V and W cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.2 Non-overlapping element distribution on two nested grids. . . . . . 113

List of Algorithms

2.1 A simple parallel DD solver (P=2 for our example). . . . . . . . . . 126.1 Rank-r modification of LU factorization - sequential. . . . . . . . . . 846.2 Block variant of rank-r modification . . . . . . . . . . . . . . . . . . 856.3 Block wise rank-r modification of LU factorization - parallel. . . . . 866.4 Sequential block ILU-Factorization. . . . . . . . . . . . . . . . . . . . 876.5 Sequential block wise substitution steps for LUw = r. . . . . . . . . 876.6 Parallelized LU-factorization. . . . . . . . . . . . . . . . . . . . . . . 886.7 Parallel forward and backward substitution step for LUw = r. . . . 886.8 Parallel IUL factorization. . . . . . . . . . . . . . . . . . . . . . . . . 896.9 Parallel backward and forward substitution step for ULw = r. . . . 896.10 Sequential Jacobi iteration. . . . . . . . . . . . . . . . . . . . . . . . 916.11 Parallel Jacobi iteration: Jacobi(K, u0, f). . . . . . . . . . . . . . . . 936.12 Sequential Gauss-Seidel forward iteration. . . . . . . . . . . . . . . . 946.13 Update step in Red-Black Gauss-Seidel forward iteration. . . . . . . 956.14 Parallel Gauss-Seidel-ω-Jacobi forward iteration. . . . . . . . . . . . 976.15 Sequential ADI in two dimensions . . . . . . . . . . . . . . . . . . . 986.16 Parallel ADI in two dimensions: first try. . . . . . . . . . . . . . . . 1026.17 Parallel ADI in two dimensions: final version. . . . . . . . . . . . . . 1036.18 Gauss-Seidel iteration for solving Tyu

k+1 = g. . . . . . . . . . . . . 1046.19 Sequential CG with preconditioning. . . . . . . . . . . . . . . . . . . 1046.20 Parallelized CG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.21 Sequential GMRES with preconditioning. . . . . . . . . . . . . . . . 1066.22 Parallel GMRES - no restart. . . . . . . . . . . . . . . . . . . . . . . 1076.23 Sequential BICGSTAB . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.24 Parallel BICGSTAB. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.1 Sequential multigrid : mgm(Kq, uq, fq

, q). . . . . . . . . . . . . . . . 113

7.2 Parallel multigrid: pmgm(Kq, uq, fq, q) . . . . . . . . . . . . . . . . . 116

ix

x LIST OF ALGORITHMS

Chapter 1

Introduction

A couple months in the laboratory can save a couplehours in the library.—Frank H. Westheimer’s Discovery

The computer simulation of physical phenomena and technical processes hasbecome a powerful tool in developing new products and technologies. The “compu-tational” supplement to many classical and new sciences such as computational en-gineering, computational physics, computational chemistry, computational biology,computational medicine, computational finance, etc. is due to the wide use duringthe last decade of computational science methods in almost all sciences includinglife sciences and business sciences. Most of the models behind these computer simu-lations are based on partial differential equations (PDE). After this modeling phase,which can be very complex and time consuming, the first step towards computersimulation consists in the discretization of the PDE model. This usually resultsin a large-scale linear system of algebraic equations or even in a sequence of suchsystems, e.g. in non-linear and/or time-dependent problems. In the latter case, thefast solution of these systems of linear algebraic equations is crucial for the over-all efficiency of the computer simulation. The growing complexity of the modelsrequires more and more the use of parallel computers. Clusters of workstations oreven personal computers are now very popular as basic hardware configuration forthe computer simulation of complex problems, even as home parallel systems. Soon,with the forthcoming multiprocessor on a chip systems, multiprocessor laptops willmake small-scale parallel processing commonplace, even on airplanes.

The right handling of computer simulations requires interdisciplinary abilitiesand knowledge in different areas of applied mathematics, computer sciences, and ofcourse in the concrete field the application belongs to. More precisely, beside someunderstanding of modeling, it requires at least basic knowledge of PDE, discretiza-tion methods, numerical linear algebra, and last but not least, computer sciences.The aim of this tutorial is to provide such basic knowledge not only to students in

1

2 Chapter 1. Introduction

applied mathematics and computer sciences but also to student in different compu-tational sciences or to people who are going to use computational methods in theirown research or applications.

Chapter 2 provides the very first ideas about what is going on in a computersimulation based on PDE. We look at the Poisson equation, which is certainly oneof the most important PDE, not only as the most widely used model problem forsecond-order elliptic PDE, but also from a practical point of view. Fast Poissonsolvers are needed in many applications such as heat conduction (see also Example4.3), electrical field computation (potential equation) [108], and pressure correctionin the CFD simulation [36]. The simplest discretization method for the Poissonequation is certainly the classical finite difference method (FDM). Replacing thecomputational domain by some uniformly spaced grid, substituting second-orderderivatives of the Laplace operator by second-order differences, and taking into ac-count the boundary conditions, we arrive at a large-scale, sparse system of algebraicequations. As simple as the FDM is, many problems and difficulties arise in the caseof more complicated computational domains, differential operators, and boundaryconditions. These difficulties can be overcome by the finite element method, whichis discussed in Chapter 4 at length. Nevertheless, due to its simple structure, theFDM has some advantages in some applications. The efficiency of the solutionprocess is then mainly determined by the efficiency of the method for solving thecorresponding linear system of algebraic equations that is nothing more than thediscrete representation of our PDE in the computer. In principle, there are twodifferent classes of solvers: direct methods and iterative solvers. We will examinesome representatives from each class in their sequential version in order to see theirfortes and weaknesses with respect to our systems of finite difference equations.Finally, we move from a sequential to a parallel algorithm using elementary domaindecomposition techniques.

Before using a parallel solver, the reader should be familiar with some basicsin parallel computing. Chapter 3 starts with a rough but systematic view on therapidly developing parallel computer hardware and how these characteristics haveto be taken into account in program and algorithm design. The presentation focusesmainly on distributed memory computers as available in clusters of workstationsand PCs. Only a few basic communication routines are introduced in a general wayand they are sufficient to implement parallel solvers for PDE. The interested readercan try first parallel programs just by using the standard MPI library and can startto create a personal library of routines necessary in the exercises of the followingchapters. The methodology of the parallel approach in this tutorial, together withthe very portable MPI library, allows the development and running of these parallelprograms on a simple PC or workstation because several parallel processes canrun on one CPU. As soon as the parallel program is free from obvious errors,the codes can recompiled and run on a real parallel computer, i.e., an expensivesupercomputer or a much cheaper cluster of PCs (such as the Beowulf Project, seehttp://beowulf.org). If a programmer writes a parallel program, then it will nottake long before someone else claims that a newer code is superior. Therefore, afew points on code efficiency in the context of parallel numerical software are givento assist the reader.

3

The finite element method (FEM) is nowadays certainly the most powerfuldiscretization technique on earth for elliptic BVP. Due to its great flexibility theFEM is more widely used in applications than the FDM considered in our introduc-tory chapter for the Poisson equation. Chapter 4 gives a brief introduction of thebasic mathematical knowledge that is necessary for the implementation of a simplefinite element code from a mathematical viewpoint. It is not so important to knowa lot about Sobolev spaces, but it is important to be able to derive the variational,or weak formulation, of the BVP that we are going to solve from its classical state-ment. The basic mathematical tool for this is the formula of integration by parts.In contrast to the FDM where we derive the finite difference equations directly fromthe PDE by replacing the derivatives by differences, the FEM starts with the vari-ational formulation of the BVP and looks for an approximation to exact solutionof the BVP in form of some linear combination of a finite number of basis functionwith local support. The second part of Chapter 4 provides a precise descriptionof the procedure for deriving the finite element equations in case of linear trian-gular finite elements. This procedure can easily be generalized to other types offinite elements. After reading this part, the reader should be able to write a firstfinite element code that generates the finite element equations approximating theBVP. The final and usually the most time-consuming step in finite element analysisconsists of the solution of the finite element equations that is nothing more thana large, sparse system of algebraic equations. The parallelization of the solver ofthe finite element equations is the most important part in the parallelization of theentire finite element code. The parallelization of the other parts of a finite elementcode is more or less straightforward. The solution methods and their paralleliza-tion are mainly discussed in Chapters 5 and 6. We already know from Chapters 2that iterative solvers are preferred for really large-scale systems, especially, if we arethinking of their parallel solution. That is the reason why we focus our attentionon iterative solvers. Some theoretical aspects which are directly connected with theiterative solution of the finite element equations are already considered at the endof Chapters 4. There is also a subsection that discusses briefly the analysis of thefinite element method. This subsection can be skipped by the reader who is onlyinterested in the practical aspects of the FEM.

Previous to the parallel numerical algorithms for solving the discrete problems,Chapter 5 investigates basic numerical features needed in all of the appropriate so-lution algorithms with respect to their adaption to parallel computers. Here, weconcentrate on a data decomposition that is naturally based on the FEM discretiza-tion. We see that numerical primitives such as inner product and matrix-vectormultiplication are amazingly easy to parallelize. The exercises refer to Chapter 3and they will lead the reader directly to a personal parallel implementation of datamanipulations on parallel computers.

Conventional classical direct and iterative solution methods for solving largesystems of algebraic equations are analyzed for their parallelization properties inChapter 6. Here, the detailed investigations of basic numerical routines from theprevious chapter simplify the parallelization significantly, especially for iterativesolvers. Again, the reader can implement a personal parallel solver guided by theexercises.

4 Chapter 1. Introduction

Classical solution methods suffer from the fact that the solution time raisesmuch more than the number of unknowns in the system of equations, e.g., it takesthe solver 100-1000 times longer when the number of unknowns is increased by afactor of 10. Chapter 7 introduces briefly a multigrid solver that is an optimalsolver in the sense that it is 10 time more expensive with respect to both memoryrequirements and solution time for 10 times more unknowns. Thanks to the previ-ous chapters, the analysis of the parallelization properties is rather simple for thismultigrid algorithm. If the reader has followed the practical course until this stage,then it will be easy to implement a first parallel multigrid algorithm.

Chapter 8 addresses some problems which are not discussed in the book, butwhich are closely related to the topics of this tutorial. The references given thereshould help the reader to generalize the parallelization techniques presented in thistutorial to other classes of PDE problems and applications. We also refer the readerto [45], [105], and [53] for further studies in parallel scientific computing.

Finally, the the Appendices A and B provide a list of notations and a guidethrough the internet, respectively.

The practical course for the tutorial can be downloaded from

http://www.numa.uni-linz.ac.at/books#Douglas-Haase-Langer

andhttp://www.mgnet.org/books/Douglas-Haase-Langer.

The theory cannot replace the practice, but the theory can help very much tounderstand what is going on in the computations. Do not belive the computationalresults boundlessly without knowing something about possible errors such as mod-eling errors, discretization errors, rounding-off errors, iteration errors, and last butnot least programming errors.

Chapter 2

A simple example

The things of this world cannot be made known withouta knowledge of mathematics.—Roger Bacon (1214-1294).

2.1 The Poisson equation and its finite differencediscretization

Let us start with a simple, but at the same time very important example, namelythe Dirichlet boundary value problem for the Poisson equation in the rectangulardomain Ω = (0, 2) × (0, 1) with the boundary Γ = ∂Ω: Given some real functionf in Ω and some real function g on Γ, find a real function u : Ω → R defined onΩ := Ω ∪ Γ such that

−∆u(x, y) := −∂2u

∂x2(x, y)− ∂2u

∂y2(x, y) = f(x, y), ∀ (x, y) ∈ Ω,

u(x, y) = g(x, y), ∀ (x, y) ∈ Γ.

(2.1)

The given functions as well as the solution are supposed to be sufficiently smooth.The differential operator ∆ is called Laplace operator. For simplicity, we consideronly homogeneous Dirichlet boundary conditions, i.e., g = 0 on Γ.

The Poisson equation (2.1) is certainly the most prominent representative ofsecond-order elliptic partial differential equations (PDE) not only from a practicalpoint of view, but also as the most frequently used model problem for testingnumerical algorithms. Indeed, the Poisson equation is the model problem for ellipticPDE, much like the heat and wave equations are for parabolic and hyperbolic PDE.

The Poisson equation can be solved analytically in special cases, as in rect-angular domains with just the right boundary conditions, e.g., with homogeneousDirichlet boundary conditions as imposed above. Due to the simple structure of the

5

6 Chapter 2. A simple example

differential operator and the simple domain in (2.1), a considerable body of anal-ysis is known that can be used to derive or verify solution methods for Poisson’sequation and more complicated ones.

If we split the intervals in both the x and y directions into Nx and Ny subin-tervals of length h each (i.e., Nx = 2Ny), then we obtain a grid (or mesh) of nodeslike the one presented in Figure 2.1. We define the set of subscripts for all nodes

-x

6y

0 2

0

1

i

j (i, j)

Nx

Ny

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

Figure 2.1: Rectangular domain with equidistant grid points.

by ωh := (i, j) : i = 0, Nx, j = 0, Ny, the set of subscripts belonging to interiornodes by ωh := (i, j) : i = 1, Nx − 1, j = 1, Ny − 1, and the corresponding setof boundary nodes by γh := ωh \ ωh. Furthermore, we set fi,j = f(xi, yj), and wedenote the approximate values to the solution u(xi, yj) of (2.1) at the grid points(xi, yj) := (ih, jh) by the values ui,j = uh(xi, yj) of some grid function uh : ωh → R.Here and in the following we associate the set of indices with the corresponding setof grid points, i.e., ωh ∋ (i, j) ↔ (xi, yj) ∈ Ωh. Replacing both second derivativesin (2.1) by second-order finite differences at the grid points

∂2u

∂x2(xi, yj) ≈

1

h2(ui−1,j − 2ui,j + ui+1,j) and (2.2)

∂2u

∂y2(xi, yj) ≈

1

h2(ui,j−1 − 2ui,j + ui,j+1) , (2.3)

we immediately arrive at the following five point stencil finite difference scheme thatrepresents the discrete approximation to (2.1) on the grid ωh: Find the values ui,j

of the grid function uh : ωh → R at all grid points (xi, yj), with (i, j) ∈ ωh, suchthat

1h2 (−ui,j−1 − ui−1,j + 4ui,j − ui+1,j − ui,j+1) = fi,j , ∀ (i, j) ∈ ωh,

ui,j = 0, ∀ (i, j) ∈ γh.(2.4)

Arranging the interior (unknown) values ui,j of the grid function uh in a properway in some vector uh, e.g., along the vertical (or horizontal) grid lines, and taking

2.1. The Poisson equation and its finite difference discretization 7

into account the boundary conditions on γh, we observe that the finite differencescheme (2.4) is equivalent to the following system of linear algebraic equations: Finduh ∈ R

N , N = (Nx − 1) · (Ny − 1), such that

Khuh = fh, (2.5)

where the N×N system matrix Kh and the right-hand side vector fh∈RN can be

rewritten from (2.4) in an explicit form. This is illustrated in Fig. 2.2.

4

4

4

−1

−1−1

−1

4

4

4

−1

−1−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

4

4

4

−1

−1−1

−1

4

4

4

−1

−1−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

4

4

4

−1

−1−1

−1

−1

−1

−1

−1

−1

−1

4

4

4

−1

−1−1

−1

−1

−1

−1

4

4

4

−1

−1−1

−1

−1

−1

−1

ff

f 1,1

1,3

1,2

h =f

ff

f 2,1

2,3

2,2

ff

f 3,1

3,3

3,2

ff

f 4,1

4,3

4,2

ff

f 5,1

5,3

ff

f 6,1

6,3

6,2

ff

f 7,1

7,3

7,2

5,2

1

h2Kh =

Figure 2.2: Structure of matrix and load vector.

The (band as well as profile) structure of the matrix Kh heavily depends onthe arrangement of the unknowns, i.e., on the numbering of the grid points. In ourexample, we prefer the numbering along the vertical grid lines because exactly thisnumbering gives us the smallest band width in the matrix. To keep the band widthas small as possible is very important for the efficiency of direct solvers.

Before reviewing some solution methods, we first look at further algebraic andanalytic properties of the system matrix Kh which may have some impact on theefficiency of solvers, especially, for large-scale systems. The finer the discretizationthe larger the system. More precisely, the dimension N of the system grows likeO(h−m), where m is the dimension of the computational domain Ω, i.e., m = 2 forour model problem (2.1). Fortunately, the matrix Kh is sparse. A sparse matrixis one with only a few nonzero elements per matrix row and column independentof its dimension. Our matrix Kh in (2.5) has at most 5 nonzero entries per rowand per column independent of the fineness of the discretization. This property iscertainly the most important one with respect to efficiency of iterative solvers aswell as direct solvers.

A smart discretization technique should preserve the inherent properties of thedifferential operator involved in the BVP. In our case, the matrix Kh is symmetric


(Kh = KTh ) and positive definite ((Khuh, uh) > 0, ∀uh 6= 0). These properties

result form the symmetry (formal self-adjointness) and uniform ellipticity of theLaplace operator. Symmetric and positive definite (SPD) matrices are regular.Regular matrices are invertible. Thus, our system of finite difference equations(2.4) has a unique solution.

Unfortunately, the matrix Kh is badly conditioned. The spectral conditionnumber κ(Kh) := λmax(Kh)/λmin(Kh) defined by the ratio of the maximal eigen-value λmax(Kh) and the minimal eigenvalue λmin(Kh) of the matrix Kh behaveslike O(h−2) if h tends to 0. That behavior affects the convergence rate of all classicaliterative methods, and it can deteriorate the accuracy of the solution obtained bysome direct method due to accumulated round-off errors, especially on fine grids.These properties are not only typical features of matrices arising from the finitedifference discretization but also characteristic for matrices arising from the finiteelement discretization discussed in Chapter 4.

2.2 Sequential solving

There exist two vastly different approaches for solving (2.5): direct methods anditerative methods. For notational purposes we consider an additive splitting of thematrix K = E+D+F into its strictly lower part E, its diagonal part D = diag(K)and its strictly upper part F . Here and in the following we omit the subscript h.

2.2.1 Direct methods

Direct methods produce the exact solution of systems of algebraic equations justlike (2.5) in a finite number of arithmetical operations provided that the exactarithmetic without round-off errors is used.

The classical direct method is the Gauss elimination process which succes-sively replaces all entries in E by 0 and updates in each step the remaining part ofthe matrix and the right-hand side f [46]. This step is called the elimination step.The resulting transformed system of equations is triangular

(D + F

)u = f , (2.6)

and can easily be solved be replacing the unknowns from the bottom to the top.This step is called the backward substitution.

The Gauss elimination produces in fact the so-called LU-factorization K = LUof the matrix K into a lower triangular matrix L where all main diagonal entriesare equal to 1 and an upper triangular matrix U . After this factorization stepthe solution of our system (2.5) reduces to the following forward and backwardsubstitution steps for the triangular systems

Lx = f and Uu = x , (2.7)

respectively. If the matrix K is symmetric and regular, then the LU-factorizationcan be put into the formK = LDLT extracting some (block) diagonal matrixD [81].

2.2. Sequential solving 9

In the SPD case, the Cholesky factorization K = UTU is quite popular [46]. Thereexist several modifications of the Gaussian elimination method and the factorizationtechniques but all of them use basically the same principals [46, 81, 95].

Each of these direct methods suffers from the so-called fill-in during the elim-ination, or factorization step. After the factorization the triangular factors becomemuch more denser than the original matrix. nonzero entries appear at places withinthe band or profile of the triangular factors where zero entries were found in theoriginal matrix. The sparsity of the original matrix gets lost. The fill-in phenomenacauses a superlinear grow of complexity. More precisely, the number of arithmeticaloperations and the storage requirements grow like O(h−3m+2) and O(h−2m+1) asthe discretization parameter h tends to 0, respectively. In addition to this, one mustbe aware of the loss of about log(κ(K)) valid digits in the solution due to round-offerrors.

The cost of computing the LU-factorization and the substitutions in (2.7) is

always less than the cost of computing K−1f [65]. Therefore K−1 should never beexplicitly computed.

2.2.2 Iterative methods

Iterative methods produce a sequence uk of iterates that should converge to theexact solution u of our system (2.5) for an arbitrarily chosen initial guess u0 ∈ R

N

as k tends to infinity, where the superscript k is the iteration index.In this introductory section to iterative methods we only consider some clas-

sical iterative methods as special cases of stationary iterative methods of the form

Cuk+1 − uk

τ+Kuk = f, k = 0, 1, ..., (2.8)

where τ is some properly chosen iteration (relaxation) parameter, and C is a non-singular matrix (sometimes called preconditioner) that can be inverted easily andhopefully improves the convergence rate. If K as well as C are SPD, then the itera-tion process (2.8) converges for an arbitrarily chosen initial guess u0 ∈ R

N providedthat τ was picked up from the interval (0, 2/λmax(C

−1K)), where λmax(C−1K)

denotes the maximal eigenvalue of the matrix C−1K (see also Subsection 4.2.4).The setting C := I results in the classical Richardson iteration

uk+1 := uk + τ(f −Kuk

). (2.9)

An improvement consists in the choice C := D and leads to the ω-Jacobiiteration (τ = ω)

uk+1 := uk + ωD−1(f −Kuk

)(2.10)

that is called Jacobi iteration for ω = 1.Choosing C = D + E (the lower triangular part of matrix K, which can be

easily inverted by forward substitution) and τ = 1 yields the forward Gauss-Seideliteration

(D + E)uk+1 = f − Fuk. (2.11)


Replacing C = D+E by C = D+F (i.e., changing the places of E and F in (2.11) )gives the backward Gauss-Seidel iteration.

The slightly different choice C = D + ωE and τ = ω results in the SuccessiveOver Relaxation (SOR) iteration

(D + ωE)uk+1 = (1− ω)Duk + ω(f − Fuk

). (2.12)

Changing again the places of E and F in (2.12) gives the backward version of theSOR iteration in analogy to the backward version of the Gauss-Seidel iteration.The forward SOR step (2.12) followed by a backward SOR step gives the so-calledSymmetric Successive Over Relaxation (SSOR) iteration. The SSOR iterationcorresponds to our basic iteration scheme (2.8) with the SSOR preconditioner

C = (D + ωE)D−1(D + ωF ) (2.13)

and the relaxation parameter τ = ω(2 − ω). The right choice of the relaxation pa-rameter ω ∈ (0, 2) is certainly one crucial point for obtaining reasonable convergencerates. The interested reader can find more information about these basic iterativemethods and, especially, about the right choice of the relaxation parameters, e.g.,in [93], pp. 95–116.

The Laplace operator in (2.1) is the sum of two one-dimensional (1D) differen-tial operators. Both have been discretized separately by a 3-point stencil resultingin regular N×N matricesKx and Ky. Therefore, we can express the matrix in (2.5)as the sum K = Kx + Ky. Each row and column in the discretization (Fig. 2.1)corresponds to one block in the block diagonal matrices Kx and Ky. All theseblocks are tridiagonal after a temporary and local reordering of the unknowns. Werewrite (2.5) as

(Kx +Ky)u = (Kx + ρI)u+ (Ky − ρI)u = f

giving us the equation

(Kx + ρI)u = f − (Ky − ρI)u

that can easily be converted into the iteration form

(Kx + ρI)ul+1 = f − (Kx +Ky)ul + (Kx + ρI)ul.

This results in an iteration in the x-direction that fits into scheme (2.8)

(Kx + ρI)(ul+1 − ul

)= f −Kul.

A similar iteration can be derived for the y-direction. Combining both iterationsas half steps in one iteration just as we combined the forward and backward SORiterations defines the Alternating-Direction Implicit iterative method, known as theADI method :

(Kx + ρI)uk+1/2 = f − (Ky − ρI)uk,

2.3. Parallel solving by means of domain decomposition 11

(Ky + ρI)uk+1 = f − (Kx − ρI)uk+1/2 . (2.14)

Again, we refer to Subsection 6.2.3 and to [93, pp.116–118] for more informationabout the ADI methods.

The classical iteration methods suffer form the bad conditioning of the ma-trices arising from finite difference, or finite element discretization. An appropriatepreconditioning and the acceleration by Krylov space methods can be very helpful.On the other hand, the classical (eventually, properly damped) iteration methodshave usually good smoothing properties, i.e., the high frequencies in a Fourier-decomposition of the iteration error are damped out much faster than the low fre-quencies. This smoothing property combined with the coarse grid approximationof the smooth parts leads us directly to multigrid methods which are discussed inChapter 7. The parallelization of the classical and the advanced iterations meth-ods (e.g., the multigrid methods where we need the classical iteration methods assmoother and the direct methods as coarse grid solvers) is certainly the most im-portant step towards the efficient solution of really large-scale problems. Therefore,in the next section, we briefly consider the domain decomposition method that isnowadays the basic tool for constructing parallel solution methods.

2.3 Parallel solving by means of domaindecomposition

The simplest non-overlapping domain decomposition (DD) for our rectangle consistsin a decomposition of Ω into the 2 unit squares Ω1 = [0, 1] × [0, 1] and Ω2 =[1, 2]× [0, 1], see Fig. 2.3. The interface between the two subdomains is ΓC . Thisimplies 2 classes of nodes. The coupling nodes on the interface are denoted bysubscript “C” and the interior nodes by subscript “I”. These classes of nodes yield

-x1

6x2

0 1 2Ω1 Ω2

ΓC

0

1

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

Ω = Ω1 ∪ Ω2

ΓD = ∂Ω

ΓC = Ω1 ∩ Ω2 \ ΓD

• ←→ “I”

←→ “C”

Figure 2.3: Two subdomains with five point stencil discretization.


the block structure (2.15) in system (2.5):

KC KCI,1 KCI,2

KIC,1 KI,1 0KIC,2 0 KI,2

uC

uI,1

uI,2

=

fC

fI,1

fI,2

. (2.15)

There are no matrix entries between interior nodes of different subdomains becauseof the locality of the five point stencil used in our finite difference discretization.

A block Gaussian elimination of the upper right entries of the matrix (that cor-responds to the elimination of the interior unknowns) results in the block triangularsystem of equations (

SC 0KIC KI

)(uC

uI

)=

(gCfI

), (2.16)

with the Schur complement

SC = KC −KCIK−1I KIC (2.17)

=(KC,1 −KCI,1K

−1I,1KIC,1

)+(KC,2 −KCI,2K

−1I,2KIC,2

)

and a modified right-hand side

gC= f

C−KCIK

−1I f

I(2.18)

=(fC,1−KCI,1K

−1I,1fI,1

)+(fC,2−KCI,2K

−1I,2fI,2

).

Therein, KC = KC,1 + KC,2, fC

= fC,1

+ fC,2

, KIC =(

KIC,1

KIC,2

)and KI =

blockdiagKI,1,KI,2 . The local Schur complements KC,s −KCI,sK−1I,sKIC,s are

completely dense in general.

I) gC

:= fC

for s := 1 to P do in parallel

gC

:= gC−KCI,s ·K−1

I,s · f I,s

end

II) Solve SC uC = gC

III) for s := 1 to P do in parallel

uI,s := K−1I,s ·

(fI,s−KIC,s · uC

)

end

Algorithm 2.1: A simple parallel DD solver (P=2 for our example).

This procedure of eliminating the interior unknowns can obviously be extendedto the more general case of the decomposition into P subdomain. This generaliza-tion is depicted in Alg. 2.1 (P=2 for our example of the decomposition of Ω intothe 2 subdomains Ω1 and Ω2). The main problem in Alg. 2.1 consists in form-ing and solving the Schur complement system in step II). In our simple example,

2.3. Parallel solving by means of domain decomposition 13

we can calculate SC explicitly and solve it directly. This is exactly the same ap-proach that was used in the classical finite element substructuring technique. Ingeneral, the forming and the direct solution of the Schur complement system instep II) of Alg. 2.1 is too expensive. The iterative solution of the Schur complementsystem (sometimes also called iterative substructuring method) requires only thematrix-by-vector multiplication SC ·uk

C and eventually a preconditioning operation

wkC = C−1

C dkC as basic operations, where dkC = gC−SC ·uk

C denotes the defect after

k iterations (see Subsection 2.2.2). The matrix-by-vector multiplication SC · ukC

involves the solution of small systems with the matrices KI,s that can be carriedout completely in parallel. The construction of really good Schur complement pre-conditioner is a challenging task. Nowadays, optimal, or at least almost optimalSchur complement preconditioners are available [98]. In our simple example, we canuse a preconditioner proposed by M. Dryja [33]. Dryja’s preconditioner replaces theSchur complement by the square root of the discretized 1D Laplacian Ky along theinterface (see Subsection 2.2.2) using the fact that Ky has the discrete eigenvectors

µl= [µl(i)]

Ny

i=0 =[√

2 sin(lπih)]Ny

i=0and eigenvalues λl(Ky) =

4h2 sin

2(lπh/2), withl = 1, 2, ..., Ny − 1. This allows us to express the defect dC and the solution wC ofthe preconditioning equation (we omit the iteration index k for simplicity) as linearcombinations of the eigenvectors µ

l:

dC =

Ny−1∑

l=1

γl · µland wC =

Ny−1∑

l=1

βl · µl.

Now the preconditioning operation wC = C−1C dC can be rewritten as follows:

1. Express dC in terms of eigenfrequencies µl and calculate the Fourier coeffi-cients γl (Fourier analysis).

2. Divide these Fourier coefficients by the square root of the eigenvalues of Ky,

i.e., βl := γl/√λl(Ky)

3. Calculate the preconditioned defect wC by Fourier synthesis.

If we now denote the Fourier transformation by the square matrix F = [µl(i)]Ny−1l,i=1 ,

then the above procedure can be written as

CC = 2K1/2y =

2

h

2 −1−1 2 −1

. . .. . .

. . .

−1 2 −1−1 2

1/2

= FT ΛF,

where Λ = diag [λl(CC))] with λl(CC) = 2√λl(Ky). Due to the property F =

F−1 = FT we can solve the system CCwC = FTΛF wC = dC in three simple steps:

F-analysis: α = FdC = FT dC ,


Scaling: β = Λ−1α,

F-synthesis: wC = Fβ = F−1β.

Usually, the Fourier transformations in the Fourier analysis and the Fourier synthe-sis require a considerable amount of computing power but this can be dramaticallyaccelerated by the Fast Fourier Transformation (FFT), especially if Ny is a powerof 2, i.e., Ny = 2r. We refer to the original paper [23] and the book by W. Briggsand V. Henson [15] for more details.

2.4 Some other discretization methods

Throughout the remainder of this book we emphasize the Finite Element Method(FEM). Before the FEM is defined in complete detail in Chapter 4, we want toshow the reader one of the primary differences between the two methods. BothFDM and FEM offer radically different ways of evaluating approximate solutionsto the original PDE at arbitrary points in the domain.

The FDM provides solution values at grid points, namely ui,j. Hence thesolution lies in a vector space, not a function space.

The FEM computes coefficients for basis functions and produces a solutionfunction in a function space, not a vector space. The FEM solution uh can bewritten as

uh(x, y) =∑

u(i,j)ϕ(i,j)(x, y), (2.19)

where u(i,j) is computed by the FEM for a given set of basis functions ϕ(i,j).To get the solution at a random point (x, y) in the domain using the FDM

solution data ui,j, there are two possibilities:

• If (x, y) lies on the grid (i.e., x = xi and y = yj , some i, j), then uh(x, y) = uij .

• Otherwise, interpolation must be used. This leads to another error term,which may be larger than the truncation error of the FDM and lead to amuch less accurate approximate solution than is hoped for.

For the FEM, the basis functions usually have very compact support. Hence,only a fraction of the ϕ(i,j) are nonzero. A bookkeeping procedure identifiesnonzero ϕ(i,j) and evaluates (2.19) locally. For orthonormal basis sets, the evalu-ation can be particularly inexpensive. There is no interpolation error involved, sowhatever the error is in the FEM is all that is in the evaluation anywhere in thedomain.

We do not treat the Finite Volume Methods (FVM) in this book. There aretwo classes of FVM:

• Ones that are the Box Scheme composed with a FDM [62].

• Ones that are the Box Scheme composed with a FEM [6, 59].

In addition, we do not consider the boundary element method (BEM), whichis also widely used in some applications [114].

Chapter 3

Introduction to parallelism

The question of whether computers can think is just likethe question of whether submarines can swim.—Edsger W. Dijkstra (1930-2002)

3.1 Classifications of parallel computers

3.1.1 Classification by Flynn

In 1966, Michael Flynn [37] categorized parallel computer architectures according tohow the data stream and instruction stream are organized as is shown in Table 3.1.

Instruction StreamSingle Multiple

SISD MISD Single DataSIMD MIMD Multiple Stream

Table 3.1. Classification by Flynn (Flynn’s taxonomy)

The MISD (Multiple Instruction, Single Data) class describes an empty set. TheSISD (Single Instruction, Single Data) class contains the normal single processorcomputer with potential internal parallel features. The SIMD (Single Instruction,Multiple Data) class has parallelism at the instruction level. This class containscomputers with vector units like the Cray T-90 and systolic array computers likeThinking Machines (TMC) CM2 and machines by MasPar. These parallel com-puters execute one program in equal steps and are not in the main scope of ourinvestigations. TMC and MasPar expired in the mid 1990’s. Cray was absorbed bySGI in the late 1990’s. Like a phoenix, Cray rose from its ashes in 2001. Besides

15

16 Chapter 3. Introduction to parallelism

recent processors by NEC and Fujitsu even state-of-the-art PC processors by Inteland AMD contain a minor vector unit.

Definition 3.1 (MIMD). MIMD means Multiple Instructions on Multiple Dataand it characterizes parallelism at the level of program execution where each processruns its own code.

Usually, those MIMD-codes do not run fully independently so that we have todistinguish between competitive processes that have to use shared resources (centerpart in Fig. 3.1) and communicating processes where each process possesses its datastream that requires data exchange at certain points in the program (right part inFig. 3.1).

?time

E

Op

?

A

E

Op1 Op2

?

A

E

E

Op1 Op2

A1

A1

? ?S -

Single operation 2 parallel operations withcompetitive access on data(no prediction on programspeed).

2 parallel operations ondistributed data which areupdated at certain syn-chronization points in thecode.

Figure 3.1: Types of parallel processes.

Normally each process runs the same code on different data sets. This allowsto introduce a sub-class of MIMD, namely the class SPMD.

Definition 3.2 (SPMD). The Single Program on Multiple Data programmingmodel is the main class used in parallel programming today, especially if very manyprocessors are used and an inherent data parallelism is definable.

There is no need for a synchronous execution of the codes at instruction level,thus an SPMD machine is not an SIMD machine. However, there are usually somesynchronization points in a code to update data that is needed by other processorsin order to preserve data consistency in implemented algorithms.

We distinguish between multiple processes and multiple processors. Processesare individual programs. There may be one or more processes running on oneindividual processor. In each process, there may be one or more threads. Threads

3.1. Classifications of parallel computers 17

are “light weight” processes in the sense that they share all of the memory. Whilealmost all of the algorithms described in the tutorial can be implemented in athreads’ manner, we normally only assume parallelism at the processor level.

3.1.2 Classification by memory access

Definition 3.3 (Shared Memory). Shared Memory is memory that is accessedby several competitive processes “at the same time”. The handling of the multiplememory requests is handled by hardware or software protocols.

A shared memory parallel computer has advantages and disadvantages:

+ Each process has access to all of the data. Therefore, a sequential code canbe easily ported to parallel computers with shared memory and leads usuallyto a first increase in performance with a small number of processors (2 . . . 8).

− As the number of processors increases, then the number of memory bankconflicts and other access conflicts usually rises. Thus, the scalability (i.e., theperformance is proportional to the number of processors) (i.e., a wall clocktime indirect proportional to the number of processors) cannot be guaranteedunless there is a lot of memory and memory bandwidth.

− Very efficient access administrations and bus systems are necessary to decreaseaccess conflicts. This is one reason why these memory subsystem are moreexpensive.

We introduce briefly three different models to construct shared memory systems[64].

Definition 3.4 (UMA). In the Uniform Memory Accesss Model for shared mem-ory, all processors have equal access time to the whole memory which is uniformlyshared by all processors.

UMA is what many vendors of small shared memory computer (IBM SP3, 2-4processor Linux servers) try to implement. IBM and commodity Linux sharedmemory machines have not expired, as of 2002.

Definition 3.5 (NUMA). In the Non-Uniform Memory Accesss Model for sharedmemory, the access time to the shared memory varies with the location of the pro-cessor.

The NUMA model has become the standard model used in shared memory super-computers today, such as one made today by HP, IBM, SGI.

Definition 3.6 (COMA). In the Cache Only Memory Accesss Model for sharedmemory, all processors use only their local cache memory so that this memory modelis a special case of the NUMA model.


The Kendall Square Research KSR-1 and KSR-2 had such a COMA memory model.Kendall Square expired in the middle 1990’s.

Definition 3.7 (Distributed Memory). Distributed memory is a collection ofmemory pieces where each of them can be accessed by only one processor. If one pro-cessor requires data stored in the memory of another processor, then communicationbetween these processors (communicating processes) is necessary.

We have to take into account the following items for programming on distributedmemory computers:

+ There are no access conflicts between processors since data are locally stored.

+ The hardware is relatively inexpensive.

+ The code is potentially nearly optimal scalable.

− There is no direct access to data stored on other processors and so communi-cation via special channels (links) is necessary. Hence, a sequential code doesnot run and special parallel algorithms are required.

· The ratio between arithmetic work and communication is one criteria forthe quality of a parallel algorithm. The time needed for communication isunderestimated quite frequently.

· Bandwidth and transfer rate of the network between processors are of extremeimportance and are always overrated.

Recent parallel computer systems by vendors as IBM, SUN, SGI, HP, NEC, Fujitsuare no longer machines with purely distributed or purely shared memory. They usu-ally combine 2-16 processors to one computing node with shared memory for theseprocessors. The memory of different computing nodes behaves again as distributedmemory.

Definition 3.8 (Distributed Shared Memory). The distributed shared memorymodel (DSM) is a compromise between shared and distributed memory models. Thememory is distributed over all processors but the program can handle the memoryas shared. Therefore, this model is also named as virtual shared memory model.

In DSM, the distributed memory is combined with an operating system basedon message passing system (see page 22) which simulates the presence of a globalshared memory, e.g., “Sea of addresses” by KSR and SGI’s “Interconnection fabric”.

+ The great advantage of DSM consists in the fact that a sequential code runsimmediately on this memory model. If the algorithms take advantage of thelocal properties of data, i.e., most data accesses of a process can be servedfrom its own local memory, then a good scalability can be achieved. On theother hand the parallel computer can be also handled as pure distributedmemory computer.

3.1. Classifications of parallel computers 19

For example, SGI’s parallel machine Origin 2000 and Origin 3000 have a SymmetricMultiprocessing (SMP). Each processor (or a small group of processors) has its ownlocal memory. However, the parallel machine handles the whole memory as one hugeshared memory. The realization was made possible by using a very fast crossbarswitch (CrayLink). SGI has not expired, as of 2002.

Besides hardware solutions to DSM, there are a number of software systemsthat simulate DSM by managing the entire memory using sophisticated databasetechniques.

Definition 3.9 (Terascale GRID Computers). Several computer centers coop-erate with huge systems, connected by very, very fast networks capable of moving aterabyte in a few seconds. Each node is a forest of DSM’s, possibly with 1000’s oflocal DSM’s. Making one of these work is a trial (comic relief?).

3.1.3 Communication topologies

How can we connect the processors of a parallel computer? We are particularlyinterested in computers with distributed memory since clusters of fast PC’s arereally easy to construct and to use. A more detailed discussion of topologies can befound in [76].

Definition 3.10 (Link). A link is a connection between 2 processors. We distin-guish between an unidirectional link that can be used only in one direction at thesame time and a bidirectional link which can be used in both directions at any time.Two unidirectional links are not a bidirectional link, however, even if the functionis identical.

Definition 3.11 (Topology). An interconnection network of the processes iscalled a topology.

Definition 3.12 (Physical topology). A physical topology is the interconnectionnetwork of the processors (nodes of the graph) given in hardware by the manufac-turer.

This network can be configured by changing the cable connections (IBM SP)or by a software reconfiguration of the hardware (Xplorer, MultiCluster-I). Recentparallel computers usually have a fixed connection of the processor nodes, but theyare also able to reconfigure the network for commonly used topologies by means ofa crossbar switch that allows potentially all connections.

As an example, the now ancient Transputer T805 was a very specialized pro-cessor for parallel computing in the early 90-ies. It had 4 hardware links so that itcould handle a 2d-grid topology (and a 4d-hypercube) directly by hardware. Us-ing one processor per grid point of the finite difference discretization of Poisson’sequation in the square (2.4) resulted in a very convenient parallel implementation


on this special parallel computer.

Definition 3.13 (Logical topology). The logical topology refers to how processes(or processor nodes) are connected to each other. This may be given by the user orthe operating system. Typically it is derived from the data relations or communica-tion structure of the program.

The mapping from the logical to the physical topology is done via a paralleloperating system or by parallel extensions to the operating system (e.g., see Sec.3.2.5). For example, a 4 process program might use physical processors 1, 89,126, and 1023 in a 1024 processor system. Logically, the parallel program wouldassume it is using processors 0–3. On most shared memory computers, the physicalprocessors can change during the course of a long computation.

Definition 3.14 (Diameter). The diameter of a topology is the maximal numberof links that have to be used by a message sent from an arbitrary node p to anothernode q.

The diameter is sometimes difficult to measure precisely. On some machinesthere is more than one path possible when sending a message between nodes p andq. The paths do not necessarily have the same number of links and are dependenton the network traffic load. In fact, on some machines, a message can fail becausenot being delivered in a set maximum period of time.

3.2 Specialties of parallel algorithms

3.2.1 Synchronization

Remark 3.15. Sequential programming is only the expression of our inability totransfer the natural parallelism of the world to a machine.

Administrating parallel processes provides several challenges that are not typ-ical on single processor systems. Synchronizing certain operations so that conflictsdo not occur is a particularly hard problem that cannot be done with just softwarealone.

Definition 3.16 (Undefined status). An undefined status occurs when the resultof a data manipulation cannot be predicted in advance. It happens mostly whenseveral processes have to access restricted resources, i.e., shared memory.

For example, consider what happens when the two processes in Fig. 3.2 run-ning simultaneously on different processors modify the same variable. The machinecode instruction inc increments a register variable and dec decrements it. Thevalue of N depends on the execution speed of processes A and B and therefore itsvalue is not predictable, i.e., the status of N is undefined as shown in Fig. 3.3. In

3.2. Specialties of parallel algorithms 21

process A: N := N + 1 clock. process B: N := N − 1

load N (1) load Ninc N (2) dec N

store N (3) store N

Figure 3.2: Undefined status.

Case a) -t

B

AN + 1

(1) (2) (3)

(1) (2) (3)

Case b) -t

B

AN − 1

(1) (2) (3)

(1) (2) (3)

Case c) -t

B

AN

(1) (2) (3)

(1) (2) (3)

Figure 3.3: Three possible outcomes.

order to avoid programs that can suffer from data with an undefined status, wehave to treat operations A and B as atomic operations, i.e., they cannot be splitany further. This requires exclusive access on N for one process during the timeneeded to finish the operation. This exclusive access is handled by synchronization.

Definition 3.17 (Synchronization). Synchronization prevents an undefined sta-tus of a variable by means of a mechanism which allows processes to access thatvariable in a well defined way.

Synchronization is handled by a semaphore mechanism (see [25, 41]) on sharedmemory machines and by message passing on distributed memory machines.

Definition 3.18 (Barrier). The barrier is a special synchronization mechanismconsisting of a certain point in a program that must be passed by all processes (or agroup of processes) before the execution continues. This guarantees that each singleprocess has to wait until all remaining processes have reached that program point.

3.2.2 Message Passing

Message passing is a mechanism for transferring data directly from one process toanother one. We distinguish between blocking and non-blocking communication.

Blocking communication makes all (often 2) involved processes wait until all


? ?

send recv-

Figure 3.4: Blocking communication

processes signal their readiness for the data/message exchange. Usually the pro-cesses wait further until the exchange is completed.

Status ?

send

?

recv

Status ?

R ?

- -

Figure 3.5: Non-blocking communication

Non-blocking communication allows all processes to send or receive theirdata/message totally independent of the remaining processes.

• This guarantees high efficiency in communication without deadlock (contraryto blocking communication which can deadlock under very subtle conditions).

• The sending process expects no acknowledgment that the operation was exe-cuted successfully. Output parameters of the receiving process must not havean undefined value. Otherwise the whole program may crash. Preventing thisdeadlock is required from the programmer.

• Other operations can be executed between calling a communication routineand its appropriate, but not required status query.

The pure semaphore concept is typical for shared memory computers and theresource management is usually unnoticed by the programmer. On the other hand,message passing occurs on distributed memory computers and must be specifiedexplicitly by the programmer.

3.2. Specialties of parallel algorithms 23

3.2.3 Deadlock

A general problem in the access to shared resources consists of preventing of dead-locks.

Definition 3.19 (Deadlock). A deadlock occurs when several processes are wait-ing for an event which can be released only by one of these waiting processes.

The simplest example in which a deadlock may occur is the exchange of databetween two processors if blocking communication is used. Here, both processes

recv

?

send

?

arithm.

recv

?

send

?

arithm.

-

Figure 3.6: Deadlock in blocking communication

send their data and cannot proceed until both get an acknowledgment that the datahas been received by the other process. But neither processes can send that acknowl-edgment since neither processes ever reaches that program statement. Therefore,each process has to wait for an acknowledgment from the other one and never reachesthe recv statement. One solution to overcome this deadlock is presented in Sec. 3.4.Communication routines of more complexity than the simple send/receive betweentwo processes in Fig. 3.6 might have deadlocks which are not as obvious.

3.2.4 Data coherency

Definition 3.20 (Data coherency). Data coherency is achieved if all copies ofa data set include always the same data values.

• On distributed memory machines, the programmer is responsible for datacoherency.

• On classical (older) machines with huge shared memory, data coherency wasguaranteed by semaphores. This limited the number of processors to about 40.

• Nowadays, also shared memory computers have a local cache on each proces-sor (sometimes a distributed shared memory model is used for programming).Here, cache coherency is additionally required which is much harder to im-plement than the simple coherency. Several mechanisms/protocols are used


to guarantee that cache coherency. Especially, the most recent distributedshared memory machines use an appropriate Cache Coherency NUMA modelwith internal protocols as:

snoopy-based protocol: Each process sends its memory accesses to the wholenetwork. This leads to problems with the communication bandwidth.

directory-based protocol: Each memory block (cache-line) possesses an entryin a table which stores the state of that block and a bit-vector including allprocesses with copies of that block. It can be determined by analysis of theseentries, which caches have to be updated [97].

3.2.5 Parallel extensions of operating systems and programminglanguages

In the early years of parallel programming, mostly in research, only a few hardwaredependent calls were available. As a consequence source code was not portableand one had to create a hardware independent interface of its own. Nowadays,all vendors of parallel computers offer standard parallel libraries and/or similarextensions of the operating system which include all necessary compilers, librariesand so on.

Some parallel versions of high level programming languages are available, e.g.,High Performance Fortran (HPF1), Vienna Fortran2 and OpenMP3 [21] extensionsto C, C++ and Fortran. They all include language extensions that support paral-lelism on structured data and the compilers try an automatic parallelization.

Another, and more promising, approach consists in the development of generaland portable interfaces for MIMD computers. The most common library in this fieldis the Message Passing Interface (MPI) which is based on earlier non-commercialand commercial interfaces like PVM, Express, etc.. The MPI4 consortium is sup-ported by all big manufacturers of parallel machines (they work actively in thatcommittee) and the library is available on all software platforms, e.g.,LINUX5, evenon some variants of Windows (e.g., 2000 and XP). Therefore, we focus in the coursematerial on MPI. Section 3.3 is strongly recommended for a deeper understandingof the calls in MPI6, see also [38, 49, 85].

3.3 Basic global operations

3.3.1 Send and Recv

Each parallel program needs basic routines for sending and receiving data fromone process to another one. We use symbolic names for these routines that can bemapped to concrete implementations. Let us introduce the routine

1http://www.mhpcc.edu/training/workshop/hpf/MAIN.html2http://www.vcpc.univie.ac.at/activities3http://www.openmp.org4http://www.mcs.anl.gov/mpi/index.html5http://www.suse.com6http://nexus.cs.usfca.edu/mpi

3.3. Basic global operations 25

Send(nwords, data, ProcNo)for sending nwords of data stored in data from the calling process to pro-cess ProcNo. The appropriate receiving routine is

Recv(nwords, data, ProcNo).The specification what sort of data (DOUBLE, INTEGER, etc.) is transfered isof no interest for our general investigations. These details have to be taken intoaccount by choosing the right parameters in the Point-to-Point communication of agiven parallel library. We use exclusively the blocking communication, implementedas mpi send and mpi recv in the MPI library. Using only the blocking versionsof the communication routines assists a beginner in parallel programming in such away that only one message thread has to be debugged at a certain program stage.Once the code runs perfectly it can be accelerated by using non-blocking commu-nication.

3.3.2 Exchange

The routines Send and Recv can be combined to a routineExchange(ProcNo, nWords, SendData, RecvData)

which exchanges data between processes p and q.

Non-blocking communication

The variant in Fig. 3.7 works only in case of non-blocking communication, i.e., thesending process does not require an acknowledgment that the message was received.Use of blocking communication ends in deadlock.

recv

?

send

?

arithm.

recv

?

send

?

arithm.

QQQQQQs

+

Figure 3.7: Non-blocking Exchange

Blocking communication

In case of a blocking communication, a deadlock free Exchange between pand q needs a unique function determining which one of the two processes maysend/receive first. The function test(p, q)= (p > q) provides an ordering mecha-nism such that both processes achieve the same result independently. The definition,that this process with the larger process number first sends and then receives (viceversa on the other process) guarantees a deadlock-free Exchange procedure.


send

recv

recv

send

?

HHHH

HHHH

yes no

arithm.

send

recv

recv

send

?

HHHH

HHHH

yes no

arithm.

”p” ”q”

test(p,q) test(q,p)

-

Figure 3.8: Blocking Exchange

An appropriate MPI call is mpi sendrecv, which determines p and q automatically.

3.3.3 Gather-Scatter operations

A parallel program usually possesses a root process, which has exclusive access toresources. This process has to gather and scatter specific data from or to the otherprocesses. In the function

Scatter(root,SendData,MyData)each process (including root) receives from the root process its specific dataMyDataout of the global data set SendData stored only at the root process. On the otherhand, the function

Gather(root,RecvData,MyData)collects the specific data in MyData on the root process and stores it in RecvData.Both functions can be implemented via Send or Recv in various algorithms. Ap-propriate MPI calls are mpi gather and mpi scatter.

root

P0 P P21. . . Pp-1

. . .

root

P0 P1 P2 Pp-1

Figure 3.9: Scatter and Gather

Sometimes, it is necessary to gather and scatter data from/to all processes. Thesimplest realization of these routines Gather All and Scatter All can be im-

3.3. Basic global operations 27

plemented by using combined gather/scatter calls.

3.3.4 Broadcast

Often, all processors receive the same information from one process (or all pro-cesses). In this case the root process acts like a broadcast station and the appro-priate routine is

Broadcast(root,nWords,Data).This is a simplification of the Scatter routine which distributes individual datato the processes. The MPI call is mpi bcast.

root

P0 P P21. . . Pp-1

Figure 3.10: Broadcast operation

3.3.5 Reduce- and Reduce-All operations

Here, we perform an arithmetical/logical operation on the distributed data of all

processes (e.g., s =p−1∑i=0

ai). At the end, the result is available on one root process

(Reduce operation) or on all processes (Reduce All).

The function could look like Reduce(n,X ,Y ,V tOp) and it can handle vectors andoperations V tOp on them in the parameter list. So the above sum s could becalculated via Reduce(1,S,A,Double Plus)).

The simplest implementation of Reduce consists in a Gather followed byDouble Plus on the root process and then a Scatter. More sophisticated andfaster implementations can be derived for certain topologies. The calls mpi reduceand mpi allreduce are the appropriate functions in the MPI-library.

3.3.6 Synchronization by barriers

The synchronization of all processes is called a barrier and is provided by MPI asmpi barrier.

• Each operation in 3.3.2-3.3.5 can be used as barrier, e.g., we can construct abarrier via Reduce(1,I,J ,Int P lus).

• Most parallel algorithms contain quite naturally a barrier, since a global dataexchange (e.g., summation) is necessary at some point in the algorithm.


3.3.7 Some remarks on portability

Operations in 3.3.2-3.3.6 have been built on the following conditions.

1. There is a given topology, e.g., farm, binary tree or hypercube [76, 90]. Ifthe assumed or required topology is not physically available, then it must beconstructed logically on the given topology.

2. Send to a process: Send.

3. Receive from a process: Recv.

Only those 3 items depend on hardware and operating system. All remaining rou-tines should be based on these three basic routines.

Parallel communication libraries come and go. TCP/IP, UDP, PVM, P4,TCMSG, and MPI are just a few of the once popular libraries. Before writing anyparallel application, stop and either get a portable, wrapper library from a friendor colleague or write your own. Do not hard code any particular library calls intoyour codes. This simplifies the codes for debugging and they become more portable.Thus parallel program is trivially slower than a hardware optimized parallel codein real production runs of very large applications.

3.4 Performance evaluation of parallel algorithms

How do we compare fairly the suitability or none qualification of parallel algorithmsfor large or huge number of processors?

3.4.1 Speedup and Scaleup

Definition 3.21 (Scalability). The property of a program to adapt automaticallyto a given number of processors is called scalability.

Scalability is more sought after than highest efficiency (i.e., gain of computingtime by parallelism) on any specific architecture/topology.

Definition 3.22 (Granularity). Granularity is a measure for the size of programsections which are executable without communication with other processes. The twoextremes are fine grain and coarse grain algorithms.

Definition 3.23 (Speedup). The speedup SP is a measure of the performancegain of a parallel code running on P processors in comparison to the sequentialprogram version. The global problem size Nglob remains unchanged. This measurecan be expressed quantitatively by

SP =t1tP

(3.1)

3.4. Performance evaluation of parallel algorithms 29

with t1 - system time of process A on one processorand tP - system time of process A on P processors.

A speedup of P is desirable, i.e., P processes are P times faster than one. Theglobal problem size Nglob is constant and therefore the local problem size on each

processor decreases like Ni ∼ Nglob

P ∀i = 1, P , i.e., the more processors are used,the less an individual processor has to do.

SP can be calculated in a variety of ways. Our speedup measures only theparallelization quality of one algorithm that is slightly adapted to P . The qualityof the numerical algorithm itself is measured with the numerical efficiency in (3.6).Only the combination of the two measures allows a fair evaluation of an algorithms.

Example 3.24 (Speedup-Worker). Let us use workers instead of processors.P = 1 worker digs a pit of 1m3 [Nglob] in t1 = 1h by stepping into the pit, takinga shovel of soil and dropping it a few meters away. Two workers can reduce thetime for digging that hole in the described way by a factor of two, i.e., it takes themt2 = 0.5h. How long [tP ] does it take P = 1000 workers to perform the same work?Obviously, the workers obstruct each other and too high an organization overheaddevelops. Therefore we gain very little speedup when using many workers.

This example leads us directly to Amdahl’s theorem.

Theorem 3.25 (Amdahl’s theorem (1967)). Each algorithm contains partswhich cannot be parallelized.Let s - sequential part of an algorithm

p - parallel part of an algorithm (both normalized):

s+ p = 1 . (3.2)

Thus, we can express the system time on one processor by t1 −→ s+ p and (underthe assumption of an optimal parallelization) the system time on P processors bytP −→ s+ p

P . This results in an upper barrier for the maximally attainable speedup

SP,max =s+ p

s+ pP

=1

s+ 1−sP

≤ 1

s. (3.3)

If we can parallelize 99% of a code, then the remaining 1% sequentialcode restricts the maximal parallel performance such that a maximal speedupS∞,max = 100 can be achieved by formula (3.3).

P 10 100 1000 10000SP 9 50 91 99

The use of more than 100 processors appears senseless, since the costs of 1000processors are (approx.) 10 times higher than the costs of 100 processors but theperformance did not even double. This result seems to be discouraging.


But, is Amdahl’s Theorem suitable for us ?

There is an observation that certain algorithms achieve even a speedup higherthan P on smaller processor numbers (< 20). This is usually caused by a moreintensive use of cache memory on the processors if the size of the local problemis getting smaller or by a different number of iterations per processor in a domaindecomposition algorithm.

On the other hand Amdahl’s Theorem does not take into account any timefor communication so that the real behavior of the speedup is always worse, seeFig. 3.11 for a qualitative illustration. We can formulate the consequence that

spee

dup ideal speedup

large problem

medium size problem

problem too small

# processors

Figure 3.11: Speedup with respect to problem size

the attainable speedup depends on the size and communication properties of theunderlying global problem.

Example 3.26 (Scaleup-Worker). If each worker digs a pit of 1m3, then 1000workers reach nearly a 1000 times higher performance during that hour. They allwork on one huge project.

Usually the sequential part of a code is rather insensitive to the overall problemsize. On the other hand the parallel parts cover most of the arithmetics in a code.Therefore, increasing the time a code spends in the parallel parts by increasing theproblem size decreases the importance of the sequential part. As a conclusion, the


scaling of the problem size with the number of processors is a way to overcomeAmdahl’s Theorem.

Definition 3.27 (Scaled speedup). The scaled speedup SC is a measure forthe parallel performance gain when the global problem size Nglob increases with thenumber of processors used. It assumes additionally that the parallel part of thealgorithm is optimal with respect to the problem size, i.e., O(Nglob). We can expressthe scaled speedup quantitatively

SC(P ) =s1 + P · p1s1 + p1

s1+p1=1= s1 + P (1 − s1) (3.4)

by using the notations s1 + p1 - normalized system time on a parallel computers1 + P · p1 - system time on a sequential computer.

Equation (3.4) means that a sequential part of 1% leads to a scaled speedup ofSC(P ) ≈ P because the serial part decreases with the problem size. This theoreticalforecast is confirmed in practice.

The (scaled) speedup may not be the only criteria for the usefulness of analgorithm, e.g., one can see in Fig. 3.12 (courtesy to Matthias Pester, Chemnitz,1992) by comparing Gaussian elimination with a diagonally preconditioned CG(PCG) that

SP (Gauss) > SP (PCG) BUT t(Gaussian) >> t(PCG) ,i.e., a good parallelization is worthless for a numerically inefficient algorithm.

If one compares algorithms implemented on different parallel machines, thenthe costs of that machine should also be taken into account, e.g., price per unknownvariable.

Figure 3.12: System times for Gaussian elimination, CG, PCG.


Remark 3.28. Some people only want to measure flops (floating point operationsper second), others only want to measure speedup. Actually, only wall clock timecounts on a parallel code. If you do not want the solution now, why bother with thenuisance of parallel programming?

3.4.2 Efficiency

Definition 3.29 (Parallel efficiency). The parallel efficiency tells us how closeour parallel implementation is to the optimal speedup. Here we present only theformula for the scaled parallel efficiency. The classical efficiency can be calculatedsimilarly.

EC,par =SC(P )

P(3.5)

A desirable efficiency would be 1 (= 100%), i.e., the use of P processorsaccelerates the code by a factor of P . The parallel efficiency for the scaleup can begiven explicitly:

EC,par =SC(P )

P=

s1 + P (1− s1)

P= 1− s1 +

s1P

> 1− s1 .

We get a slightly decreasing efficiency with increasing number of processors butthe efficiency is, however, at least as high as the parallel part of the program.

Definition 3.30 (Numerical efficiency, scaled efficiency). The numerical ef-ficiency compares the fastest sequential algorithm with the fastest parallel algorithmimplemented on one processor via the relationship

Enum =tparalleltserial

. (3.6)

From that follows the scaled efficiency of a parallel algorithm:

E = EC,par ·Enum . (3.7)

Formula (3.7) does not contain any losses due to communication and it alsoassumes a uniform distribution of the global problem on all processors. This uni-form distribution can be achieved in practice rather seldom and the work load perprocessor can change additionally during non-linear and/or adaptive calculations.Therefore, the attained efficiency is lower than the theoretical one, often vastly so.One way to save a good theoretical efficiency in practice consists in the use of loadbalancing strategies.


Definition 3.31 (Load balance). The equal distribution of the computationalwork load on all processors is called load balance.

Opportunities to achieve a load balance close to the optimum:

• A static load balancing can be done a priori:

– Simple distribution of meshes or data based on an arbitrary numberingis not efficient for the parallel code since it does not take advantage ofspatial data locality.

– Distribution of data with several bisection techniques (e.g., coordinate bi-section, recursive spectral bisection [7], Kerninghan-Lin-Algorithm [103]produces much better data splittings also with respect to communication.The better the desired data distribution should be, the more expensivethe individual bisection techniques becomes.The static load balancing is especially suited for optimization problemsor the distribution of coarse meshes/grids in multigrid algorithms.

• A dynamic load balancing contains a redistribution of data during the calcu-lations [9, 112]:

– A high effort for the load distribution is needed and good heuristics arenecessary that avoid a communication overhead.

– Dynamic load balancing is required in highly adaptive algorithms.

Remark 3.32. METIS, PARMETIS, JOSTLE and PARTI are the most commonlyused load balancing codes. Many others exist, however.

3.4.3 Communication expenditure

We mostly neglected the time needed for communication (exchange of data) in allinvestigations on speedup and efficiency.

Let tStartup - latency (time until communication starts)tWord - time for transferring one Word (determined by bandwidth),

then the time to transfer N Word information (message) is:

tN = tStartup +N · tWord . (3.8)

Regarding (3.8) and practical experience, the transfer of one long message is fasterthan the transfer of several short messages. Operating system or/and hardwaresplit huge messages into several smaller ones. The size of those smaller messages isvendor/OS dependent, but is usually > 1kByte. Table 3.2 represents a collectionof latency time and bandwidth performance. The sustained bisection bandwidth isthe bandwidth that can be achieved when one half of the processors communicateswith the other half. As an example, the bisection bandwith of the Origin2000scales linearly up to 10 GB/s on 32 processors. This value remains the same for 64processors and doubles to 20 GB/sec with 128 processors.


Machine/OS Latency Bandwidth

Xplorer/Parix 100 µs 1.3 MB/sec (measured)Intel iPsc/860 82.5 µs 2.5 MB/secConvex SPP-1 0.2 µs 16.0 MB/secnCube2 71 MB/secCray TT3D 6 µs 120 MB/secIntel Paragon 200 MB/sec

SGI Origin2000 R10k/250 680 MB/secHP 9000 N-class 105 ns 160 MB/sec (in Hyperfabric)

HP Superdome 360 ns 12.8 GB/secIBM RS/6000 1000 ns 0.5 GB/secSGI Origin3800 45 ns 1.6 GB/secSUN E10000 500 ns 12.8 GB/sec

Table 3.2. Comparison of latency and bandwidth

Exercises

The following MPI functions require a communicator as parameter. This communicatordescribes the group of processes that are addressed by the corresponding MPI function. Bydefault, all processes are collected in MPI COMM WORLD, which is one of the constantssupplied by MPI. We restrict the examples to those global operations only. For thispurpose, create the special MPI-type variable

MPI Comm icomm= MPI COMM WORLD;which is used as parameter.

E3.1. Compile the program in Example/firstc via make. Start the program.

E3.2. Write your first parallel program by usingMPI Init and MPI Finalize,

compile the program with make, and start 4 processes viampirun -np 4 first.LINUX.

E3.3. Determine the number of your parallel processes and the local process rank byusing the routines

MPI Comm rank and MPI Comm size.Let the root process (rank=0) write the number of running processes. Start withdifferent numbers of processes.

E3.4. The file greetings.c includes a routineGreetings(myid,numprocs,icomm)

that prints the names of the hosts your processes are running on. Call that routinefrom your main program and change the routine such that the output is orderedwith respect to the process ranks. Study the routines

MPI Send and MPI Recv.

E3.5. Write a routineSend ProcD(to,nin,xin,icomm)

which sends nin Double Precision numbers of the array xin to the process to. Notethat the receiving process to does not normally have any information about the


length of the data it receives.Also write a corresponding routine

Recv ProcD(from,nout,xout,maxbuf,icomm)which receives nout Double Precision numbers of the array xout from the pro-cess from. A priori, the receiving process does not have any information about thelength of the data to be received, i.e., nout is an output parameter. maxbuf standsfor the maximum length of the array xout.

E3.6. Test the routines from E3.5 with two processes first. Let process 1 send data andprocess 0 receive them. Extend the test to more processes.

E3.7. Combine the routines from E3.5 to one routineExchangeD(yourid,nin,xin,nout,xout,maxbuf,icomm),

which exchanges double precision data between the own process and another pro-cess yourid. The remaining parameters are the same as in E3.5. Test your routineswith two and more processes.


Chapter 4

Galerkin Finite ElementDiscretization of EllipticPDE

If people do not believe that mathematics is simple, it isonly because they do not realize how complicated life is.—John von Neumann (1903-1957)

In Chapter 2, we considered the Poisson equation and the finite difference method(FDM) as the most simple method for its discreti zation. In this chapter, we give abrief introduction to the finite element method (FEM) that is the standard discretizationtechnique for elliptic boundary value problems (BVP). As we will see, the FEM is nothingmore than a Galerkin method with special basis functions. In contrast to the FDM wherewe used the classical formulation of the elliptic BVP as starting point for the discretization,the Galerkin FEM starts with the variational, or weak formulation, of the BVP that wewant to solve.

4.1 Variational Formulation of Elliptic BVPsIn this section, we derive the variational (weak) formulation of some model BVP for ascalar elliptic second-order partial differential equation (PDE). The same procedure isused for elliptic PDE systems (e.g., for the linear elasticity problem) and for higher-orderequations (e.g., for the biharmonic problem).

The abstract form of the variational formulation of any linear elliptic problem indivergence form reads as follows:

Find u ∈ Vg : a(u, v) = 〈F, v〉, ∀ v ∈ V0, (4.1)

with a given continuous (= bounded) bilinear form a(·, ·) : V × V → R1 defined on some

Hilbert space V equipped with the scalar product (., .) and the corresponding norm ‖.‖,and a given continuous linear functional 〈F, ·〉 : V0 → R

1 defined on some subspace V0 ofV . We search for the solution u in the set Vg := g+V0 ≡ v ∈ V : ∃ w ∈ V0, v = g+w ofadmissible function that is usually a linear manifold defined by the inhomogeneous (g 6= 0)Dirichlet boundary conditions. As we will see, integration by parts is the basic techniquefor deriving the variational formulation from the classical statement of the boundary valueproblem.

37

38 Chapter 4. Galerkin Finite Element Discretization of Elliptic PDE

Let us first consider the classical formulation of a mixed boundary value problemfor a second-order linear elliptic PDE in the divergence form

Find u ∈ X := C2(Ω) ∩ C1(Ω ∪ Γ2 ∪ Γ3) ∩ C(Ω ∪ Γ1) such thatthe partial differential equation

−m∑

i,j=1

∂

∂xi

(aij(x)

∂u

∂xj

)+

m∑

i=1

ai(x)∂u

∂xi+ a(x)u(x) = f(x) (4.2)

holds for all x ∈ Ω and that the Boundary Conditions (BC)

• u(x) = g1(x), ∀x ∈ Γ1 (Dirichlet (1st-kind) BC),

• ∂u∂N

:=m∑

i,j=1

aij(x)∂u(x)∂xj

ni(x) = g2(x), ∀x ∈ Γ2

(Neumann (2nd-kind) BC),• ∂u

∂N+ α(x)u(x) = g3(x), ∀x ∈ Γ3 (Robin (3rd-kind) BC).

are satisfied.

The computational domain Ω ⊂ Rm (m = 1, 2, 3) is supposed to be bounded and the

boundary Γ = ∂Ω = Γ1 ∪ Γ2 ∪ Γ3 should be sufficiently smooth. Throughout this chapter,n = (n1, ..., nm)T denotes the outer unit normal. ∂u

∂Nis sometimes called the co-normal

derivative. It is clear that the Dirichlet problem (2.1) for the Poisson equation discussedin Chapter 2 is a special case of (4.2). The stationary heat conduction equation is anotherspecial case of (4.2); see Example 4.3.

Remark 4.1.

1. u ∈ X solving (4.2) is called classical solution of the BVP (4.2).

2. Note that the data aij , ai, a, α, f, g,Ω should satisfy classical smoothness assump-tions, such as aij ∈ C1(Ω) ∩ C(Ω ∪ Γ2 ∪ Γ3), etc.

3. We assume uniform ellipticity of (4.2) in Ω, i.e.,

• aij(x) = aji(x), ∀ x ∈ Ω, ∀ i, j = 1,m,

• ∃ µ1 = const. > 0 such that

Λ(x, ξ) :=m∑

i,j=1

aij(x)ξi ξj ≥ µ1|ξ|2, ∀ ξ ∈ R

m, ∀ x ∈ Ω.

4. The Robin boundary condition is often written in the form

−∂u

∂N= α(x)(u(x)− ue(x)),

where g3(x) = α(x)ue(x) with given smooth functions α and ue.

As mentioned in Remark 4.1, we have to assume that the data of the BVP (4.2) issmooth. However, in practice, the computational domain Ω is usually composed of severalsubdomains with different material properties. In this case, the differential equation is onlyvalid in the subdomains where the data is supposed to be smooth (e.g., constant). On theinterfaces between these subdomains, we have to impose interface conditions forcing thefunction u (the temperature in the case of a heat conduction problem) and its co-normal

4.1. Variational Formulation of Elliptic BVPs 39

derivative (the negative heat flux) to be continuous. The variational formulation that weare going to study now will cover this practically important case automatically. A heatconduction problem of this type will be introduced shortly and will be used as a modelproblem throughout this chapter.

So, let us first derive the variational formulation of the BVP (4.2). The formalprocedure for the derivation of the variational formulation consists of the following 5 steps:

1. Choose the space of test functions V0 = v ∈ V = H1(Ω) : v = 0 on Γ1, whereV = H1(Ω) is the basic space for scalar second-order PDE. The Sobolev spaceH1(Ω), sometimes also denoted by W 1

2 (Ω), consists of all quadratically integrablefunctions with quadratically integrable derivatives, and it is equipped with the scalarproduct

(u, v)1,Ω :=

∫

Ω

(uv +∇Tu∇v)dx ∀u, v ∈ H1(Ω)

and the corresponding norm ‖.‖1,Ω :=√

(., .)1,Ω (see [1, 22] for more details aboutSobolov spaces).

2. Multiply the PDE (4.2) by test functions v ∈ V0 and integrate over Ω:

∫

Ω

(−

m∑

i,j=1

∂

∂xi

(aij

∂u

∂xj

)+

m∑

i=1

ai∂u

∂xi+ au

)v dx =

∫

Ω

fv dx, ∀ v ∈ V0.

3. Use integration by parts:

∫

Ω

∂w

∂xiv dx = −

∫

Γ

w∂u

∂xidx+

∫

∂ Ω

wn · ni ds

in the principal term (= term with the second-order derivatives):

∫

Ω

(m∑

i,j=1

aij∂u

∂xj

∂v

∂xi+

m∑

i=1

ai∂u

∂xiv + auv

)dx−

∫

∂ Ω

m∑

i,j=1

aij∂u

∂xjni

︸︷︷︸=: ∂u

∂N= Co-normal derivative

· v ds

=∫Ω

fv dx ∀ v ∈ V0.

4. Incorporate the natural (Neumann and Robin) BC on Γ2 and Γ3:∫

Γ

∂u

∂Nv ds =

∫

Γ1

∂u

∂Nv ds+

∫

Γ2

g2v ds+

∫

Γ3

(g3 − αu)v ds.

5. Define the linear manifold, where the solution u is searched for:

Vg = v ∈ V = H1(Ω) : v = g1 on Γ1.


Summarizing, we arrive at the variational formulation

Find u ∈ Vg such that a(u, v) = 〈F, v〉 ∀ v ∈ V0, where

a(u, v) :=

∫

Ω

(m∑

i,j=1

aij∂u

∂xj

∂v

∂xi+

m∑

i=1

ai∂u

∂xiv + auv

)dx+

∫

Γ3

αuv ds,

〈F, v〉 :=

∫

Ω

fv dx+

∫

Γ2

g2v ds+

∫

Γ3

g3v ds,

Vg := v ∈ V = H1(Ω) : v = g1 on Γ1,

V0 := v ∈ V : v = 0 on Γ1.

(4.3)

Remark 4.2.

1. The solution u ∈ Vg of (4.3) is called the weak or generalized solution.

2. For the variational formulation (4.3), the assumptions imposed on the data can beweakened (the integrals involved in (4.3) should exist), e.g.:

1) aij , ai, a ∈ L∞(Ω), α ∈ L∞(Γ3), i.e., bounded,

2) f ∈ L2(Ω), gi ∈ L2(Γi), i = 2, 3,

3) g1 ∈ H12 (Γ1), i.e., ∃ g1 ∈ H1(Ω) : g1|Γ1 = g1,

4) Ω ⊂ Rm is supposed to be bounded with

a Lipschitz-continuous boundary Γ = ∂ Ω ∈ C0,1,

5) uniform ellipticity, i.e.,∃ µ1 = const > 0 such thatm∑

i,j=1

aij(x)ξi ξj ≥ µ1|ξ|2 ∀ ξ ∈ R

m

aij(x) = aji(x), ∀ i, j = 1, m

∀ a.e. in Ω.

(4.4)

Example 4.3 (The plain heat conduction problem CHIP). Find the temperaturefield u ∈ Vg such that the variational equation

a(u, v) = 〈F, v〉 ∀ v ∈ V0,

holds, where

a(u, v) :=

∫

Ω

(λ∂u(x)

∂x1

∂v(x)

∂x1+ λ

∂u(x)

∂x2

∂v(x)

∂x2+ a(x)u(x)v(x)

)dx+

∫

Γ3

αuv ds,

〈F, v〉 :=

∫

Ω

fv dx+

∫

Γ2

g2v ds+

∫

Γ3

g3v ds,

Vg := v ∈ V = H1(Ω) : v = g1 on Γ1,

V0 := v ∈ V = H1(Ω) : v = 0 on Γ1.

The computational domain Ω consists of the two subdomains ΩI and ΩII with differentmaterial properties (see Fig. 4.1). In the silicon subdomain ΩI , λ = λI = 1 W (mK)−1

(heat conduction coefficient), a = 0 (no heat transfer through the top and bottom surfaces),

4.1. Variational Formulation of Elliptic BVPs 41

and f = 0, whereas in the copper subdomain ΩII , λ = λII = 371 W (mK)−1, a = 0 andf = 0, i.e., the domain data are piecewise constant. The boundary data on Γ1, Γ2, and Γ3

(see Fig. 4.1 for the precise location of the boundary pieces) are prescribed as follows:g1 = 500K (prescribed temperature on Γ1),

g2 = 0, i.e., heat flux isolation on Γ2,

g3 = αue (cf. Remark 4.1),

α = 5.6W (m2K)−1 (heat transfer coefficient),

ue = 300K (exterior temperature).

-x1

1

6x2

0.8

0

(ΩII – Material 2)

(ΩI – Material 1)

Γ1

Γ2

Γ3

ΓIn

Figure 4.1: Computational domain Ω consisting of two different materials.

It is clear that the variational formulation of the problem CHIP given in Example 4.3fulfills the requirements formulated in Remark 4.2. The solution u ∈ Vg of the variationalproblem (4.1) has the form u = g +w, with some given admissible function g ∈ Vg (e.g.,g = 500 and some unknown function w ∈ V0). Therefore, we can rewrite (4.1) in the form

Find w ∈ V0 : a(w, v) = 〈F, v〉 − a(g, v), ∀ v ∈ V0. (4.5)

This procedure is called homogenization of the essential boundary conditions. Indeed, forour variational problem (4.3), it is sufficient to find some function g ∈ H1(Ω) satisfyingthe Dirichlet boundary conditions g = g1 on Γ1. This is always possible if the function g1is from H1/2(Γ1) (see, e.g., [1, 22]). Then the function w is searched in V0 such that (4.5)is satisfied, where the bilinear form and the linear form are defined as in (4.3). Thehomogenized variational formulation (4.5) simplifies the existence and uniqueness analysis.For simplicity, let us denote the right hand side of (4.5) again by 〈F, v〉. Therefore, for thetime being, we consider the variational problem in homogenized form

Find w ∈ V0 : a(w, v) = 〈F, v〉, ∀ v ∈ V0, (4.6)

where a(·, ·) : V ×V → R1 and F ∈ V ∗

0 are supposed to be a given continuous (= bounded)bilinear form and a given linear continuous (= bounded) functional, respectively. Let us


now consider the homogenized variational problem (4.6) under the standard assump-tions:

1) F ∈ V ∗

0 (space of linear, continuous functionals),2) a(·, ·) : V0 × V0 → R

1 − bilinear form onV0 :2a) V0 − elliptic : µ1‖v‖

2 ≤ a(v, v) ∀ v ∈ V0,2b) V0 − bounded : |a(u, v)| ≤ µ2‖u‖ ‖v‖ ∀ u, v ∈ V0.

(4.7)

where the constants µ1 and µ2 are supposed to be positive. Recall that V0 ⊂ V is usuallyan infinite dimensional, closed, non-trivial subspace of the Hilbert space V equipped withthe scalar product (·, ·) and the corresponding norm ‖ · ‖. The standard assumptions (4.7)are sufficient for existence and uniqueness as Lax-Milgram’s theorem states:

Theorem 4.4 (Lax-Milgram).

Let us assume that our standard assumptions (4.7) are fulfilled. Then there exists aunique solution u ∈ V0 of the variational problem (4.6).

Proof. Using the operator representation of (4.6) and Riesz isomorphism between V0 andits dual space V ∗

0 , we can easily rewrite problem (4.6) as an operator fixed point problemin V0. Then the proof of the results stated in the theorem immediately follows fromBanach’s fixed point theorem (see [22]). Moreover, Banach’s fixed point theorem providesus with the fixed point iteration for approximating the solution together with a priori anda posteriori iteration error estimates. Later on we will use this iteration method for solvingthe Galerkin equations.

4.2 Galerkin Finite-Element-DiscretizationThis section provides the basic knowledge of the finite element method (FEM). First weconsider the Galerkin method for solving the abstract variational problem (4.1) introducedin Section 4.1. Then we show that the FEM is nothing more than the Galerkin methodwith special basis functions. We focus on the practical aspects of the implementation ofthe FEM. To illustrate the main implementation steps, we consider our model problemCHIP introduced in Section 4.1. The section is concluded by some introductory analysis ofthe Galerkin FEM and an overview of some iterative solvers derived from Banach’s fixedpoint theorem. More information on FEM can be found in the standard monograph [22]on “The Finite Element Method for Elliptic Problems” (1978) by Ph. Ciarlet as well as ina lecture notes on numerical methods for PDE [3], [11], [50].

4.2.1 The Galerkin Method

The Galerkin method looks for some approximation uh ∈ Vgh = gh+V0h ⊂ Vg to the exactsolution u ∈ Vg of the variational problem (4.1) in some finite dimensional counterpart Vghof the linear manifold Vg such that uh satisfies the variational equation (4.1) for all testfunctions vh from the finite dimensional subspace V0h ⊂ V0 generating Vgh.

To be more specific, let us first introduce the finite dimensional space

Vh = spanϕ(i) : i ∈ ωh

=

vh =∑

i∈ωh

v(i)ϕ(i)

= span Φ ⊂ V (4.8)

4.2. Galerkin Finite-Element-Discretization 43

spanned by the (linear independent) basis functions Φ = [ϕ(i) : i ∈ ωh] = [ϕ1, . . . , ϕNh],

where ωh is the index set providing the “numbering” of the basis functions. Due to thelinear independence of the basis function, the dimension, dimVh, of the space Vh is givenby dimVh = |ωh| = Nh = Nh + ∂Nh < ∞. After introducing the basic finite dimensionalspace Vh, we can define the finite dimensional linear manifold

Vgh = Vh ∩ Vg = gh + V0h =

vh =

∑

i∈ γh

u(i)∗ ϕ(i) +

∑

i∈ωh

v(i)ϕ(i)

⊂ Vg (4.9)

and the corresponding finite dimensional test space

V0h = Vh ∩ V0 =

vh =

∑

i∈ωh

v(i)ϕ(i)

= span Φ ⊂ V0, (4.10)

where dim V0h = |ωh| = Nh. Note that we implicitly assume that Vh ∩ Vg 6= ∅ and that

at least one element gh =∑

i∈ γh

u(i)∗ ϕ(i) ∈ Vg ∩ Vh is available, where γh := ωh \ ωh. For

instance, the 1D linear basis functions are presented in the following picture:

-

@@

@

@@

@

@@

@

ϕ(1) ϕ(2) ϕ(3)

0 1Basis functions

Once the basis functions are defined, it is easy to verify that the Galerkin Schemeis equivalent to the Galerkin System of algebraic equations for defining the Galerkincoefficient vector, which is named the skeleton solution (SS), as illustrated in Fig. 4.2.

Before we are going to analyze the Galerkin scheme (existence, uniqueness, anddiscretization error estimates), and to discuss solvers for the system of equations

Khuh = fh

(4.11)

arising from the Galerkin discretization, we introduce the FEM as a Galerkin method withspecial basis functions resulting in the probably most powerful method for solving ellipticPDE in the world.

4.2.2 The Simplest Finite Element Schemes

The basic idea for overcoming the principle difficulties of the classical Galerkin method(i.e., the use of basis and test functions with global supports, e.g., polynomials) goesback to the paper [24] by R. Courant (1943) at least, in a strong mathematical sense,and was recovered by the engineers in the mid-fifties (see also the historical review [4] byI. Babuska): Use basis and test functions ϕ(i) = ϕ(i)(x) with local supports, where theϕ(i) can be defined element-wise by form functions.

In order to construct such basis functions, the computational domain Ω is decom-posed into “sufficiently small” subdomains δr, the so-called finite elements (e.g., trianglesor quadrilateral in 2D; tetrahedrons or hexahedrons in 3D). R. Courant defined his basisfunction ϕ(i) on some triangulation (Ω is decomposed into triangles) in such a way that


Find uh ∈ Vgh : a(uh, vh) = 〈F, vh〉, ∀ vh ∈ V0h (test space)

↑GS

Choose the basis: Φ = ϕ(i) : i ∈ ωh = [ϕ1, . . . , ϕN ]

uh =∑

i∈ωh

u(i)ϕ(i) +∑

i∈ γh

u(i)∗ ϕ(i)

↑look for

↑given

vh = ϕ(k) ∈ V0h = spanϕ(j) : j ∈ ωh, ∀ k ∈ ωh a(·, ·) – bilinear, 〈F, ·〉 – linear

Find uh = [u(i)]i∈ωh:

∑i∈ωh

u(i)a(ϕ(i), ϕ(k)) =⟨F, ϕ(k)

⟩− ∑

i∈ γh

u(i)∗ a(ϕ(i), ϕ(k)), ∀ k ∈ ωh

Galerkin system

Find uh ∈ RNh : Khuh = f

h(linear system)

with Kh = [a(ϕ(i), ϕ(k))]k,i∈ωh− (Nh ×Nh)

stiffness matrix

fh=

[⟨F, ϕ(k)

⟩− ∑

i∈ γh

u(i)∗ a(ϕ(i), ϕ(k))

]

k∈ωh

∈ RNh

load vector

6

?

6

?

SS

@@@I

Galerkin-isomorphism

Figure 4.2: Galerkin Scheme.

it is continuous, linear on each triangle, 1 at the nodal point (vertex of the triangle) x(i)

to which it belongs, and 0 at all the other nodal points of the triangulation; see Fig 4.3.Due to its shape, Courant’s basis function is also called “hat function” or “chapeaufunction”.

Using such basis functions, we search the Galerkin Solution (GS) of our variationalproblem in the form

uh(x) =∑

i∈ ωh

u(i)ϕ(i)(x) ∈ Vh.

The piecewise linear finite element function uh is obviously continuous, but not contin-uously differentiable across the edges of the elements. Due to the nodal basis propertyϕ(i)(x(j)) = δij , the value uh(x

(i)) of the finite element function uh at the nodal point


u u

u u

uΓ = ∂Ω

ϕ(j)(x)

Ω

supp (ϕ(i)) = u

- h

ul

ll

l

ϕ(i)(x)

shape function

6

9x(i) - node belonging to

the basis function ϕ(i)

ϕ(i)(x(j)) = δij

nodal basis

Figure 4.3: Courant’s basis function.

x(i) is equal to the coefficient u(i) for all nodal points x(i) ∈ Ω, where δij denotes Kro-necker’s symbol (δij = 1 for i = j and δij = 0 for i 6= j, respectively). The use of suchfinite basis functions with local supports instead of basis functions with global supports isadvantageous from various points of view:

1. There is a great flexibility in satisfying the Dirichlet boundary conditions.

2. The system matrix (= stiffness matrix) is sparse because of the locality of supportof the basis and test functions, i.e.,

a(ϕ(i), ϕ(k)) = 0, if supp(ϕ(i)) ∩ supp(ϕ(k)) = ∅.

3. The calculation of the coefficients of the stiffness matrix Kh and the load vector fh

is based on the quite efficient element-wise numerical integration technique.

4. It is easy to construct better and better finite element approximations uh to the exactsolution u of our variational problem by refining the finite elements even within anadaptive procedure (“limiting completeness” of the family of f.e. spaces V0h).

For the sake of simplicity of the presentation, we assume in the following that ourcomputational domain Ω is plain and polygonally bounded. It is clear that such a computa-tional domain can be decomposed into triangular finite elements δr exactly. Furthermore,


we restrict ourselves to Courant’s basis functions mentioned above (the linear triangularfinite elements). Of course, this approach can easily be generalized to more complicatedcomputational domain in 2D and 3D as well as to other finite elements (see, [22, 73, 117]).

Triangulation (Domain Discretization)

The first step in the finite element Galerkin approach consists in the discretization of thecomputational domain, the triangulation. The triangulation Th = δr : r ∈ Rh :=1, 2, . . . , Rh is obtained by decomposing the computational domain Ω into triangles δrsuch that the following conditions are fulfilled:

1. Ω =⋃

r∈Rh

δr.

2. For all r, r′ ∈ Rh, r 6= r′, the intersection

δr ∩ δr′ =

∅, or

a common edge, or

a common vertex.

Among other things, the manner of the decomposition of Ω depends on:

1. The geometry of the domain (e.g., re-entrant corners on the boundary ∂Ω and microscales in the domain).

2. The input data of the elliptic boundary value problem (e.g., jumping coefficients →interfaces, behavior of the right hand side, mixed boundary conditions → pointswhere the type of the boundary conditions is changing),

3. The accuracy imposed on the finite element solution (→ choice of the fineness ofthe mesh and / or the degree of the shape functions forming the finite element basisfunctions).

The following general hints should be taken into account by the triangulation procedureused:

1. Due to the second condition mentioned above, the triangulations in Fig. 4.4 areinadmissible.

@@@@@

AAAALLLL

s s

s s s

s

@@@@@

LLLL

AAAAAs s

ss

s

s

Figure 4.4: Examples of inadmissible meshes.

2. The mesh should be adapted to the change in the type of the boundary conditions(see Fig. 4.5).

3. The interfaces should be captured by the triangulation (see Fig. 4.6).


wrong

l

llllll

u

u

ue

1st kind BC

2nd kind BC

right

l

llllll

u

u

uu

1st kind BC

2nd kind BC

Figure 4.5: Change in the type of the boundary conditions (BC).

wrong

!!!!!!!!@@

@@

@@

s

s

s(((((((((

interface

right

!!!!!!!!@@

@@

@@

s

s

sJJJJ

(((((((((

s

s

s

interface

Figure 4.6: Capturing interfaces by triangulation.

4. The mesh should be refined appropriately in regions with a strongly changing gradi-ent. This strongly changing gradient is caused by re-entrant corners on the bound-ary, corners in the interfaces, and boundary points where the boundary conditionschange, etc.

5. Avoid triangles with too acute, or too obtuse angles, especially triangles of the formshown in Fig. 4.7 (they cause a bad condition for the stiffness matrix and eventuallybad approximation properties of the finite element solution).

((((((((((((!!!!!

u uu

Figure 4.7: Triangle with an obtuse angle.

The triangulation procedure (mesh generator) for plain, polygonally boundedcomputational domains Ω produces the coordinates of the vertices (or nodes) of thetriangles. We construct both a global and a local numbering of the vertices and the


connectivity of the vertices. This is called a discrete decomposition of the domain.

Global: Numbering of all nodes and elements:

ωh = 1, 2, . . . , Nh, Nh - number of nodes,

Rh = 1, 2, . . . , Rh, Rh - number of elements.

Determination of the nodal coordinates:

x(i) = (x1,i, x2,i), i = 1, 2, . . . , Nh

Local: Anticlockwise numbering of the nodes in every triangle, see Fig. 4.8:

δ(r)s

s

sx(r)3

x(r)2

x(r)1

Nodal coordinates of the nodes x(r)α :

x(r)α = (x

(r)1,α, x

(r)2,α)

α ∈ A = A(r) = 1, 2, 3

Figure 4.8: Local numbering of the nodes in a triangle

Connectivity: Determination of the connectivity between the local and global num-berings of the nodes for all triangles δ(r) , r ∈ Rh:

r : α ↔ i = i(r, α) , α ∈ A(r), i ∈ ωh.

Therefore, a Finite Element Code usually requires at least two arrays describing thetriangulation, namely the so-called finite element connectivity array r : α ↔ i , r ∈Rh , α ∈ A(r) = A , i ∈ ωh and the real array of all nodal coordinates i : (x1,i, x2,i).These two arrays are presented by Tables 4.1 and 4.2 on page 50 for the triangulation ofmodel problem CHIP shown in Fig. 4.9.

Further arrays for characterizing nodes, or elements are required by many finiteelement codes, e.g., in order to assign material properties (MP) to the elements (seeTable 4.1), or in order to describe the boundary ∂Ω and to code the boundary conditions(see, e.g., [73]).

There are several methods for generating the mesh data from some geometricaldescription of the computational domain and from the PDE data. If it is possible todecompose the computational domain into a coarse mesh as for our model problem CHIP,then one can simply edit the arrays described above by hand. More complicated 2D, or3D, computational domains require the use of mesh generators that automatically producemeshes from the input data (see [39, 101, 73]). These hand generated or automaticallygenerated meshes can then be refined further via some a priori or a posteriori refinementprocedures [110].

Definition of the FE Basis Function

Once the mesh is generated, the finite element basis functions ϕ(i)can be locally definedvia the element shape functions ϕ

(r)α that are obtained from the master element shape


s s s

ssss

s

s s

s

s s s

s

s

ss

s

s

s

6

-x

x1

yx2

0.8

0.6

0.3

0

0.15

0.45

1h 2h Γ33h 4hΓ3

5h0 0.17 0.35 0.5 0.65 0.83 1

6h

Γ2

7h8h 9h Γ2

10h 11h

Γ2

12h13h 14hΓ2

15h 16hΓ2

17h

1

2 3

4

5 6

7

8 9

10

11

12

13

14

15 16

17

18

19

20

21

22 23

24

18lΓ1

19lΓ1

20lΓ1

21lΓ1

1 2

3

1

23

1

23

1

2

3

1 2

3

1 2

3

1

2

3

1

23

1

23

1 2

3

1 2

3

1

23

1 2

3

1

2

3

1

23

1

23

1

2

3

1

2

3

1

23

1 2

3

1 2

3

1

23

1

23

1 2

3MP = 2

MP = 1

6

Figure 4.9: Triangulation of the model problem CHIP.

global nodal numbers of the local nodesElement-number ϕ

(r)1 ϕ

(r)2 ϕ

(r)3

MaterialProperty(MP)

1 1 2 6 12 2 7 6 1...

......

......

19 8 14 13 120 8 9 14 1...

......

......

Rh = 24 12 13 17 2

Table 4.1. Element connectivity table.

functions by mapping.


i 1 2 3 · · · Nh = 21

x1,i 0.0 0.17 0.5 · · · 0.65

x2,i 0.0 0.0 0.0 · · · 0.3

Table 4.2. Table of the nodal coordinates.

6x2

-x1

δ(r)sx(r)1 = xi

sx(r)2 = xj

sx(r)3 = xk

-ξ = ξδr (x)

x = xδr (ξ)

6ξ2

-ξ1

∆

s0

α = 1

s1

α = 2

s1 α = 3

arbitrary triangle δ(r) ∈ Th master triangle ∆

Figure 4.10: Mapping between an arbitrary element and the master element

The affine linear transformation x = Jδ(r) ξ + x(i) is given by(

x1

x2

)=

(x1,j − x1,i x1,k − x1,i

x2,j − x2,i x2,k − x2,i

)(ξ1ξ2

)+

(x1,i

x2,i

). (4.12)

This transformation maps the master element ∆ (= unit triangle) onto an arbitrary triangleδ(r) of our triangulation Th, see Fig. 4.10. The relation

|det Jδ(r) | = 2 meas δ(r)

immediately follows from the elementary calculations

meas δ(r) =

∫

δ(r)

dx =

∫

∆

|det Jδ(r) | dξ = |det Jδ(r) |

∫

∆

dξ =1

2|det Jδ(r) | ,

where meas δ(r) denotes the area of the element δ(r). The inverse mapping ξ = J−1

δ(r)(x−xi)

is given by the formulas(

ξ1ξ2

)=

1

detJδ(r)

(x2,k − x2,i −(x1,k − x1,i)

−(x2,j − x2,i) x1,j − x1,i

)(x1 − x1,i

x2 − x2,i

).

Now using the element connectivity relation r : α ↔ i and the affine linear transforma-tions, we can obviously represent the finite element basis functions ϕ(i) by the formula

ϕ(i)(x) =

ϕ(r)α (x), x ∈ δ(r), r ∈ Bi (shape function),

0, otherwise, i.e.,∀ x ∈ Ω \⋃

r∈Bi

δ(r),(4.13)


where Bi = r ∈ Rh : x(i) ∈ δ(r) , ϕ(r)α (x) = φα(ξδ(r) (x)) , x ∈ δ(r) , and φα denotes the

master element shape function that corresponds to the element shape function ϕ(r)α . For

Courant’s triangle, we have three master element shape functions

φ1 = 1− ξ1 − ξ2,

φ2 = ξ1,

φ3 = ξ2,

(4.14)

that correspond to the nodes (0, 0), (1, 0), and (0, 1), respectively.Let us consider the triangulation of our model problem CHIP given in Fig. 4.9 as an

example. If we chose, e.g., the node with the global number 8, then B8 = 8, 9, 18, 19, 20.The element-wise definition of the basis function ϕ(8) is now illustrated in Fig. 4.11.

tt

t

t

t

t

1

23

1

231

2

3

1

23

1

2

3

ϕ(8)(x) =

j j

j

j 13l14l

9l

4l

8i

8 9

18

19

20

21l

-

ξ = ξδ20 (x)

x = xδ20 (ξ)

-

6

∆

t t

t

0

01

1

ξ1

ξ2

α=1 α=2

α=3

φ3(ξ) = ξ2

φ2(ξ) = ξ1

8l*

9l7

14l@@R

φ1(ξ) = 1− ξ1 − ξ2ϕ(8)(x)|δ20 = ϕ(20)1 (x) := φ1(ξδ20 (x))

Figure 4.11: Definition of the FE basis function ϕ(8)

Now it is clear that

ϕ(i)(xj) = δij for all i, j ∈ ωh (−→ nodal basis),

Vh =vh(x) : vh(x) =

∑

i∈ωh

v(i)ϕ(i)(x)⊂ V = H1(Ω),

V0h =vh(x) : vh(x) =

∑

i∈ωh

v(i)ϕ(i)(x)⊂ V0,

Vg1h =vh(x) : vh(x) =

∑

i∈ωh

v(i)ϕ(i)(x) +∑

i∈γh

g1(xi)ϕ(i)(x)

.

The nodal basis relation ϕ(i)(x(j)) = δij implies that vh(x(i)) = v(i). For the triangulation

given in Fig. 4.9, we have

ωh = 1, 2, . . . , 21 and ωh = ωh \ γh , γh = 18, 19, 20, 21.


Assembling of the Finite Element Equations

Assembling the finite element equations is a simple procedure, but is sometimes quiteconfusing the first time someone sees the process. Strang and Fix [99] provide manydetails for a one dimensional boundary value problem that the reader may find easier tofollow than what follows.

We are now going to describe the element-wise assembling of the system of finiteelement equations that is nothing more but the Galerkin System for our finite elementbasis functions (cf. Section 4.2.1): Find uh ∈ R

Nh such that

Khuh = fh, (4.15)

with Kh = [a(ϕ(i), ϕ(k))]i,k∈ωh,

fh

= [〈F, ϕ(k)〉 −∑

i∈γh

u(i)∗ a(ϕ(i), ϕ(k))]k∈ωh

.

For our model problem CHIP that we consider shortly, the bilinear form a(·, ·) andthe linear form 〈F, ·〉 are defined as follows (cf. again Section 4.2.1):

a(ϕ(i), ϕ(k)) =

∫

Ω

[λ1

∂ϕ(i)

∂x1

∂ϕ(k)

∂x1+ λ2

∂ϕ(i)

∂x2

∂ϕ(k)

∂x2

]dx +

∫

Γ3

αϕ(i)ϕ(k) ds,

〈F,ϕ(k)〉 =

∫

Ω

f(x)ϕ(k)(x) dx +

∫

Γ2

g2(x)ϕ(k)(x) ds +

∫

Γ3

αg3ϕ(k) ds.

In the first step, we ignore the boundary conditions, and generate the extendedmatrix

Kh = [a(ϕ(i), ϕ(k))]i,k∈ωh=

[∫

Ω

[λ1

∂ϕ(i)

∂x1

∂ϕ(k)

∂x1+ λ2

∂ϕ(i)

∂x2

∂ϕ(k)

∂x2

]dx

]

i,k∈ωh

,

and the extended right-hand side

fh= [〈F , ϕ(k)〉]k∈ωh

=

[ ∫

Ω

f(x)ϕ(k)(x) dx

]

k∈ωh

.

The finite element technology for the element-wise generation of the matrix Kh and theright hand-side f

his based on the fundamental identities

(Khuh, vh) = a(uh, vh) , ∀ uh ↔ uh, vh ↔ vh, uh, vh ∈ Vh, uh, vh ∈ RNh

and(f

h, vh) = 〈F , vh〉 , ∀ vh ↔ vh, vh ∈ Vh, vh ∈ R

Nh ,

respectively.

Generation of the element stiffness matrices

Inserting uh =∑

i∈ωh

u(i)ϕ(i)(x) and vh =∑

k∈ωh

v(k)ϕ(k)(x) ∈ Vh into the bilinear form

a(., .), we obtain the representation

a(uh, vh) = a( ∑

i∈ωh

u(i)ϕ(i)(x),∑

k∈ωh

v(k)ϕ(k)(x))


=

∫

Ω

[λ1

∂

∂x1

( ∑

i∈ωh

u(i)ϕ(i)(x)

)∂

∂x1

( ∑

k∈ωh

v(k)ϕ(k)(x)

)

+ λ2∂

∂x2

( ∑

i∈ωh

u(i)ϕ(i)(x)

)∂

∂x2

( ∑

k∈ωh

v(k)ϕ(k)(x)

)]dx

=∑

r∈Rh

∑

i∈ωh

∑

k∈ωh

u(i)v(k)∫

δ(r)

[λ1

∂ϕ(i)

∂x1

∂ϕ(k)

∂x1+ λ2

∂ϕ(i)

∂x2

∂ϕ(k)

∂x2

]dx.

Introducing the index set ω(r)h = j : x(j) = (x1,j , x2,j) ∈ δ(r) , and taking into account

that the integral ∫

δ(r)

[λ1

∂ϕ(i)

∂x1

∂ϕ(k)

∂x1+ λ2

∂ϕ(i)

∂x2

∂ϕ(k)

∂x2

]dx = 0

for x(i) /∈ δ(r) , or x(k) /∈ δ(r), we arrive at the representation

a(uh, vh) =∑

r∈Rh

∑

i∈ω(r)h

∑

k∈ω(r)h

u(i)v(k)∫

δ(r)

[λ1

∂ϕ(i)

∂x1

∂ϕ(k)

∂x1+ λ2

∂ϕ(i)

∂x2

∂ϕ(k)

∂x2

]dx.

Taking into account the correspondences r : i ↔ α , k ↔ β with α, β ∈ A(r) = 1, 2, 3,we finally obtain (see also relation (4.13))

a(uh, vh) =∑

r∈Rh

∑

α∈A(r)

∑

β∈A(r)

u(r)α v

(r)β

∫

δ(r)

[λ1

∂ϕ(r)α

∂x1

∂ϕ(r)β

∂x1+ λ2

∂ϕ(r)α

∂x2

∂ϕ(r)β

∂x2

]dx

=∑

r∈Rh

(v(r)1 v

(r)2 v

(r)3 )K(r)

u(r)1

u(r)2

u(r)3

,

where the element stiffness matrices K(r) are defined as follows

K(r) =

[ ∫

δ(r)

[λ1

∂ϕ(r)α

∂x1

∂ϕ(r)β

∂x1+ λ2

∂ϕ(r)α

∂x2

∂ϕ(r)β

∂x2

]dx

]3

β,α=1

.

The coefficients of the element stiffness matrices K(r) will be calculated by transformingthe integrals over δ(r) into integrals over the master element ∆ = (ξ1, ξ2) : 0 ≤ ξ1 ≤1, 0 ≤ ξ2 ≤ 1, ξ1 + ξ2 ≤ 1. The transformation xδ(r) = xδ(r)(ξ) of the master elementonto the finite element δ(r) is explicitly given by the formula x = Jδ(r) ξ + x(i), i.e.,

(x1

x2

)=

(x1,j − x1,i x1,k − x1,i

x2,j − x2,i x2,k − x2,i

)(ξ1ξ2

)+

(x1,i

x2,i

),

where (x1,i, x2,i), (x1,j , x2,j), (x1,k, x2,k) denote the vertices of the triangle δ(r), and thecorrespondence i ↔ 1, j ↔ 2, k ↔ 3 between the global and local nodal numbering is takenfrom the element connectivity table r : i ↔ α . The inverse mapping ξδ(r) = ξδ(r)(x) isobviously given by the formula

(ξ1ξ2

)= J−1

δ(r)

(x1 − x1,i

x2 − x2,i

)


=1

detJδ(r)

(x2,k − x2,i −(x1,k − x1,i)

−(x2,j − x2,i) x1,j − x1,i

)(x1 − x1,i

x2 − x2,i

)

=

(∂ξ1∂x1

∂ξ1∂x2

∂ξ2∂x1

∂ξ2∂x2

)(x1 − x1,i

x2 − x2,i

).

The transformation formulas

∂

∂x1=

∂

∂ξ1

∂ξ1∂x1

+∂

∂ξ2

∂ξ2∂x1

and∂

∂x2=

∂

∂ξ1

∂ξ1∂x2

+∂

∂ξ2

∂ξ2∂x2

for the partial derivatives give the relations

∂

∂x1=

1

detJδ(r)

[(x2,k − x2,i)

∂

∂ξ1− (x2,j − x2,i)

∂

∂ξ2

],

∂

∂x2=

1

detJδ(r)

[− (x1,k − x1,i)

∂

∂ξ1+ (x1,j − x1,i)

∂

∂ξ2

].

Using these transformation formulas and the relation (4.13), we obtain the formulas for

the coefficients K(r)βα of the element stiffness matrix K(r):

K(r)βα =

∫

δ(r)

[λ1

∂ϕ(r)α

∂x1

∂ϕ(r)β

∂x1+ λ2

∂ϕ(r)α

∂x2

∂ϕ(r)β

∂x2

]dx

=

∫

∆

[λ1(xδ(r)(ξ))

(∂ϕ

(r)α (xδ(r)(ξ))

∂ξ1

∂ξ1∂x1

+∂ϕ

(r)α (xδ(r)(ξ))

∂ξ2

∂ξ2∂x1

)·

(∂ϕ

(r)β (xδ(r)(ξ))

∂ξ1

∂ξ1∂x1

+∂ϕ

(r)β (xδ(r)(ξ))

∂ξ2

∂ξ2∂x1

)

+ λ2(xδ(r)(ξ))

(∂ϕ

(r)α (xδ(r)(ξ))

∂ξ1

∂ξ1∂x2

+∂ϕ

(r)α (xδ(r)(ξ))

∂ξ2

∂ξ2∂x2

)·

(∂ϕ

(r)β (xδ(r)(ξ))

∂ξ1

∂ξ1∂x2

+∂ϕ

(r)β (xδ(r)(ξ))

∂ξ2

∂ξ2∂x2

)]|detJδ(r) | dξ

=

∫

∆

|detJδ(r) |

[λ1(xδ(r)(ξ))

(∂φα(ξ)

∂ξ1

∂ξ1∂x1

+∂φα(ξ)

∂ξ2

∂ξ2∂x1

)(∂φβ(ξ)

∂ξ1

∂ξ1∂x1

+∂φβ(ξ)

∂ξ2

∂ξ2∂x1

)

+ λ2(xδ(r)(ξ))

(∂φα(ξ)

∂ξ1

∂ξ1∂x2

+∂φα(ξ)

∂ξ2

∂ξ2∂x2

)(∂φβ(ξ)

∂ξ1

∂ξ1∂x2

+∂φβ(ξ)

∂ξ2

∂ξ2∂x2

)]dξ

In the case of linear shape functions (4.14)

φ1(ξ) = 1− ξ1 − ξ2 ,∂φ1

∂ξ1= −1 ,

∂φ1

∂ξ2= −1 ,

φ2(ξ) = ξ1 ,∂φ2

∂ξ1= 1 ,

∂φ2

∂ξ2= 0 ,

φ3(ξ) = ξ2 ,∂φ3

∂ξ1= 0 ,

∂φ3

∂ξ2= 1 ,


we derive from the above formulas the explicit relations

K(r)11 =

1

|detJδ(r) |

∫

∆

[λ1(xδ(r)(ξ))(−(x2,k − x2,i) + (x2,j − x2,i))2

+λ2(xδ(r)(ξ))((x1,k − x1,i)− (x1,j − x1,i))2] dξ

=1

|detJδ(r) |

((x2,j − x2,k)

2

∫

∆

λ1(xδ(r)(ξ)) dξ + (x1,k − x1,j)2

∫

∆

λ2(xδ(r)(ξ)) dξ

),

K(r)22 =

1

|detJδ(r) |

∫

∆

[λ1(xδ(r)(ξ))(x2,k − x2,i)2 + λ2(xδ(r)(ξ))(−(x1,k − x1,i))

2] dξ

=1

|detJδ(r) |

((x2,k − x2,i)

2

∫

∆

λ1(xδ(r)(ξ)) dξ + (x1,i − x1,k)2

∫

∆

λ2(xδ(r)(ξ)) dξ

),

K(r)33 =

1

|detJδ(r) |

∫

∆

[λ1(xδ(r)(ξ))(−(x2,j − x2,i))2 + λ2(xδ(r)(ξ))(x1,j − x1,i)

2] dξ

=1

|detJδ(r) |

((x2,i − x2,j)

2

∫

∆

λ1(xδ(r)(ξ)) dξ + (x1,j − x1,i)2

∫

∆

λ2(xδ(r)(ξ)) dξ

),

K(r)12 =

1

|detJδ(r) |

∫

∆

[λ1(xδ(r)(ξ))(x2,k − x2,i)(−(x2,k − x2,i) + (x2,j − x2,i))

+λ2(xδ(r)(ξ))(−(x1,k − x1,i)((x1,k − x1,i)− (x1,j − x1,i)))] dξ

=1

|detJδ(r) |

((x2,k − x2,i)(x2,j − x2,k)

∫

∆

λ1(xδ(r)(ξ)) dξ +

(x1,i − x1,k)(x1,k − x1,j)

∫

∆

λ2(xδ(r)(ξ)) dξ

),

K(r)13 =

1

|detJδ(r) |

∫

∆

[λ1(xδ(r)(ξ))(−(x2,j − x2,i)(−(x2,k − x2,i) + (x2,j − x2,i)))

+λ2(xδ(r)(ξ))(x1,j − x1,i)((x1,k − x1,i)− (x1,j − x1,i))] dξ

=1

|detJδ(r) |

((x2,i − x2,j)(x2,j − x2,k)

∫

∆

λ1(xδ(r)(ξ)) dξ

+(x1,j − x1,i)(x1,k − x1,j)

∫

∆

λ2(xδ(r)(ξ)) dξ

),

K(r)23 =

1

|detJδ(r) |

∫

∆

[λ1(xδ(r)(ξ))(−(x2,j − x2,i)(x2,k − x2,i))

+λ2(xδ(r)(ξ))(x1,j − x1,i)(−(x1,k − x1,i))] dξ

=1

|detJδ(r) |

((x2,i − x2,j)(x2,k − x2,i)

∫

∆

λ1(xδ(r)(ξ)) dξ

+(x1,j − x1,i)(x1,i − x1,k)

∫

∆

λ2(xδ(r)(ξ)) dξ

).


These formulae can be implemented easily if the integrals∫∆

λ1 dξ and∫∆

λ2 dξ are

easy to calculate. If λ1 and λ2 are constant on the elements, then∫∆

λ1 dξ = 12λ1

and∫∆

λ2 dξ = 12λ2. Otherwise, we use the mid-point rule

∫

∆

w(ξ) dξ ≈ 1

2w(ξs) , ξs =

(1

3,1

3

),

therefore, ∫

∆

λ1(xδ(r) (ξ)) dξ ≈1

2λ1(xδ(r)(ξs)) , ξs =

(1

3,1

3

)

and ∫

∆

λ2(xδ(r)(ξ)) dξ ≈1

2λ2(xδ(r) (ξs)) , ξs =

(1

3,1

3

).

We note that the mid-point rule exactly integrates first order polynomials, i.e.,polynomials of the form ϕ(ξ) = a0 + a1ξ1 + a2ξ2 . Setting k1 = 1

2λ1 , or k1 =12λ1(xδ(r) (ξs)), and k2 = 1

2λ2, or k2 = 12λ2(xδ(r) (ξs)) , we can rewrite the element

stiffness matrix K(r) in the form

K(r) =(K(r)

)T=

1

|detJδ(r) |·

k1(x2,j − x2,k)2 k1(x2,k − x2,i)(x2,j − x2,k) k1(x2,i − x2,j)(x2,j − x2,k)

+k2(x1,k − x1,j)2 +k2(x1,i − x1,k)(x1,k − x1,j) +k2(x1,j − x1,i)(x1,k − x1,j)

k1(x2,k − x2,i)2 k1(x2,i − x2,j)(x2,k − x2,i)

+k2(x1,i − x1,k)2 +k2(x1,j − x1,i)(x1,i − x1,k)

k1(x2,i − x2,j)2

+k2(x1,j − x1,i)2

,

where we again assume the correspondence i ↔ 1, j ↔ 2, k ↔ 3 between theglobal and the local nodal numbering as in Fig. 4.11.

We note that we have only used the coefficients λi of the PDE, the mappingx = Jδ(r)ξ+x(i), and the shape functions φα over the master element for calculating

the coefficients K(r)αβ the element stiffness matrix K(r). The explicit form of the

shape functions over the finite element δ(r) is not required. This is also valid forhigher order shape functions.

Generation of the element load vectors The element load vectors f (r) will begenerated in the same way as we have just generated the element stiffness matrix


K(r), i.e., it is based on the relation

(fh, vh) = 〈F , vh〉 =

∫

Ω

f(x)vh(x) dx =

∫

Ω

f(x)∑

k∈ωh

v(k)ϕ(k)(x) dx

=∑

r∈Rh

∑

k∈ωh

v(k)∫

δ(r)

f(x)ϕ(k)(x) dx.

Introducing again the index set ω(r)h = k : x(k) = (x1,k, x2,k) ∈ δ(r) and taking

into account that ϕ(k)(x)|δ(r)

= 0 if x(k) /∈ δ(r), we obtain the representation

(fh, vh) =

∑

r∈Rh

∑

k∈ω(r)h

v(k)∫

δ(r)

f(x)ϕ(k)(x) dx.

This can be rewritten in the form

(fh, vh) =

∑

r∈Rh

∑

β∈A(r)

v(r)β

∫

δ(r)

f(x)ϕ(r)β (x) dx =

∑

r∈Rh

(v(r)1 v

(r)2 v

(r)3 )f (r) ,

with the element connectivity relation r : k ↔ β , where the element load vectorsf (r) are defined by the relation

f (r) =[ ∫

δ(r)

f(x)ϕ(r)β (x) dx

]3β=1

.

We again transform the integrals over δ(r) into integrals over the master element

∫

δ(r)

f(x)ϕ(r)β (x) dx =

∫

∆

f(xδ(r)(ξ))ϕ(r)β (xδ(r) (ξ))|detJδ(r) | dξ

=

∫

∆

f(xδ(r)(ξ))φβ(ξ)|detJδ(r) | dξ

that can be approximated by the mid-point rule

∫

∆

f(xδ(r)(ξ))ϕβ(ξ)|detJδ(r) | dξ ≈1

2f(xδ(r)(ξs))ϕβ(ξs)|detJδ(r) |,

with ξs = (1/3, 1/3).

Assembling the element stiffness matrices and the element load vectors Toassemble the element stiffness matrices K(r) and the element load vectors f (r) to

the global stiffness matrix Kh and the global load vector fh, we only need theelement connectivity table r : i ↔ α that relates the local nodal numbers to


global ones. We recall that the element connectivity table is generated by the meshgenerator.

For our model problem CHIP, the element connectivity table belonging tothe mesh shown in Fig. 4.9 is given in Table 4.1. Using this information, we nowassemble the corresponding global stiffness matrix Kh. First we set all coefficientsof Kh to zero. Then we calculate the element stiffness matrices K(r) one by one andimmediately add them to the global stiffness matrix Kh (loop over all elements).After adding K(1) to Kh, we obtain the matrix in Fig 4.12.

1 2 3 · · · 5 6 7 8 · · · 19 20 21

1

2

3

...

5

6

7

8

...

19

20

21

K(1)11 K

(1)12 0 · · · 0 K

(1)13 0 0 · · · 0 0 0

K(1)21 K

(1)22 0 · · · 0 K

(1)23 0 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

......

.... . .

......

......

. . ....

......

0 0 0 · · · 0 0 0 0 · · · 0 0 0

K(1)31 K

(1)32 0 · · · 0 K

(1)33 0 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

......

.... . .

......

......

. . ....

......

0 0 0 · · · 0 0 0 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

Figure 4.12: One element matrix is assembled into the global matrix.

The incorporation of K(2) into Kh yields the matrix in Fig 4.13.After finishing this assembly process, we obtain the global stiffness matrix Kh

without considering the boundary conditions.The assembling of the right hand side f

hfollows the same algorithm, i.e., we

first set all coefficients of the vector fhto zero, then we calculate the element load

vector f (1) and add it to fhobtaining

fh

=(

f(1)1 f

(1)2 0 · · · 0 f

(1)3 0 0 · · · 0 0 0

)T

1 2 3 · · · 5 6 7 8 · · · 19 20 21

After adding f (2), we have

fh

=(

f(1)1 f

(1)2 + f

(2)1 0 · · · 0 f

(1)3 + f

(2)3 f

(2)2 0 · · · 0 0 0

)T

1 2 3 · · · 5 6 7 8 · · · 19 20 21


1 2 3 · · · 5 6 7 8 · · · 19 20 21

1

2

3

...

5

6

7

8

..

.

19

20

21

K(1)11 K

(1)12 0 · · · 0 K

(1)13 0 0 · · · 0 0 0

K(1)21 K

(1)22 +K

(2)11 0 · · · 0 K

(1)23 +K

(2)13 K

(2)12 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

......

.... . .

......

......

. . ....

......

0 0 0 · · · 0 0 0 0 · · · 0 0 0

K(1)31 K

(1)32 +K

(2)31 0 · · · 0 K

(1)33 +K

(2)33 K

(2)32 0 · · · 0 0 0

0 K(2)21 0 · · · 0 K

(2)23 K

(2)22 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

......

.... . .

......

......

. . ....

......

0 0 0 · · · 0 0 0 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

0 0 0 · · · 0 0 0 0 · · · 0 0 0

Figure 4.13: Two element matrices are added into the global matrix.

Repeating this assembly process for all r = 3, 4, . . . , Rh (looping over allelements), we arrive at the global load vector f

hthat does not take into account

the boundary conditions. For instance, the coefficient f (8) is composed as follows:

f (8) = f(8)2 + f

(9)3 + f

(18)1 + f

(19)1 + f

(20)1 .

We note that the matrix Kh and the right hand side fhcan also be represented

by the use of the so-called element connectivity matrices C(r) defined by the relation

C(r)ij =

1 if i is a global nodal number of some vertex of the triangle δ(r),and j is the corresponding local nodal number

0 otherwise,

with i = 1, 2, . . . , Nh and j = 1, 2, 3. Therefore, we have the representations

Kh =∑

r∈Rh

C(r)K(r)(C(r))T and fh=∑

r∈Rh

C(r)f (r).

For instance, the matrix (C(1))T has the form

(C(1))T =

1 0 0 · · · 0 0 0 0 · · · 0 0 00 1 0 · · · 0 0 0 0 · · · 0 0 00 0 0 · · · 0 1 0 0 · · · 0 0 0

.

1 2 3 · · · 5 6 7 8 · · · 19 20 21

This immediately follows from the element connectivity in Table 4.1.


Up to now, we did not take into account the contributions from the boundaryconditions:∫

Γ2

g2(x)vh(x) ds ,

∫

Γ3

α(x)g3(x)vh(x) ds and

∫

Γ3

α(x)uh(x)vh(x) ds ,

resulting from the Neumann (2nd kind) and Robin (3rd kind) boundary conditions.The same holds for the Dirichlet (1st kind) boundary condition that we have to

integrate into the global stiffness matrix and the global load vector too.

Including 2nd and 3rd kind boundary condition Recall, that we assumed thatour computational domain Ω is polygonally bounded. The conditions imposed onthe mesh generation above imply that

Γ2 =⋃

e2∈E(2)h

Γ(e2)

2 and Γ3 =⋃

e3∈E(3)h

Γ(e3)

3 ,

where Γ(e2)2 and Γ

(e3)3 are those edges of the mesh triangles that lie on Γ2 and Γ3,

respectively. Now we number all edges that lie on Γ2 and Γ3 such that we obtain

index sets E(2)h = 1, 2, . . . , N (2)

h and E(3)h = 1, 2, . . . , N (3)

h . Furthermore, wedefine an edge connectivity table e : i ↔ α , with α ∈ A(e) = 1, 2, that relatethe local nodal number α to the corresponding global nodal number i of the edge e.For our model problem CHIP, the following edge connectivity table describes theedges that lie on Γ3:

Global numbers of the local nodesNumber of the edges,that lie on Γ3 P

(e3)1 P

(e3)2

1 1 22 2 33 3 44 4 5

Table 4.3. Connectivity table for the edges that lie on Γ3.

The line integrals over Γ3 (similarly on Γ2) are calculated as follows:∫

Γ3

α(x)uh(x)vh(x) ds =∑

e3∈E(3)h

∫

Γ(e3)3

α(x)uh(x)vh(x) ds

=∑

e3∈E(3)h

∑

i∈ωh

∑

k∈ωh

u(i)v(k)∫

Γ(e3)3

α(x)ϕ(i)(x)ϕ(k)(x) ds.

Introducing the index set ω(e3)h = j : x(j) = (x1,j , x2,j) ∈ Γ

(e3)

3 , and taking into

account that ϕ(j)(x)∣∣Γ(e3)3

= 0 for x(j) /∈ Γ(e3)

3 , we can rewrite the line integral in


form∫

Γ3


e3∈E(3)h

∑

i∈ω(e3)

h

∑

k∈ω(e3)

h

u(i)v(k)∫

Γ(e3)3

α(x)ϕ(i)(x)ϕ(k)(x) ds

that together with the correspondences e : i ↔ α and e : k ↔ β gives therepresentation∫

Γ3


e3∈E(3)h

∑

α∈A(e3)

∑

β∈A(e3)

u(e3)α v

(e3)β

∫

Γ(e3)3

α(x)ϕ(e3)α (x)ϕ

(e3)β (x) ds

=∑

e3∈E(3)h

(v(e3)1 v

(e3)2 )K(e3)

(u(e3)1

u(e3)2

),

with

K(e3) =

[ ∫

Γ(e3)3

α(x)ϕ(e3)α (x)ϕ

(e3)β (x) ds

]2

β,α=1

.

In order to compute the coefficients of the matrix K(e3), we again transform the line

integrals over the edge Γ(e3)3 to line integrals over the unit interval [0, 1], as shown

in Fig. 4.14. This is an affine linear transformation given by the formula

x = x(ξ1) :

(x1

x2

)=

(x1,k − x1,i

x2,k − x2,i

)ξ1 +

(x1,i

x2,i

).

6x2

-x1

ss

x(e3)1 = x(i)

sx(e3)2 = x(k)

Γ(e3)3

-ξ1 = ξ1(x)

x = x(ξ1)

6ξ2

-ξ1

s0

α = 1

s1

α = 2

arbitrary triangle edge Γ(e3)3 unit interval [0, 1]

Figure 4.14: Mapping of a triangular edge onto the unit interval [0, 1].

Using the integral transformation formulas, we obtain∫

Γ(e3)3

α(x)ϕ(e3)α (x)ϕ

(e3)β (x) ds =


=

1∫

0

α(x(ξ1))ϕ(e3)α (x(ξ1))ϕ

(e3)β (x(ξ1))

√(dx1(ξ1)

dξ1

)2

+

(dx2(ξ1)

dξ1

)2

dξ1

=

1∫

0

α(x(ξ1))φα(ξ1)φβ(ξ1)√(x1,k − x1,i)2 + (x2,k − x2,i)2 dξ1.

In the case of linear shape functions, we have

φ1(ξ1) = 1− ξ1 and φ2(ξ1) = ξ1.

In the same way, we compute the integrals

∫

Γ3

α(x)g3(x)vh(x) ds and

∫

Γ2

g2(x)vh(x) ds, i.e.,

∫

Γ3

α(x)g3(x)vh(x) ds =∑

e3∈E(3)h

∑

k∈ω(e3)

h

v(k)∫

Γ(e3)3

α(x)g3(x)ϕ(k)(x) ds

=∑

e3∈E(3)h

∑

β∈A(e3)

v(e3)β

∫

Γ(e3)3

α(x)g3(x)ϕ(e3)β (x) ds

=∑

e3∈E(3)h

(v(e3)1 v

(e3)2 )f (e3),

with

f (e3) =

[ ∫

Γ(e3)3

α(x)g3(x)ϕ(e3)β (x) ds

]2

β=1

.

Analogously, we obtain

f (e2) =

[ ∫

Γ(e2)2

g2(x)ϕ(e2)β (x) ds

]2

β=1

.

Now we assemble the edge stiffness matrices K(e3) as well as the edge load vectors

f (e3) and f (e2), e3 = 1, 2, . . . , N(3)h , e2 = 1, 2, . . . , N

(2)h , into the global stiffness

matrix and into the global load vector in the same way as we assembled K(r) andf (r) above.


Including 1st kind boundary conditionFirst Method: Homogenization.We first put u(j) = g1(x

(j)) for all j ∈ γh = ωh \ωh , i.e., for all indices j belongingto the nodes on the Dirichlet boundary Γ1, and then we correct the right hand sideas follows:

f (i) := f (i) −∑

j∈γh

Kiju(j), ∀ i ∈ ωh, (4.16)

where Kij denotes the coefficients of the global stiffness matrix after assembling theelement stiffness matrices K(r) and the edge stiffness matrices K(e3), and f (i) arethe coefficients of the global load vector after assembling the element load vectorsf (r) and the edge load vectors f (e2) and f (e3). Afterwards we cancel the columns j,j ∈ γh, and the rows i, i ∈ γh, from the global stiffness matrix, and the componentwith the index i, i ∈ γh, from the global load vector. The resulting system Khuh =fhof finite element equations has dimension Nh.

Second Method: Homogenization without row canceling.Instead of canceling columns and rows in the matrix we apply (4.16) and set for allj ∈ γh:

f (j) = g1(x(j)), Kjj = 1 and Kij = Kji = 0, ∀i ∈ ωh.

The resulting system Khuh = fhof finite element equations has dimension Nh.

Third Method: Penalty technique (see also [73]).For all i ∈ γh, we put

Kii := Kii + K , K = 10p · maxj=1,2, ... ,Nh

|Kij | · Nh

f (i) := f (i) + Kg1(x(i)).

with a sufficiently large natural number p. The resulting system Khuh = fhof

finite element equations has dimension Nh.

4.2.3 Analysis of the Galerkin Finite Element Method

The following results can be concluded directly from the Lax-Milgram theory ap-plied to the Galerkin scheme replacing V0 by V0h. Without loss of generality, weconsider again the homogenized variational problem (4.6) under our standard as-sumptions (4.7).

Let V0h = span ϕ(i) : i ∈ ωh ⊂ V0 be some finite dimensional subspaceof V0, where Φ = ϕ(i) : i ∈ ωh denotes some basis of V0h. More precisely, wethink about a family of subspaces with h ∈ Θ such that dimV0h = Nh → ∞ ash→ 0 (|h| → 0) (see Section 4.2.2 for one example of such finite element spaces).


Theorem 4.5 (Lax-Milgram for the Galerkin scheme). Let us assumethat our standard assumptions (4.7) are fulfilled and that V0h ⊂ V0 is a finitedimensional subspace of V0 as introduced above. Then the following statementshold:

1. There exists a unique solution uh ∈ V0h of the Galerkin scheme

Find uh ∈ V0h : a(uh, vh) = 〈F, vh〉, ∀ vh ∈ V0h. (4.17)

2. The sequence unh ⊂ V0h, uniquely defined by the equations:

Find un+1h ∈ V0h such that

(un+1h , vh

)= (un

h, vh)− (a (unh, vh)− 〈F, vh〉) , ∀ vh ∈ V0h (4.18)

holds and converges to the solution uh ∈ V0h of (4.17) in V0 (‖ · ‖, (·, ·))for an arbitrary initial guess u0

h ∈ V0h and for some fixed ∈ (0, 2µ1/µ22).

3. The following iteration error estimates are valid:

(a) ‖uh − unh‖ ≤ qn‖uh − u0

h‖ (a priori estimate),

(b) ‖uh − unh‖ ≤ qn

1−q ‖u1h − u0

h‖ (a priori bound),

(c) ‖uh − unh‖ ≤ q

1−q ‖unh − un−1

h ‖ (a posteriori bound),

with 0 ≤ qopt = q(opt) ≤ q() := (1 − 2µ1 + µ22

2)12 < 1 for ∈

(0, 2µ1/µ22); opt = µ1/µ

22, qopt =

√1− ξ2, ξ = µ1/µ2.

The proof of this theorem is analogous to the proof of the Lax-Milgram Theorem4.4. Note that V0h ⊂ V0 ⊂ V is also a closed, non-trivial subspace of V . It is easyto see that the iteration process (4.18) is a fixed point iteration that converges for ∈ (0, 2µ1/µ

22) due to Banach’s fixed point theorem [116].

The discretization error estimate in the norm ‖ ·‖ of the space V (V0) is basedon Cea’s lemma.

Theorem 4.6 (Cea’s lemma). Let us assume that our standard assump-tions (4.7) are fulfilled. Furthermore, we assume that Vgh ⊂ Vg is a finitedimensional submanifold of Vg, and that V0h ⊂ V0 is a finite dimensional sub-space of V0 as introduced above. Then the discretization error estimate

‖u− uh‖ ≤µ2

µ1inf

wh∈Vgh

‖u− wh‖ (4.19)

holds, where u ∈ V0 and uh ∈ V0h are the solutions of the abstract variationalproblem (4.1)and and its Galerkin approximation (see Subsection 4.2.1), re-spectively.

Proof. Consider (4.1) for an arbitrary test function vh ∈ Vgh ⊂ Vg. If we subtract


the Galerkin equations (see scheme given in Subsection 4.2.1) from (4.1), then weobtain the so-called Galerkin orthogonality relation

a(u− uh, vh) = 0, ∀vh ∈ V0h. (4.20)

Choosing an arbitrarywh ∈ Vgh, and inserting vh = u−uh−(u−wh) = wh−uh ∈ V0h

into (4.20), we get the identity

a(u− uh, u− uh) = a(u − uh, u− wh), ∀wh ∈ Vgh. (4.21)

Now, using the V0-ellipticity and the V0-boundness of the bilinear form a(·, ·), wecan estimate (4.21) from both above and below as follows:

µ1‖u− uh‖2 ≤ a(u− uh, u− uh) = a(u− uh, u− wh)

≤ µ2‖u− uh‖ ‖u− wh‖, ∀wh ∈ Vgh.

This completes the proof.

Cea’s result means that the discretization error problem with respect to thenorm of the space V is reduced to a pure approximation error problem (= stability).Error estimates of the form (4.19) are called quasioptimal because they lead toestimates of discretization error from both above and below:

infwh∈Vgh

‖u− wh‖ ≤ ‖u− uh‖ ≤µ2

µ1inf

wh∈Vgh

‖u− wh‖. (4.22)

It follows immediately from (4.22) that the Galerkin solution uh converges asymp-totically to the solution u of our variational problem (4.1) as the discretizationparameter h tends to 0 if and only if infwh∈Vgh

‖u − wh‖ −→ 0 for h −→ 0. Thefamily V0hh∈Θ of finite dimensional subspaces of the space V0 is called completein the limit if and only if

limh→0h∈Θ

infvh∈V0h

‖u− vh‖ = 0, ∀u ∈ V0.

If g = 0, or g ∈ Vgh ∀h ∈ Θ, then the limiting completeness ensures the convergenceof the Galerkin method.

Let us return to our model problem CHIP described in Section 4.2. We obtaina family of triangulation by refining recursively all triangles, e.g., by dividing alltriangles into 4 triangles connecting the midpoints of the edges. The last procedureresults in a family of regular triangulations ensuring that the edges of the trianglesare all of the same order O(h) and the angles of all triangles are not less than somefixed angle θ0 ∈ (0, π/2), and not greater than π− θ0. If u ∈ H2(Ω)∩Vg , i.e., u hasalso quadratically integrable second order derivatives, then we can prove that thereexists an h-independent, positive constant c such that the approximation estimate

infwh∈Vgh

‖u− wh‖1,Ω ≤ c h ‖u‖2,Ω (4.23)


holds, where the norm ‖.‖2,Ω in H2(Ω) is defined by

‖u‖22,Ω = ‖u‖21,Ω +

∫

Ω

[(∂2w

∂x21

)2+ 2( ∂2w

∂x1∂x2

)2+(∂2w

∂x22

)2]dx.

Cea’s estimate (4.19) and the approximation error estimate (4.23) imply the a prioridiscretization error estimate

‖u− uh‖1,Ω ≤µ2

µ1c h ‖u‖2,Ω, (4.24)

i.e., the convergence rate in the H1-norm is O(h). However, due to the obtuse anglesin the boundary Γ of the computational domain Ω of our model problem CHIP, wecan expect only less regularity of the solution u than H2. This reduces the order ofconvergence to O(hλ) with some positive λ less than 1. Using appropriately gradedgrids, we can recover the O(h) convergence rate.

The detailed discretization error analysis and discretization error estimates inother norms (e.g., L2- and L∞-norms) can be found in [22]. A survey on a posteriorierror estimates and adaptive mesh refinement techniques is given in [110].

4.2.4 Iterative Solution of the Galerkin System

The convergence rate of the classical iterative method, such as the Richardsonmethod, the Jacobi iteration, and the Gauss-Seidel iteration (c.f. Chapter 2),strongly depends on the condition number κ(Kh) of the stiffness matrix Kh [81].In the symmetric and positive definite (SPD) case, the spectral condition numberis defined by the quotient

κ(Kh) := λmax(Kh)/λmin(Kh) (4.25)

of the maximal eigenvalue λmax(Kh) over the minimal eigenvalue λmin(Kh) of thestiffness matrix Kh. The minimal and maximal eigenvalues of the SPD matrix Kh

can be characterized by the Rayleigh quotients

λmin(Kh) := minvh∈R

Nh

(Khvh, vh)

(vh, vh)and λmax(Kh) := max

vh∈RNh

(Khvh, vh)

(vh, vh)(4.26)

respectively. A careful analysis of the Rayleigh quotients (4.26) leads us to therelations

λmin(Kh) = O(hm) and λmax(Kh) = O(hm−2) (4.27)

for the minimal and maximal eigenvalues of the finite element stiffness matrix Kh

derived from symmetric, second order, elliptic boundary value problems, includingour model problem CHIP (m = 2), on a regular triangulation, where m denotesthe dimension of the computational domain Ω ⊂ R

m. This analysis is based on thefundamental relation

(Khuh, vh) = a(uh, vh), ∀uh, vh ∈ RNh , uh, vh ←→ uh, vh ∈ V0h, (4.28)


relating the stiffness matrix Kh to the underlying bilinear from a(., .), the standardassumption (4.7), and the representation of the integrals over the computationaldomain Ω as a sum of integral over all finite elements (triangles). The behavior(4.27) of the minimal and maximal eigenvalues and (4.25) imply that the spectralcondition number of the finite element stiffness matrix behaves like

κ(Kh) = O(h−2) (4.29)

as h tends to 0. Therefore, the classical iteration methods mentioned above show aslow convergence on fine grids. More precisely, the number I(ε) of iterations neededfor reducing the initial iteration error by the factor ε−1 behaves like κ(Kh) ln(ε

−1),i.e., like h−2 ln(ε−1).

To avoid this drawback, say in the case of the classical Richardson method, weneed some preconditioning. Let us look at the fixed point iteration process (4.18).Once some basis is chosen, the iteration process (4.18) is equivalent to the iterationmethod (c.f. also (2.8))

Bhun+1h − un

h

+Kh u

nh = f

h, n = 0, 1, . . . , u0

h ∈ RNh , (4.30)

in RNh with the symmetric and positive definite matrix (Gram matrix with respect

to the scalar product (·, ·) in V0(V ))

Bh =[(

ϕ(i), ϕ(k))]

k,i∈ωh

. (4.31)

Indeed, because of the Galerkin-Ritz isomorphism uh ↔ uh, we have the corre-spondence

un+1h ∈ V0h : (un+1

h , vh) = (unh, vh)− (a(un

h, vh)− 〈F, vh〉), ∀ vh ∈ V0h

un+1h ∈ R

Nh :∑i∈ωh

u(i)n+1

(ϕ(i), ϕ(k)

)=∑

i∈ωh

u(i)n

(ϕ(i), ϕ(k)

)−

∑ı∈ωh

u(i)n a

(ϕ(i), ϕ(k)

)+

⟨F, ϕ(k)

⟩,

∀ k ∈ ωh

Bhun+1h = Bhu

nh − Khu

nh + f

h.

6

?

ujh =

∑i∈ωh

u(i)j ϕ(i), j = n, n+ 1, where j is here the iteration index

vh = ϕ(k), k ∈ ωh

Due to the identity

‖vh‖ = ‖vh‖Bh:= (Bh vh, vh)

0.5R

Nh, ∀ vh ↔ vh ∈ R

Nh ,

we can rewrite the iteration error estimates 3a) – 3c) of Theorem 4.5 in the Bh–energy norm ‖ · ‖Bh

, where (·, ·)R

Nh denotes the Euclidian inner product.


The iteration (4.30) is effective if and only if the system

Bh wnh = dnh := f

h−Khu

nh,

un+1h = un

h + wnh,

which has to be handled at each iteration step, can be solved fast, i.e., if we denoteby ops(B−1

h ∗ dnh) the number of arithmetical operations for solving the systemBhw

nh = dnh , then this effort should be of the order O(ops(Kh ∗ un

h)) of the numberof arithmetical operations needed for one matrix-vector multiplication, e.g., thesolver is called asymptotically optimal in this case. Unfortunately, this is nottrue in practical applications (There exist a few exceptions for special problems,namely the so-called fast direct methods [81]). Note that the convergence rate

q = qopt =√1− ξ2 < 1 is, independently of h, less than 1, because ξ = µ1/µ2 does

not depend on h.To avoid the drawback of a non-optimal solver, we introduce a SPD precondi-

tioner Ch that should satisfy the following two conditions:

1. ops(C−1

h ∗ dnh)= O (ops(Kh ∗ un

h)).2. Ch is spectrally equivalent to Bh, i.e.,∃ γ1, γ2 = const. > 0 : γ1Ch ≤ Bh ≤ γ2Ch, i.e., ∀ vh ∈ R

Nh

γ1(Ch vh, vh)RNh ≤ (Bh vh, vh)RNh ≤ γ2(Ch vh, vh)RNh

γ2/γ1 should not grow too much for h→ 0.

(4.32)

Then we can use the iteration method

Chun+1h − un

h

τ+Khu

nh = f

h, n = 0, 1, . . . ; u0

h ∈ RNh (4.33)

instead of (4.30). The iteration method (4.33) is sometimes called preconditionedRichardson iteration. Moreover, under the assumptions (4.32), the following theo-rem shows that the iteration method (4.33) is efficient.

Theorem 4.7 (Preconditioned Richardson iteration). Let us assume that ourstandard assumptions (4.7) are fulfilled and that V0h ⊂ V0 is a finite dimensionalsubspace of V0 as introduced above. Furthermore, the SPD preconditioner Ch shouldsatisfy the assumptions (4.32). Then the following statements hold:

1. The iteration (4.33) is converging to the solution

uh ∈ RNh : Khuh = fh

luh ∈ V0h : a(uh, vh) = 〈F, vh〉 ∀ vh ∈ V0h

for every, fixed τ ∈ (0, 2 ν1/ν22), with ν1 = µ1γ1 and ν2 = µ2γ2.

2. The following iteration error estimates are valid:

(a) ‖uh − unh‖Ch

≤ qn‖uh − u0h‖Ch

(b) ‖uh − unh‖Ch

≤ qn

1−q ‖u1h − u0

h‖Ch


(c) ‖uh − unh‖Ch

≤ q1−q ‖un

h − un−1h ‖Ch

with ‖vh‖Ch= (vh, vh)

0.5Ch

:= (Ch vh, vh)0.5R

Nh

and 0 ≤ qopt = q(τopt) ≤ q(τ) := (1− 2 ν1τ + ν22τ2)0.5 < 1,

for τ ∈ (0, 2 ν1/ν22), τopt = ν1/ν

22 and qopt =

√1− ξ2, ξ = ν1/ν2.

3. If γ2/γ1 6= c(h), then I(ǫ) = O(ln ǫ−1

)iterations and Q(ǫ) =

O(ops(Kh ∗ un

h) · ln ǫ−1)arithmetical operations are needed in order to re-

duce the initial error by some factor ǫ ∈ (0, 1), i.e.,∥∥∥uh − u

I(ǫ)h

∥∥∥Ch

≤ ǫ∥∥uh − u0

h

∥∥Ch

.

The proof of this theorem is easy. Indeed, replacing ‖vh‖ = ‖vh‖Bhby ‖vh‖Ch

and taking into account the inequalities

a) a(vh, vh) = (Kh vh, vh) ≥ µ1‖vh‖2Bh≥ µ1γ1‖vh‖2Ch

, ∀ vh ∈ RNh ,

b) |a(uh, vh)| = |(Kh uh, vh)| ≤ µ2‖uh‖Bh‖vh‖Bh

≤ µ2γ2‖uh‖Ch‖vh‖Ch

, ∀ uh, vh ∈ RNh ,

we can easily prove the first two statements of the theorem. Alternatively, we canconsider the error iteration scheme

zn+1h := uh − un+1

h =(Ih − τ C−1

h Kh

)znh.

Then we can directly estimate the iteration error in the Ch-energy norm∥∥zn+1

h

∥∥2Ch

=∥∥(Ih − τ C−1

h Kh

)znh∥∥2Ch

= (znh , znh)Ch

− 2τ(C−1h Khz

nh, z

nh)Ch

+ τ2(C−1h Khz

nh, C

−1h Khz

nh)Ch

≤ (1− 2µ1γ1τ + (µ2γ2)2τ2) ‖znh‖2Ch

.

The last statement follows from the estimate qn ≤ ǫ if n ≥ ln ǫ−1/ ln q−1.

Remark 4.8.1. Candidates for Ch are, for instance, the following, practically very important

preconditioner:• SSOR: see Subsection 2.2.2 and [60, 81, 93],

• ILU, MILU, IC, MIC: see Section 6.1.2 and [60, 81, 93],

• Multigrid preconditioners: see Chapter 7 and [12, 60],

• Multilevel preconditioners (e.g., BPX): see [12, 60].

We will define and consider some of these preconditioning procedures, andtheir parallel versions, in Chapters 6 and 7.

2. It is not assumed in Theorem 4.7 that Kh = KTh . However, we immediately

conclude from the V0-ellipticity of a(·, ·) that the stiffness matrixKh is positivedefinite. In practice, the GMRES iteration or other Krylow space methods(see Subsection 6.3.2 and [81, 93]) are preferred instead of the Richardsoniteration.


3. If the stiffness matrix Kh is both symmetric and positive definite, then

(a) the convergence rate estimate can be improved [81, 93]:

ν1Ch ≤ Kh ≤ ν2 Ch : τopt = 2/(ν1 + ν2), qopt =1−ξ1+ξ , ξ = ν1

ν2and

(b) the conjugate gradient acceleration is feasible, and, of course, used inpractice (see also Subsection 6.2.1).

The main topic of the following chapters consists in the parallelization of theiteration schemes and the preconditioning operations.

Exercises

E4.1. Let us suppose that the solution u of the model problem CHIP (Example 4.3) issufficiently smooth, at least in both material subdomains ΩI and ΩII . Derive theclassical formulation of the problem, i.e., the partial differential equations in ΩI andΩII , the interface conditions on ΓIn, and the boundary conditions on Γ1 Γ2 and Γ3.Hint: Use (back) integration by parts in the main part.

E4.2. Show that Example 4.3 satisfies the standard assumption formulated in (4.7). Com-pute µ1 and µ2 such that the quotient µ2/µ1 is as small as possible. Show that thisexample has a unique solution.Hint: Use Theorem 4.4.

E4.3. Let us consider the Poisson equation (2.1) in the rectangle Ω = (0, 2)× (0, 1) underhomogeneous Dirichlet boundary conditions on Γ1 := Γ. Derive the variationalformulation (4.6) of this problem, and show that the standard assumptions (4.7) arefulfilled.Hint: You will need Friedrich’s inequality

∫

Ω

|v|2dx ≤ cF

∫

Ω

|∇v|2dx ∀v ∈ H10 (Ω) := v ∈ V = H1(Ω) : v = 0onΓ

for proving the V0–ellipticity [22].

E4.4. Let us again consider the Poisson equation (2.1) in the rectangle Ω = (0, 2)× (0, 1)under homogeneous Dirichlet boundary conditions on Γ1 := Γ (see also ExerciseE4.3 for the variational formulation). Generate the table of nodal coordinates i :(x1,i, x2,i) (cf. Table 4.2) and the table of the element connectivities r : α ↔ i (cf.Table 4.1) for the triangulation derived from Figure 2.1 by dividing all cells of themesh into two triangles connecting the left bottom node with right top node of cell,where the nodes should be numbered from the left to the right and from the bottomto the top. Furthermore, derive the finite element equations Khuh = f

hfor linear

triangular elements (Courant’s elements) according to the element-wise assemblingtechnique presented above. Compare the finite element stiffness matrix and the loadvector with their finite difference counterparts given in (2.4).Hint: Use the homogenization method (First Method) for implementing the Dirichletboundary conditions.

Chapter 5

Basic numerical routinesin parallel

Program testing can be used to show the presence ofbugs, but never to show their absence.—Edsger W. Dijkstra (1930-2002)

5.1 Storage of sparse matrices

The finite element method introduced in Section 4.2 combines the local approx-imations of the PDE into one huge system of linear equations, which is usuallyextremely sparse. The term sparse is used for a class of systems of equations, e.g.,arising from FEM discretization, where the small number of nonzero entries per rowdoes not grow with the number of equations.

If we think of the matrix as a table and fill all of the locations with nonzeroelements, then we get the nonzero pattern of the matrix. Depending on the proper-ties of that pattern we call a matrix structured or unstructured. The length of theinterval from the diagonal element to the last nonzero element of a row/column iscalled the band/profile.

In the following, sparse matrices are always thought of as unstructured, i.e.,we will not take advantage of special structures inside of sparse matrices.

There exist many storage schemes for matrices and we focus our attention onthe compressed row storage (CRS ) scheme. Here only the nonzero elements of asparse matrix are stored. Therefore, an indirect addressing of the matrix entries isneeded. The matrix

Kn×m =

10 0 0 −23 9 0 00 7 8 73 0 8 7

can be stored using just two integer vectors and one real/double vector.

71

72 Chapter 5. Basic numerical routines in parallel

Values : val =

Column index : col ind =

Row pointer : row ptr =(at the beginning) 1 3 5 8 11

1 4 1 2 2 3 4 1 3 4

10 −2 3 9 7 8 7 3 8 7

The row pointer may also point at the end of the row or, in case of symmetricmatrices, at the diagonal. Further storage schemes for sparse matrices as, e.g., theCompressed Column Storage (CCS), the Compressed Diagonal Storage (CDS), theJagged Diagonal Storage (JDS) or the Skyline Storage(SKS), are nicely introducedin [8, §4.3.1]. The SKS stores all matrix elements within the variable band/profile ofthat matrix and requires usually much more memory than CRS. This is most com-monly used when direct solvers like Gaussian elimination or LU-factorization areused to solve the system of algebraic equations. The concrete choice whether to pre-fer band or profile storage depends on the algorithm used. The proper band/profilewill be preserved during the solution process.

More storage schemes for sparse matrices can be found in [8]. A change inthe storage scheme will change the routines mentioned in this chapter and can beapplied in the following chapters analogously.

5.2 Domain decomposition by non-overlappingelements

The data distribution with respect to domain decomposition uses geometrical con-nections for the parallelization. The domain Ω is split into P subdomains Ωi (i=1,P ).Each Ωi is mapped onto a process. Fig. 5.1 presents a part of a non-overlappingdomain decomposition. We denote edges between subdomains as Interfaces.

s

s

s

s

s

ss

s

QQQQQQ

AA

AA

BBBBBB

Ω2

Ω3

Ω1

Figure 5.1: Non-overlapping domain decomposition.

We distinguish between element based and node based data, which we may storeoverlapping or non-overlapping (the domains are still non-overlapped). This leadsto 4 types of data distributions, namely overlapping elements, overlapping nodes,

5.2. Domain decomposition by non-overlapping elements 73

non-overlapping elements, and non-overlapping nodes. We investigate the non-overlapping distribution of (finite) elements in detail.

Assume that our domain Ω is split into 4 subdomains which are discretizedby linear triangular finite elements (see Fig. 5.2).

•

•

•

•

•

•

••

••

••

•

•

•

•

•

•

••

••

••

•

•

•

•

•

•

••

••

••

•

•

•

•

•

•

••

••

••

Ω1 Ω2

Ω3 Ω4

[ℓ] [k] [n] [m]

[p]

[q] [r]

[s]

[k] [n] [m]

−→ ”V”

• −→ ”E”

· −→ ”I”

Figure 5.2: Non-overlapping elements.

We distinguish between 3 sets of nodes denoted by the appropriate subscripts:

“I” nodes located in the interior of a domain [NI :=∑P

i=1 NI,i],

“E” nodes located in the interior of an interface edge [NE :=∑ne

j=1 NE,j ], and

“V” cross points (vertices), i.e., nodes at start and end of an interface edge [NV ].

The two latter sets are often combined as coupling nodes with the subscript “C”[NC = NV +NE ]. The total number of nodes is N = NV +NE +NI =

∑Pi=1 Ni.

For simplicity we number the cross points first, then the edge nodes, andfinally the inner nodes. The nodes of one edge possess a sequential numbering, thesame holds for the inner nodes of each subdomain so that all vectors and matriceshave a block structure of the form

vT = (vV , vE,1, . . . , vE,ne, vI,1, . . . , vI,P )

T .

According to the mapping of nodes to the P subdomains Ωi (i = 1, 2, . . . , P ) ,all entries of matrices and vectors are distributed on the appropriate process Pi. The


coincidence matrices Ai (i=1,P ) represent this mapping and can be considered as asubdomain connectivity matrix, similar to the element connectivities in Table 4.1.In detail, an N ×Ni matrix Ai is a boolean matrix which maps the global vector vonto the local vector vi.Properties of A:

– Entries for inner nodes appear exactly once per row and column.– Entries for coupling nodes appear once in Ai if this node belongs to process Pi.

Now, we define two types of vectors: the accumulated vector (Type-I) and thedistributed vector (Type-II):

Type I: u and w are stored in process Pi (= Ωi) such that ui = Aiu

and wi = Aiw , i.e., each process Pi owns the full values of itsvector components.

Type II: r, f are stored in the way ri, fi in Pi such that r =p∑

i=1

ATi ri

holds, i.e., each node on the interface (rC,i) owns only its contri-bution to the full values of its vector components.

Starting with a local FEM discretization of the PDE we naturally achieve adistributed matrix K. If there are convective terms in the PDE and one uses higherorder upwind methods, then additional information is needed from neighboringelements (communication) in order to calculate the local matrix entries.

Matrix K is stored in a distributed way, analogously to a Type-II vector, andit is classified as Type-II matrix:

K =

p∑

i=1

ATi KiAi, (5.1)

with Ki denoting the stiffness matrix belonging to subdomain Ωi (to be more precise:the support of the FEM test functions is restricted to Ωi). If one thinks of Ωi asone large finite element, then the distributed storing of the matrix is equivalent tothe presentation of that element matrix previously to the f.e. accumulation.

The special numbering of nodes implies the following block representation ofequation Ku = f:

KV KV E KV I

KEV KE KEI

KIV KIE KI

uVuEuI

=

fVfEfI

. (5.2)

Therein, KI is a block diagonal matrix with entries KI,i. Similar block structuresare valid for KIE , KIE , KIV , KV I .

If we really perform the global accumulation of matrix K, then the result is aType-I matrix M and we can write

Mi := AiMATi . (5.3)

Although K = M holds, we need to distinguish between both representations be-cause of the different local storing (i.e., Ki 6= Mi).

5.3. Vector-vector operations 75

The diagonal matrix

R =

p∑

i=1

ATi Ai (5.4)

contains for each node the number of subdomains it belongs to (priority of a node),e.g., in Fig. 5.2 R[k] := Rkk = 4, R[n] = R[m] = R[p] = 2, R[q] = 1.

We use subscripts and superscripts in the remaining section in the following

way: v[n]C,i denotes the n

th component (in local or global numbering) of vector r storedin process Pi. Subscript “C” indicates a subvector belonging to the interface. Asimilar notation is used for matrices.

5.3 Vector-vector operations

Change of vector types

Obviously, we can apply addition/subtraction and similar operations withoutcommunication on vectors of the same type.

The conversion of Type-II (distributed) vector into a Type-I (accumulated)vector requires communication:

wi = Ai

p∑

i=1

ATi ri. (5.5)

The other direction, conversion of a Type-I vector into a Type-II vector is notunique. One possibility consists of locally dividing each vector component by itspriority (number of subdomains a node belongs to), i.e.,

ri = R−1wi (5.6)

with R defined in (5.4).

Inner product.

The inner product of two vectors of different type requires communication only forthe global sum of one scalar:

(w, r) = wT r = wT

p∑

i=1

ATi ri =

p∑

i=1

(Aiw)T ri =

p∑

i=1

(wi, ri) (5.7)

Every other combination of vectors requires a conversion of the type, possibly withcommunication.


5.4 Matrix-vector operations

Here, we investigate Matrix-Vector multiplication with matrices and vectors varioustypes. A more abstract presentation of that topic with proofs of the followingstatements can be found in [48, 52, 56].

1. Type-II matrix × Type-I vector results in a Type-II vector.Indeed, we achieve from definition (5.1):

K ·w =

p∑

i=1

ATi KiAi ·w =

p∑

i=1

ATi Ki ·wi︸︷︷︸

ri

= r (5.8)

The (not necessary) execution of the summation (implying communication)results in a Type-I vector.

2. Type-II matrix × Type-II vector needs a conversion of the vector previouslyto the multiplication in item 1.

3. The operation Type-I matrix × Type-I vector cannot be performed with arbi-trary Type-I matrices M.For a detailed analysis, we study the operation u = M ·w on node n in Fig. 5.2on page 73. The local multiplication ui = Mi · wi should result in a Type-Ivector u at node n in both subdomains.

u[n]2 = M

[n,n]2 w

[n]2 +M

[n,m]2 w

[m]2 +M

[n,k]2 w

[k]2 +M

[n,p]2 w

[p]2 +M

[n,s]2 w

[s]2

u[n]4 = M

[n,n]4 w

[n]4 +M

[n,m]4 w

[m]4 +M

[n,k]4 w

[k]4 +M

[n,q]4 w

[q]4 +M

[n,r]4 w

[r]4

Terms 4 and 5 of the right hand side in the above equations differ such thatprocesses 2 and 4 achieve different results instead of a unique one. The reasonlies in the transport of information via the matrix from a process Pa to a nodebelonging additionally to another process Pb.

Denote by σ[i] = s : xi ∈ Ωs all processors a node i belongs to, e.g. inFig. 5.2, σ[n] = 2, 4, σ[p] = 1, 2, σ[s] = 2. Then, the transport of infor-mation from node j to a node i is only admissible if σ[i] ⊆ σ[j]. Representingthe matrix entries as directed graph then, e.g., the entries q → k, q → n,s→ n, s→ m, s→ p, p→ k, n→ k, p→ n, n→ p, ℓ→ k, k → ℓ in Fig. 5.2are not admissible.As shown in [52], the operationM·u can be performed without communicationonly for the following special structure:

M =

MV 0 0MEV ME 0MIV MIE MI

=⇒ u = M ·w (5.9)

with block-diagonal matrices MI , ME , and MV . In particular, the subma-trices must not contain entries between nodes belonging to different sets ofprocessors. The proof for (5.9) can be found in [53, 54].

5.4. Matrix-vector operations 77

The block-diagonality ofMI and the correct block structure of MIE and MIV

are guaranteed by the data decomposition. However, the mesh must fulfill thefollowing two requirements in order to guarantee the correct block structurefor the remaining three matrices:

(a) There is no connection between vertices belonging to different sets ofsubdomains. This guarantees the block-diagonality of MV (often MV isa diagonal matrix). There is no connection between vertices and oppositeedges, e.g. in Fig. 5.1, ℓ 6→ p. This ensures the admissible structureof MEV (and MV E).

(b) There is no connection between edge nodes belonging to different sets ofsubdomains. This guarantees the block-diagonality of ME .

Requirement (a) can be easily fulfilled if at least one node is located on alledges between two vertices and one node lies in the interior of each subdomain(or has been added therein). A modification of a given mesh generator alsoguarantees that requirement (b) is fulfilled, e.g., the edge between nodes pand n is omitted. Fig. 5.3 presents the revised mesh.

• This strategy can be applied only to triangular or tetrahedral finite elements.

• In general, rectangular elements do not fulfill condition (b).

•

•

•

•

•

•

••

••

••@@

@

@@

@

•

•

•

•

•

•

••

••

••@@

@

@@

@

•

•

•

•

•

•

••

••

••@@

@

@@

@

•

•

•

•

•

•

••

••

••@@

@

@@

@

Ω1 Ω2

Ω3 Ω4

[ℓ] [k] [n] [m]

[p]

[q] [r]

[s]

[k] [n] [m]

Figure 5.3: Non-overlapping elements with a revised discretization.


4. Similar to the previous point, Type-I matrix × Type-II vector cannot be per-formed with general Type-I matrices M. But we would expect a Type-II vectoras result.Assuming that the necessary requirements (a) and (b) are fulfilled by themesh (see Fig. 5.3), we focus our interest on the operation fI = MIC · rC .Performing this operation locally on processor 4 gives

f[r]4 = M

[r,n]IC,4 · r

[n]4 +M

[r,m]IC,4 · r

[m]4 .

Due to the Type-II vector property, we miss the entries M[r,n]IC,2 · r

[n]2 and

M[r,m]IC,2 · r

[m]2 in the result. The difference with respect to the previous point is

that the transport of information from node j to a node i is permitted only ifσ[i] ⊇ σ[j]. Thus, just a matrix of the shape

M =

MV MV E MV I

0 ME MEI

0 0 MI

=⇒ f =

P∑

i=1

ATi fi =

P∑

i=1

ATi (Miri)

(5.10)with block-diagonal matricesMI , ME , MV can be used in this type of matrix-vector multiplication.

Remark: Items 3 and 4 can be combined under the assumption that MV , ME ,MI are block-diagonal matrices (guaranteed by the mesh in Fig. 5.3). Denote byML, MU and MD the strictly lower, upper and diagonal part of M. Then we canperform the Type-I matrix-vector multiplication for all types of vectors

w = M · u := (ML +MD) · u +

P∑

i=1

ATi MU,iR

−1i · ui (5.11a)

w = M · r := (ML +MD)

P∑

i=1

ATi · ri +

P∑

i=1

ATi MU,i · ri (5.11b)

f = M · u := R−1(ML +MD) · u + MUR−1 · u (5.11c)

f = M · r := R−1(ML +MD)

P∑

i=1

ATi · ri + MU · r. (5.11d)

Note that in any case the above multiplications require two type conversions, butthe choice of the vectors directly influences the amount of communication needed.

If the Matrix M is available as factor product M = L−1U−1 with the lowerand the upper triangular matrices L−1 and U−1 fulfilling conditions (a) and (b),then we write

w = L−1U−1 · r := L−1P∑

i=1

ATi U

−1i · ri (5.12a)

f = L−1U−1 · u := R−1L−1P∑

i=1

ATi U

−1i R−1 · ui. (5.12b)


The second case, i.e., M = U−1L−1 results in

w = U−1L−1 · r :=P∑

i=1

ATi U

−1i R−1

i AiL−1 · (

P∑

j=1

ATj rj) (5.13a)

f = U−1L−1 · u := U−1R−1L−1 · u. (5.13b)

The equations for the remaining combinations of vector types in (5.12) and (5.13)can easily be derived by type conversion. Here the number and the kind of typeconversion is also determined by the factoring method used with the matrix M.

A similar methodology can be applied if other data distributions, e.g., over-lapping nodes, are considered.

Exercises

Let some Double Precision vector u be stored blockwise disjoint, i.e., distributedover all processes s (s=0,...,P−1) such that u = (uT0 , . . . , u

TP−1)

T .

E5.1. Write a routineDebugD(nin,xin,icomm)

that prints nin Double Precision numbers of the array xin. Start the programwith several processes. As a result, all processes will write their local vectorsat roughly the same time. Hence, you must look carefully through the outputfor the data of process s.Improve the routine DebugD so that process 0 reads the number (fromterminal) of the process that should write its vector. Use

MPI Bcastto broadcast this information and let the processes react appropriately. Ifnecessary use MPI Barrier to synchronize the output.

E5.2. Exchange the global minimum and maximum of the vector u. UseMPI Gather, MPI Scatter/ MPI Bcast and ExchangeD.

How can you reduce the amount of communication?

Hint: Compute the local minimum/maximum. Afterwards let some processdetermine the global quantities.

Alternatively, you can useMPI Allreduce and the operationsMPI Minlocor MPI Maxloc.

E5.3. Write a routineSkalar(ns,x,y,icomm)

for computing the global scalar product 〈x, y〉 of two Double Precision vec-tors x and y with local lengths ns. Use

MPI Allreduce with the operation MPI SUM.

Let the unit square [0, 1]2 be partitioned uniformly into procx×procy rectangles Ωi

numbered row by row. The numbering of the subdomains coincides with the corre-sponding process-id’s (or ranks in MPI jargon).


0 1 procx−1

procx

procx∗(procy−1) procx∗procy−1

South

North

West East

The functionIniGeom(myid,procx,procy,neigh,color)

from example/accuc generates the topological relations corresponding to the do-main decomposition defined above. This information is stored in the integer arrayneigh(4). A checkerboard coloring is defined in color. Moreover, the function

IniCoord(myid,procx,procy,xl,xr,yb,yt)can be used to generate the coordinates of the lower left corner (xl, yb) and theupper right corner (xr, yt) of each subdomain.

E5.4. Implement a local data exchange of a double precision number between eachprocessor and all of it’s neighbors (connected by a common edge). Use theroutine ExchangeD from E3.7.

Let each subdomain Ωi be uniformly discretized into nx ∗ ny rectangles generatinga triangular mesh (nx, ny are the same for all subdomains).If we use linear f.e. test functions, then each vertex of the triangles represents onecomponent of the solution vector, e.g., the temperature in this point, and we havend := (nx + 1) ∗ (ny + 1) local unknowns within one subdomain. We propose alocal ordering in rows of the unknowns. Note, that the global number of unknownsis N = (procx ∗ nx+ 1) ∗ (procy ∗ ny + 1) < procx ∗ procy ∗ nd).The function

GetBound(id,nx,ny,w,s)copies the values of w corresponding to the boundary South (id=1), East (id=2),North (id=3), West (id=4) into the auxiliary vector s. Vice versa, the function

AddBound(id,nx,ny,w,s)adds the values of s to the components of w corresponding to the nodes on theboundary South (id=1), East (id=2), North (id=3), and West (id=4). These func-tions can be used for the accumulation (summation) of values corresponding to thenodes on the interfaces between two adjacent subdomains which is a typical andnecessary operation.

E5.5. Write a routine which accumulates a distributed Double Precision vector w.The call of such a routine could look like

VecAccu(nx,ny,w,neigh,color,myid,icomm)where w is both the input and output vector.

E5.6. Use the routine from E5.5 to determine the number of subdomains sharing anode.


1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

Figure 5.4: 4 subdomains in local numbering with local discretization nx = ny = 4and global discretization Nx = Ny = 8.

E5.7. Write a routine that implements the parallel multiplication of a distributedmatrix in compressed row storage (CRS) format with an accumulated vector.

E5.8. Determine the accumulated diagonal of a distributed matrix in CRS format.


Chapter 6

Classical solvers

The Analytical Engine has no pretensions whatever tooriginate anything. It can do whatever we know how toorder it to perform.—Ada Augusta, Countess of Lovelace (1815-1851).

In this chapter we solve n×n sparse systems of linear equations

Ku = f. (6.1)

We explicitly assume that the system of equations above was generated by discretiz-ing an elliptic PDE. Matrix K is sparse if FEM, FDM, or FVM have been usedfor discretization. In the following, we assume exactly this sparsity pattern of thematrix. A detailed description of numerical algorithms used in this chapter can befound in [2, 8, 81]. A different introduction into parallelization of these algorithmsis represented in [26].

All parallel algorithms refer to the non-overlapping element distributionand basic routines in sections 5.2-5.4. The appropriate vector and matrix types aredenoted in the same way as therein. The use of any other data distribution in theparallelization will be emphasized explicitly.

If we know in advance the amount of arithmetic work needed to solve (6.1)not taking into account round-off errors, then we call this solution method a directone. Otherwise, it is an iterative method since the number of iterations is datadependent.

Gaussian elimination could be used in principle to solve (6.1) that resultsfrom a discretization with a typical discretization parameter h in the m = 2, 3dimensional space. Unfortunately the solution time is of order O(n3) = O(h−3m)for dense matrices, i.e., doubling the number of unknowns increases the solutiontime by 8. Even if we take into account the bandwidth of sparse matrices as BW =O(h−m+1) then the arithmetical costs are reduced toO(n·BW 2) = O(h−3m+2) only.Therefore, some iterative techniques are introduced which require for sparse systems

83

84 Chapter 6. Classical solvers

only O(h−m−2 ln(ε)) operations (with an achieved accuracy of ε at termination). Itis even possible to achieve O(n) = O(h−m) operations (see chapter 7).

6.1 Direct methods

6.1.1 LU factorization

K - dense matrix, row-wise stored,L - lower triangular matrix with normalization ℓi,i = 1, column-wise stored,U - upper triangular matrix, row-wise stored,

andn∑

j=1

Li,j · Uj,k = Ki,k holds. We investigate the rank-r modification in Alg. 6.1

for the LU factorization without pivot search, as depicted in Fig. 6.1.

DO i = 1, nDO k = i, n Determine ith row of U

Ui,k := Ki,k

END DOLi,i := 1DO k = i+ 1, n Determine ith column of L

Lk,i := Kk,i/Ui,i

END DODO k = i+ 1, n Transformation rest of matrix

DO j = i+ 1, nKk,j := Kk,j − Lk,i · Ui,j

END DOEND DO

END DO

Algorithm 6.1: Rank-r modification of LU factorization - sequential.

Ui,k

Lk,iRemainder

matrix

Figure 6.1: Illustration of the rank-r modification.

6.1. Direct methods 85

Parallelization of the LU factorization

For the purpose of parallelization on a distributed memory computer, a block variantof the rank-r modification as in Alg. 6.2 seems preferable. Now, Kk,i denotes therows and columns of the block splitting of matrix K.

DO i = 1, na) Factor Kk,i such that Kk,i = Lk,i · Ui,i (k=i,n)

−→ Ui,i, Li,i, Li+1,i, . . ., Ln,i

b) Determine Ui,ℓ such that Ki,ℓ = Li,i · Ui,ℓ (ℓ=i+1,n)

−→ Ui,i+1, . . ., Ui,n

c) Rank-r modification of remainder matrixKk,ℓ := Kk,ℓ − Lk,i · Ui,ℓ (k,ℓ=i+1,n)

END DO

Algorithm 6.2: Block variant of rank-r modification

If we use a row-wise or column-wise distribution of matrix K on the pro-cessors, then the smaller the remainder matrix gets the fewer processors are used(see Fig. 6.1). Therefore, we use a square block scattered decomposition as it isimplemented in a parallel version of ScaLAPACK.

K5,1

P3

K5,2

P4

K5,3

P5

K5,4

P3

K5,5

P4

K4,1

P0

K4,2

P1

K4,3

P2

K4,4

P0

K4,5

P1

K3,1

P6

K3,2

P7

K3,3

P8

K3,4

P6

K3,5

P7

K2,1

P3

K2,2

P4

K2,3

P5

K2,4

P3

K2,5

P4

K1,1

P0

K1,2

P1

K1,3

P2

K1,4

P0

K1,5

P1P1

5× 5 matrix blocks

3× 3 Processors (i=0,n)

Ki,j

↓P(i−1) mod Px,(j−1) mod Py

Figure 6.2: Scattered distribution of a matrix

Now we can formulate the parallel block version by using the scattered distri-bution of K.

Non-blocking communication seems advantageous from the point of imple-mentation. A distribution of blocks in direction of rows and columns similar to thehypercube numbering (ring is embedded in hypercube [76, 90]) is also possible.


Process P(Ki,j) stores block Ki,j.

DO i = 1, na) P(Ki,i):

Determine Li,i and Ui,i from Ki,i = Li,i · Ui,i.Send Li,i to all P(Ki,ℓ) via the ring in direction of columns (ℓ=i+1,n).Send Ui,i to all P(Kk,i) via the ring in direction of rows (k=i+1,n).

b) P(Kk,i) [k=i+1,n]:Determine Lk,i from Kk,i = Lk,i · Ui,i.Send Lk,i to all P(Kk,ℓ) via the ring in direction of columns (ℓ=i+1,n).

P(Ki,ℓ) [ℓ=i+1,n]Determine Ui,ℓ from Ki,ℓ = Li,i · Ui,ℓ.Send Ui,ℓ to all P(Kk,ℓ) via the ring in direction of rows (k=i+1,n).

c) P(Kk,ℓ) [k,ℓ=i+1,n]Kk,ℓ := Kk,ℓ − Lk,i · Ui,ℓ

END DO

Algorithm 6.3: Block wise rank-r modification of LU factorization - parallel.

6.1.2 ILU factorization

If we use the LU factorization from Section 6.1.1 for solving (6.1), then the matrixK

will be factored into an upper U and a lower triangular matrix L. Solving theresulting graduated system of equations with the triangular matrices is fast, butthe time for the factorization

K =: LU , (6.2)

behaves like O(n ·BW 2). If the matrix is sparse (FEM, FDM, . . . ), then the fill-inin the factorization increases the memory requirements unacceptably when storingthe factored matrix.

Therefore, one tries to factor matrix K into triangular matrices L and U withthe same matrix pattern as K. Equation (6.2) has to be modified by taking theremainder matrix P into account.

K = LU − P. (6.3)

There exist special modifications for symmetric and positive definite matrices(IC= Incomplete Cholesky) and for factorizations allowing a certain fill-in in thepattern of the triangular matrices (ILU(m)) [26, 106]. They are often used in com-bination with lumping techniques, i.e., entries not fitting in the admissible patternare added to the main diagonal (e.g., MAF [82]).

We investigate in the following the incomplete factorizations (6.3) for non-symmetric matrices K without fill-in, i.e. the classical ILU(0) factorization, and thepreconditioning step LUw = r.

Sequential algorithm

We write the block wise LU-factorization of a matrix K using the block structureof vectors and matrices introduced in (5.2) from Sec. 5.2. The remainder matri-


DetermineLV , UV KV = LV · UV − PV

LEV KEV = LEV · UV − PEV

LIV KIV = LIV · UV − PIV

UV E KV E = LV · UV E − PV E

UV I KV I = LV · UV I − PV I

KE := KE − LEV · UV E

KEI := KEI − LEV · UV I

KIE := KIE − LIV · UV E

KI := KI − LIV · UV I

DetermineLE, UE KE = LE · UE − PE

LIE KIE = LIE · UE − PIE

UEI KEI = LE · UEI − PEI

KI := KI − LIE · UEI

DetermineLI , UI KI = LI · UI − PI

Algorithm 6.4: Sequential block ILU-Factorization.

ces P in Alg. 6.4 are only used for notational purposes and they are omitted in theimplementation. Pointwise ILU is used within the single blocks. After factorizationin Algorithm 6.4, the solution is derived using Algorithm 6.5. completed.

I) Solve Lu = ruV := L−1

V rVuE := L−1

E (rE − LEV uV )uI := L−1

I (rI − LIEuE − LIV uV )

II) Solve Uw = uwI := U−1

I uI

wE := U−1E (uE − UEIwI)

wV := U−1V (uV − UV EwE − UV IwI)

Algorithm 6.5: Sequential block wise substitution steps for LUw = r.

Parallel ILU-Factorization

The first idea for parallelization of ILU is the simple rewriting of Alg. 6.4 with anaccumulated matrix K. Naturally, L and U are also accumulated matrices. However,due to (5.9) and (5.10), we have to fulfill requirements 3a and 3b (from page 77)on the mesh. The redundantly stored matrices KV , KE , KE , KEV and KV E are notupdated before their (pointwise) factorization so that the matrix accumulation canbe performed in the beginning.

The forward and backward substitution step w = U−1 · L−1 · r contains thematrix-vector multiplications (5.9) and (5.10) requiring 3 total vector type changes.


Start K =P∑i=1

ATi KiAi

Determine Why parallel ?LV , UV KV = LV · UV − PV mesh → diagonal matrixLEV KEV = LEV · UV − PEV DD, mesh → block matricesLIV KIV = LIV · UV − PIV DD, mesh → block matricesUV E KV E = LV · UV E − PV E DD, mesh → block matricesUV I KV I = LV · UV I − PV I DD, mesh → block matricesModify

KE := KE − LEV · UV E same matrix typeKEI := KEI − LEV · UV I same matrix typeKIE := KIE − LIV · UV E same matrix typeKI := KI − LIV · UV I same matrix type

DetermineLE , UE KE = LE · UE − PE DD, mesh → block matricesLIE KIE = LIE · UE − PIE DD → block matricesUEI KEI = LE · UEI − PEI DD → block matricesModify

KI := KI − LIE · UEI same matrix typeDetermineLI , UI KI = LI · UI − PI DD → block matrices

Algorithm 6.6: Parallelized LU-factorization.

The diagonal matrix R, which includes the number of subdomains sharing a node,

I) r :=P∑i=1

ATi ri IV) wI := U−1

I uI

II) uV := L−1V rV wE := U−1

E (uE − UEIwI)uE := L−1

E (rE − LEV uV ) wV := U−1V (uV − UV EwE

uI := L−1I (rI − LIEuE − LIV uV ) −UV IwI)

III) u := R−1u V) w :=P∑i=1

ATi wi

Algorithm 6.7: Parallel forward and backward substitution step for LUw = r.

was defined in (5.4).The disadvantage of Alg. 6.7 is that the wrong vector type is always available.

This leads to 3 vector type changes including 2 accumulations and means doublingthe communication costs per iteration compared to the ω-Jacobi iteration. Thiscommunication behavior can be improved. According to (5.12a), the substitutionstep w = L−1 · U−1 · r should require only one vector type change (including oneaccumulation, as demonstrated in the next paragraph).


Parallel IUL factorization

In order to reduce the communication in Alg. 6.7 we use an incomplete UL-factorization where the unknowns are passed in the reverse order. Thus the re-dundant stored matrices KV , KE , KE , KEV and KV E are updated locally during thefactorization in Alg. 6.8 (page 89) so that their accumulation is only allowed beforetheir factorization. The substitution step in Algorithm 6.9 is a simple application

Start K

Determine Why parallel ?UI , LI KI = UI · LI − PI DD → blocks matricesLIE KIE = UI · LIE − PIE DD → blocks matricesLIV KIV = UI · LIV − PIV DD → blocks matricesUEI KEI = UEI · LI − PEI DD → blocks matricesUV I KV I = UV I · LI − PV I DD → blocks matricesModify

KE := KE − UEI · LIE same matrix typeKEV := KEV − UEI · LIV same matrix typeKV E := KV E − UV I · LIE same matrix typeKV := KV − UV I · LIV same matrix type

Accumulate KE , KEV , KV E

e.g., KEV :=P∑i=1

ATE,iKEV,iAV,i —

DetermineUE , LE KE = UE · LE − PE DD, mesh → blocks matricesUV E KV E = UV E · LE − PV E DD, mesh → blocks matricesLEV KEV = UE · LEV − PEV DD, mesh → blocks matricesModify

KV := KV − UV E ·R−1E · LEV same matrix type (5.13a)

Accumulate KV —DetermineUV , LV KV = UV · LV − PV mesh → diagonal matrix

Algorithm 6.8: Parallel IUL factorization.

of (5.12a).

I) uI := U−1I rI III) wV := L−1

V uVuE := U−1

E (rE − UEIuI) wE := L−1E (uE − LEV wV )

uV := U−1V (rV − UV EuE − UV IuI) wI := L−1

I (uI − LIEwE

II) u :=P∑i=1

ATi ui −LIVwV )

Algorithm 6.9: Parallel backward and forward substitution step for ULw = r.


Comparison of ILU- and IUL-factorization

The two algorithms introduced in the previous two sections differ mainly in thecommunication behavior sampled in Table 6.1.

ILU: LU IUL: UL

start up matrix accumulation —factorization — matrix accumulationsubstitution 2× vector accumulation 1× vector accumulation

Table 6.1. Communication in ILU and IUL

The accumulation of the matrix is split into separate calls for accumulation ofcross points and interface nodes. Therefore, the communication loads in Algs. 6.8and 6.6 are the same. According to the FE-discretization, the stiffness matrix K isstored distributed (Sec. 5.2) so that we cannot save any communication.

With respect to the communication we recommend using IUL factorization inparallel implementations.

IUL-factorization with reduced pattern

The requirements on the matrix from Sec. 5.4 cannot be fulfilled in the 3D case orthe 2D case with bilinear/quadratic/... elements. Instead of the original matrix K

we have to factor a near-by matrix K with the proper reduced pattern. This ma-trix K can be derived by deleting all entries not included in the reduced patternor by lumping these entries. We also have to change the internal storage patternof K (usually stored in some index arrays) in any case since only setting the non-

admissible entries to zero causes errors in the factorization of K.

6.2 Smoothers

In the multigrid literature, the term smoother has become synonymous with alliterative methods. However, it is a term which has been abused frequently by manyauthors (all of us included at one time or another). The term smoother was usedin [14] to describe the effect of one iteration of an iterative method on each of thecomponents of the error vector.

For many relaxation methods, the norm of each error component is reduced ineach iteration: hence, the term smoother. This means that if the error componentsare each thought of as wave functions, then the waves are smoothed out somewhateach iteration. The most effective smoothers are variants of Gaussian elimination,which are one iteration smoothers (note that Gaussian elimination is not usuallyreferred to as a smoother in the multigrid literature, but as a direct method in-stead). Other examples of smoothers include Jacobi, Gauss-Seidel, underrelaxationmethods, and ADI.

Not all iterative methods are smoothers, however. Some are roughers [28, 29].For many iterative methods, while the norm of the error vector is reduced each

6.2. Smoothers 91

iteration, the norm of some of the components of the error may grow each iteration;hence, the term rougher. For example, all conjugate gradients-like methods, SSOR,and overrelaxation methods are roughers. Roughers are treated in Sec. 6.3.

6.2.1 ω-Jacobi iteration


Let us denote the main diagonal of K by D = diag(K), the residual and correctionby r and w, respectively. Then we can write the ω-Jacobi iteration using theKD−1K-norm in the stopping criteria in the following way:

Choose u0

r := f −K · u0

w := D−1 · rσ := σ0 := (w, r)k := 0

while σ > tolerance2 · σ0 do

k := k + 1uk := uk−1 + ω · wr := f −K · uk

w := D−1 · rσ := (w, r)

end

Algorithm 6.10: Sequential Jacobi iteration.

Data flow of Jacobi iteration

There are no data dependencies between any components of vector uk (kth iterativesolution) so that the parallelization of that algorithm is quite easy. The data flowchart is presented in Fig. 6.3.

? ? ? ?

? ? ? ?@@@R

@@@R

@@@R

· · · uk−1i−1 uk−1

i· · · uk−1

j−1 uk−1j · · ·

· · · ri−1 ri · · · rj−1 rj · · ·

· · · uki−1 uk

i · · · ukj−1 uk

j · · ·

Figure 6.3: Data flow of Jacobi iteration.


Parallel algorithm

The parallelization strategy is based on non-overlapping elements.

• Store matrix K as distributed matrix K (5.1). Otherwise, we have to assumerestrictions on the matrix pattern.

• Vectors uk → uk are accumulated (Type-I) stored.Therefore, the matrix-vector product results in a Type-II vector without anycommunication (5.8).And so, vectors r, f → r, f are distributed (Type-II) stored and are never beaccumulated during the iteration.

• By using an accumulated vectorwwe can perform the update from uk−1 → uk

by means of vector operations without any communication.

• Due to the different vector types used in the inner product (w, r), that oper-ation requires only a very small amount of communication (5.7).

• Calculation of the correction w needs the inversion of D. This can only bedone on an accumulated matrix. The calculation of

D−1 = diag−1(P∑

s=1AT

s KsAs) =

(P∑

s=1AT

s diagKsAs

)−1

requires one communication step in the iteration setup.

• D−1 is a diagonal matrix and therefore we can apply D−1 not only to anaccumulated vector but also to a distributed one. This leads to the following(numerically and costly) equivalent formulations for the parallel calculationof the correction

w :=

P∑

s=1

ATs D

−1s · rs = D−1 ·

P∑

s=1

ATs · rs,

requiring only one communication per iteration sweep.

If we store the diagonal matrix D−1 as a vector d, then the operation

D−1 ·(

P∑s=1

ATs rs

)has to be implemented as vector accumulation (communication)

followed by the component wise multiplication of the two accumulated vectors andresults in a vector of the same type. Each step of the Jacobi iteration requiresonly one vector accumulation and one All Reduce of a real number in the innerproduct.

If we define w :=P∑

s=1AT

s rs as an auxiliary vector, it changes the update step

intouk := uk−1 + ω ·D−1 ·w = uk−1 + ω · d⊛w.

Here, we denote by d ⊛ w the component-wise multiplication of two vectors, i.e,di ·wii=1,n, which represents a multiplication of a vector by a diagonal matrix.

6.2. Smoothers 93

D−1 := diag−1(P∑

s=1AT

s KsAs)

Choose u0

r := f − K · u0

w := D−1 ·P∑

s=1AT

s rs

σ := σ0 := (w, r)k := 0


k := k + 1uk := uk−1 + ω ·wr := f − K · uk

w := D−1 ·P∑

s=1AT

s rs

σ := (w, r)end

Algorithm 6.11: Parallel Jacobi iteration: Jacobi(K, u0, f).

The calculation of the inner product is no longer necessary if the Jacobiiteration is used with a fixed number of iterations (this is typical for multigridsmoothers). This saves the All Reduce operation in the parallel code and vec-tors r and w can be stored in one place.

6.2.2 Gauss-Seidel iteration

Denoting by E and F the strict lower and upper triangular submatrices and by Dthe main diagonal of the sparse matrix K from (6.1) we can write K = E+D+F .We use a residual norm in the stopping criteria.


The reason for the generally better convergence behavior of the Gauss-Seidel it-eration compared to the ω-Jacobi iteration (Alg. 6.10) consists in the continuousupdate of the residual whenever a new component uk

i of the kth iterate has beencalculated.

ri := fi −i−1∑

j=1

Ki,jukj −

n∑

j=i

Ki,juk−1j

uki := uk−1

i +K−1i,i · ri (6.4)

Now the pointwise residual ri, and therefore the iterate uki in (6.4), depends on the

numbering of the unknowns in vector u. Hence, the whole Gauss-Seidel iterationdepends on the numbering of the unknowns.


Choose u0

r := f −K · u0

σ := σ0 := (r, r)k := 0


k := k + 1uk := uk−1 +D−1 ·

(f − E · uk − (D + F ) · uk−1

)

r := f −K · uk

σ := (r, r)end

Algorithm 6.12: Sequential Gauss-Seidel forward iteration.

In order to save arithmetic work, the K2-norm (r, r) is replaced by the normusing the pointwise residual, namely, (r, r).

Data flow of the Gauss-Seidel forward iteration

As presented in Alg. 6.12 and Fig. 6.4 the components of uk are coupled, e.g., uki

cannot be calculated before uki−1 was determined. In general, all uk

s (s < i) forwhich Ki,s 6= 0 (graph/pattern of matrix) holds have to be calculated before uk

i canbe updated. Due to the coupling of the components of uk the direct paralleliza-

? ? ? ?

- - -

- - -· · · uk−1i−1 uk−1

i uk−1j−1 uk−1

j · · ·

· · · uki−1 uk

i ukj−1 uk

j · · ·

Figure 6.4: Data flow of Gauss-Seidel forward iteration.

tion becomes fine grain and therefore the pointwise Gauss-Seidel iteration is nearlyimpossible to parallelize.

Red-Black Gauss-Seidel iteration

To get rid of the unfavorable properties with respect to the data flow we have tosplit the subscripts of the vector into at least two adjoint subsets ωred and ωblack

so that Ki,j ≡ 0, ∀i 6= j ∈ ωred (resp. ωblack) holds. This results in no couplingbetween elements of the subsets. This allows to reformulate the update step inAlg. 6.12 into Alg 6.13. The proper data flow is presented in Fig. 6.5.

Due to the decoupling property within the subsets of “red” and “black” data,Alg. 6.13 can be parallelized. This pointwise red/black splitting can be applied to a5-point stencil discretization in 2D and also to a 7-point stencil discretization in 3D.

6.2. Smoothers 95

ukred := uk−1

red +D−1red ·

(fred− (Erb + Frb) · uk−1

black −Dred · uk−1red

)

= D−1red ·

(fred− (Erb + Frb) · uk−1

black

)

ukblack := uk−1

black +D−1black ·

(fblack

− (Ebr + Fbr) · ukred −Dblack · uk−1

black

)

= D−1black ·

(fblack

− (Ebr + Fbr) · ukred

)

Algorithm 6.13: Update step in Red-Black Gauss-Seidel forward iteration.

? ?

? ?

@@@R

@@@R

@@@R

uk−1i−1 uk−1

i uk−1j−1 uk−1

j

uki−1

uki

ukj−1

ukj

i,j ∈ ωblack

i−1,j−1 ∈ ωred

Figure 6.5: Data flow in Red-Black-Gauss-Seidel forward iteration.

A block version of the iteration above is recommended in the parallel case.The blocks should be chosen so that the update of blocks of one color requires nocommunication. This produces a checkerboard coloring of the unit square coincidingwith the adjoint data distribution (again, 2 colors are sufficient when a 5-pointstencil is used).

Parallel algorithms

The update step of the Gauss-Seidel iteration is the significant difference with re-spect to the Jacobi iteration, so we restrict further investigations to it. A datadistribution with non-overlapping elements (Sec. 5.2) is assumed and representedin Fig. 6.6.

• A formal application of the parallelization strategy used for the ω-Jacobi it-eration results in a distributed matrix K, an accumulated diagonal matrix D,distributed stored vectors f, r, and an accumulated vector uk.

• The update step requires vector accumulation.

• The inner product requires different vector types.

• Therefore, we choose an accumulated auxiliary vector w =P∑

s=1AT

s rs fulfilling

both requirements above.


•

•

•

•

•

•

•• •

•• •

•• •@@@

@@@

•

•

•

•

•

•

•• •

•• •

•• •@@@

@@@

•

•

•

•

•

•

•• •

•• •

•• •@@@

@@@

•

•

•

•

•

•

•• •

•• •

•• •@@@

@@@

Ω1 Ω2

Ω3 Ω4

[i−1] [i] [i+1][q]

[k+1]

[k]

[k−1]

[ℓ]

−→ ”V”

• −→ ”E”

· −→ ”I”

Figure 6.6: Domain decomposition of unit square.

• The derived algorithm changes only slightly if we define w as a correction (aswas done in Sec. 6.2.1).

The parallelization of the original Gauss-Seidel method has to solve the prob-lem of data dependencies. Moreover, the processors would have to wait for eachother depending on the global numbering of the data (crucial dependency). Thiseffect can be reduced by a special block-wise ordering (coloring) of the nodes.

We use the marked nodes in Fig. 6.6 for illustrating these data dependen-cies and algorithmical modifications. First we order the nodes (and the un-knowns connected with them) such that the vertex nodes (“V ′’) are followedby the edgewise ordered nodes (“E”) and then by the interior nodes (“I”), i.e.,q < . . . < k − 1 < k < k + 1 < . . . < i− 1 < i < i+ 1 < . . . < ℓ.

1. If there are no matrix connections between k + 1 and i − 1 (valid for lineartest functions on the triangles), then we have no crucial data dependencies.However, we need the accumulated value uki (and so on) for the calculation ofuki+1. This results in a fine grain parallel Gauss-Seidel iteration on each edge.

2. If we regard the nodes i − 1, i, i + 1 and k − 1, k, k + 1 in each case as oneblock, then we can formulate a parallel block Gauss-Seidel iteration. There isno matrix connection admissible between both blocks in the example.

3. To overcome the above restriction on the matrix pattern, we combine a blockGauss-Seidel iteration with Jacobi iterations on the interface nodes “V ” and“E”. This version will now be presented.

6.2. Smoothers 97

Variant: Gauss-Seidel-ω-Jacobi

Between the blocks “V,E, I” and inside the subdomains (block “I”) a Gauss-Seideliteration is performed. Inside the blocks “V,E” a Jacobi iteration is used. Due tothe partial use of the Jacobi iteration, the convergence rate is in general worse thanin the usual Gauss-Seidel iteration. Operation d⊛w in Alg. 6.14 denotes component-

d := diag(K [i,i])i=1,n

d :=P∑

s=1AT

s ds

d := ω/d[i]i=1,n

Choose u0

r := f − K · u0

w :=P∑

s=1

ATs rs

σ := σ0 := (w, r)k := 0


k := k + 1

rV := fV − KV · uk−1V − KV E · uk−1

E −KV I · uk−1I

wV :=P∑

s=1AT

V,srV,s

ukV := uk−1V + dV ⊛wV

rE := fE − KEV · ukV − KE · uk−1E −KEI · uk−1

I

wE :=P∑

s=1AT

E,srE,s

ukE := uk−1E + dE ⊛wE

rI := fI−KIV · ukV −KIE · ukE

ukI,i := uk−1

I,i + dI,i ·(rI,i −

i−1∑j=NE+1

KI,ij · ukI,j −

N∑j=i

KI,ij · uk−1I,j

)

wI := rI −KI · ukI

σ := (w, r)end

Algorithm 6.14: Parallel Gauss-Seidel-ω-Jacobi forward iteration.

wise multiplication of two vectors. We also took into account that in the interior ofthe domains (“I”) accumulated and distributed vectors/matrices are identical. Theaccumulation of cross points (“V ”) and interface data (“E”) is usually performedseparately so that we need exactly the same amount of communication as in theω-Jacobi iteration (Alg. 6.11).


6.2.3 ADI methods

Alternating Direction Implicit (ADI) methods are useful when a regular mesh isemployed in two or more space dimensions. The first paper that ever appearedwith ADI in it was [32] rather than the commonly referenced [86]. The general ADIresults for N space variables are found in [31]. The only paper ever written thatshows convergence without resorting to requiring commutativity of operators waswritten by Pearcy [87]. A collection of papers were written applying ADI to finiteelement methods. An extensive treatment can be found in [30]. Wachspress [111]has a very nice treatise on ADI.

The algorithm is dimension dependent, though the basic techniques are simi-lar. The sparse matrixK is decomposed into a sum of matrices that can be permutedinto tridiagonal form. These tridiagonal systems can be solved easily, e.g., by theProngonka algorithm [95]. The permutation normally requires a data transposethat is painful to perform on a parallel computer with distributed memory.

In this section, we investigate ADI for two dimensional problems on a squaredomain with a uniform or tensor product mesh. The techniques used work equallywell with general line relaxation iterative methods.


Let us begin with the two dimensional model problem (2.1). Suppose that we haveK = H + V , where H and V correspond to the discretization in the horizontaland vertical directions only. Hence, H corresponds to the discretization of the term∂2u(x,y)

∂x2 in (2.2) and V corresponds to the discretization of the term ∂2u(x,y)∂y2 in (2.3).

Choose u0

r := f −K · u0

σ := σ0 := (r, r)k := 0


(H + ρk+1I)uk∗ = (ρk+1I − V ) · uk + f

(V + ρk+1I)uk+1 = (ρk+1I −H) · uk∗ + f

r := f −K · uk+1

σ := (r, r)k := k + 1

end

Algorithm 6.15: Sequential ADI in two dimensions

The parameters ρk are acceleration parameters that are chosen to speed upthe convergence rate of ADI. For (2.1), we know the eigenvalues and eigenvectorsfor K:

Kµjm = λjmµjm, where µjm = sin(jπy) sin(mπx/2).

6.2. Smoothers 99

We also know that

Hµjm = λmµjm and V µjm = λjµjm,

where λj + λm = λjm.It is also readily verified that H and V commute: HV = V H . This condi-

tion can be met in practice for all separable elliptic equations with an appropriatediscretization of the problem. The error iteration matrix

Tρ = (V + ρI)−1(H − ρI)(H + ρI)−1(V − ρI)

satisfies

Tρµjm =(λj − ρ)(λm − ρ)

(λj + ρ)(λm + ρ)µjm.

Suppose that0 < α ≤ λj , λm ≤ β.

A plot (see Fig. 6.7) of which eigenvalues provide good or bad damping of theerrors leads us to note that at the extreme values of the eigenvalues there is notmuch damping, but there is an impressive amount at some point in between.

01

23

4

0

1

2

3

40

0.2

0.4

0.6

0.8

1

lambda_xlambda_y

rate

of c

onve

rgen

ce

Figure 6.7: ADI damping factor parameter space.

For a sequence of length γ of acceleration parameters, we define

c = α/β, δ = (√2− 1)2, and n = ⌈logδ c⌉+ 1.

Cyclically choose

ρj = βcj−1n−1 , j = 1, . . . , n. (6.5)


Then the error every n iterations is reduced by a factor of δ. For (2.1) with a(Nx + 1)×(Ny + 1) mesh, we have

α =1

h2(2− 2 cos(πh)) ≈ π2 and β =

1

h2(2 + 2 cos(πh)) ≈ 4

h2,

which, when substituted into (6.5), gives us

ρj ≈4

h2δj−1, j = 1, . . . , n.

Parallel algorithm

We illustrate the parallelization of the ADI method on the strip wise decompositionin Fig. 6.8. In this special case, we have only interior nodes and edge nodes, butno cross points. Additionally, there are no matrix connections between nodes fromdifferent edges.

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

Ω1

Ω2

Ω3

Ω4

[i]

[j] [k] [ℓ]

[m] [n] [p]

[q] [r]

Figure 6.8: Decomposition in 4 strips, • denotes an Edge node.

This implies the following block structure of the stiffness matrix:

K =

(KE KEI

KIE KI

)=

6.2. Smoothers 101

KE01 0 0 0 0 KE01I1 0 0 00 KE12 0 0 0 KE12I1 KE12I2 0 00 0 KE23 0 0 0 KE23I2 KE23I3 00 0 0 KE34 0 0 0 KE34I3 KE34I4

0 0 0 0 KE40 0 0 0 KE40I4

KI1E01 KI1E12 0 0 0 KI1 0 0 00 KI2E12 KI2E23 0 0 0 KI2 0 00 0 KI3E23 KI3E34 0 0 0 KI3 00 0 0 KI4E34 KI4E40 0 0 0 KI4

with exactly that block structure for KE which is required in point 3a on page 77.This means that formula (5.11a) can be applied with an accumulated matrix K by

setting ML :=

(0 0

KIE 0

), MD :=

(KE 00 KI

)and MU :=

(0 KEI

0 0

). Keeping in

mind that RI,i = II,i, we get

w = K · u :=

KE · uE +

P∑i=1

ATi KEI,i · uI,i

KIE · uE + KI · uI

, (6.6)

Hence, we need only one communication step in the matrix-vector multipli-cation. The great advantage of using an accumulated matrix in ADI is that weonly need one storage scheme for all the matrices in Alg. 6.15. The specialties

of V =

(KE,y KIE

KIE KI,y

)and H =

(KE,x 00 KI,x

)consist of a diagonal matrix KE,y

and parallel invertible matrices KE,x, KI,x, KI,y, Tx := H + ρk+1I. Therefore wecan write the whole parallel ADI algorithm in terms of accumulated vectors andmatrices.

Alg. 6.16 has the disadvantage of 2 communication steps in matrix-vector mul-tiplications per iteration. This can be reduced to one communication step becausethe involved subvectors and submatrices are identical. We introduce the auxiliaryvector v for this purpose and rewrite the algorithm. We introduce the auxiliaryvector v for this purpose and rewrite the algorithm. Note that v contains parts ofthe residual r and that is related to uk+1. Hence, v can be reused in the next iter-ation to reduce the computational cost. The inversions T−1

E,x and T−1I,x in Alg. 6.16

and the inversion T−1x in Alg. 6.17 only use a sequential solver for tridiagonal band

matrices along the lines in the x-direction [95]. Hence, there is no communicationat all.

On the other hand, a parallel solver for the tridiagonal system Tyuk+1 = g in

the y-direction is needed. For simplicity we use an adapted version of Alg. 6.14 forthis purpose. It is obvious that Ty,E is a diagonal matrix and Ty,I is equivalent toa block diagonal matrix with tridiagonal blocks. In this case, we can solve systemswith these matrices by means of a direct solver at low costs. This changes

uℓ+1E := uℓE + T−1

y,E ·(gE− Ty,E · uℓE −

P∑

s=1

ATs Ty,EI,s · uℓI,s

)


Choose u0

r :=

f

E− KE · u0E −

P∑i=1

ATi KEI,i · u0I,i

fI− KIE · u0E − KI · u0I

σ := σ0 := (r, R−1r)k := 0


Tx := H+ ρk+1I

uk∗ =

T−1

E,x

[fE

+ ρk+1ukE − KE,y · ukE −

P∑i=1

ATi KEI,i · ukI,i

]

T−1I,x

[fI

+ ρk+1ukI − KIE · ukE − KI,y · ukI

]

Ty := V+ ρk+1I

g :=

(fE

+ ρk+1uk∗E − KE,x · uk∗E

fI

+ ρk+1uk∗I − KI,x · uk∗I

)

Solve Tyuk+1 = g

r :=

f

E− KE · uk+1

E −P∑i=1

ATi KEI,i · uk+1

I,i

fI− KIE · uk+1

E − KI · uk+1I

σ := (r, R−1r)k := k + 1

end

Algorithm 6.16: Parallel ADI in two dimensions: first try.

from the straightforward Gauss-Seidel implementation into

uℓ+1E := T−1

y,E ·(gE−

P∑

s=1

ATs Ty,EI,s · uℓI,s

)

and is applied in a block iteration for solving the system. Again we use an auxiliaryvariable v to save communication steps.

6.3 Roughers

As noted in the introduction to Sec. 6.2, not all iterative methods are smoothers.Some are roughers.

A rougher differs from a smoother in how individual error components aretreated. While the error norm of a rougher should decrease each iteration, some ofthe individual components may increase in magnitude. Roughers differ from propersmoothers (where no error component increases in magnitude each iteration) in howthe error components behave during the iterative procedure.

6.3. Roughers 103

Choose u0

v :=

f

E−

P∑i=1

ATi KEI,i · u0I,i

fI− KIE · u0E

r := v−(KE 00 KI

)· u0

σ := σ0 := (r, R−1r)k := 0


Tx := H+ ρk+1I

uk∗ := T−1x ·

[v+ ρk+1u

k −(KE,y 00 KI,y

)· uk]

Ty := V+ ρk+1I

g := f+ ρk+1uk∗ −

(KE,x 00 KI,x

)· uk∗

Solve Tyuk+1 = g

v :=

f

E−

P∑i=1

ATi KEI,i · uk+1

I,i

fI− KIE · uk+1

E

r := v−(KE 00 KI

)· uk+1

σ := (r, R−1r)k := k + 1

end

Algorithm 6.17: Parallel ADI in two dimensions: final version.

6.3.1 CG method


If matrix K in (6.1) is SPD ((Kv, v) > 0, ∀v 6= 0), then we may use the CGmethod for solving (6.1). Alg. 6.19 presents this method in combination with anSPD preconditioner C.

Parallelized CG

Investigating the operations in the classical CG (Alg. 6.19 with C = I) with respectto the results of Sections 5.3-5.4, we use the following strategy for parallelization.One objective therein is the minimization of communication steps.


u0 := uk∗ , ℓ := 0

vE := gE−

P∑s=1

ATs Ty,EI,s · u0I,s


uℓ+1E := T−1

y,E · vEvI := g

I− Ty,IE · uℓ+1

E

uℓ+1I := T−1

y,I · vIvE := g

E−

P∑s=1

ATs Ty,EI,s · uℓ+1

I,s

r := v−(Ty,E 00 Ty,I

)· uℓ+1

σ := (r, R−1r)ℓ := ℓ+ 1

end

uk+1 := uℓ

Algorithm 6.18: Gauss-Seidel iteration for solving Tyuk+1 = g.

Choose u0

r := f −K · u0

w := C−1 · rs := wσ := σold := σ0 := (w, r)repeat

v := K · sα := σ/(s, v)u := u+ α · sr := r − α · vw := C−1 · rσ := (w, r)β := σ/σold

σold := σs := w + β · s

until√σ/σ0 < tolerance

Algorithm 6.19: Sequential CG with preconditioning.

Choose u0

r := f − K · u0

w :=P∑

j=1

ATj rj

s := w

σ := σold := σ0 := (w, r)repeat

v := K · sα := σ/(s, v)u := u+ α · sr := r − α · vw :=

P∑j=1

ATj rj

σ := (w, r)β := σ/σold

σold := σs := w+ β · s

until√σ/σ0 < tolerance

6.3. Roughers 105

First, we store matrix K as distributed K (5.1) because otherwise we haveto assume restrictions on the matrix pattern. Second, vectors s, u → s, u areaccumulated (Type-I) stored and so the appropriate matrix-vector product resultsin a Type-II vector without any communication (5.8). This leads to the obviousconclusion that vectors v, r, f → v, r, f should be stored distributed (Type-II)and are never be accumulated during the iteration. Additionally, if w → w is anaccumulated vector, then all daxpy operations require no communication. Dueto the different types of the vectors used in the inner products, those operationsrequire only a very small amount of communication (5.7).

Note that choosing C = I in Alg. 6.19 results in the setting w := r thatincludes one change of the vector type via accumulation (5.5). Here we need com-munication between all the processes sharing the proper data. The resulting parallelCG in Alg. 6.20 requires two All Reduce-operations with one real number andone vector accumulation.

6.3.2 GMRES solver


For solving a system of linear equations with a non-symmetric matrix K (K 6= KT )we can use the General Conjugate Residual Method [94]. If K is additionally notpositive definite, then a variety of methods like GMRES, QMR, Bi-CG, Bi-CGSTABcan be used instead.


Choose u0

r := f −K · u0

w1 := C−1 · rz1 :=

√(w1, r)

k := 0

repeat k := k + 1r := K · wk

for i := 1 to k do

hi,k := (wi, r)r := r − hi,k · wi

end

wk+1 := C−1 · rhk+1,k :=

√(wk+1, r)

wk+1 := wk+1/hk+1,k

for i := 1 to k − 1 do

(hi,k

hi+1,k

):=

(ci+1 si+1

si+1 −ci+1

)·(

hi,k

hi+1,k

)

α :=√h2k,k + h2

k+1,k

sk+1 := hk+1,k/α ; ck+1 := hk,k/α ; hk,k := αzk+1 := sk+1zk ; zk := ck+1zk

until |zk+1/z1| < tolerancezk := zk/hk,k

for i := k − 1 down to 1 do zi := (zi −k∑

j=i+1

hi,jzj)/hi,i

uk := u0 +k∑

i=1

zi · wj

Algorithm 6.21: Sequential GMRES with preconditioning.

Let us consider the GMRES method in Alg. 6.21. The number of storedvectors wk and entries in matrix H = hi,ji,j=1,k increases with the number ofiterations k and may require excessive amounts of memory. To solve this problem,the REPEAT-loop is stopped after m iterations and restarted with um as the initialguess. This shortening of the cycle leads to an unstable behavior of the method.For preconditioning techniques, see Section 6.2.2.

Parallel GMRES

We have to distinguish in Alg. 6.21 between not only functionals f , r and state

variables u, wk, but also vectors zii=1,k+1, sii=1,k+1, cii=1,k+1, and ma-trix hi,ji,j=1,k. None of these fit into that scheme. Therefore we handle themsimply as scalar values in the parallelization.

Our strategy for the parallelization of GMRES without preconditioning(C = I) is the following: vectors f , r → f, r are stored distributed and vectors

u, wk → u, wk are stored accumulated. We store the scalar values zii=1,k+1,sii=1,k+1, cii=1,k+1, and hi,ji,j=1,k redundantly on each processor. Then all

6.3. Roughers 107

Choose u0

r := f − K · u0

w1 :=P∑

s=1AT

s rs

z1 :=√(w1, r)

k := 0

repeat k := k + 1r := K ·wk

for i := 1 to k do

hi,k := (wi, r)r := r − hi,k · R−1 ·wi

end

wk+1 :=P∑

s=1AT

s rs

hk+1,k :=√(wk+1, r)

wk+1 := wk+1/hk+1,k

for i := 1 to k − 1 do

(hi,k

hi+1,k

):=

(ci+1 si+1

si+1 −ci+1

)·(

hi,k

hi+1,k

)

α :=√h2k,k + h2

k+1,k

sk+1 := hk+1,k/α ; ck+1 := hk,k/α ; hk,k := αzk+1 := sk+1zk ; zk := ck+1zk

until |zk+1/z1| < tolerancezk := zk/hk,k

for i := k − 1 down to 1 do zi := (zi −k∑

j=i+1

hi,jzj)/hi,i

uk := u0 +k∑

i=1

zi ·wj

Algorithm 6.22: Parallel GMRES - no restart.

inner products require only a minimum of communication (5.7) and the matrix-vector operation requires no communication at all (5.8). Setting w := r requiresone change of the vector type via accumulation (5.5). Here we need communicationbetween all processes sharing the proper data. All operations on scalar values ci,si, zi and hi,j are performed locally on each processor. These operations are redun-dant. Nearly all daxpy-operations use the same vector types (and scalar values)excluding the operation

r := r − hi,k ·wi,

which combines different vector types. But the necessary change of an accumulatedvector wi into a distributed one can be done without any communication (5.6).Therefore, no daxpy-operation requires communication.

The kth iteration of Alg. 6.22 includes (k + 1) All Reduce-operations withone real number and one vector accumulation. Due to changing the type of vectorwi

we have to perform k · n additional multiplications (n - length of vector w).


A parallelization of GMRES(m) with a restart after m iterations can be donein the same way as above.

6.3.3 BICGSTAB solver


One method for solving a system of linear equations with a non-symmetric and notpositive definite matrix K is BICGSTAB in Alg. 6.23.

Choose initial guess ur := f −K · uChoose q s.t. (q, r) 6= 0Set := 1, α := 1, ω := 1, v := 0, p := 0

while (r, r) < ε2 do

old := := (q, r)β := /old · α/ωp := r + β · (p− ω · v)v := K · pα := /(q, v)s := r − α · vt := K · sω := (t, s)/(t, t)u := u+ α · p+ ω · sr := s− ω · t

Algorithm 6.23: Sequential BICGSTAB

Parallel BICGSTAB

If we choose the same parallelization strategy as in Section 6.3.2, then the paral-lel BICGSTAB in Alg. 6.24 requires three type conversions with communicationper iteration and two additional vectors. This can be reduced by using mostlyaccumulated vectors in the parallel code.

6.4 Preconditioners

The parallel versions of classical iterative schemes listed in subsections 6.1.2-6.2.3can be used as preconditioners in subsections 6.3.1-6.1.2. Additionally, the sym-metric multigrid iteration in Chapter 7 can also be used as preconditioner [68, 67].

6.4. Preconditioners 109

Choose initial guess u

r :=P∑

s=1AT

s (fs − Ks · us)Choose q s.t. (q, r) 6= 0Set := 1, α := 1, ω := 1, v := 0, p := 0

while (r, R−1r) < ε2 do

old := := (q, r)β := /old · α/ωp := r+ β · (p− ω · v)v :=

P∑s=1

ATs Ks · ps

α := /(q, v)s := r− α · vt := K · st :=

P∑s=1

ATs ts

ω := (t, s)/(t, t)u := u+ α · p+ ω · sr := s− ω · t

Algorithm 6.24: Parallel BICGSTAB.

Exercises

We want to solve the Laplace problem using the weak formulation with homogeneousDirichlet boundary conditions on the discretization given in Fig 5.4 on page 81.

E6.1. You will find a sequential version of the Jacobi solver in the directory Solu-tion/jacseqc with the following functions in addition to the functions on pages79–80.

GetMatrix(nx, ny, xl, xr, yb, yt, sk, id, ik, f)in which matrix K and right hand side f are calculated using the functionFunctF(x,y) for describing f(x). Note that only coordinates and element connec-tivities are related to the rectangular domain. All other parts in this routine arewritten for general 2d-domains.The Dirichlet boundary conditions are set in SetU(nx, ny, u). Alternatively, wecould use FunctU(x,y). These b.c. are applied in

ApplyDirichletBC(nx, ny, neigh, u, sk, id, ik, f)via a penalty method.The solver itself is implemented in

JacobiSolve(nx, ny, sk, id, ik, f, u )and uses

GetDiag(nx, ny, sk, id, d)to get the diagonal from matrix K. Matrix-vector multiplication is implemented in

CrsMult(iza, ize, w, u, id, ik, sk, alfa) .

A vector u can be saved in a file named name by calling


SaveVector(name, u, nx, ny, xl, xr, yb, yt, ierr) ,such that gnuplot name can be used to visualize that vector.

Implement a parallel ω-Jacobi solver based on the sequential code.

E6.2. Implement a parallel CG without preconditioning.

E6.3. Implement a parallel CG with diagonal preconditioning.

Chapter 7

Multigrid methods

Theory attracts practice as the magnet attracts iron.—Johann C. F. Gauss (1777-1855).

7.1 Multigrid methodsWe assume that there exists a series of regular (FE) meshes/grids Tq

ℓq=1, where the finer

grid Tq+1 was derived from the coarser grid Tq . The simplest case in 2D is to subdivideall triangles into 4 congruent ones, i.e., by connecting the bisection points of the threeedges.

We assume that the grids are nested, T1 ⊂ T2 ⊂ · · · ⊂ Tℓ. There are convergenceresults for nonnested triangulations (and more general griddings and discretizations) [27,28], but we do not consider this case. This simplifies the theory considerably [5, 58, 115].

Discretizing the given differential equation on each grid Tq results in a series ofsystems of equations

Kquq = fq

(7.1)

with the symmetric, positive definite sparse stiffness matrices Kq (see Chapter 4 for FEmeshes).

When analyzing the non-optimal convergence behavior of iterative methods pre-sented in Sec 6.2.1 and 6.2.2 by means of representing the error e := u∗

q − ukq as the sum

of the eigenfrequencies of the matrix K, we achieve the following results (see also thenicely written multigrid introduction [16]). First, those parts of the error belonging tohigh eigenfrequencies are reduced very quickly. Second, low frequency parts of the erroryield to the slow convergence behavior of classical iterative solvers.

We need an inexpensive method reducing the low frequency errors elow without losingthe good convergence behavior for the high frequency errors ehigh. This leads directly tothe two grid idea:

Reduce ehigh on grid Tℓ (smoothing), project the remaining error onto coarsegrid Tℓ−1, and solve there exactly. Interpolate the coarse grid solution (calledthe defect correction) back to the fine grid Tℓ and add it to the approximationobtained after smoothing.

111

112 Chapter 7. Multigrid methods

coarsest grid

finest grid

auxil. grids

Figure 7.1: V and W cycles.

Usually, additional smoothing reduces the high frequency errors once more on Tℓ. The twogrid method is repeated until convergence is reached. We defer until page 115 the properdefinition of solve exactly, but until then, assume a direct solver is adequate.

The Multigrid idea substitutes the exact solver in the defect correction on the coarsegrid Tℓ−1 recursively by another two grid method until the coarsest grid T1 is reached.There we solve the remaining small system of equations exactly by a direct method.

7.2 The multigrid algorithm

7.2.1 Sequential algorithm.

For notational purposes we have to introduce the following vectors and matrices:

Iq−1q Restriction/Projection operator for mapping fine grid data onto the coarse grid.

Iqq−1 Prolongation/Interpolation operator for mapping coarse grid data onto the fine grid.

Sνprepre

Presmoothing operator reducing ehigh, applied νpre times.

Sνpostpost Postsmoothing operator reducing ehigh, applied νpost times.

dq Defect on qth grid.

wq Correction on qth grid.

mgmγ Recursive multigrid procedure, applied γ-times(γ = 1: V-cycle, γ = 2: W-cycle)

We choose one or more iterative methods from Sec. 6.2.1 - 6.1.2 as the presmoother orpostsmoother in Alg. 7.1. Note that they can be different. If we use multigrid as precon-ditioner in a CG algorithm, then the proper multigrid operator must be symmetric andpositive definite. In this case we must choose the grid transfer operators, i.e., interpola-

tion and restriction, as Iq−1q =

(Iqq−1

)Tand the iteration operator of the postsmoother

has to be adjoint to the iteration operator of the presmoother with respect to the energyproduct (·, ·)Kq . Additionally, the number of smoothing steps have to be the same, i.e.,νpre = νpost, and the postsmoother as well as the presmoother has to start with an initualguess equal zero [68, 67]. The defect system on the coarsest grid is solved exactly bya direct method or it can be solved by an iterative method with a symmetric iterationoperator, e.g., ω-Jacobi.

7.2.2 Parallel components of multigrid

It is obvious that we can parallelize the whole multigrid algorithm if we can parallelizeeach single component of it, i.e., interpolation, restriction, smoothing, and the coarse grid

7.2. The multigrid algorithm 113

if (q == 1) thenSolve K1u1 = f

1exactly Coarse grid solver

else

uq := Sνprepre

(Kq, uq, fq) Presmoothing

dq := fq−Kq · uq Defect calculation

dq−1 := Iq−1q · dq Restriction of defect (new rhs)

w0q−1 := 0 Initial guess

wq−1 := mgmγ(Kq−1, w0q−1, dq−1, q − 1) Defect system

wq := Iqq−1 · wq−1 Interpolation of correction

uq := uq + wq Add correctionuq := Sνpost

post(Kq, uq, fq

) Postsmoothing

endif

Algorithm 7.1: Sequential multigrid : mgm(Kq, uq, f q, q).

solver. If we use the non-overlapping element distribution from Sec. 5.2 on the coarsestgrid T1, then this distribution is kept on all finer grids.

•

•

•

•

−→ “C” coarse grid

• −→ “I” coarse grid

−→ “C” fine grid

−→ “I” fine grid

Figure 7.2: Non-overlapping element distribution on two nested grids.

Once again, we store the matrices Kq as distributed ones. From the experiences ofprevious sections (especially Sec. 6.3.1) the following classification of vectors from Alg. 7.1into the two parallel vector types seems to be advantageous.

• Accumulated vectors: uq, uq, wq.

• Distributed vectors: fq , dq .


• Not yet assigned to a vector type: uq, wq−1, w0q−1, dq−1.

Interpolation

Investigating Fig. 7.2 and the proper interpolation resulting from the mesh refinement wecan derive for the three sets of nodes V,E, I the following statements:

1. The interpolation IqV,q−1 within the set of cross points is the (potentially scaled)identity matrix I .

2. The complete set of crosspoints belongs to the coarse grid and therefore they arenever interpolated by nodes belonging to E or I , i.e., IqV E,q−1 = IqV I,q−1 = 0.

3. The interpolation matrix on the edges breaks up into blocks and so we getIqE,q−1 = blockdiagIqEj ,q−1i=1,NumEdges.

4. Nodes on the edges are never interpolated by inner nodes, i.e., IqEI,q−1 = 0.

5. Obviously, also the interpolation matrix of the inner nodes (I) is block diagonal,IqI,q−1 = blockdiagIqIs,q−1s=1,P .

As a consequence, the interpolation matrix Iqq−1 possesses a block structure in whichthe submatrices on the upper right are equal 0. The structure of the matrix allows theapplication of (5.9) and we conclude that

• the interpolation matrix has to be accumulated : Iqq−1, and

• the vector wq−1 is accumulated.

In principle it is also possible to define the interpolation matrix as distributed. But inthat case an additional type conversion with communication has to be performed after theinterpolation or in the following addition of the correction.

No communication in interpolation! :-)

Restriction

By choosing the restriction as transposed interpolation, i.e., Iq−1q =

(Iqq−1

)T, we achieve a

block triangular structure of the restriction matrix as in (5.10) leading to

• an accumulated restriction matrix is accumulated Iq−1q , and

• distributed stored vectors dq−1 and dq .

Again, it is also possible to define the restriction matrix as distributed. In this case anadditional type conversion with communication has to be performed on dq before therestriction. Using injection as restriction does not change any of the previous statements.

No communication in restriction! :-)

Remark 7.1. Choosing Iq−1q =

(Iqq−1

)Thas an odd side effect for the finite difference

method (FDM) (recall FDM from Chapter 2). The resulting Kq−1 resembles a FEM matrix.Instead of maintaining a 5-point stencil (see (2.4)), a 9-point stencil is created. A muchmore complicated system is needed for FDM to maintain the 5-point stencil on all grids(see [27] and [28] for unifying approaches).


Smoothing

The smoothing operators Spre and Spost may require extra input vectors. Iterative methodslike ω-Jacobi, Gauss-Seidel and ILU (6.2.1 - 6.1.2) require as input an accumulated initialsolution u0, a distributed stored right hand side f, and a distributed matrix K. The iteratedsolution u is accumulated in any case. Each iteration of a smoother requires at least oneaccumulation. The appropriate storage classes will be:

• vector uq is accumulated, uq and uq are accumulated in any case,

• vector w0q is also accumulated.

Communication in smoothing :-(

Coarse grid solver

The main problem in parallelizing a multigrid or multilevel algorithm consists of solvingthe coarse grid system exactly:

P∑

s=1

ATs,1Ks,1As,1 u1 =

P∑

s=1

ATs,1 fs,1 . (7.2)

The following ideas are common:

1. Direct solver - sequential :

The coarsest grid is so small that one processor P0 can store it additionally.• Accumulate K1 (Reduce).• Accumulate f1 (Reduce).• Processor P0 solves K1u1 = f

1.

• Distribute u1 to all other processors (Scatter).

This algorithm is purely sequential!

2. Direct solver - parallel :

We store parts of the accumulated matrix on all processors and solve the coarse gridsystem by some direct parallel solver. Therein, load imbalances can appear.

3. Iterative solver - parallel :

Here, no accumulation of K1 and f1 is necessary, but in each iteration at least oneaccumulation must be performed (see 6.3.1 - 6.1.2).

Caution: If CG-like methods are used as coarse grid solvers, then a symmetricmultigrid operator cannot be achieved. To preserve the symmetry one should com-bine, e.g., a Gauss-Seidel forward iteration with the appropriate backward iteration(SSOR) or one should use algebraic multigrid (AMG) [92, 16, 56] as coarse gridsolver.

The usually poor ratio between communication and arithmetic behavior of parallel ma-chines normally results in poor parallel efficiency for the coarse grid solver. This can beobserved when using a W-cycle on 32 or more processors since the W-cycle works on thecoarser grids more often than a V-cycle.

One idea to overcome this bottleneck consists in the distribution of the coarse (orthe coarser) grid on less than P processors. The disadvantage of this approach is theappearance of communication between certain grids ℓ0 and ℓ0+1 in the intergrid transfer.

Communication in coarse grid solver :-(


7.2.3 Parallel algorithm

if (q == 1) then

SolveP∑

s=1

ATs,1Ks,1As,1 u1 =

P∑s=1

ATs,1 fs,1 Coarse grid solver

else

uq := Sνprepre

(Kq, uq, fq) Presmoothing

dq := fq − Kq · uq Defect calculation (new rhs)dq−1 := Iq−1

q · dq Restriction of defectw0

q−1 := 0 Initial guesswq−1 := pmgmγ(Kq−1,w

0q−1, dq−1, q − 1) Solve defect system

wq := Iqq−1 ·wq−1 Interpolation of correction

uq := uq +wq Addition of correctionuq := Sνpost

post(Kq, uq, fq) Postsmoothing

endif

Algorithm 7.2: Parallel multigrid: pmgm(Kq, uq, fq, q)

Alg. 7.2 contains a concise parallelization of multigrid. The interested reader can findadditional information on geometrical and algebraic multigrid methods and their paral-lelization in the survey paper [57].

Exercises

We want to solve the Laplace problem using the weak formulation with homogeneousDirichlet boundary conditions on the discretization given in Fig 5.4 on page 81. Here wesolve the resulting system of equations by means of a multigrid iteration.

E7.1. You will find a sequential version of the multigrid solver using ω-Jacobi iterationsa smoother and, for simplicity, also as a coarse grid solver in the directorySolution/mgseqc. The following functions are added to the functions from pages 79– 80, 109.

AllocLevels(nlevels, nx, ny, pp, ierr)allocates the memory for all levels and stores the pointers in variable pp which is astructure PointerStruct. For details, see mg.h. The matrices and right hand sideswill be calculated in

IniLevels(xl, xr, yb, yt, nlevels, neigh, pp, ierr) .FreeLevels(nlevels, pp, ierr)

This deallocates the memory at the end. Note that we need a different procedurefor multigrid to apply the Dirichlet b.c.

ApplyDirichletBC1(nx, ny, neigh, u, sk, id, ik, f)The multigrid solver

MGMSolver(pp, crtl, level, neigh)uses crtl as predefined structure ControlStruct to control the multigrid cycle (seemg.h). The same holds for each mg-iteration

MGM(pp, crtl, level, neigh) .The mg components smoothing, coarse grid solver, interpolation, and restrictionare defined in


JacobiSmooth(nx, ny, sk, id, ik, f, u, dd, aux, maxiter )JacobiSolve2(nx, ny, sk, id, ik, f, u, dd, aux )Interpolate(nx, ny, w, nxc, nyc, wc, neigh)Restrict(nx, ny, w, nxc, nyc, wc, neigh) .

A vector u on level level can be saved in a file named name by callingSaveVector(name, pp->u[level], pp->nx[level], pp->ny[level],

u, nx, ny, xl, xr, yb, yt, ierr) ,such that gnuplot name can be used to visualize that vector.

Your assignment, should you decide to accept it, is to implement a parallel versionof the sequential multigrid code. These instructions will self destruct in ten seconds.

E7.2. Replace the parallel ω-Jacobi solver on the coarse grid by the sequential ω-Jacobisolver on one process. This requires a gather of the right hand side and a scatter ofthe solution after the solving. Compare the run time of this parallel multigrid forsolving the system of equations on the finest grid with the run time of the parallelmultigrid from E7.1. Interpret the different run time behavior.


Chapter 8

Problems not addressed inthis book

Prediction is difficult, especially of the future.—Niels Bohr (1885-1962).

It is not the goal of this tutorial to compile a complete compendium on PDE, solvers,and parallelization. Each section of this tutorial can still be extended by many, many moremethods as well as by theory covering other types of PDE. Depending on the concreteproblem the following topics, not treated in the book, may be of interest:

• Nonsymmetric problems, time dependent PDE, nonlinear PDE, and coupled PDEproblems [47, 104, 109].

• Error estimators and adaptive solvers [91, 110].

• Algorithms and programs that calculate the decomposition of nodes or elementsto achieve a static load balancing, e.g., the programs Chaco[63], METIS [69] andJOSTLE [113]. See also [7, 74, 55] for basics and extensions of the techniques used.

• Dynamic load balancing [9, 17].

• The wide range of domain decomposition algorithms. A good overview of researchactivities in this field can be found in the proceedings of the Domain Decompo-sition Conferences, see [42, 18, 19, 43, 70, 88, 72, 44, 10, 79, 77, 20] and in thecollection [71]. Two pioneering monographs give a good introduction to domaindecomposition methods from two different points of view: the algebraic one [98] andthe analytic one [89].

• ................................... (add a topic of your choice)

As already mentioned in the introduction to this tutorial, (PDE based) parallel computa-tional methods becomes more and more important in many science where they were not,or almost not used 20, or even 10 years ago. This is especially true for the so-called lifesciences like computational biology and computational medicine (see, e.g., [35]) and thebusiness sciences like computational finance (see, e.g., [96]) where now PDE models playan important rule.

119

120 Chapter 8. Problems not addressed in this book

Appendix A

Abbreviations andNotations

Nobody can say what a variable is.—Hermann Weyl (1885-1955).

ADI Alternating Directions Iterative methodAMG Algebraic Multigrida.e. almost everywhereBEM Boundary Element MethodBICGSTAB BIConjugate Gradient STABilized (method, algorithm, ...)BVP Boundary Value ProblemBPX Bramble, Pasciak, and Xu preconditionerCG Conjugate Gradient (method, algorithm, ...)COMA Cache Only Memory AccesssCRS Compressed Row StorageDD Domain DecompositionDSM Distributed Shared MemoryFDM Finite Difference MethodFEM Finite Element MethodFVM Finit Volume MethodFFT Fast Fourier TransformationGMRES Generalized Minimum RESidual (method, algorithm, ...)HPF High Performance FortranILU Incomplete LU factorizationLU LU factorizationMIC Modified Incomplete Cholesky factorizationMILU Modified Incomple LU factorizationMIMD Multiple Instructions on Multiple DataMISD Multiple Instructions on Single DataMPI Message Passing InterfaceNUMA Non-Uniform Memory AccesssPCG Preconditioned Conjugate Gradient methodPDE Partial Differential Equations

121

122 Appendix A. Abbreviations and Notations

SIMD Single Instruction on Multiple DataSISD Single Instruction on Single DataSMP Symmetric MultiProcessingSOR Successive Over RelaxationSPMD Single Program on Multiple DataSSOR Symmetric Successive Over RelaxationSPD Symmetric and Positive DefiniteUMA Uniform Memory Accesss

(·, ·) inner (scalar) product in some Hilbert space V(·, ·)1,Ω inner product in H1(Ω): (u, v)1,Ω = (u, v)1 :=

∫Ω

(∇Tu∇u+ u2

)dx

‖ · ‖ norm in some normed space V : ‖ u ‖2:= (u, u) if V is a Hilbert space‖ · ‖1,Ω norm in the function space H1: ‖ u ‖21,Ω=‖ u ‖21:= (u, u)1〈F, ·〉 continuous linear functional, see page 40a(·, ·) bilinear form, see page 40

(u, v) Euclidian inner product in RNh : (u, v) =

∑Nhi=1 uivi ∀u, v ∈ R

Nh

‖ u ‖B B-energy norm: ‖ u ‖2B := (Bu, u), where B is a Nh ×Nh SPD matrixR real numbersR

m m-dimension Euclidian spaceΩ computational domain, Ω ⊂ R

m (m = 1, 2, 3)Ωs subdomain Ωs ⊂ ΩΓ = ∂Ω boundary of a domain ΩΓ1 Dirichlet boundary (page 38)Γ2 Neumann boundary (page 38)Γ3 Robin boundary (page 38)L2(Ω) functions space: L2(Ω) := v|

∫Ωv(x)2dx < ∞

L∞(Ω) functions space: L∞(Ω) := v| esssupx∈Ω|v(x)| < ∞C(Ω) space of continuous functions f : Ω 7→ R

Cm(Ω) space of m-times continuously differentiable functions f : Ω 7→ R

X general function space for classical solutionsH1(Ω) Sobolov space H1(Ω) := v(x) : ‖ v ‖1,Ω < ∞

H12 (Γ1) trace space H

12 (Γ1) := g | ∃v ∈ H1(Ω) : v|Γ1

= g

V Hilbert spaceV0 ⊂ V subspace of VVg ⊂ V linear manifold in V generated by a given g ∈ V and V0

V ∗

0 dual space of all linear and continuous functionals on V0

Vh, V0h finite dimensional subspaces: Vh ⊂ V , V0h ⊂ V0

Vgh finite dimensional submanifold: Vgh ⊂ Vg

dimVh dimension of space Vh (page 43)

ωh set of nodes or their indicies from discretized Ωγh set of nodes or their indicies from discretized Γ1

ωh set of nodes or their indicies from discretized Ω \ Γ1

Nh number of nodes: Nh = |ωh|Nh number of nodes: Nh = |ωh|u classical solution u ∈ X, or weak solution u ∈ Vg

uh finite element (Galerkin) solution uh ∈ Vgh

uh discrete solution (vector of coefficients) uh ∈ RNh

supp(ϕ) supp(ϕ) := x ∈ Ω | ϕ(x) 6= 0, support of ϕ(x)Th triangulationRh set of finite elements

123

δ(r) arbitrary triangular element∆ master (= reference) elementKh, K stiffness matrixCh, C preconditioning matrixκ(K) condition number of a matrix Kδij Kronecker symbol, δij = 1 for i = j and 0 otherwiseO(Np) some quantity that behaves proportional to the p-th power of N for

N → ∞O(hp) some quantity that behaves proportional to the p-th power of h for

h → 0

124 Appendix A. Abbreviations and Notations

Appendix B

Internet addresses

Software for parallelization• The MPI homepage1 should be used as as reference address for MPI. The MPI

forum2 contains the official MPI documents, especially the function calls in MPI3,see also another very good reference4.

• There exist several different implementations of the MPI standard. We use mainlythe MPICH5 implementation or the LAM6 version.

• The Parallel Virtual Machine (PVM7) is an older parallel computing environmentbut it is still alive.

• The OpenMP8 homepage contains the specifications of this parallelization approach.

Other software• The Fast Fourier Transformation, or FFT9, homepage should be visited for more

information on that approach.

• Numerical software can be found in the Netlib10.

• Tools and benchmarks11 for parallelization are also available.

• METIS12, JOSTLE13 and CHACO14 are well-known software packages for data

1http://www-unix.mcs.anl.gov/mpi2http://www.mpi-forum.org3http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html4http://www-unix.mcs.anl.gov/mpi/www5http://www-unix.mcs.anl.gov/mpi/mpich6http://www.lam-mpi.org7http://www.csm.ornl.gov/pvm/pvm_home.html8http://www.openmp.org9http://www.fftw.org

10http://www.netlib.org11http://www.nhse.org/ptlib12http://www-users.cs.umn.edu/~karypis/metis13http://www.gre.ac.uk/jostle14http://www.cs.sandia.gov/CRF/chac.html

125

126 Appendix B. Internet addresses

distribution.

Parallel computers• The Beowulf15 Project allows parallel programming on LINUX clusters.

• One interesting project to construct some sort of super parallel computer is calledGlobus16.

• You can try to find your supercomputer in the TOP 50017 list.

• The main vendors for parallel computers (year 2002) are Cray18, Fujitsu19, HP20,IBM21, NEC22, SGI23, and SUN24.

People• To find more information on or contact the authors of this tutorial Craig C. Dou-

glas25 , Gundolf Haase26 and Ulrich Langer27.

Further links• A very nice overview on Parallel and High Performance Computing Resources28.

• The book by Pacheco29 [85] is available via internet.

• The porting commercial software on parallel computers has been done in the EU-ROPORT30 project.

• Find more information on multigrid31 and domain decomposition32 methods.

• One of the good sources for recent developments in scientific computing is the Ger-man Scientific Computing Page33.

15http://www.beowulf.org16http://www.globus.org17http://www.top500.org18http://www.cray.com19http://primepower.fujitsu.com/hpc/en/products-e/index-e.html20http://www.hp.com/products1/servers/scalableservers/21http://www.rs6000.ibm.com/hardware/largescale22http://www.sw.nec.co.jp/hpc/sx-e/index.html23http://www.sgi.com/origin/24http://www.sun.com/servers/highend25http://www.mgnet.org/~douglas26http://www.numa.uni-linz.ac.at/Staff/haase/haase.html27http://www.numa.uni-linz.ac.at/Staff/langer/langer.html28http://www.jics.utk.edu/parallel.html29http://www.usfca.edu/mpi30http://www.gmd.de/SCAI/europort31http://www.mgnet.org32http://www.ddm.org33http://www.scicomp.uni-erlangen.de

Bibliography

[1] R. A. Adams, Sobolev Spaces, Academic Press, San Francisco, 1975.

[2] O. Axelsson, Iterative Solution Methods, Cambridge University Press, Cambridge,1994.

[3] O. Axelsson and V. A. Barker, Finite Element Solution of Boundary ValueProblems, Academic Press, New York, 1984.

[4] I. Babuska, Courant element: Before and after, in [75], Marcel Dekker, Inc., NewYork, Basel, Hong Kong, 1994, pp. 37–51.

[5] R. E. Bank and C. C. Douglas, Sharp estimates for multigrid rates of convergencewith general smoothing and acceleration, SIAM J. Numer. Anal., 22 (1985), pp. 617–633.

[6] R. E. Bank and D. J. Rose, Some error estimates for the box method, SIAM J.Numer. Anal., 24 (1977), pp. 777–787.

[7] S. T. Barnard and H. D. Simon, Fast multilevel implementation of recursivespectral bisection for partitioning unstructured problems, Concurrency: Practice andExperience, 6 (1994), pp. 101–107.

[8] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra,

V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst, Templates for theSolution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition34,SIAM, Philadelphia, PA, 1994.

[9] P. Bastian, Load balancing for adaptive multigrid methods, SIAM J. Sci. Comp.,19 (1998), pp. 1303–1321.

[10] P. E. Bjørstad, M. Espedal, and D. Keyes, eds., Domain Decomposition Meth-ods in Sciences and Engineering, John Wiley & Sons, 1997. Proceedings from theNinth International Conference, June 1996, Bergen, Norway.

[11] D. Braess, Finite Elements: Theory, Fast Solvers and Applications in Solid Me-chanics, Cambridge University Press, Cambridge, 1997.

[12] J. H. Bramble, Multigrid Methods, Longman Scientific & Technical, Pitman Re-search Notes in Mathematics Series No.294, 1993.

[13] J. H. Bramble, J. E. Pasciak, and A. H. Schatz, The construction of precon-ditioners for elliptic problems by substructuring I – IV, Math. Comp., (1986, 1987,1988, 1989). 47, 103–134, 49, 1–16, 51, 415–430, 53, 1–24.

34http://www.netlib.org/linalg/html templates/report.html

127

128 Bibliography

[14] A. Brandt, Multi-level adaptive solutions to boundary-value problems, Math.Comp., 31 (1977), pp. 333–390.

[15] W. L. Briggs and V. E. Henson, The DFT. an owner’s manual for the discreteFourier transformation, SIAM, Philadelphia, 1995.

[16] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial,SIAM, Philadelphia, 2000. Second Edition.

[17] A. Caglar, M. Griebel, M. A. Schweitzer, and G. Zumbusch, Dynamic load-balancing of hierarchical tree algorithms on a cluster of multiprocessor pcs and on thecray t3e, in Proceedings 14th Supercomputer Conference, Mannheim, H. W. Meuer,ed., Mateo, 1999.

[18] T. F. Chan, R. Glowinski, J. Periaux, and O. B. Widlund, eds., SecondInternational Symposium on Domain Decomposition Methods for Partial DifferentialEquations, Philadelphia, 1989, SIAM. Los Angeles, California, January 14-16, 1988.

[19] , eds., Third International Symposium on Domain Decomposition Methods forPartial Differential Equations, Philadelphia, 1990, SIAM. Houston, March 20-22,1989.

[20] T. F. Chan, T. Kako, H. Kawarada, and O. Pironneau, eds., Domain De-composition Methods in Sciences and Engineering, Chiba University, 1999, ddm.org.Proceedings of the 12th Int. Conference on Domain Decomposition, October 25-29,1999, Chiba, Japan.

[21] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald,Parallel Programming in OpenMP, Morgan Kaufmann Publishers, 2000.

[22] P. Ciarlet, The finite element method for elliptic problems, North-Holland Publish-ing Company, Amsterdam-New York-Oxford, 1978. Republished by SIAM in April2002.

[23] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation ofcomplex Fourier series, Mathematics of Computation, 90 (1965), pp. 297–301.

[24] R. Courant, Variational methods for the solution of problems of equilibrium andvibrations, Bull. Amer. Math., 49 (1943), pp. 1–23.

[25] E. W. Dijkstra, Cooperating sequential processes, Tech. Rep. Technical ReportEWD-123, Technological University of Eindhoven, 1965. Reprinted in [41], pp. 43–112.

[26] J. Dongarra, I. S. Duff, D. Sorensen, and H. van der Vorst, NumericalLinear Algebra For High-Performance Computers, vol. 7 of Software, Environments,and Tools, SIAM, Philadelphia, PA, 1998.

[27] C. C. Douglas, Multi–grid algorithms with applications to elliptic boundary–valueproblems, SIAM J. Numer. Anal., 21 (1984), pp. 236–254.

[28] C. C. Douglas and J. Douglas, A unified convergence theory for abstract multi-grid or multilevel algorithms, serial and parallel, SIAM J. Numer. Anal., 30 (1993),pp. 136–158.

[29] C. C. Douglas, J. Douglas, and D. E. Fyfe, A multigrid unified theory for non-nested grids and/or quadrature, East-West J. Numer. Math., 2 (1994), pp. 285–294.

[30] J. Douglas and T. Dupont, Alternating-direction Galerkin methods on rectangles,in Numerical Solution of Partial Differential Equations II, Academic Press, NewYork, 1971, pp. 133–214.

Bibliography 129

[31] J. Douglas, R. B. Kellogg, and R. S. Varga, Alternating direction iterationmethods for n space variables, Math. Comp., 17 (1963), pp. 279–282.

[32] J. Douglas and D. W. Peaceman, Numerical solution of two-dimensional heatflow problems, American Institute of Chemical Engineering Journal, 1 (1955),pp. 505–512.

[33] M. Dryja, A finite element-capacitance matrix method for elliptic problems on re-gions partitioned into substructures, Numer. Math., 34 (1984), pp. 153–168.

[34] D. J. Evans, G. R. Joubert, and H. Lidell, eds., Parallel Computing ’91, North-Holland, 1992. Proceedings of the Int. Conference on Parallel Computing, London,U.K., 3-6 Septembert 1991.

[35] C. Fall, E. Marland, J. Wagner, and J. Tyson, Computational Cell Biology,Springer, New-York, 2002.

[36] J. H. Ferziger and M. Peric, Computational Methods for Fluid Dynamics,Springer-Verlag, Berlin, 1999.

[37] M. Flynn, Very high-speed computing systems, in Proceedings of the IEEE, vol. 54,December 1966, pp. 1901–1909.

[38] I. Foster, Designing and Building Parallel Programs, Addison-Wesley, 1994.

[39] P. J. Frey and P. L. George, Mesh generation: applications to finite elements,Hermes Science Publ., Paris, Oxford, 2000.

[40] K. Gallivan, M. Heath, E. Ng, B. Peyton, R. Plemmons, J.Ortega,

C. Romine, A. Sameh, and R. Voigt, Parallel Algorithms for Matrix Compu-tations, SIAM, Philadelphia, 1990.

[41] F. Genuys, ed., Programming Languages, Academic Press, London, England, 1968.

[42] R. Glowinski, G. H. Golub, G. A. Meurant, and J. Periaux, eds., FirstInternational Symposium on Domain Decomposition Methods for Partial DifferentialEquations, Philadelphia, 1988, SIAM. Paris, January 1987.

[43] R. Glowinski, Y. Kuznetsov, G. Meurant, J. Periaux, and O. B. Widlund,eds., Fourth International Symposium on Domain Decomposition Methods for PartialDifferential Equations, Philadelphia, 1991, SIAM. Moscow, May 21-25, 1990.

[44] R. Glowinski, J. Periaux, and Z. Shi, eds., Domain Decomposition Methodsin Sciences and Engineering, John Wiley & Sons, 1997. Proceedings of EighthInternational Conference, Beijing, P.R. China.

[45] G. Golub and J. M. Ortega, Scientific Computing: An Introduction with ParallelComputing, Academic Press, New York, 1993.

[46] G. H. Golub and C. F. van Loan, Matrix Computations, Johns Hopkins Univ.Press, Baltimore, MD, 1996. Third edition.

[47] M. Griebel, T. Dornseifer, , and T. Neunhoeffer, eds., Numerical Simula-tion in Fluid Dynamics: A Practical Introduction, vol. 3 of SIAM Monographs onMathematical Modeling and Computation, SIAM, Philadelphia, 1997.

[48] U. Groh, Local realization of vector operations on parallel computers, Preprint SPC94-235, TU Chemnitz, 1994. in German.

[49] W. Gropp, E. Lusk, and A. Skjellum, Using MPI, The MIT Press, Cambridge,London, 1994.

35http://www.tu-chemnitz.de/ftp-home/pub/Local/mathematik/SPC/spc94 2.ps.gz

130 Bibliography

[50] C. Grossmann and H.-G. Roos, Numerik partieller Differentialgleichungen, B.G.Teubner, Stuttgart, 1992. in German.

[51] I. Gustafsson, A class of first order factorization methods, BIT, 18 (1978), pp. 142–156.

[52] G. Haase, Parallel incomplete cholesky preconditioners based on the non-overlappingdata distribution, Parallel Computing, 24 (1998), pp. 1685–1703.

[53] , Parallelisierung numerischer Algorithmen fur partielle Differentialgleichun-gen, Teubner, Leipzig, 1999. in German.

[54] , A parallel amg for overlapping and non-overlapping domain decomposition,Electronic Transactions on Numerical Analysis (ETNA), 10 (2000), pp. 41–55.

[55] G. Haase and M. Kuhn, Preprocessing in 2d FE-BE domain decomposition meth-ods, Computing and Visualization in Science, 2 (1999), pp. 25–35.

[56] G. Haase, M. Kuhn, and S. Reitzinger, Parallel amg on distributed memorycomputers, SIAM SISC, 24 (2002), pp. 410–427.

[57] G. Haase and U. Langer, Multigrid methods: From geometrical to algebraic ver-sions, in Modern Methods in Scientific Computing and Applications, Kluwer Aca-demic Press, Dordrecht, 2002. Vol. 75 in the NATO Science Ser. II, Chapter X.

[58] W. Hackbusch, Multi-Grid Methods and Applications, vol. 4 of ComputationalMathematics, Springer-Verlag, Berlin, 1985.

[59] , On first and second order box schemes, Computing, 41 (1989), pp. 277–296.

[60] , Iterative solution of large sparse systems, Springer, Berlin, 1994.

[61] D. W. Heermann and A. N. Burkitt, Parallel Algorithms in Computational Sci-ence, vol. 24 of Information Sciences, Springer, Berlin, 1991.

[62] B. Heinrich, Finite Difference Methods on Irregular Networks, vol. 82 of Interna-tional Series of Numerical Mathematics, Birkhauser-Verlag, Basel, 1986.

[63] B. Hendrickson and R. Leland, The Chaco user’s guide – version 2.0, tech. rep.,Sandia National Laboratories, Albuquerque, NM, 1992. SAND94–2692.

[64] K. Hwang, Advanced computer architecture: parallelism, scalability, programma-bilty, McGraw-Hill computer engineering series, McGraw-Hill, 1993.

[65] E. Issacson and H. Keller, Analysis of Numerical Methods, Dover Publications,Mineola, NY, 1994. Reprint of a 1966 text.

[66] M. Jung, On the parallelization of multi-grid methods using a non-overlapping do-main decomposition data structure, Applied Numerical Mathematics, 20 (1996).

[67] M. Jung and U. Langer, Application of multilevel methods to practical problems,Surveys on Mathematics for Industry, 1 (1991), pp. 217–257.

[68] M. Jung, U. Langer, A. Meyer, W. Queck, and M. Schneider, Multigridpreconditioners and their applications, in Third Multigrid Seminar, Biesenthal 1988,G. Telschow, ed., Berlin, 1989, Karl–Weierstrass–Institut, pp. 11–52. Report R–MATH–03/89.

[69] G. Karypis and V. Kumar, METIS: Unstructured graph partitioning and sparsematrix ordering system - version 2.0, tech. rep., Department of Computer Science,University of Minnesota, 1995.

Bibliography 131

[70] D. Keyes, T. Chan, G. Meurant, J. Scroggs, and R. Voigt, eds., Fifth In-ternational Symposium on Domain Decomposition Methods for Partial DifferentialEquations, Philadelphia, 1992, SIAM. Norfolk, May 6-8, 1991.

[71] D. E. Keyes, Y. Saad, and D. G. Truhlar, eds., Domain-Based Parallelism andProblem Decomposition Methods in Computational Science and Engineering, SIAM,1995.

[72] D. E. Keyes and J. Xu, eds., Domain Decomposition Methods in Scientific andEngineering Computing, Providence, Rhode Island, 1994, AMS. Proceedings of theSeventh Int. Conference on Domain Decomposition, October 27-30, 1993, The Penn-sylvania State University.

[73] N. Kikuchi, Finite Element Method in Mechanics, Cambridge University Press,Cambridge, 1986.

[74] R. Koppler, G. Kurka, and J. Volkert, Exdasy - a user–friendly and extentabledata distribution system, in Euro-Par’97 Parallel Processing, C. Lengauer, M. Griebl,and S. Gorlatsch, eds., vol. 1300 of Lecture Notes in Computer Science, Berlin, 1997,Springer-Verlag, pp. 118–127.

[75] M. Krizek, P. Neittaanmaki, and R. Stenberg, eds., Finite Element Methods:Fifty years of the Courant Element, New York, Basel, Hong Kong, 1994, MarcelDekker, Inc. Jyvaskyla, September 1993.

[76] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to ParallelComputing: Design and Analysis of Algorithms, Benjamin/Cummings PublishingCompany, Redwood City, California, 1994.

[77] C.-H. Lai, P. E. Bjørstad, M. Cross, and O. B. Widlund, eds., EleventhInternational Conference on Domain Decomposition Methods, 1998. Proceedings ofthe 11th International Conference on Domain Decomposition Methods in Greenwich,England, July 20-24, 1998.

[78] K. H. Law, A parallel finite element solution method, Computer and Structures, 23(1989), pp. 845–858.

[79] J. L. Mandel, ed., Domain Decomposition Methods 10 - The tenth InternationalConference on Domain Decomposition Methods, Providence, RI, 1998, AMS. Pro-ceedings of the tenth International Conference on Domain Decomposition Methods, August 10-14, 1997, Boulder, Colorado, U. S. A.

[80] J. Meijerink and H. van der Vorst, A iterative solution for linear systems ofwhich the coefficient matrix is a symmetric m-matrix, Math. Comp., 31 (1977),pp. 148–162.

[81] G. Meurant, Computer Solution of Large Linear Systems, North Holland, Amster-dam, 1999.

[82] A. Meyer, On the O(h−1)-property of MAF, Preprint 45, TU Karl-Marx-Stadt,Karl-Marx-Stadt, 1987.

[83] , A parallel preconditioned conjugate gradient method using domain decompo-sition and inexact solvers on each subdomain, Computing, 45 (1990), pp. 217–234.

[84] R. Miller and Q. F. Stout, Parallel Algorithms for Regular Architectures: Meshesand Pyramides, The MIT Press, 1996.

[85] P. Pacheco, Parallel Programming with MPI, Morgan Kaufmann Publishers, 1997.

132 Bibliography

[86] D. W. Peaceman and H. H. Rachford, The numerical solution of parabolic andelliptic differential equations, Journal of the Society for Industrial and Applied Math-ematics, 3 (1955), pp. 28–41.

[87] C. M. Pearcy, On convergence of alternating direction procedures, NumerischeMathematik, 4 (1962), pp. 172–176.

[88] A. Quarteroni, J. Periaux, Y. Kuznetsov, and O. B. Widlund, eds., SixthInternational Conference on Domain Decomposition Methods in Science and Engi-neering, Providence, Rhode Island, 1994, AMS, Contemporary Mathematics. Como,June 15-19, 1992.

[89] A. Quarteroni and A. Vali, Domain Decomposition Methods for Partial Differ-ential Equations, Oxfort Sciences Publications, Oxford, 1999.

[90] S. H. Roosta, Parallel Programming and Parallel Algorithms: Theory and Compu-tation, Springer, New York, 2000.

[91] U. Rude, Mathematical and computational techniques for multilevel adaptive meth-ods, vol. 13 of Frontiers in Applied Mathematics, SIAM, Philadelphia, 1993.

[92] J. W. Ruge and K. Stuben, Algebraic multigrid (AMG), in Multigrid Methods,S. McCormick, ed., vol. 5 of Frontiers in Applied Mathematics, SIAM, Philadelphia,1986, pp. 73–130.

[93] Y. Saad, Iterative methods for sparse linear systems, PWS Publishing Company,Boston, 1995.

[94] Y. Saad and M. H. Schultz, GMRES: A generalized minimal residual algorithmfor solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 7 (1986),pp. 856–869.

[95] A. A. Samarski and E. S. Nikolaev, Numerical methods for grid equations. Vol.I/II, Birkhauser-Verlag, Basel-Boston-Berlin, 1989.

[96] R. Seydel, Tools for Computational Finance, Springer, Heidelberg, 2002.

[97] SGI, Origin servers, technical report, Silicon Graphics Computer Systems, MountainView, April 1997.

[98] B. F. Smith, P. E. Bjørstad, and W. Gropp, Domain Decomposition: ParallelMultilevel Methods for Elliptic Partial Differential Equations, Cambridge UniversityPress, 1996.

[99] G. Strang and G. J. Fix, An Analysis of the Finite Element Method, PrenticeHall, New York, 1973.

[100] G. Thomas and C. W. Ueberhuber, Visualization of Scientific Parallel Programs,Springer, Berlin, 1994.

[101] J. F. Thompson, B. S. Bharat, and N. P. Weatherrill, Handbook of GridGeneration, CRC Press, Philadelphia, 1998.

[102] C. H. Tong, T. F. Chan, and C. C. Kuo, Multilevel filtering preconditioners:Extensions to more general elliptic problems, SIAM J. Sci. Stat. Comput., 13 (1992),pp. 227–242.

[103] B. Topping and A. Khan, Parallel Finite Element Computations, Saxe-CoburgPublications, Edinburgh, 1996.

[104] A. Tvaito and R. Winther, Introduction to Partial Differential Equations, AComputational Approach, Springer, 1998.

Bibliography 133

[105] E. van de Velde, Concurrent Scientific Computing, Texts in Applied Mathematics,Springer, New-York, 1994.

[106] H. A. van der Vorst, High performance preconditioning, SIAM J. Sci. Stat. Com-put., 10 (1989), pp. 1174–1185.

[107] C. van Loan, Computational Framework for the Fast Fourier Transformation,vol. 10 of Frontiers in Applied Mathematics, SIAM, Philadelphia, 1992.

[108] U. van Rienen, Numerical Methods in Computational Electrodynamics, Springer-Verlag, Berlin, Heidelberg, 2000.

[109] S. Vandewalle, Parallel Multigrid Waveform Relaxation for Parabolic Problems,Teubner Skripten zur Numerik, Teubner, 1993.

[110] R. Verfurth, A review of a posteriori error estimation and adaptive mesh-refinement techniques, Wiley-Teubner, Chichester, 1996.

[111] E. L. Wachspress, The ADI Model Problem, Wachspress, Windsor, CA, 1995.

[112] C. Walshaw and M. Cross, Mesh partitioning: A multilevel balancing and refine-ment algorithm, SIAM SISC, 22 (2000), pp. 63–80.

[113] C. Walshaw, M. Cross, S. P. Johnson, and M. G. Everett, JOSTLE: Par-titioning of Unstructured Meshes for Massively Parallel Machines, in Parallel Com-putational Fluid Dynamics: New Algorithms and Applications, N. Satofuka, ed.,Elsevier, Amsterdam, 1995, pp. 273–280. (Proc. Parallel CFD’94, Kyoto, 1994).

[114] W. L. Wendland, ed., Boundary Element Topics, Berlin, Heidelberg, 1998,Springer. Proceedings of the Final Conference of the Priority Research ProgrammeBoundary Element Methods 1989–1995 of the German Research Foundation.

[115] P. Wesseling, An Introduction to Multigrid Methods, http://www.MGNet.org, CosCob, CT, 2001. Reprint of the 1992 edition.

[116] E. Zeidler, Applied Functional Analysis: Applications to Mathematical Physics,vol. 108 of Applied Mathematical Sciences, Teubner, 1995.

[117] O. C. Zienkiewicz, The Finite Element Method in Engineering Science, McGrawHill, London, 1971.

Documents

ATutorialon EllipticPDESolvers andtheirParallelization