Numerical Linear Algebra - KTH

Numerical Linear AlgebraLecture Notes 2014

Barbel Janssen

October 15, 2014

Department of High Performance Computing

School of Computer Science and Communication

Royal Institute of Technology, KTH

Contents

1 Introduction 5

2 Condition and Stability 72.1 Errors and machine precision . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Floating-point representation . . . . . . . . . . . . . . . . . . . . 7

2.1.3 Arithmetic operations an cancellation . . . . . . . . . . . . . . . . 8

2.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Linear Algebra and Analysis 133.1 Key definitions and notation . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5 Orthogonal systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Non-quadratic linear systems . . . . . . . . . . . . . . . . . . . . . . . . 21

3.7 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.7.1 Geometric and algebraic multiplicity . . . . . . . . . . . . . . . . 24

3.7.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 25

3.8 Conditioning of linear algebraic systems . . . . . . . . . . . . . . . . . . 26

3.9 Conditioning of eigenvalue problems . . . . . . . . . . . . . . . . . . . . . 27

4 Direct Solution Methods 294.1 Triangular linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 An example for why pivoting is necessary . . . . . . . . . . . . . . 33

4.2.2 Conditioning of Gaussian elimination . . . . . . . . . . . . . . . . 36

4.3 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Cholesky decomposition . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Orthogonal decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.1 Householder Transformation . . . . . . . . . . . . . . . . . . . . . 38

4.5 Direct determination of eigenvalues . . . . . . . . . . . . . . . . . . . . . 41

3

Contents

5 Iterative Solution Methods 435.1 Fixed point iteration and defect correction . . . . . . . . . . . . . . . . . 43

5.1.1 Construction of iterative methods . . . . . . . . . . . . . . . . . . 445.2 Descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.1 Gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.2 Conjugate gradient method (cg method) . . . . . . . . . . . . . . 50

6 Iterative Methods for Eigenvalue Problems 536.1 Methods for the partial eigenvalue problem . . . . . . . . . . . . . . . . . 53

6.1.1 The Power method . . . . . . . . . . . . . . . . . . . . . . . . . . 536.1.2 The inverse iteration . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Krylov space methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.1 Reduction by Galerkin approximation . . . . . . . . . . . . . . . . 566.2.2 Lanczos and Arnoldi method . . . . . . . . . . . . . . . . . . . . . 58

4

1 Introduction

These lecture notes are based on a course given at Royal Institute of Technology, KTHin autum 2014. During a prestudy week problems are solve theoretically. Even smallprograms are implemented in Matlab. Then a compact course follows in which each daya new topic of numerical linear algebra is considered. The course ends with a week forsolving bigger projects with Matlab.

5

2 Condition and Stability

Numerical methods require use of computers and therefore precision is limited. Themethods we use have to be analyzed in view of the finite precision.

2.1 Errors and machine precision

In approximations, we would like to decide how close we are to a (possibly) exact solution.This usually means we are interested in the difference between the approximated andthe real solution.

2.1.1 Errors

Definition 2.1 (Absolute error) Let x be an approximation to x. The absolute errore is defined as

e = |x− x|, x ∈ R, e = ‖x− x‖, x ∈ Rn.

The absolute error e might be big just because |x| or ‖x‖ is big. This gives rise to thefollowing definition.

Definition 2.2 (Relative error) For x 6= 0, we define the relative error between xand its approximation x as

eR =|x− x||x|

, x ∈ R, eR =‖x− x‖‖x‖

, x ∈ Rn.

Errors may have different reasons. But one of them is the already mentioned finiteprecision. This is connected with the computer’s represenatation of numbers.

2.1.2 Floating-point representation

In scientific computuations, floating-point numbers are typically used. The three digitfloating-point representation of π = 3.1415926 . . . is

+.314× 101.

Definition 2.3 (Decimal floating-point number) The fraction .314 is called the man-tissa, 10 is called the base, and 1 is called the exponent. Generally, n-digit decimalfloating-point numbers have the form

±.d1d2 · · · dn × 10p,

7


where the digits d1 through dn are integers between 0 and 9 and d1 is never zero exceptin the case d1 = d2 = · · · = dn = 0.

Generally, a floating-point number can be represented using a base b as

±.d1d2 · · · dn × bp,where each digit is between 0 and b− 1. For the decimal system, b = 10, while b = 2 inthe inary system, where the digits are 0 or 1. If a number can be written in that sense,we call it representable. Before a computation with any number x, we have to convertit into a representable number x:

x = fl(x).

The relation between the exact value x and its representable version fl(x) has the form

fl(x) = x(1 + δ),

where δ may be positive, negative or zero.

2.1.3 Arithmetic operations an cancellation

Errors in finite arithmetic do not only occur in representing numbers but also in per-forming arithmetic operations. The result after an operation may be not representableeven if the input numbers both were representable. The result needs to be rounded ortruncated to be representable. The result of a floating point operation satifies

fl(a op b) = (a op b)(1 + δ)

where op is one of the basic operations +,−, · :. Usually, properties of exact mathematicoperations are not valid in arithmetic operations as

fl(fl(x+ y) + z) 6= fl(x+ fl(y + z)).

A well know example for loss of precision occurs in the computation of the exponential ofan arbitrary number x. Since the convergence radius of the exponential series is infinitywhich means it converges for all numbers x ∈ R the power-series expansion

exp(x) =∞∑k=0

xk

k!= 1 + x+

1

2x2 +

1

3!x3 + · · · (2.1)

is a good candidate for computations. However, for negative values of x, the convergenceis very bad with a high relative error. When the expansion eq. (2.1) is applied to anynegative argument, the terms alternate in sign. Subtraction is required to computethe approximation. The large relative error is caused by rounding in combination withsubtraction.

When close floating-point numbers are subtracted, the leading digits of their mantissasare subtracted (hence the term cancellation). With exact arithmetic, these low-orderdigits would in general be nonzero. However, this information has been discarded beforethe subtraction in order to represent the numbers.

Remark 2.1 Mathematically equivalent methods may differ dramatically in their sen-sitivity to rounding errors.

8

2.2 Conditioning

2.2 Conditioning

When analyzing algorithms for mathematical problems we also have to know about theconditioning of the problem itself.

Definition 2.4 (Condition) A problem is well conditioned if small changes in problemparameters produce small changes in the outcome.

Thus to decide whether a problem is well conditioned, we make a list of the problem’sparameters, we change each parameter slightly, and we observe the outcome. If the newoutcome deviates from the old one just a little, the problem is well conditioned.

Remark 2.2 The property to be well or ill conditioned is a property of the problem itself.It does not have to do anything with an algorithm applied to solve a certain problem.Especially, it is independent of computation and rounding errors.

Example 2.1 Let us consider the problem of determining the roots of the polynomial

(x− 1)4 = 0, (2.2)

whose four routs are all exactly equal to 1. Now we suppose that the right-hand side isperturbed by 10−8 such that the perturbed problem of eq. (2.2) now reads

(x− 1)4 = 10−8, (2.3)

and it can be equivalently written as

(x− 1)4 − 10−8 =

(x− 99

100

)(x− 101

100

)(x− 1 +

i

100

)(x− 1− i

100

),

where i ∈ C is the imaginary number such that i2 = −1. That means one root haschanged by 10−2 compared to the root of eq. (2.2). The change in the solution is sixorders larger in comparison to the size of the perturbation. So we say that this problemis ill conditioned.

2.3 Stability

Let us turn to the anlysis of algorithms. We expect them to give a correct answer, evenif small errors are made.

Definition 2.5 (Stability) An algorithm is stable if small changes in algorithm pa-rameters have a small effect on the algorithm’s output.

Example 2.2 Consider the example of solving the equation

x = e−x or equivalently x = − ln(x).

9


The iteration procedurexk+1 = − ln(xk) (2.4)

produces approximations which diverge from the root except if you choose the solution xas the starting guess x1 = x. This algorithm is unstable since each iteration amplifiesthe error. An interation cobweb in fig. 2.1 shows a few iteration steps beginning withx1 = 1/2. As contrast to the unstable iteration in eq. (2.4) let us look at a stable iteration

f(x) = xf(x)

x

x1

f(x1)

x2

f(x2)

x3

f(x3)

x4

f(x4)

Figure 2.1: Cobweb for the iteration eq. (2.4).

which is given asxk+1 = exp(−xk). (2.5)

Beginning with the starting value x1 = 1, this algorithm produces convergent iterates tothe root. The first few iterates are shown in a cobweb in fig. 2.2. This is an example fora stable algorithm.

10

2.3 Stability

f(x) = xf(x)

x

x1x2

f(x1)

x3

f(x2)

x4

f(x3)

x5

f(x4)

Figure 2.2: Cobweb for the iteration eq. (2.5).

11

3 Linear Algebra and Analysis

In this chapter we introduce the notation which is used throughout this lecture notes.For convinience, we recall the most important definitions and results. However, we referto standard literature for proofs.

3.1 Key definitions and notation

The elements of finite dimensional real vector spaces are denoted by

x =

x1...xn

∈ Rn with xi ∈ R.

If we want to refer to this vector as a row vector, we simply write xT = (x1, . . . , xn). Fortwo vectors x, y ∈ R addition and scalar multiplication is defined as

x+ y := (x1 + y1, . . . , xn + yn) , αx := (αx1, . . . , αxn) .

Definition 3.1 (Linear independence) A set of vectors v1, . . . , vk ∈ Rn is called lin-early independent if

k∑i=1

civi = 0⇒ ci = 0, for i = 1, . . . , k.

The rank of A, rank(A) is the number of linearly independent columns of A.

This is the ingredient for the next definition of a basis.

Definition 3.2 (Basis) A set of n vectors v1, . . . , vn ∈ Rn is called a basis of Rn ifthey are linearly independent. We say that the basis spans Rn. Any element of Rn canbe uniquely written as

x =n∑i=1

civi, ci ∈ R, for i = 1, . . . , n.

Using Zorn’s Lemma, one can show that each vector space possesses a basis.

A special basis of Rn is formed by the unit vectors {e1, . . . , en}, where

ei =

δ1i...δni

, δij =

{1 for i = j,

0 for i 6= j.

The symbol δij is called Kronecker sybol.

13


Definition 3.3 (Euclidean scalar product) The function (·, ·) : Rn × Rn → R de-fined by

(x, y) = yTx = xTy =n∑i=1

cidi,

where x =∑n

i=1 civi, y =

∑ni=1 div

i is called Euclidean scalar product. Two vectors xand y of Rn are orthogonal if (x, y) = 0.

This whole lecture deals with linear transformation. Its definition is the following:

Definition 3.4 (Linearity) Let A be a transformation between two vector spaces as

A : Rn → Rm.

We call A linear if

A(αx+ βy) = A(αx) +A(βy) = αA(x) + βA(y).

holds true for every α, β ∈ R and for every x, yRn.

In the next definition we turn to linear transformations and matrices.

Definition 3.5 (Linear transformation and matrices) Let {v1, . . . , vn} and {w1, . . . , wm}be bases of Rn and Rm respectively. Relative to these bases, a linear transformation

A : Rn → Rm

is represented by the matrix A having m rows and n columns:

A =

a11 a12 . . . a1n

a21 a22 . . . a2n...

.... . .

...am1 am2 . . . amn

,

the coefficients aij of the matrix A are uniquely defined by the relations

Avj =m∑i=1

aijwi, j = 1, . . . , n.

A matrix with elements aij is written as

A = (aij) .

Given a matrix A, (A)ij denotes the element in the ith row and the jth column. Thetranspose of A is uniquely defined by the relations

(Ax, y) =(x,ATy

)for every x ∈ Rn, y ∈ Rm

14

3.1 Key definitions and notation

which imply that(AT)ij

= aji. It is called symmetric if A = AT . For (square) matrices,

addition and scalar multiplication is defined elementwise. Let

A = (aij) , B = (bij) , α ∈ R

then we haveαA+B = (αaij + bij)

as well as multiplications between a matrix and a vector or between two matrices as

Ax =n∑k=1

aikxk ∈ Rn, AB =n∑k=1

aikbkj ∈ Rn×n.

3.1.1 Examples

In order to better understand the definitions, let us look at some examples.

Example 3.1 (Transpose) Given a matrix

A =

(1 2 33 −4 6

), then its transpose is AT =

1 32 −4−1 6

,

Only matrices of the same dimensions can be added as the next example shows.

Example 3.2 (Addition of matrices)(1 2 33 4 5

)+

(0 6 21 1 1

)=

(1 8 54 5 6

)In the case of matrix vector multiplication the dimensions have to be suitable.

Example 3.3 (Matrix vector multiplication)a11 a12

a21 a22

a31 a32

(x1

x2

)=

a11x1 + a12x2

a21x1 + a22x2

a31x1 + a32x2

,

or a more concrete example if

A =

(3 1 −21 −1 6

)and x =

123

,

then

Ax =

(3 + 2− 61− 2 + 18

)=

(−117

).

In the same manner two matrices of suitable dimensions are multiplied.

15


Example 3.4 (Matrix matrix multiplication) Let

A =

(3 1 −21 −1 6

), B =

1 −12 23 −1

,

then

AB =

(−1 117 −9

).

3.2 Linear systems

One focus of this course is to solve linear systems like

a11x1 + a12x2 + . . . + a1nxn = b1

a21x1 + a22x2 + . . . + a2nxn = b2...

...am1x1 + am2x2 + . . . amnxn = bm

(3.1)

numerically. Direct and iterative methods will be presented and explained to numericallycompute a solution. Before applying sophisticated methods, we have to make sure thata solution exists. Therfor, we recall the known results from basic Linear Algebra.

We present criteria of solvability for rectangular as well as quadratic systems. Thesystem eq. (3.1) is solvable if and only if rank(A) = rank(A, b) with the composed matrix

(A, b

)=

a11 · · · a1n b1...

. . ....

...amn · · · amn bm

,

In case the matrix A is quadratic, the following criteria for solvability of eq. (3.1) areequivalent:

• Ax = 0 implies x = 0.

• rank(A) = n.

• det(A) 6= 0.

• All eigenvalues of A are nonzero.

• The inverse A−1 of A exists and it holds A−1A = AA−1 = I.

For the definition of an eigenvalue, see section 3.7.

16

3.3 Vector norms

3.3 Vector norms

Using numerical methods with limited precision and roundoff errors, we have to decidehow far away from the exact solution the computed solution is. This means we have tomeasure distances between vectors for example. In this section we provide the necessarytools.

Definition 3.6 (Norm) A mapping ‖·‖ : Rn → R is a (vector) norm if it fulfills thefollowing properties:

(i) ‖x‖ ≥ 0 for any vector x ∈ Rn, ‖x‖ = 0⇒ x = 0.

(ii) For any scalar α ∈ R and any vector x ∈ Rn, it holds ‖αx‖ = |α|‖x‖.

(iii) For any two vectors x ∈ Rn and y ∈ Rn, the triangle inequality holds:‖x+ y‖ ≤ ‖x‖+ ‖y‖.

A vector space equipped with a norm is called a normed vector space.

Example 3.5 (Vector norms) A standard example for a vector norm is the euclidiannorm

‖x‖2 :=

(n∑i=1

|xi|2)1/2

Other examples are the maximum norm and the l1 norm

‖x‖∞ := max1≤i≤n

|xi|,

or the general lp norms for 1 ≤ p <∞

‖x‖p :=

(n∑i=1

|xi|p)1/p

.

Let us compute different norms of the same vector in different vector norms.

Example 3.6 Consider the vector

x =

(1−2

), then ‖x‖1 = 3, ‖x‖2 =

√5 and ‖x‖∞ = 2.

Different norms for the same vector can give different results. However, their norms cannot vary arbitrarily as the following therorem shows.

Theorem 3.1 (Equivalence of norms) All norms on the finite dimensional vectorspace Rn are equivalent to the maximum norm, i. e., there exist positive constants mand M such that for each norm

m‖x‖∞ ≤ ‖x‖ ≤M‖x‖∞, x ∈ Rn.

17


In the following we state some of the most important inequalities used throuout thiscourse.

Lemma 3.1 (Holder’s inequality) For 1 < p, q <∞ and 1/p+ 1/q = 1 and x, y ∈ Rn,the inequality

n∑i=1

|xiyi| ≤

(n∑i=1

|xi|p)1/p( n∑

i=1

|yi|q)1/q

holds. In the case p = q = 2 this inequality is called Cauchy-Schwarz inequality.

Lemma 3.2 (Cauchy-Schwarz inequality)

n∑i=1

|xiyi| ≤

(n∑i=1

|xi|2)1/2( n∑

i=1

|yi|2)1/2

In other references, you can find this inequality also written like this

|(x, y)|2 ≤ (x, x)(y, y) = ‖x‖22‖y‖

22,

or like this

xTy ≤ ‖x‖2‖y‖2.

Lemma 3.3 (Minkowski inequality) For 1 ≤ p ≤ ∞ the inequality

‖x+ y‖p ≤ ‖x‖p + ‖y‖p, x, y ∈ Rn.

For scalars the following inequality is often used.

Lemma 3.4 (Young inequality) For 1 < p, q <∞ the inequality

|xy| ≤ |x|p

p+|y|q

q, x, y ∈ R.

3.4 Matrix norms

Definition 3.7 (Matrix norm) A mapping ‖·‖ : Rn×n → R is a matrix norm if itfulfills the following properties:

(i) ‖A‖ ≥ 0 for any matrix A ∈ Rn×n, ‖A‖ = 0⇒ A = 0.

(ii) For any scalar α ∈ R and any matrix A ∈ Rn×n, it holds ‖αA‖ = |α|‖A‖.

(iii) For any two matrices A ∈ Rn×n and B ∈ Rn×n, the triangle inequality holds:‖A+B‖ ≤ ‖A‖+ ‖B‖.

18

3.5 Orthogonal systems

If a matrix norm in addition also satisfies

‖AB‖ ≤ ‖A‖‖B‖,

it is called submultiplicative or consistent.

A matrix norm for any A ∈ Rn can be created through any vector norm ‖·‖ on Rn by

‖A‖ := supx∈Rnx 6=0

‖Ax‖‖x‖

. (3.2)

From this definition, it follows directly that All matrix norm generated like this arecalled natural matrix norms. These norms all comply with

(i) ‖Ax‖ ≤ ‖A‖‖x‖ for every x ∈ Rn, which follows directly from eq. (3.2).

(ii) ‖AB‖ ≤ ‖A‖‖B‖ for every A,B ∈ Rn×n.

(iii) the norm of the identity matrix ‖I‖ = 1.

Theorem 3.2 (Natural matrix norms) Let A = (aij) be a square matrix. Then

‖A1‖ = sup x∈Rnx 6=0

‖Ax‖1

‖x‖1

= max1≤j≤n

n∑i=1

|aij|,

‖A2‖ = sup x∈Rnx 6=0

‖Ax‖2

‖x‖2

,

‖A∞‖ = sup x∈Rnx 6=0

‖Ax‖∞‖x‖∞

= max1≤i≤n

n∑j=1

|aij|,

3.5 Orthogonal systems

We introduced the definition of orthogonal. Two vectors x and y ∈ Rn are orthogonalif (x, y) = 0. For two subspaces N,M ⊂ Rn we define orthogonality as

(x, y) = 0, x ∈ N, y ∈M.

Definition 3.8 (Orthogonal projection) Let M ⊂ Rn be a nontrivial subset. Forany vector x ∈ Rn the orthogonal projection PMx ∈M is defined by

‖x− PMx‖2 = miny∈M‖x− y‖.

This best approximation property is equivalent to the relation

(x− PMx, y) = 0 y ∈M,

which can be used to compute PMx.

19


A set of vectors {v1, . . . , vm}, vi 6= 0 of Rn which are mutually orthogonal (vk, vl) = 0for k 6= l is necessarily independent.

Definition 3.9 (Orthogonal system) A set of vectors {v1, . . . , vm}, vi 6= 0 of Rn

which are mutually orthogonal (vk, vl) = 0 for k 6= l is called orthogonal system and inthe case m = n orthogonal basis. If in addation also (vk, vk) = 1 for k = 1, . . . ,m thesystem is called orthonormal system or orthonormal basis, respectively.

Often it is necessary to transform an existing system of vectors into an orthogonal ororthonormal one. The following theorem states how.

Theorem 3.3 (Gram-Schmidt algorithm) Let {v1, . . . , vm} be any basis of Rn. Then

b1 :=v1

‖v1‖,

bk := vk −k∑j=1

(vk, bj)bj, bk :=bk

‖bk‖, k = 2, . . . , n,

yields an orthonormal basis {b1, . . . , bn} of Rn. This algorithm is called Gram-Schmidtalgorithm.

Definition 3.10 (Orthonormal matrix) A matrix Q ∈ Rm×n is called orthogonal ororthonormal if its column vectors foram an orthogonal or orthonormal system, respec-tively. In the case m = n such a matrix is called unitary.

Lemma 3.5 A unitary matrix Q ∈ Rn×n is regular and its inverse is Q−1 = QT .Further it holds

(Qx,Qy) = (x, y), x, y ∈ Rn,

‖Qx‖2 = ‖x‖2, x ∈ Rn.(3.3)

Definition 3.11 (Range and null space) For a general matrix A ∈ Rm×n, we defineits range and null space (or kernel) as

range(A) := {y ∈ Rm : y = Ax, x ∈ Rn} ,kern(A) := {x ∈ Rn : Ax = 0} ,

respectively.

Definition 3.12 A quadratic matrix

• is called normal if ATA = AAT .

• is called positive semi-definite if

(Ax, x) ∈ R, (Ax, x) ≥ 0, x ∈ Rn,

and positive definite if

(Ax, x) ∈ R, (Ax, x) > 0, x ∈ Rn \ {0},

20

3.6 Non-quadratic linear systems

Remark 3.1 In view of (3.3), we remark that euclidian scalar product and euclidianvector norm are invariant under unitary transformations.

Example 3.7 (Unitary matrix) The real unitary matrix

Q(ij)θ =

i j

1 0 0 0 00 cos θ 0 − sin θ 0 i0 0 1 0 00 sin θ 0 cos θ 0 j0 0 0 0 1

describes a rotation in the (xi, xj)-plane about the origin with angle θ ∈ [0, 2π).

3.6 Non-quadratic linear systems

Let A ∈ Rn×m be a not necessarily quadratic matrix and b ∈ Rm a right-hand sidevector. In this section we consider solving the non-quadractic system

Ax = b (3.4)

for x ∈ Rm, where m 6= n. We need to define a new notion of a solution to the linearsystem eq. (3.4) as the following example shows:

A =

110

and b =

111

.

There exists no vector x such that eq. (3.4) is exactly fulfilled. Instead of lookingfor an exact solution we seek a vector x such that Ax is as close to b as possible. Inmathematical terms, this means we look for a vector x such that ‖b−Ax‖2 is minimized.In the case that x is a exact solution to eq. (3.4) it also minimizes ‖b− Ax‖2.

Theorem 3.4 (Least squares) There exists a solution x of eq. (3.4) in the sense that

‖Ax− b‖2 = minx∈Rn‖Ax− b‖2.

This is equivalent to x being a solution of the normal equation

ATAx = AT b.

21


3.7 Eigenvalues and eigenvectors

In the following we consider quadratic matrices A ∈ Rn×n.

Definition 3.13 (Eigenvalue) A number λ ∈ C is called eigenvalue of the matrixA ∈ Rn×n if there exists a vector w 6= 0, w ∈ Rn such that

Aw = λw.

The vector w is called eigenvector. Eigenvalues can also be obtain as zeros of the char-acteristic polynomial

χA(z) := det(A− zI). (3.5)

I denotes the identity matrix of the same size as the matrix A. The set of all eigenvaluesof a matrix A is called its spectrum and denoted by σ(A) ⊂ C. Any matrix A ∈ Rn×n

has n eigenvalues {λ1, . . . , λn} which are not necessarily distinct.

Definition 3.14 (Similar matrices) Two matrices A,B ∈ Rn×n are called similar (toeach other) if there is a regular matrix T ∈ Rn×n such that

B = T−1AT.

Lemma 3.6 For two similar matrices A and B ∈ Rn×n there holds:

(i) det(A) = det(B).

(ii) σ(A) = σ(B).

(iii) trace(A) = trace(B),

where the trace of A is defined as trace(A) :=∑n

i=1 aii.

Let λi, i = 1, . . . , n denote the eigenvalues to the matrix A ∈ Rn×n and let ui, i =1, . . . , n denote the corresponding normalized eigenvectors so that

Aui = λiui, i = 1, . . . , n. (3.6)

Writing ui as the column vectors of the matrix U , and let Λ = diag(λ1, . . . , λn) be adiagonal matrix with the eigenvalues on the diagonal. With these matrices, the eq. (3.6)can be written in compact form as

AU = UΛ ⇔ A = UΛUT (3.7)

since U is orthonormal and therefor UT = U−1. The form on the right of eq. (3.7) iscalled the spectral decomposition of A. On the other hand, if a spectral decompositionwith an orthonormal matrix U is available, the entries of the diagonal matrix Λ arethe eigenvalues of A. In general not all matrices are diagonalizable in that way. If for

22


example A does not have n linearly independent eigenvectors it is not diagonalizable.However, there exists an invertible matrix P such that

P−1AP =

J1

. . .

Jr

where each submatrix Ji of order ni is of the form

Ji = λiI, if ni = 1, Ji =

λi 1

. . . . . .

λi 1λi

, if ni ≥ 2. (3.8)

This is the Jordan normal form of A. Another normal form is given by the Schurdecomposition.

Theorem 3.5 (Real Schur decomposition) If A ∈ Rn×n then there exists an or-thogonal Q ∈ Rn×n such that

QTAQ =

R11 R12 · · · R1m

0 R22 · · · R2m...

.... . .

...0 0 · · · Rmm

,

where each Rii is either a 1-by-1 matrix or a 2-by-2 matrix having complex conjugateeigenvalues.

For a proof, we refer to [2].

Lemma 3.7 (Spectral norm) For an arbitrary square matrix A ∈ Rn×n the productmatrx ATA ∈ Rn×n is always symmetric and positive semi-definite. For the spectralnorm of A there holds

‖A‖2 = max{√|λ|, λ ∈ σ(ATA)

}.

If A is symmetric, then there holds

‖A‖2 = max {|λ|, λ ∈ σ(A)} .

Let ‖·‖ be an arbitrary vector norm with a corresponding matrix norm. Then, with anormalized eigenvector ‖u‖ = 1 of a matrix A corresponding the eigenvalue λ, thereholds

|λ| = |λ|‖u‖ = ‖λu‖ = ‖Au‖ ≤ ‖A‖‖u‖ = ‖A‖,

23


e.g., all eigenvalues of A are contained in a circle in C with center at the origin andradius ‖A‖. Especially with ‖A‖∞, we obtain the eigenvalue bound

maxλ∈σ(A)

|λ| ≤ ‖A‖∞ = maxi=1,...,n

n∑j=1

|aij|. (3.9)

Since the eigenvalues of AT and A are the same σ(A) = σ(AT ) and using the boundeq. (3.9) simultaneously for AT and for A, we get the refinded bound

maxλ∈σ(A)

|λ| ≤ min(‖A‖∞, ‖A

T‖∞)

= min

(maxi=1,...,n

n∑j=1

|aij|, maxj=1,...,n

n∑i=1

|aij|

).

Theorem 3.6 (Gerschgorin-Hadamard theorem) All eigenvalues of a matrix A ∈Rn×n are contained in the union of the corresponding Gerschgorin circles

Kj :=

{z ∈ C : |z − ajj| ≤

n∑k=1,k 6=j

|ajk|

}, j = 1, . . . , n.

Example 3.8 (Eigenvalues and eigenvectors) Let us consider the following matri-ces and vectors

A =

(2 11 2

), B =

(3 10 2

), u =

(11

), v =

(10

),

then we have

Au =

(33

)= 3u, Bv =

(30

)= 3v.

This shows that λ = 3 is an eigenvalue of both A and B, u is an eigenvector of A andv is an eigenvector of B.

3.7.1 Geometric and algebraic multiplicity

The set of eigenvectors corresponding to a single eigenvalue, together with the zerovector, forms a subspace of Cn. This subspace is called eigenspace. If λ is an eigenvectorof A we denote the corresponding eigenspace by Eλ. An eigenspace is an example of aninvariant subspace of A, i.e., AEλ ⊂ Eλ. The dimension of Eλ can be interpreted as themaximum number of linearly independent eigenvectors to the same eigenvalue λ. Wecall the dimension of Eλ the geometric multiplicity of λ. The geometric multiplicity canalso be described as the dimension of the nullspace A−λI, since that nullspace is againEλ.

By the fundamental theorem of algebra, we can write χA in the form

χA(z) = (z − λ1)(z − λ2) · · · (z − λn)

for some numbers λi ∈ C. By definition 3.13, each λi is an eigenvalues of A, and alleigenvalues of A appear somewhere in the list. In general, an eigenvalue might appearmore than once. The algebraic multiplicity of an eigenvalue λ of A is its multiplicity asa root of χA. An eigenvalue is simple if its algebraic multiplicity is 1.

24


Example 3.9 Consider the matrices

A =

22

2

, B =

2 12 1

2

.

Both matrices A and B have the same characteristic polynomial χA(z) = χB(z) =(z−2)3, so there is a single eigenvalue λ = 2 of algebraic multiplicity 3. In case of A wecan choose three independent eigenvectors, for example e1, e2 and e3, so the geometricmultiplicity is also 3. For B, on the other hand, we can find only a single independenteigenvector (a scalar multiple of e1), so the geometric multiplicity of the eigenvalue isonly 1.

3.7.2 Singular Value Decomposition

Theorem 3.7 (Singular value decomposition) Let A ∈ Rm×n be an arbitrary ma-trix. There exist orthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that

A = USV T , S = diag(σ1, . . . , σp) ∈ Rm×n, p = min(m,n). (3.10)

where σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0. Depending on whether m ≤ n or m ≥ n the matrix Shas the form σ1 0

. . . 00 σm

, or

σ1

. . .

σn

0

.

The nonnegative values {σi} are called the singular values of A and eq. (3.10) is calledthe singular value decomposition (SVD) of A.

From eq. (3.10) we observe that there holds

Avi = σiui, ATui = σiv

i, i = 1, . . . ,min(m,n)

with ui and vi the columns vectors of U and V , respectively. Moreover, we obtain

ATAvi = σ2i v

i, AATui = σ2i u

i,

which shows that the values σi, i = 1,min(m,n) are the square roots of eigenvalues ofthe symmetric positive definite matrices ATA ∈ Rn×n and AAT ∈ Rm×m correspondingto the eigenvectors vi and ui, respectively.

(i) In the case m ≥ nthe matrix ATA ∈ Rn×n has the p = n eigenvalues {σ2

i , i = 1, . . . , n}, whilethe matrix AAT ∈ Rm×m has the m eigenvalues {σ2

1, . . . , σ2n, 0n+1, . . . , 0m}.

25


(ii) In the case m ≤ nthe matrix ATA ∈ Rn×n has the n eigenvalues {σ2

1, . . . , σ2m, 0m+1, . . . , 0n}, while

the matrix AAT ∈ Rm×m has the p = m eigenvalues {σ2i , i = 1, . . . ,m} .

The existence of a decomposition eq. (3.10) is a consequence of the fact that ATA isorthonormally diagonalizable as

QT (ATA)Q = diag(σ2i ).

Important consequences of the decomposition in eq. (3.10). Suppose that the singularvalues are ordered like

σ1 ≥ · · · ≥ σr > σr+1 = · · · = σp = 0, p = min(m,n)

then there holds

(i) rank(A) = r,

(ii) kern(A) = span{vr+1, . . . , vn},

(iii) range(A) = span{u1, . . . , ur},

(iv) A =∑r

i=1 σiuiviT ,

(v) ‖A‖2 = σ1 = σmax,

(vi) ‖A‖F =√

(σ21 + · · ·+ σ2

r) (Frobenius norm).

3.8 Conditioning of linear algebraic systems

Let us consider a linear system to be solved which is given as

Ax = b

where A ∈ Rn×n and b ∈ Rn. Let us further assume that this system is perturbed bysmall errors δA and δb, such that the actually system to be solved is

Ax = b

with A = A+ δA, b = b+ δb and x = x+ δx. We are interested in the impact of theseperturbations on the solution.

Theorem 3.8 (Perturbation theorem) Let the matrix A ∈ Rn×n be regular and theperturbation satisfy ‖δA‖ < ‖A−1‖−1. Then the perturbed matrix A = A + δA is alsoregular and for the resulting relative error in the solution there holds

‖δx‖‖x‖

≤ cond(A)

1− cond(A)‖δA‖‖A‖

(‖δb‖‖b‖

+‖δA‖‖A‖

), (3.11)

with the so called condition number cond(A) := ‖A‖‖A−1‖ of the matrix A.

26

3.9 Conditioning of eigenvalue problems

The condition number cond(A) depends on the chosen vector norm in the estimateeq. (3.11). Most often, the max-norm ‖·‖∞ or the euclidian norm ‖·‖2 are used. In thefirst case, there holds

cond∞(A) := ‖A‖∞‖A−1‖∞

Especially for hermitian matrices we have

cond2(A) := ‖A‖2‖A−1‖2 =

|λmax||λmin|

,

with the eigenvalues λmax and λmin of A with largest and smallest modulus, respectively.

Example 3.10 Consider the following matrix A and its inverse A−1

A =

(1.2969 0.86480.2161 0.1441

), A−1 = 108

(0.1441 −0.8648−0.2161 1.2969

).

We derive‖A‖∞ = 2.1617, ‖A−1‖∞ = 1.513 · 108

and therefore, we obtaincond∞(A) ≈ 3.3 · 108.

Hence, this matrix is very ill-conditioned.

3.9 Conditioning of eigenvalue problems

The most natural way of computing eigenvalues of a matrix A ∈ Rn×n appears to go viaits definition of zeros of the characteristic polynomial, see eq. (3.5). The computationof corresponding eigenvectors could be achieved by solving the singular system

(A− λI)w = 0.

This approach is not advisable in general since the determination of zeros of a polynomialmay be highly ill-conditioned. The computation of zeros of the Wilkinson polynomial isa famous example, see [1, chapter 3.3].

Theorem 3.9 (Stability theorem) Let A ∈ Rn×n be a diagonalizable matrix, andlet {w1, . . . , wn} be n linearly independent eigenvectors of A and let B ∈ Rn×n be anarbitrary second matrix. Then, for each eigenvalue λ(B) of B there is a correspondingeigenvalue λ(A) of A such that with the matrix W = (w1, . . . , wn) there holds

|λ(A)− λ(B)| ≤ cond2(W )‖A−B‖2. (3.12)

For symmetric matrices A ∈ Rn×n there exists an ONB in Rn of eigenvectors so that Win eq. (3.12) can be assumed unitary, WW T = I. That leads to

cond2(W ) = ‖W‖2‖WT‖2 = 1.

In conclusion, the eigenvalue problem of symmetric (or more general normal) matrices iswell conditioned. For general matrices, the conditioning of the eigenvalue problem maybe arbitrarily bad.

27

4 Direct Solution Methods

In this chapter, we study direct methods for the solution of a system of linear equa-tions. In contrast to iterative methods which we will later turn to in chapter 5, directmethods deliver a solution in a finite number of steps if we assume an exact arithmetic.This could be seen as an advantage over iterative methods. However, in order to getuseful result from a direct method it has to be carried out to the last step. Iterativemethods which theroretically converge after an infinite number of steps might produce asatisfying solution in a few number of iterations. Direct methods are known to be veryrobust. They are well applicable to moderate sized problems but become infeasible forlarge systems due to their usually high storage requirements and high computationalcomplexity. Today, moderate size systems are of dimensions up to n ≈ 105 – 106.

We will consider to solve systems of the form

Ax = b (4.1)

with a quadratic matrix A and vectors x and b of suitable sizes.

4.1 Triangular linear systems

Systems which have a triangular form are particularly easy to solve. In the case of anupper triangular system, the matrix A has the coefficient matrix

A =

a11 a12 . . . a1n

a22 . . . a2n

. . ....ann

and the coresponding linear system looks like

a11x1 + a12x2 + . . . + a1nxn = b1

a22x2 + . . . + a2nxn = b2

. . ....

annxn = bn

.

For ann 6= 0, we obtain xn = bnann

and plug this into the second last equation to computexn−1. If we progress in that manner and in case ajj 6= 0, j = 1, . . . , n, we derive thesolution to (4.1) as

xn =bnann

, xj =1

ajj

(n∑

k=j+1

ajkxk

), j = n− 1, . . . 1.

29


The same operations to obtain a solution can be applied to a lower triangular system inopposite order.

Let us take a look at the following example:

Example 4.1x1 + x2 + 2x3 = 3

x2 − 3x3 = − 4−19x3 = − 19

.

In the first step we obtain x3 = 1. After inserting it in the second equation we derive

x1 + x2 + 2 = 3x2 − 3 = − 4

x3 = 1

and find that x2 = −1. In the last step we solve the system and get

x1 = 2x2 = − 1

x3 = 1.

4.2 Gaussian elimination

The elimination method of Gauss transforms the system Ax = b in several eliminationsteps into an upper triangular form and then applies the procedure described in sec-tion 4.1. Only elimination steps that do not change the solution are allowed. These arethe following:

• permutation of two rows of the matrix,

• permutation of two colums of the matrix, inlcuding according renumbering of theunkowns xi,

• addition of a scalar multiple of a row to another row of the matrix.

The procedure is applied to the composed system(A, b

). For the realization, we assume

that the matrix A is regular. We start by setting A(0) = A, b(0) = b. Since we assumedthe matrix to be regular, there exists an element a

(0)r1 6= 0, 1 ≤ r ≤ n. In the first step,

we permute the first and the rth row. We call the result(A(0), b(0)

). Each remaining

row subtracts qj1 = a(0)j1/a(0)11 times the first row. After this first step, we arrive at

(A(1), b(1)

)=

a

(0)11 a

(0)12 · · · a

(0)1n b

(0)1

0 a(1)22 · · · a

(1)2n b

(1)2

......

. . ....

...

0 a(1)n2 · · · a

(1)nn b

(1)n

,

30


where the new entries are given as

a(1)j1 = a

(0)ji − qj1a

(0)1i , b

(0)j − qj1b

(0)1 , 2 ≤ i, j ≤ n.

Again, the submatrix (a(1)jk ) is regular and we can repeat the same steps which we applied

to the matrix A(0). After n− 1 of these elimination steps we arrive at the desired uppertriangular form

(A(n−1), b(n−1)

)=

a

(0)11 a

(0)12 a

(0)13 · · · a

(0)1n b

(0)1

0 a(1)22 a

(1)23 · · · a

(1)2n b

(1)2

0 0 a(2)33 · · · a

(2)3n b

(2)3

......

. . ....

...

0 0 0 a(n−1)nn b

(n−1)n

. (4.2)

For the solution of this upper triangular system we refer to section 4.1. For betterunderstanding, we present the following eample:

Example 4.2

A =

3 1 62 1 31 1 1

, b =

274

.

The nonzero element from the first column to be determined is a(0)11 = 3 which means

that we do not have to permute any rows at the beginning. For the second and third row,we compute

q21 =a21

a(0)11

=2

3, q31 =

a31

a(0)11

=1

3.

In the next step, we get

(A(1), b(1)

)=

3 1 6 20 1/3 −1 17/3

0 2/3 −1 10/3

,

and again no swapping of rows is required since the entry a(1)22 6= 0. With one more

elimination step, we get the upper triangualar form

(A(1), b(1)

)=

3 1 6 20 1/3 −1 17/3

0 1 −8

.

For an example of how to finally derive the solution, have a look at example 4.1. Sinceall the permitted actions are linear manipulations they can be described in terms ofmatrices: (

A(0), b(0))

= P1

(A(0), b(0)

),(A(1), b(1)

)= G1

(A(0), b(0)

),

31


where P1 is a permutation matrix and G1 is a Frobenius matrix given by

P1 =

1 r

0 · · · 1 11

.... . .

...1

1 · · · 0 r1

. . .

1

G1 =

1−q21 1

.... . .

−qn1 1

Both matrices P1 and G1 are regular with determinants det(P1) = det(G1) = 1 andthere holds

P−11 = P1, G−1

1 =

1q21 1...

. . .

qn1 1

.

Since there holdsAx = b⇔ A(1)x = G1P1Ax = G1P1b = b(1),

the systems Ax = b and A(1)x = b(1) have the same solution. After n − 1 eliminationsteps, we arrive at

(A, b)→(A(1), b(1)

)→ · · · →

(A(n−1), b(n−1)

)=: (R, c) ,

where (A(i), b(i)

)= GiPi

(A(i−1), b(i−1)

),(A(0), b(0)

):= (A, b) ,

with permutation matrics Pi and Frobenius matrics Gi of the following form

Pi =

i r

1. . .

10 · · · 1 i

1...

. . ....

11 · · · 0 r

1. . .

1

Gi =

i

1. . .

1 i−qi+1,i 1

.... . .

−qni 1

32


The end result(R, c) = Gn−1Pn−1 · · ·G1P1 (A, b)

is an upper triangular system which has the same solution as the original system.

4.2.1 An example for why pivoting is necessary

To illustrate a potential pitfall of the Gaussian elimination, we solve the system

.001x1 + x2 = 1,

x1 + x2 = 2.(4.3)

If we assume to use exact arithmetic, we obtain the upper triangular system

.001x1 + x2 = 1,

−999x2 = −998,

which has the solution

x2 =998

999≈ .999,

x1 =1− x2

.001=

1− 998/999

.001=

1000

999≈ 1.001.

However, if we assume to use two-digit decimal floating-point arithmetic with rounding,we would derive the following upper triangular system

.001x1 + x2 = 1,

(1− 1000)x2 = 2− 1000.(4.4)

Evaluating the first difference, the computer produces

1− 1000 ≈ fl(1− 1000) = fl(−999) = fl(−.999× 103) = −1.0× 103.

For the second difference, the computer derives

2− 1000 ≈ fl(2− 1000) = fl(−998) = fl(−.998× 103) = −1.0× 103.

The upper triangular system that results from two-digit decimal floating-point arithmeticis

.001x1 + x2 = 1,

−1000x2 = −1000.(4.5)

By backward substitution, the solution is given as

x2 =−1000

−1000= 1,

x1 =1− x2

.001= 0.

33


The error in x1 is 100% since the computed x1 is zero while the correct solution for x1

is close to 1. As you observe, the upper triangular systems in eq. (4.3) and eq. (4.5) donot differ a lot from each other. The crucial step is the backward substitution where wecompute 1−x2

.001. Here, cancellation happens due to the fact that x2 is close to one. A way

to avoid this dilemma is changing the two equations in eq. (4.4) to

x1 + x2 = 2,

.001x1 + x2 = 1.

Applying two-digit decimal floating-point arithmetic to this system yields the uppertriangular system

x1 + x2 = 2,

x2 = 1.

With backward substitution the solution x2 = x1 = 1 is derived. This is in much betteragreement with the exact solution. This process is called pivoting.

Definition 4.1 (Pivot element) The element ar1 = a(0)11 in eq. (4.2) is called pivot

element and the whole substep of its determination is called pivot search. For reasons ofnumerical stability, usually the chioce

|ar1| = maxj=1,...,n

|aj1|

is made. The whole process inluding permuation of rows is called column pivoting. Ifthe elements of the matrix A are of very different size, total pivoting is advisable. Thisconsists in the choice

|ars| = maxk,j=1,...,n

|ajk|

and subsequent permutation of the 1st row with the rth row and the 1st column withthe sth column. According to the column permutation also the unknowns xk have to berenumbered. Total pivoting is costly so that column pivoting is usually preferred.

Let us recall the end result after a Gaussian elimination (with pivoting). We can usethe zero entries in the lower triangular matrix to store the elements qk+1,k . . . , qnk whichwill be renamed as λk+1,k . . . , λnk after possible permuatation of rows due to pivoting:

r11 r12 r13 · · · r1n c1

λ21 r22 r23 · · · r2n c2

λ31 λ32 r33 · · · r3n c3...

.... . . . . .

......

λn1 λn2 · · · λn,n−1 rnn cn

. (4.6)

34


Theorem 4.1 (LR factorization) The matrices

L =

1 0l21 1...

. . . . . .

ln1 · · · ln,n−11

, R =

r11 r12 · · · r1n

r22 · · · r2n

. . ....

0 rnn

are factors in the (multiplicative) decomposition of the matrix PA,

PA = LR, P := Pn−1 · · ·P1.

If there exists such a decomposition with P = I, then it is uniquely determined. Once anLR decomposition is computed, the solution of the linear system Ax = b can be achievedby successively solving two triangular systems

Ly = Pb, Rx = y

by forward and backward substitution, respectively.

We revisit the example 4.2 to illustrate the LR decomposition.

Example 4.33 1 6 22 1 3 71 1 1 4

pivoting−−−−→

3 1 6 22 1 3 71 1 1 4

elimination−−−−−−→

3 1 6 22/3 1/3 −1 17/3

1/3 2/3 −1 10/3

pivoting−−−−→

3 1 6 22/3 1/3 −1 17/3

1/3 2/3 −1 10/3

premutation−−−−−−−→

3 1 6 21/3 2/3 −1 10/3

2/3 1/3 −1 17/3

elimination−−−−−−→

3 1 6 21/3 2/3 −1 10/3

2/3 1/2 −1/2 4

We obtain the solution

x3 = −8, x2 =3

2

(10

3+ x3

)= −7, x1 =

1

3(2− x2 − 6x3) = 19,

and the LR decomposition

P1 = I, P2 =

1 0 00 0 10 1 0

,

PA =

3 1 61 1 12 1 3

= LR =

1 0 01/3 1 02/3 1/2 1

3 1 60 2/3 −10 0 −1/2

.

35


4.2.2 Conditioning of Gaussian elimination

For any (regular) matrix A there exists an LR decomposition like PA = LR. We derive

R = L−1PA, R−1 = (PA)−1L

for R and its inverse R−1. Due to column pivoting, all the elements of L and L−1 areless or equal one and there holds

cond∞(L) = ‖L‖∞‖L−1‖∞ ≤ n2.

Therefore,

cond∞(R) = ‖R‖∞‖R−1‖∞ = ‖L−1PA‖∞‖(PA)−1L‖∞

≤ ‖L−1‖∞‖PA‖∞‖(PA)−1‖∞‖L‖∞ ≤ n2 cond∞(PA).

Applying the perturmation theorem, theorem 3.8, we obtain the estimate for the solutionof the equation LRx = Pb (considering only perturbation of the right-hand side b,δA = 0)

‖δx‖∞‖x‖∞

≤ cond∞(L) cond∞(R)‖δPb‖∞‖Pb‖∞

≤ n4 cond∞(PA)‖δPb‖∞‖Pb‖∞

.

Hence, the conditioning of the original system Ax = b by the LR decomposition isamplified by n4 in the worst case. However, this is an extremely pessimistic estimatewhich can be significantly improve, for example, see [4, theorem 2.2].

4.3 Symmetric matrices

For symmetric matrices A = AT we would like that the symmetry is also reflected inthe LR factorization. It should be only half the work to achive this. Let us assume thatthere exists a decomposition such that A = LR, where L is the a unit lower triangularmatrix and R is a upper triangular matrix. From the symmerty of A it follows that

LR = A = AT = (LR)T = (LDR)T = RTDLT ,

where

R =

1 r12/r11 · · · r1n/r11

. . . . . ....

1 rn−1,n/rn−1,n−1

0 1

, D =

r11 0. . .

0 rnn

.

The uniqueness of the LR decomposition implies that L = RT and R = DLT such thatA may be written as

A = LDLT .

36

4.3 Symmetric matrices

4.3.1 Cholesky decomposition

Symmetric positive definite matrices allow for a Cholesky decomposition

A = LDLT = LLT

with the matrix L := LD1/2. For computing the Cholesky decomposition it suffices tocompute the matrices D and L. This reduces the required work to half the work for theLR decomposition.The algorithm starts from the relation A = LLT which is ˜l11 0

.... . .

ln1 · · · lnn

˜l11 · · · ln1

. . ....

0 lnn

=

a11 · · · ann...

. . ....

an1 · · · ann

and yields equations to determine the first column of L:

l211 = a11, l21l11 = a21, . . . , ln1l11 = an1,

from whichl11 =

√a11, for j = 2, . . . , n : l21 =

a21

l11

=a21√a11

.

is obtained. Let now for some i ∈ {2, . . . , n} the elements ljk, k = 1, . . . , i − 1, j =k, . . . , n be already computed. Then, via

l2i1 + l2i2 + . . . l2ii = aii, lii > 0,

lj1li1 + lj2li2 + . . .+ ljilii = aji,

the next elements lii and lji, j = i+ 1, . . . , n can be obtained as

lii =√aii − l2i1 − l2i2 − . . .− l2i,i−1,

lji = l−1ii

{aji − lj1li1 − lj2li2 − . . .− lj,i−1li,i−1

},

Example 4.4 The matrix

A =

4 12 −1612 37 −43−16 −43 98

has the uniquely determined Cholesky decomposition A = LDL = LLT given as

A =

1 0 03 1 0−4 5 1

4 0 00 1 00 0 9

1 3 −40 1 50 0 1

=

2 0 06 1 0−8 5 3

2 6 −80 1 50 0 3

.

37


4.4 Orthogonal decomposition

Let A ∈ Rm×n be a not necessarily quadratic matrix and b ∈ Rm a right-hand sidevector. We consider the system

Ax = b

for x ∈ Rn. As discussed in section 3.6, we seek a vector x ∈ Rn with minimal defectnorm ‖b− Ax‖. A solution is obtained by solving the normal equation

ATAx = AT b.

Theorem 4.2 (QR decomposition) Let A ∈ Rm×n be a rectangular matrix with m ≥n and rank(A) = n. Then there exists a uniquely determined othonormal matrix Q ∈Rm×n with the property

QTQ = I

and a uniquely determined upper triangular matrix R ∈ Rn×n with diagonal rii > 0, i =1 . . . , n, such that

A = QR.

In theorem 3.3 we introduced an algorithm which transforms a basis into an orthonormalone. However, in practical applications this algorithm is unfeasible because of its insta-bility. Due to round-off effects the orthonormality of the columns of Q is quickly lostalready after a few orthonomailzation steps. A more stable algorithm for this poouse isthe Householder transformation described below.

4.4.1 Householder Transformation

Definition 4.2 (Householder transformation) For any normalized vector v ∈ Rm,‖v‖2 = 1, the matrix

S = I − 2vvT ∈ Rm×m

is called Householder transformation and the vector v is called Householder vector.Householder transformations are symmetric S = ST and orthonormal STS = SST = I.

For the geometric interpretation of the Householder transformation S, let us consideran arbitrary normed vector v ∈ R2, ‖v‖2 = 1. The two vectors

{v, v⊥

}form a basis of

R2, where vTv⊥ = 0. For an arbitrary vector u = αv + βv⊥ ∈ R2, there holds

Su =(I − 2vvT

) (αv + βv⊥

)= αv + βv⊥ − 2α(v vT )v︸︷︷︸

=1

−2β(v vT )v⊥︸︷︷︸=0

= −αv + βv⊥.

Starting from a matrix A ∈ Rm×n the Householder algorithm generates a sequence ofmatrices

A := A(0) → · · · → A(i) → · · · → A(n) = R,

38

4.4 Orthogonal decomposition

where A(i) has the form

A(i) =

i

∗ · · · · · · ∗. . .

...∗ · · · ∗∗ · · · ∗ i

0...

. . ....

∗ · · · ∗

In the ith step the Householder transformation Si ∈ Km×m is determined such that

SiA(i−1) = A(i).

After n steps the result is

R = A(n) = SnSn−1 · · · sqA = QTA,

where Q ∈ Rm×m as product of unitary matrices is also unitary and R ∈ Rm×m has theform

R =

r11 · · · r1n

. . ....

0 rnn

0 · · · 0

n

m.

Therefore we have the representation

A = ST1 · · ·STn R = QR.

We obtain the desired QR decomposition of A by removing the last m − n columns inQ and the last m− n rows in R:

A =

Q ∗

︸︷︷︸n

︸︷︷︸m− n

·

R

0

nm− n

= QR.

In the following, we describe the elimination process in more detail.

Step 1: As a first step of this process, we construct a Householder matrix S1 whichtransforms a1 (the first column of A) into a multiple of e1, the first coordinate vector.This results in a vector which has just a first component. Depending on the sign of a11

39


we choose one of the axes span{a1 + ‖a1‖e1} or span{a1 − ‖a1‖e1} in order to reduceround-off errors. In case a11 ≥ 0 we choose

v1 =a1 + ‖a1‖2e1

‖a1 + ‖a1‖2e1‖2

, v⊥1 =a1 − ‖a1‖2e1

‖a1 − ‖a1‖2e1‖2

.

Then the matrix A(1) = (I − 2v1vT1 )A has the column vectors

a(1)1 = (I − 2v1v

T1 )a1 = −‖a1‖2e1,

a(1)k = (I − 2v1v

T1 )ak = ak − 2(ak, v1)v1, k = 2, . . . , n.

In contrast to the first step of the Gaussian elimination, the first row of the resultingmatrix is changed.

Step i: Let the transformed matrix A(i−1) be already computed. The Householdertransformation is given as

Si = I − 2vivTi =

I 0

0 I − 2vivT

︸︷︷︸i− 1

, vi =

0...0vi

i− 1

m.

The application of the (orthonormal) matrix Si to A(i−1) leaves the first i− 1 rows andcolumns of A(i−1) unchanged. For the construction of vi, we use the considerations ofstep 1 for the submatrix

A(i−1) =

a(i−1)ii · · · a

(i−1)in

.... . .

...

a(i−1)mi · · · a

(i−1)mn

=(a

(i−1)1 , . . . , a(i−1)

n

).

Thus, it follows that

vi =a

(i−1)ii − ‖a(i−1)

ii ‖2ei

‖a(i−1)ii − ‖a(i−1)

ii ‖2ei‖2

, vi =a

(i−1)ii + ‖a(i−1)

ii ‖2ei

‖a(i−1)ii + ‖a(i−1)

ii ‖2ei‖2

,

and the matrix A(i) has the column vectors

a(i)k = a

(i−1)k , k = 1, . . . , i− 1,

a(i)i =

(a

(i−1)1i , . . . , a

(i−1)i−1,i , ‖a

(i−1)i ‖2, 0, . . . , 0

)T,

a(i)k = a

(i−1)k − 2

(a

(i−1)k , vi

)vi, k = i+ 1, . . . , n.

For a quadratic matrix A ∈ Rn×n the costs for a QR decomposition are about doublethe costs for a LR decomposition.

40

4.5 Direct determination of eigenvalues

4.5 Direct determination of eigenvalues

Let us consider a general square matrix A ∈ Rn×n. The direct way of computing eigen-values of A would be to follow the definition of what an eigenvalues is and compute thezeros of the characteristic polynomial χA(z) = det(zI−A) by a suitable method, e.g. theNewton method. We already discussed that the determination of zeros of polynomials,especially in monomial expasion can be highly ill-conditioned. In general the eigenval-ues cannot be computed via the characteristic polynomial. Only in special cases wherethe polynomial does not need to be built up explicitly as for tri-diagonal or Hessenbergmatrices.

Triagonal matrix Hessenberg matrixa1 b1

c2. . . . . .. . . . . . bn−1

cn an

a11 a12 · · · a1n

a21. . . . . .

.... . . . . . an−1,n

0 an,n−1 ann

Reduction methods

Let us recall some properties of similar matrices. Two matrices A,B ∈ Cn×n are calledsimilar (A ∼ B) if there holds

A = T−1BT

with a regular matrix T ∈ Cn×n. In view of

det(A− zI) = det(T−1(B − zI)T ) = det(T−1) det(B − zI) det(T ) = det(B − zI)

similar matrices have the same characteristic polynomial and therefore the same eigen-values. For any eigenvalue λ of A with a corresponding eigenvector w there holds

Aw = T−1BTw = λw,

i.e., Tw is an eigenvector of B corresponding to the same eigenvalue λ. Further, algebraicand geometric multiplicity of eigenvalues of similar matrices are the same. A reductionmethod reduces a given matrix A ∈ Cn×n by a sequence of similarity transformations to asimply structured matrix for which the eigenvalue problem is easier to solve. In general,the direct transformation of a given matrix into Jordan normal form (see eq. (3.8)) orthe real Schur decomposition, see theorem 3.5, in finitely many steps is possible only ifall its eigenvectors are a priori known. Therefore, one transforms the matrix in finitelymany steps into a matrix of a simpler structure, e.g. Hessenberg form. The classicalmethod for computing the eigenvalues of a tri-diagonal or Hessenberg matrix is Hyman’smethod. This method computes the characteristic polynomial of a Hessenberg matrixwithout explicitly determining the coefficients in its monomial expansion. With Sturm’smethod, one can then determine the zeros of the characteristic polynomial. For furtherreading on Hyman’s and Sturm’s method, we refer to [4].

41

5 Iterative Solution Methods

Direct methods for solving linear systems of the form

Ax = b (5.1)

with a real square matrix A = (aij)ni,j=1 ∈ Rn×n and a vector b = (bj)

nj=1 ∈ Rn become

infeasible when the system is large, i.e. n � 103. The Gaussian elimination methodwould take a long time to solve a system of that size. In this chapter, we discuss iterativemethods which generate a sequence x1, x2, x3, . . . (hopefully) converging to the solutionof the linear system.

5.1 Fixed point iteration and defect correction

For the construction of cheap iterative methods for solving linear systems, we rewriteeq. (5.1) in an equivalent form as a fixed-point problem

Ax = b ⇔ Cx = Cx− Ax+ b ⇔ x = (I − C−1A)x+ C−1b,

with a suitable regular matrix C ∈ Rn×n which is called a preconditioner. Starting froman initial value x0, the fixed-point iteration reads

xt = (I − C−1A)xt−1 + C−1b, t = 1, 2, . . . . (5.2)

The matrix B = I −C−1A is called the iteration matrix of the fixed-point iteration. Itsproperties determine if the method converge. In practice, such a fixed-point iterationis organized in form of a defect correction iteration. It essentially requires only onematrix-vector multiplication and the solution of a linear system with the matrix C ineach step

dt−1 = b− Axt−1 (defect), Cδxt = dt−1 (correction), xt = xt−1 + δxt (update).

Example 5.1 (Richardson method) The simplest method of this type is the (damped)Richardson method. The used matrices are given as

C =1

θI, B = I − θA,

for a suitable parameter θ ∈ (0, 2/λmax(A)]. For a given initial value x0, the iteration lookslike

xt = xt−1 + θ(b− Axt−1), t = 1, 2, . . . .

43


A sufficient condition for convergence of the fixed-point iteration eq. (5.2) is given bythe contraction property of the corresponding fixed-point mapping g(x) := Bx+ C

‖g(x)− g(y)‖ = ‖B(x− y)‖ ≤ ‖B‖‖x− y‖, ‖B‖ < 1, (5.3)

in some vector norm ‖·‖. By the Banach fixed-point theorem, the convergence is guar-anteed if the contraction property in eq. (5.3) is fulfilled. This means the convergencedepends on the choice of the norm. Therefore, we introduce a norm independent cri-terium. The spectral radius is given as

spr(B) := max{|λ|, λ ∈ σ(B)}.

For any natural matrix norm ‖·‖, there holds

spr(B) ≤ ‖B‖,

for symmetric B, the spectral radius satisfies

spr(B) = ‖B‖2 = supx∈Rn\{0}

‖Bx‖2

‖x‖2

.

Remark 5.1 The spectral radius spr(·) does not define a norm on Rn×n since the tri-angle inequality does not hold in general.

Theorem 5.1 (Fixed-point iteration) The fixed-point iteration in eq. (5.2) convergesfor any starting value x0 if and only if

ρ := spr(B) < 1.

In case of convergence, the limit is the uniquely determined fixed-point x. The asymptoticconvergence behavior with respect to any vector norm ‖·‖ is characterized by

supx0∈Rn

lim supt→∞

(‖xt − x‖‖x0 − x‖

)1/t

= ρ.

Hence, the number of iteration steps is necessary for an asymptotic reduction by a smallfactor ε > 0 is approximately given by

t(ε) ≈ ln(1/ε)

ln(1/ρ).

5.1.1 Construction of iterative methods

By specification of the preconditioner C, we can construct iterative methods for solvingthe linear system Ax = b. As already discussed previously, we need that spr(I-C-1A) < 1for convergence of the method. Moreover, we ask

(i) spr(I − C−1A) to be as small as possible,

44

5.1 Fixed point iteration and defect correction

(ii) that the correction equation Cδxt = b − Axt−1 is solvable with O(n) arithmeticoperations requiring storage space not much exeeding that for the matrix A.

Unfortunately, these requirements contradict each other. The two extreme cases are

C = A ⇒ spr(I − C−1A) = 0

C =1

θI ⇒ spr(I − C−1A) ∼ 1.

The simplest preconditioners are defined using the following natural additive decompo-sition of the matrix A = L+D +R

D =

a11 0 · · · 0

0. . . . . .

......

. . . . . . 00 · · · 0 ann

L =

0 · · · · · · 0

a21. . .

......

. . . . . ....

an1 · · · an,n−1 0

R =

0 a12 · · · a1n...

. . . . . ....

.... . .

...0 · · · · · · 0

.

Further, we assume that the main diagonal elements of A are nonzero, aii 6= 0. Wepresent some iterative methods in the following.

(i) Jacobi method

C = D, B = −D−1(L+R) =: J (Jacobi iteration matrix).

The iteration of the Jacobi method reads

Dxt = b− (L+R)xt−1, t = 1, 2, . . . ,

or written componentwise

aiixti = bi −

n∑j=1

j 6=i

aijxtj, i = 1, . . . , n.

(ii) Gauss-Seidel method

C = D + L, B = −(D + L)−1R =: H1 (Gauss-Seidel iteration matrix).

The iteration of the Gaus-Seidel method reads as follows

(D + L)xt = b−Rxt−1, t = 1, 2, . . . .

Writing the iteration componentwise

aiixti = bi −

∑j<i

aijxtj −

∑j>i

aijxt−1j , i = 1, . . . , n,

we observe that Jacobi and Gauss-Seidel method have exactly the same arithmeti-cal complexity per iteration step and require the same amount of storage.

45


Remark 5.2 Since the Gauss-Seidel method uses a better approximation of thematrix A as a preconditioner, it is expected to have an iteration matrix with smallerspectral radius, e.g. it converges faster.

(iii) SOR method (Successive overrelaxation) For ω ∈ (0, 2), we define

C =1

ω(D + ωL), B = −(D + ωL)−1 ((ω − 1)D + ωR) .

The SOR method is designed to accelerate the Gauss-Seidel method by introducinga relaxation parameter ω ∈ R which can be optimized in order to minimize thespectral radius of the corresponding iteration matrix. Its iteration reads as follows

(D + ωL)xt = ωb− ((ω − 1)D + ωR)xt−1, t = 1, 2, . . . .

The algorithmical complexity is about that of Jacobi and Gauss-Seidel method.But the parameter ω can be optimized for a certain class of matrices resulting ina significantly faster convergence than that of the other two simple methods.

Convergence of the Jacobi and Gauss-Seidel methods

We do not give a complete analysis of the convergence of the Jacobi and Gauss-Seidelmethods. For further reading, we refer to [4].

Theorem 5.2 (Strong row sum criterion) If the row sums or the column sums ofthe matrix A ∈ Rn×n satisfy the condition∑

k=1k 6=j

|ajk| < |ajj| or∑k=1k 6=j

|akj| < |ajj|, j = 1, . . . , n, (5.4)

then spr(J) < 1 and spr(H1) < 1, i.e, Jacobi and Gauss-Seidel method converge.

The property in eq. (5.4) is called strict diagonal dominance. The strict diagonal dom-inance of A or AT required in theorem 5.2 is a too restrictive criterion for the needs inmany application. In most cases only simple diagonal dominance is given.

Convergence of the SOR method

The SOR method can be interpreted as a combination of the Gauss-Seidel method andan extra relaxation step. Starting from a standard Gauss-Seidel step in the tth iteration,

xtj =1

ajj

(bj −

∑k<j

ajkxtk −

∑k>j

ajkxt−1k

),

the next iterate xtj is generated as a convex combination (relaxation) of the form

xtj = ωxtj + (1− ω)xt−1j ,

46

5.2 Descent methods

with a parameter ω ∈ (0, 2). For ω = 1 this is just the Gauss-Seidel iteration. For ω < 1we speak of underrealaxation and for ω > 1 of overrelaxation. The iteration matrix ofthe SOR methods is obtained from the relation

xt = ωD−1(b− Lxt −Rxt−1) + (1− ω)xt−1

asHω = −(D + ωL)−1((ω − 1)D + ωR).

Hence, the iteration reads

xt = Hωxt−1 + ω(D + ωL)−1b,

or in componentwise notation

xti = (1− ω)xt−1i +

ω

aii(bi −

∑j<i

aijxtj −

∑j>i

aijxt−1j ), i = 1, . . . , n.

The following lemma shows that the relaxation parameter has to be in the range of (0, 2)to guarantee convergence.

Lemma 5.1 (Relaxation) For an arbitrary matrix A ∈ Rn×n with regular D thereholds

spr(Hω) ≥ |ω − 1|, ω ∈ R.

For positive definite matrices, we have the following result.

Theorem 5.3 (Theorem of Ostrowski-Reich) For a positive definite matrix A ∈Rn×n there holds

spr(Hω) < 1, for 0 < ω < 2.

For the derivation of an optimal relaxation parameter ωopt, we refer to [3] and [4].

5.2 Descent methods

We consider a class of iterative methods which are especially designed for linear systemswith symmetric and positive definite matrices A, but can be extended to more generalsituations. In the following, we use the abbreviations (·, ·) := (·, ·)2 and ‖·‖ := ‖·‖2 forthe euclidian scalar product and norm.

Let A ∈ Rn×n be a symmetric and positive definite (and hence regular) matrix

(Ax, y) = (x,Ay), x, y ∈ Rn, (Ax, x) > 0, x ∈ R \ {0}.

In this way, we can generate a scalar product and a norm based on the matrix A, calledA-scalar product and A-norm, respectively

(x, y)A := (Ax, y), ‖x‖A := (Ax, x)1/2, x, y ∈ Rn.

47


Accordingly, vectors with the property (x, y)A = 0 are called A-orthogonal. The positivedefinite matrix A has important properties. Its eigenvalues are real and positive 0 <λ := λ1 ≤ . . . ≤ λn =: Λ and there exists an ONB of eigenvetors {w1, . . . , wn}. For itsspectral radius and spectral condition number, there holds

spr(A) = Λ, cond2(A) =Λ

λ.

The basis for the descent methods dicussed below is provided by the following theorem,which characterizes the solution of the linear system Ax = b as the minimum of aquadratic functinal.

Theorem 5.4 (Minimization property) Let the matrix A ∈ Rn×n be symmetric andpositive definite. The uniquely determined solution of the linear system Ax = b is char-acterized by the property

Q(x) < Q(y) for all y ∈ Rn \ {0}, Q(y) :=1

2(Ay, y)2 − (b, y)2.

Remark 5.3 We note that the gradient of Q in a point y ∈ Rn is given by

gradQ(y) =1

2(A+ AT )y − b = Ay − b.

This coincides with the defect of the point y with respect to the equation Ax = b.

Starting from some initial point x0 ∈ Rn, descent methods determine a sequence ofiterates xt, t ≥ 1, by the prescription

xt+1 = xt + αtrt, Q(xt+1) = minα ∈ RQ(xt + αrt).

The descent directions rt are a priori determined or adaptively chosen during the it-eration. The procedure of choosing the step length αt is called line search. In viewof

d

dαtQ(xt + αtr

t) = gradQ(xt + αtrt) · rt = (Axt − b, rt) + αt(Ar

t, rt),

we obtain

αt = − (gt, rt)

(Art, rt), gt = AxT − b = gradQ(xt).

General descent algorithm

Starting value x0 ∈ Rn, g0 := Ax0 − b.Iterate for t ≥ 0 Determine descent direction rt.

Compute step length αt = − (gt,rt)(Art,rt)

.

Perform descent step xt+1 = xt + αtrt.

Calculate gradient gt+1 = gt + αtArt.

The minimization of the functional Q(·) is equivalent to the minimization of the defectnorm ‖Ay − b‖A−1 or the error norm ‖y − x‖A since there holds

2Q(y) = (Ay, y)2 − 2(b, y)2 = ‖Ay − b‖2A−1 − ‖b‖2

A−1 = ‖y − x‖2A − ‖x‖

2A.

48

5.2 Descent methods

The various descent methods differ by the choice of the descent directions rt. A verysimple strategy uses the cartesian coordinate vectors {e1, . . . , en} in a cyclic way. Thismethod is called coordinate relaxation and is in a certain sense equivalent to the Gauss-Seidel method and therefore is is very slow.

5.2.1 Gradient method

A natural choice for rt are the directions of steepest descent of Q(·) in the points xt

rt = − gradQ(xt) = −gt.

Algorithm for the gradient method

Starting value x0 ∈ Rn, g0 := Ax0 − b.Iterate for t ≥ 0 Determine descent direction rt = −gt.

Compute step length αt = ‖gt‖2(Agt,gt)

.

Perform descent step xt+1 = xt − αtgt.Calculate gradient gt+1 = gt − αtArt.

In case that (Agt, gt) = 0 for some t ≥ 0 there must hold gt = 0 which means thatAxt = b and the iteration can terminate.

Theorem 5.5 (Convergence and error of the gradient method) For a symmet-ric positive definite matrix A ∈ Rn×n the gradient method converges for any startingpoint x0 ∈ Rn to the solution of the linear system Ax = b. The following error estimateholds

‖xt − x‖A ≤(

1− 1/κ

1 + 1/κ

)t‖x0 − x‖A, t ∈ N,

with the spectral condition number κ = cond2(A) = Λ/λ of A. For reducing the initialerror by a factor ε

t(ε) ≈ 1

2κ ln(1/ε)

number of iterations are required.

The relation

(gt+1, gt) = (gt − αtAgt, gt) = ‖gt‖2 − αt(Agt, gt) = 0

shows that the descent directions rt = −gt in consecutive steps used in the gradientmethod are orthogonal to each other, while gt+2 may be far away from being orthogonalto gt. This leads to strong oscillations in the convergence behaviour of the gradientmethod especially for matrices A with large condition number, i.e. λ� Λ.

49


5.2.2 Conjugate gradient method (cg method)

The gradient method utilizes the particular structure of the functional Q(·) only locallyfrom one iterate xt to the next xt+1. It seems more appopriate to use the informationabout the global structure of Q(·) which was already obtained to determine the nextdescent direction. This means the descent directions are mutually orthogonal. The cgmethod generates a sequence of descent directions which are mutually A-orthogonal, i.e.orthogonal with respect to the scalar product (·, ·)A.

We start form a set of linearly independent vectors di

Bt := span{d0, . . . , dt−1}.

The iterates

xt = x0 +t−1∑i=0

αidi ∈ x0 +Bt,

are determined such that

Q(xt) = miny∈x0+Bt

Q(y) ⇔ ‖Axt − b‖A−1 = miny∈x0+Bt

‖Ay − b‖A−1 .

Setting the derivatives of Q(·) with respect to the αi to zero, we see that this is equivalentto solving the Galerkin equations

(Axt − b, dj) = 0, j = 0, . . . , t− 1,

or in compact form Axt − b = gt ⊥ Bt. Inserting xt into this orthogonality condition,we obtain a regular linear system

n∑i=0

αi(Adi, dj) = (b, dj)− (Ax0, dj), j = 0, . . . , t− 1,

for the coefficients αi, i = 0, . . . , t−1. . Choices for the spaces Bt are the Krylov spaces

Bt = Kt(d0, A) := span{d0, Ad0, . . . , At−1d0},

with some vector d0, e.g. the (negative) initial defect d0 = b−Ax0 of an arbitrary vectorx0.

Algorithm for the cg method

Starting value x0 ∈ Rn, d0 = −g0 = b− Ax0.

Iterate for t ≥ 0 Compute step length αt = ‖gt‖2(dt,Adt)

.

Perform descent step xt+1 = xt + αtdt.

Calculate gradient gt+1 = gt + αtAdt.

Determine descent direction dt+1 = −gt+1 + ‖gt+1‖2

‖gt‖2 dt.

The cg method generates by construction a sequence of descent directions dt which are

50

5.2 Descent methods

automaticallt A-orthogonal. This implies that the vectors {d0, . . . , dn−1} are linearlyindependent and therefore span{d0, . . . , dn−1} = Rn.

Theorem 5.6 (Conjugate gradient method) Let the matrix A ∈ Rn×n be symmet-ric and positive definite. If we assume exact arithmetic, the cg method terminates forany starting vector x0 ∈ Rn after at most n steps at xn = x. In each step there holds

Q(xt) = miny∈x0+Bt

Q(y),

and equivalently

‖xt − x‖A = ‖Axt − b‖A−1 = miny∈x0+Bt

‖Ay − b‖A−1 = miny∈x0+Bt

‖y − x‖A,

where Bt := span{d0, . . . , dt−1}. There holds the error estimate

‖xt − x‖A ≤ 2

(1− 1/√κ

1 + 1/√κ

)t‖x0 − x‖A, t ∈ N,

with the spectral condition number κ = cond2(A) = Λ/λ of A. For reducing the initialerror by a factor ε

t(ε) ≈ 1

2

√κ ln(2/ε)

number of iterations are required.

Remark 5.4 In view of the result of theorem 5.6, the cg method formally belongs to theclass of direct methods. In practice, however, it is used like an iterative method, since

(i) the descent directions dt are not exactly A-orthogonal due to round-off errors suchthat the iteration does not terminate.

(ii) accurate approximations for large matrices are already obtained after t� n itera-tions.

Since κ = condnat(A) > 1, we have√κ < κ. Observing, that the function f(λ) = 1−1/λ

1+1/λ

is strictly monotonically increasing for λ > 0 since f ′(λ) > 0 there holds

1− 1/√κ

1− 1/√κ<

1− 1/κ

1− 1/κ.

This shows that the cg method should converge faster than the gradient method. Inpractice this is actually the case. Both methods converge faster the smaller the conditionnumber is. In case Λ � λ which is often the case in practice, also the cg method isslow. An acceleration can be achieved. For further reading about acceleration usingpreconditioning and generalized cg methods, we refer to [4].

51

6 Iterative Methods for EigenvalueProblems

6.1 Methods for the partial eigenvalue problem

In the following, we discuss iterative methods for solving the partial eigenvalue problemof a given matrix A ∈ Cn×n.

6.1.1 The Power method

Definition 6.1 Suppose A ∈ Cn×n is diagonalizable, that X−1AX = diag(λ1, . . . , λn)with X = (x1, . . . , x2), and |λn| > |λn−1| ≥ · · · ≥ |λ1|. Given a starting value z0 ∈ Cn

with ‖z0‖2 = 1, the power method generates a sequence of vectors zt ∈ Cn, t = 1, 2, . . . ,by

zt = Azt−1, zt :=zt

‖zt‖2

.

The corresponding eigenvalue approximation is given by

λt :=(Azt)rztr

, r ∈ {1, . . . , n} : |ztr| = maxj=1,...,n

|ztj|.

Theorem 6.1 (Power method) Let the matrix A be diagonalizable and assume thatthe eigenvalue with the largest modulus is seperated from the other eigenvalues, i.e.,|λn| > |λi|, i = 1, . . . , n − 1. Let us assume that the eigenvectors {w1, . . . , wn} areassociated to the eigenvalues orderd according to their modulus. Furthermore, let thestarting vector z0 have a non trivial component with respect to the eigenvector wn

z0 =n∑i=1

αiwi, αn 6= 0.

Then, there exist numbers σt ∈ C, |σt| = 1 such that

‖zt − σtwn‖2 → 0 (t→∞),

and the maximum eigenvalue λmax = λn is approximated with the convergence speed

λt = λmax +O

(∣∣∣∣λn−1

λmax

∣∣∣∣t)

(t→∞).

53

6 Iterative Methods for Eigenvalue Problems

For hermitian matrices, the eigenvalues can be obtained through an improved approxi-mation using the Rayleigh quotient

λt := (Azt, zt)2, ‖zt‖2 = 1.

In this case {x1, . . . , xn} = {w1, . . . , wn} can be chosen as ONB of eigenvectors suchthat there holds

λt =(At+1z0, Atz0)2

‖Atz0‖22

= λmax +O

(∣∣∣∣λn−1

λmax

∣∣∣∣2t)

Here, the convergence of the eigenvalue approximation is twice as fast as in the nonhermitian case.

The convergence of the power method is the better the more the eigenvalue with largestmodulus is seperated from the other eigenvalues. The power method is only of limitedvalue as it convergence is very slow in general if |λn−1/λn| ≈ 1. Moreover, it delivers onlythe largest eigenvalue. In most practical applications, the eigenvalue with the smallestmodulus is wanted. This is accomplished by the inverse iteration. Here, it is assumedthat a good approximation λ for an eigenvalue λk of the matrix A to be computed isalready known (obtained by other methods, theorem 3.6 of Gerschgorin-Hadamard, etc.)such that

|λk − λ| � |λi − λ|, i = 1, . . . , n, i 6= k.

In case λ 6= λk the matrix (A − λI)−1 has the eigenvalues µi = 1λi−λ

, i = 1, . . . , n and

there holds

|µk| =∣∣∣∣ 1

λk − λ

∣∣∣∣� ∣∣∣∣ 1

λi − λ

∣∣∣∣ = |µi|, i = 1, . . . , n, i 6= k.

6.1.2 The inverse iteration

Definition 6.2 The application of the power method to the matrix (A− λI)−1 is calledinverse iteration. The shift λ is taken as an approximation to the desired eigenvalueλk. Starting from the initial point z0 the method generates iterates zt as solutions of thelinear systems

(A− λI)zt = zt−1, zt =zt

‖zt‖2

, t = 1, 2, . . . .

The corresponding eigenvalue approximation is determined by

µt :=((A− λI)−1)r

ztr, r ∈ {1, . . . , n} : |ztr| = max

j=1,...,n|ztj|,

or in the hermitian case, by the Rayleigh quotient

µt := ((A− λI)−1zt, zt)2, ‖zt‖2 = 1.

54

6.2 Krylov space methods

In the evaluation of the eigenvalue, the unknown vector zt := (A− λI)−1zt−1 is needed.This can be avioded by using the formulas

λt :=(Azt)rztr

, or in the hermitian case λt := (Azt, zt)2, ‖zt‖2 = 1

instead. This is justified since zt is supposed to be an approximation to an eigenvectorof (A − λI)−1 corresponding to the eigenvalue µk which is also an eigenvector of Acorresponding to the eigenvalue λk of A.

Theorem 6.2 (Inverse iteration) Let the matrix A be diagonalizable and assume thatthe eigenvalue with smallest modulus is separated from the other eigenvalues, e.g., |λ1| <|λi|, i = 2, . . . , n. Moreover, let the starting vector z0 have a non trivial component withrespect to the eigenvector w1

z0 =n∑i=1

αiwi, α1 6= 0.

With shift λ = 0, there exist numbers σt ∈ C, |σt| = 1 such that

‖zt − σtw1‖2 → 0 (t→∞),

and the eigenvalue with the smallest modulus λmin = λ1 of A is in general approximatedwith the convergence speed

λt = λmin +O

(∣∣∣∣λmin

λ2

∣∣∣∣t)

and in the hermitian case twice as fast (with power 2t).

The inverse iteration allows the iteration of any eigenvalue of the matrix A for whicha sufficiently good approximation is known which depends on the separation of thedesired eigenvalue of A from the others. The price to be paid for this flexibility is thateach iteration step requires the solution a linear system. The matrix A − λI is veryill-conditioned with condition number (λ ≈ λk)

cond2(A− λI) =|λmax(A− λ)||λmin(A− λ)|

=maxj=1,...,n|λj − λ|

|λk − λ|� 1.

For methods for the full eigenvalue problem, we refer to [4].


Krylov methods for solving eigenvalue problems follow essentially the same idea as in thecase of the solution of linear systems. The original high-dimensional problem is reducedto smaller dimension by applying the Galerkin approximation to subspaces, e.g. theKrylov spaces, which are succesively constructed using the given matrix and sometimesalso its transpose. In the following, we will focus on the Arnoldi method for general, notnecessarily hermitian matrices and its specialization for hermitian matrices, the Lanczosmethod.

55


6.2.1 Reduction by Galerkin approximation

First, we introduce the general concept of a model reduction by Galerkin approximation.Consider a general eigenvalue problem

Az = λz

with a high-dimensional matrix A ∈ Cn×n, n ≥ 104, which may have resulted from thediscretization of the eigenvalue problem of a partial differential operator. This eigenvalueproblem can equivalently be written in variational form as

z ∈ Cn, λ ∈ C : (Az, y)2 = λ(z, y)2 for all y ∈ Cn. (6.1)

Let Km = span{q1, . . . , qm} be an appropriately chosen subspace of Cn of smaller di-mension dim(Km) = m � n. Then, the n-dimensional eigenvalue problem in eq. (6.1)is approximated by the m-dimensional Galerkin eigenvalue problem

z ∈ Km, λ ∈ C : (Az, y)2 = λ(z, y)2 for all y ∈ Km.

Expanding the eigenvector z ∈ Km with respect to the given basis

z =m∑j=1

αjqj,

the Galerkin system takes the form

m∑j=1

αj(Aqj, qi)2 = λ

m∑j=1

αj(qj, qi)2, i = 1, . . . ,m.

Within the framework of Galerkin approximation this is usually written in compact formas a generalized eigenvalue problem

Aα = λMα,

for the vector α = (αj)nj=1, involving the matrices A = ((Aqj, qi)2)ni,j=1 and M =

((qj, qi)2)ni,j=1.With the cartesian representation of the basis vectors qi = (qij)

nj=1 the Galerkin eigen-

value problem reads

m∑j=1

αj

n∑k,l=1

aklqjkqil = λ

m∑j=1

αj

n∑k,l=1

qjkqil , i = 1, . . . ,m.

Using the matrix Q(m) := (q1, . . . , qm) ∈ Cn×m and the vector α = (αj)mj=1 ∈ Cm, this

can be written in compact form as

Q(m)TAQ(m)α = λQ(m)TQ(m)α.

56


If {q1, . . . , qm} were an ONB of Km this reduces to the normal eigenvalue problem

Q(m)TAQ(m)α = λα (6.2)

of the reduced matrix H(m) := Q(m)TAQ(m) ∈ Cm×m.

If the reduced matrix H(m) has a particular structure, e.g., a Hessenberg form or asymmetric tridiagonal matrix, then the lower-dimensional eigenvalue problem eq. (6.2)can be efficiently solved. Its eigenvalues may be considered as approximations to someof the dominant eigenvalues of the original matrix A. They are called Ritz eigenvaluesof A.

In view of this preliminary consideration, the Krylov methods consist in the followingsteps:

(i) Choose an appropriate subspace Km ⊂ Cn,m � n, a Krylov space using thematrix A and powers of it.

(ii) Construct an ONB {q1, . . . , qm} of Km and set Q(m) = (q1, . . . , qm).

(iii) Form the matrix H(m) := Q(m)TAQ(m) which by construction is a Hessenbergmatrix, or in the hermitian case, a hermitian tridiagonal matrix.

(iv) Solve the eigenvalue problem of the reduced matrix H(m) ∈ Cm×m by the QRmethod.

(v) Take the eigenvalues of H(m) as approximations to the dominant (e.g. the largest)eigenvalues of A. If the smallest eigenvalues (e.g. those closest to the origin) areto be determined, the whole process has to be applied to the inverse matrix A−1,which possibly makes the construction of the subspaces Km expensive.

Remark 6.1 In the above form, the Krylov method for the eigenvalue problem is anal-ogous to its version for (real) linear systems. Starting from the variational form of thelinear system

x ∈ Rn : (Ax, y)2 = (b, y)2 for all y ∈ Rn, (6.3)

we obtain the following reduced system for xm =∑m

j=1 αjqj

m∑j=1

αj

n∑k,l=1

aklqjkqil = λ

m∑j=1

αj

n∑k,l=1

bkqik, i = 1, . . . ,m.

This is equivalent to the m-dimensional algebraic system

Q(m)TAQ(m)α = Q(m)T b.

57


6.2.2 Lanczos and Arnoldi method

The power method for computing the largest eigenvalue of a matrix only uses the currentiterate Amq,m� n for some normalized starting vector q ∈ C, ‖q‖2 = 1, but it ignoresthe information contained in the already obtained iterates {q, Aq, . . . , Am−1q}. Thissuggests to form the Krylov matrix

Km = (q, Aq,A2q, . . . , Am−1q), 1 ≤ m ≤ n.

The columns of this matrix are not orthogonal. In fact, since Atq converges to thedirection of the eigenvector corresponding to the largest (in modulus) eigenvalue of A,this matrix tends to be badly conditioned with increasing dimension m. Therefore, oneconstructs an orthogonal basis by the Gram-Schmidt algorithm. This basis is expected toyield good approximations of the eigenvectors corresponding to them largest eigenvalues,for the same reason that Am−1q approximates the dominant eigenvector. However, in thissimplistic form the method is unstable due to instablity of the standard Gram-Schmidtalgorithm. The Arnoldi method instead uses a stabilized version of the Gram-Schmidtalgorithm to produce a sequence of orthonormal vectors {q1, q2, q3, . . .} called the Arnoldivectors such that for every m, the vectors {q1, . . . , qm} span the Krylov subspace Km.In the following, we make use of the projection operator defined as

proju(v) :=(v, u)2

‖u‖2

u,

which projects the vector v onto span{u}. With this notation the classical Gram-Schmidtorthonormalization process uses the recurrence formulas

q1 =q

‖q‖2

, t = 2, . . . ,m :

qt = At−1q −t−1∑j=1

projqj(At−1q), qt =

qt

‖qt‖2

.

Here, the tth step projects out the component of At−1q in the direction of the alreadydetermined orthonormal vectors {q1, . . . , qt−1}. This algorithm is numerically unstabledue to round-off error accumulation. There is a simple modification where the tth stepprojects out the component of Aqt in the directions of {q1, . . . , qt−1}. This is called themodified Gram-Schmidt algorithm.

q1 =q

‖q‖2

, t = 2, . . . ,m :

qt = Aqt−1 −t−1∑j=1

projqj(Aqt−1), qt =

qt

‖qt‖2

.

Since qt, qt are aligned and qt ⊥ Kt, we have

(qt, qt)2 = ‖qt‖2 = (qt, Aqt−1 −t−1∑j=1

projqj(Aqt−1))2 = (qt, Aqt−1)2.

58


In practice the algorithm is implemented in the following equivalent recursive form

q1 =q

‖q‖2

, t = 2, . . . ,m : j = 1, . . . , t− 1 :

qt,1 = Aqt−1

qt,j+1 = qt,j − projqj(qt,j), qt =

qt,t

‖qt,t‖2

.

The modified algortihm gives the same result as the original one in exact arithmetic butintroduces smaller errors in finite precision arithmetic.

Definition 6.3 (Arnoldi Algorithm) For a general matrix A ∈ Cn×n the Arnoldimethod determines a sequence of orthonormal vectors qt ∈ Cn, 1 ≤ t ≤ m � n whichis called the Arnoldi basis. These vectors are deliverd by the modified Gram-Schmidtmethod applied to the basis {q, Aq, . . . , Am−1q} of the Krylov space Km.

Starting vector: q1 = q‖q‖2

.

Iterate for 2 ≤ t ≤ m: qt,1 = Aqt−1,

j = 1, . . . , t− 1: hj,t = (qt,j, qj)2, qt,j+1 = qt,j − hj,tqj,ht,t = ‖qt,t‖2, qt = qt,t

ht,t.

Let Q(m) denote the n×m matrix formed by the first m Arnoldi vectors {q1, q2, . . . , qm},and let H(m) be the (upper Hessenberg) m×m matrix formed by the numbers hjk

Q(m) = (q1, q2, . . . , qm), H(m) =

h11 h12 h13 · · · h1m

h21 h22 h23 · · · h2m

0 h32 h33 · · · h3m...

. . . . . . . . ....

0 · · · 0 hm,m−1 hmm

.

The matrices Q(m) are orthonormal and there holds the Arnoldi relation

AQ(m) = Q(m)H(m) + hm+1,m(0, . . . , 0, qm+1).

Multiplying by Q(m)T from the left and observing that Q(m)TQ(m) = I and Q(m)T qm+1 =0, we obtain that

H(m) = Q(m)TAQ(m).

In the limit case m = n the matrix H(n) is similar to A and therefore has the sameeigenvalues. This suggests that even for m � n the eigenvalues of the reduced matrixH(m) may be good approximations to some eigenvalues of A. When the algorithm stops(in exact arithmetic) for some m < n by hm+1,m = 0, then the Krylov space Km is aninvariant subspace of the matrix A and the reduced matrix H(m) = Q(m)TAQ(m) has meigenvalues in common with A, i.e.

σ(H(m)) ⊂ σ(A).

59


The following lemma provides an a posteriori bound for the accuracy in approximatingthe eigenvalues of A by those of H(m).

Lemma 6.1 Let {µ,w} be an eigenpair of the Hessenberg matrix H(m) and let v =Q(m)w be such that {µ, v} is an approximate eigenpair of A. Then, there holds

‖Aw − µw‖2 = |hm+1,m||wm|

where wm is the last component of the eigenvector w.

60

Bibliography

[1] P. Gill, W. Murray, and M. Wright. Numerical linear algebra and optimization.Number v. 1 in Numerical Linear Algebra and Optimization. Addison-Wesley Pub.Co., Advanced Book Program, 1991.

[2] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins Studies in theMathematical Sciences. Johns Hopkins University Press, 1996.

[3] W. Hager. Applied numerical linear algebra. Prentice Hall PTR, 1988.

[4] R. Rannacher. Numerical linear algebra. Lecture Notes, 2014.

61

Documents

Numerical Linear Algebra - KTH