FERNANDO DE TERAN - UC3MBy de nition, there is some eigenvalue of Asuch that j j= ˆ(A). By de nition of eigenvalue, there is some v6= 0 such that Av= v. Then kAvk= k vk and, using

NOTES ON A BASIC (PARTIAL) COURSE ON NUMERICALLINEAR ALGEBRA

FERNANDO DE TERAN†

1. Presentation of the course. The main four problems in numerical linearalgebra (NLA) are the following (we set F = R or C):

1. Solution of systems of linear equations (SLE): Compute x such that:

Ax = b, with A ∈ Fn×n, b ∈ Fn.

2. Least squares problems (LSP): Compute:

minx‖Ax− b‖2 A ∈ Fm×n, b ∈ Fn.

3. Eigenvalue problems (EP): Compute eigenvalues (and eigenvectors) of matri-ces, namely, pairs (λ, v) such that:

Av = λv, λ ∈ F, v ∈ Fn, A ∈ Fn×n.

4. Compute the singular value decomposition (SVD) of a matrix A ∈ Fm×n:

A = UΣV ∗,

with

U, V being

{unitary (if F = C), ororthogonal (if F = R)

,

and

Σ = diag(σ1, . . . , σr, 0, . . . , 0).

The values σ1 ≥ . . . ≥ σr > 0 are the singular values of A.

(For the notion of unitary and orthogonal matrix see Definition 4.1)

1.1. Background. Throughout this course we will use the following notions andnotations.

Given a matrix A ∈ Cm×n = [aij ], the notation |A| := [|aij |] stands for the matrixwhose entries are the absolute values of the entries of A, A> denotes the transposeof A, and A∗ denotes its conjugate transpose. The identity n × n matrix is denotedby In. Sometimes we introduce some specific notation for the entries of a matrix, likeabove, but otherwise we denote by A(i, j) the (i, j) entry of A.

1.1.1. Matrix norms and spectral radius. Matrix norms will be used through-out this course. For a nice introduction to this issue we refer to [6, Lecture 3].

Definition 1.1. (Consistent family of matrix norms) A map ‖ · ‖ : Cm×n −→ Ris a consistent family of matrix norms if is satisfies the following properties:

(i) ‖A‖ ≥ 0, and ‖A‖ = 0 if and only if A = 0.

2Departamento de Matematicas, Universidad Carlos III de Madrid, Avenida de la Universidad30, 28911 Leganes, Spain ([email protected])

1

(ii) ‖αA‖ = |α| · ‖A‖, for any A ∈ Cm×n and α ∈ C.(iii) ‖A+B‖ ≤ ‖A‖+ ‖B‖, for any A,B ∈ Cm×n.(iv) ‖AB‖ ≤ ‖A‖ · ‖B‖, for any A ∈ Cm×n, B ∈ Cn×p.

Example 1. The standard matrix norms in this course are the following:Let [xij ] ∈ Cm×n. Then:

• The 1-norm: ‖[xij ]‖1 := maxj=1,...,n

m∑i=1

|xij |.

• The infinite norm: ‖[xij ]‖∞ := maxi=1,...,m

n∑j=1

|xij |.

• The Frobenius norm: ‖[xij ]‖F :=√∑

ij |xij |2.

• The 2-norm (or spectral norm):

‖X‖2 := max{√|λ| : λ is an eigenvalue of X∗X}.

These maps define also norms in Cn, since Cn ≡ Cn×1. Note, in particular,

that if x =[x1 . . . xn

]> ∈ Rn, then ‖x‖F =√|x1|2 + · · ·+ |xn|2 = ‖x‖2 (the

standard Euclidean norm). It is also straightforward to see that this also coincideswith the spectral norm.

Note also that ‖[xij ]‖1 is the maximum 1-norm of the columns of [xij ], and that‖[xij ]‖∞ is the maximum 1-norm of the rows of [xij ].

The reason to term the map ‖·‖ in Definition 1.1 as a “family” of norms is becausewe want to introduce a map valid for all values m,n, so it is not a single map from asingle space Cm×n, but a “family” of maps from infinitely many spaces. In particular,property (iv) in Definition 1.1 is valid for all matrices A,B such that the product iswell-defined.

An absolute norm is a matrix norm ‖·‖ such that ‖A‖ = ‖|A|‖, for any A ∈ Cm×n.

Exercise 1. Which of the norms in Definition 1 are absolute norms?

Definition 1.2. (Spectral radius). The spectral radius of the matrix A ∈ Cn×nis the nonnegative real number:

ρ(A) := max{|λ| : λ is an eigenvalue of A}.

The spectral radius is not a matrix norm, but is closely related to matrix norms.In particular, we have the following result.

Lemma 1.3. Let ‖ · ‖ be a matrix norm and A ∈ Cn×n. Then ρ(A) ≤ ‖A‖.Proof. By definition, there is some eigenvalue λ of A such that |λ| = ρ(A). By

definition of eigenvalue, there is some v 6= 0 such that Av = λv. Then ‖Av‖ = ‖λv‖and, using properties (ii) and (iv) in Definition 1.1, we arrive at |λ| · ‖v‖ ≤ ‖A‖ · ‖v‖,and, since v 6= 0, this implies ρ(A) = |λ| ≤ ‖A‖.

The spectral radius is a relevant notion in matrix analysis and applied linearalgebra. Among its many properties, we recall the following one, that will be usedlater.

Lemma 1.4. If ‖A‖ < 1 then the matrix I +A is invertible and

‖(I +A)−1‖ ≤ 1

1− ‖A‖.

2

Proof. By contradiction, if I+A is not invertible then 0 is an eigenvalue of I+A,so there is a nonzero vector v ∈ Cn such that Av = −v. But this implies that λ = −1is an eigenvalue of A, in contradiction with the hypothesis that ρ(A) ≤ ‖A‖ < 1.

To prove the second claim, notice that, for each N ≥ 0,

(I +A) · (I −A+A2 + · · ·+ (−1)NAN ) = I + (−1)NAN+1,

so

(I +A)−1(I + (−1)NAN+1) =

N∑k=0

(−1)kAk.

From this we get

(I +A)−1 = (−1)N+1(I +A)−1AN+1 +

N∑k=0

(−1)kAk.

Taking norms, and using properties (ii)–(iv) in Definition 1.1, we arrive at

‖(I +A)−1‖ ≤ ‖(I +A)−1‖ · ‖A‖N+1 +

N∑k=0

‖A‖k.

When N →∞, ‖A‖N → 0, because ‖A‖ < 1, so

‖(I +A)−1‖ ≤∞∑k=0

‖A‖k =1

1− ‖A‖,

and this concludes the proof.

1.1.2. Basic notions on matrices. In Lesson 2 we will use the following notion.Definition 1.5. (Triangular matrix). A matrix A = [aij ] ∈ Fn×n is said to be:• upper triangular if aij = 0 for i > j,• lower triangular if aij = 0 for i < j,• triangular if it is either upper triangular or lower triangular.

We will use the following elementary properties on the inverses of triangularmatrices.

Lemma 1.6. If A is an invertible lower (respectively, upper) triangular matrix,then A−1 is also a lower (resp., upper) triangular matrix. Moreover, if all diagonalentries of A are equal to 1, then all diagonal entries of A−1 are equal to 1.

Exercise 2. Prove Lemma 1.6.

Permutations matrices are used in Lesson 3.

Definition 1.7. (Permutation matrix). A permutation matrix is an n×n matrixwhich is obtained by permuting the rows of the n× n identity matrix.

It is interesting to notice that a permutation matrix P is obtained after permutingsome rows of In, but also by permuting some columns of In. Actually, the columnpermutation and the row permutation of In are the same. Related to this, if P is a

3

n×n permutation matrix, then PIn is obtained from In by permuting some rows, andInP is obtained from In by permuting exactly the same columns, since PIn = InP .

Some elementary properties of permutation matrices that will be used in thiscourse are listed in the following lemma, whose proof is straightforward.

Lemma 1.8. Let P, P1, P2 be n × n permutation matrices and X an arbitraryn× n matrix. Then:

(i) PX is obtained from X by permuting the same rows as in PIn.(ii) XP is obtained from X by permuting the same columns as InP .(iii) P−1 = P>.(iv) detP = ±1.(v) P1P2 is also a permutation matrix.

1.2. Some indications about the content. This course includes some exer-cises, which are either theoretical or numerical (or both). Those requiring numerical

computations with a computer are indicated with the symbol .

1.3. Some comments on the bibliography. For Lesson 2 we have mainlyfollowed [3] and [6]. Lesson 3 is mainly based on [3], and some parts are also taken from[6]. As for Lesson 4, we have essentially followed [4] and [6]. The remaining referencesare excellent basic textbooks on numerical linear algebra and matrix computations.

4

2. Lesson 1: Conditioning, floating point arithmetic, and stability.Though some of the notions introduced in this lesson are valid also for complex num-ber, we will focus on the set of real numbers (namely, F = R). In particular, inSection 2.2, we only look at real arithmetic. Of course, all computers are able to dealwith complex numbers, but this in the end is based on real arithmetic. Though theanalysis for complex arithmetic does not require too much additional effort, we omit itfor two reasons. The first one is a time restriction, and the second one is that it is notfrequent in the literature to find an explicit numerical analysis on complex arithmeticfor backward error and conditioning of elementary algorithms and problems in NLA.

2.1. Conditioning. When working with computers, we can not be sure that theoriginal data are the exact ones. When introducing our (hopefully exact) data in thecomputer, these data are affected by rounding errors (see Section 2.2).

We can think of a problem as a function from a set of data (X) to a set ofsolutions (Y ), where each set is endowed with a norm.

Problem f

f : X −→ Yx (data) 7→ f(x) (solution)

We will refer to the elements in X as “points”, even though X is not necessarily anaffine space. Hence, you can replace sentences like “a given point x ∈ X” by “a givendata x ∈ X”. Moreover, for simplicity we will write ‖ · ‖ for both norms in X and Y ,even though they are not necessarily the same.

In this setting, x denotes the exact data, which are affected by some perturbation,δx. These perturbations may come not only from rounding errors, but also fromuncertainty in the measurements, or from any other source of error. Then, the errorin the solution is given by

(2.1) δf := f(x+ δx)− f(x)

� We want to compare δf with δx, where δf is as in (2.1).

Definition 2.1. (Normwise absolute condition number) The condition numberof the problem f at a given point x ∈ X is the non-negative number

(2.2) κf (x) := limδ→0

sup‖δx‖≤δ

‖δf‖‖δx‖

,

with δf as in (2.1).

Remark 1. Some observations about Definition 2.1 are in order:(a) The condition number κf (x) compares absolute changes in the solutions with

absolute changes in the data.(b) In general, to get an acceptable measure of the condition number κf (x) it is

not necessary to take the limit, but just to get

κf (x) ≈ sup‖δx‖≤δ

‖δf‖‖δx‖

,

for δ small enough.

5

(c) If f is a differentiable function of the data x ∈ X, then

κf (x) = ‖Jf (x)‖,

where Jf is the Jacobian matrix of f (namely, the linear operator associatedto the differential).

If we look at relative variations of the data, and look, accordingly, for relativevariations in the solution, then we arrive at the following notion.

Definition 2.2. (Normwise relative condition number) The relative conditionnumber of the problem f at the point x ∈ X is the nonnegative number:

(2.3) κf (x) := limδ→0

sup‖δx‖‖x‖ ≤δ

‖δf‖/‖f‖‖δx‖/‖x‖

.

Remark 2. Again, if f is a differentiable function of the data x ∈ X, then wehave

κf (x) =‖x‖‖f(x)‖

· ‖Jf (x)‖.

� One of the central questions in NLA is to measure condition numbers. We willanalyze the condition number for some of the main four problems of NLA mentionedin Section 1.

We will frequently say that a given problem is well-conditioned or ill-conditioned,depending on how big is the condition number. There is no a fixed quantity that es-tablishes the difference between these two categories. In general, whether a problemis well-conditioned or ill-conditioned depends on the context. Is it a relative conditionnumber of 103 enough to say that a problem is ill-conditioned? There is no a satisfac-tory answer to this question. What the relative condition number tells us in this caseis that a relative error of 1 unit in the data could cause a relative error of 103 unitsin the solutions or, in other words, that the error in the solutions of the problem canbe 103 times bigger than the error in the data.

Remark 3. (Choice of the norm). Condition numbers depend on the norms inX and Y . This is not, in general a problem, and it should be clear from the verybeginning which is the norm we are using. In this course, both X and Y will be eitherFn or Fn×n, and the standard norms in these spaces are the norms in Example 1.

Example 2. Let f be the function (problem) assigning to each nonzero real valueits inverse:

f : R \ {0} −→ R

x 7→ f(x) =1

x.

This function is differentiable, so

κf (x) =1

|x|2.

Note that when x → 0, the condition number κf (x) goes to infinity, so inverting anumber close to zero is ill-conditioned in absolute terms..

6

However, in relative terms

κf (x) =|x||f(x)|

|f ′(x)| = 1,

so the problem of inverting nonzero real numbers is well conditioned in relative terms.

Example 3. Let us consider the following particular values in Example 2. Set:x = 10−6 and x = 10−6 + 10−10. Then the absolute error in the inversion of x isgiven by: ∣∣ 1

x −1x

∣∣|x− x|

=1

10−12 + 10−16≈ 1012,

which is a big number. However, the relative error is∣∣ 1x −

1x

∣∣÷ ∣∣ 1x ∣∣|x−x||x|

=|x||x|≈ 1.

The reason for such a difference is that∣∣∣∣ 1x − 1

x

∣∣∣∣ = 106 − 106

1 + 10−4≈ 102,

which is a big quantity, compared just with the original data, which are of order 10−6.However, it is a small error compared to the magnitude of 1/x.

In the following, for the sake of simplicity, and since the function f is clear bythe context, we will replace the subscript f in the notation for condition numbers bya reference to the norm. Then, for instance, κ∞ is the absolute condition number inthe ∞-norm, and κ1 is the relative condition number in the 1-norm.

Example 4. Let us now consider the function assigning to each pair of realnumbers their difference:

f : R2 −→ R[xy

]7→ f

([xy

])= x− y.

Again, this function is differentiable, and

κ∞

([xy

])=

∥∥∥∥J ([ xy

])∥∥∥∥∞

=∥∥∥[ ∂f

∂x∂f∂y

]∥∥∥∞

= ‖[

1 −1]‖∞ = 2,

κ1

([xy

])= 1.

However

κ1

([xy

])=

∥∥∥∥[ xy

]∥∥∥∥1

|x− y|

∥∥∥∥J ([ xy

])∥∥∥∥1

=|x|+ |y||x− y|

.

Therefore, κ1

([xy

])can be arbitrarily large if x ≈ y and x, y are not close to zero.

7

Example 4 highlights the well-know fact that subtracting nearby numbers canbe problematic. But note that this just gives problems in relative terms, since theabsolute condition number is not big. Summarizing:

� The problem of subtracting two numbers is relatively ill–conditioned whenthese two numbers are close to each other (and not close to zero), though it is abso-lutely well-conditioned. The effect produced by ill-conditioning when subtracting twonearby numbers is known as cancellation.

Exercise 3. Let x = 99, y = 100, x+δx = 99, y+δy = 101, and set f

([xy

])=

x− y.

(a) Compute κ1

([xy

]).

(b) Compute κ1

([xy

]).

(c) Compute the absolute error:

∥∥∥∥∥∥f x+ δx

y + δy

−f x

y

∥∥∥∥∥∥1∥∥∥∥∥∥

δxδy

∥∥∥∥∥∥1

.

(d) Compute the relative error:

∥∥∥∥∥∥f x+ δx

y + δy

−f x

y

∥∥∥∥∥∥1

/∥∥∥∥∥∥f x

y

∥∥∥∥∥∥1∥∥∥∥∥∥

δxδy

∥∥∥∥∥∥1

/∥∥∥∥∥∥ xy

∥∥∥∥∥∥1

.

(e) Compare the results obtained in (c) and (d).

In the previous examples we have included the absolute condition number just forthe sake of comparison. However, it is not illustrative, because it does not take intoaccount the magnitude of the original data or the solution. In the following, we willonly focus on the relative condition number. Nonetheless, the next example showsthat this is not the whole story, and that the normwise relative condition numberis not always the best way to get an idea on how much sensible is the problem torealistic perturbations.

Example 5. Let f be the function

f : R2 \{[

x0

]}−→ R[

xy

]7→ f

([xy

])=x

y.

This is again a differentiable function, so

κ2

([xy

])=

∥∥∥∥[ xy

]∥∥∥∥2

|x/y|

∥∥∥∥J ([ xy

])∥∥∥∥2

=

√x2 + y2

|x/y|

∥∥∥∥[1

y− x

y2

]∥∥∥∥2

=x2 + y2

|xy|=

∣∣∣∣xy∣∣∣∣+∣∣∣yx

∣∣∣ .Example 5 shows that the division of two real numbers x, y is ill-conditioned

when either |x/y| or |y/x| is large, namely, when x, y are quite different to each

8

other. However, this is not in accordance with the experience in practice. The reasonto explain the results from Example 5 is that in this example we are consideringall perturbations of x, y, and this is not realistic. To be more precise, the usualperturbations in practice are small perturbations of x and y independently. However,in Example 5 we are allowing all kind of perturbations and this, when x, y are verydifferent form each other, may produce unrealistic perturbations. An example of anon-realistic perturbation is the following.

Example 5. (Continued). Set x = 1, y = 10−16, and x+ δx = 1, y+ δy = 10−10.Note that the perturbation in the variable y is huge in relative terms if we just look aty, but compared to the norm of the whole pair (x, y) it is a small perturbation:∥∥∥∥[ δx

δy

]∥∥∥∥∞∥∥∥∥[ x

y

]∥∥∥∥∞

= 10−10 − 10−16.

However, it produces a big (relative) change in the function f , namely:∥∥∥∥f ([ x+ δxy + δy

])− f

([xy

])∥∥∥∥∞

/ ∥∥∥∥f ([ xy

])∥∥∥∥∞∥∥∥∥[ δx

δy

]∥∥∥∥∞

/ ∥∥∥∥[ xy

]∥∥∥∥∞

=(1016 − 1010)/1016

10−10 − 10−16≈ 1010.

Exercise 4. We have seen in Example 2 that inverting a nonzero number isa well-conditioned problem in relative terms. However, in Example 5 we have seenthat, even when if we fix the numerator to be 1, the condition number of the divisionbetween two real numbers x, y can be arbitrarily large. Which is the reason for this?

A realistic perturbation of the pair (x, y) in Example 5 would be a small relativeperturbation in both x and y, namely, a perturbation of the form (δx, δy) with both|δx|/|x| and |δy|/|y| being small.

The normwise relative condition number, however, does not take into accountthis issue. For this, we need to look at what happens in each coordinate of the data.This leads to the notion of componentwise relative condition number, which we arenot going to define formally because is beyond the scope of this course but, followingDefinition 2.3, it can be informally defined as

(2.4) κ(comp)f (x) := lim

δ→0sup

Erc(x)≤δ

‖δf‖/‖f‖Erc(x)

,

where Erc(x) is some measure of the componentwise relative error in the data x. Inthe case of Example 5, this componentwise relative error is:

Erc := Erc

([xy

])= max

{|δx||x|

,|δy||y|

}.

Then, with some elementary manipulations, we get∣∣∣∣δf ([ xy

])∣∣∣∣ =

∣∣∣∣x+ δx

y + δy− x

y

∣∣∣∣ =

∣∣∣∣xy∣∣∣∣∣∣∣∣∣δxx −

δyy

1 + δyy

∣∣∣∣∣ ≤∣∣∣∣xy∣∣∣∣ 2Erc

1− Erc,

9

so ∣∣∣∣δf ([ xy

])∣∣∣∣∣∣∣∣f ([ xy

])∣∣∣∣ ≤ 2 · Erc1− Erc

≈ 2Erc .

As a consequence, κ(comp)f

([xy

])= 2, which means that the problem of dividing

two real numbers x, y, with y 6= 0, is well-conditioned componentwise.

Though componentwise conditioning is relevant in NLA, we will not pay muchattention to it, due to time restrictions.

Exercise 5. . In order to check the effect of cancellation, we are going tocompute the value of e−30 using the Taylor polynomial of the exponential:

(2.5) ex ≈N∑k=0

xk

k!.

In particular, we look for an approximation to e−30 with absolute error less than 10−10.(a) Use the bound for the residual of the N th Taylor expansion for the function

f(x) around x0:

RN (x) ≤ |f(N+1)(c)|

(N + 1)!|x− x0|N+1, for some c with |x− c| ≤ |x− x0|,

to determine which is the smallest value of N in (2.5) that guarantees anapproximation (in absolute terms) with error at most 10−10.

(b) Write a MATLAB code to compute e−30 using the sum (2.5) for the value of Nobtained in (a), and compute it using this routine.

(c) Compute the value of e−30 just typing exp(−30) in MATLAB, and compare withthe result obtained in (b). Explain what happens and why happens.

(d) Compute e−30 by first computing e30 using again (2.5) with the appropriatevalue of N , and then inverting the result.

(e) Compare the result in (d) with the ones in (b)–(c). Is this in accordance withyour explanations in (c)?

10

HIGHLIGHTS:

� The conditioning of a problem measures the sensitivity of the solution of thisproblem to (small) changes in the data.

� The word “conditioning” makes reference to a problem, and not to an algorithm.Moreover, it is a mathematical property, and it has nothing to do with the computer.

� The normwise condition number measures the sensitivity of the solution withrespect to changes in norm of the set of data. In some cases, it is more appropriateto look at the componentwise condition number, which measures the sensitiv-ity of the solution with respect to changes in each of the components of the data,independently.

� If the problem, f , is differentiable, then the condition number at a certain data,x, is the norm of the Jacobian of f at x.

� Subtracting two nearby numbers is ill-conditioned.

2.2. Floating point arithmetic. Since any machine (computer) has a finitememory, it is only allowed to represent a finite subset of the set of real (or complex)numbers. In this section we provide an overview on some of the basic features of thenumber system in a computer: How many numbers can a computer store?, how largeis the distance between two consecutive numbers?, how does the computer operatewith numbers?

The standard number system in the current computers is the floating pointsystem. The underlying idea of such a system is that the relative separation betweentwo consecutive numbers in the system is essentially constant through all the systemrange.

Definition 2.3. (Floating point system). A floating point system (FPS) is asubset F ⊆ R defined as

(2.6) F := {±0.d1d2 · · · dt · βe : 0 ≤ di ≤ β − 1; d1 ≥ 1, emin ≤ e ≤ emax} ∪ {0},

where• β ∈ N, β ≥ 2 is called the base of the system;• t ∈ N, t ≥ 1 is called the precision of the system;• e, emin, emax ∈ Z.

Numbers in F are usually called floating point numbers, or machine numbers.

Remark 4. The decimal notation in (2.6) stands for the following number:

0.d1d2 · · · dt :=d1

β+d2

β2+ · · ·+ dt

βt.

Remark 5. The condition d1 ≥ 1 in (2.6) guarantees the uniqueness of therepresentation of any number in the system (2.6).

Exercise 6. (Integer representation of F ). Prove that any number x ∈ F , withF as in (2.6), can be written in the form

(2.7) x = ±mβe−t, with βt−1 ≤ m ≤ βt − 1.

11

Note that the number m in (2.7) is a whole number, and that any whole numberin the indicated range provides a number in the system (2.6). Then, Equation (2.7)provides an equivalent expression for numbers in (2.6).

Let us summarize some basic facts about floating point systems.

• The standard floating point systems in most computers are the IEEE singleprecision and IEEE double precision systems. Actually, most computers havethe double system by default, though the simple can be also chosen (in MATLAB

just type single(a), where a is the number you want to convert to singleprecision). The basic ingredients of each system are indicated in Table 2.1.

• The floating point system has an additional position for the zero number.• Any floating point system contains the number 1. It is the first number with

exponent e = 1.• The distance between 1 and the next floating point number is β1−t. This

distance β1−t is known as the machine epsilon (machep).

system base (β) precision (t) emin emax machep (β1−t) unit roundoff (β−t)

IEEE single 2 24 -126 127 1.2 · 10−7 6 · 10−8

IEEE double 2 53 -1022 1023 2.2 · 10−16 1.1 · 10−16

Table 2.1Basic ingredients of IEEE systems

When shifting from base-10 to base-2 representation (or vice versa) it may happenthat a number with only a finite amount of nonzero decimals becomes a periodicnumber. This is, surprisingly, what happens with 0.1.

Exercise 7. Find the expression for the decimal number 0.1 in base 2.

2.2.1. Rounding numbers: the float function. When the user inserts anumber x ∈ R in a computer, the computer automatically rounds it to a number inits floating point system F (as long as x lies within the range of F ). We use thestandard notation:

fl(x) := number in the system (2.6) which is closest to x.

In the case where there are two closest numbers to x, there are several rules tochoose one of them, but we are not going to get into these details.

The first natural question to ask is: how far is away the number fl(x) from x?The next result provides a bound for this distance.

Theorem 2.4. Let x ∈ R be within the range of F . Then

(a) fl(x) = x(1 + δ), with |δ| ≤ 12β

1−t,

(b) fl(x) =x

1 + α, with |α| ≤ 1

2β1−t.

Proof. If x ∈[βe−1, βe

], for some emin ≤ e ≤ emax, then |fl(x) − x| ≤ βe−t

2 , by(2.7). Now:

(a) Let δ =fl(x)− x

x, then |δ| ≤ βe−t/2

βe−1=

1

2β1−t.

(b) Let α =x− fl(x)

fl(x), then |α| ≤ βe−t/2

βe−1=

1

2β1−t.

12

The number

u :=1

2β1−t,

appearing in Theorem 2.4 is known as the unit roundoff of the system F . If β = 2it is just u = β−t.

Remark 6. Some remarks are in order:(a) The zero number is always rounded to zero.(b) If the number x ∈ R does not lie within the range of F , then it is either because

it is less (in absolute value) than the smallest nonzero number in the system,or because it is larger than the largest one in F . In the first case, fl(x) = 0(this in known as underflow), whereas in the second case, the computer doesnot round it and produces an error message (this is known as overflow).

2.2.2. Absolute and relative separation of numbers in floating pointsystems. We restrict ourselves in this section to positive numbers, for simplicity. Wewant to know the absolute and relative distance between two consecutive numbers inF , xn < xn+1. We first analyze the absolute distance.

Let us fix an exponent emin ≤ e ≤ emax. Looking at (2.7), we see that thedifference between two consecutive numbers with exponent e in (2.6) is

xn+1 − xn = βe−t.

Indeed, the numbers in F with exponent e are of the form

β−1βe, (β−1 + β−t)βe, (β−1 + 2β−t)βe, . . . , (1− β−t)βe.

Now, the first number in F with exponent e+1 is β−1βe+1 = βe, so the differencebetween this number and the precedent one is

xn+1 − xn = βe − (1− β−t)βe = βe−t.

As a consequence, any two consecutive numbers in F between βe−1 and βe are at

the same distance, namely βe−t . This means that, each time we change the expo-

nent by adding up one unit, the absolute separation between floating point numbersincreases in β.

Figure 2.1 displays all positive numbers in a floating point system (2.6) withβ = 2, t = 3, emin = −4, emax = 3. As can be seen, the more to the right are thenumbers, the more separated they are.

Now, let us look at the relative distance. Let us assume that βe−1 ≤ xn < xn+1 ≤βe. Then, the relative separation, sr, between xn and xn+1 satisfies

β−t =βe−t

βe≤ sr =

xn+1 − xnxn

=βe−t

xn≤ βe−t

βe−1= β1−t.

Moreover, the maximum of the relative distance, β1−t, is attained for x = βe−1 (thesmallest number with exponent e), and it decreases until the minimum, β−t, which isattained at x = (1−β−t)βe (the largest number with exponent e). This is illustratedin Figure 2.2 for a binary system with precision t = 24.

13

Fig. 2.1. β = 2, t = 3, emin = −4, emax = 3

Fig. 2.2. Relative distance from x to the next larger machine number (β = 2, t = 24). Takenfrom [3, p. 40]

.

2.2.3. Floating point operations. by “floating point operations” we under-stand the following arithmetic operations: + (sum), − (subtraction), × (multiplica-tion), ÷ (division), and

√· (square root). Clearly, the system F in (2.6) is not closed

under such arithmetic operations. Then, the question that arises is: what does thecomputer do when the result of operating two numbers in F is not a number in F?The answer is not easy and, essentially, depends on the computer. However, it ispossible to model what most computers do as follows.

� In the following, we denote by ∗ any of the arithmetic operations mentioned above,and by ~ its floating counterpart, that is: if x, y ∈ F , then x~y is the output provided

14

by the computer as x ∗ y.

Floating point arithmetic model

Let x, y ∈ F . Then

(2.8) x~ y = (x ∗ y)(1 + δ), with |δ| ≤ u,

where u is the unit roundoff.

Some comments on the floating point arithmetic model are in order:• Comparing with the natural assumption x ~ y = fl(x ∗ y), the assumption

(2.8) is more pessimistic, because in the case x ∗ y ∈ F , the model does notrequire δ = 0.

• However, and having in mind the precedent comment, we will sometimeswrite fl(x ∗ y) instead of x~ y. This will simplify the developments.

• The model does not tell us which is exactly x~ y. It only provides a boundon it with respect to the exact value x ∗ y. This allows us to deal easily withrounding errors, at the expense of working with unknown quantities (δ).

• Note that the model does not guarantee that the standard arithmetic laws(associative, distributive) are still true in floating point arithmetic. For in-stance, in general x⊗ (y⊕z) 6= (x⊗y)⊕ (y⊗z), and x⊗ (y⊗z) 6= (x⊗y)⊗z.

• In general, we start with two real numbers, x, y ∈ R not necessarily in F . Inthis case, the operation performed by the computer would be fl(x) ~ fl(y).

Example 6. Let x, y ∈ R+. We are going to analyze the error when computingx+ y in a computer with unit roundoff u.

As mentioned before, the computer performs:

fl(x)⊕ fl(y) ≡ fl(x+ y) = x(1 + u1)⊕ y(1 + u2) = [x(1 + u1) + y(1 + u2)] (1 + u3),

with |u1|, |u2|, |u3| ≤ u, by (2.8). Then

(2.9) fl(x+ y) = x(1 + u1)(1 + u3) + y(1 + u2)(1 + u3) = x(1 + δ1) + y(1 + δ2),

with |δ1|, |δ2| ≤ 2u+ u2. As a consequence, fl(x+ y)− (x+ y) = xδ1 + yδ2, so

|fl(x+ y)− (x+ y)| = |x||δ1|+ |y||δ2| ≤ (|x|+ |y|)(2u+ u2) = |x+ y|(2u+ u2),

where in the last equality we have used that x, y are both positive. Equivalently

(2.10)|fl(x+ y)− (x+ y)|

|x+ y|≤ 2u+ u2.

Equation (2.10) provides a bound on the relative error when computing the sum oftwo positive numbers in floating point arithmetic.

2.3. Computational cost. Another relevant feature of numerical algorithmsis their computational cost (or computational complexity). This is measured as thenumber of elementary arithmetic operations. Each floating point operation counts oneoperation in the computer, termed as flop. Then, the computational cost is measuredin number of flops. The computational cost of competitive algorithms must be apolynomial in the size of the problem. In the four basic NLA problems mentioned at

15

HIGHLIGHTS:

�A floating point system aims the relative distance between two consecutive numbersto be essentially constant for all numbers in the system.

� The relative distance between two consecutive numbers are bounded between themachine epsilon, β1−t, and the unit roundoff, β−t.

� The absolute separation increases as the numbers increase in absolute value. Itincreases by a factor or β as the exponent of the numbers increase by one unit.

� The floating point arithmetic model assumes that the relative error when roundingan elementary arithmetic operation of machine numbers is at most the unit roundoffof the computer.

the beginning of this notes, the size of the problem refers to the size of the matrixA and, when it is square, this value is n. We say that an algorithm is, for instance,of order n3, and denote it by O(n3), when the amount of operations involved in thealgorithm is a cubic polynomial in n. Sometimes we also indicate the coefficient of theleading term of this polynomial and say, for instance, that an algorithm is O( 2

3n3).

2.4. Order of approximation. The “big-O” notation will be also used in thiscourse with a different meaning to the one in the previous section, related to smallquantities, instead of big quantities. This notation is relevant to measure the accuracyof the algorithms.

By O(u) we denote a function of u such that, when u approaches 0, there is aquantity C, independent on u, such that O(u) ≤ Cu. However, the quantity C maydepend on other quantities (like the size of the problem).

2.5. Stability of numerical algorithms. Following the developments in Sec-tion 2.1, we represent both a problem, f , and an algorithm to solve this problem, f ,as functions from a set of data (X) to a set of solutions (Y ):

Problem f

f : X −→ Yx (data) 7→ f(x) (exact solution)

Algorithm f

f : X −→ Y

x (data) 7→ f(x) (computed solution)

For a given data x ∈ X, the algorithm computes an approximate solution f(x)to the exact solution f(x).

� The GOAL of this section is to compare f(x) and f(x) in a relative way, that is,to measure the relative error

(2.11)‖f(x)− f(x)‖‖f(x)‖

.

16

The first attempt to introduce a notion of a good algorithm in this respect is thefollowing definition.

Definition 2.5. The algorithm f for the problem f is accurate if, for any x ∈ X,

(2.12)‖f(x)− f(x)‖‖f(x)‖

= O(u).

However, condition (2.12) is too ambitious, since it is a condition on the algorithmthat disregards the conditioning of the problem. More precisely, even in the idealsituation where

(2.13) f(x) = f(fl(x)),

then the relative error (2.11) can be large if the problem is ill-conditioned (and thislast property has nothing to do with the algorithm). We also want to note that (2.13)means that the only error committed by the computer is the roundoff error, which isan unavoidable assumption.

Tho following definition is more realistic, since it takes into account the condi-tioning of the problem.

Definition 2.6. The algorithm f for the problem f is stable if for any x ∈ Xthere is some x ∈ X satisfying

‖f(x)− f(x)‖‖f(x)‖

= O(u), and‖x− x‖‖x‖

= O(u).

In other words, a stable algorithm produces an approximate solution for an ap-proximate data. We can go even further in this direction.

Definition 2.7. The algorithm f for the problem f is backward stable if for anyx ∈ X there is some x ∈ X satisfying

f(x) = f(x), and‖x− x‖‖x‖

= O(u).

In other words, a backward stable algorithm provides the exact solution for anearby data. This is the most ambitious requirement for an algorithm, since data arealways subject to rounding errors.

Definitions 2.5, 2.6, and 2.7 depend on two objects: (a) the norm in the sets ofdata and solutions, and (b) the quantity C such that O(u) ≤ Cu when u approaches0. Regarding the norm, in all problems of this course, X and Y are finite dimensionalvector spaces. Since all norms in finite dimensional spaces are equivalent, the notionsof accuracy, stability, and backward stability do not depend on the norm. As for thequantity C, it must be independent on u, but it will usually depend on the dimensionof the spaces X,Y . However, in order for the quantity O(u) not being too large, thisdependence is expected to be, at most, polynomial.

A natural question regarding Definitions 2.5, 2.6, and 2.7 is about the relationshipsbetween them. For instance: is an accurate algorithm necessarily backward stable,or vice versa? It is immediate that a backward stable algorithm is stable, and thatan accurate algorithm is stable. However, is not so easy to see whether the other

17

implications are true or not. Actually, none of them is true. In particular, an accuratealgorithm is not necessarily backward stable.

Example 7. Equation (2.10) tells us that the sum of two real numbers of thesame sign in floating point arithmetic is an accurate algorithm, and Equation (2.9)implies that it is backward stable as well.

In terms of the goal mentioned at the beginning of this section, one may wonderwhich is the connection between the stability (or the backward stability) and therelative error (2.11). The following results provides an answer to this question.

Theorem 2.8. Let f be a stable algorithm for the problem f : X → Y , whoserelative condition number at the data x ∈ X is denoted by κ(x). Then

‖f(x)− f(x)‖‖f(x)‖

= O(uκ(x)).

The proof of Theorem 2.8 for f being backward stable is immediate (see [6, Th.15.1]). For f being just stable is more involved, and is out of the scope of this course.

What Theorem 2.8 tells us is that when using a stable algorithm for solving aproblem, the forward error of the algorithm depends on the conditioning. In particu-lar, this can be summarized as:

f stable + f well-conditioned ⇒ f accurate

HIGHLIGHTS:

� Stability is a property of an algorithm used to solve a given problem.

� The best property one can expect from an algorithm is to be backward stable,because accuracy is too ambitious due to the presence of rounding errors.

� A stable algorithm applied to solve a well-conditioned problem produces an accu-rate solution.

18

3. Lesson 2: Solution of systems of linear equations. The main objectof this Lesson is a (real) system of linear equations, namely:

(3.1) Ax = b, A ∈ Fn×n, b ∈ Fn.

Moreover, we will assume that the system (3.1) has a unique solution, so the coef-ficient matrix A is invertible.

The goals of this Lesson are:

1. To review the standard method for solving (3.1).2. To analyze the conditioning of the problem of solving (3.1).3. To analyze the backward error of the standard algorithm to solve (3.1).

3.1. Conditioning. In this section, we analyze both the relative normwise andrelative componentwise conditioning of the general problem

(3.2)f : Fn×n × Fn −→ Fn

(A, b) 7→ x,

where x ∈ Rn is the solution of (3.1). In other words, the problem f associates toeach pair (A, b) ∈ Rn×n × Rn (data) the solution of the corresponding system (3.1).

3.1.1. Normwise conditioning. We begin with the following result.

Theorem 3.1. Let A ∈ Fn×n be an invertible matrix and b, δb ∈ Fn. LetδA ∈ Fn×n be such that ‖A−1‖ · ‖δA‖ < 1, for some matrix norm ‖ · ‖. Let x be asolution of (3.1) and δx be such that

(3.3) (A+ δA)(x+ δx) = b+ δb.

Then

(3.4)‖δx‖‖x‖

≤ ‖A−1‖ · ‖A‖1− ‖A−1 · δA‖

·(‖δA‖‖A‖

+‖δb‖‖b‖

).

Proof. We first prove that the condition ‖A−1‖ · ‖δA‖ < 1 implies that thematrix A + δA is invertible. For this, note that A + δA is invertible if and only ifI + A−1 · δA is invertible. From Lemma 1.3 and property (iv) in Definition 1.1 weget ρ(A−1 · δA) ≤ ‖A−1 · δ‖ ≤ ‖A−1‖ · ‖δA‖ < 1, and then Lemma 1.4 implies thatI +A−1 · δA is invertible.

Now, from (3.3) we immediately get δA · x + (A + δA)δx = δb, which impliesδx = (A+ δA)−1(δb− δA ·x), and taking norms and applying the second claim in thestatement of Lemma 1.4 we arrive at

‖δx‖ ≤ ‖A−1‖1− ‖A−1 · δA‖

(‖δA‖ · ‖x‖+ ‖δb‖) .

Now, dividing by ‖x‖ and using that ‖b‖ ≤ ‖A‖ · ‖x‖ in the rightmost summand, wearrive at (3.4).

Definition 3.2. The number

(3.5) κ‖·‖(A) = ‖A−1‖ · ‖A‖19

is the condition number for inversion of the matrix A in the norm ‖ · ‖.

Note that, using (3.5) and ‖A−1 · δA‖ ≤ ‖A−1‖ · ‖δA‖, inequality (3.4) leads to

(3.6)‖δx‖‖x‖

≤κ‖·‖(A)

1− κ‖·‖(A) · ‖δA‖‖A‖

·(‖δA‖‖A‖

+‖δb‖‖b‖

).

When ‖δA‖/‖A‖ is small (which should be the case if the perturbation δA issmall enough), then

κ‖·‖(A)

1− κ‖·‖(A) · ‖δA‖‖A‖

≈ κ‖·‖(A)

so (3.6) is approximately

(3.7)‖δx‖‖x‖

≤ κ‖·‖(A) ·(‖δA‖‖A‖

+‖δb‖‖b‖

).

Bound (3.7) indicates that the condition number for inversion of A, κ‖·‖(A), isa measure of the condition number for the solution of the general SLE (3.2). If theright-hand side of (3.7) is considered as a normwise measure of the relative error inthe data (A, b), then the relative error in the solution is not magnified more thanκ‖·‖(A) times.

Exercise 8. Set

A =

[1 1

1 + 10−n 1

], b =

[1 + 10n

2 + 10n

].

(a) Give an explicit expression, depending on n, for the solution, x, of (3.1), withA, b as above.

(b) Compute κ1(A), depending on n.(c) Use MATLAB to compute the solution of (3.1), with A, b as above, for n =

8, 10, 12, 14.(d) Are your results in accordance with (3.7)?(In order to answer (d) you should know that the unit roundoff by default in

MATLAB, which works in IEEE with double precision by default, is u = 10−16, seeTable 2.1. This means that the rightmost factor in the right-hand side of (3.7) isapproximately 10−16).

3.1.2. Componentwise conditioning. In the following, a matrix inequalityA ≤ B, with A,B ∈ Fn×n is understood entrywise, that is aij ≤ bij , for any entriesaij , bij of A and B, respectively.

Let us assume that the matrix δA satisfies

(3.8) |δA| ≤ ε|A|,

for some ε > 0, and, for simplicity, we assume δb = 0 (so the right-hand side of (3.1)is not perturbed).

Then

A · δx = −δA(x+ δx)⇒ δx = A−1(−δA(x+ δx))⇒ |δx| ≤ |A−1|(|δA| · |x+ δx|).20

For the last implication we have used that |AB| ≤ |A| · |B|, for any pair of matricesA ∈ Cm×n, B ∈ Cn×p. Now, from (3.8) we get

|δx| ≤ ε|A−1| · |A| · |x+ δx|.

Then, if ‖ · ‖ is any norm with the property

|A| ≤ |B| ⇒ ‖A‖ ≤ ‖B‖

(like, for instance, any absolute norm, as well as the 2-norm), this implies

‖δx‖ ≤ ε‖|A−1| · |A|‖ · ‖x+ δx‖,

which in turn implies

(3.9)‖δx‖‖x+ δx‖

≤ ε‖|A−1| · |A|‖.

The factor κCR(A) := ‖|A−1| · |A|‖ in the right-hand side of (3.9) is know as thecomponentwise condition number, or Bauer-Skeel condition number.

Inequality (3.9) provides also an estimation of the componentwise relative condi-tioning of the problem (3.1). This estimation is, precisely, the Bauer-Skeel conditionnumber. The advantage of inequality (3.9) against (3.7) is that the Bauer-Skeel con-dition number κCR(A) is smaller than the condition number for inversion κ(A), andin some cases it is much smaller.

� Or course, both (3.4) and (3.9) are bounds, which means that the right-hand sidecould be much greater than the left-hand side. However, these bounds are attainable,which means that, for any value of n, there are data (A, b) for which these inequalitiesbecome equalities.

3.2. Gaussian elimination and LU factorization. Gaussian elimination isstill the standard algorithm for solving systems of linear equations. The goal of thissection is to analyze this method from the point of view of the computational costand backward error. We will also see that this method is equivalent to factorizing amatrix as a product of a lower (L) and an upper (U) triangular matrix, which leadsto the LU factorization.

3.2.1. Triangular systems. A general strategy to solve a SLE is to transformthe original system (3.1) into an equivalent one (i.e., one having exactly the samesolutions),

(3.10) Tx = b′,

whose solution can be obtained easily. The standard choice for such a system (3.10)is one whose coefficient matrix T is a triangular matrix. Both upper triangularand lower triangular are equivalent alternatives, but here we concentrate on uppertriangular matrices. The arguments and developments for lower triangular matricesare completely analogous.

The standard algorithm to solve an upper triangular system (3.10) is as follows.We denote T = [tij ] and, for simplicity, b′ = [bi]. Then the system (3.10) becomes

(3.11)

t11x1 + t12x2 + t13x3 + · · ·+ t1nxn = b1t22x2 + t23x3 + · · ·+ t2nxn = b2

......

...tn−1,n−1xn−1 + tn−1,nxn = bn−1

tnnxn = bn

.

21

The following algorithm solves (3.11) through the standard procedure of backwardsubstitution.

Algorithm 1: Algorithm to solve an upper triangular system (3.11) bybackward substitution.

xn = bn � tnnfor i = n− 1 : −1 : 1

xi = (bi (ti,i+1 ⊗ xi+1) (ti,i+2 ⊗ xi+2) · · · (tin ⊗ xn))� tiiend

In Algorithm 1, and in what follows, we will use a hat, (·), for the computed(approximate) quantities. Note that in Algorithm 1 we are assuming that the data,(T, b) are the exact values, so they are machine numbers. This will be a generalassumption in this course, and it is only relevant for the backward error analysis.

For lower triangular systems Algorithm 1 works in a similar way with forwardsubstitution instead.

Computational cost.Note that the first step of Algorithm 1 involves just one flop (a division). At

each new step we add 2 new flops, namely, one subtraction and one multiplication.Hence, the total cost of Algorithm 1 is

1 + 3 + 5 + · · ·+ 2n− 1 = n2 .

This cost is exact, and not an approximation.

Backward error analysis.The following result shows that Algorithm 1 is backward stable.

Theorem 3.3. Let us assume that Algorithm 1 computes a solution x of thesystem Tx = b, with T upper triangular. Then there is an upper triangular matrix∆T such that

(3.12) (T + ∆T )x = b,

with

|∆T | ≤ (nu+O(u2))|T |,

and therefore ‖∆T‖1,∞,F ≤ (nu+O(u2))‖T‖1,∞,F .Proof. From the third line in Algorithm 1 we get

ti,i+k ⊗ xi+k = ti,i+kxi+k(1 + εi+k),

with |εi+k| ≤ u and 1 ≤ k ≤ n− i. We also have

bi (ti,i+1 ⊗ xi+1) = (bi − ti,i+1xi+1(1 + εi+1))(1 + δ1)

and, iterating

(bi (ti,i+1 ⊗ xi+1 · · · (ti,i+k−1 ⊗ xi+k−1)) (ti,i+k ⊗ xi+k) =(bi (ti,i+1 ⊗ xi+1 · · · (ti,i+k−1 ⊗ xi+k−1)− ti,i+k ⊗ xi+k)(1 + δk),

22

with |δk| ≤ u, and 1 ≤ k ≤ n− i. Then we arrive at

xi =

b1(1 + δ1) · · · (1 + δn−i)−n−1∑k=1

xi+kti,i+k(1 + εi+k)(1 + δk) · · · (1 + δn−i)

tii(1 + αi),

with |αi| ≤ u. Operating we get

tii(1 + αi)

(1 + δ1) · · · (1 + δn−i)· xi = bi −

n−i∑k=1

ti,i+k(1 + εi+k)

(1 + δ1) · · · (1 + δk−1)· xi+k.

Now, setting

tii + ∆tii =tii(1 + αi)

(1 + δ1) · · · (1 + δn−i)= tii(1 + θii),

ti,i+k + ∆ti,i+k =ti,i+k(1 + εi+k)

(1 + δ1) · · · (1 + δk−1)= ti,i+k(1 + θi,i+k),

with θii ≤ (n− i+ 1)u+O(u2) ≤ nu+O(u2), and θi,i+k ≤ ku+O(u2) ≤ nu+O(u2),the result follows.

Some relevant remarks about Theorem 3.3:

• Theorem 3.3 guarantees that the backward substitution algorithm for uppertriangular systems is backward stable, not only normwise, but also compo-nentwise.

• It also says that the computed solution is a solution of a nearby system whichhas the same right-hand side as the original system, namely b.

• Moreover, this nearby system is also upper triangular.

Exercise 9. With the notation from Theorem 3.3, prove that

(3.13)‖x− x‖1,∞,F‖x‖1,∞,F

≤ (nu+O(u2))‖|T−1| · |T |‖1,∞,F .

Equation (3.13) says that the Bauer-Skeel condition number of T is the conditionnumber of solving the triangular system (3.10). For a more general result in thisdirection see [3, p. 155].

3.2.2. General systems. Gaussian elimination is a numerical method to getan upper triangular matrix from a given matrix A ∈ Fn×n using elementary rowoperations. It performs at most n steps, where each single step can be depicted asfollows:

(3.14) A =

a b>

c A22

−→ A =

a b>

0 A22

, A22 = A22 −1

ac · b>,

where a 6= 0 and b, c ∈ Fn−1.Let us review the Gaussian elimination method with an elementary example. We

want to solve the SLE:

2x1 + 4x2 + 4x3 = 2,x1 − x2 + 2x3 = 2,3x1 + x2 + 2x3 = 0.

23

This system can be written in the form (3.1), with

A =

2 4 41 −1 23 1 2

, b =

220

.The Gaussian elimination method (or, the Gauss method, or Gaussian elimination,for short) proceeds by choosing the (1, 1) entry of the coefficient matrix A (which isthe pivot entry of this step) and performs elementary row operations to make zero allthe entries below the pivot entry. Then the process is iterated with the (2, 2) entry,then the (3, 3) entry, and so on. In the case of the previous system this goes as follows: 2 4 4 2

1 −1 2 23 1 2 0

−→

2 4 4 20 −3 0 10 −5 −4 −3

−→

2 4 4 20 −3 0 10 0 −4 − 14

3

,F21(− 1

2 )F31(− 3

2 )F32(− 5

3 )

where Fij(α) is the elementary row operation that consists of replacing the ith row bythe ith row plus α times the jth row, that is: Fij(α) ≡ Rowi + αRowj 7→ Rowi. Letus just look at the coefficient matrix, forgetting for a moment about the right-handside term b. If we recall that elementary row operations are equivalent to multiply onthe left by the corresponding elementary matrix, then the previous transformationson the matrix A can be written as matrix multiplications in the form:

(3.15)

1 0 00 1 00 − 5

3 1

1 0 0− 1

2 1 0− 3

2 0 1

A =

2 4 40 −3 00 0 −4

.Multiplying by the inverses of the two lower triangular matrices in the left-hand sideof (3.15), we arrive at

(3.16) A =

1 0 012 1 032 0 1

1 0 00 1 00 5

3 1

2 4 40 −3 00 0 −4

.Now, notice that we can multiply the first two matrices in the right-hand side of (3.16)in a fast way. Just replace the first column in the identity matrix by the first columnof the first matrix, and the second column of the identity matrix by the second columnof the second matrix. In other words:

(3.17) A =

1 0 012 1 032

53 1

2 4 40 −3 00 0 −4

.Now, if we denote

L =

1 0 012 1 032

53 1

, U =

2 4 40 −3 00 0 −4

,then Equation (3.17) gives

(3.18) A = LU, with

{L: lower triangular with 1’s in the diagonal,U : upper triangular.

24

� The matrix U is the upper triangular matrix obtained after applying Gaussianelimination on A, and the matrix L encodes in each column the multipliers used ateach step to make zeroes below the corresponding pivot (with the signs shifted) .

The previous example illustrates the LU factorization and its connection withGaussian elimination in a particular case. However, it is immediate to realize aboutproblems with this procedure for a general matrix A. As we will see, not every n× nmatrix can be decomposed in the form (3.18). It may happen that at some step ofthe Gaussian elimination, say the ith step, the (i, i) entry of the updated matrix iszero, so it can not act as a pivot to make zeroes below it. Nonetheless, it is also easyto find a solution for this problem: just look at some nonzero entry in the ith columnand below the ith row, (namely, at the (i + j, i) position), and permute the ith rowwith the (i + j)th row. Note that we must permute with a row which is below theith one, since otherwise that would place a nonzero entry below and to the left of the(i, i) entry, and we would loose the upper triangular property of the resulting matrix.A row permutation is equivalent to multiply on the left by a permutation matrix, byLemma 1.8. Then, it is natural to replace (3.18) by

(3.19) A = PLU, with

P : permutation matrix,L: lower triangular with 1’s in the diagonal,U : upper triangular.

If a given n × n matrix A can be written in the form (3.19), then the followingalgorithm computes the solution of (3.1) by solving only triangular systems.

Algorithm 2: Solves (3.1) using Gaussian elimination.

1. Factorize A as in (3.19).2. Solve PLy = b as Ly = P>b and using forward substitution.3. Solve Ux = y by backward substitution.

Algorithm 2 provides a method to solve linear systems (3.1) provided that thecoefficient matrix can be factorized in the form (3.19). The natural question now is:can every n× n invertible matrix be decomposed in the form (3.19)? To answer thisquestion is the goal of the following section.

3.2.3. Existence and uniqueness of LU and PLU factorization for in-vertible matrices. By an LU factorization of a matrix we mean a factorization likein (3.18), and a PLU factorization is a factorization like (3.19).

Theorem 3.4. Let A ∈ Fn×n be invertible. If A has a an LU factorization, thenit is unique.

Proof. Let A = LU = LU be two LU factorizations of A. Then, since A is invert-ible, both factors are invertible and L−1L = UU−1. Since L−1L is lower triangularand UU−1 is upper triangular, it must be L−1L = UU−1 = D, with D diagonal.But, since the main diagonal of L−1L is all 1’s, it must be D = In, and this impliesL = L, U = U .

The following result characterizes the set of matrices having an LU factorization.

Theorem 3.5. An invertible matrix A ∈ Fn×n has an LU factorization if andonly if the matrix A(1 : j, 1 : j) is invertible, for all j = 1, . . . , n.

25

Proof. ⇒ Let us assume that A has an LU factorization, and, for a fixed1 ≤ j ≤ n, let us partition it as

(3.20) A =

[A11 A12

A21 A22

]=

[L11 0L21 L22

] [U11 U12

0 U22

],

with A11 = A(1 : j, 1 : j). Then (3.20) implies that A11 = L11U11. Since A isinvertible, both L11 and U11 must be invertible, so A11 must be invertible as well.

⇐ Let us proceed by induction on n. For n = 1 the result is trivial. Let usassume the result true for n− 1, and let us partition A as

A =

[B bc> ann

]with B = A(1 : n − 1, 1 : n − 1). Since, by hypothesis, B is invertible, we can applythe induction hypothesis to B, so, it has an LU decomposition, namely B = LBUB .We are going to see that there is an LU factorization of A of the form

A =

[LB 0`> 1

] [UB u0 unn

]=

[LBUB LBu`>UB `>u+ unn

],

for some `, u ∈ Cn. For this, just equate in the previous identity to get

LBu = b ⇒ u = L−1B b,

`>UB = c> ⇒ `> = c>U−1B ,

`>u+ unn = ann ⇒ unn = ann − `>u,where in the first two equations we have used that LB and UB are invertible, becauseB is invertible. As a consequence, A has an LU factorization.

Theorem 3.5 is in accordance with the intuition that matrices like A = [ 0 11 0 ] do

not have an LU factorization since it is not possible to perform Gaussian elimination(without permutation of rows). However, the following result guarantees that, if weallow for such row permutations, then we can always get a PLU factorization.

Theorem 3.6. Let A ∈ Fn×n be an invertible matrix. Then A has a PLUfactorization.

Proof. Let us prove it by induction on n. The proof is constructive and actuallyprovides the standard algorithm to compute the LU factorization of A.

For n = 1, we just take P = L = 1, U = A ∈ F. Let us assume the result true formatrices of order n− 1.

Let A be of size n×n and let P1 be a permutation matrix such that P1A(1, 1) = a11

is nonzero. Such a P1 exists because A is invertible. Now we solve the block matrixequation

P1A =

[a11 A12

A21 A22

]=

[1 0L21 In−1

] [u11 U12

0 A22

]=

[u11 U12

L21u11 L21U12 + A22

],

whereA21, A>12 ∈ F(n−1)×1, for the unknowns L21 ∈ F(n−1)×1, u11 ∈ F, U12 ∈ F1×(n−1),

and A22 ∈ F(n−1)×(n−1) (note that A22 is not necessarily upper triangular). Equatingabove gives

(3.21)

u11 = a11 6= 0,U12 = A12,

L21u11 = A21 ⇒ L21 =1

a11·A21,

A22 = A22 − L21U12 = A22 −1

a11·A21A12.

26

Since A is invertible,

0 6= det(P1A) = u11 det A22 ⇒ det A22 6= 0,

so A22 is invertible as well. Since it has size (n − 1) × (n − 1) we can apply the

induction hypothesis, which guarantees that there is some P such that

(3.22) P A22 = LU .

As a consequence,

P1A =

[1 0L21 In−1

] [u11 U12

0 A22

]=

[1 0L21 In−1

] [u11 U12

0 P>LU

]=

[1 0L21 In−1

] [1 0

0 P>L

] [u11 U12

0 U

]=

[1 0

0 P>

] [1 0

PL21 L

] [u11 U12

0 U

].

.

Finally, adding up the first permutation we arrive at the PLU decomposition of A:

(3.23)

([1 0

0 P

]P1

)︸︷︷︸

P

A =

[1 0

PL21 L

]︸︷︷︸

L

[u11 U12

0 U

]︸︷︷︸

U

,

so A = P>LU .

Exercise 10. Adapt the proof of Theorem 3.6 to prove that it is also true forany A ∈ Fn×n, invertible or not. What do L and U in the singular case look like?

Remark 7. Since the only operations used in the proof of Theorem 3.6 are sums,subtractions, multiplications, and divisions, the LU is valid over any field F.

Remark 8. Note that a similar result to Theorem 3.6 is true for a factorization ofthe form AP = LU , that is, multiplication on the right by a permutation (equivalently,performing column operations instead of row operations). Moreover, we can get adecomposition A = P1LUP2, with P1, P2 being permutation matrices. This allowsus to choose the pivot at the ith step not only from the ith column, but from therectangular submatrix with indices (i : n, i : n).

3.2.4. The LU algorithm using Gaussian elimination. The following prac-tical comments on the proof of Theorem 3.5 are key to get the algorithm for computingthe PLU factorization:

• Instead of A = PLU we apply the permutations on A, to get PA = LU , sowe end up with an LU factorization of some matrix obtained from A by rowpermutation.

• At the ith step, the permutation (if any) only permutes rows from i to n.Moreover, we see from (3.23) that the successive permutations can be storedindependently and applied on the whole matrix A at the end of the algorithm.

• At the ith step, once we have computed A22 using (3.21), the first row of Uand the first column of L remain fixed until the end of the process, so theyare the final row of U and the final column of L. This is iterated at each step,so at the ith step we get the ith row of U and the ith column of L.

27

• Up to permutations, we see from (3.23) that the first column of L and thefirst row of U at each step are computed like in (3.21), which is just onestep of Gaussian elimination (3.14). More precisely, the rows of matrix U arethe rows of the upper triangular matrix obtained by Gaussian elimination,whereas the ith column of L encodes the multipliers used to make zeroesbelow the pivot at the ith step.

• Similarly, the columns of A with indices from 1 to i−1 are not used anymorefrom the ith step on, so we can replace them by the columns of L obtainedin steps from the 1st to the (i− 1)th, in order to save storage requirements.

With all these observations in mind we are in the position of stating the algorithmto compute the LU factorization of A ∈ Fn×n.

Algorithm 3: LU factorization with Gaussian elimination.

% INPUT: A ∈ Fn×n invertible.% OUTPUT: A permutation matrix P and a matrix A such that:% U = triu(A), L = In + tril(A,−1) and PA = LU

P = Infor i = 1 : n− 1

Permute the ith and the jth row of A, with j > i, such that A(i, i) 6= 0Exchange the ith and jth row in PA(i+ 1 : n, i) = A(i+ 1 : n, i)/A(i, i)A(i+ 1 : n, i+ 1 : n) = A(i+ 1 : n, i+ 1 : n)−A(i+ 1 : n, i) ·A(i, i+ 1 : n)

end

Though in this course we do not pay special attention to the storage requirements,Algorithm 3 only requires 2n2 storage, since at each step it produces a couple of n×nmatrices, namely P and A.

� Algorithm 3 is a very simple algorithm. Note that, forgetting about permutations,it just contains two lines of code!!

3.2.5. Computational cost of solving (3.1) using the LU factorization.We want to estimate the computational cost of Algorithm 2. This algorithm involvesAlgorithm 3 for computing the PLU factorization of A. Then, it consists of threesteps:

Step1. Compute P,L, U as in (3.19) using Algorithm 3.Step2. Solve PLy = b.Step3. Solve Ux = y.

In the estimation of the computational complexity we will not take into accountthe permutations, and just focus on counting the number of flops.

Let us start counting the number of flops of Algorithm 3. The first line, at theith step involves n−i divisions. The second line involves (n−i)2 products and (n−i)2

subtractions. Then, the cost of Algorithm 3 is:

Cost of Algorithm 3:

n−1∑i=1

(2(n− i)2 + (n− i)

)= 2

n−1∑j=1

j2 +

n−1∑j=1

j = 2(n− 1)n(2n− 1)

6+

(n− 1)n

2

=(n− 1)n(4n+ 1)

6=

2

3n3 +O(n2).

28

The cost of Step 2 is n2 − n (exact), and the cost of Step 3 is n2 (exact) (seethe cost of Algorithm 1).

Then, adding up:

� The cost of Algorithm 2 (solving (3.1) with Gaussian elimination / LU factoriza-

tion) is 23 n

3 +O(n2)

3.2.6. Backward error analysis of Gaussian elimination. Let us start withthe following result, whose proof is beyond the scope of this course.

Theorem 3.7. Let A ∈ Fn×n be invertible and L, U be the factors computedthrough Algorithm 3 in a machine with unit roundoff u. Then there is some matrixE such that

(3.24) A+ E = LU , and |E| ≤ (nu+O(u2)) |L| · |U |.

As a consequence of Theorem 3.7 we get the following result for the solution of(3.1) using Algorithm 2.

Theorem 3.8. Let A ∈ Fn×n invertible and b ∈ Fn. Let us assume that theSLE (3.1) is solved using Algorithm 2 in a computer with unit roundoff u. Then thecomputed solution x satisfies

(3.25) (A+ ∆A)x = b, with |∆A| ≤ (3nu+O(u2)) |L| · |U |,

with L, U being the computed factors of the LU factorization of A from Algorithm

3.

Exercise 11. Prove Theorem 3.8 using Theorem 3.7.

Some relevant observations on Theorems 3.7 and 3.8 are:• Taking absolute norms ‖ · ‖ in both (3.24) and (3.25) we arrive at

‖E‖ ≤ (nu+O(u2))‖L‖ · ‖U‖,

in (3.24), and

(3.26) ‖∆A‖ ≤ (3nu+O(u2))‖L‖ · ‖U‖

from (3.25). This gives us a bound for the backward error of Algorithm 3

and Algorithm 2, respectively.• The previous bounds do not guarantee that computing the LU factorization

using Algorithm 3, is backward stable. For that, we should guarantee thatthe computed factors L, U are the LU factors of a nearby matrix, that is, ofsome ∆A such that

‖∆A‖ ≤ O(u)‖A‖ (normwise), or|∆A| ≤ O(u)|A| (componentwise).

Then, Theorem (3.24) does not guarantee that Algorithm 3 is backward stable.In section 3.2.8 we will analyze this issue more in depth. First, we are going tomotivate it by pointing at the place where it may fail in order to be backward stable.

29

3.2.7. The need for an appropriate pivoting strategy. Let us consider thefollowing example. Set

A =

[10−20 1

1 1

].

The exact LU factorization of A is

L =

[1 0

1020 1

], U =

[10−20 1

0 1− 1020

].

However, if we apply Algorithm 3 without row permutations (since there is noneed for that) then we get

L =

[1 0

fl(1020) 1

]=

[1 0

1020 1

]U =

[10−20 1

0 fl(1− 1020)

]=

[10−20 1

0 −1020

].

You can check in MATLAB (with double precision) that fl(1020) = 1020 and fl(1 −1020) = −1020. Then this would give:

LU =

[10−20 1

1 0

].

As a consequence,

∆A = A− LU =

[0 00 1

],

so the backward error in the ∞-norm of this computed LU factorization is

‖∆A‖∞‖A‖∞

=1

26= O(u).

This is a huge backward error which indicates that Algorithm 3 is not backwardstable.

Furthermore, let us apply this LU factorization to solve a particular SLE usingAlgorithm 2. Let us consider, for instance

Ax =

[10

],

with A as above. The exact solution is

(3.27) x =1

1− 10−20

[−11

].

However, if we follow Algorithm 2 with the computed factors L, U , then we first solve

Ly =

[10

]⇒ y =

[1

−1020

],

and then we solve

(3.28) U x = y ⇒ x =

[01

].

30

Comparing x and x in (3.27) and (3.28) we realize that there is a huge error, sincethe computed solution has nothing to do with the exact one (they are in completelydifferent directions!).

Let us explain what has happened in this case. The problem has arised whenapproximating fl(1 − 1020) ≡ fl(a22 − 1020) = −1020 (actually we would have thesame problem with any value of a22 for which fl(a22 − 1020) = −1020).

Exercise 12. Describe the set of real numbers 0 < a < 1020 such that fl(a −1020) = −1020 in a floating point system with unit roundoff u = 10−16.

This problem arises because there are some entries of L and/or U which are quitebig compared to the entries of A. Then, when rounding these numbers, we loose allinformation about some of the entries of A, which are completely different to the exactones when multiplying the computed factors L, U .

How to avoid this problem? The idea is to keep control on the size of the entriesof L and U against the size of the elements of A. But the question is: how can we dothis?

Looking at Algorithm 3 we see that there are different choices for the permuta-tion P . In the precedent example, we have not performed any permutation since the(1, 1) entry of A is nonzero. This has produced a huge number in the (2, 1) position ofL. But this can be avoided if we exchange the first and second row, namely, startingwith

PA =

[1 1

10−20 1

].

Now, the exact LU factorization of PA is

PA =

[1 0

10−20 1

]︸︷︷︸

L

[1 10 1− 10−20

]︸︷︷︸

U

,

and the computed factors would be

L =

[1 0

10−20 1

], U =

[1 10 1

].

Now we have

A− LU = ∆A =

[0 00 10−20

],

so the backward error of the LU factorization is

‖∆A‖∞‖A‖∞

=10−20

2= O(u).

The underlying problem here is that in Algorithm 3 it is not specified how topermute the rows, at each step, if there is more than one nonzero entry in the ithcolumn. This should be specified in the algorithm.

The way to choose how the rows are permuted in Algorithm 3 is known as apivoting strategy. There are several options, but here we are only going to considerthe following one:

31

Partial pivoting strategy: At each step of Algorithm 3, chose apermutation, Pi such that PiA(i, i) is the entry with largest modulusin A(i : n, i).

When we use the partial pivoting strategy in Algorithm 3 we end up with whatis known as Gaussian elimination with partial pivoting.

3.2.8. Is Algorithm 3 with partial pivoting backward stable?. Algorithm

3 is currently the standard algorithm (with some improvements) to compute the LUfactorization of a given matrix A ∈ Fn×n. Even though it is widely used and workswell in practice, it is still an open question to determine whether it is backward stableor not. Though this answers the question in the title of this section, we are going toexplain the natural attempts to analyze the backward stability of Algorithm 3.

Looking at (3.26), Algorithm 2 would be normwise backward stable if

(3.29) ‖L‖ · ‖U‖ = O(‖A‖).

Gaussian elimination with partial pivoting ensures that |L(i, j)| ≤ 1, for i ≥ j, which

implies ‖L‖∞ ≤ n, and (3.25) gives (in the ∞-norm):

(3.30) ‖∆A‖∞ ≤ (3n2u+O(u2))‖U‖∞.

Therefore, the problem reduces to analyze what happens with ‖U‖∞. In order toachieve (3.29), and following (3.30), we need to analyze the ratio between ‖A‖∞ and

‖U‖∞. This leads to the following definition.

Definition 3.9. (Growth factor). Let A ∈ Fn×n be invertible and let U be its LUfactor computed using Gaussian elimination with partial pivoting. Then the growthfactor of A is

gpp(A) :=maxi,j |U(i, j)|maxi,j |A(i, j)|

.

Note that ‖U‖∞n ≤ maxi,j |U(i, j)| ≤ (maxi,j |A(i, j)|) · gpp(A) ≤ gpp(A)‖A‖∞.As a consequence, using (3.26) and Theorem 3.8, Gaussian elimination with partialpivoting provides a solution of (3.1), x, such that

(A+ ∆A)x = b, with ‖∆A‖∞ ≤ (3n2u+O(u2))gpp(A)‖A‖∞.

Then, in order to guarantee backward stability of Gaussian elimination with par-tial pivoting, we should guarantee that the growth factor gpp(A) is not quite big. Thefollowing result establishes a bound on the growth factor.

Proposition 3.10. If gpp(A) is the growth factor of the Gaussian eliminationwith partial pivoting applied to a matrix A ∈ Fn×n, then

gpp(A) ≤ 2n−1.

Exercise 13. Prove Proposition 3.10.

Unfortunately, the bound in Proposition 3.10 is sharp, since there are n × nmatrices for which this bound is attainable.

32

Exercise 14. Find a matrix Mn, with size n×n such that gpp(Mn) = 2n−1, andprove that, indeed, gpp(Mn) = 2n−1. (Warning: Do not try to find Mn by yourself !Look at the bibliography.)

From Theorems 3.7 and 3.8, Equation (3.30), and using Proposition 3.10, we canconclude the following.

Theorem 3.11. Let A ∈ Fn×n be invertible and L, U be the factors computedthrough Algorithm 3 with partial pivoting in a machine with unit roundoff u. Thenthere is some matrix E such that

A+ E = LU , and ‖E‖∞ ≤ (n22n−1u+O(u2)) ‖A‖∞ .

As a consequence of Theorem 3.7 we get the following result for the solution of(3.1) using Algorithm 2.

Theorem 3.12. Let A ∈ Fn×n be invertible and b ∈ Fn. Let us assume that theSLE (3.1) is solved using Algorithm 2 in a computer with unit roundoff u. Then thecomputed solution x satisfies

(3.31) (A+ ∆A)x = b, with ‖∆A‖∞ ≤ (3n22n−1u+O(u2)) ‖A‖∞,

with L, U being the computed factors of the LU factorization of A from Algorithm

3.Theorems 3.11 and 3.12 say, in some sense, that the LU factorization and Gaus-

sian elimination with partial pivoting are backward stable, since the factors n22n−1

and 3n22n−1 depend only on the size of the problem, n. However, they are not poly-nomials in n, but exponential, so they are huge for large values of n. Therefore, forn large enough, these bounds could not guarantee any single digit of accuracy in thecomputations. To show this is the goal of the next exercise.

Exercise 15. Let n = 100 and assume that we apply Gaussian elimination(Algorithm 2) with partial pivoting to solve (3.1) in a machine with IEEE double(unit roundoff u = 10−16).

(a) Estimate the bound in (3.31).(b) Theorem 2.8 indicates that

‖x− x‖‖x‖

≤ (3n22n−1u+O(u2))κ∞(A).

Assume κ∞(A) = 1 (which is a very optimistic assumption!). Then computethe bound for the relative error in the solution given by the inequality above.How many digits of accuracy in the computed solution x are guaranteed bythis bound?

The value n = 100 in Exercise 15 is just a moderate value (the standard size ofmatrices in applications are of the order of thousands). Therefore, the situation canbe even much worse than the one in this exercise. However, it is very rare in practicethat gpp is so big as 2n−1.

One of the most challenging open problems concerning LU factorization is toprovide an average estimation of the growth factor. In practice, gpp(A) .

√n, and

Gaussian elimination with partial pivoting is considered backward stable in practice.

33

HIGHLIGHTS:

� The standard algorithm to solve a SLE (3.1) is Gaussian elimination. This isequivalent to compute the LU factorization of A and then solve two triangular sys-tems.

� Every square matrix A has an LU factorization, permuting adequately the rowsof A.

� The pivot at each step is not unique. There are several strategies to choose it, butthe standard one is the one known as partial pivoting, which chooses the entry withlargest modulus among all possible pivots which can be obtained by row permutations.

� The computational complexity of Gaussian elimination is 23 n

3 +O(n2).

� Though Gaussian elimination with partial pivoting has not been proved to bebackward stable, it is considered backward stable in practice.

3.3. Positive definite matrices and Cholesky factorization. The Gaus-sian elimination method admits many refinements when applied to matrices enjoyingspecial structures. Among these structures, one of particular relevance is the one ofsymmetric positive definite matrices. These matrices appear frequently in applica-tions in different areas of science. Besides its algorithmic simplifications, the refinedGaussian elimination method has an important property when applied to symmetricpositive definite matrices, namely, it is backward stable. This fact is really surprising,because the algorithm is no more than an adjustment of Gaussian elimination to theparticular structure.

3.3.1. Properties of symmetric positive definite and symmetric posi-tive semidefinite matrices. Let us start with the basic definition of these kinds ofmatrices.

Definition 3.13. (Symmetric positive definite and symmetric positive semidef-inite matrix). A matrix A ∈ Rn×n is symmetric positive definite (respectively, sym-metric positive semidefinite) if

(i) A = A>

(ii) x>Ax > 0 (resp. x>Ax ≥ 0), for all x ∈ Rn and x 6= 0.

Condition (i) in Definition 3.13 is the symmetric condition, and condition (ii) isthe positive definite (resp. positive semidefinite) condition.

In the following, we will use the abbreviations spd (symmetric positive definite)and sps (symmetric positive semidefinite). We will also use the following propertiesof spd and sps matrices.

Lemma 3.14.(i) If X ∈ Rn×n is invertible, then A is spd (resp., sps) if and only if X>AX is

spd (resp., sps).(ii) If A is spd (resp., sps) and H := A(i : j, i : j) is a principal submatrix of A,

then H is spd (resp., sps).(iii) If A is spd (resp., sps) then

(iii-a) A(i, i) > 0 (resp., A(i, i) ≥ 0).(iii-b) |A(i, j)| <

√A(i, i)A(j, j) (resp., |A(i, j)| ≤

√A(i, i)A(j, j)) if i 6= j.

34

(iii-c) maxi,j |A(i, j)| ≤ maxiA(i, i).(iii-d) maxi 6=j |A(i, j)| < maxiA(i, i).


3.3.2. Cholesky factorization. The following result is the analogue to the LUdecomposition for spd and sps matrices.

Theorem 3.15. (Cholesky factorization). A matrix A ∈ Rn×n is spd (resp., sps)if and only if A can be factorized as

(3.32) A = LL>

for some lower triangular matrix L ∈ Rn×n with positive (resp., nonnegative) elementsin the main diagonal. Moreover, if A is spd, then the factorization (3.32) is unique.

The factorization (3.32) is called the Cholesky factorization of A, and L is calledthe Cholesky factor of A.

Proof. ⇐ Let A = LL>. Then x>Ax = x>LL>x = (L>x)>L>x = ‖L>x‖22,which is > 0, for all x 6= 0 (if A is spd) and ≥ 0 (if A is sps).

⇒ Let us first prove the existence of the factorization (3.32). We proceed by

induction. The case n = 1 is trivial, since A =√A√A ≡ LL>. Let us assume the

result true for matrices with size (n− 1)× (n− 1).Let A ∈ Rn×n be spd. Using Gaussian elimination (3.21) we can decompose A as

(3.33)

A=

[a11 A>21

A21 A22

]=

[1 0

1a11A21 In−1

] [a11 A>21

0 A22 − 1a11A21A

>21

]=

[1 0

1√a11A21 In−1

] [ √a11 A>21

0 A22 − 1√a11A21A

>21

]=

[ √a11 0A21√a11

In−1

][1 0

0 A22 − A21A>21

a11

][ √a11

A>21√a11

0 In−1

]

(note that, by Lemma 3.14(iii-a), a11 > 0). Now, using Lemma 3.14, the submatrix

A22 − A21A>21

a11is spd, so applying the induction hypothesis, there is some L in the

conditions of the statement such that A22 − A21A>21

a11= LL>. Replacing this in (3.33)

we arrive at

(3.34) A =

[ √a11 0A21√a11

L

]︸︷︷︸

L

[ √a11

A>21√a11

0 L>

]︸︷︷︸

L>

,

which is a decomposition (3.32).If A is sps, then the arguments are exactly the same ones as for the spd case if

a11 6= 0. If a11 = 0, Lemma 3.14(iii-b) implies that A21 = 0 in (3.33), so A>21 = 0as well, and then A =

[0 00 A22

], with A22 being sps by Lemma 3.14(ii). Then we can

apply the induction hypothesis on A22, to get

A =

[0 00 A22

]=

[0 0

0 LL>

]=

[0 0

0 L

] [0 0

0 L>

],

which is a decomposition like (3.32).

35

As for the uniqueness of (3.32), we will proceed by showing that equation (3.32)has a unique solution, provided that L is lower triangular. Therefore, we just need tosolve for entries (i, j) with i ≤ j. Equating the (i, j) entries in (3.32) we arrive at

(3.35) A(i, j) =

j∑k=1

L(i, k)(L>)(k, j) =

j∑k=1

L(i, k)L(j, k).

Now, for i = j = 1 in (3.35) we get A(1, 1) = L(1, 1)2 ⇒ L(1, 1) =√A(1, 1), and

for j = 1, i ≥ 2 in (3.35) we get L(i, 1) = A(i,1)L(1,1) . This uniquely determines the first

column of L. Now, let us assume that we have uniquely determined the first j − 1columns of L. Then (3.35) gives

A(j, j) =

j∑k=1

L(j, k)2 ⇒ L(j, j) =

√√√√A(j, j)−j−1∑k=1

L(j, k)2,

which is uniquely determined. Moreover, for i > j (3.35) also gives

L(i, j)L(j, j) = A(i, j)−j−1∑k=1

L(i, k)L(j, k)⇒ L(i, j) =A(i, j)−

∑j−1k=1 L(i, k)L(j, k)

L(j, j),

which is again uniquely determined. As a consequence, the jth column of L is alsouniquely determined.

Remark 9. The Cholesky factorization (3.32) for spd matrices is the analogousresult to the LU factorization for general matrices. Note, however, that (3.32) requiresjust one matrix (L), instead of two (L,U) in (3.18). This is completely natural andexpected due to the symmetry of A.

The uniqueness condition of (3.32) is no longer true for sps matrices. As a coun-terexample, consider the following two different decompositions (3.32) of the same spsmatrix: 1 1 1

1 1 11 1 3

=

1 0 01 0 0

1 0√

2

1 1 10 0 0

0 0√

2

=

1 0 01 0 01 1 1

1 1 10 0 10 0 1

.Equation (3.33) provides the following algorithm to compute the Cholesky fac-

tor of A. It is exactly the same as Gaussian elimination (Algorithm 3), withoutpermutations and with an additional line for the diagonal entry.

Algorithm 4: Cholesky factorization.

% INPUT: A ∈ Rn×n spd.% L such that A = LL>

for i = 1 : nA(i, i) =

√A(i, i)

A(i+ 1 : n, i) = A(i+ 1 : n, i)/A(i, i)trilA(i+1 : n, i+1 : n) = tril(A(i+1 : n, i+1 : n)−A(i+1 : n, i)·A(i, i+1 : n))

endL = trilA

Computational cost: Since Algorithm 4 performs half of the operations in Algorithm

3, the computational cost of the Cholesky factorization is 13n

3 +O(n2) flops

36

3.3.3. Backward stability of Cholesky factorization. The most importantproperty of Algorithm 4 is that it is backward stable. This is in stark contrast withAlgorithm 3 for general n × n matrices. In order to approach this result, we firststate the following theorem, which is just the analogue of Theorem 3.7.

Theorem 3.16. Let A ∈ Rn×n be a spd matrix, and let us assume that Algorithm4 produces a matrix L in a machine with unit roundoff u. Then

(3.36) LL> = A+ E, with |E| ≤ ((n+ 1)u+O(u2))|L| · |L>| .

Similarly, the following result is the analogue of Theorem 3.8.

Theorem 3.17. Let A ∈ Rn×n be a sdp matrix, and let L be its Cholesky factorcomputed with Algorithm 4 in a machine with roundoff error u. Let x be the computedsolution of the SLE (3.1), obtained using the Cholesky factorization and then solvingtwo triangular systems. Then, there is some matrix ∆A such that

(A+ ∆A)x = b, with |∆A| ≤ ((3n+ 1)u+O(u2))|L| · |L>|.

So far, there is nothing new concerning backward error for linear systems with spdcoefficient matrices compared to general systems, as should be expected. However,the following key inequality makes the difference:

(3.37) ‖|L| · |L>|‖2 ≤ ‖|L|‖22 ≤ ‖|L|‖2F = ‖L‖2F ≤ n‖L‖22 = n‖A+ E‖2,

with E as in (3.36). Now, joining (3.36) and (3.37),

(3.38)‖E‖2 ≤ ‖|E|‖2 ≤ ((n+ 1)u+O(u2))‖|L| · |L>|‖2

≤ ((n+ 1)u+O(u2))n(‖A‖2 + ‖E‖2),

and solving for ‖E‖2 we arrive at

(3.39) ‖E‖2 ≤ (n(n+ 1)u+O(u2))‖A‖2.

Exercise 17. Obtain (3.39) from (3.38).

Replacing (3.39) in (3.36) we obtain the main results in this section, which guar-antee backward stability of Algorithm 4 and the solution of (3.1), with A being spd,using Cholesky factorization.

Theorem 3.18. (Computing Cholesky with Gaussian elimination is backwardstable). Let A ∈ Rn×n be a spd matrix, and let us assume that Algorithm 4 produces

an outpout matrix L in a machine with unit roundoff u. Then

LL> = A+ E, with ‖E‖2 ≤ (n(n+ 1)u+O(u2))‖A‖2.

Theorem 3.19. (Solving SLE with Cholesky is backward stable). Let A ∈ Rn×n

be a sdp matrix, and L its Cholesky factor computed with Algorithm 4 in a machinewith roundoff error u. Let x be the computed solution of the SLE (3.1), obtainedusing the Cholesky factorization and then solving two triangular systems. Then, thereis some matrix ∆A such that

(A+ ∆A)x = b, with ‖∆A‖2 ≤ ((3n+ 1)nu+O(u2))‖A‖2.

37

HIGHLIGHTS:

� For symmetric positive definite matrices the Cholesky factorization plays the roleof LU factorization.

� The computational cost to compute the Cholesky factorization is 13n

3 +O(n2).

� Cholesky algorithm for positive definite matrices is backward stable.

38

4. Lesson 3: The QR factorization.

4.1. The importance of unitary/orthogonal matrices. Another importantclass of matrices is the following.

Definition 4.1. (Unitary and orthogonal matrix). A matrix U ∈ Cn×n is saidto be

• unitary if U∗U = In,• orthogonal if U>U = In.

Unitary matrices play a relevant role in matrix computations, due essentially tothe following properties.

Lemma 4.2. Let U, V be unitary matrices of appropriate sizes. Then(i) ‖U‖2 = 1.(ii) ‖UAV ‖2 = ‖A‖2, for any matrix A of appropriate size.(iii) UV is also a unitary matrix.


Real orthogonal matrices are the counterpart of (complex) unitary matrices overthe real field. In particular, all claims in Lemma 4.2 can be directly translated toreal orthogonal matrices replacing ∗ by >. For this reason, orthogonal matrices arespecially relevant over the real field. Note, however, that orthogonal matrices can alsohave complex entries.

The relevance of unitary matrices in matrix computations relies on claim (ii)in Lemma 4.2. Most problems in NLA involve matrix multiplications. Think, forinstance, of Gaussian elimination: in order to get the upper triangular matrix U froma given matrix A we need to perform elementary row operations, which consist, aswe have seen, of multiplying A on the left by elementary matrices. The problem withmatrix multiplications is that they can significantly alter the condition number of theoriginal matrix (consider for instance, the trivial example A = In and multiply it byanother n×n matrix with large condition number). However, property (ii) in Lemma4.2 guarantees that this does not happen with unitary matrices and with the 2-norm.For this reason, and from the point of view of conditioning, to work with unitary (orreal orthogonal) matrices is preferable than doing it with general matrices.

4.2. QR factorization. In Lesson 3 we have reviewed the standard way tosolve linear systems (3.1) (Gaussian elimination), which consists of transforming thecoefficient matrix A into an upper triangular matrix using elementary row operations.These transformations are not unitary (which means that the L matrix in the LUfactorization is not unitary). This fact is, precisely, the one that prevents the LUfactorization from being backward stable. More precisely, the norm of the matrix Uobtained after applying these elementary transformations may be much larger thanthe one of the original matrix A. Therefore, one may wonder whether it is possibleto achieve an upper triangular form using unitary transformations. The answer isyes, and is given by the QR factorization. As in Lesson 3, we focus on the real case(namely, F = R).

Theorem 4.3. (QR factorization). Every invertible matrix A ∈ Rn×n has afactorization A = QR, where Q ∈ Rn×n is an orthogonal matrix and R ∈ Rn×n isan upper triangular matrix. The factorization is unique if R has positive diagonalelements.

39

Proof. The proof is essentially the one in [4, Fact. 3.32]. Let us first prove theexistence of a QR factorization. Since A is nonsingular, there is some x 6= 0 such thatAx 6= 0, so x>A>Ax = ‖Ax‖2 > 0. Since the matrix M = A>A is symmetric, theprevious inequality implies that it is spd. Then, by Theorem 3.15, there is a lowertriangular matrix L with positive diagonal elements such that M = LL>. Moreover, Lis invertible, because M is. Then M = A>A = LL>. Multiplying by A−> := (A−1)>

on the left gives A = QR, with Q = A−>L, and R = L>, which is upper triangularwith positive diagonal entries. It remains to prove that Q defined in this way isorthogonal. For this, let us multiply

Q>Q = (A−>L)>A−>L = L>A−1A−>L = L>(A>A)−1L = L>M−1L= L>(LL>)−1L = L>(L−>L−1)L = (L>L−>)(L−1L) = In,

so Q is, indeed, orthogonal.Now, let us prove the uniqueness. Let us assume that A = QR = QR are two

factorizations of A as in the statement. Then S := Q−1Q = RR−1. The matrix Sis both orthogonal, by Lemma 4.2(iii), and upper triangular with positive diagonal

elements. Now Exercise 19 implies that S = In, which in turn implies that Q = Qand R = R.

Exercise 19. Prove that if G ∈ Rn×n is both orthogonal and upper triangularwith positive diagonal elements, then it must be G = In. Why this is not true anymoreif we remove the condition of having positive diagonal elements?

The QR factorization provides an alternative method to Algorithm 2 for solvingthe SLE (3.1), namely:

1. Compute the QR factorization of A = QR.2. Compute y = Q>b.3. Solve Rx = y.

However, in order to decide about whether it is competitive or not with Algorithm

2 solved using Algorithm 3 we should first derive an algorithm to compute the QRfactorization and analyze it from the point of view of backward stability and compu-tational cost.

The proof of Theorem 4.3 provides a construction from the Cholesky factorizationof the matrix M = A>A, and it involves the following three steps: (i) the computationof the product A>A; (b) the computation of the Cholesky factorization of A>A; and(c) the computation of the product A−>L to get the matrix Q. This procedure presentseveral disadvantages from the computational point of view. The most important oneis that the computation of A>A is subject to rounding errors which may produce anon-symmetric (computed) matrix, and then the Cholesky factorization of M couldnot be computed. As a consequence, this procedure is not appropriate.

4.2.1. Computation of QR factorization: Householder reflectors. Thereare several ways to compute the QR factorization of A ∈ Rn×n. The one we presenthere uses Householder reflectors and goes as follows.

We multiply on the left the matrix A by a sequence of orthogonal transformations,Q1, Q2, . . . , Qn (Householder reflectors), such that Qk, for k = 1, . . . , n makes zeroesbelow the diagonal in the kth column while preserving the zeroes introduces by thematrices Q1, . . . , Qk−1. In this sense, it works like Gaussian elimination, but withorthogonal transformations, instead of elementary row transformations. The resulting

40

matrix Qn · · ·Q2Q1A = R is the upper triangular matrix of the QR factorization,while the transpose of the matrix Q is Q> = Qn · · ·Q2Q1.

Let us explain how the Householder reflector Qk works. The matrix Qk, fork = 1, . . . , n, is of the form

(4.1) Qk =

[Ik−1 0

0 Hk

],

where Hk is a unitary matrix. Assume that we have already constructed k−1 matricesQ1, . . . , Qk−1 such that, for 1 ≤ i ≤ k − 1, the ith column of (Qk−1 · · ·Q1)A has nullentries below the ith coordinate. Then, we want to construct an orthogonal matrixQk such that the same holds for the first k columns of (Qk · · ·Q1)A. If Qk has theform (4.1), this guarantees that multiplying on the left by Qk does not alter the firstk − 1 columns of (Qk−1 · · ·Q1)A. Now, let v(k) be the kth column of (Qk−1 · · ·Q1)Aand let us denote

x(k) =

x1

...xn−k+1

∈ Rn−k+1,

the vector whose coordinates are the last (bottom) n − k + 1 coordinates of v(k).In other words, x(k) is the vector obtained from v(k) after removing the first k − 1coordinates. Now, we look for an orthogonal transformation, Hk, such that

(4.2) Hkx(k) = ±‖x(k)‖2e1,

where e1 is the first column of In−k+1. The only nonzero entry of the right-handside vector in (4.2) is the first one, which means that the kth column of (Qk · · ·Q1)Awill have null entries below the kth coordinate, as wanted. The factor ‖x(k)‖2 mustappear because we want Hk to be orthogonal, and then ‖Hkx

(k)‖2 = ‖x(k)‖2, byLemma 4.2(ii), taking U = Hk, V = I, and A = x(k) in this lemma. A natural choicefor such Hk is the “reflection” across the hyperplane orthogonal to ‖x(k)‖2e1 − x(k).Figure 4.1 depicts this reflection in R2 (the orthogonal hyperplane is the dashed lineH, and x plays the role of x(k)).

Fig. 4.1. Householder reflector in R2. Taken from [6].

However, there is another way to get the same vector, namely, reflecting across theorthogonal hyperplane to −‖x(k)‖2e1 − x(k). This is illustrated in Figure 4.2, where

41

the (orthogonal) hyperplanes are the dashed lines, and are denoted by H+ and H−.Note that, as shown in the picture, x goes to a vector whose first entry is negative ifwe reflect across H−.

Fig. 4.2. Two possible Householder reflector in R2. Taken from [6].

In both cases, Hk is defined as follows:

(4.3) Hk = In−k+1 − 2y(k)(y(k))>

(y(k))>y(k),

where y(k) = sign(x1)‖x(k)‖2e1 ± x(k). The right choice is the plus sign (+) in orderto keep x(k) as far away as possible from its image to avoid cancellation (see [6]).

Summarizing, by choosing Hk as in (4.3) we guarantee the following:

(P1) The matrix Hk is orthogonal.

(P2) The first k columns of (Qk · · ·Q1)A are of the form

± ∗ · · · ∗0 ± · · · ∗

0 0. . .

......

.... . . ±

0 0 · · · 0...

......

0 0 · · · 0

,

where ± are nonzero entries and the entries marked with ∗ are not relevant.

Note that (4.3), with the (+) sign for y(k), makes the signs (±) in the diagonalto be all (+). Essentially, what we do is to first reflect onto −‖x‖e1 and then shiftthe (−) sign with (+).

Property (P2) is an immediate consequence of (4.2). The proof of property (P1)is left as an exercise.

Exercise 20. Prove property (P1) above.

Note that (P1) implies that that Qk in (4.1) is orthogonal as well. As for (P2),it shows that the product (Qn · · ·Q1)A is upper triangular with positive diagonalentries.

The following algorithm constructs the matrix R of the QR transformation usingthe previous ideas. As for the matrix U in the LU factorization, R is obtained by

42

replacing the corresponding entries in the matrix A, for storage saving. Note that thealgorithm also stores the vectors qk (for latter use).

Algorithm 5: Householder QR factorization.

% INPUT: An n× n matrix A% OUTPUT: The matrix R of the QR factorization of A

for k = 1 : nx = A(k : n, k)qk = sign(x1)‖x‖2e1 + xqk = qk/‖qk‖2A(k : n, k : n) = A(k : n, k : n)− 2qk(q>k A(k : n, k : n))

endR = A

Exercise 21. Let

A =

1 5 4 3 22 1 5 4 33 2 1 5 44 3 2 1 55 4 3 2 1

, b =

1−12−11

.

(a) Solve Ax = b using Algorithm 2. Provide not just the solution, x, but alsothe LU factorization of A, with 4 decimal digits.

(b) Solve Ax = b using Algorithm 7. Provide not just the solution, x, but alsothe matrix R in the QR factorization of A, together with the vectors qk, fork = 1, . . . , 5, with 4 decimal digits.

4.2.2. LU versus QR for solving linear systems. Algorithm 5 does notcompute the matrix Q of the QR factorization of A. This is because for solving SLEwe do not need to compute it. To see this, note that if A = QR is the QR factorizationof A, then (3.1) is equivalent to Rx = Q>b, so we just need to compute R and Q>b.It is possible to compute Q>b without explicitly computing Q. This is the reason forstoring the vectors qk. In particular, from these vectors we can compute Q>b usingthe following algorithm. Again, Q>b is replaced by b for saving storage.

Algorithm 6: Implicit calculation of the product Q>b% INPUT: A vector b and the vectors qk, for k = 1 : n in Algorithm 5

% OUTPUT: The vector Q>b, with Q being the Q factor in the QR% decomposition of A

for k = 1 : nb(k : n) = b(k : n)− 2qk(q>k b(k : n))

end

Now, joining Algorithm 5 and Algorithm 6 we arrive at the following algorithmto solve (3.1) using the QR factorization.

43

Algorithm 7: Solves (3.1) using QR factorization.

1. Factorize A = QR using Householder reflectors.2. Compute y = Q>b.3. Solve Rx = y by backward substitution.

Computational cost: The computational cost of Algorithm 7 is4

3n3 flops (see,

for instance, [6, Lecture 10]).

The relevant difference between Algorithm 2 and Algorithm 7 is that Algorithm7 is backward stable. Moreover, we have a perfect backward stability result.

Theorem 4.4. (Solving SLE using QR is backward stable). If Algorithm 7

produces a solution x to (3.1), in a machine with unit roundoff u, then there is somematrix ∆A such that

(4.4) (A+ ∆A)x = b, with ‖∆A‖ ≤ O(u)‖A‖.

For the proof of Theorem 4.4, we refer to the proof of Theorem 16.2 in [6]. Wejust give some general indications to understand why Theorem 4.4 is true. This isa consequence of the fact that all three steps in Algorithm 7 are backward stable.We already know that the third step (solution of triangular systems by backwardsubstitution) is backward stable (Theorem 3.3). It is not difficult to prove that thesecond step is also backward stable. As for the first one (computation of the QRfactorization using Householder reflectors), the backward stability strongly relies onthe property that Householder reflectors are orthogonal matrices.

Remark 10. Note that no special assumptions in the statement of Theorem 4.4are made on the matrix norm ‖ · ‖. Among the standard norms introduced in Section1, only the 2-norm satisfies ‖UAV ‖ = ‖A‖, for any U, V unitary (in order words,only the 2-norm is unitarily invariant). This means that, if ‖ · ‖ is any of the othernorms (the 1, ∞, or Frobenius norms) then it could happen that, when multiplying bythe Householder reflectors, the norm of the resulting matrix increases substantially.However, it can only increase up to some point. In particular, it can not go to infinityas u goes to zero, since all norms are equivalent in a finite-dimensional vector space.Note that inequality (4.4) is independent on the choice of the norm, as long as theyare all equivalent.

However, Algorithm 7 is twice more expensive than Algorithm 2, and for thisreason Algorithm 2 is the one used in practice.

44

HIGHLIGHTS:

� Any invertible matrix A ∈ Rn×n can be written as A = QR, with Q beingorthogonal and R being upper triangular with positive diagonal entries. This is knownas the QR factorization of A, and it is unique.

� The computational cost to compute the QR factorization using Householder re-flections is 4

3n3 +O(n2), which is twice the cost of the LU factorization.

� The computation of the QR factorization using Householder reflections is back-ward stable.

� The QR factorization can be used to solve SLE (3.1) through Algorithm 7, whichis backward stable.

� Despite the backward stability of Algorithm 7, its computational cost, which istwice as much as the computational cost of solving (3.1) using the LU factorization,makes this last procedure preferable in practice.

45

5. Lesson 4: Least squares problems and the SVD.

5.1. The singular value decomposition (SVD). The singular value decom-position (SVD) of a matrix A ∈ Cn×n is another factorization of A that provides veryuseful information, including:

• The rank of A.• The 2-norm of A.• The robustness of rankA against small perturbations. This tells us how far

is A away from a matrix of lower rank. In particular, if A is invertible, theSVD will tell us how far is A away from being singular.

Furthermore, there are many applications of the SVD decomposition. In thiscourse, we will review one of its basic applications to image compressing and restora-tion (Section 5.4).

Note that in this section we are considering matrices with complex entries andeither square or rectangular. This is a more general setting than in precedent sections,where only the real square case was considered.

In the proof of the SVD we will use the following results. We recall that aHermitian matrix is a matrix H ∈ Cn×n such that H∗ = H, and that a matrix A isunitarily diagonalizable is there is a unitary matrix U such that U∗AU is diagonal.

Exercise 22. Prove that rankA = rankA∗A, for any A ∈ Cm×n.

Exercise 23. Prove that the eigenvalues of a Hermitian positive definite matrixare real nonnegative numbers.

Theorem 5.1. Any Hermitian matrix is unitarily diagonalizable.

The SVD is stated in the following result, whose proof is taken from [5, p. 31].

Theorem 5.2. (SVD). Let A ∈ Cm×n. Then, there are two unitary matricesU ∈ Cm×m, V ∈ Cn×n, and a diagonal matrix Σ = diag(σ1, . . . , σr, 0, . . . , 0), withσ1 ≥ · · · ≥ σr > 0 such that

(5.1) A = UΣV ∗.

The values σ1, . . . , σr are uniquely determined and they are known as the singularvalues of A. The columns of U (resp. V ) are known as left (resp. right singularvectors of A.

If A has real entries, then all U,Σ, V in (5.1) can be taken to be real.

Proof. Let r := rankA. By Exercise 22, rankA∗A = r. The matrix A∗A isHermitian and positive definite. By Exercise 23, the eigenvalues of A∗A are real andnonnegative. Let σ2

1 ≥ · · · ≥ σ2r > 0 = σ2

r+1 = · · · = σ2n be the eigenvalues of A∗A.

Let V =[V1 V2

](with V1 ∈ Cn×r) be a unitary matrix whose columns are the

eigenvectors of A∗A. This matrix exists, by Theorem 5.1. Then

V ∗(A∗A)V =

[Σ2

+ 00 0

],

where Σ+ = diag(σ1, . . . , σr). Then we have

(5.2) V ∗1 A∗AV1 = Σ2

+, V ∗2 A∗AV2 = 0.

46

The second identity is equivalent to (AV2)∗(AV2) = 0. By Exercise 22, rank(AV2) =rank(AV2)∗(AV2) = 0, so we conclude that

(5.3) AV2 = 0.

Now let

(5.4) U1 = AV1Σ−1+ .

Then (5.2) implies that U∗1U1 = Ir. Let us choose U2 such that[U1 U2

]is unitary.

Then, (5.2), (5.3), and (5.4) together imply

U∗AV =

[U∗1AV1 U1A

∗V2

U∗2AV1 U∗2AV2

]=

[Σ+ 00 0

],

and from this it follows that A = UΣV ∗, since U, V are unitary.

If A is real, note that all arguments are still true replacing ∗ by > and all matricesinvolved can be chosen to be real.

The decomposition (5.1) is known as a singular value decomposition (SVD) of A.

One may wonder about the uniqueness of the SVD of A. Theorem 5.2 claimsuniqueness of the singular values, but nothing is said about the singular vectors.In the next result we provide an answer to this question, together with some otherproperties of the SVD.

Lemma 5.3. Let A be as in the statement of Theorem 5.2.

(i) The singular values are the nonnegative square roots of the eigenvalues ofAA∗ or, equivalently, of A∗A (that’s why they are uniquely determined).

(ii) The columns of U are eigenvectors of AA∗, and the columns of V are eigen-vectors of A∗A, arranged in the same order as the corresponding eigenvaluesσ2i .

(iii) If m ≤ n and if AA∗ has different eigenvalues, then U is determined up toa right diagonal factor D = diag(eiθ1 , . . . , eiθn), with all θi ∈ R, that is, ifA = U1ΣV ∗1 = U2ΣV ∗2 then U2 = U1D.

(iv) If m < n then V is never uniquely determined.(v) If m = n and U is given, then V is uniquely determined .

(vi) If n ≤ m, analogous results hold by considering A∗.(vii) ‖A‖2 = σ1.


Moreover, the following properties on U1 and V2 hold.

Lemma 5.4. If U1 and V1 are as above, then

(a) The columns of U1 are an orthonormal basis of the column space of A.(b) The columns of V2 are an orthonormal basis of the null space of A.


We are not getting into the problem of computing the SVD of a given matrix.Instead of this, we will review some relevant applications of the SVD.

47

5.2. Low rank approximations. In many situations it may be convenient toapproximate a given matrix A by some other matrix with smaller rank, which is calleda low-rank approximation of A. It is desirable that this low-rank approximation is theclosest to A in some sense. A natural notion of “closeness” is given by the distance,and in particular the distance defined by the 2-norm (we will see more on this inSection 5.3). Then the low-rank approximation problem is as follows:

Low-rank approximation problem:Given a matrix A, find a matrix Aρ such that

‖A−Aρ‖2 = min{‖A−B‖2 : rank B ≤ ρ}

The solution to this problem goes though the SVD of A. To state it, we need torecall that any matrix M ∈ Cm×n with rankM = ρ can be written as a sum of rank-1matrices

M = x1y∗1 + · · ·+ xρy

∗ρ,

for some vectors xi, yi ∈ Cn, for i = 1, . . . , ρ. In particular, the SVD of A in (5.1)allows to decompose A as a sum of r rank-1 matrices as follows:

(5.5) A =

r∑i=1

σiuiv∗i ,

where ui, vi, for i = 1, . . . , r, are the ith columns of U and V , respectively. With theseconsiderations in mind, we can state the solution of the Low-rank approximationproblem.

Theorem 5.5. (Optimal low-rank approximation). Let A ∈ Cm×n be as in (5.5),with σ1 ≥ · · · ≥ σr > 0 being the nonzero singular values of A, and let ρ ≤ r. Set

(5.6) Aρ :=

ρ∑i=1

σiuiv∗i .

Then

‖A−Aρ‖2 = min{‖A−B‖2 : rankB ≤ ρ} = σρ+1,

where, if ρ = r, then σρ+1 = 0.Proof. First note that, if A and Aρ are as in (5.5) and (5.6), respectively, then

‖A−Aρ‖2 = ‖∑ri=ρ+1 σiuiv

∗i ‖2 = σρ+1, by Lemma 5.3(vii). It remains to prove that,

for any B with rankB ≤ ρ, it must be ‖A−B‖2 ≥ σρ+1. We omit this part and referto [4, Fact 4.13] for a proof.

As a consequence of Theorem 5.5, the best approximation to A with rank ρ isobtained from the SVD of A by truncating the sum in (5.5) from the ρth singularvalue on.

In particular, we can determine the distance of an invertible matrix to the set ofsingular (i. e., non-invertible) matrices.

Corollary 1. Let A ∈ Cn×n be invertible and σ1 ≥ · · · ≥ σn > 0 be the singularvalues of A. Then

σn = min{‖E‖2 : A+ E is singular}.48

Proof. If E is such that A + E is singular, then rank(A + E) ≤ n − 1, so byTheorem 5.5,

‖E‖2 = ‖A− (A+ E)‖2 ≤ σn.

Corollary 1 tells us about how far is an invertible matrix away from being singular.In other words, it provides the norm of the smallest perturbation that makes aninvertible matrix to become singular. Though it does not explicitly show us how toconstruct this matrix, we can derive a procedure to get it using the SVD. This goesas follows:

1. Compute the SVD (5.1) of A.2. Set E = −σnunv∗n.

The matrix E constructed above clearly satisfies ‖E‖2 = σn and A+E is singular.Note, moreover, that E is a rank-1 perturbation.

Exercise 26. Let A be the matrix in Exercise 8.(a) Compute the SVD of A for n = 8, 10, 12, 14.(b) Indicate the distance to singularity of A for the previous values of n.(c) Construct a perturbation E such that A+E is singular, for each of the previous

values of n.

Exercise 27. Assume that A is invertible with SVD as in (5.1). Give an ex-pression for a SVD of A−1.

The previous arguments are closely related to the condition number for inversionof A. This number is a measure of the sensitivity (conditioning) of the problem ofinverting a matrix. It is expected to be big if the matrix is invertible (otherwise theproblem of computing its inverse makes no sense) but close to being singular. Inparticular, from Lemma 5.3(vii) and Exercise 27, we see that

κ2(A) = ‖A−1‖2‖A‖2 =σ1

σn.

This shows, again, that if σn is small, then A is ill-conditioned by inversion.

5.3. Least squares problems. We are now interested in the case where thelinear system (3.1) does not have a solution. This happens when the vector b doesnot belong to the column space of A ∈ Cm×n. The typical situation is when m > n,so the system has more equations than unknowns, because these systems genericallyhave no solution. In this setting, we are interested in computing a vector x whichis the “closest” to a solution. But, for this, we first need to define what we meanby “close”. This closeness, as in the previous section, is measured in terms of thedistance induced by the 2-norm.

Least squares problem (LSP):

Given A ∈ Cm×n and b ∈ Cm, the least squares problem associated to Ax = bconsists of finding a vector x ∈ Cn such that

(5.7) ‖Ax− b‖2 = min{‖Ax− b‖2 : x ∈ Cn}.49

The vector x in (5.7) is a least squares solution of the system (3.1), and Ax − bis called the least squares residual. Then, a LSP aims to find those vectors x ∈ Cnwhich minimize the residual.

Some observations are in order:

• For the sake of completeness, we have stated the LSP for complex rectangularmatrices, instead of real (and square) ones.

• The least squares solution is not necessarily unique.

The main question related to the LSP is the following: Does it always have asolution? In other words, Given A ∈ Cm×n and b ∈ Cm arbitrary, does necessarilyexist a vector x satisfying (5.7)? The answer to this problem is YES. Now, we aregoing to review several different ways to compute it.

5.3.1. Solution of LSP using the SVD. Let us decompose the SVD of A(5.1) as in the proof of Theorem 5.2, namely

A = U

[Σ+ 00 0

]V ∗, U =

[U1 U2

], U =

[V1 V2

],

where Σ+ = diag(σ1, . . . , σr) is r× r, U1 is m× r, U2 is m× (m− r), V1 is n× r, andV2 is n× (n− r).

The following result provides a formula for the solution of the LSP.

Theorem 5.6. (Solution of the LSP). Let A ∈ Cm×n and b ∈ Cm. Then thesolutions of the LSP (5.7) are of the form

x = V1Σ−1+ U∗1 b+ V2z,

for any z ∈ Cn−r.Proof. Let x be a solution of the least squares problems, and let us partition

V ∗x =

[V ∗1 xV ∗2 x

]=

[yz

].

Now, we replace the SVD of A into the residual

Ax− b = U

[Σ+ 00 0

]V ∗x− b = U

[Σ+y − U∗1 b−U∗2 b

].

Since the 2-norm is unitarily invariant (see Remark 10), applying the square of the2-norm in the previous identity we get

‖Ax− b‖22 = ‖Σ+y − U∗1 b‖22 + ‖U∗2 b‖22.

Since the second summand in the right-hand side is independent on y, z, the residualis minimized when the first summand in the right-hand side is zero. This happenswhen y = Σ−1

+ U∗1 b. Therefore, the solution to the LSP (5.7) is given by

x = V

[yz

]= V1y + V2z = V1Σ−1

+ U∗1 b+ V2z.

Note that, by (5.3), A(V2z) = 0, for any z ∈ Cn−r. Therefore, the vector V2z doesnot have any influence on the residual, so z can be any vector.

50

As a consequence of Theorem 5.6, if rankA = r < n, then the LSP (5.7) hasinfinitely many solutions, and the essential part of the solution contains the matrixV1Σ−1

+ U∗1 . This matrix is obtained from the SVD of A just inverting the singularvalues and exchanging the roles of U and V . Note that, when A is invertible, this isjust A−1 (as expected). Moreover, when A is not invertible, this matrix shares someproperties with the inverse of a matrix. In fact, it is a particular case of what is knownas a pseudoinverse (or generalized inverse).

Definition 5.7. Let A ∈ Cm×n with an SVD given by (5.1), and partitionΣ =

[Σ+ 00 0

], with Σ+ invertible. Then the matrix

A† := V

[Σ−1

+ 00 0

]U∗

is called the Moore-Penrose inverse of A.

The following lemma supports the use of the word “inverse” for A†.

Lemma 5.8. Let A ∈ Cm×n.• If A is invertible then A† = A−1.• If rankA = n, then A† = (A∗A)−1A∗. This implies A†A = In, so A† is a left

inverse of A.• If rankA = m, then A† = A∗(AA∗)−1. This implies AA† = Im, so A† is a

right inverse of A.


Using the Moore-Penrose inverse we can give a more condensed expression for theleast squares solution.

Corollary 2. Let A ∈ Cm×n, and b ∈ Cm. The solution of the LSP (5.7) is

x = A†b+ y, where y ∈ Ker(A).

5.3.2. Solution of LSP using the QR factorization. Another way to solvethe LSP (5.7) is by means of the so-called normal equations, as explained in thefollowing result.

Theorem 5.9. Let A ∈ Cm×n and b ∈ Cm. Then x is a solution of the LSP(5.7) if and only if x is a solution of

(5.8) A∗Ax = A∗b.

Exercise 29. Prove Theorem 5.9.

The equations of system (5.8) are known as the normal equations associated withthe LSP (5.7). They relate the LSP with a SLE, which is nice because it allows usto address a LSP as a SLE. However, the solution of the LSP using normal equationscan be numerically unstable, since the conditioning of the problem can be as big asκ2(A)2, instead of κ2(A) (see [4, p. 103]).

In this course we have just defined the QR factorization for real square invertiblematrices. However, the QR factorization can be extended to complex arbitrary ma-trices (square and rectangular). Though we are not getting into the details, the QRfactorization for complex rectangular matrices is of the same form as for real invertible

51

matrices. More precisely, given a matrix A ∈ Cm×n, there are two matrices Q ∈ Cm×n(with orthonormal columns, that is, satisfying Q∗Q = In) and R ∈ Cn×n (upper tri-angular) such that A = QR. Using this decomposition, we can find an expression forthe solution of the normal equations (5.8). To be precise, (5.8) is equivalent to

R∗Rx = R∗Q∗b.

If A has full column rank (which means that m ≥ n and rankA = n), then R∗ isinvertible, and this system is equivalent to

Rx = Q∗b.

This is an upper triangular system, that can be solved by backward substitution.

5.4. Image processing. Another application of the SVD is in image processing.In particular, we are going to review the implications of the SVD in the process ofcompression and restoration of images.

For the sake of simplicity, let us focus on the simplest case of a black and white im-age. The information of this image is stored in a matrix, whose dimensions correspondto the resolution of the image. For instance, an image with resolution 1368 × 1712means that the image is decomposed into a grid of 1368 × 1712 small squares (pix-els), each one having an independent intensity of black color. This corresponds to amatrix of size 1368 × 1712, where each entry encodes the intensity of black color inthe corresponding pixel. In some cases, it may be preferable to reduce the amountof stored information, at the cost of loosing some definition in the image. This leadsto the question on how to get the best compromise between the amount of storedinformation and the quality of the image. The SVD is helpful in this question, basedon Equation (5.5) and Theorem 5.5. More precisely, the decomposition in 5.5 allowsto store the information of the matrix A ∈ Cn×n through the sum of r rank-1 ma-trices, together with r real values (the singular values σ1 ≥ · · · ≥ σr > 0). This isa total amount of r(m + n) + r numbers, since each rank-1 matrix of size m × n isstored in 2 vectors, one of size m × 1, and the other one of size 1 × n. Of course, ifr = m = n (so the matrix is invertible), then this amounts to 2n2 + n, which is moreinformation than storing directly the n× n whole matrix. However, when the matrixis rectangular, and m << n or n << m, or even when it is square n× n but r << n,this quantity can be much smaller than mn, which is the storage requirement of thewhole matrix. Then, what we can do is to replace the whole matrix A by a goodlow-rank approximation. The notion of “good” here depends on the particular case,but the idea is to find a low rank approximation that allows to decrease the storagecost without loosing too much definition in the image. Theorem 5.5 tells us that thebest rank-ρ approximation to a given matrix (in terms of the distance induced by the2-norm) is obtained by truncating the sum (5.5), namely removing the last summands,and keeping the first ρ summands, which correspond to the largest singular values.

In general, a rank-100 approximation (that is, using only the largest 100 singularvalues) is enough to get a good approximation to the original image. However, theamount of information stored in this approximation is much less than the one storedin the original matrix. In particular, for an image with 1368 × 1712 resolution, theinformation required to store the rank-100 approximation as a matrix (5.5) with r =100 is 100(1368 + 1712) + 100 = 308100 bytes. Compared to the size of the wholematrix this gives a compression rate of

cr =308100

1368 · 1712= 0.1315,

52

which means that the information stored in the approximation is only 13.15% of thewhole information, which is an important saving.

Exercise 30. (Mandatory). Choose your favorite picture (one not having a veryhigh resolution!) and put it into your working path in MATLAB (for instance, as a jpgfile). Name the picture using the command name = imread(’image.jpg’). Thenrun the command compress (that should be as well in your working path in MATLAB),by choosing 4 different values for the rank, which is the variable p in the input of thecode. Your choices should include: a small rank (around 1/10 of the rank of the wholematrix), a moderate choice (around 100), and a large number (like 1/2 of the wholerank). Then display the following information:

• The original picture.• The 4 pictures obtained after applying the compress code.• The rate of compression of each picture (indicate the computations carried

out to get it).

HIGHLIGHTS:

� The SVD is a matrix factorization which displays relevant information of an arbi-trary matrix (like the rank and the singular values).

� The singular values measure how far is a matrix of certain rank away from the setof matrices with smaller rank.

� The SVD is a useful tool in many applied settings, including low-rank approxima-tions, solution of LSP, and image processing.

� A LSP arises naturally from a SEL with no solution, and aims to compute thevector with smallest residual (that is, in some sense, the “closest” to a solution).

53

REFERENCES

[1] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.[2] G. H. Golub, C. Van Loan. Matrix Computations. The Johns Hopkins University Press, Balti-

more, 1996.[3] N. J. Higham. Accuracy and Stability of Numerical Methods. SIAM, Philadelphia, 1998.[4] I. C. F. Ipsen. Numerical Matrix Analysis. SIAM, Philadelhia, 2009.[5] G. W. Stewart, J. G. Sun. Matrix Perturtbation Theory Academic Press, new York, 1990.[6] L. N. Threfeten and D. Bau III. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.

54

Documents

FERNANDO DE TERAN - UC3MBy de nition, there is some eigenvalue of Asuch that j j= ˆ(A). By de nition of eigenvalue, there is some v6= 0 such that Av= v. Then kAvk= k vk and, using