24
Background Material These notes contain a summary of background material from linear algebra and calculus. Much of the content should be familiar to some degree, and the purpose is to bring it back to attention. Important concepts are highlighted in the notes. 1 Asymptotic notation An algorithm is a sequence of instructions carried out by a computer. Important algorithm features of an algorithm are computation time (measured as the number of operations or the number of iterations needed to reach a solution) and accuracy. One is mainly interested in the orders of magnitude of these quantities, and not so much in their orders of magnitude exact values. A convenient notation for this purpose is the asymptotic O-notation. Let f,g : R R be two functions. Then: f (n) O(g(n)) as n →∞ if there exists a constant C> 0 and an integer n 0 such that |f (n)|≤ C |g(n)| for n>n 0 ; O(g(n)), O(g(x)) f (x) O(g(x)) as x 0 if there exists a constant C> 0 and a real number ε> 0 such that |f (x)|≤ C |g(x)| for |x| . We omit “n →∞” or “x 0” when it is clear from the context. One often finds statements such as f (n)= O(g(n)) or f (x)=1+ x + O(x 2 ); the first is equivalent to f (n) O(g(n)), while the second should be read as f (x)=1+ x + g(x) for a function g(x) O(x 2 ). The following examples illustrate the O-notation. n + n 2 O(n 2 ) as n →∞ n 5 O(e n ) as n →∞ 10 100 O(1) as n →∞ x 3 O(x 2 ) as x 0 sin(x) O(x) as x 0 e x =1+ x + O(x 2 ) as x 0 The notation f (n) Ω(g(n)) as n →∞ means that g(n) O(f (n)) as n →∞, and f (n) o(g(n)) as n →∞ means that for all C> 0 there exists n 0 such that |f (n)| <C |g(n)| for n>n 0 . If g(n) 6=0 for sufficiently large n, this is equivalent to lim n→∞ f (n) g(n) =0. One defines Ω(g(x)) and o(g(x)) as x 0 analogously. 1

Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

Background Material

These notes contain a summary of background material from linear algebra andcalculus. Much of the content should be familiar to some degree, and the purpose isto bring it back to attention. Important concepts are highlighted in the notes.

1 Asymptotic notation

An algorithm is a sequence of instructions carried out by a computer. Important algorithmfeatures of an algorithm are computation time (measured as the number of operationsor the number of iterations needed to reach a solution) and accuracy. One is mainlyinterested in the orders of magnitude of these quantities, and not so much in their orders of magnitudeexact values. A convenient notation for this purpose is the asymptotic O-notation.

Let f, g : R→ R be two functions. Then:

• f(n) ∈ O(g(n)) as n→∞ if there exists a constant C > 0 and an integer n0such that |f(n)| ≤ C|g(n)| for n > n0; O(g(n)), O(g(x))

• f(x) ∈ O(g(x)) as x→ 0 if there exists a constant C > 0 and a real numberε > 0 such that |f(x)| ≤ C|g(x)| for |x| < ε.

We omit “n→∞” or “x→ 0” when it is clear from the context. One often findsstatements such as f(n) = O(g(n)) or f(x) = 1 + x+O(x2); the first is equivalentto f(n) ∈ O(g(n)), while the second should be read as f(x) = 1 + x + g(x) for afunction g(x) ∈ O(x2). The following examples illustrate the O-notation.

•√n+ n2 ∈ O(n2) as n→∞

• n5 ∈ O(en) as n→∞

• 10100 ∈ O(1) as n→∞

• x3 ∈ O(x2) as x→ 0

• sin(x) ∈ O(x) as x→ 0

• ex = 1 + x+O(x2) as x→ 0

The notation f(n) ∈ Ω(g(n)) as n→∞ means that g(n) ∈ O(f(n)) as n→∞,and f(n) ∈ o(g(n)) as n → ∞ means that for all C > 0 there exists n0 such that|f(n)| < C|g(n)| for n > n0. If g(n) 6= 0 for sufficiently large n, this is equivalentto limn→∞

f(n)g(n) = 0. One defines Ω(g(x)) and o(g(x)) as x→ 0 analogously.

1

Page 2: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

2

2 Linear Algebra

We restrict to linear algebra over the field of real numbers R, as this is the setting thatis of most interest in optimization. A vector in Rn and its transpose are written asvector

transpose

x =

x1...xn

= (x1, . . . , xn)>, x> = (x1, . . . , xn),

with coordinates xi ∈ R for 1 ≤ i ≤ n. The zero vector is denoted by 0, while ecoordinatesis the vector with every coordinate equal to 1. If λ1, λ2 ∈ R and x,y ∈ Rn, thenλ1x + λ2y is the vector with coordinates λ1xi + λ2yi for 1 ≤ i ≤ n.

In Rn we have the Euclidean (or standard) inner product (or scalar product)inner product

〈x,y〉 =n∑i=1

xiyi.

The Euclidean inner product is bilinear: for x1,x2,y1,y2 ∈ Rn and α, β ∈ R,bilinear

〈αx1 + βx2,y〉 = α〈x1,y〉+β〈x2,y〉, 〈x, αy1 + βy2〉 = α〈x,y1〉+β〈x,y2〉,

symmetric (〈x,y〉 = 〈y,x〉) and satisfies 〈x,x〉 ≥ 0, with equality if and only ifx = 0. Two vectors x and y are called orthogonal, if 〈x,y〉 = 0.orthogonal

Example 2.1. The vectors (1, 1)> and (1,−1)> are orthogonal in R2, while (1, 1)>

and (2,−1)> are not.

(1, 1)>

(1,−1)>

Figure 2.1: Orthogonal vectors

(1, 1)>

(2,−1)>

Figure 2.2: Non-orthogonal vectors

Linear subspaces

A linear subspace is a subset V ⊆ Rn such that for any x,y ∈ V and for alllinear subspaceα, β ∈ R, αx + βy ∈ V . In particular, the sets 0 and Rn are linear subspaces.

Example 2.2. The linear subspaces of R2 are 0, lines through the origin, and R2.The linear subspaces of R3 are 0, lines and planes through the origin, and R3.

A linear combination of vectors x1, . . . ,xk ∈ Rn is an expression of the formlinear combinationx =

∑ki=1 λixi, where λi ∈ R for 1 ≤ i ≤ k. The set of linear combinations

V = span x1, . . . ,xk :=

k∑i=1

λixi : λi ∈ R

Page 3: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

2. LINEAR ALGEBRA 3

forms a linear subspace of Rn. It is the intersection of all linear subspaces that containx1, . . . ,xk. The vectors x1, . . . ,xk are linearly independent if

∑ki=1 λixi = 0 linearly independent

implies λ1 = · · · = λk = 0. A minimal set of vectors that span a linear subspace V iscalled a basis of this subspace, and the number of elements in a basis is the dimension basis

dimensionof the linear subspace. The elements of a basis are always linearly independent, and amaximal linearly independent set in a vector subspace V is a basis. If b1, . . . , bk is abasis of a subspace V , then every x ∈ V has a unique representation x =

∑ki=1 λibi.

A basis is orthogonal if 〈bi, bj〉 = 0 for i 6= j, and orthonormal if in addition orthonormal basis〈bi, bi〉 = 1 for 1 ≤ i ≤ k. The unique expression of x ∈ V as linear combination ofan orthonormal basis b1, . . . , bk of V is given by

x =

k∑i=1

〈x, bi〉bi.

The standard basis of Rn is the orthonormal basis e1, . . . , en, where ei has a 1 in standard basisthe i-th coordinate and 0 elsewhere.

Example 2.3. The vectors v1 = (0, 1, 1)> and v2 = (1, 0, 1)> span a linear subspaceof R3 of dimension 2, but they are not orthogonal. The vectors

b1 =1√2

011

, b2 =1√3

2− 1√

21√2

form an orthonormal basis of V . The vector x = (1, 1, 2)> lives in V , and itsrepresentation in terms of b1, b2 is

x =3√2b1 +

√3

2b2.

x

y

z

v1

v2b1

b2

The direct sum of two vector subspaces V,W ⊂ Rn with V ∩W = 0 is direct sum ⊕

V ⊕W = v + w : v ∈ V,w ∈W.

Page 4: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

4

The orthogonal complement of a subspace V ⊆ Rn is the setorthogonalcomplement ⊥

V ⊥ = x ∈ Rn : ∀y ∈ V : 〈x,y〉 = 0.

The vector space Rn is the direct sum of V and its orthogonal complement,

Rn = V ⊕ V ⊥. (1)

If V = span x1, . . . ,xk, then V ⊥ = y : 〈x1,y〉 = · · · 〈xk,y〉 = 0; to checkwhether a vector y is in the orthogonal complement of V we therefore only need tocheck whether y is orthogonal to a spanning set (for example, a basis) of V .

Example 2.4. The vector x = (−1,−1, 1)> is orthogonal to the basis b1, b2 fromExample 2.3. It is therefore orthogonal to the whole plane V = span b1, b2 spannedby these vectors. The orthogonal complement of V is the line λx : λ ∈ R.

The direct product of two vector subspaces V ⊆ Rn, W ⊆ Rm is defined asdirect product ×

V ×W = (v,w) ∈ Rn+m : v ∈ V,w ∈W.

where (v,w) is the vector whose first n coordinates coincide with v, and the last mcoordinates coincide with w. In particular, Rn × Rm = Rn+m.

Linear maps

An m× n matrixmatrix

A =

a11 · · · a1n...

. . ....

am1 · · · amn

,

represents a linear map from Rn to Rm by means oflinear map

y = Ax, yi =

n∑j=1

aijxj .

For example, (2 1 01 0 2

)112

=

(35

).

The columns of a matrix are vectors, and we sometimes write

A = (a1, . . . ,an)

for the matrix whose columns are given by the vectors ai. If A ∈ Rn×n andn = m+ k, then A can be written as block matrix,block matrix

A =

(A11 A12

A21 A22

),

Page 5: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

2. LINEAR ALGEBRA 5

with A11 ∈ Rm×m, A22 ∈ Rk×k, A12 ∈ Rm×k and A21 ∈ Rk×m. The sum anddifference of matrices of the same size are defined component-wise.

The n× n matrix 1 is the matrix with 1 on the diagonal and 0 elsewhere, while0 is the matrix consisting of only zeros. A matrix is diagonal if all the off-diagonalelements are 0, lower-triangular if all the elements above the diagonal are 0, andupper-triangular if all the elements below the diagonal are 0. A block-diagonal diagonal, triangular,

block-diagonalmatrix is a block matrix, with all blocks outside the main diagonal consisting ofzero-matrices 0.

The transpose A> is the matrix with entries a′ij := aji. It is the matrix A transposemirrored on the diagonal from top left to bottom right. A matrix A ∈ Rn×n is calledsymmetric if A> = A. The set of symmetric matrices in Rn×n is denoted by Sn. symmetric

The product of an m× p matrix A with a p× n matrix B, product

C = AB,

is the m× n matrix C whose (i, j)-th entry is given by

cij =

p∑k=1

aikbkj .

It represents a composition of maps Rn → Rp → Rm. The number of columns of Ahas to equal the number of rows of B for this definition to make sense. Products ofblock matrices or of block matrices with vectors can be carried out block-wise. If, forexample, x = (x1,x2)

> with x1 ∈ R1×m and x2 ∈ R1×k, then(A11 A12

A21 A22

)(x1

x2

)=

(A11x1 + A12x2

A21x1 + A22x2

).

The matrix 1 satisfies 1A = A and A1 = A, whenever the dimensions are such thatthis is defined. In general, even if A,B ∈ Rn×n, AB 6= BA.

Example 2.5. Let

A =

(1 21 4

), B =

(2 33 2

).

Then

AB =

(8 714 11

), BA =

(5 165 14

).

If we consider a vector x ∈ Rn as an n× 1 matrix and the transpose as an 1× nmatrix, then for x,y ∈ Rn ∼= Rn×1 we have

〈x,y〉 = x>y.

The transpose of a product satisfies (AB)> = B>A>. From this it follows that forany matrix, A>A is symmetric. For any matrix A we have

〈x,Ax〉 = x>Ax = (A>x)>x = 〈A>x,x〉.

Page 6: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

6

It follows from this that if a matrix is symmetric, then it is also self-adjoint, whichmeans that 〈x,Ax〉 = 〈Ax,x〉.

The rank of a matrix A, rk(A), is the maximum number of linearly independentrankrows or columns of A. The kernel and image of A are the linear subspaceskernel

imagekerA := x ∈ Rn : Ax = 0, imA = Ax : x ∈ Rn.

The dimensions are given by dim kerA = n−rk(A) and dim imA = rk(A). WhileA ∈ Rm×n represents a linear map from Rn to Rm, the transpose A> representsa map in the other direction, and the image of A> coincides with the orthogonalcomplement of the kernel of A, (kerA)⊥ = imA>. In particular, in view of (1) wehave the direct sum decomposition

Rn = kerA⊕ imA>.

A system of linear equationslinear equations

a11x1 + · · ·+ a1nxn = b1...

......

am1x1 + · · ·+ amnxn = bm

is written as a matrix vector product

Ax = b, (2)

where the m× n matrix A is defined as above, and x ∈ Rn, b ∈ Rm. If the columnsof A are linearly independent, then the system of equations can have at most onesolution, and otherwise it has infinitely many solutions (this is the case if n > m).If n = m, then the system (2) has a unique solution if and only if the matrix A isinvertible or non-singular. This is the case if the rows of A (or equivalently, theinvertible, singularcolumns of A) are linearly independent. If A is not invertible, it is called singular.

If A is invertible, there exists a matrix A−1 (the inverse) such thatinverse

AA−1 = A−1A = 1.

The solution of (2) is then given by x = A−1b. The following conditions on a matrixA ∈ Rn×n are equivalent:

1. A is invertible,

2. rk(A) = n,

3. kerA = 0,

4. imA = Rn,

5. the rows of A are linearly independent,

Page 7: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

2. LINEAR ALGEBRA 7

6. the columns of A are linearly independent,

7. det(A) 6= 0,

where the determinant is

det(A) =∑σ∈Sn

sgn(σ)a1σ(1) · · · anσ(n),

and Sn is the group of permutations of [n] = 1, . . . , n, with sgn(σ) the sign of thepermutation (parity of the number of inversions).

Example 2.6. For two- and three-dimensional matrices

A =

(a11 a12a21 a22

), B =

a11 a12 a13a21 a22 a23a31 a32 a33

,

the determinants are

det(A) = a11a22 − a12a21,det(B) = a11(a22a33 − a23a32)− a12(a21a33 − a23a31) + a13(a21a32 − a22a31).

A matrix Q is orthogonal if Q = (q1, . . . , qn), with 〈qi, qj〉 = δij , and orthogonal

δij =

0 if i 6= j,

1 if i = j.

As the (i, j)-th entry of Q>Q are given by 〈qi, qj〉, the orthogonality of Q cansuccinctly be characterized by the requirement Q>Q = 1. In particular, Q> = Q−1,and the columns (and rows) of an orthogonal matrix form an orthonormal basis ofRn. Orthogonal matrices have the property that 〈Qx,Qy〉 = 〈x,y〉. From this itfollows that orthogonality of vectors is preserved under orthogonal transformations.The determinant of an orthogonal matrix is det(Q) = 1. As the product of orthogonalmatrices is again orthogonal, the set of orthogonal n × n matrices forms a group,commonly denoted by O(n).

Example 2.7. Consider the three matrices,

A =

(1 22 4

), B =

(2 33 2

), C =

(1√2− 1√

21√2

1√2

).

The matrices A and B are symmetric, while C is not. The matrices B and C areinvertible, with inverse

B−1 =

(−0.4 0.60.6 −0.4

), C−1 =

(1√2

1√2

− 1√2

1√2

).

The matrix A is not invertible, since the second column is a multiple of the first. Thekernel of A the linear span of (−2, 1)>. The matrix C is orthogonal (this can be seenby checking that the columns or rows are orthonormal, or looking at the expression ofthe inverse above).

Page 8: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

8

Eigenvalues

A vector u 6= 0 is an eigenvector of A ∈ Rn×n, if there exists a λ ∈ C such thateigenvector

Au = λu.

Such a number λ is called an eigenvalue of A. Note that the eigenvectors are onlyeigenvaluedefined up to scaling: if u is an eigenvector, then so is λu for any non-zero λ ∈ R.

From the definition of the determinant, the function λ 7→ det(λ1 − A) is apolynomial of degree at most n, called the characteristic polynomial of A. Theeigenvalues are the roots of this polynomial,characteristic

polynomialdet(λ1−A) = 0.

The eigenvalues can be complex numbers, and appear in complex conjugate pairs. Ifthe matrix A is symmetric, then the eigenvalues are all real numbers. Two importantquantities, the determinant and the trace of a matrix (corresponding, up to sign, todeterminant

trace the highest and lowest coefficient of the characteristic polynomial) can be expressedin terms of the eigenvalues:

det(A) = λ1 · · ·λn, trace(A) := a11 + · · ·+ ann = λ1 + · · ·+ λn.

A matrix has a zero eigenvalue if and only if it is singular. Eigenvalues may occurwith multiplicity.

Norms

A norm in Rn is a function ‖·‖ that satisfies the following three propertiesnorm

1. ‖x‖ ≥ 0 for all x ∈ Rn, and x = 0 if and only if x = 0;

2. ‖λx‖ = |λ|‖x‖ for λ ∈ R and x ∈ Rn;

3. ‖x + y‖ ≤ ‖x‖+ ‖y‖ for x,y ∈ Rn.

Three important examples of norms are the following:

1. The 1-norm: ‖x‖1 =∑n

i=1 |xi|;

2. The 2-norm: ‖x‖2 =√∑n

i=1 x2i ;

3. The∞-norm: ‖x‖∞ = max1≤i≤n |xi|.

Example 2.8. Let x = (2,−3, 4)>. The ‖x‖1 = 9, ‖x‖2 =√

29, and ‖x‖∞ = 4.

Note that the 2-norm, also called Euclidean norm, can be defined as2-norm

‖x‖22 = x>x = 〈x,x〉,

Page 9: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

2. LINEAR ALGEBRA 9

that is, it is the norm induced by the Euclidean inner product. From this it followsthat the 2-norm does not change under orthogonal transformations: if Q ∈ O(n) andx ∈ Rn, then ‖Qx‖2 = ‖x‖2. Orthogonal transformations in R2 and R3 correspondto rotations and reflections, so it is intuitively clear that these don’t change distances.

The unit sphere with respect to a norm is the set x ∈ Rn : ‖x‖ = 1, and the unit sphere, ball(closed) unit ball is the set x ∈ Rn : ‖x‖ ≤ 1. The unit spheres with respect tothe 2-norm, the 1-norm and the∞-norm in R2 are shown in the following diagram.

The unit sphere with respect to the 2-norm in Rn is usually denoted by Sn−1.The 1-, 2- and∞-norms are equivalent, in the sense that they can be bounded in

terms of each other. In particular,

‖x‖∞ ≤ ‖x‖2 ≤√n‖x‖∞, ‖x‖∞ ≤ ‖x‖1 ≤ n‖x‖∞. (3)

The inner product and the 2-norm are related the Cauchy-Schwarz inequality, Cauchy-Schwarz

|〈x,y〉| ≤ ‖x‖2‖y‖2,

with equality if and only if x and y are linearly dependent. As a consequence of theCauchy-Schwarz inequality we get

−1 ≤ 〈x,y〉‖x‖2‖y‖2

≤ 1.

The angle between vectors x and y is the number θ ∈ [0, 2π) such that angle

cos(θ) =〈x,y〉‖x‖2‖y‖2

.

If x and y are orthogonal, then cos(θ) = 0 and θ ∈ π/2, 3π/2.Norms are an important device to measure the size of vectors. In order to measure

the amount by which a linear transformation (matrix) distorts vectors, we need theconcept of matrix norms. A matrix norm is a function on the set of matrices that is a matrix normnorm when considering a matrix as a vector, and in addition satisfies the condition

‖AB‖ ≤ ‖A‖‖B‖.

The most important examples are given by the operator norms. Given a vector norm operator norm

Page 10: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

10

‖x‖, the associated matrix norm is defined as

‖A‖ = maxx6=0

‖Ax‖‖x‖

= maxx : ‖x‖≤1

‖Ax‖.

The matrix norms ‖A‖1, ‖A‖2, ‖A‖∞ are the operator norms that arise when usingthe 1-, 2- and∞-norms. They can conveniently characterized as follows

• ‖A‖1 = maxj∑n

i=1 |aij |;

• ‖A‖∞ = maxi∑n

j=1 |aij |;

• ‖A‖2 =√λmax(A>A).

Here, λmax denotes the largest eigenvalue of the symmetric matrix A>A. If A issymmetric, then A>A = A2, and the eigenvalues of A2 are the squares of theeigenvalues of A. It follows that for symmetric A, ‖A‖2 = λmax(A).

Example 2.9. Let

A =

(1 −2−2 1

).

Then ‖A‖1 = ‖A‖∞ = 3 and ‖A‖2 = 3.

A special case is the dual norm of a vector norm: to a given norm is defined asdual norm

‖x‖∗ = maxy : ‖y‖≤1

〈x,y〉.

This is the operator norm of x>, considered as a 1× d matrix. The dual of the dualnorm is the norm itself. The dual norm of the 2-norm is again the 2-norm, while the1-norm and the∞-norm are dual to each other.

In addition to the operator norms, an important matrix norm is the FrobeniusFrobenius normnorm of a matrix A ∈ Rn×n,

‖A‖F =

√√√√ n∑i,j=1

a2ij .

This is just the 2-norm of A interpreted as a vector in Rn2. The 2-norm and the

Frobenius norm have the important property of being orthogonal invariant, whichorthogonal invariantmeans that for any Q ∈ O(n),

‖QA‖2 = ‖AQ‖2 = ‖A‖2, ‖QA‖F = ‖AQ‖F = ‖A‖F .

Orthogonal invariance allows to simplify a matrix without changing the norm.

Page 11: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

2. LINEAR ALGEBRA 11

Positive semidefinite matrices

If A ∈ Sn is a symmetric matrix and u ∈ Rn an eigenvector with ‖u‖2 = 1 andcorresponding eigenvalue λ, then u>Au = λu>u = λ. In particular, the largest andsmallest values of an eigenvalue are given by

λ1 = maxu : ‖u‖2=1

u>Au, λn = minu : ‖u‖2=1

u>Au.

A symmetric matrix A is called positive semidefinite, written A 0, if for all positive(semi)definitenon-zero x ∈ Rn, x>Ax ≥ 0, and positive definite, written A 0, if x>Ax > 0

for all x 6= 0. Equivalently, a symmetric matrix is positive semidefinite if all itseigenvalues are non-negative, and positive definite if they are all positive. The set ofpositive semidefinite symmetric matrices in Rn×n is denoted by Sn+, while the set ofpositive definite matrices is Sn++.

An inner product (or scalar product) on Rn×n is a function inner product

〈·, ·〉 : Rn × Rn → R, (x,y) 7→ 〈x,y〉

that is bilinear (linear in each of the two arguments), symmetric (〈x,y〉 = 〈y,x〉),and satisfies 〈x,x〉 ≥ 0, with 〈x,x〉 = 0 if and only if x = 0. The standard innerproduct 〈x,y〉 = x>y is an example, and the notation 〈·, ·〉 usually refers to thisproduct. More generally, every matrix A ∈ Sn++ defines an inner product by

〈x,y〉A := 〈x,Ay〉 = x>Ay.

The associated norm is ‖x‖A =√〈x,x〉A. The unit sphere with respect to this

norm,E = x ∈ Rn : x>Ax = 1,

is an ellipsoid, where the 2-norms of the largest and smallest axes are the largest and ellipsoidsmallest eigenvalues of A−1.

Matrix decompositions

Matrices can be represented as products of simpler matrices. Important examples are: QR1. QR decomposition. A matrix A ∈ Rm×n can be written as

A = QR,

where Q ∈ Rm×m is an orthogonal, and R ∈ Rm×n an upper triangular matrix.Gram-Schmidt orthogonalisation produces such a decomposition. LU

2. LU decomposition. A square matrix A ∈ Rn×n can be written as

A = LU

where L ∈ Rn×n is a lower triangular, and U ∈ Rn×n an upper triangularmatrix. Gaussian eliminations produces such a decomposition.

Page 12: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

12

Symmetriceigenvaluedecomposition

3. Symmetric eigenvalue decomposition. A symmetric matrix A ∈ Sn can bewritten as

A = QΛQ>,

where Q is an orthogonal matrix, with the eigenvectors as rows, and Λ adiagonal matrix with the eigenvalues on the diagonal.Cholesky

4. Cholesky decomposition. A positive definite symmetric matrix A ∈ Sn++ canbe factored as

A = LL>,

with L ∈ Rn×n lower-triangular and with strictly positive diagonal entries.

One of the most powerful matrix decompositions is the singular value decom-SVDposition (SVD). It states that any m× n matrix A can be written as

A = UΣV >,

where U is a m×m orthogonal matrix, V an n× n orthogonal matrix, and Σ is adiagonal matrix with the singular values σ1 ≥ · · · ≥ σminm,n on the diagonal. Thesingular valuesingular values are the square roots of the eigenvalues of A>A. The singular valuesare related to the matrix 2-norm and Frobenius norm of A ∈ Rn×n as follows:

‖A‖2 = σ1(A), ‖A‖F =

√√√√ n∑i=1

σ2i (A).

If A is symmetric, the singular values are the absolute values of the eigenvalues of A.Matrix decompositions can help reduce a problem into one involving simpler

(orthogonal, triangular) matrices. For example, to solve Ax = b, one can firstcompute A = QR, solve the simpler system of equations Qy = b by computingy = Q>b, and then solve the triangular system Rx = y by back-substitution.

Example 2.10. Consider the matrix(

2 −1−1 2

). The QR decomposition is

Q =1√5

(−2 11 2

), R =

1√5

(−5 40 3

),

the LU decomposition is

L =

(1 0−1/2 1

), U =

(2 −10 3/2

),

the symmetric eigenvalue decomposition and the SVD are given by

Λ = Σ =

(3 00 1

), Q = U = V =

(cos(π/4) sin(π/4)− sin(π/4) cos(π/4)

),

Page 13: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

3. CALCULUS 13

and the Cholesky decomposition is given by

L =1√2

(2 0

−1√

3

).

The eigenvalue decomposition shows how to visualize the ellipse Ax : ‖x‖2 = 1:

Applying the transformation Ax = QΛQ>x corresponds to rotating the vectorx clockwise by an angle of π/4, then stretching by a factor of 3 in the x-direction,and then rotating back.

3 Calculus

We writeC([a, b]) = C0([a, b]) for the set of continuous functions on an interval [a, b],and for k ≥ 1 we write Ck([a, b]) for the set of functions continuous on [a, b], andwhose first k derivatives f ′, . . . , f (k) exist and are continuous on (a, b). In the abovedefinition we allow a = −∞ or b =∞. If a, b ∈ R, then any function f ∈ C([a, b])is bounded. The infimum (largest lower bound) and supremum (smallest upper infimum

supremumbound) of a function f on an interval [a, b] are defined as

infx∈[a,b]

f(x) = maxy ∈ R : ∀x ∈ [a, b], f(x) ≥ y,

supx∈[a,b]

f(x) = miny ∈ R : ∀x ∈ [a, b], f(x) ≤ y.

Again, we allow the “values” −∞ and∞. If the infimum is attained (i.e., there existsx∗ such that f(x∗) = infx∈[a,b] f(x)), then we write minx∈[a,b] f(x), and similarlymax if the supremum is attained. Any f ∈ C([a, b]) for a, b ∈ R attains its infimumand supremum on [a, b].

Three important concepts are the Intermediate Value Theorem, the Mean ValueTheorem, and the Taylor expansion.

IntermediateValue TheoremTheorem (Intermediate Value Theorem). If f ∈ C([a, b]) and if y satisfies

infx∈[a,b]

f(x) ≤ y ≤ supx∈[a,b]

f(x),

then there exists ξ ∈ [a, b] such that f(ξ) = y. In particular, the infimum andsupremum are attained.

Mean Value TheoremTheorem (Mean Value Theorem). Let f ∈ C1([a, b]) and let x, x0 ∈ (a, b) withx 6= x0. Then there exists a number ξ ∈ (x0, x) (or (x, x0) if x < x0) such that

f(x) = f(x0) + f ′(ξ)(x− x0).

Page 14: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

14

This can also be written as

f ′(ξ) =f(x)− f(x0)

x− x0.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.01.5

1.0

0.5

0.0

0.5

1.0

1.5

x0

x

ξ

Figure 3.1: MVT: there exists a point at which the slope (derivative) is the same as that ofthe secant connecting (x0, f(x0)) and (x, f(x)).

The Mean Value Theorem is a special case of the Taylor expansion.Taylor series

Theorem (Taylor expansion). Let f ∈ C(n)([a, b]) and let x, x0 ∈ (a, b) withx 6= x0. Then there exists ξ ∈ (x, x0) (or (x, x0) if x < x0) such that

f(x) = f(x0) + f ′(x0)(x− x0) +1

2f ′′(x0)(x− x0)2 + · · ·

+f (n)(x0)

n!(x− x0)n +

f (n+1)(ξ)

(n+ 1)!(x− x0)n+1

The first (n+ 1) terms of the above sum can be seen as an approximation to thefunction f that becomes more accurate as n increases. The last term is known as thetruncation error in numerical approximation.

As an example, consider the Taylor expansion of the sine function at x0 = 0,

sin(x) = x− x3

3!+x5

5!− · · ·

The Taylor approximation to different orders is illustrated in the following figure.

0 0.5 1 1.5 2 2.5 3 3.5−3

−2

−1

0

1

2

3

4

Figure 3.2: Taylor expansion of sin(x).

Page 15: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

3. CALCULUS 15

A special case is of the Mean Value Theorem is Rolle’s Theorem.Rolle’s Theorem

Theorem (Rolle’s Theorem). Let f ∈ C1([a, b]) with f(a) = f(b). Then there existsa number ξ ∈ (a, b) such that f ′(ξ) = 0.

0 1 2 3 4 5 6 71.5

1.0

0.5

0.0

0.5

1.0

1.5

a

b

Figure 3.3: Rolle’s Theorem: there exist points with “flat" slope (derivative zero).

The intuition is that if you walk across mountains and arrive at a point with thesame elevation as where you started, then there must be places on the way where theslope is 0, that is, a local maximum or minimum. Of importance is also the followingvariant of the the Mean Value Theorem.

Integral MVTTheorem (Integral Mean Value Theorem). Let f, g ∈ C([a, b]) and assume thatf(x) does not change sign on [a, b]. Then there exists a ξ ∈ (a, b) such that∫ b

af(x)g(x) dx = g(ξ)

∫ b

af(x) dx.

Topology

The open ball of radius ε around p ∈ Rn is defined as open ball

Bn(p, ε) = x : ‖x− p‖2 < ε.

We write Bn := Bn(0, 1) for the (open) unit ball. A subset U ⊆ Rn is called open open and closed setif for every p ∈ U there exists an ε > 0 such that B(p, ε) ⊂ U . A set C is closed ifC = Rn\C is open. The closure clS of a set S ⊆ Rn is the intersection of all closed closure, boundary,

interiorsets containing S, while the interior intS is the union of all open sets contained in S.The boundary of S is defined as bdS = clS\intS. For example, the boundary ofthe open unit ball is the unit sphere unit sphere, ball

bdBn = Sn−1 = x ∈ Rn : ‖x‖2 = 1.

The superscript n− 1 refers to the fact that this set is a manifold of dimension n− 1.A neighbourhood N of a point x ∈ Rn is a set such that there exists an open set U neighbourhoodwith x ∈ U ⊆ N . An open neighbourhood is a neighbourhood that is open. Note

Page 16: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

16

that if U1 ⊆ Rn and U2 ⊆ Rm are open sets, then the product U1 × U2 ⊆ Rn+m isalso open.

Any subset S ⊆ Rn inherits a topological structure from Rn, where the open setsin S are the sets of the form U ∩ S, with U ⊆ Rn open. If a set S is contained ina lower-dimensional linear subspace V ⊂ Rn, say, with dimV = k < n, then S isalways closed. However, it can be open relative to its linear span,

span(S) =

k∑i=1

λixi : λ1, . . . , λk ∈ R,x1, . . . ,xk ∈ S

.

We call a set S relatively open or relatively closed if it is open or closed in therelatively open,relatively closed induced topology on span(S). Another way of defining this notion goes as follows.

Let b1, . . . , bk be an orthonormal basis of span(S), with B = (b1, . . . , bk), andconsider the map

ϕB : Rk → span(S), ϕB(x) =k∑i=1

xibi.

Then a set S is relatively open or relatively closed if the preimage

ϕ−1B (S) = x : ϕB(x) ∈ S

is open or closed in Rk. Based on these notions, one defines the relative closurerelative closure,relative interior relclS and relative interior relintS just as before.

x

y

z

Figure 3.4: The disk without boundary on the xy-plane is relatively open, and is therelative interior of the disk with boundary.

A subset S ⊆ Rn is bounded if there exists number M > 0 such that ‖x‖2 < Mboundedfor all x ∈ S. Invoking the equivalence of norms (3) one sees that this definition doesnot depend on the norm chosen. A set K ⊆ Rn is called compact if it is closed andcompactbounded. Equivalently, every cover of K with open sets contains a finite subcover. Afunction f : Rn → Rm is continuous if for every open set U ⊂ Rm,continuous

f−1(U) := x ∈ Rn : f(x) ∈ U

is an open subset of Rn. A function defined on a subset S ⊆ Rn is said to becontinuous if it is continuous on the induced topology. The set of continuous functions

Page 17: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

3. CALCULUS 17

f : S → Rm is denoted by C(S,Rm) = C0(S,Rm). If f ∈ C(K,R), where K iscompact, then f is bounded, and attains its infimum and supremum there: there existx∗,x

∗ ∈ K such that

infx∈K

f(x) = f(x∗), supx∈K

f(x) = f(x∗).

A weaker notion is that of a Lipschitz continuous function. A function f : S → Rm Lipschitzis called Lipschitz continuous with Lipschitz constant L > 0, if for all x,y ∈ S,

‖f(x)− f(y)‖2 ≤ L‖x− y‖2.

Notions about continuity of functions can be conveniently stated in terms ofsequences. A sequence of points xkk∈N ⊂ Rn (for short, xk) converges to sequence

convergencex ∈ Rn as k → ∞ with respect to a norm ‖·‖, written xk → x, if the sequence ofnumbers ‖xk − x‖ converges to 0,

limk→∞

‖xk − x‖ = 0.

Formally, this means that for every ε > 0 there exists an index k0, such that forall k > k0, ‖xk − x‖ < ε. From the equivalence of norms (3) it follows that if asequence converges with respect to one norm, it also converges with respect to theother norms.

A subsequence of a sequence xk is an infinite subset of S. A limit point for a subsequencelimit pointsequence S = xk is a point x that is the limit of an subsequence of S. Formally, for

every ε > 0 there exists a k0 such that ‖xk − x‖ < ε for some k > k0. A sequencexk is called a Cauchy sequence if for every ε > 0 there exists an index k0 > 0, Cauchy sequencesuch that for all k, ` > k0, ‖xk − x`‖2 < ε. The vector space Rn with the 2-norm(or any other norm) is a Banach space, which means that every Cauchy sequence Banach spacecontains a convergent subsequence.

All the topological notions discussed earlier have an interpretation in terms ofsequences and limits:

1. A set C is closed if and only if for every sequence xk ⊂ C, all limit pointsare in C;

2. The closure of a set S is the set of all limit points of sequences in S;

3. A set K is compact, if and only if every sequence of points xk in K has alimit point in K.

Given a function f : Ω→ Rm, where Ω ⊆ Rn, and x ∈ Ω, then f is continuous continuous at x∗at x∗ if

limx→x∗

f(x) = f(x∗).

Formally, for every ε > 0 there exists a δ > 0 such that whenever ‖x− x∗‖ < δ,‖f(x)− f(x∗)‖ < ε. This means that for every sequence of points xk withlimk→∞ xk = x∗, the sequence f(xk) → f(x∗) as k → ∞ with respect to somenorm on Rm.

Page 18: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

18

Differentiable functions

A function f : Rn → Rm is called (Fréchet) differentiable at x0 ∈ Rn if there existsdifferentiablea linear map Jf(x0) : Rn → Rm, such that

limh→0

‖f(x0 + h)− f(x0)− Jf(x0)h‖2‖h‖2

= 0.

A function is differentiable on an open subset U ⊆ Rn if it is differentiable atevery x ∈ U . If f(x) = (f1(x), . . . , fm(x))> is differentiable, then all the partialderivatives exist, and Jf(x0) is represented by the Jacobian matrixJacobian

Jf(x0) =

∂f1∂x1

· · · ∂f1∂xn

.... . .

...∂fm∂x1

· · · ∂fm∂xn

,

where the partial derivatives are evaluated at x0. If all the partial derivatives exist andare continuous in a neighbourhood of x0 (called continuously differentiable), thencontinuously

differentiable f is differentiable at x0.If m = 1, then J(x0) is the transpose of the gradient∇f(x0) of f at x0,gradient

∇f(x0) =

(∂f

∂x1, . . . ,

∂f

∂xn

)>.

The gradient points in the direction in which f increases the most.

X

Y

Z

Figure 3.5: A surface, level sets, and the gradient

A convenient way to visualise a function f : R2 → R is through level setslevel setx ∈ R2 : f(x) = c. For each c ∈ R, such a level set defines a curve in R2, thecurve on which the function value does not change. The gradient is always orthogonalto the level set, pointing in the direction in which f increases the most (see Figure 3.5).

If the gradient, considered as a map Rn → Rn, is itself differentiable at x0, thenthe Jacobian matrix of the gradient is called the Hessian matrix,Hessian

∇2f(x0) =

∂2f

∂x1∂x1· · · ∂2f

∂x1∂xn...

. . ....

∂2f∂xn∂x1

· · · ∂2f∂xn∂xn

Page 19: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

3. CALCULUS 19

Since∂2f

∂xi∂xj=

∂2f

∂xj∂xi,

the Hessian is a symmetric matrix.The directional derivative Dx0f(x0) of a function f : Rn → Rm in direction directional derivative

v ∈ Rn at x0 is defined as

Dvf(x0) = limh→0

f(x0 + hv)− f(x0)

h.

In the special case where v = ei, we obtain the partial derivative with respect to xi,

∂f

∂xi(x0) = Deif(x0).

If f : Rn → R is differentiable with continuous derivative near x0, then

Dvf(x0) = ∇f(x0)>v = 〈∇f(x0),v〉.

If f : Rn → Rm and g : Rm → Rp are differentiable in a neighbourhood ofx0 ∈ Rn and f(x0) ∈ Rm, respectively, then the composition h = g f : Rn → Rpis continuously differentiable in a neighbourhood of x0, and the Jacobian matrix isdefined by the chain rule: chain rule

Jh(x0) = Jg(f(x0))Jf(x0).

If n = 1, then f : R→ Rn is called a curve, and we write curve

Jf =df

dt= f = (f1, . . . , fn)> ∈ Rn

for the derivative of the curve. If f(t0) = v ∈ Rn and g : Rn → R, then by thechain rule, the derivative of g f : R → R is the directional derivative of g in thedirection v,

dg f(t0)

dt= 〈∇g(f(t0)),v〉.

Before going on to deal with higher derivative, we state the generalisation of theMean Value Theorem to higher dimensions.

Multivariate MVTTheorem (Multivariate Mean Value Theorem). Let f ∈ C1(U) for an open set Uwith x0,x ∈ U , x 6= x0. Then there exists t ∈ (0, 1) such that

f(x)− f(x0) = 〈∇f(tx + (1− t)x0),x− x0〉.

Note that tx + (1− t)x0 parametrises the line segment connecting x and x0.For a tuple of natural number α = (α1, . . . , αn), set |α| =

∑ni=1 αi, and define

the higher order partial derivative

Dαf(x) =∂|α|f(x)

∂α1x1 · · · ∂αnxn.

Page 20: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

20

For a set S ⊆ Rn, denote by Ck(S,Rm) the set of functions f such that all partialderivatives Dαf with |α| ≤ k exists and are continuous on intS. If m = 1, we writeCk(S) := Ck(S,R).

Define, for a vector x and multi-index α,

xα := xα11 · · ·x

αnn .

We then have the Taylor expansion around a point x0,Taylor series

f(x) =∑|α|≤k

Dαf(x0)(x− x0)α +

∑|α|=k

rα(x)(x− x0)α,

with rα(x)→ 0 as x→ x0.If a differentiable function f(x) has a local minimum or maximum at a point

x, then this point satisfies ∇f(x) = 0, that is, it is a critical point. The LagrangeLagrange multipliersmultiplier theorem says something about local extrema under certain constraints.

Theorem (Lagrange multipliers). Let x∗ be maximum of f(x) under the constraintg(x) = c (that is, a maximum among all points x such that g(x) = c). Then thereexist a λ ∈ R such that

∇f(x∗) = λ∇g(x∗).

The Lagrangian of a function f(x) with constraint g(x) = c is the functionLagrangianΛ: Rn × R→ R defined by

Λ(x, λ) = f(x)− λ(g(x)− c).

The Lagrange multiplier theorem then says that if x∗ is a maximum point of f(x)under the constraint g(x) = c, then there exists λ ∈ R such that the pair (x∗, λ) is acritical point of the Lagrangian Λ(x, λ).

The Implicit Function Theorem is one of the most important results in analysis,Implicit FunctionTheorem and underlies much of differential geometry and physics. Let F : Rn × Rm → Rn by

a function that is continuously differentiable in a neighbourhood of a point (x0,y0),with x0 ∈ Rn and y0 ∈ Rm. The Jacobian Jx(x0,y0) with respect to the first set ofn coordinates consists of the first n columns of the Jacobian matrix,

Jx(x0,y0) =

∂f1∂x1

(x0,y0) · · · ∂f1∂xn

(x0,y0)...

. . ....

∂fm∂x1

(x0,y0) · · · ∂fm∂xn

(x0,y0)

The interpretation is that we consider f as a function in only the first set of coordinates,with the remaining ones (denoted by y) considered as parameters.

Theorem (Implicit Function Theorem). Let f : Rn×Rm → Rn be k times continu-ously differentiable in an open neighbourhood of (x0,y0) ∈ Rn × Rm, and assumethat f(x0,y0) = 0. Assume further that the Jacobian Jxf(x0,y0) ∈ Rn×n in the firstn coordinates is non-singular at (x0,y0). Then there exists an open neighbourhoody0 ∈ Uy ⊆ Rm, and a function h ∈ Ck(Uy,Rn) such that

Page 21: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

3. CALCULUS 21

• h(y0) = x0,

• f(h(y),y) = 0 for y ∈ Uy.

Moreover, the Jacobian of h is given by

Jh(y) = −Jfy(h(y),y)(Jxf(h(y),y))−1

for all y ∈ Uy.

Example 3.1. Let f(x, y) = x2 + y2 − 1 and (x0, y0) = (1, 0). The Jacobian of fin the first coordinate is

∂f

∂x(1, 0) = 2 6= 0,

which is non-singular. Choosing the neighbourhood Uy = (−1, 1), the open intervalbetween −1 and 1, we get the function h : Uy → R as

h(y) =√

1− y2.

This function is defined and differentiable on (−1, 1), and satisfies

f(h(y), y) = h(y)2 + y2 − 1 = (1− y2) + y2 − 1 = 0, y ∈ Uy.

The derivative of h can be computed using the chain rule:

df(h(y), y)

dy=∂f

∂x(h(y), y)

dh

dy(y) +

∂f

∂y(h(y), y)

dy

dy

from which we get

dh

dy(y) = −∂f

∂y(h(y), y)

(∂f

∂x(h(y), y)

)−1= −y(1− y2)−3/2.

In this example, the implicit function theorem just gives the usual way of parametrisingpart of the circle.

x

y

Figure 3.6: The blue arc is parametrised by h(y) =√

1− y2 for y ∈ (−1, 1).

Page 22: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

22

4 Finite precision arithmetic

In practical applications, one often cannot simply plug numbers into formulae and getall the exact results. Most numerical data also requires an infinite amount of storage(just try to store π on a computer!), but a piece of paper or a computer only has limitedspace. These are some of the reasons that lead us to work with approximations.approximation

Measuring errors

To measure the quality of approximations, we use the concept of relative error. Givena quantity x and a computed approximation x, the absolute error is given byabsolute and relative

errorEabs(x) = |x− x|,

while the relative error is given as

Erel(x) =|x− x||x|

.

The benefit of working with relative errors is that they are scale invariant. Absoluteerrors can be meaningless at times: for example, an error of one hour is irrelevantwhen estimating the age of Stan the Tyrannosaurus Rex at Manchester Museum, but itis crucial when determining the time of a lecture. That is because in the former onehour corresponds to a relative error is of the order 10−11, while in the latter it is of theorder 1.

Floating point and significant figures

The established way of representing real numbers on computers is using floating-pointfloating-pointarithmetic. In the double precision version of the IEEE standard for floating-pointarithmetic, a number is represented using 64 bits, where a bit is either 1 or 0. Anumber is written

x = ±f × 2e,

where f is a fraction in [0, 1], represented using 52 bits, and e is the exponent, using11 bits, and one bit is for the sign. There are largest possible numbers, and there aregaps between representable numbers. The largest and smallest numbers representablein this form are of the order of ±10308, enough for most practical purposes. A biggerconcern are the gaps, which means that the results of many computations almostalways have to be rounded to the closest floating-point number.

When going through calculations without using a computer, we usually use theterminology of significant figures (s.f.) and work with 4 significant figures in basesignificant figure10. For example, in base 10,

√3 equals 1.732 to 4 significant figures. To count the

number of significant figures in a given number, start with the first non-zero digit fromthe left and, moving to the right, count all the digits thereafter, counting final zerosif they are to the right of the decimal point. For example, 1.2048, 12.040, 0.012048,

Page 23: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

4. FINITE PRECISION ARITHMETIC 23

0.0012040 and 1204.0 all have 5 significant figures (s.f.). In rounding or truncationof a number to n s.f., the original is replaced by the closest number with n s.f. Anapproximation x of a number x is said to be correct to n significant figures if bothx and x round to the same n s.f. number.

Remark 4.1. Note that final zeros to left of the decimal point may or may not besignificant: the number 1204000 has a least 4 significant figures, but without anymore information there is no way of knowing whether or not any more figures aresignificant. When 1203970 is rounded to 5 significant figures to give 1204000, anexplanation that this has 5 significant figures is required. This could be made clear bywriting it in scientific notation: 1.2040× 106. In some cases we also have to agreewhether to round up or round down: for example, 1.25 could equal 1.2 or 1.3 to twosignificant figures. If we agree on rounding up, then to say that a = 1.2048 to 5 s.f.means that the exact value of a satisfies 1.20475 ≤ a < 1.40485.

Example 4.2. Suppose we want to find the solution to the quadratic equation

ax2 + bx+ c = 0.

The two solutions to this problem are given by

x1 =−b+

√b2 − 4ac

2a, x2 =

−b−√b2 − 4ac

2a. (4.1)

In principle, to find x1 and x2 one only needs to evaluate the expressions for givena, b, c. Assume, however, that we are only allowed to compute to four significantfigures, and consider the particular equation

x2 + 39.7x+ 0.13 = 0.

Using the formula 4.1, we have, always rounding to four significant figures,

a = 1, b = 39.7, c = 0.13,

b2 = 1576.09 = 1576 (to 4 s.f.) , 4ac = 0.52 (to 4 s.f.),

b2 − 4ac = 1575.48 = 1575 (to 4 s.f.) ,√b2 − 4ac = 39.69.

Hence, the computed solutions (to 4 significant figures) are given by

x1 = −0.005, x2 = −39.69

The exact solutions, however, are

x1 = −0.0032748..., x2 = −39.6907...

The solution x1 is completely wrong, at least if we look at the relative error:

|x1 − x1||x1|

= 0.5268.

Page 24: Background Material · 2. LINEAR ALGEBRA 5 with A 11 2Rm m, A 22 2Rk k, A 12 2Rm kand A 21 2Rk m.The sum and difference of matrices of the same size are defined component-wise. The

24

While the accuracy can be increased by increasing the number of significant figuresduring the calculation, such effects happen all the time in scientific computing andthe possibility of such effects has to be taken into account when designing numericalalgorithms.

By analysing what causes the error it is sometimes possible to modify the methodof calculation in order to improve the result. In the present example, the problems arebeing caused by the fact that b ≈

√b2 − 4ac, and therefore

−b+√b2 − 4ac

2a=−39.7 + 39.69

2

causes what is called “catastrophic cancellation”. A way out is provided by theobservation that the two solutions are related by

x1 · x2 =c

a. (4.2)

When b > 0, the calculation of x2 according to (4.1) shouldn’t cause any problems, inour case we get −39.69 to four significant figures. We can then use (4.2) to derivex1 = c/(ax2) = −0.00327.

There are other potential sources of error besides those introduced by roundingoperations.

1. Overflow

2. Errors in the model

3. Human or measurements errors

4. Truncation or approximation errors

The first is rarely an issue, as we can represent numbers of order 10308 on a com-puter. The second two are important factors that need to be addressed when workingon real-world problems. The fourth has to do with the fact that many computations aredone approximately rather than exactly. For computing the exponential, for example,we might use a method that gives the approximation

ex ≈ 1 + x+x2

2.

As it turns out, many optimization problems work with approximations of the functionsof interest, and the solution found is only an approximation to the “true" solution ofthe problem. Quantifying the quality of such an approximation is an important aspectin the design and analysis of optimization algorithms.