NUMERICAL PROGRAMMING 1 (CSE) [MA3305]

BRIEF LECTURE NOTES

NUMERICAL PROGRAMMING 1 (CSE)

[MA3305]

TECHNICAL UNIVERSITY OF MUNICH, WS13/14

PROF. DR. FALK HANTE

Contents

0. Introduction and Preliminaries 30.A. Introduction 30.B. Math Refresher 31. Estimating Accuracy in Numerical Calculations 51.1. Floating Point Arithmetics 51.2. Conditioning 61.3. Stability 82. Nonlinear Equations (Zero-Finding) 92.1. Condition of zero-finding 92.2. Newton’s method in R 92.3. Fixed-point iterations 102.4. Newton’s method in Rn 112.5. Termination of Newton’s method 113. Interpolation 123.1. Polynomial interpolation 123.2. Spline-Interpolation 143.3. Trigonometric Interpolation 154. Quadrature (Numerical Integration) 174.1. Condition of the problem 174.2. Basic quadrature rules 174.3. Gaussian quadrature 194.4. Romberg quadrature 215. Linear Systems of Equations (LSE) 225.1. Condition number of a matrix 235.2. LU-Factorization 235.3. Cholesky Decomposition 265.4. Classical iteration methods 275.5. Mutigrid methods 296. Least Squares Problems 316.1. Normal Equations 316.2. Condition 316.3. Solution of Normal Equations 32

Date: Last Update: February 6, 2014.These lecture notes base on material for MA3305 from WS10/11 (Prof. Dr. Caroline Lasser)

and WS11/12 (Prof. Dr. Miriam Mehl).

References 33

2

0. Introduction and Preliminaries

0.A. Introduction. A numerical method provides an approximation for a mathe-matical problem using a finite number of simple operations. These limitation canlead to deficiencies and pitfalls.

Examples.

• Cancellation when solving a quadratic equation.• Numerical instability for a recurrence relation.

Such deficiencies can have dramatic consequences.

Examples.

• Vancouver Stock Exchange 1983.• Patriot missile fail in 1991.• Sleipner A offshore platform 1991.

15.10.13

Literature.

• [DB, Section 1.2.1, 1.2.2, Examples 2.1.2, 2.2.2]• http://www.ima.umn.edu/~arnold/disasters/sleipner.html

1

0.B. Math Refresher.

0.B.1 Notation and Symbols. Quantifier, binary operations, implications, sum, prod-uct, infinity, kronecker delta.

0.B.2 Elementary Algebra. Sets (elements, union, intersection, difference, subsets),important sets (natural numbers, integers, rational, real and complex numbers,empty set), Cartesian product, functions (injective, surjective, bijective, composi-tion, inverse function), relation (reflexive, symmetric. transitive, equivalence rela-tion), sequence and convergence, algebraic structures (groups, rings, fields), expo-nents and roots, logarithm, complex numbers (real and imaginary part, conjuga-tion, absolute value), polynomials (degree, roots, fundamental theorem of algebra,complex conjugate root theorem), topology (open and closed sets, boundedness,compactness, density), infimum/supremum and minimum/maximum, limit inferi-or/superior, trigonometrics (sinus, cosinus, tangent, cotangent, trigonometric rela-tions), Euler’s formulas, combinatorics (permutations and combinations), convexcombinations, convexity of sets, convexity/concavity of functions.

0.B.3 General Comments. Definitions, theorems/propositions, lemma, corollary,proofs (forward, contradiction, counter example, induction).

21.10.13

0.B.4 Linear Algebra. Matricies (square, rectangular, triangular, diagonal, identity,transposition, adjoint, symmetry, sum and difference, product, singularity, matrixinverse, matrix powers, matrix exponential), determinant (cofactor expansion), vec-tor spaces (axioms, dimension), linear (in-)dependency, linear combination andbasis, span, inner product, orthogonal and orthonormal, Euclidian norm, Cauchy-Schwarz, other norms (1-norm, max-norm, p-norm), equivalence of norms in Rn,matrix norms (submultiplicative, consistent, compatible, induced norms, matrix-1-norm, matrix-max-norm, Frobenius-norm), characteristic equation, eigenvalue,eigenvalue equation, eigenvectors, spectrum, spectral radius, similarity transforma-tion, diagonalalization, linear maps (range and null space), column/row-rank, rankof matrix, complex matricies, Jordan normal form.

22.10.13

1Retrieved on 10/10/2013.

3

0.B.5 Calculus. Cauchy sequence, Cauchy criterion for convergence, subsequencesand cluster points, limit of functions, continuity, intermediate value theorem, Lan-dau symbols (big-O and little-o), derivative, higher order derivatives, notion ofcontinuous differentiability, Taylor’s theorem, partial derivatives, Hessian, Lapla-cian, total derivative, directional derivative, Jacobian, divergence, curl, complexdifferentiability, Cauchy-Riemann equations, exponential function in the complexplain, series and partial sums, convergence and absolute convergence of series, ra-tio and root test, harmonic and geometric series, anti-derivative/indefinite integral,Riemann sum and Riemann integral, fundamental theorem of calculus, substitution,integration by parts, improper integrals, line integrals,

28.10.13 Riemann integration in Rn, Fubini’s theorem, Gauss’ theorem, periodic func-tions, Fourier integral, Fourier- and Laplace transform, ordinary diffential equa-tions.

Literature.

• Fanchi: Math Refresher for Scientists and Engineers, 3rd Ed., Wiley, 2006.• Anton: Elementary Linear Algebra, 9th Ed., Wiley, 2005.• Anton: Calculus, 8th Ed., Wiley, 2005.

4

1. Estimating Accuracy in Numerical Calculations

1.1. Floating Point Arithmetics. Representing real/complex numbers on a dig-ital machine presents the difficulties of a limited range and gaps for the represen-tation. In floating point arithmetics, p digits d1, . . . , dp, an exponent e and a baseb are used to represent real numbers as

x = ±be(d1

b+d2

b2+ · · ·+ dp

bp

).

For b = 10, this means x = ±10e · 0.d1d2 . . . dp.

Definition (Floating point numbers).

F = F(b, p, [emin, emax] =±m · be−p : bp−1 ≤ m < bp, emin ≤ e ≤ emax

.

The range of F is xmin = bemin−1 ≤ |x| ≤ bemax(1− b−p) = xmax.

Example (Visualizing F(2, 3, [1, 3]) in Matlab).

f unc t i on f l o a t i n g p o i n t

b=2; p=3; emin=−1; emax=3; F = 0 ;

f o r m = bˆ(p−1):bˆp−1, F = [ F , m∗b . ˆ ( [ emin : emax]−p ) ] ; end

F = [−F,F ] ; p l o t (F , z e r o s ( l ength (F) ) , ’ r ∗ ’ )

Provisionally, we define the machine precision as

εmachine = b1−p (1.1.2)

(the distance of 1 to the next larger x ∈ F).

Theorem 1.1.1. For all x ∈ R, |x| ∈ [xmin, xmax], there exists x′ ∈ F such that

|x− x′| ≤ 1

2εmachine.

Remark. Base b = 2 minimizes the mean square representation error.

Example (Distance of single in double precision, Matlab).

f unc t i on wobbling

x=l i n s p a c e (1 ,16 ,1 e3 ) ;

p l o t (x , eps ( s i n g l e ( x ) ) . / x ) ;

For simplicity, we consider F∞ = F(b, p, (−∞,∞)) and let fl: R → F∞, x 7→nearest y ∈ F (in case of a tie, fl(x) is the y with even dp). For x ∈ R, we say

|fl(x)| > xmax overflow

0 <|fl(x)| < xmin underflow .

By ~ we denote operations on F, where ∗ denotes the corresponding operation onR of which the classical set is +,−,×,÷.

Definition (Fundamental axiom of floating point arithmetic). For all x, y ∈ F∞,there exists ε with |ε| < 1

2εmachine such that

x~ y = fl(x ∗ y) = (x ∗ y)(1 + ε). (1.1.4)

A digital machine statisfying this axiom for all elementary operations is said toimplement the standard model. However, some computers only satisfy a weakermodel, where instead of (1.1.4), it only holds

x~ y = x(1 + ε1) ∗ y(1 + ε2), |ε1,2| ≤1

2εmachine.

5

Example. The standard model implies that the evaluation of a polynomial byHorner’s scheme can be interpreted as the exact evaluation of a polynomial withperturbed coefficients.

Note that floating point arithmetic is in general not associative, i. e.,

x~ (y ~ z) 6= (x~ y)~ z.

29.10.13 Sometimes, it is necessary to modify the definition of εmachine to be the smallestnumber such that the fundamental axiom of floating point arithmetic holds.

In complex floating point arithmetics, complex numbers are represented by pairsof real numbers and elementary operations are computed by real operations on suchpairs. The fundamental axiom is still valid for complex numbers if εmachine is takenas of order 2

52 b1−p on most machines.

Literature. [TB, Lecture 13], [DB, Section 2.2]

1.2. Conditioning. Let f : X → Y represent a problem with X,Y normed vectorspaces. The behavior of f(x) with respect to small perturbations of the data x iscalled conditioning of f .

Definition. The relative condition number κf (x) of f at x is

κf (x) = limδ→0

sup‖∆x‖<δ

‖f(x+ ∆x)− f(x)‖‖f(x)‖

‖x‖‖∆x‖

.

We say that f is well-conditioned or ill-conditioned if κf is “small” or “large”,respectively. For f ∈ C1(X;Y ), one obtains by Taylor’s theorem that

κf (x) = ‖Jf (x)‖ ‖x‖‖f(x)‖

, (1.2.1)

where ‖Jf (x)‖ denotes the matrix-norm induced by the X,Y -norms of the Jacobianmatrix of f at x. In most cases, we will have X = Rn, Y = Rm, with Rq equippedwith one of the norms

‖x‖1 =

q∑j=1

|xj |, ‖x‖2 =

q∑j=1

|xj |2 1

2

, ‖x‖∞ = max|xj | : j = 1, . . . , q,

with x = (x1, . . . , xq) ∈ Rq, q = m,n. The corresponding induced matrix normsare

‖A‖1 = max‖aj‖1 : j = 1, . . . , n maximum column sum

‖A‖2 = σmax(A) =√µ1(A>A) largest singular value

µ1(A>A) largest eigenvalue of A>A

‖A‖∞ = max‖a>i ‖1 : i = 1, . . . ,m maximum row sum

with A ∈ Rm×n, columns A = (a1| . . . |an) and rows A = (a1| . . . |am)>.

Example. The scalar muliplication f : R → R, x 7→ 12x is well-conditioned with

κf = 1.

Examples. • The square root f : (0,∞) → R, x 7→√x is well-conditioned

with κf = 12 .

• Substraction f : R2 → R, x = (x1, x2) 7→ x1 − x2 satisfies

κf (x) = 2max|x1|, |x2||x1 − x2|

for all x ∈ R2, using (1.2.1) and the norm ‖ · ‖∞. Hence, substraction isill-conditioned when x1 ≈ x2 (cancellation).

6

Example. Polynomial rootfinding is typically ill-conditioned. This can be seen forthe polynomial p(x) = x2−2x+a0 with the roots r1,2 = 1±

√1− a0, where (1.2.1)

using the norm ‖ · ‖1 yields

κp→(r1,r2)(a0) =1√

1− a0

|a0|2→∞ for a0 → 1−.

In general, if ai is the ith coefficient of a polynomial p ∈ Pn and xj is the jth root(n ≥ i, j), one shows

κai→xj =|aixi−1

j ||p′(xj)|

.

For the Wilkinson polynomial

p(x) =

20∏i=1

(xi) (1.2.2)

we have κp→x15(a15) ≈ 5.1×1013. A visualization of the ill-conditioning is obtainedby the following MATLAB function.

func t i on wi lk in son po lynomia l

a = poly ( 1 : 2 0 ) ;

x = roo t s ( a ) ;

hold on

f o r n=1:100

r = randn ( s i z e ( a ) ) ;

b = a .∗ ( 1 + 1e−10∗ r ) ;

z = roo t s (b ) ;

p l o t ( r e a l ( z ) , imag ( z ) , ’ r . ’ ) ;

end

p l o t ( r e a l ( x ) , imag ( x ) , ’ b ∗ ’ ) ;

For f ∈ C1(X;Y ), f = h g with g ∈ C1(X;Z), h ∈ C1(Z;Y ), the chain rule ofdifferentiation yields

κf (x) = κh(g(x))κg(x),

but we may have κh(g(x)) κf (x) or κg(x) κf (x).

Example. The eigenvalue computation of a non-symmetric matrix is often ill-con-

ditioned. However, for A = A> ∈ Rn×n, one finds κ = ‖A‖2|λ| , where λ is an

eigenvalue. On the hand, even if this is a small number, decomposing the probleminto computing the characteristic polynomial and then performing root-finding is abad idea as seen above.

Remark. Besides the relative condition number we might sometimes consider theabsolute condition number of f : X → Y at x

κf (x) = limδ→0

sup‖∆x‖<δ

‖f(x+ ∆x)− f(x)‖‖∆x‖

which satisfies κf (x) = ‖Jf (x)‖ for f ∈ C1(X;Y ).

Literature. [TB, Lecture 12].

04.12.137

1.3. Stability. Let f : X → Y be a problem with X,Y normed vector spacesand f(x) be a corresponding algorithm, e. g., a floating point implementation of fdepending on εmachine. We consider the relative error

rf,f (x) =‖f(x)− f(x)‖‖f(x)‖

. (1.3.1)

Definition. The algorithm f for problem f is called

(1) accurate if for all x ∈ Xrf ,f (x) = O(εmachine), εmachine → 0, (1.3.2)

(2) (forward) stable if for all x ∈ X

‖f(x)− f(x)‖‖f(x)‖

= O(εmachine), εmachine → 0 for some x with

‖x− x‖‖x‖

= O(εmachine), εmachine → 0.

(1.3.3)

(3) backward stable if for all x ∈ X

f(x) = f(x) for some x with

‖x− x‖‖x‖

= O(εmachine), εmachine → 0.(1.3.4)

In words, an accurate algorithm gives nearly the right answer, a (forward) stablealgorithm gives nearly the right answer to nearly the right question and a backwardstable algorithm gives exactly the right answer to nearly the right question.

Theorem 1.3.1. For problems and algorithms defined on finite-dimensional spacesX,Y the properties of accuracy, (forward) stability and backward stability all holdor fail to hold independently of the choice of norms in X,Y .

Examples.

• Floating point operations⊕,,⊗, satisfying the fundamental axiom (1.1.4)are backward stable.• The floating point realization of the Euclidian inner product adding pair-

wise products with ⊕ and ⊗ is backward stable.• The increment function using ⊕ to compute x+ 1 is (forward) stable. but

not backward stable.• The computation of eigenvalues via characteristic root finding is not only

backward unstable but unstable.

Theorem 1.3.2. Suppose a backward stable algorithm f is used to solve a problemf : X → Y with a relative condition κf on a computer satisfying axiom (1.1.4).Then

rf ,f = O(κf (x)εmachine), εmachine → 0.

Stability can be quantified using the following numbers.

Definition 1.3.1. Define

σf (x) = limδ→0

supε<δ

‖f(x)− f(x)‖‖f(x)(κf (x) + 1)ε‖

(forward stability indicator)

ρf (x) = limδ→0

sup

‖∆x‖‖x‖ε

: f(x) = f(x+ ∆x), ε < δ

(backward stability indicator)

with εmachine = ε.8

It holds σf (x) ≤ ρf (x) and one proves estimates for composition.

Theorem 1.3.3. Let f = h g, f(x) = h(g(x)) = h(y) and f = h g. Then

σf (x)κf (x) ≤ κh(y)(σh(y) + σg(x)κg(x))

ρf (x) ≤ ρg(x) + κg−1(y)ρh(y).

Hence, algorithms built on composing well-conditioned problems solved by for-ward stable algoriths are forward stable. For backward stability, the subproblemsshould have well-conditioned inverse mappings.

Literature. [TB, Lecture 14,15], [D, Section 2.3], [F. Bornemann, IMA J. Num.Anal. 27:219–231, 2007].

05.11.13

2. Nonlinear Equations (Zero-Finding)

2.1. Condition of zero-finding. Consider a smooth function f : R→ R and theproblem to find zeros of f , i. e., points x∗ ∈ R such that f(x∗) = 0. Assumingexistence and uniqueness for a moment, let Ψ: f 7→ x∗ be the abstract problem.For ‖f‖∞ = supx∈R |g(x)| Taylor’s theorem yields

κPsi(f) =‖f‖∞

|x∗||f ′(x+)|.

Hence, zeros x∗ with |f(x∗)| 1 are ill-conditioned.

2.2. Newton’s method in R. To compute zeros x∗, we consider starting fromx0 and constructing a sequence x1, x2, . . . with limk→∞ xk = x∗ if x0 is sufficientlynear x∗. Newton’s method defines

xk+1 = xk −f(xk)

f ′(xk). (2.2.1)

This can be motivated geometrically by trigonometric relations and analyticallyusing Taylor’s theorem.

Examples. For any a > 0

• the reciprocal a−1 is the unique zero of f(x) = 1x − a. Newton’s method

yields the iteration

xk+1 = 2xk − ax2k

converging for 0 < x0 <1a

• the square root√a is the unique zero of f(x) = x2 − a. Newton’s method


xk+1 =1

2

(xk +

a

xk

).

Theorem 2.2.1. Let f ∈ C2(R) and x∗ ∈ R such that f(x∗) = 0 with f ′(x∗) 6= 0. Ifx0 is sufficiently close to x∗, the sequence (xk)k∈N0

generated by Newton’s method(2.2.1) satisfies

limk→∞

xk = x∗ and limk→∞

xk+1 − x∗

(xk − x∗)2=

f ′′(x∗)

2f ′(x∗).

Proof. Analyzing the iteration function ϕ(x) = x− f(x)f ′(x) .

9

The theorem says that

ek+1 ≈f ′′(x∗)

2f ′(x∗)e2k.

where ek = xk − x∗ is the error in the kth iterate. Newton’s method is thereforecalled quadratically convergent. However, note that this convergence rate only holdsasymptotically (on a long run).

Literature. [S, Lectures 2,5]

11.11.2013

2.3. Fixed-point iterations. Newton’s method requires a derivative of f at eachiteration. We may therefore iterate according to

xk+1 = xk −f(xk)

gk

where gk is an easily computable approximation to f ′(xk). The special case gk = gfor all k, e. g., with g = f ′(x0) is called constant slope method. Assuming∣∣∣∣1− f ′(x∗)

g

∣∣∣∣ < 1 (2.3.1)

one obtains by analyzing the iteration function ϕ(x) = x − f(x)g that this method

converges again locally, but

ek+1 ≈ ϕ′(x∗)ek, (2.3.2)

for the error ek = xk − x∗. The method is therefore called linearly convergent.

Definition. Let (xk)k∈N0be a sequence converging to a limit x∗. If there exists ρ,

0 < |ρ| < 1, such that

limk→∞

xk+1 − x∗

xk − x∗= ρ (2.3.3)

(xk) is said to converge linearly with rate ρ. If the ratio in (2.3.3) converges tozero, the convergence is said to be superlinear. If there is a number p > 1 and aconstant C > 0 such that

limk→∞

|xk+1 − x∗||xk − x∗|p

= C

(xk) is said to converge with order p.

Newton’s method and the constant slope method are particular examples of fixedpoint iterations, where we consider a function ϕ having a fixed point x∗, i. e., a pointx∗ such that ϕ(x∗) = x∗ and study the iteration

xk+1 = ϕ(xk), k = 0, 1, . . . (2.3.4)

for convergence to x∗.

Theorem 2.3.1. Suppose that ϕ ∈ Cp+1(R), p ∈ N, ϕ(x∗) = x∗, and

|ϕ′(x∗)| < 1.

Then there is an interval Iδ = [x∗−δ, x∗+δ] such that the iteration (2.3.4) convergesto x∗ whenever x0 ∈ Iδ. If ϕ′(x∗) 6= 0, then the convergence is linear with rateϕ′(x∗). On the other hand, if

0 = ϕ′(x∗) = · · · = ϕ(p−1)(x∗) 6= ϕ(p)(x∗), (2.3.5)

then the convergence is of order p.

Proof. Using Taylor’s theorem. 10

Fixed point methods (hence Newton’s method, etc.) extend to functions ofseveral variables. Recall that a function g : Rn ⊃ Ω → Rm is called Lipschitzcontinuous with constant γ if

‖g(x)− g(y)‖ ≤ γ‖x− y‖, x, y ∈ Ω.

A function ϕ : Rn ⊃ Ω→ Rn is a contraction mapping if ϕ is Lipschitz continuouswith constant γ < 1.

Theorem (Contraction Mapping Theorem). Let Ω ⊂ Rn be a closed set and ϕ bea contraction mapping on Ω with constant γ < 1 such that ϕ(x) ∈ Ω for all x ∈ Ω.Then there exists a unique fixed point of ϕ, x∗ ∈ Ω, and the iteration xk+1 = ϕ(xk)converges linearly to x∗ with rate γ for all x0 ∈ Ω.

2.4. Newton’s method in Rn. Consider f : Rn → Rn and the problem f(x) =0 with solution x∗. Assume that the derivative Jf (x) : Ω → Rn×n is Lipschitzcontinuous with constant γ and that Jf (x∗) is nonsingular.

Theorem 2.4.1. Under the above assumptions, there exists a δ > 0 such that if‖x0 − x∗‖ < δ, the Newton iteration in Rn

xk+1 = xk − (Jf (xk))−1f(xk) (2.4.1)

converges quadratically to x∗.

Equation (2.4.1) can be expressed as solving a linear system of equations

Jf (xk)(xk+1 − xk) = −f(xk)

for the unknown (xk+1 − xk).

Example 2.4.1.

% Compute N=8 i t e r a t i o n s o f Newton ’ s method f o r f ( x)=0 with

% f ( x1 , x2)=(x1ˆ2 + x2ˆ2 −1; s i n ( p i /2∗x1 ) + x2 ˆ 3 ) ;

f unc t i on Y=newton system

N=8; X=ze ro s (2 ,N) ;

X( : , 1 ) = [ 1 ; 1 ] ;

f o r k=1:N−1

X( : , k+1) = X( : , k ) − [ 2∗X(1 , k ) , 2∗X(2 , k ) ; \ . . .

p i /2∗ cos ( p i /2∗X(1 , k ) ) , 3∗X(2 , k ) ˆ 2 ] \ . . .

[X(1 , k )ˆ2 + X(2 , k)ˆ2−1; s i n ( p i /2∗X(1 , k ) ) + X(2 , k ) ˆ 3 ] ;

end

Y=X( : ,N)

% t e s t

abs e=abs ( [Y(1)ˆ2 + Y(2)ˆ2 − 1 ; s i n ( p i /2∗Y( 1 ) ) + Y( 2 ) ˆ 3 ] − [ 0 ; 0 ] ) ;

f p r i n t f ( ’ Absolute Error : %f \n ’ , abs e ) ;

2.5. Termination of Newton’s method.

Lemma 2.5.1. For f and δ as in Theorem 2.4.1, it holds

‖x− x∗‖4‖x0 − x∗‖κJf(x∗)

≤ ‖f(x)‖‖f(x0)‖

≤4κJf(x∗)‖x− x∗‖‖x0 − x∗‖

, ‖x− x∗‖ < δ,

where κJf(x∗) = ‖Jf (x∗)‖‖Jf (x∗)−1‖ is the condition number of Jf (x∗). 11

Hence, the relative nonlinear residual ‖f(x)‖‖f(x0)‖ is a good error measure. However,

for x0 very close to x∗ a balancing of a relative and absolute tolerance τr and τa bystopping when

‖f(x)‖ ≤ τr‖f(x0)‖+ τa

is very common. Another way to terminate is when the stepsize sk = ‖xk+1 − xk‖is small, since by Theorem 2.4.1

‖xk+1 − x∗‖ = s+O(‖xk+1 − x∗‖2).

Literature. [S, Lecture 3], [Kelley, Iterative methods for linear and nonlinear equa-tions, SIAM, 1995, Chapter 5].

12.11.2013

3. Interpolation

3.1. Polynomial interpolation. Let x0, . . . , xn be distinct points in [−1, 1] andy0, . . . , yn ∈ R (or C). Polynomial interpolation provides the polynomial p ∈ Pnsuch that p(xj) = yj for all j = 0, . . . , n. With lj ∈ Pn, such that

lj(xk) = δjk (3.1.1)

for j, k = 0, . . . , n, p(x) = y0l0(x) + . . . + ynln(x) satisfies p(xj) = yj for all j =0, . . . , n. Such lj are given by the Lagrange polynomials

lj(x) =

n∏l=0l 6=j

x− xlxj − xl

, j = 0, . . . , n. (3.1.2)

Uniqueness of the interpolation polynomial follows from 0 = r = p−q ∈ Pn, becauser vanishes at n+ 1 points

r(xi) = p(xi)− q(xj) = yj − yj = 0, j = 0, . . . , n (3.1.3)

if p and q are interpolation polynomials. The interpolation polynomial in Lagrangeform

p(x) =

n∑j=0

yj lj(x) (3.1.4)

yields the following result.

Theorem 3.1.1. The absolute condition number of the polynomial interpolationproblem

φ[x0, . . . , xn](f) =

n∑j=0

f(xj)lj(x) : C([−1, 1])→ Pn

with respect to the ∞-norm is

κφ = supx∈[−1,1]

n∑j=0

|lj(x)| =: Λn.

Λn is called the Lebesgue constant for the points x0, . . . , xn.

Proposition 3.1.1. It holds

Λn ≥2

πloge(n+ 1) + 0.52125 . . .

For equispaced points,

Λn >2n−2

n2and Λn ∼

2n+1

en loge n, n→∞.

12

Defining the nodal polynomial

l(x) =

n∏k=0

(x− xk) ∈ Pn+1

and the weights

λj =1∏n

k=0k 6=j

(xj − xk)=

1

l′(xj)

one obtains the first form of the barycentric interpolation formula

p(x) = l(x)

n∑j=0

λjx− xj

yj (3.1.5)

and, with some further manipulation, the barycentric interpolation formula

p(x) =

∑nj=0

λjyjx−xj/

∑nj=0

λjx−xj

, x 6= xj

yj , x = xj .(3.1.6)

Theorem 3.1.2. Let p be the polynomial interpolating f ∈ Cn+2([−1, 1]) at x0, . . . , xn.Then

f(x)− p(x) =f (n+1)(ξ)

(n+ 1)!l(x)

for some ξ ∈ [−1, 1].

With a bound |f (n+1)(x)| ≤ M , x ∈ [−1, 1], Theorem 3.1.2 yields the errorestimate

|f(x)− p(x)| ≤ M

n!max

x∈[−1,1]|l(x)|.

It can be shown that

minl(x)

maxx∈[−1,1]

|l(x)| = 2−n

and that the minimum is achieved for the Chebyshev points

xj = cos

(2(n− j) + 1

2n+ 2π

), j = 0, . . . , n.

Example 3.1.1. (Runge)

n = 8 ;

%x = cos ( p i /n ∗ [ 0 : n ] ) ;

x = l i n s p a c e (−1 ,1 ,n+1);

y = 1./(1+25∗x . ˆ 2 ) ;

f o r j =1:n+1, w( j ) = prod ( 1 . / ( x ( j ) − x ( [ 1 : j −1, j +1:end ] ) ) ) ; end

xx = l i n s p a c e (−1 ,1) ; yy = 1./(1+25∗xx . ˆ 2 ) ; l = 1 ; s = 0 ;

f o r j =1:n+1, l = l . ∗ ( xx−x ( j ) ) ; s = s + y ( j )∗w( j ) . / ( xx−x ( j ) ) ; end

p = l .∗ s ; p l o t ( xx , yy , xx , p , x , y , ’ ∗ ’ )

For Chebyshev points, (3.1.6) becomes

p(x) =

∑nj=0′ (−1)jyjx−xj /

∑nj=0′ (−1)j

x−xj, x 6= xj

yj , x = xj ,(3.1.7)

13

where the prime on the sum signifies that the terms j = 0 and j = n are multipliedby 1

2 .All formulas (3.1.4), (3.1.5) and (3.1.6) are (forward) stable for interpolation,

(3.1.4) is even backward stable. For extrapolation, (3.1.5) and (3.1.6) are unstable,but (3.1.4) is forward stable.

Literature. [Trefethen, Approximation theory and approximation practice, SIAM,2013, Section 5, 15], [S, Lecture 20].

18.11.13

3.2. Spline-Interpolation. Let x0 < . . . < xn be points (knots) and f0, . . . , fn bevalues in R.

Definition 3.2.1. A linear (interpolating) spline is a function l(x) such that

(1) l is linear in [xj , xj+1](2) l is continuous in [x0, xn](3) l(xj) = fj , j = 1, . . . , n.

Defining lj(x) by

lj(x) = fj +fj+1 − fjxj+1 − xj

(x− xj) (3.2.1)

the functionl(x) = lj(x), x ∈ [xj , xj+1]

satisfies these conditions.We let L denote the space of all linear splines over the knots x0, . . . , xn and L(f)

be the operator taking a function f and producing a linear spline interpolating fat xj , j = 0, . . . , n. Then

L[L(f)] = L(f), (3.2.2)

and the error bounds from polynomial interpolation yield the following result.

Theorem 3.2.1. Let f ∈ C2([x0, xn]), |f ′′(x)| ≤ M for x ∈ [x0, xn] and let hmax =maxxj+1 − xj : j = 0, . . . , n− 1. Then

|f(x)− L(f)(x)| ≤ M

8h2

max. (3.2.3)

The hat functions

ck(x) =

(x−xk−1)/(xk−xk−1), x ∈ [xk−1, xk](xk+1−x)/(xk+1−xk), x ∈ [xk, xk+1]

0 elsewhere

(3.2.4)

k = 0, . . . , n (where c0 and cn omit the first or second case, respectively) define abases of L. In terms of this bases, we get

L(f)(x) =

n∑k=0

fkck(x), (3.2.5)

because ck ∈ L and∑nk=0 fkck(xj) = fj .

Definition 3.2.2. A function g(x) is a cubic (interpolating) spline if it satisfies

(1) g(x) = pj(x) on [xj , xj+1] with pj ∈ P3

(2) pj(xj) = fj , j = 0, . . . , n− 1(3) pj(xj+1) = fj+1, j = 0, . . . , n− 1(4) p′j(xj+1) = p′j+1(xj+1), j = 0, . . . , n− 2

14

(5) p′′j (xj+1) = p′′j+1(xj+1), j = 0, . . . , n− 2.

Theorem 3.2.2. In the interval [xj , xj+1], hj = xj+1 − xj , the cubic spline g(x) =pj(x) has the form

pj(x) = aj + bj(x− xj) +sj

6hj(xj+1 − x)3 +

sj+1

6hj(x− xj)3,

where

aj = fj −sjh

2j

6, bj = dj −

1

6(sj+1 − sj)hj

with dj =fj+1−fj

hjand the second derivatives sj = g′′(xj) satisfying

hj6sj +

hj + hj+1

3+hj+1

6sj+2 = dj+1 − dj , j = 1, . . . , n− 1. (3.2.9)

The equations (3.2.9) represent only n− 1 conditions for the n+ 1 unknowns sj .May may additionally impose

a) fixing s0 and sn to some given values (taking s0 = sn = 0 results in afunction called natural spline),

b) requiring p0 = p1 and pn−2 = pn−1 (not-a-knot condition), orc) specifying the derivatives at the endpoints (resulting in a function called

complete spline).

All resulting linear systems of equations are tridiagonal. For the interpolation errorwith fj = f(xj) one shows

‖f − g‖∞ ≤5

384‖f (4)‖∞h4

max

‖f ′ − g′‖∞ ≤1

24‖f (4)‖∞h3

max

‖f ′′ − g′′‖∞ ≤3

8‖f (4)‖∞h2

max.

Literature. [S, Lectures 10, 11]

19.11.13

3.3. Trigonometric Interpolation. Let TNC denote the space of complex trigono-metric polynomials of degree N − 1

φN (t) =

N−1∑j=0

cjeijt, with cj ∈ C (3.3.1)

and TNR denote the space of real trigonometric polynomials of degree N − 1

ψ2n+1(t) =a0

2

n∑j=1

(aj cos jt+ bj sin jt), (3.3.2)

for odd N = 2n+ 1, respectively,

ψ2n(t) =a0

2

n−1∑j=1

(aj cos jt+ bj sin jt) +an2

cosnt, (3.3.3)

for even N = 2n, where aj , bj ∈ R. We consider interpolating 2π-periodic functionsat N distinct points 0 ≤ t0 < t1 < · · · < tN−1 < 2π on [0, 2π[. Given N valuesf0, . . . , fN−1 in C (or in R) we seek φN ∈ TNC (or φN ∈ TNR ), satisfying

φN (tj) = fj , j = 0, . . . , N − 1. (3.3.4)15

The bijectionTNC → PN−1

φ(t) =

N−1∑j=0

cjeijt 7→ P (z) =

N−1∑j=0

cjzj

yields

(1) existence and uniqueness of φN ∈ TNC satisfying (3.3.4)(2) that the condition of the problem is again given by the Lebesgue constant

ΛN−1.

We will compute the interpolation polynomial for equidistant points

tj =2πj

N, j = 0, . . . , N − 1 (3.3.5)

and denote by ωj the Nth unit roots ωj = eitj = e2πij/N , satisfying

Lemma 3.3.1.N−1∑j=0

ωkj ω−lj = Nδkl.

Theorem 3.3.1. The coefficiens cj of φN ∈ TNC in (3.3.1) satisfying (3.3.4) for tjsatisfying (3.3.5) are

cj =1

N

N−1∑k=0

fkω−jk , j = 0, . . . , N − 1. (3.3.6)

Proposition 3.3.1. Consider φN from Theorem 3.3.1 with data fj ∈ R. Then ψN ∈TNR in (3.3.2), respectively (3.3.3), given by the coefficients

aj = 2<(cj) = cj + cN−j , bj = −2=(cj) = i(cj − cN−j)satisfy the interpolation condition ψN (tj) = fj , j = 0, . . . , N − 1.

Observe that cj in (3.3.6) is given by NYj , where

Yj =

N−1∑k=0

fkω−jk =

N−1∑k=0

ωjkfk

with ω = e−2πi/N . Hence, cj is essentially given by the Discrete Fourier Transform(DFT) of the vector f = (f0, . . . , fN−1)> resulting in a vector Y = (Y0, . . . , YN−1)>,which can be expressed as

Y = Wf

where W = (wjk) is a matrix with the entries

wjk = ωjk = e−2πijk/N .

The MATLAB function fftgui demonstrates some properties of this transfor-mation. Direct application of this matrix-vector representation requires at leastO(N2) flops. Modern algorithms implement fast DFT (FFT) with a complexity ofO(N log2N). The key to FFT is that for ω = ωN = e−2πi/N we have

ω22N = ωN ,

so that for even N , j ≤ N2 − 1,

Yj =∑

even k

ωjkfk +∑

odd k

ωjkfk =

N/2−1∑k=0

ω2jkf2k + ωjN/2−1∑k=0

ω2jkf2k+1.

These are two DFTs of length N2 .

16

omega=exp(−2∗pi ∗ i /N) ;

j =(0:N/2−1) ’ ;

w=omega . ˆN;

u=f f t ( f ( 1 : 2 :N−1)) ;

v=w.∗ f f t ( f ( 2 : 2 :N) ) ;

[ u+v ; u−v ] %=f f t ( f )

IfN = 2p this process can be repeated recursively giving complexity ofO(N log2N)flops. If N 6= 2p one can express the DFT as FFT with another factorization. Evenif N is a prime, one can embed the problem into one that can be factored. Moreover,one can show that the algorithm is backward stable.

Remark 3.3.1. Aliasing

n = 8 ; x = 2∗ pi /n ∗ [ 0 : n−1] ; y = s i n ( x ) + 3∗ s i n (5∗x ) ;

xx = l i n s p a c e (0 ,2∗ pi , 1 0 0 ) ; yy = s i n ( xx ) + 3∗ s i n (5∗xx ) ;

z = i n t e r p f t (y , 1 0 0 ) ; p l o t ( xx , yy , ’ b− ’ ,xx , z , ’ r− ’ ,x , y , ’ ro ’ )

n = 16 ; x = 2∗ pi /n ∗ [ 0 : n−1] ; y = s i n ( x ) + 3∗ s i n (5∗x ) ;

z = i n t e r p f t (y , 1 0 0 ) ; hold on , p l o t ( xx , z , ’ g− ’)

Literature. [D, Section 7.2], [Moler, Numerical Computing with Matlab (rev. reprint),SIAM, 2008, Chapter 8]

25.11.13

4. Quadrature (Numerical Integration)

Consider f ∈ C([a, b];R). Let x0, . . . , xn ∈ [a, b]. We seek approximations

I(f) =

∫ b

a

f(x) dx ≈n∑j=0

Ajf(xj) = Q(f),

for some weights A0, . . . , An.

4.1. Condition of the problem. The linearity of I : C([a, b];R)→ R andQ : C([a, b];R)→R yields

κI(f) ≤ (b− a)‖f‖∞|I(f)|

and

κQ(f) ≤ ‖A‖1‖f‖∞|Q(f)|

and both estimates can be made sharp. Hence, integration and quadrature areillconditioned problems for oscillatory functions.

4.2. Basic quadrature rules. Without loss of generality, we may take [a, b] =[0, 1], since by a change of variables x = a+ (b− a)y and g(y) = f(a+ (b− a)y),∫ b

a

f(x) dx = (b− a)

∫ 1

0

g(y) dy.

Example 4.2.1. Simpson’s rule, stated as∫ 1

0

f(x) dx ≈ 1

6f(0) +

2

3f(

1

2) +

1

6f(1)

yields ∫ b

a

f(x) dx ≈ b− a6

(f(a) + 4f(

a+ b

2) + f(b)

)=: S(f).

17

The simplest quadrature formula is the trapezoidal rule, for [a, b] = [0, h],∫ h

0

f(x) dx ≈ h

2[f(0) + f(h)] =: Th(f).

The rule can be motivated by interpolating f at 0 and h by a linear polynomiall(x) and take

I(f) ≈∫ h

0

l(x) dx.

The error estimates from polynomial interpolation and nonpositivity of x 7→ x(x−h)on [0, h] yields∫ h

0

f(x) dx− Th(f) =f ′′(η)

2

∫ h

0

x(x− h) dx = −f′′(η)

12h3 (4.2.1)

for some η ∈ [0, h]. Deviding [a, b] into n intervals [xi−1, xi] with

xi = a+ ih, h =b− ah

, i = 0, . . . , n

yields the composite trapezoidal rule∫ b

a

f(x) dx ≈n∑i=1

h

2[f(xi−1) + f(xi)] = h

(f(x0) + f(xn)

2+

n−1∑i=1

f(xi)

)=: CTh(f)

and (4.2.1) implies ∫ h

0

f(x) dx− CTh(f) = − (b− a)f ′′(η)

12h2 (4.2.2)

for some η ∈ [a, b].The observation that Th(f) integrates a linear polynomial exactly can be gener-

alized to the following problem: Determine A0, . . . , An such that

p ∈ Pn =⇒∫ b

a

p(x) dx = A0p(x0) + . . .+Anp(xn). (4.2.3)

Theorem 4.2.1. Let li(x) be the ith Lagrange polynomial, i = 0, . . . , n. The

Ai =

∫ b

a

li(x) dx

are the unique coefficients satisfying (4.2.3).

Alternatively, the coefficients Ai can by solving a linear system of equations,called the method of undetermined coefficients. For example, taking [a, b] = [0, 1]and n = 2 yields A0 = A2 = 1

6 and A1 = 23 which is Simpson’s rule.

Deviding [a, b] again into n intervals [xi−1, xi] with n even and

xi = a+ ih, h =b− ah

, fi = f(xi), i = 0, . . . , n

yields the composite Simpson’s rule∫ b

a

f(x) dx ≈ h

3(f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · ·+ 2fn−2 + 4fn−1 + fn) =: CSh(f).

For odd n, we may use Th(f) on the extra interval [xn−1, xn], introduce xn− 12

=xn−1+xn

2 and use ∫ xn

xn−1

f(x) dx ≈ h

6(fn−1 + 4fn− 1

2+ fn)

18

or determine A0, A1, A2 for∫ xn

xn−1

f(x) dx ≈ A0fn−2 +A1fn−1 +A2fn,

called half-simp or semi-simp rule.

Theorem 4.2.2. For f ∈ C4([a, b];R) it holds∫ b

a

f(x) dx− S(f) = − (b− a)5

2880f (4)(ξ)

for some ξ ∈ [a, b] and∫ b

a

f(x) dx− CSh(f) = − (b− a)

180f (4)(ξ)h4

for some ξ ∈ [a, b].

The quadrature formulas based on exact integration of the interpolation poly-nomial with equidistant points on [0, 1] are called Newton-Cotes formulas. In thiscase, the weights Ai can be precomputed based on Theorem 4.2.1. For closedNewton-Cotes formulas, i. e., xi = i

n , i = 0, . . . , n one obtains

n A0, . . . , An Error, ξ ∈ [0, 1] Name1 1

212

112f′′(ξ) Trapezoidal rule

2 16

23

16

12880f

(4)(ξ) Simpson’s rule3 1

838

38

18

13480f

(4)(ξ) Newton’s 3/8-rule4 7

903290

1290

3290

790

246945f

(6)(ξ) Milne’s rule

Note that all weights for n = 1, . . . , 4 are positive.26.11.13

4.3. Gaussian quadrature. The idea of Gaussian quadrature rules is that besidesthe n+ 1 weights A0, . . . , An the points x0, . . . , xn represent another n+ 1 degreesof freedom to extend the exactness property of the rule to polynomials up to degree2n+ 1. We consider consider also the more general problem∫

f :=

∫ b

a

f(x)ω(x) dx ≈ A0f(x0) + . . .+Anf(xn) (4.3.1)

where ω(x) > 0 is a given weight function and we will use the concept of orthogonalpolynomials.

A sequence of polynomials (pi)i∈N0with deg(pi) = i is orthogonal if

i 6= j =⇒∫pipj = 0,

where we may assume that pi is monic, i. e., the coefficient of xi is one.

Lemma 4.3.1. Let (pi)i∈N0 be a sequence of polynomials with deg(pi) = i. If

q(x) = anxn + an−1x

n−1 + . . .+ a0 (4.3.2)

then q can be written uniquely in the form

q = bnpn + bn−1pn−1 + . . .+ b0p0. (4.3.3)

Corollary 4.3.1. Let (pi)i∈N0be orthogonal. Then pn+1 is orthogonal to any poly-

nomial q ∈ Pn.

19

Theorem 4.3.1. The following three-term recurrence generates a sequence of orthog-onal (monic) polynomials

p0 = 1, p1 = x− α1, pn+1 = xpn − αn+1pn − βn+1pn−1, n = 1, 2, . . .

where

αn+1 =∫xp2n/

∫p2n, and βn+1 =

∫xpnpn−1/

∫p2n−1.

Lemma 4.3.2. Let p0, . . . , pn+1 be a sequence of orthogonal (monic) polynomials.Then the zeros of pn+1 are real, simple and lie in [a, b].

A Gaussian quadrature formula is obtained by constructing a quadrature formulabeing exact for the class Pn using as x0, . . . , xn the zeros of pn+1 for a sequence oforthogonal (monic) polynomials p0, . . . , pn+1.

Theorem 4.3.2. Let p0, . . . , pn+1 be a sequence of orthogonal (monic) polynomials.Moreover, let x0, . . . , xn be the zeros of pn+1. Set Ai =

∫li, i = 0, . . . , n, where

li(x) is the ith Lagrange polynomial over x0, . . . , xn. For any f ∈ C([a, b];R), let

Gnf := A0f(x0) + . . .+Anf(xn).

Then if f ∈ P2n+1 it holds∫f = Gnf .

Proof. By construction Gnf is exact for f ∈ Pn. Let f ∈ P2n+1. Devide f by pn+1

to get

f = pn+1q + r

for some q ∈ Pn, r ∈ Pn. Then

Gnf =

n∑i=0

Aif(xi)

=

n∑i=0

Ai[pn+1(xi)q(xi) + r(xi)]

=

n∑i=0

Air(xi) because pn+1(xi) = 0

= Gnr

=

∫r because Gn is exact for r ∈ Pn

=

∫(pn+1q + r) because

∫pn+1q = 0 for q ∈ Pn

=

∫f.

The Ai in the above theorem are positive and one proves the following errorestimate ∫

f −Gnf =f2n+2(ξ)

(2n+ 2)!

∫p2n+1

for ξ ∈ [a, b]. Moreover, one can prove

f ∈ C([a, b];R) =⇒ limn→∞

Gnf =

∫f

using Weierstrass’ approximation theorem.Particular Gauss formulas arise for particular choices of [a, b] and ω(x).

20

[a, b] ω Name[−1, 1] ω(x) = 1 Gauss-Legendre quadrature[0,∞] ω(x) = e−x Gauss-Laguerre quadrature

[−∞,∞] ω(x) = e−x2

Gauss-Hermite quadrature

The coefficients and abscissas can be found in most mathematical handbooks.

Example 4.3.1. (Gauss-Legendre)

Literature. [S, Lectures 21, 23]

02.12.13

4.4. Romberg quadrature. The composite trapezoidal rule on a mesh xi = a+ih, i = 0, . . . , n,

T (n) := CTh := CTh(f) = h

(1

2(f(a) + f(b)) +

n−1∑i=1

f(a+ ih)

)has the following asymptotic expansion in terms of h2.

Theorem 4.4.1. Let f ∈ C2m+1([a, b];R). Then

CTh =

∫ b

a

f(t) dt+ τ2h2 + τ4h

4 + . . .+ τ2mh2m +R2m+2(h)h2m+2

with

τ2k =B2k

(2k)!

(f (2k−1)(b)− f (2k−1)(a)

)and

R2m+2(h) = −∫ b

a

K2m+2(t, h)f (2m)(t) dt,

where B2k ∼ (2k)! are the Bernoulli numbers and the functions K2m+2 are closelyrelated to the Bernoulli functions. Moreover, there exists a constant C2m+2 ≥ 0not depending on h such that

|R2m+2(h)| ≤ C2m+2|b− a|.

Thus, CTh is an example for a methodQ(h) depending on a step-size h, satisfyinglimh→0Q(h) = τ0 for some τ0 and having an asymptotic expansion in hp up to anorder pm

Q(h) = τ0 + τphp + τ2ph

2p + . . .+ τmphmp +O(h(m+1)p), h→ 0

for constants τ0, τp, τ2p, . . . , τmp ∈ R.The idea of Romberg quadrature is that once we have computed Q(h) for k

different step sizes

h = hi−k+1, . . . , hi

we can determine the interpolation polynomial in hp as

Pik(hp) = P (hp;hpi−k+1, . . . , hpi ) ∈ Pk−1

with respect to the points((hpi−k+1, Q(hi−k+1), . . . , (hpi , Q(hi))

)and extrapolate the value of Q(0) by evaluating Pik at zero

Qik := Pik(0), 1 ≤ k ≤ i. (4.4.1)

21

Lemma 4.4.1. Let Pf (x;x0, . . . , xn) denote the interpolation polynomial for thepoints x0, . . . , xn and data f(x0), . . . , f(xn). Then

Pf (x;x0, . . . , xn) =(x0 − x)Pf (x;x1, . . . , xn)− (xn − x)Pf (x;x0, . . . , xn−1)

(x0 − xn).

Hence, setting Pik = Pf (x;xi−k, . . . , xi) for i ≥ k, the value Pnn = Pf (x;x0, . . . , xn)can be computed by the recursion

Pi0 = fi, i = 0, . . . , n

Pik = Pi,k−1 +x− xi

xi − xi−k(Pi,k−1 − Pi−1,k−1), i ≥ k. (4.4.2)

For the evaluation (4.4.1) this yields

Qi0 := Q(hi), i = 1, 2, . . .

Qik := Qi,k−1 +Qi,k−1 −Qi−1,k−1(

hi−k+1

hi

)p− 1

, 2 ≤ k ≤ i. (4.4.3)

Theorem 4.4.2. The approximation error εik = |Qik − τ0| satisfies

εik = |τkp|hpi−k+1 · · ·hpi +

i∑j=i−k+1

O(h(k+1)pj for hj ≤ h→ 0.

Algorithm (Extrapolation method). (1) For a basic step size H, choose a se-quence of step sizes h1, h2, . . . with hj = H

nj, nj+1 > nj and set i = 1.

(2) Determine Qi1 = Q(hi).(3) Compute Qik for k = 2, . . . , i by (4.4.3).(4) If Qii is precise enough, or if i is too large, then STOP. Else set i = i + 1

and go to 2.

For Q(h) = CTh this extrapolation method is known as classical Romberg quad-rature. For the Romberg sequence ni = 2i−1 one obtains

Qi1 = CThi(f) =1

2Qi−1,1 + hi

n−1∑k=1

f(a+ (2k − 1)hi)

in step 2 and we may for example STOP if |Qii−Qi−1,i−1| < TOL or i > ITERMAXfor given parameters TOL and ITERMAX.

Literature. [D, Section 9.4]

03.12.13

5. Linear Systems of Equations (LSE)

We consider the equation Ax = b for a nonsingular matrix A ∈ Rn×n. Givenx ∈ Rn, the equation has a (unique) solution b given by Ax. Given b ∈ Rn, theequation has a unique solution x given by A−1b.

22

5.1. Condition number of a matrix. We find

κx→Ax = ‖A‖ ‖x‖‖Ax‖

≤ ‖A‖‖A−1‖ (5.1.1)

and

κb→A−1b = ‖A‖ ‖b‖‖A−1b

≤ ‖A‖‖A−1‖ (5.1.2)

and the estimates are sharp for certain A and x or b. Moreover, we find

κA7→x=A−1b = ‖A‖‖A−1‖. (5.1.3)

The product ‖A‖‖A−1‖ is called the condition number of A

κA := ‖A‖‖A−1‖ (5.1.4)

(relative to the norm ‖ · ‖). It holds

1 ≤ ‖AA−1‖ ≤ ‖A‖‖A−1‖ = κA,

and as a rule of thumb one must always expect to “loose log10 κA digits” whensolving a LSE.

Example 5.1.1. Consider κ for Ax = b, some b and

A =

(1 11 2

)and the same system with the first equation in the system multiplied with 10−4.

The example shows that illconditioning may be due to bad scaling. Since

Ax = b⇐⇒ BAx = Bb

for nonsingular matrices B ∈ Rn×n, a B for which

κBA κA

is called a preconditioner of A.

Literature. [TB, Lecture 12]

5.2. LU-Factorization. A triangular system

Ux = b (5.2.1)

where U = (uij) is an upper triangular matrix can be solved recursively by backwardsubsitution

xn = bn/unn

xn−1 = (bn−1 − un−1,nxn)/un−1,n−1

...

x1 = (b1 − u12x2 − . . .− u1nxn)/u11

iff uii 6= 0, i = 1, . . . , n, that is, if U is nonsingular. In total analogy, a triangularsystem

Lx = b (5.2.2)

with a lower triangular matrix L can be solved by forward substitution.A general LSE

Ax = b (5.2.3)

with A = (aij) can be manipulated to a triangular system by a sequence of opera-tions

A = A(1) → A(2) . . .→ A(n), b = b(1) → b(2) . . .→ b(n)

23

with

A(k) =

a(1)11 a

(1)12 . . . . . . . . . a

(1)1n

a(2)22 . . . . . . . . . a

(2)2n

. . ....

a(k)kk . . . a

(k)kn

......

a(k)nk . . . a

(k)nn

, b(k) =

b(1)1

b(2)2...

b(k)k...

b(k)n

,

wherelik := a

(k)ik /a

(k)kk , for i = k + 1, . . . , n

a(k+1)ij := a

(k)ij − lika

(k)kj , for i, j = k + 1, . . . , n

b(k+1)i := b

(k)i − likb

kk, for i = k + 1, . . . , n

if the pivot elements a(k)kk do not vanish. The transformations A(k) → A(k+1) and

b(k) → b(k+1) can be represented as

A(k+1) = LkA(k), b(k+1) = Lkb

(k)

with

Lk =

1. . .

1−lk+1,k 1

.... . .

−ln,k 1

.

It holds

L := L−11 · · ·L

−1n−1 =

1l21 1...

. . .. . .

ln1 · · · ln,n−1 1

, L−1 = Ln−1 · · ·L1,

so that with L lower triangular

U := A(n) = L−1A (5.2.4)

is upper triangular and we obtain the product representation

A = LU (5.2.5)

known as LU-factorization of A.Equation (5.2.5) yields the Gaussian Elimination method to solve Ax = b.

Algorithm 5.2.1.

a) Compute (5.2.5).b) Solve Lz = b by forward substitution.c) Solve Ux = z by backward substitution.

The complexity in terms of multiplication is O(n3) for a) and O(n2) for both,b) and c).

09.12.13Algorithm 5.2.1 is only applicable if aii 6= 0 for all i = 1, . . . , n. However, this

requirement can be avoided by suitably exchanging rows and/or columns of A. Thepartial pivoting or column pivoting strategy is to choose at each elimination stepas pivot row the one having the largest element in absolute values.

24

Algorithm 5.2.2.

a) In step A(k) → A(k+1), choose p ∈ k, . . . , n such that

|akpk| ≥ |a(k)jk | for j = k . . . , n.

b) Interchange rows p and k

A(k) → A(k) with a(k)ij =

a

(k)kj , if i = p

a(k)pj , if i = k

a(k)ij , otherwise

b(k) → b(k) = (b(k)p , b

(k)k+1, . . . , b

(k)k , . . . , b(k)

n )>

c) Perform the next elimination step for A(k), b(k), i. e.,

A(k) → A(k+1), b(k) → b(k+1).

Let Pπ = [eπ(1)| . . . |eπ(n)] ∈ Rn×n with ej = (δ1j , . . . , δnj) be the permutationmatrix corresponding to a permutation π of the numbers 1, . . . , n.

Theorem 5.2.1. For every nonsingular matrix A ∈ Rn×n there exists a permutationmatrix Pπ such that a triangular factorization of the form

PπA = LU

is possible. Here, Pπ can be chosen such that L = (lij) with |lij | ≤ 1.

Instead of row interchange one can also perform row pivoting with column inter-change. Both strategies require O(n2) additional flops. The total pivoting strategywhere one combines both methods requires O(n3) additional flops and is thereforeseldomly employed.

Remark 5.2.1. Note that Theorem 5.2.1 yields

detA = detPπ · det(LU) = sgn(π) · u11 · · ·unn

with U = (uij) and sgn(π) =∏

1≤i<j≤nπ(j)−π(i)

j−i .

Backward stability analysis yields the following result.

Theorem 5.2.2. The floating point realization of Gaussian elimination with columnpivoting for the linear system Ax = b computes a solution x such that Ax = b fora matrix A with

‖A− A‖∞‖A‖∞

≤ 2n3ρn(A)εmachine + o(εmachine), εmachine → 0,

where

ρn(A) =αmax

max1≤1,j≤n |aij |

and αmax is the largest absolute value of an element of the remainder matrices A(1)

through A(n) appearing during eliminatation.

It is known:

Type of A ρ ≤nonsingular 2n−1

A or A> strict diagonal dominant 2tridiagonal 2random n2/3 (average).

25

Hence, supported by probabilistic considerations, Gaussian elimination can beconsidered as stable for matrices usually encountered in practice in the sense thatthe backward stability indicator

ρA,b ≤ 2n3ρn(A)

is not “too large”.

Literature. [D, Section 1.1–1.3]

5.3. Cholesky Decomposition. We will now apply Gaussian elimination to thespecial class of systems with symmetric positive definite (s.p.d.) matrices, i. e.,A = A> ∈ Rn×n with

〈x,Ax〉 > 0, ∀x 6= 0. (5.3.1)

Theorem 5.3.1. For any s.p.d. A = (aij) ∈ Rn×n we have

(i) A is nonsingular(ii) aii > 0 for i = 1, . . . , n(iii) maxi,j=1,...,n |aij | = maxi=1,...,n aii(iv) Each remainder matrix obtained during Gaussian elimination (without piv-

oting) is also s.p.d.

Theorem 5.3.2. For every s.p.d. matrix A ∈ Rn×n there exists a uniquely deter-mined factorization

A = LDL>

where L is a unit lower triangular matrix and D is a positive diagonal matrix.

Corollary 5.3.1. For A,L and D = diag(di) as in Theorem 5.3.2, D12 = diag(

√(di))

exists and with it the Cholesky factorization

A = LL>,

where L is the lower triangular matrix L = LD12 .

The matrix L can be computed as follows.

Example 5.3.1. (Cholesky decomposition in MATLAB)

func t i on L=cho le sky (A, n)

L=ze ro s (n , n ) ;

f o r k=1:n

L(k , k)= s q r t (A(k , k)−sum(L(k , 1 : k−1 ) . ˆ 2 ) ) ;

f o r i=k+1:n

L( i , k)=(A( i , k)−sum(L( i , 1 : k−1).∗L(k , 1 : k−1)))/L(k , k ) ;

end

end

For s.p.d. matrices A ∈ Rn×n, Theorem 5.2.2 holds with ρn(A) ≤ 1. Hence,Cholesky factorization is backward stable.


10.12.1326

5.4. Classical iteration methods. For many applications problems Ax = b arisewith sparse A ∈ Rn×n, i. e., having most components being zero.

Example 5.4.1. The LSE to solve the cubic spline interpolation problem is sparse.

Besides sparse direct solvers implementing sparsity preserving strategies for gauss-ian elimination, another well suited approach to solve sparse LSE are fixed-pointiterations

xk+1 = φ(xk).

We find

Ax = b⇔ Q−1(b−Ax) = 0⇔ φ(x) := (I −Q−1A)=:G

x+Q−1b=:c

= x.

for any nonsingular Q ∈ Rn×n and need to ensure that

xk+1 = φ(xk) = Gxk + c (5.4.1)

converges.

Theorem 5.4.1. The fixed-point iteration (5.4.1) converges to some x∗ ∈ Rn for allx0 ∈ Rn if and only if

ρ(G) < 1

where ρ(G) = max|λ| : λ ∈ σ(G) denotes the spectral radius of G.

Lemma 5.4.1.(i) ρ(G) ≤ ‖G‖ for any comptible matrix norm ‖ · ‖.(ii) For all ε > 0 there exists an induced norm ‖ · ‖ such that

ρ(G) ≤ ‖G‖ ≤ ρ(G) + ε.

Hence, ‖G‖ < 1 is sufficient for ρ(G) < 1 and

xk − x∗ = φ(xk−1)− φ(x∗) = G(xk−1 − x∗) = . . . = Gk(x0 − x∗) (5.4.2)

yields an error estimate

‖xk − x‖ ≤ ‖G‖k‖x0 − x‖.

Besides convergence, we want Q being easily invertible. The choice Q = I yieldsG = I −A, i. e.,

xk+1 = xk −Axk + b (5.4.3)

known as Richardson methods. We obtain that

0 < λ < 2, λ ∈ σ(A)

is a necessary condition for the convergence of (5.4.3). The choice Q = D =diag(a11, . . . , ann) and

A = L+D + U (5.4.4)

with

L =

0 . . . . . . 0

a21. . .

......

. . .. . .

...

an1. . . an,n−1 0

, U =

0 a12 . . . a1n

.... . .

. . ....

.... . . an−1,n

0 . . . . . . 0


xk+1 = −D−1(L+ U)xk +D−1b (5.4.5)

known as the Jacobi method.27

Theorem 5.4.2. The iteration (5.4.5) converges for all x0 ∈ Rn if A is strictlydiagonal dominant, i. e.,

|aii| >∑i 6=j

|aij |, i = 1, . . . , n.

The choice Q = D + L with (5.4.4) yields the iteration

xk+1 = −(D + L)−1Uxk + (D + L)−1b (5.4.6)

known as the Gauss-Seidel method.

Theorem 5.4.3. The iteration (5.4.6) converges for for all x0 ∈ Rn and any s.p.d.matrix A ∈ Rn×n.

Remark 5.4.1. The inequality

‖(L+D)−1U‖∞ ≤ ‖D−1(L+ U)‖∞shows that iteration (5.4.6) converges also under the assumptions of Theorem 5.4.2.

The convergence speed of the Gauss-Seidel method is typically better than thespeed of the Jacobi method.

Example 5.4.2.

f unc t i on JacobiAndGaussSeidel

n = 100 ; A = rand (n ) ; f o r i =1:n , A( i , i )= sum( abs (A( i , : ) ) ) ; end ,

b = rand (n , 1 ) ; D = diag ( diag (A) ) ; L = − t r i l (A,−1) ; U =−t r i u (A, 1 ) ;

m = 15 ; X=ze ro s (n ,m) ; Y=ze ro s (n ,m) ;

X( : , 1 ) = rand (n , 1 ) ; Y( : , 1 ) = X( : , 1 ) ;

f o r k=1:m−1

X( : , k+1)=D\b + D\(L+U)∗X( : , k ) ; Y( : , k+1)=(D−L)\b + (D−L)\U∗Y( : , k ) ;

end

z=A\b ; EX=ze ro s (m, 1 ) ; EY=ze ro s (m, 1 ) ;

f o r k=1:m

EX( k)=norm(X( : , k)−z ) ; EY( k)=norm(Y( : , k)−z ) ;

end

p lo t (EX, ’ r − ’ ) ; hold on ; p l o t (EY, ’ b− ’ ) ;

16.12.13 The convergence speed of iteration (5.4.1) can often be improved by consideringthe relaxed iteration

xk+1 = ω(Gxk + c) + (1− ω)xk (5.4.7)

where ω ∈ [0, 1] is a damping parameter. With Gω := ωG+ (1− ω)I we have

xk+1 = Gωxk + ωc. (5.4.8)

Definition 5.4.1. An iteration method (5.4.1) is called symmetrizable, if for anys.p.d. matrix A ∈ Rn×n, the matrix I −G is similar to a s.p.d. matrix, i. e., thereexists an invertible W ∈ Rn×n such that W (I −G)W−1 is s.p.d.

Example 5.4.3. The Richardson and the Jacobi method are symmetrizable.

Theorem 5.4.4. For a symmetrizable iteration (5.4.1) with the extreme eigenvaluesλmin /max(G) and optimal damping parameter

ω =2

2− λmax(G)− λmin(G)28

we have

ρ(Gω) =λmax(G)− λmin(G)

2− λmax − λmin< 1.

This says that for any symmetrizable iteration method, convergence of the re-laxed iteration can be enforced by choosing ω = ω as above.

Example 5.4.4. For the Richardson method, we obtain

ω =2

λmax + λmin.

Remark 5.4.2. The Gauss-Seidel method is not symmetrizable. Nevertheless, onecan show that the relaxed iteration (5.4.7) converges for any s.p.d. matrix A andany ω ∈]0, 2[. For 1 < ω < 2, the method is called successive overrelaxation (SOR-method). Typical values for ω are 1.6 or 1.7.

5.5. Mutigrid methods. We consider

−y′′(x) = f(x), x ∈ Ω :=]0, π[, y(0) = y(π) = 0 (5.5.1)

as a model problem. For a grid

Ωh = xj = jh : j = 1, . . . , n− 1, h = π/n,

unknowns uh,j = y(xj) and y′′(xj) ≈ 1h2 (y(xj+1) − 2y(xj) + y(xj−1) yields a LSE

in uh = (uh,1, . . . , uh,n−1)>

Ahuh = fh, ah =1

h2

2 −1 0

−1 2. . .

. . .. . . −1

0 −1 2

, fh =

f(x1)

...

...f(xn−1)

. (5.5.2)

The relaxed Jacobi method for (5.5.2) becomes

u(i+1) = Gωu(i) +

ω

2h2f (5.5.3)

with Gω = Gω,h = (1− ω)I + ωG, and

G = Gh = I − h2

2Ah.

Writing the errors e(i) = u(i) − u as a linear combination of the eigenvectors of Gω

z(k)h = (sin(kh), sin(2kh), . . . , sin((n− 1)kh))>

with correponding eigenvalues µ(k)ω = 1 − 2ω sin2(kh/2), k = 1, . . . , n − 1 and

regarding the components of z(k)h as waves on Ωh, one can show that

Gωe(k) =

n−1∑k=1

ρkµ(k)ω z

(k)h

and thatmax

n2≤k≤n−1

|µkω|

becomes minimal for ω = 23 with

|µ(k)ω | ≤

1

3,

n

2≤ k ≤ n− 1

independently of h, whilemax

1≤k≤n−1|µkω| → 1

29

for h 0. Hence, with this ω, the method acts as a “smoother” of high-frequencyoscillations in the error. Moreover, the error

e(i)h = uih − uh (5.5.4)

is the exact solution of Ahe(i)h = −r(i)

h , where r(i)h = fh − Ahu(i)

h is the residual of

u(i)h . Hence, r

(i)h essentially contains only contributions of lower frequencies.

A key idea is now that a long wave grid function gh ∈ Ωh can be approximatedwell by a grid function g2h on a coarser grid Ω2h = j2h : j = 1, . . . , n2 − 1, wherethe wave appears more oscillatory, e. g, by restriction

g2h = I2hh gh, I2h

h =

0 1 0

1 0 1. . .0 1 0

.

Conversely, interpolation can be used to extend a coarse grid function g2h to a gridfunction gh, e. g., by

g2h = Ih2hg2h, Ih2h =1

2

121 1

21

...121

.

An elementary form of a multigrid method is therefore the following.

Algorithm 5.5.1. (Two-Grid method)

i) Let v0 be an initial guess, set i = 0.ii) Perform m steps of (5.5.3) with ω = 2

3 and u(0) = v(i) which results in

ωh = u(m) with the residual rh = fh −Ahωh (smoothing step).iii) Compute r2h := I2h

h rh (projection step).iv) Solve A2he2h = −r2h (coarse-grid solution).v) Set v(i+1) = ωh − Ih2he2h (interpolation and correction step).vi) Set i = i+ 1 and go to ii).

Theorem 5.5.1. For m = 2 and e(i)h = v(i) − uh it holds

‖ei+1h ‖2 ≤ 0.782‖e(i)

h ‖2,

that is, the convergence rate of the Two-Grid method is linear and independent ofh.

The idea of multigrid methods is to the use Two-Grid method recursively to solvethe problem in step iv) on a still coarse grid Ω4h, etc. until the problem becomestrivial and then successively correct to the finer grids. Commonly, V-cyling orW-cycling are used as grid-level schemes.

Literature. [SB, Section 8.9]

17.12.1330

6. Least Squares Problems

6.1. Normal Equations. We consider

Ax = b (6.1.1)

with A ∈ Rm×n, b ∈ Rm and m ≥ n. We want to solve such an overdeterminedsystem in the sense that we seek x ∈ Rn such that the residual norm

‖b−Ax‖2 (6.1.2)

is minimized.

Example 6.1.1. (Polynomial interpolation)

Geometrically speaking, we seek a point z = Ax in Im(A) which has the smallestdistance to b and the distance ‖b − Ax‖ is minimal when b − Ax is perpendicularto the subspace Im(A).

Theorem 6.1.1. Let U ⊂ Rm be a subspace and

U⊥ = v ∈ Rm : 〈v, u〉 = 0, u ∈ Ube its orthogonal complement. Then for all v ∈ Rm

‖v − u‖ = minu′∈U

‖v − u′‖ ⇔ v − u ∈ U⊥,

where ‖v‖ =√〈v, v〉.

Hence, the solution u ∈ U of ‖v−u‖ = min is uniquely determined by Pv, where

P : Rm → U, v 7→ Pv with ‖v − Pv‖ = minu∈U‖v − u‖

is the orthogonal projection of Rm onto U .

Theorem 6.1.2. A vector x ∈ Rn minimized the residual norm ‖b− Ax‖2, therebysolving the least squares problem (6.1.1) iff

A>Ax = A>b. (6.1.3)

In particular, the solution is unique iff rank(A) = n.

(6.1.3) is called the normal equations.

6.2. Condition. Consider the orthogonal projection P : Rm → V ⊂ Rn, b 7→ Pb.

Lemma 6.2.1. Let ϑ be the angle between b abd V , i. e.,

sinϑ =‖b− Pb‖2‖b‖2

.

Then

κp =1

cosϑ‖P‖2.

Moreover, for A ∈ Rm×n, m ≥ n, rank(A) = n, we have

κ2(A) :=max‖x‖2=1 ‖Ax‖22min‖x‖2=1 ‖Ax‖22

=max‖x‖2=1〈A>Ax, x〉min‖x‖2=1〈A>Ax, x〉

=λmax(A>A)

λmin(A>A)= κ2(A>A),

cf. tutorials.

Theorem 6.2.1. Let A ∈ Rm×n, m ≥ n, rank(A) = n, b ∈ Rm, and x be the(unique) solution of ‖b − Ax‖2 = min. Assume that x 6= 0 and denote by ϑ theangle

sinϑ =‖b−Ax‖2‖b‖2

.

Then the relative condition satisfies31

(a) corresponding to perturbations of b,

κ ≤ κ2(A)

cosϑ

(b) corresponding to perturbations of A,

κ ≤ κ2(A) + κ2(A)2 tanϑ.

Hence, only if ‖b − Ax‖2 ‖b‖2 the solution of the least squares problem issimilar to that of LSE.

6.3. Solution of Normal Equations. Assuming that the least squares problemis uniquely solvable, the matrix A>A ∈ Rn×n is s.p.d. and the solution is given by

x = (A>A)−1A>b.

The matrix (A>A)−1A> is known as pseudoinverse of A ∈ Rm×n. The standardmethod to solve (6.1.1) is therefore Cholesky factorization, constructing a factor-ization

A>A = LL>. (6.3.1)

Algorithm 6.3.1.

1. Compute A>A and A>b.2. Compute (6.1.3).3. Solve Lw = A>b for w.4. Solve L>x = w for x.

The complexity is O(n2m) flops to compute A>A and O(n3) to compute (6.3.1).


07.01.14

32

References

[D] Deuflhard, Hohmann: Numerical Analysis in Modern Scientific Computing. 2nd Ed.,

Springer, 2003.

[DB] Dahlquist, Bjorck: Numerical Methods in Scientific Computing. SIAM, 2008.[SB] Stoer, Burlisch: Introduction to Numerical Analysis, 2nd Ed., Springer, 1992.

[S] Stewart: Afternotes on Numerical Analysis, SIAM, 1996.

[TB] Trefethen, Bau: Numerical Linear Algebra, SIAM, 1997.

33

Documents

NUMERICAL PROGRAMMING 1 (CSE) [MA3305]