Introduction to Numerical Analysisweb.math.ucsb.edu/.../Introduction_Numerical_Analysis.pdf1.1 What is Numerical Analysis? This is an introductory course of Numerical Analysis, which

Introduction to Numerical Analysis

Hector D. Cenicerosc© Draft date October 22, 2019

Contents

Contents i

Preface 1

1 Introduction 31.1 What is Numerical Analysis? . . . . . . . . . . . . . . . . . . 31.2 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 An Approximation Principle . . . . . . . . . . . . . . . 41.2.2 Divide and Conquer . . . . . . . . . . . . . . . . . . . 61.2.3 Convergence and Rate of Convergence . . . . . . . . . 71.2.4 Error Correction . . . . . . . . . . . . . . . . . . . . . 81.2.5 Richardson Extrapolation . . . . . . . . . . . . . . . . 11

1.3 Super-algebraic Convergence . . . . . . . . . . . . . . . . . . . 13

2 Function Approximation 172.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Uniform Polynomial Approximation . . . . . . . . . . . . . . . 19

2.2.1 Bernstein Polynomials and Bezier Curves . . . . . . . . 192.2.2 Weierstrass Approximation Theorem . . . . . . . . . . 23

2.3 Best Approximation . . . . . . . . . . . . . . . . . . . . . . . 252.3.1 Best Uniform Polynomial Approximation . . . . . . . . 27

2.4 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . 31

3 Interpolation 373.1 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . 37

3.1.1 Equispaced and Chebyshev Nodes . . . . . . . . . . . . 403.2 Connection to Best Uniform Approximation . . . . . . . . . . 413.3 Barycentric Formula . . . . . . . . . . . . . . . . . . . . . . . 43

i

ii CONTENTS

3.3.1 Barycentric Weights for Chebyshev Nodes . . . . . . . 44

3.3.2 Barycentric Weights for Equispaced Nodes . . . . . . . 45

3.3.3 Barycentric Weights for General Sets of Nodes . . . . . 45

3.4 Newton’s Form and Divided Differences . . . . . . . . . . . . . 46

3.5 Cauchy’s Remainder . . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 Convergence of Polynomial Interpolation . . . . . . . . . . . . 53

3.8 Piece-wise Linear Interpolation . . . . . . . . . . . . . . . . . 55

3.9 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.9.1 Solving the Tridiagonal System . . . . . . . . . . . . . 60

3.9.2 Complete Splines . . . . . . . . . . . . . . . . . . . . . 62

3.9.3 Parametric Curves . . . . . . . . . . . . . . . . . . . . 63

4 Trigonometric Approximation 65

4.1 Approximating a Periodic Function . . . . . . . . . . . . . . . 65

4.2 Interpolating Fourier Polynomial . . . . . . . . . . . . . . . . 70

4.3 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . . 75

5 Least Squares Approximation 79

5.1 Continuous Least Squares Approximation . . . . . . . . . . . . 79

5.2 Linear Independence and Gram-Schmidt Orthogonalization . . 85

5.3 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . 86

5.3.1 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . 89

5.4 Discrete Least Squares Approximation . . . . . . . . . . . . . 90

5.5 High-dimensional Data Fitting . . . . . . . . . . . . . . . . . . 95

6 Computer Arithmetic 99

6.1 Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . 99

6.2 Rounding and Machine Precision . . . . . . . . . . . . . . . . 100

6.3 Correctly Rounded Arithmetic . . . . . . . . . . . . . . . . . . 101

6.4 Propagation of Errors and Cancellation of Digits . . . . . . . . 102

7 Numerical Differentiation 105

7.1 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2 The Effect of Round-off Errors . . . . . . . . . . . . . . . . . . 108

7.3 Richardson’s Extrapolation . . . . . . . . . . . . . . . . . . . . 109

CONTENTS iii

8 Numerical Integration 1118.1 Elementary Simpson Quadrature . . . . . . . . . . . . . . . . 1118.2 Interpolatory Quadratures . . . . . . . . . . . . . . . . . . . . 1148.3 Gaussian Quadratures . . . . . . . . . . . . . . . . . . . . . . 116

8.3.1 Convergence of Gaussian Quadratures . . . . . . . . . 1198.4 Computing the Gaussian Nodes and Weights . . . . . . . . . . 1218.5 Clenshaw-Curtis Quadrature . . . . . . . . . . . . . . . . . . . 1228.6 Composite Quadratures . . . . . . . . . . . . . . . . . . . . . 1248.7 Modified Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . 1258.8 The Euler-Maclaurin Formula . . . . . . . . . . . . . . . . . . 1278.9 Romberg Integration . . . . . . . . . . . . . . . . . . . . . . . 131

9 Linear Algebra 1359.1 The Three Main Problems . . . . . . . . . . . . . . . . . . . . 1359.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379.3 Some Important Types of Matrices . . . . . . . . . . . . . . . 1389.4 Schur Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 1419.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1429.6 Condition Number of a Matrix . . . . . . . . . . . . . . . . . 148

9.6.1 What to Do When A is Ill-conditioned? . . . . . . . . . 150

10 Linear Systems of Equations I 15310.1 Easy to Solve Systems . . . . . . . . . . . . . . . . . . . . . . 15410.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . 156

10.2.1 The Cost of Gaussian Elimination . . . . . . . . . . . . 16310.3 LU and Choleski Factorizations . . . . . . . . . . . . . . . . . 16410.4 Tridiagonal Linear Systems . . . . . . . . . . . . . . . . . . . 16810.5 A 1D BVP: Deformation of an Elastic Beam . . . . . . . . . . 17010.6 A 2D BVP: Dirichlet Problem for the Poisson’s Equation . . . 17210.7 Linear Iterative Methods for Ax = b . . . . . . . . . . . . . . . 17510.8 Jacobi, Gauss-Seidel, and S.O.R. . . . . . . . . . . . . . . . . 17610.9 Convergence of Linear Iterative Methods . . . . . . . . . . . . 178

11 Linear Systems of Equations II 18311.1 Positive Definite Linear Systems as an Optimization Problem . 18311.2 Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . 185

11.2.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . 18611.3 The Conjugate Gradient Method . . . . . . . . . . . . . . . . 186

iv CONTENTS

11.3.1 Generating the Conjugate Search Directions . . . . . . 18911.4 Krylov Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . 19211.5 Convergence of the Conjugate Gradient Method . . . . . . . . 194

12 Eigenvalue Problems 19712.1 The Power Method . . . . . . . . . . . . . . . . . . . . . . . . 19712.2 Methods Based on Similarity Transformations . . . . . . . . . 198

12.2.1 The QR method . . . . . . . . . . . . . . . . . . . . . 199

13 Non-Linear Equations 20113.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20113.2 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

13.2.1 Convergence of the Bisection Method . . . . . . . . . . 20213.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . 20313.4 Interpolation-Based Methods . . . . . . . . . . . . . . . . . . . 20413.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 20513.6 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . 20713.7 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . 20913.8 Systems of Nonlinear Equations . . . . . . . . . . . . . . . . . 211

13.8.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . 212

14 Numerical Methods for ODEs 21514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21514.2 A First Look at Numerical Methods . . . . . . . . . . . . . . . 21914.3 One-Step and Multistep Methods . . . . . . . . . . . . . . . . 22114.4 Local and Global Error . . . . . . . . . . . . . . . . . . . . . . 22214.5 Order of a Method and Consistency . . . . . . . . . . . . . . . 22614.6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22714.7 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . 23014.8 Adaptive Stepping . . . . . . . . . . . . . . . . . . . . . . . . 23414.9 Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . 23514.10Multistep Methods . . . . . . . . . . . . . . . . . . . . . . . . 235

14.10.1 Adams Methods . . . . . . . . . . . . . . . . . . . . . . 23614.10.2 Zero-Stability and Dahlquist Theorem . . . . . . . . . 237

14.11A-Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23714.12Stiff ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

List of Figures

2.1 The Bernstein weights bk,n(x) for x = 0.25 ()and x = 0.75(•), n = 50 and k = 1 . . . n. . . . . . . . . . . . . . . . . . . . . 21

2.2 Quadratic Bezier curve. . . . . . . . . . . . . . . . . . . . . . . 212.3 If the error function en does not equioscillate at least twice we

could lower ‖en‖∞ by an amount c > 0. . . . . . . . . . . . . . 28

4.1 S8(x) for f(x) = sinxecosx on [0, 2π]. . . . . . . . . . . . . . . 74

5.1 The function f(x) = ex on [0, 1] and its Least Squares Ap-proximation p1(x) = 4e− 10 + (18− 6e)x. . . . . . . . . . . . 81

5.2 Geometric interpretation of the solutionXa of the Least Squaresproblem as the orthogonal projection of f on the approximat-ing linear subspace W . . . . . . . . . . . . . . . . . . . . . . . 97

v

vi LIST OF FIGURES

List of Tables

1.1 Composite Trapezoidal Rule for f(x) = ex in [0, 1]. . . . . . . 81.2 Composite Trapezoidal Rule for f(x) = 1/(2 + sin x) in [0, 2π]. 13

14.1 Butcher tableau for a general RK method. . . . . . . . . . . . 23214.2 Improved Euler. . . . . . . . . . . . . . . . . . . . . . . . . . . 23214.3 Midpoint RK. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23314.4 Classical fourth order RK. . . . . . . . . . . . . . . . . . . . . 23314.5 Backward Euler. . . . . . . . . . . . . . . . . . . . . . . . . . 23314.6 Implicit mid-point rule RK. . . . . . . . . . . . . . . . . . . . 23314.7 Hammer and Hollingworth DIRK. . . . . . . . . . . . . . . . . 23414.8 Two-stage order 3 SDIRK (γ = 3±

√3

6). . . . . . . . . . . . . . 234

vii

viii LIST OF TABLES

Preface

These notes were prepared by the author for use in the upper division under-graduate course of Numerical Analysis (Math 104 ABC) at the University ofCalifornia at Santa Barbara. They were written with the intent to emphasizethe foundations of Numerical Analysis rather than to present a long list ofnumerical methods for different mathematical problems.

We begin with an introduction to Approximation Theory and then usethe different ideas of function approximation in the derivation and analysisof many numerical methods.

These notes are intended for undergraduate students with a strong math-ematics background. The prerequisites are Advanced Calculus, Linear Alge-bra, and introductory courses in Analysis, Differential Equations, and Com-plex Variables. The ability to write computer code to implement the nu-merical methods is also a necessary and essential part of learning NumericalAnalysis.

These notes are not in finalized form and may contain errors, misprints,and other inaccuracies. They cannot be used or distributed without writtenconsent from the author.

1

2 LIST OF TABLES

Chapter 1

Introduction

1.1 What is Numerical Analysis?

This is an introductory course of Numerical Analysis, which comprises thedesign, analysis, and implementation of constructive methods and algorithmsfor the solution of mathematical problems.

Numerical Analysis has vast applications both in Mathematics and inmodern Science and Technology. In the areas of the Physical and Life Sci-ences, Numerical Analysis plays the role of a virtual laboratory by providingaccurate solutions to the mathematical models representing a given physicalor biological system in which the system’s parameters can be varied at will, ina controlled way. The applications of Numerical Analysis also extend to moremodern areas such as data analysis, web search engines, social networks, andbasically anything where computation is involved.

1.2 An Illustrative Example: Approximating

a Definite Integral

The main principles and objectives of Numerical Analysis are better illus-trated with concrete examples and this is the purpose of this chapter.

Consider the problem of calculating a definite integral

I[f ] =

∫ b

a

f(x)dx. (1.1)

3

4 CHAPTER 1. INTRODUCTION

In most cases we cannot find an exact value of I[f ] and very often we onlyknow the integrand f at finite number of points in [a, b]. The problem isthen to produce an approximation to I[f ] as accurate as we need and at areasonable computational cost.

1.2.1 An Approximation Principle

One of the central ideas in Numerical Analysis is to approximate a givenfunction or data by simpler functions which we can analytically evaluate,integrate, differentiate, etc. For example, we can approximate the integrandf in [a, b] by the segment of the straight line, a linear polynomial p1(x), thatpasses through (a, f(a)) and (b, f(b)). That is

f(x) ≈ p1(x) = f(a) +f(b)− f(a)

b− a(x− a). (1.2)

and ∫ b

a

f(x)dx ≈∫ b

a

p1(x)dx = f(a)(b− a) +1

2[f(b)− f(a)](b− a)

=1

2[f(a) + f(b)](b− a).

(1.3)

That is ∫ b

a

f(x)dx ≈ b− a2

[f(a) + f(b)]. (1.4)

The right hand side is known as the simple Trapezoidal Rule Quadrature. Aquadrature is a method to approximate an integral. How accurate is thisapproximation? Clearly, if f is a linear polynomial or a constant then theTrapezoidal Rule would give us the exact value of the integral, i.e. it wouldbe exact. The underlying question is: how well does a linear polynomial p1,satisfying

p1(a) = f(a), (1.5)

p1(b) = f(b), (1.6)

approximate f on the interval [a, b]? We can almost guess the answer. Theapproximation is exact at x = a and x = b because of (1.5)-(1.6) and it is

1.2. AN ILLUSTRATIVE EXAMPLE 5

exact for all polynomials of degree ≤ 1. This suggests that f(x) − p1(x) =Cf ′′(ξ)(x − a)(x − b), where C is a constant. But where is f ′′ evaluatedat? it cannot be at x for if it did f would be the solution of a second orderODE and f is an arbitrary (but sufficiently smooth, C2[a, b] ) function soit has to be at some undetermined point ξ(x) in (a, b). Now, if we take theparticular case f(x) = x2 on [0, 1] then p1(x) = x, f(x) − p1(x) = x(x − 1),and f ′′(x) = 2, which implies that C would have to be 1/2. So our conjectureis

f(x)− p1(x) =1

2f ′′(ξ(x))(x− a)(x− b). (1.7)

There is a beautiful 19th Century proof of this result by A. Cauchy. It goesas follows. If x = a or x = b (1.7) holds trivially. So let us take x in (a, b)and define the following function of a new variable t as

φ(t) = f(t)− p1(t)− [f(x)− p1(x)](t− a)(t− b)(x− a)(x− b)

. (1.8)

Then φ, as a function of t, is C2[a, b] and φ(a) = φ(b) = φ(x) = 0. Sinceφ(a) = φ(x) = 0 by Rolle’s theorem there is ξ1 ∈ (a, x) such that φ′(ξ1) = 0and similarly there is ξ2 ∈ (x, b) such that φ′(ξ2) = 0. Because φ is C2[a, b] wecan apply Rolle’s theorem one more time, observing that φ′(ξ1) = φ′(ξ2) = 0,to get that there is a point ξ(x) between ξ1 and ξ2 such that φ′′(ξ(x)) = 0.Consequently,

0 = φ′′(ξ(x)) = f ′′(ξ(x))− [f(x)− p1(x)]2

(x− a)(x− b)(1.9)

and so

f(x)− p1(x) =1

2f ′′(ξ(x))(x− a)(x− b), ξ(x) ∈ (a, b). (1.10)

We can use (1.10) to find the accuracy of the simple Trapezoidal Rule. As-suming the integrand f is C2[a, b]∫ b

a

f(x)dx =

∫ b

a

p1(x)dx+1

2

∫ b

a

f ′′(ξ(x))(x− a)(x− b)dx. (1.11)

Now, (x − a)(x − b) does not change sign in [a, b] and f ′′ is continuous soby the Weighted Mean Value Theorem for Integrals we have that there is


η ∈ (a, b) such that∫ b

a

f ′′(ξ(x))(x− a)(x− b)dx = f ′′(η)

∫ b

a

(x− a)(x− b)dx. (1.12)

The last integral can be easily evaluated if we shift to the midpoint, i.e.,changing variables to x = y + 1

2(a+ b) then∫ b

a

(x− a)(x− b)dx =

∫ b−a2

− b−a2

[y2 −

(b− a

2

)2]dy = −1

6(b− a)3. (1.13)

Collecting (1.11) and (1.13) we get∫ b

a

f(x)dx =b− a

2[f(a) + f(b)]− 1

12f ′′(η)(b− a)3, (1.14)

where η is some point in (a, b). So in the approximation∫ b

a

f(x)dx ≈ b− a2

[f(a) + f(b)].

we make the error

E[f ] = − 1

12f ′′(η)(b− a)3. (1.15)

1.2.2 Divide and Conquer

The error (1.15) of the simple Trapezoidal Rule grows cubically with thelength of the interval of integration so it is natural to divide [a, b] into smallersubintervals, apply the Trapezoidal Rule on each of them, and sum up theresult.

Let us divide [a, b] in N subintervals of equal length h = 1N

(b− a), deter-mined by the points x0 = a, x1 = x0 +h, x2 = x0 +2h, . . . , xN = x0 +Nh = b,then∫ b

a

f(x)dx =

∫ x1

x0

f(x)dx+

∫ x2

x1

f(x)dx+ . . .+

∫ xN

xN−1

f(x)dx

=N−1∑j=0

∫ xj+1

xj

f(x)dx.

(1.16)


But we know∫ xj+1

xj

f(x)dx =1

2[f(xj) + f(xj+1)]h− 1

12f ′′(ξj)h

3 (1.17)

for some ξj ∈ (xj, xj+1). Therefore, we get∫ b

a

f(x)dx = h

[1

2f(x0) + f(x1) + . . .+ f(xN−1) +

1

2f(xN)

]− 1

12h3

N−1∑j=0

f ′′(ξj).

The first term on the right hand side is called the Composite TrapezoidalRule Quadrature (CTR):

Th[f ] := h

[1

2f(x0) + f(x1) + . . .+ f(xN−1) +

1

2f(xN)

]. (1.18)

The error term is

Eh[f ] = − 1

12h3

N−1∑j=0

f ′′(ξj) = − 1

12(b− a)h2

[1

N

N−1∑j=0

f ′′(ξj)

], (1.19)

where we have used that h = (b − a)/N . The term in brackets is a meanvalue of f ′′ (it is easy to prove that it lies between the maximum and theminimum of f ′′). Since f ′′ is assumed continuous (f ∈ C2[a, b]) then by theIntermediate Value Theorem, there is a point ξ ∈ (a, b) such that f ′′(ξ) isequal to the quantity in the brackets so we obtain that

Eh[f ] = − 1

12(b− a)h2f ′′(ξ), (1.20)

for some ξ ∈ (a, b).

1.2.3 Convergence and Rate of Convergence

We do not not know what the point ξ is in (1.20). If we knew, the error couldbe evaluated and we would know the integral exactly, at least in principle,because

I[f ] = Th[f ] + Eh[f ]. (1.21)


But (1.20) gives us two important properties of the approximation methodin question. First, (1.20) tell us that Eh[f ] → 0 as h → 0. That is, thequadrature rule Th[f ] converges to the exact value of the integral as h→ 01. Recall h = (b − a)/N , so as we increase N our approximation to theintegral gets better and better. Second, (1.20) tells us how fast the approx-imation converges, namely quadratically in h. This is the approximation’srate of convergence. If we double N (or equivalently halve h) then theerror decreases by a factor of 4. We also say that the error is order h2 andwrite Eh[f ] = O(h2). The Big ‘O’ notation is used frequently in NumericalAnalysis.

Definition 1.1. We say that g(h) is order hα, and write g(h) = O(hα), ifthere is a constant C and h0 such that |g(h)| ≤ Chα for 0 ≤ h ≤ h0, i.e. forsufficiently small h.

Example 1.1. Let’s check the Trapezoidal Rule approximation for an integralwe can compute exactly. Take f(x) = ex in [0, 1]. The exact value of theintegral is e − 1. Observe how the error |I[ex] − T1/N [ex]| decreases by a

Table 1.1: Composite Trapezoidal Rule for f(x) = ex in [0, 1].N T1/N [ex] |I[ex]− T1/N [ex]| Decrease factor16 1.718841128579994 5.593001209489579× 10−4

32 1.718421660316327 1.398318572816137× 10−4 0.25001220640603964 1.718316786850094 3.495839104861176× 10−5 0.250003051723810128 1.718290568083478 8.739624432374526× 10−6 0.250000762913303

factor of (approximately) 1/4 as N is doubled, in accordance to (1.20).

1.2.4 Error Correction

We can get an upper bound for the error using (1.20) and that f ′′ is boundedin [a, b], i.e. |f ′′(x)| ≤M2 for all x ∈ [a, b] for some constant M2. Then

|Eh[f ]| ≤ 1

12(b− a)h2M2. (1.22)

1Neglecting round-off errors introduced by finite precision number representation andcomputer arithmetic.


However, this bound does not in general provide an accurate estimate of theerror. It could grossly overestimate it. This can be seen from (1.19). AsN →∞ the term in brackets converges to a mean value of f ′′, i.e.

1

N

N−1∑j=0

f ′′(ξj) −→1

b− a

∫ b

a

f ′′(x)dx =1

b− a[f ′(b)− f ′(a)], (1.23)

as N →∞, which could be significantly smaller than the maximum of |f ′′|.Take for example f(x) = e100x on [0, 1]. Then max |f ′′| = 10000e100, whereasthe mean value (1.23) is equal to 100(e100 − 1) so the error bound (1.22)overestimates the actual error by two orders of magnitude. Thus, (1.22) isof little practical use.

Equation (1.19) and (1.23) suggest that asymptotically, that is for suffi-ciently small h,

Eh[f ] = C2h2 +R(h), (1.24)

where

C2 = − 1

12[f ′(b)− f ′(a)] (1.25)

and R(h) goes to zero faster than h2 as h→ 0, i.e.

limh→0

R(h)

h2= 0. (1.26)

We say that R(h) = o(h2) (little ‘o’ h2).

Definition 1.2. A function g(h) is little ‘o’ hα if

limh→0

g(h)

hα= 0

and we write g(h) = o(hα).

We then have

I[f ] = Th[f ] + C2h2 +R(h). (1.27)

and, for sufficiently small h, C2h2 is an approximation of the error. If it

is possible and computationally efficient to evaluate the first derivative of


f at the end points of the interval then we can compute directly C2h2 and

use this leading order approximation of the error to obtain the improvedapproximation

Th[f ] = Th[f ]− 1

12[f ′(b)− f ′(a)]h2. (1.28)

This is called the (composite) Modified Trapezoidal Rule. It then follows from(1.27) that error of this “corrected approximation” is R(h), which goes tozero faster than h2. In fact, we will prove later that the error of the ModifiedTrapezoidal Rule is O(h4).

Often, we only have access to values of f and/or it is difficult to evaluatef ′(a) and f ′(b). Fortunately, we can compute a sufficiently good approxi-mation of the leading order term of the error, C2h

2, so that we can use thesame error correction idea that we did for the Modified Trapezoidal Rule.Roughly speaking, the error can be estimated by comparing two approxima-tions obtained with different h.

Consider (1.27). If we halve h we get

I[f ] = Th/2[f ] +1

4C2h

2 +R(h/2). (1.29)

Subtracting (1.29) from (1.27) we get

C2h2 =

4

3

(Th/2[f ]− Th[f ]

)+

4

3(R(h/2)−R(h)) . (1.30)

The last term on the right hand side is o(h2). Hence, for h sufficiently small,we have

C2h2 ≈ 4

3

(Th/2[f ]− Th[f ]

)(1.31)

and this could provide a good, computable estimate for the error, i.e.

Eh[f ] ≈ 4

3

(Th/2[f ]− Th[f ]

). (1.32)

The key here is that h has to be sufficiently small to make the asymptoticapproximation (1.31) valid. We can check this by working backwards. If his sufficiently small, then evaluating (1.31) at h/2 we get

C2

(h

2

)2

≈ 4

3

(Th/4[f ]− Th/2[f ]

)(1.33)


and consequently the ratio

q(h) =Th/2[f ]− Th[f ]

Th/4[f ]− Th/2[f ](1.34)

should be approximately 4. Thus, q(h) offers a reliable, computable indicatorof whether or not h is sufficiently small for (1.32) to be an accurate estimateof the error.

We can now use (1.31) and the idea of error correction to improve theaccuracy of Th[f ] with the following approximation 2

Sh[f ] := Th[f ] +4

3

(Th/2[f ]− Th[f ]

). (1.35)

1.2.5 Richardson Extrapolation

We can view the error correction procedure as a way to eliminate theleading order (in h) contribution to the error. Multiplying (1.29) by 4 andsubstracting (1.27) to the result we get

I[f ] =4Th/2[f ]− Th[f ]

3+

4R(h/2)−R(h)

3(1.36)

Note that Sh[f ] is exactly the first term in the right hand side of (1.36) andthat the last term converges to zero faster than h2. This very useful andgeneral procedure in which the leading order component of the asymptoticform of error is eliminated by a combination of two computations performedwith two different values of h is called Richardson’s Extrapolation.

Example 1.2. Consider again f(x) = ex in [0, 1]. With h = 1/16 we get

q

(1

16

)=T1/32[ex]− T1/16[ex]

T1/64[ex]− T1/32[ex]≈ 3.9998 (1.37)

and the improved approximation is

S1/16[ex] = T1/16[ex] +4

3

(T1/32[ex]− T1/16[ex]

)= 1.718281837561771 (1.38)

which gives us nearly 8 digits of accuracy (error ≈ 9.1 × 10−9). S1/32 givesus an error ≈ 5.7 × 10−10. It decreased by approximately a factor of 1/16.This would correspond to fourth order rate of convergence. We will see inChapter 8 that indeed this is the case.

2The symbol := means equal by definition.


It appears that Sh[f ] gives us superior accuracy to that of Th[f ] but atroughly twice the computational cost. If we group together the commonterms in Th[f ] and Th/2[f ] we can compute Sh[f ] at about the same compu-tational cost as that of Th/2[f ]:

4Th/2[f ]− Th[f ] = 4h

2

[1

2f(a) +

2N−1∑j=1

f(a+ jh/2) +1

2f(b)

]

− h

[1

2f(a) +

N−1∑j=1

f(a+ jh) +1

2f(b)

]

=h

2

[f(a) + f(b) + 2

N−1∑k=1

f(a+ kh) + 4N−1∑k=1

f(a+ kh/2)

].

Therefore

Sh[f ] =h

6

[f(a) + 2

N−1∑k=1

f(a+ kh) + 4N−1∑k=1

f(a+ kh/2) + f(b)

]. (1.39)

The resulting quadrature formula Sh[f ] is known as the Composite Simpson’sRule and, as we will see in Chapter 8, can be derived by approximating theintegrand by quadratic polynomials. Thus, based on cost and accuracy, theComposite Simpson’s Rule would be preferable to the Composite TrapezoidalRule, with one important exception: periodic smooth integrands integratedover their period.

Example 1.3. Consider the integral

I[1/(2 + sin x)] =

∫ 2π

0

dx

2 + sin x. (1.40)

Using Complex Variables techniques (Residues) the exact integral can be com-puted and I[1/(2 + sinx)] = 2π/

√3. Note that the integrand is smooth (has

an infinite number of continuous derivatives) and periodic in [0, 2π]. If weuse the Composite Trapezoidal Rule to find approximations to this integralwe obtain the results show in Table 1.2.

The approximations converge amazingly fast. With N = 32, we alreadyreached machine precision (with double precision we get about 16 digits).

1.3. SUPER-ALGEBRAIC CONVERGENCE 13

Table 1.2: Composite Trapezoidal Rule for f(x) = 1/(2 + sin x) in [0, 2π].N T2π/N [1/(2 + sin x)] |I[1/(2 + sin x)]− T2π/N [1/(2 + sin x)]|8 3.627791516645356 1.927881769203665× 10−4

16 3.627598733591013 5.122577029226250× 10−9

32 3.627598728468435 4.440892098500626× 10−16

1.3 Super-Algebraic Convergence of the CTR

for Smooth Periodic Integrands

Integrals of periodic integrands appear in many applications, most notably,in Fourier Analysis.

Consider the definite integral

I[f ] =

∫ 2π

0

f(x)dx,

where the integrand f is periodic in [0, 2π] and has m > 1 continuous deriva-tives, i.e. f ∈ Cm[0, 2π] and f(x + 2π) = f(x) for all x. Due to periodicitywe can work in any interval of length 2π and if the function has a differentperiod, with a simple change of variables, we can reduce the problem to onein [0, 2π].

Consider the equally spaced points in [0, 2π], xj = jh for j = 0, 1, . . . , Nand h = 2π/N . Because f is periodic f(x0 = 0) = f(xN = 2π) and the CTRbecomes

Th[f ] = h

[f(x0)

2+ f(x1) + . . .+ f(xN−1) +

f(xN)

2

]= h

N−1∑j=0

f(xj). (1.41)

Being f smooth and periodic in [0, 2π], it has a uniformly convergent FourierSeries:

f(x) =a0

2+∞∑k=1

(ak cos kx+ bk sin kx) (1.42)

where

ak =1

π

∫ 2π

0

f(x) cos kx dx, k = 0, 1, . . . (1.43)

bk =1

π

∫ 2π

0

f(x) sin kx dx, k = 1, 2, . . . (1.44)


Using the Euler formula3.

eix = cosx+ i sinx (1.45)

we can write

cosx =eix + e−ix

2, (1.46)

sinx =eix − e−ix

2i(1.47)

and the Fourier series can be conveniently expressed in complex form in termsof functions eikx for k = 0,±1,±2, . . . so that (1.42) becomes

f(x) =∞∑

k=−∞

ckeikx, (1.48)

where

ck =1

2π

∫ 2π

0

f(x)e−ikxdx. (1.49)

We are assuming that f is real-valued so the complex Fourier coefficientssatisfy ck = c−k, where ck is the complex conjugate of ck. We have therelation 2c0 = a0 and 2ck = ak− ibk for k = ±1,±2, . . ., between the complexand real Fourier coefficients.

Using (1.48) in (1.41) we get

Th[f ] = h

N−1∑j=0

(∞∑

k=−∞

ckeikxj

). (1.50)

Justified by the uniform convergence of the series we can exchange the finiteand the infinite sums to get

Th[f ] =2π

N

∞∑k=−∞

ck

N−1∑j=0

eik2πNj. (1.51)

3i2 = −1 and if c = a+ ib, with a, b ∈ R, then its complex conjugate c = a− ib.

1.3. SUPER-ALGEBRAIC CONVERGENCE 15

But

N−1∑j=0

eik2πNj =

N−1∑j=0

(eik

2πN

)j. (1.52)

Note that eik2πN = 1 precisely when k is an integer multiple of N , i.e. k = lN ,

l ∈ Z and if so

N−1∑j=0

(eik

2πN

)j= N for k = lN. (1.53)

Otherwise, if k 6= lN , then

N−1∑j=0

(eik

2πN

)j=

1−(eik

2πN

)N1−

(eik

2πN

) = 0 for k 6= lN (1.54)

Using (1.53) and (1.54) we thus get that

Th[f ] = 2π∞∑

l=−∞

clN . (1.55)

On the other hand

c0 =1

2π

∫ 2π

0

f(x)dx =1

2πI[f ]. (1.56)

Therefore

Th[f ] = I[f ] + 2π [cN + c−N + c2N + c−2N + . . .] , (1.57)

that is

|Th[f ]− I[f ]| ≤ 2π [|cN |+ |c−N |+ |c2N |+ |c−2N |+ . . .] , (1.58)

So now, the relevant question is how fast the Fourier coefficients clN of fdecay with N . The answer is tied to the smoothness of f . Doing integrationby parts in the formula (4.11) for the Fourier coefficients of f we have

ck =1

2π

1

ik

[∫ 2π

0

f ′(x)e−ikxdx− f(x)e−ikx∣∣2π0

]k 6= 0 (1.59)


and the last term vanishes due to the periodicity of f(x)e−ikx. Hence,

ck =1

2π

1

ik

∫ 2π

0

f ′(x)e−ikxdx k 6= 0. (1.60)

Integrating by parts m times we obtain

ck =1

2π

(1

ik

)m ∫ 2π

0

f (m)(x)e−ikxdx k 6= 0, (1.61)

where f (m) is the m-th derivative of f . Therefore, for f ∈ Cm[0, 2π] andperiodic

|ck| ≤Am|k|m

, (1.62)

where Am is a constant (depending only on m). Using this in (1.58) we get

|Th[f ]− I[f ]| ≤ 2πAm

[2

Nm+

2

(2N)m+

2

(3N)m+ . . .

]=

4πAmNm

[1 +

1

2m+

1

3m+ . . .

],

(1.63)

and so for m > 1 we can conclude that

|Th[f ]− I[f ]| ≤ CmNm

. (1.64)

Thus, in this particular case, the rate of convergence of the CTR at equallyspaced points is not fixed (to 2). It depends on the number of derivativesof f and we say that the accuracy and convergence of the approximation isspectral. Note that if f is smooth, i.e. f ∈ C∞[0, 2π] and periodic, the CTRconverges to the exact integral at a rate faster than any power of 1/N (orh)! This is called super-algebraic convergence.

Chapter 2

Function Approximation

We saw in the introductory chapter that one key step in the construction ofa numerical method to approximate a definite integral is the approximationof the integrand by a simpler function, which we can integrate exactly.

The problem of function approximation is central to many numericalmethods: given a continuous function f in an interval [a, b], we would like tofind a good approximation to it by simpler functions, such as polynomials,trigonometric polynomials, wavelets, rational functions, etc. We are goingto measure the accuracy of an approximation using norms and ask whetheror not there is a best approximation out of functions from a given family ofsimpler functions. These are the main topics of this introductory chapter toApproximation Theory.

2.1 Norms

A norm on a vector space V over a field K = R (or C) is a mapping

‖ · ‖ : V → [0,∞),

which satisfy the following properties:

(i) ‖x‖ ≥ 0 ∀x ∈ V and ‖x‖ = 0 iff x = 0.

(ii) ‖x+ y‖ ≤ ‖x‖+ ‖y‖ ∀x, y ∈ V .

(iii) ‖λx‖ = |λ| ‖x‖ ∀x ∈ V, λ ∈ K.

17

18 CHAPTER 2. FUNCTION APPROXIMATION

If we relax (i) to just ‖x‖ ≥ 0, we obtain a semi-norm.We recall first some of the most important examples of norms in the finite

dimensional case V = Rn (or V = Cn):

‖x‖1 = |x1|+ . . .+ |xn|, (2.1)

‖x‖2 =√|x1|2 + . . .+ |xn|2, (2.2)

‖x‖∞ = max|x1|, . . . , |xn|. (2.3)

These are all special cases of the lp norm:

‖x‖p = (|x1|p + . . .+ |xn|p)1/p , 1 ≤ p ≤ ∞. (2.4)

If we have weights wi > 0 for i = 1, . . . , n we can also define a weighted pnorm by

‖x‖w,p = (w1|x1|p + . . .+ wn|xn|p)1/p , 1 ≤ p ≤ ∞. (2.5)

All norms in a finite dimensional space V are equivalent, in the sense thatthere are two constants c and C such that

‖x‖α ≤ C‖x‖β, (2.6)

‖x‖β ≤ c‖x‖α, (2.7)

for all x ∈ V and for any two norms ‖ · ‖α and ‖ · ‖β defined in V .If V is a space of functions defined on a interval [a, b], for example C[a, b],

the corresponding norms to (2.1)-(2.4) are given by

‖u‖1 =

∫ b

a

|u(x)|dx, (2.8)

‖u‖2 =

(∫ b

a

|u(x)|2dx)1/2

, (2.9)

‖u‖∞ = supx∈[a,b]

|u(x)|, (2.10)

‖u‖p =

(∫ b

a

|u(x)|pdx)1/p

, 1 ≤ p ≤ ∞ (2.11)

and are called the L1, L2, L∞, and Lp norms, respectively. Similarly to (2.5)we can defined a weighted Lp norm by

‖u‖p =

(∫ b

a

w(x)|u(x)|pdx)1/p

, 1 ≤ p ≤ ∞, (2.12)

2.2. UNIFORM POLYNOMIAL APPROXIMATION 19

where w is a given positive weight function defined in [a, b]. If w(x) ≥ 0, weget a semi-norm.

Lemma 1. Let ‖ · ‖ be a norm on a vector space V then

| ‖x‖ − ‖y‖ |≤ ‖x− y‖. (2.13)

This lemma implies that a norm is a continuous function (on V to R).

Proof. ‖x‖ = ‖x− y + y‖ ≤ ‖x− y‖+ ‖y‖ which gives that

‖x‖ − ‖y‖ ≤ ‖x− y‖. (2.14)

By reversing the roles of x and y we also get

‖y‖ − ‖x‖ ≤ ‖x− y‖. (2.15)

2.2 Uniform Polynomial Approximation

There is a fundamental result in approximation theory, which states thatany continuous function can be approximated uniformly, i.e. using the norm‖·‖∞, with arbitrary accuracy by a polynomial. This is the celebrated Weier-strass Approximation Theorem. We are going to present a constructive proofdue to Sergei Bernstein, which uses a class of polynomials that have foundwidespread applications in computer graphics and animation. Historically,the use of these so-called Bernstein polynomials in computer assisted design(CAD) was introduced by two engineers working in the French car industry:Pierre Bezier at Renault and Paul de Casteljau at Citroen.

2.2.1 Bernstein Polynomials and Bezier Curves

Given a function f on [0, 1], the Bernstein polynomial of degree n ≥ 1 isdefined by

Bnf(x) =n∑k=0

f

(k

n

)(n

k

)xk(1− x)n−k, (2.16)


where (n

k

)=

n!

(n− k)!k!, k = 0, . . . , n (2.17)

are the binomial coefficients. Note that Bnf(0) = f(0) and Bnf(1) = f(1)for all n. The terms

bk,n(x) =

(n

k

)xk(1− x)n−k, k = 0, . . . , n (2.18)

which are all nonnegative, are called the Bernstein basis polynomials and canbe viewed as x-dependent weights that sum up to one:

n∑k=0

bk,n(x) =n∑k=0

(n

k

)xk(1− x)n−k = [x+ (1− x)]n = 1. (2.19)

Thus, for each x ∈ [0, 1], Bnf(x) represents a weighted average of the valuesof f at 0, 1/n, 2/n, . . . , 1. Moreover, as n increases the weights bk,n(x) con-centrate more and more around the points k/n close to x as Fig. 2.1 indicatesfor bk,n(0.25) and bk,n(0.75).

For n = 1, the Bernstein polynomial is just the straight line connectingf(0) and f(1), B1f(x) = (1−x)f(0)+xf(1). Given two points P0 = (x0, y0)and P1 = (x1, y1), the segment of the straight line connecting them can bewritten in parametric form as

B1(t) = (1− t)P0 + tP1, t ∈ [0, 1]. (2.20)

With three points, P0,P1,P2, we can employ the quadratic Bernstein basispolynomials to get a more useful parametric curve

B2(t) = (1− t)2P0 + 2t(1− t)P1 + t2 P2, t ∈ [0, 1]. (2.21)

This curve connects again P0 and P2 but P1 can be used to control howthe curve bends. More precisely, the tangents at the end points are B′2(0) =2(P1−P0) and B′2(1) = 2(P2−P1), which intersect at P1, as Fig. 2.2 illus-trates. These parametric curves formed with the Bernstein basis polynomialsare called Bezier curves and have been widely employed in computer graph-ics, specially in the design of vector fonts, and in computer animation. Toallow the representation of complex shapes, quadratic or cubic Bezier curves


0 10 20 30 40 500

0.02

0.04

0.06

0.08

0.1

0.12

0.14

k

bk,n(0.25) bk,n(0.75)

Figure 2.1: The Bernstein weights bk,n(x) for x = 0.25 ()and x = 0.75 (•),n = 50 and k = 1 . . . n.

P2P0

P1

Figure 2.2: Quadratic Bezier curve.


are pieced together to form composite Bezier curves. To have some degreeof smoothness (C1), the common point for two pieces of a composite Beziercurve has to lie on the line connecting the two adjacent control points on ei-ther side. For example, the TrueType font used in most computers today isgenerated with composite, quadratic Bezier curves while the Metafont usedin these pages, via LATEX, employs composite, cubic Bezier curves. For eachcharacter, many pieces of Bezier are stitched together.

Let us now do some algebra to prove some useful identities of the Bern-stein polynomials. First, for f(x) = x we have,

n∑k=0

k

n

(n

k

)xk(1− x)n−k =

n∑k=1

kn!

n(n− k)!k!xk(1− x)n−k

= xn∑k=1

(n− 1

k − 1

)xk−1(1− x)n−k

= xn−1∑k=0

(n− 1

k

)xk(1− x)n−1−k

= x [x+ (1− x)]n−1 = x.

(2.22)

Now for f(x) = x2, we get

n∑k=0

(k

n

)2(n

k

)xk(1− x)n−k =

n∑k=1

k

n

(n− 1

k − 1

)xk(1− x)n−k (2.23)

and writing

k

n=k − 1

n+

1

n=n− 1

n

k − 1

n− 1+

1

n, (2.24)


we have

n∑k=0

(k

n

)2(n

k

)xk(1− x)n−k =

n− 1

n

n∑k=2

k − 1

n− 1

(n− 1

k − 1

)xk(1− x)n−k

+1

n

n∑k=1

(n− 1

k − 1

)xk(1− x)n−k

=n− 1

n

n∑k=2

(n− 2

k − 2

)xk(1− x)n−k +

x

n

=n− 1

nx2

n−2∑k=0

(n− 2

k

)xk(1− x)n−2−k +

x

n.

Thus,

n∑k=0

(k

n

)2(n

k

)xk(1− x)n−k =

n− 1

nx2 +

x

n. (2.25)

Now, expanding(kn− x)2

and using (2.19), (2.22), and (2.25) it follows that

n∑k=0

(k

n− x)2(

n

k

)xk(1− x)n−k =

1

nx(1− x). (2.26)

2.2.2 Weierstrass Approximation Theorem

Theorem 2.1. (Weierstrass Approximation Theorem) Let f be a continuousfunction in [a, b]. Given ε > 0 there is a polynomial p such that

maxa≤x≤b

|f(x)− p(x)| < ε.

Proof. We are going to work on the interval [0, 1]. For a general interval[a, b], we consider the simple change of variables x = a+ (b−a)t for t ∈ [0, 1]so that F (t) = f(a+ (b− a)t) is continuous in [0, 1].

Using (2.19), we have

f(x)−Bnf(x) =n∑k=0

[f(x)− f

(k

n

)](n

k

)xk(1− x)n−k. (2.27)


Since f is continuous in [0, 1], it is also uniformly continuous. Thus, givenε > 0 there is δ(ε) > 0, independent of x, such that

|f(x)− f(k/n)| < ε

2if |x− k/n| < δ. (2.28)

Moreover,

|f(x)− f(k/n)| ≤ 2‖f‖∞ for all x ∈ [0, 1], k = 0, 1, . . . , n. (2.29)

We now split the sum in (2.27) in two sums, one over the points such that|k/n− x| < δ and the other over the points such that |k/n− x| ≥ δ:

f(x)−Bnf(x) =∑

|k/n−x|<δ

[f(x)− f

(k

n

)](n

k

)xk(1− x)n−k

+∑

|k/n−x|≥δ

[f(x)− f

(k

n

)](n

k

)xk(1− x)n−k.

(2.30)

Using (2.28) and (2.19) it follows immediately that the first sum is boundedby ε/2. For the second sum we have

∑|k/n−x|≥δ

∣∣∣∣f(x)− f(k

n

)∣∣∣∣ (nk)xk(1− x)n−k

≤ 2‖f‖∞∑

|k/n−x|≥δ

(n

k

)xk(1− x)n−k

≤ 2‖f‖∞δ2

∑|k/n−x|≥δ

(k

n− x)2(

n

k

)xk(1− x)n−k

≤ 2‖f‖∞δ2

n∑k=0

(k

n− x)2(

n

k

)xk(1− x)n−k

=2‖f‖∞nδ2

x(1− x) ≤ ‖f‖∞2nδ2

.

(2.31)

Therefore, there is N such that for all n ≥ N the second sum in (2.30) isbounded by ε/2 and this completes the proof.

2.3. BEST APPROXIMATION 25

2.3 Best Approximation

We just saw that any continuous function f on a closed interval can beapproximated uniformly with arbitrary accuracy by a polynomial. Ideallywe would like to find the closest polynomial, say of degree at most n, to thefunction f when the distance is measured in the supremum (infinity) norm,or in any other norm we choose. There are three important elements in thisgeneral problem: the space of functions we want to approximate, the norm,and the family of approximating functions. The following definition makesthis more precise.

Definition 2.1. Given a normed linear space V and a subspace W of V ,p∗ ∈ W is called the best approximation of f ∈ V by elements in W if

‖f − p∗‖ ≤ ‖f − p‖, for all p ∈ W. (2.32)

For example, the normed linear space V could be C[a, b] with the supre-mum norm (2.10) and W could be the set of all polynomials of degree atmost n, which henceforth we will denote by Pn.

Theorem 2.2. Let W be a finite-dimensional subspace of a normed linearspace V . Then, for every f ∈ V , there is at least one best approximation tof by elements in W .

Proof. Since W is a subspace 0 ∈ W and for any candidate p ∈ W for bestapproximation to f we must have

‖f − p‖ ≤ ‖f − 0‖ = ‖f‖. (2.33)

Therefore we can restrict our search to the set

F = p ∈ W : ‖f − p‖ ≤ ‖f‖. (2.34)

F is closed and bounded and because W is finite-dimensional it follows thatF is compact. Now, the function p 7→ ‖f − p‖ is continuous on this compactset and hence it attains its minimum in F .

If we remove the finite-dimensionality of W then we cannot guaranteethat there is a best approximation as the following example shows.


Example 2.1. Let V = C[0, 1/2] and W be the space of all polynomials(clearly of subspace of V ). Take f(x) = 1/(1 − x). Then, given ε > 0 thereis N such that

maxx∈[0,1/2]

∣∣∣∣ 1

1− x− (1 + x+ x2 + . . .+ xN)

∣∣∣∣ < ε. (2.35)

So if there is a best approximation p∗ in the max norm, necessarily ‖f −p∗‖∞ = 0, which implies

p∗(x) =1

1− x, (2.36)

which is impossible.

Theorem 2.2 does not guarantee uniqueness of best approximation. Strictconvexity of the norm gives us a sufficient condition.

Definition 2.2. A norm ‖ · ‖ on a vector space V is strictly convex if for allf 6= g in V with ‖f‖ = ‖g‖ = 1 then

‖θf + (1− θ)g‖ < 1, for all 0 < θ < 1.

In other words, a norm is strictly convex if its unit ball is strictly convex.

The p-norm is strictly convex for 1 < p <∞ but not for p = 1 or p =∞.

Theorem 2.3. Let V be a vector space with a strictly convex norm, W asubspace of V , and f ∈ V . If p∗ and q∗ are best approximations of f in Wthen p∗ = q∗.

Proof. Let M = ‖f − p∗‖ = ‖f − q∗‖, if p∗ 6= q∗ by the strict convexity ofthe norm

‖θ(f − p∗) + (1− θ)(f − q∗)‖ < M, for all 0 < θ < 1. (2.37)

Taking θ = 1/2 we get

‖f − 1

2(p∗ + q∗)‖ < M, (2.38)

which is impossible because 12(p∗ + q∗) is in W and cannot be a better ap-

proximation.


2.3.1 Best Uniform Polynomial Approximation

Given a continuous function f on a interval [a, b] we know there is at leastone best approximation p∗n to f , in any given norm, by polynomials of degreeat most n because the dimension of Pn is finite. The norm ‖ · ‖∞ is notstrictly convex so Theorem 2.3 does not apply. However, due to a specialproperty (called the Haar property) of the linear space Pn, which is that theonly element of Pn that has more than n roots is the zero element, it ispossible to prove that the best approximation is unique.

The crux of the matter is that error function

en(x) = f(x)− p∗n(x), x ∈ [a, b], (2.39)

has to equioscillate at least n+2 points, between +‖en‖∞ and −‖en‖∞. Thatis, there are k points, x1, x2, . . . , xk, with k ≥ n+ 2, such that

en(x1) = ±‖en‖∞en(x2) = −en(x1),

en(x3) = −en(x2),

...

en(xk) = −en(xk−1),

(2.40)

for if not, it would be possible to find a polynomial of degree at most n,with the same sign at the extrema of en (at most n sign changes), and usethis polynomial to decrease the value of ‖en‖∞. This would contradict thefact that p∗n is a best approximation. This is easy to see for n = 0 as it isimpossible to find a polynomial of degree 0 (a constant) with one change ofsign. This is the content of the next result.

Theorem 2.4. The error en = f − p∗n has at least two extrema x1 and x2

in [a, b] such that |en(x1)| = |en(x2)| = ‖en‖∞ and en(x1) = −en(x2) for alln ≥ 0.

Proof. The continuous function |en(x)| attains its maximum ‖en‖∞ in at leastone point x1 in [a, b]. Suppose ‖en‖∞ = en(x1) and that en(x) > −‖en‖∞ forall x ∈ [a, b]. Then, m = minx∈[a,b] en(x) > −‖en‖∞ and we have some roomto decrease ‖en‖∞ by shifting down en a suitable amount c. In particular, iftake c as one half the gap between the minimum m of en and −‖en‖∞,

c =1

2(m+ ‖en‖∞) > 0, (2.41)


en(x)

en(x)−c

0

Figure 2.3: If the error function en does not equioscillate at least twice wecould lower ‖en‖∞ by an amount c > 0.

and subtract it to en, as shown in Fig. 2.3, we have

−‖en‖∞ + c ≤ en(x)− c ≤ ‖en‖∞ − c. (2.42)

Therefore, ‖en−c‖∞ = ‖f−(p∗n+c)‖∞ = ‖en‖∞−c < ‖en‖∞ but p∗n+c ∈ Pnso this is impossible since p∗n is a best approximation. A similar argumentcan used when en(x1) = −‖en‖∞.

Before proceeding to the general case, let us look at the n = 1 situation.Suppose there are only two alternating extrema x1 and x2 for e1 as describedin (2.40). We are going to construct a linear polynomial that has the samesign as e1 at x1 and x2 and which can be used to decrease ‖e1‖∞. Supposee1(x1) = ‖e1‖∞ and e1(x2) = −‖e1‖∞. Since e1 is continuous, we can findsmall closed intervals I1 and I2, containing x1 and x2, respectively, and suchthat

e1(x) >‖e1‖∞

2for all x ∈ I1, (2.43)

e1(x) < −‖e1‖∞2

for all x ∈ I2. (2.44)

Clearly I1 and I2 are disjoint sets so we can choose a point x0 between thetwo intervals. Then, it is possible to find a linear polynomial q that passesthrough x0 and that is positive in I1 and negative in I2. We are now going


to find a suitable constant α > 0 such that ‖f − p∗1 − αq‖∞ < ‖e1‖∞. Sincep∗1 + αq ∈ P1 this would be a contradiction to the fact that p∗1 is a bestapproximation.

Let R = [a, b] \ (I1 ∪ I2) and d = maxx∈R |e1(x)|. Clearly d < ‖e1‖∞.Choose α such that

0 < α <1

2‖q‖∞(‖e1‖∞ − d) . (2.45)

On I1, we have

0 < αq(x) <1

2‖q‖∞(‖e1‖∞ − d) q(x) ≤ 1

2(‖e1‖∞ − d) < e1(x). (2.46)

Therefore

|e1(x)− αq(x)| = e1(x)− αq(x) < ‖e1‖∞, for all x ∈ I1. (2.47)

Similarly, on I2, we can show that |e1(x)−αq(x)| < ‖e1‖∞. Finally, on R wehave

|e1(x)− αq(x)| ≤ |e1(x)|+ |αq(x)| ≤ d+1

2(‖e1‖∞ − d) < ‖e1‖∞. (2.48)

Therefore, ‖e1 − αq‖∞ = ‖f − (p∗1 + αq)‖∞ < ‖e1‖∞, which contradicts thebest approximation assumption on p∗1.

Theorem 2.5. (Chebyshev Equioscillation Theorem) Let f ∈ C[a, b]. Then,p∗n in Pn is a best uniform approximation of f if and only if there are at leastn + 2 points in [a, b], where the error en = f − p∗n equioscillates between thevalues ±‖en‖∞ as defined in (2.40).

Proof. We first prove that if the error en = f − p∗n, for some p∗n ∈ Pn,equioscillates at least n+ 2 times then p∗n is a best approximation. Supposethe contrary. Then, there is qn ∈ Pn such that

‖f − qn‖∞ < ‖f − p∗n‖∞. (2.49)

Let x1, . . . , xk, with k ≥ n+ 2, be the points where en equioscillates. Then

|f(xj)− qn(xj)| < |f(xj)− p∗n(xj)|, j = 1, . . . , k (2.50)


and since

f(xj)− p∗n(xj) = −[f(xj+1)− p∗n(xj+1)], j = 1, . . . , k − 1 (2.51)

we have that

qn(xj)− p∗n(xj) = f(xj)− p∗n(xj)− [f(xj)− qn(xj)] (2.52)

changes signs k − 1 times, i.e. at least n + 1 times. But qn − p∗n ∈ Pn.Therefore qn = p∗n, which contradicts (2.49), and consequently p∗n has to bea best uniform approximation of f .

For the other half of the proof the idea is the same as for n = 1 but we needto do more bookkeeping. We are going to partition [a, b] into the union ofsufficiently small subintervals so that we can guarantee that |en(t)− en(s)| ≤‖en‖∞/2 for any two points t and s in each of the subintervals. Let us labelby I1, . . . , Ik, the subintervals on which |en(x)| achieves its maximum ‖en‖∞.Then, on each of these subintervals either en(x) > ‖en‖∞/2 or en(x) <−‖en‖∞/2. We need to prove that en changes sign at least n+ 1 times.

Going from left to right, we can label the subintervals I1, . . . , Ik as a (+)or (−) subinterval depending on the sign of en. For definiteness, suppose I1

is a (+) subinterval then we have the groups

I1, . . . , Ik1, (+)

Ik1+1, . . . , Ik2, (−)

...

Ikm+1, . . . , Ik, (−)m.

We have m changes of sign so let us assume that m ≤ n. We already knowm ≥ 1. Since the sets, Ikj and Ikj+1 are disjoint for j = 1, . . . ,m, we canselect points t1, . . . , tm, such that tj > x for all x ∈ Ikj and tj < x for allx ∈ Ikj+1. Then, the polynomial

q(x) = (t1 − x)(t2 − x) · · · (tm − x) (2.53)

has the same sign as en in each of the extremal intervals I1, . . . , Ik and q ∈ Pn.The rest of the proof is as in the n = 1 case to show that p∗n + αq would bea better approximation to f than p∗n.

Theorem 2.6. Let f ∈ C[a, b]. The best uniform approximation p∗n to f byelements of Pn is unique.

2.4. CHEBYSHEV POLYNOMIALS 31

Proof. Suppose q∗n is also a best approximation, i.e.

‖en‖∞ = ‖f − p∗n‖∞ = ‖f − q∗n‖∞.

Then, the midpoint r = 12(p∗n + q∗n) is also a best approximation, for r ∈ Pn

and

‖f − r‖∞ = ‖1

2(f − p∗n) +

1

2(f − q∗n)‖∞

≤ 1

2‖f − p∗n‖∞ +

1

2‖f − q∗n‖∞ = ‖en‖∞.

(2.54)

Let x1, . . . , xn+2 be extremal points of f−r with the alternating property(2.40), i.e. f(xj) − r(xj) = (−1)m+j ‖en‖∞ for some integer m and j =1, . . . n+ 2. This implies that

f(xj)− p∗n(xj)

2+f(xj)− q∗n(xj)

2= (−1)m+j ‖en‖∞, j = 1, . . . , n+ 2.

(2.55)

But |f(xj) − p∗n(xj)| ≤ ‖en‖∞ and |f(xj) − q∗n(xj)| ≤ ‖en‖∞. As a conse-quence,

f(xj)− p∗n(xj) = f(xj)− q∗n(xj) = (−1)m+j ‖en‖∞, j = 1, . . . , n+ 2,(2.56)

and it follows that

p∗n(xj) = q∗n(xj), j = 1, . . . , n+ 2 (2.57)

Therefore, q∗n = p∗n.

2.4 Chebyshev Polynomials

The best uniform approximation of f(x) = xn+1 in [−1, 1] by polynomials ofdegree at most n can be found explicitly and the solution introduces one ofthe most useful and remarkable polynomials, the Chebyshev polynomials.

Let p∗n ∈ Pn be the best uniform approximation to xn+1 in the interval[−1, 1] and as before define the error function as en(x) = xn+1− p∗n(x). Notethat since en is a monic polynomial (its leading coefficient is 1) of degree


n + 1, the problem of finding p∗n is equivalent to finding, among all monicpolynomials of degree n+ 1, the one with the smallest deviation (in absolutevalue) from zero.

According to Theorem 2.5, there exist n+ 2 distinct points,

−1 ≤ x1 < x2 < · · · < xn+2 ≤ 1, (2.58)

such that

e2n(xj) = ‖en‖2

∞, for j = 1, . . . , n+ 2. (2.59)

Now consider the polynomial

q(x) = ‖en‖2∞ − e2

n(x). (2.60)

Then, q(xj) = 0 for j = 1, . . . , n+2. Each of points xj in the interior of [−1, 1]is also a local minimum of q, then necessarily q′(xj) = 0 for j = 2, . . . n + 1.Thus, the n points x2, . . . , xn+1 are zeros of q of multiplicity at least two.But q is a nonzero polynomial of degree 2n + 2 exactly. Therefore, x1 andxn+2 have to be simple zeros and so x1 = −1 and xn+2 = 1. Note that thepolynomial p(x) = (1 − x2)[e′n(x)]2 ∈ P2n+2 has the same zeros as q and sop = cq, for some constant c. Comparing the coefficient of the leading orderterm of p and q it follows that c = (n + 1)2. Therefore, en satisfies theordinary differential equation

(1− x2)[e′n(x)]2 = (n+ 1)2[‖en‖2

∞ − e2n(x)

]. (2.61)

We know e′n ∈ Pn and its n zeros are the interior points x2, . . . , xn+1. There-fore, e′n cannot change sign in [−1, x2]. Suppose it is nonnegative for x ∈[−1, x2] (we reach the same conclusion if we assume e′n(x) ≤ 0) then, takingsquare roots in (2.61) we get

e′n(x)√‖en‖2

∞ − e2n(x)

=n+ 1√1− x2

, for x ∈ [−1, x2]. (2.62)

Using the trigonometric substitution x = cos θ, we can integrate to obtain

en(x) = ‖en‖∞ cos[(n+ 1)θ], (2.63)

for x = cos θ ∈ [−1, x2] with 0 < θ ≤ π, where we have chosen the constant ofintegration to be zero so that en(1) = ‖en‖∞. Recall that en is a polynomialof degree n + 1 then so is cos[(n + 1) cos−1 x]. Since these two polynomialsagree in [−1, x2], (2.63) must also hold for all x in [−1, 1].


Definition 2.3. The Chebyshev polynomial (of the first kind) of degree n,Tn is defined by

Tn(x) = cosnθ, x = cos θ, 0 ≤ θ ≤ π. (2.64)

Note that (2.64) only defines Tn for x ∈ [−1, 1]. However, once thecoefficients of this polynomial are determined we can define it for any real(or complex) x.

Using the trigonometry identity

cos[(n+ 1)θ] + cos[(n− 1)θ] = 2 cosnθ cos θ, (2.65)

we immediately get

Tn+1(cos θ) + Tn−1(cos θ) = 2Tn(cos θ) · cos θ (2.66)

and going back to the x variable we obtain the recursion formula

T0(x) = 1,

T1(x) = x,

Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1,

(2.67)

which makes it more evident the Tn for n = 0, 1, . . . are indeed polynomialsof exactly degree n. Let us generate a few of them.

T0(x) = 1,

T1(x) = x,

T2(x) = 2x · x− 1 = 2x2 − 1,

T3(x) = 2x · (2x2 − 1)− x = 4x3 − 3x,

T4(x) = 2x(4x3 − 3x)− (2x2 − 1) = 8x4 − 8x2 + 1

T5(x) = 2x(8x4 − 8x2 + 1)− (4x3 − 3x) = 16x5 − 20x3 + 5x.

(2.68)

From these few Chebyshev polynomials, and from (2.67), we see that

Tn(x) = 2n−1xn + lower order terms (2.69)

and that Tn is an even (odd) function of x if n is even (odd), i.e.

Tn(−x) = (−1)nTn(x). (2.70)


Going back to (2.63), since the leading order coefficient of en is 1 and thatof Tn+1 is 2n, it follows that ‖en‖∞ = 2−n. Therefore

p∗n(x) = xn+1 − 1

2nTn+1(x) (2.71)

is the best uniform approximation of xn+1 in [−1, 1] by polynomials of degreeat most n. Equivalently, as noted in the beginning of this section, the monicpolynomial of degree n with smallest infinity norm in [−1, 1] is

Tn(x) =1

2n−1Tn(x). (2.72)

Hence, for any other monic polynomial p of degree n

maxx∈[−1,1]

|p(x)| > 1

2n−1. (2.73)

The zeros and extrema of Tn are easy to find. Because Tn(x) = cosnθand 0 ≤ θ ≤ π, the zeros occur when θ is an odd multiple of π/2. Therefore,

xj = cos

((2j + 1)

n

π

2

)j = 0, . . . , n− 1. (2.74)

The extrema of Tn (the points x where Tn(x) = ±1) correspond to nθ = jπfor j = 0, 1, . . . , n, that is

xj = cos

(jπ

n

), j = 0, 1, . . . , n. (2.75)

These points are called Chebyshev or Gauss-Lobatto points and are ex-tremely useful in applications. Note that xj for j = 1, . . . , n − 1 are localextrema. Therefore

T ′n(xj) = 0, for j = 1, . . . , n− 1. (2.76)

In other words, the Chebyshev points (2.75) are the n − 1 zeros of T ′n plusthe end points x0 = 1 and xn = −1.

Using the Chain Rule we can differentiate Tn with respect to x we get

T ′n(x) = −n sinnθdθ

dx= n

sinnθ

sin θ, (x = cos θ). (2.77)


Therefore

T ′n+1(x)

n+ 1−T ′n−1(x)

n− 1=

1

sin θ[sin(n+ 1)θ − sin(n− 1)θ] (2.78)

and since sin(n+ 1)θ − sin(n− 1)θ = 2 sin θ cosnθ, we get that

T ′n+1(x)

n+ 1−T ′n−1(x)

n− 1= 2Tn(x). (2.79)

The polynomial

Un(x) =T ′n+1(x)

n+ 1=

sin(n+ 1)θ

sin θ, (x = cos θ) (2.80)

of degree n is called the Chebyshev polynomial of second kind. Thus, theChebyshev nodes (2.75) are the zeros of the polynomial

qn+1(x) = (1− x2)Un−1(x). (2.81)


Chapter 3

Interpolation

3.1 Polynomial Interpolation

One of the basic tools for approximating a function or a given data set isinterpolation. In this chapter we focus on polynomial and piece-wise poly-nomial interpolation.

The polynomial interpolation problem can be stated as follows: Given n+1data points, (x0, f0), (x1, f1)..., (xn, fn), where x0, x1, . . . , xn are distinct, finda polynomial pn ∈ Pn, which satisfies the interpolation property:

pn(x0) = f0,pn(x1) = f1,

...pn(xn) = fn.

The points x0, x1, . . . , xn are called interpolation nodes and the values f0, f1, . . . , fnare data supplied to us or can come from a function f we are trying to ap-proximate, in which case fj = f(xj) for j = 0, 1, . . . , n.

Let us represent such polynomial as pn(x) = a0 +a1x+ · · ·+anxn. Then,

the interpolation property implies

a0 + a1x0 + · · ·+ anxn0 = f0,

a0 + a1x1 + · · ·+ anxn1 = f1,

...

37

38 CHAPTER 3. INTERPOLATION

a0 + a1xn + · · ·+ anxnn = fn.

This is a linear system of n+ 1 equations in n+ 1 unknowns (the polynomialcoefficients a0, a1, . . . , an). In matrix form:

1 x0 x20 · · ·xn0

1 x1 x21 · · ·xn1

...1 xn x2

n · · ·xnn

a0

a1...an

=

f0

f1...fn.

(3.1)

Does this linear system have a solution? Is this solution unique? The answeris yes to both. Here is a simple proof. Take fj = 0 for j = 0, 1, . . . , n. Thenpn(xj) = 0, for j = 0, 1, ..., n but pn is a polynomial of degree at most n, itcannot have n+ 1 zeros unless pn ≡ 0, which implies a0 = a1 = · · · = an = 0.That is, the homogenous problem associated with (3.1) has only the trivialsolution. Therefore, (3.1) has a unique solution.

Example 3.1. As an illustration let us consider interpolation by a linearpolynomial, p1. Suppose we are given (x0, f0) and (x1, f1). We have writtenp1 explicitly in the Introduction, we write it now in a different form:

p1(x) =x− x1

x0 − x1

f0 +x− x0

x1 − x0

f1. (3.2)

Clearly, this polynomial has degree at most 1 and satisfies the interpolationproperty:

p1(x0) = f0, (3.3)

p1(x1) = f1. (3.4)

Example 3.2. Given (x0, f0), (x1, f1), and (x2, f2) let us construct p2 ∈P2 that interpolates these points. The way we have written p1 in (3.2) issuggestive of how to explicitly write p2:

p2(x) =(x− x1)(x− x2)

(x0 − x1)(x0 − x2)f0 +

(x− x0)(x− x2)

(x1 − x0)(x1 − x2)f1 +

(x− x0)(x− x1)

(x2 − x0)(x2 − x1)f2.

3.1. POLYNOMIAL INTERPOLATION 39

If we define

l0(x) =(x− x1)(x− x2)

(x0 − x1)(x0 − x2), (3.5)

l1(x) =(x− x0)(x− x2)

(x1 − x0)(x1 − x2), (3.6)

l2(x) =(x− x0)(x− x1)

(x2 − x0)(x2 − x1), (3.7)

then we simply have

p2(x) = l0(x)f0 + l1(x)f1 + l2(x)f2. (3.8)

Note that each of the polynomials (3.5), (3.6), and (3.7) are exactly of degree2 and they satisfy lj(xk) = δjk

1. Therefore, it follows that p2 given by (3.8)satisfies the interpolation property

p2(x0) = f0,

p2(x1) = f1,

p2(x2) = f2.

(3.9)

We can now write down the polynomial of degree at most n that interpo-lates n + 1 given values, (x0, f0), . . . , (xn, fn), where the interpolation nodesx0, . . . , xn are assumed distinct. Define

lj(x) =(x− x0) · · · (x− xj−1)(x− xj+1) · · · (x− xn)

(xj − x0) · · · (xj − xj−1)(xj − xj+1) · · · (xj − xn)

=n∏k=0k 6=j

(x− xk)(xj − xk)

, for j = 0, 1, ..., n.(3.10)

These are called the elementary Lagrange polynomials of degree n. For sim-plicity, we are omitting in the notation their dependence on the n+ 1 nodes.Since lj(xk) = δjk, we have that

pn(x) = l0(x)f0 + l1(x)f1 + · · ·+ ln(x)fn =n∑j=0

lj(x)fj (3.11)

1δjk is the Kronecker delta, i.e. δjk = 0 if k 6= j and 1 if k = j.


interpolates the given data, i.e., it satisfies the interpolation property pn(xj) =fj for j = 0, 1, 2, . . . , n. Relation (3.11) is called the Lagrange form of theinterpolating polynomial. The following result summarizes our discussion.

Theorem 3.1. Given the n+ 1 values (x0, f0), . . . , (xn, fn), for x0, x1, ..., xndistinct. There is a unique polynomial pn of degree at most n such thatpn(xj) = fj for j = 0, 1, . . . , n.

Proof. pn in (3.11) is of degree at most n and interpolates the data. Unique-ness follows from the Fundamental Theorem of Algebra, as noted earlier.Suppose there is another polynomial qn of degree at most n such that qn(xj) =fj for j = 0, 1, . . . , n. Consider r = pn − qn. This is a polynomial of degreeat most n and r(xj) = pn(xj) − qn(xj) = fj − fj = 0 for j = 0, 1, 2, . . . , n,which is impossible unless r ≡ 0. This implies qn = pn.

3.1.1 Equispaced and Chebyshev Nodes

There are two special sets of nodes that are particularly important in ap-plications. For convenience we are going to take the interval [−1, 1]. For ageneral interval [a, b], we can do the simple change of variables

x =1

2(a+ b) +

1

2(b− a)t, t ∈ [−1, 1]. (3.12)

The uniform or equispaced nodes are given by

xj = −1 + jh, j = 0, 1, . . . , n and h = 2/n. (3.13)

These nodes yield very accurate and efficient trigonometric polynomial inter-polation but are generally not good for (algebraic) polynomial interpolationas we will see later.

One of the preferred set of nodes for high order, accurate, and computa-tionally efficient polynomial interpolation is the Chebyshev or Gauss-Lobattoset

xj = cos

(jπ

n

), j = 0, . . . , n, (3.14)

which, as discussed in Section 2.4, are the extrema of the Chebyshev poly-nomial (2.64) of degree n . Note that these nodes are obtained from the

3.2. CONNECTION TO BEST UNIFORM APPROXIMATION 41

equispaced points θj = j(π/n), j = 0, 1, . . . , n in [0, π] by the one-to-one re-lation x = cos θ, for θ ∈ [0, π]. As defined in (3.14), the nodes go from 1 to -1so often the alternative definition xj = − cos(jπ/n) is used. The Chebyshevnodes are not equally spaced and tend to cluster toward the end points ofthe interval.

3.2 Connection to Best Uniform Approxima-

tion

Given a continuous function f in [a, b], its best uniform approximation p∗n inPn is characterized by an error, en = f − p∗n, which equioscillates, as definedin (2.40), at least n + 2 times. Therefore en has a minimum of n + 1 zerosand consequently, there exists x0, . . . , xn such that

p∗n(x0) = f(x0),

p∗n(x1) = f(x1),

...

p∗n(xn) = f(xn),

(3.15)

In other words, p∗n is the polynomial of degree at most n that interpolatesthe function f at n + 1 zeros of en. Of course, we do not construct p∗n byfinding these particular n+ 1 interpolation nodes. A more practical questionis: given (x0, f(x0)), (x1, f(x1)), . . . , (xn, f(xn)), where x0, . . . , xn are distinctinterpolation nodes in [a, b], how close is pn, the interpolating polynomial ofdegree at most n of f at the given nodes, to the best uniform approximationp∗n of f in Pn?

To obtain a bound for ‖pn−p∗n‖∞ we note that pn−p∗n is a polynomial ofdegree at most n which interpolates f − p∗n. Therefore, we can use Lagrangeformula to represent it

pn(x)− p∗n(x) =n∑j=0

lj(x)(f(xj)− p∗n(xj)). (3.16)

It then follows that

‖pn − p∗n‖∞ ≤ Λn‖f − p∗n‖∞, (3.17)


where

Λn = maxa≤x≤b

n∑j=0

|lj(x)| (3.18)

is called the Lebesgue Constant and depends only on the interpolation nodes,not on f . On the other hand, we have that

‖f − pn‖∞ = ‖f − p∗n − pn + p∗n‖∞ ≤ ‖f − p∗n‖∞ + ‖pn − p∗n‖∞. (3.19)

Using (3.17) we obtain

‖f − pn‖∞ ≤ (1 + Λn)‖f − p∗n‖∞. (3.20)

This inequality connects the interpolation error ‖f − pn‖∞ with the bestapproximation error ‖f−p∗n‖∞. What happens to these errors as we increasen? To make it more concrete, suppose we have a triangular array of nodesas follows:

x(0)0

x(1)0 x

(1)1

x(2)0 x

(2)1 x

(2)2

...

x(n)0 x

(n)1 . . . x

(n)n

...

(3.21)

where a ≤ x(n)0 < x

(n)1 < · · · < x

(n)n ≤ b for n = 0, 1, . . .. Let pn be the

interpolating polynomial of degree at most n of f at the nodes correspondingto the n+ 1 row of (3.21).

By the Weierstrass Approximation Theorem ( p∗n is a better approxima-tion or at least as good as that provided by the Bernstein polynomial),

‖f − p∗n‖∞ → 0 as n→∞. (3.22)

However, it can be proved that

Λn >2

π2log n− 1 (3.23)

3.3. BARYCENTRIC FORMULA 43

and hence the Lebesgue constant is not bounded in n. Therefore, we cannotconclude from (3.20) and (3.22) that ‖f − pn‖∞ as n → ∞, i.e. that theinterpolating polynomial, as we add more and more nodes, converges uni-formly to f . That depends on the regularity of f and on the distribution ofthe nodes. In fact, if we are given the triangular array of interpolation nodes(3.21) in advance, it is possible to construct a continuous function f suchthat pn will not converge uniformly to f as n→∞.

3.3 Barycentric Formula

The Lagrange form of the interpolating polynomial is not convenient for com-putations. If we want to increase the degree of the polynomial we cannotreuse the work done in getting and evaluating a lower degree one. How-ever, we can obtain a very efficient formula by rewriting the interpolatingpolynomial in the following way. Let

ω(x) = (x− x0)(x− x1) · · · (x− xn). (3.24)

Then, differentiating this polynomial of degree n+1 and evaluating at x = xjwe get

ω′(xj) =n∏k=0k 6=j

(xj − xk), for j = 0, 1, . . . , n, (3.25)

Therefore, each of the elementary Lagrange polynomials may be written as

lj(x) =

ω(x)

x− xjω′(xj)

=ω(x)

(x− xj)ω′(xj), for j = 0, 1, . . . , n, (3.26)

for x 6= xj and lj(xj) = 1 follows from L’Hopital rule. Defining

λj =1

ω′(xj), for j = 0, 1, . . . , n, (3.27)

we can write Lagrange formula for the interpolating polynomial of f atx0, x1, . . . , xn as

pn(x) = ω(x)n∑j=0

λjx− xj

fj. (3.28)


Now, note that from (3.11) with f(x) ≡ 1 it follows that

1 =n∑j=0

lj(x) = ω(x)n∑j=0

λjx− xj

. (3.29)

Dividing (3.28) by (3.29), we get the so-called Barycentric Formula for in-terpolation:

pn(x) =

n∑j=0

λjx− xj

fj

n∑j=0

λjx− xj

, for x 6= xj, j = 0, 1, . . . , n. (3.30)

For x = xj, j = 0, 1, . . . , n, the interpolation property, pn(xj) = fj, shouldbe used.

The numbers λj depend only on the nodes x0, x1, ..., xn and not on givenvalues f0, f1, ..., fn. We can obtain them explicitly for both the Chebyshevnodes (3.14) and for the equally spaced nodes (3.13) and can be precomputedefficiently for a general set of nodes.

3.3.1 Barycentric Weights for Chebyshev Nodes

The Chebyshev nodes are the zeros of qn+1(x) = (1 − x2)Un−1(x), whereUn−1(x) = sinnθ/ sin θ, x = cos θ is the Chebyshev polynomial of the secondkind of degree n − 1, with leading order coefficient 2n−1 [see Section 2.4].Since the λj’s can be defined up to a multiplicative constant (which wouldcancel out in the barycentric formula) we can take λj to be proportional to1/q′n+1(xj). Since

qn+1(x) = sin θ sinnθ, (3.31)

differentiating we get

q′n+1(x) = −n cosnθ − sinnθ cot θ. (3.32)

Thus,

q′n+1(xj) =

−2n, for j = 0,

−(−1)jn, for j = 1, . . . , n− 1,

−2n (−1)n for j = n.

(3.33)

3.3. BARYCENTRIC FORMULA 45

We can factor out −n in (3.33) to obtain the barycentric weights for theChebyshev points

λj =

12, for j = 0,

(−1)j, for j = 1, . . . , n− 1,12

(−1)n for j = n.

(3.34)

Note that for a general interval [a, b], the term (a + b)/2 in the change ofvariables (3.12) cancels out in (3.25) but we gain an extra factor of [(b−a)/2]n.However, this factor can be omitted as it does not alter the barycentricformula. Therefore, the same barycentric weights (3.34) can also be used forthe Chebyshev nodes in an interval [a, b].

3.3.2 Barycentric Weights for Equispaced Nodes

For equispaced points, xj = x0 + jh, j = 0, 1, . . . , n we have

λj =1

(xj − x0) · · · (xj − xj−1)(xj − xj+1) · · · (xj − xn)

=1

(jh)[(j − 1)h] · · · (h)(−h)(−2h) · · · (j − n)h

=1

(−1)n−jhn[j(j − 1) · · · 1][1 · 2 · · · (n− j)]

=(−1)j−n

hnn!

n!

j!(n− j)!

=(−1)j−n

hnn!

(n

j

)=

(−1)n

hnn!(−1)j

(n

j

).

We can omit the factor (−1)n/(hnn!) because it cancels out in the barycentricformula. Thus, for equispaced nodes we can use

λj = (−1)j(n

j

), j = 0, 1, . . . n. (3.35)

3.3.3 Barycentric Weights for General Sets of Nodes

For general arrays of nodes we can precompute the barycentric weights effi-ciently as follows.


λ(0)0 = 1;

for m = 1 : nfor j = 0 : m− 1

λ(m)j =

λ(m−1)j

xj−xm ;

endλ

(m)m = 1

m−1∏k=0

(xm − xk);

end

If we want to add one more point (xn+1, fn+1) we just extend the m-loop

to n+ 1 to generate λ(n+1)0 , λ

(n+1)1 , · · · , λ(n+1)

n+1 .

3.4 Newton’s Form and Divided Differences

There is another representation of the interpolating polynomial which is bothvery efficient computationally and very convenient in the derivation of nu-merical methods based on interpolation. The idea of this representation, dueto Newton, is to use successively lower order polynomials for constructingpn.

Suppose we have gotten pn−1 ∈ Pn−1, the interpolating polynomial of(x0, f0), (x1, f1), . . . , (xn−1, fn−1) and we would like to obtain pn ∈ Pn, the in-terpolating polynomial of (x0, f0), (x1, f1), . . . , (xn, fn) by reusing pn−1. Thedifference between these polynomials, r = pn−pn−1, is a polynomial of degreeat most n. Moreover, for j = 0, . . . , n− 1

r(xj) = pn(xj)− pn−1(xj) = fj − fj = 0. (3.36)

Therefore, r can be factored as

r(x) = cn(x− x0)(x− x1) · · · (x− xn−1). (3.37)

The constant cn is called the n-th divided difference of f with respect tox0, x1, ..., xn, and is usually denoted as f [x0, . . . , xn]. Thus, we have

pn(x) = pn−1(x) + f [x0, . . . , xn](x− x0)(x− x1) · · · (x− xn−1). (3.38)

By the same argument, we have

pn−1(x) = pn−2(x) + f [x0, . . . , xn−1](x− x0)(x− x1) · · · (x− xn−2), (3.39)

3.4. NEWTON’S FORM AND DIVIDED DIFFERENCES 47

etc. So we arrive at Newton’s Form of pn:

pn(x) = f [x0] + f [x0, x1](x− x0) + . . .+ f [x0, . . . , xn](x− x0) · · · (x− xn−1).(3.40)

Note that for n = 1

p1(x) = f [x0] + f [x0, x1](x− x0),

p1(x0) = f [x0] = f0,

p1(x1) = f [x0] + f [x0, x1](x1 − x0) = f1.

Therefore

f [x0] = f0, (3.41)

f [x0, x1] =f1 − f0

x1 − x0

, (3.42)

and

p1(x) = f0 +f1 − f0

x1 − x0

(x− x0). (3.43)

Define f [xj] = fj for j = 0, 1, ...n. The following identity will allow us tocompute all the required divided differences.

Theorem 3.2.

f [x0, x1, ..., xk] =f [x1, x2, ..., xk]− f [x0, x1, ..., xk−1]

xk − x0

. (3.44)

Proof. Let pk−1 be the interpolating polynomial of degree at most k − 1 of(x1, f1), . . . , (xk, fk) and qk−1 the interpolating polynomial of degree at mostk − 1 of (x0, f0), . . . , (xk−1, fk−1). Then

p(x) = pk−1(x) +x− xkxk − x0

[pk−1(x)− qk−1(x)]. (3.45)

is a polynomial of degree at most k and for j = 1, 2, ....k − 1

p(xj) = fj +xj − xkxk − x0

[fj − fj] = fj.


Moreover, p(x0) = qk−1(x0) = f0 and p(xk) = pk−1(xk) = fk. There-fore, p is the interpolation polynomial of degree at most k of the points(x0, f0), (x1, f1), . . . , (xk, fk). The leading order coefficient of pk is f [x0, ..., xk]and equating this with the leading order coefficient of p

f [x1, ..., xk]− f [x0, x1, ...xk−1]

xk − x0

,

gives (3.44).

To obtain the divided difference we construct a table using (3.44), columnby column as illustrated below for n = 3.

xj 0th order 1th order 2th order 3th orderx0 f0

f [x0, x1]x1 f1 f [x0, x1, x2]

f [x1, x2] f [x0, x1, x2, x3]x2 f2 f [x1, x2, x3]

f [x2, x3]x3 f3

Example 3.3. Let f(x) = 1 + x2, xj = j, and fj = f(xj) for j = 0, . . . , 3.Then

xj 0th order 1th order 2th order 3th order

0 12−11−0

= 1

1 2 3−12−0

= 15−22−1

= 3 0

2 5 5−33−1

= 110−53−2

= 5

3 10

so

p3(x) = 1 + 1(x− 0) + 1(x− 0)(x− 1) + 0(x− 0)(x− 1)(x− 2) = 1 + x2.

After computing the divided differences, we need to evaluate pn at a givenpoint x. This can be done efficiently by suitably factoring it. For example,

3.5. CAUCHY’S REMAINDER 49

for n = 3 we have

p3(x) = c0 + c1(x− x0) + c2(x− x0)(x− x1) + c3(x− x0)(x− x1)(x− x2)

= c0 + (x− x0) c1 + (x− x1)[c2 + (x− x2)c3]

For general n we can use the following Horner-like scheme to get p = pn(x):p = cn;for k = n− 1 : 0

p = ck + (x− xk) ∗ p;end

3.5 Cauchy’s Remainder

We now assume that the data fj = f(xj), j = 0, 1, . . . , n come from asufficiently smooth function f , which we are trying to approximate withan interpolating polynomial pn, and we focus on the error f − pn of suchapproximation.

In the Introduction we proved that if x0, x1, and x are in [a, b] andf ∈ C2[a, b] then

f(x)− p1(x) =1

2f ′′(ξ(x))(x− x0)(x− x1),

where p1 is the polynomial of degree at most 1 that interpolates (x0, f(x0)),(x1, f(x1)) and ξ(x) ∈ (a, b). The general result about the interpolation erroris the following theorem:

Theorem 3.3. Let f ∈ Cn+1[a, b], x0, x1, ..., xn, x be contained in [a, b], andpn(x) be the interpolation polynomial of degree at most n of f at x0, ..., xnthen

f(x)− pn(x) =1

(n+ 1)!f (n+1)(ξ(x))(x− x0)(x− x1) · · · (x− xn), (3.46)

where minx0, . . . , xn, x < ξ(x) < maxx0, . . . , xn, x.

Proof. The right hand side of (3.46) is known as the Cauchy Remainder andthe following proof is due to Cauchy.


For x equal to one of the nodes xj the result is trivially true. Take x fixednot equal to any of the nodes and define

φ(t) = f(t)− pn(t)− [f(x)− pn(x)](t− x0)(t− x1) · · · (t− xn)

(x− x0)(x− x1) · · · (x− xn). (3.47)

Clearly, φ ∈ Cn+1[a, b] and vanishes at t = x0, x1, ..., xn, x. That is, φ has atleast n + 2 zeros. Applying Rolle’s Theorem n + 1 times we conclude thatthere exists a point ξ(x) ∈ (a, b) such that φ(n+1)(ξ(x)) = 0. Therefore,

0 = φ(n+1)(ξ(x)) = f (n+1)(ξ(x))− [f(x)− pn(x)](n+ 1)!

(x− x0)(x− x1) · · · (x− xn)

from which (3.46) follows. Note that the repeated application of Rolle’s theo-rem implies that ξ(x) is between minx0, x1, ..., xn, x and maxx0, x1, ..., xn, x.

We are now going to find a beautiful connection between Chebyshevpolynomials and the interpolation error as given by the Cauchy remainder(3.46). Let us consider the interval [−1, 1]. We have no control on the termf (n+1)(ξ(x)). However, we can choose the interpolation nodes x0, . . . , xn sothat the factor

w(x) = (x− x0)(x− x1) · · · (x− xn) (3.48)

is smallest as possible in the infinity norm. The function w is a monic poly-nomial of degree n+1 and we have proved in Section 2.4 that the Chebyshevpolynomial Tn+1, defined in (2.72), is the monic polynomial of degree n + 1with smallest infinity norm. Hence, if the interpolation nodes are taken tobe the zeros of Tn+1, namely

xj = cos

((2j + 1)

n+ 1

π

2

), j = 0, 1, . . . n. (3.49)

‖w‖∞ is minimized and ‖w‖∞ = 2−n. The following theorem summarizesthis observation.

Theorem 3.4. Let pTn (x) be the interpolating polynomial of degree at mostn of f ∈ Cn+1[−1, 1] with respect to the nodes (3.49) then

‖f − pTn‖∞ ≤1

2n(n+ 1)!‖fn+1‖∞. (3.50)

3.5. CAUCHY’S REMAINDER 51

The Gauss-Lobatto Chebyshev points,

xj = cos

(jπ

n

), j = 0, 1, . . . , n, (3.51)

which are the extrema and not the zeros of the corresponding Chebyshevpolynomial, do not minimize maxx∈[−1,1] |w(x)|. However, they are nearlyoptimal. More precisely, since the Gauss-Lobatto nodes are the zeros of the(monic) polynomial [see (2.81) and (3.31) ]

1

2n−1(x2 − 1)Un−1(x) =

1

2n−1sin θ sinnθ, x = cos θ. (3.52)

We have that

‖w‖∞ = maxx∈[−1,1]

∣∣∣∣ 1

2n−1(1− x2)Un−1(x)

∣∣∣∣ ≤ 1

2n−1. (3.53)

So, the Gauss-Lobatto nodes yield a ‖w‖∞ of no more than a factor of twofrom the optimal value.

We now relate divided differences to the derivatives of f using the Cauchyremainder. Take an arbitrary point t distinct from x0, . . . , xn. Let pn+1 bethe interpolating polynomial of f at x0, . . . , xn, t and pn that at x0, . . . , xn.Then, Newton’s formula (3.38) implies

pn+1(x) = pn(x) + f [x0, . . . , xn, t](x− x0)(x− x1) · · · (x− xn). (3.54)

Noting that pn+1(t) = f(t) we get

f(t) = pn(t) + f [x0, . . . , xn, t](t− x0)(t− x1) · · · (t− xn). (3.55)

Since t was arbitrary we can set t = x and obtain

f(x) = pn(x) + f [x0, . . . , xn, x](x− x0)(x− x1) · · · (x− xn), (3.56)

and upon comparing with the Cauchy remainder we get

f [x0, ..., xn, x] =f (n+1)(ξ(x))

(n+ 1)!. (3.57)

If we set x = xn+1 and relabel n+ 1 by k we have

f [x0, ..., xk] =1

k!f (k)(ξ), (3.58)


where minx0, . . . , xk < ξ < maxx0, . . . , xk. Suppose that we now letx1, ..., xk → x0. Then ξ → x0 and

limx1,...,xk→x0

f [x0, ..., xk] =1

k!f (k)(x0). (3.59)

We can use this relation to define a divided difference where there arecoincident nodes. For example f [x0, x1] when x0 = x1 by f [x0, x0] = f ′(x0),etc. This is going to be very useful for the following interpolation problem.

3.6 Hermite Interpolation

The Hermite interpolation problem is: given values of f and some of itsderivatives at the nodes x0, x1, ..., xn, find the interpolating polynomial ofsmallest degree interpolating those values. This polynomial is called theHermite Interpolation Polynomial and can be obtained with a minor modifi-cation to the Newton’s form representation.

For example: Suppose we look for a polynomial p of lowest degree whichsatisfies the interpolation conditions:

p(x0) = f(x0),

p′(x0) = f ′(x0),

p(x1) = f(x1),

p′(x1) = f ′(x1).

We can view this problem as a limiting case of polynomial interpolation off at two pairs of coincident nodes, x0, x0, x1, x1 and we can use Newton’sInterpolation form to obtain p. The table of divided differences, in view of(3.59), is

x0 f(x0)x0 f(x0) f ′(x0)x1 f(x1) f [x0, x1] f [x0, x0, x1]x1 f(x1) f ′(x1) f [x0, x1, x1] f [x0, x0, x1, x1]

(3.60)

and

p(x) = f(x0) + f ′(x0)(x− x0) + f [x0, x0, x1](x− x0)2

+ f [x0, x0, x1, x1](x− x0)2(x− x1).(3.61)

3.7. CONVERGENCE OF POLYNOMIAL INTERPOLATION 53

Example 3.4. Let f(0) = 1, f ′(0) = 0 and f(1) =√

2. Find the HermiteInterpolation Polynomial.

We construct the table of divided differences as follows:

0 1

0 1 0

1√

2√

2− 1√

2− 1

(3.62)

and therefore

p(x) = 1 + 0(x− 0) + (√

2− 1)(x− 0)2 = 1 + (√

2− 1)x2. (3.63)

3.7 Convergence of Polynomial Interpolation

From the Cauchy Remainder formula

f(x)− pn(x) =1

(n+ 1)!f (n+1)(ξ(x))(x− x0)(x− x1) · · · (x− xn) (3.64)

it is clear that the accuracy and convergence of the interpolation polynomialpn of f depends on both the smoothness of f and the distribution of nodesx0, x1, . . . , xn.

In the Runge example

f(x) =1

1 + 25x2x ∈ [−1, 1],

is very smooth. It has an infinite number of continuous derivatives, i.e.f ∈ C∞[−1, 1] (in fact f is real analytic in the whole real line, i.e. it hasa convergent Taylor series to f(x) for every x ∈ R). Nevertheless, for theequispaced nodes (3.13) pn does not converge uniformly to f(x) as n → ∞.In fact it diverges quite dramatically toward the end points of the interval.On the other hand, there is fast and uniform convergence of pn to f whenthe Chebyshev nodes (3.14) are employed.

It is then natural to ask: Given any f ∈ C[a, b], can we guarantee that ifwe choose the Chebyshev nodes ‖f−pn‖∞ → 0? The answer is no. Bernsteinand Faber proved in 1914 that given any distribution of points, organized in


a triangular array (3.21), it is possible to construct a continuous function ffor which its interpolating polynomial pn (corresponding to the nodes on then-th row of (3.21)) will not converge uniformly to f as n → ∞. However,if f is slightly smoother, for example f ∈ C1[a, b], then for the Chebyshevarray of nodes ‖f − pn‖∞ → 0.

If f(x) = e−x2

and there is convergence of pn even with the equidistributednodes. What is so special about this function? The function

f(z) = e−z2

, (3.65)

z = x + iy is analytic in the entire complex plane. Using complex variablesanalysis it can be shown that if f is analytic in a sufficiently large region inthe complex plane containing [a, b] then ‖f − pn‖∞ → 0. Just how large theregion of analyticity needs to be? it depends on the asymptotic distributionof the nodes as n→∞.

In the limit as n→∞, we can think of the nodes as a continuum with adensity ρ so that for sufficiently large n,

(n+ 1)

∫ x

a

ρ(t)dt (3.66)

is the total number of nodes in [a, x]. Take for example, [−1, 1]. For equis-paced nodes ρ(x) = 1/2 and for the Chebyshev nodes ρ(x) = 1/(π

√1− x2).

It turns out that the relevant domain of analyticity is given in terms ofthe function

φ(z) = −∫ b

a

ρ(t) ln |z − t|dt. (3.67)

Let Γc be the level curve consisting of all the points z ∈ C such that φ(z) = cfor c constant. For very large and negative c, Γc approximates a large circle.As c is increased, Γc shrinks. We take the “smallest” level curve, Γc0 , whichcontains [a, b]. The relevant domain of analyticity is

Rc0 = z ∈ C : φ(z) ≥ c0. (3.68)

Then, if f is analytic in Rc0 , ‖f − pn‖∞ → 0, not only in [a, b] but for everypoint in Rc0 . Moreover,

|f(x)− pn(x)| ≤ Ce−N(φ(x)−c0), (3.69)

3.8. PIECE-WISE LINEAR INTERPOLATION 55

for some constant C. That is pn converges exponentially fast to f . Forthe Chebyshev nodes Rc0 approximates [a, b], so if f is analytic in any regioncontaining [a, b], however thin this region might be, pn will converge uniformlyto f . For equidistributed nodes, Rc0 looks like a football, with [a, b] as itslongest axis. In the Runge example, the function is singular at z = ±i/5which happens to be inside this football-like domain and this explain theobserved lack of convergence for this particular function.

The moral of the story is that polynomial interpolation using Chebyshevnodes converges very rapidly for smooth functions and thus yields very ac-curate approximations.

3.8 Piece-wise Linear Interpolation

One way to reduce the error in linear interpolation is to divide [a, b] intosmall subintervals [x0, x1], ..., [xn−1, xn]. In each of the subintervals [xj, xj+1]we approximate f by

p(x) = f(xj) +f(xj+1)− f(xj)

xj+1 − xj(x− xj), x ∈ [xj, xj+1]. (3.70)

We know that

f(x)− p(x) =1

2f ′′(ξ)(x− xj)(x− xj+1), x ∈ [xj, xj+1] (3.71)

where ξ is some point between xj and xj+1.Suppose that |f ′′(x)| ≤M2,∀x ∈ [a, b] then

|f(x)− p(x)| ≤ 1

2M2 max

xj≤x≤xj+1

|(x− xj)(x− xj+1)|. (3.72)

Now the max at the right hand side is attained at the midpoint (xj +xj+1)/2and

maxxj≤x≤xj+1

|(x− xj)(x− xj+1)| =(xj+1 − xj

2

)2

=1

4h2j , (3.73)

where hj = xj+1 − xj. Therefore

maxxj≤x≤xj+1

|f(x)− p(x)| ≤ 1

8M2h

2j . (3.74)


If we want this error to be smaller than a prescribed tolerance δ we can takesufficiently small subintervals. Namely, we can pick hj such that 1

8M2h

2j ≤ δ

which implies that

hj ≤√

8δ

M2

. (3.75)

3.9 Cubic Splines

Several applications require a smoother curve than that provided by a piece-wise linear approximation. Continuity of the first and second derivativesprovide that required smoothness.

One of the most frequently used such approximations is a cubic spline,which is is a piecewise cubic function, s(x), which interpolates a set points(x0, f0), (x1, f1), . . . (xn, fn), and has two continuous derivatives. In eachsubinterval [xj, xj+1], s(x) is a cubic polynomial, which we may representas

sj(x) = Aj(x− xj)3 +Bj(x− xj)2 + Cj(x− xj) +Dj. (3.76)

Let

hj = xj+1 − xj. (3.77)

The spline s(x) interpolates the given data:

sj(xj) = fj = Dj, (3.78)

sj(xj+1) = Ajh3j +Bjh

2j + Cjhj +Dj = fj+1. (3.79)

Now s′j(x) = 3Aj(x−xj)2 + 2Bj(x−xj) +Cj and s′′j (x) = 6Aj(x−xj) + 2Bj.Therefore

s′j(xj) = Cj, (3.80)

s′j(xj+1) = 3Ajh2j + 2Bjhj + Cj, (3.81)

s′′j (xj) = 2Bj, (3.82)

s′′j (xj+1) = 6Ajhj + 2Bj. (3.83)

3.9. CUBIC SPLINES 57

We are going to write the spline coefficients Aj, Bj, Cj, and Dj in terms offj and fj+1 and the unknown values zj = s′′j (xj) and zj+1 = s′′j (xj+1). Wehave

Dj = fj,

Bj =1

2zj,

6Ajhj + 2Bj = zj+1 ⇒ Aj =1

6hj(zj+1 − zj)

and substituting these values in (3.79) we get

Cj =1

hj(fj+1 − fj)−

1

6hj(zj+1 + 2zj).

Let us collect all our formulas for the spline coefficients:

Aj =1

6hj(zj+1 − zj), (3.84)

Bj =1

2zj, (3.85)

Cj =1

hj(fj+1 − fj)−

1

6hj(zj+1 + 2zj), (3.86)

Dj = fj. (3.87)

Note that the second derivative of s is continuous, s′′j (xj+1) = zj+1 = s′′j+1(xj+1),and by construction s interpolates the given data. We are now going to usethe condition of continuity of the first derivative of s to determine equationsfor the unknown values zj, j = 1, 2, . . . , n− 1:

s′j(xj+1) = 3Ajh2j + 2Bjhj + Cj

= 31

6hj(zj+1 − zj)h2

j + 21

2zjhj +

1

hj(fj+1 − fj)−

1

6hj(zj+1 + 2zj)

=1

hj(fj+1 − fj) +

1

6hj(2zj+1 + zj),

Decreasing the index by 1 we get

s′j−1(xj) =1

hj−1

(fj − fj−1) +1

6hj−1(2zj + zj−1) (3.88)


Continuity of the first derivative at an interior node means s′j−1(xj) = s′j(xj)for j = 1, 2, ..., n− 1. Therefore

1

hj−1

(fj − fj−1) +1

6hj−1(2zj + zj−1) = Cj =

1

hj(fj+1 − fj)−

1

6hj(zj+1 + 2zj)

which can be written as

hj−1zj−1 + 2(hj−1 + hj)zj + hjzj+1 =

− 6

hj−1

(fj − fj−1) +6

hj(fj+1 − fj), j = 1, . . . , n− 1.

(3.89)

This is a linear system of n−1 equations for the n−1 unknowns z1, z2, . . . , zn−1.In matrix form

2(h0 + h1) h1 · · · 0

h1 2(h1 + h2) h2. . . 0

.... . . . . . hn−2

0. . . hn−2 2(hn−2 + hn−1)

z1

z2...

zn−1

=

d1

d2...

dn−1

,(3.90)

whered1

d2...

dn−2

dn−1

=

− 6h0

(f1 − f0) + 6h1

(f2 − f1)− h0z0

− 6h1

(f2 − f1) + 6h2

(f3 − f2)...

− 6hn−3

(fn−2 − fn−3) + 6hn−2

(fn−1 − fn−2)

− 6hn−2

(fn−1 − fn−2) + 6hn−1

(fn − fn−1)− hn−1zn

. (3.91)

Note that z0 = f ′′0 and zn = f ′′n are unspecified. The values z0 = zn = 0 de-fined what is called a Natural Spline. The matrix of the linear system (3.90)is strictly diagonally dominant, a concept we make precise in the definitionbelow. A consequence of this property, as we will see shortly, is that the ma-trix is nonsingular and therefore there is a unique solution for z1, z2, . . . , zn−1

corresponding the second derivative values of the spline at the interior nodes.

Definition 3.1. An n×n matrix A with entries aij, i, j = 1, . . . , n is strictlydiagonally dominant if

|aii| >n∑j=1j 6=i

|aij|, for i = 1, . . . , n. (3.92)


Theorem 3.5. Let A be a strictly diagonally dominant matrix. Then A isnonsingular.

Proof. Suppose the contrary, that is there is x 6= 0 such that Ax = 0. Let kbe an index such that |xk| = ‖x‖∞. Then, the k-th equation in Ax = 0 gives

akkxk +n∑j=1j 6=k

akjxj = 0 (3.93)

and consequently

|akk||xk| ≤n∑j=1j 6=k

|akj||xj|. (3.94)

Dividing by |xk|, which by assumption in nonzero, and using that |xj|/|xk| ≤1 for all j = 1, . . . , n, we get

|akk| ≤n∑j=1j 6=k

|akj|, (3.95)

which contradicts the fact that A is strictly diagonally dominant.

If the nodes are equidistributed, xj = x0 + jh for j = 0, 1, . . . , n then thelinear system (3.90), after dividing by h simplifies to

4 1 · · · 0

1 4 1. . . 0

... 1. . . 1

0. . . 1 4

z1

z2...

zn−2

zn−1

=

6h2

(f0 − 2f1 + f2)− z06h2

(f1 − 2f2 + f3)...

6h2

(fn−3 − 2fn−2 + fn−1)6h2

(fn−2 − 2fn−1 + fn)− zn

. (3.96)

Once the z1, z2, . . . , zn−1 are found the spline coefficients can be computedfrom (3.84)-(3.87).

Example 3.5. Find the natural cubic spline that interpolates (0, 0), (1, 1), (2, 16).We know z0 = 0 and z2 = 0. We only need to find z1 (only 1 interior node).The system (3.89) degenerates to just one equation and h = 1, thus

z0 + 4z1 + z2 = 6[f0 − 2f1 + f2]⇒ z1 = 21


In [0, 1] we have

A0 =1

6h(z1 − z0) =

1

6× 21 =

7

2,

B0 =1

2z0 = 0

C0 =1

h(f1 − f0)− 1

6h(z1 + 2z0) = 1− 1

621 = −5

2,

D0 = f0 = 0.

Thus, s0(x) = A0(x − 0)3 + B0(x − 0)3 + C0(x − 0) + D0 = 72x3 − 5

2x. Now

in [1, 2]

A1 =1

6h(z2 − z1) =

1

6(−21) = −7

2,

B1 =1

2z1 =

21

2,

C1 =1

h(f2 − f1)− 1

6h(z2 + 2z1) = 16− 1− 1

6(2 · 21) = 8,

D1 = f1 = 1.

and s1(x) = −72(x− 1)3 + 21

2(x− 1)2 + 8(x− 1) + 1. Therefore the spline is

given by

s(x) =

72x3 − 5

2x x ∈ [0, 1],

−72(x− 1)3 + 21

2(x− 1)2 + 8(x− 1) + 1 x ∈ [1, 2].

3.9.1 Solving the Tridiagonal System

The matrix of coefficients of the linear system (3.90) has the tridiagonal form

A =

a1 b1

c1 a2 b2

c2. . . . . .. . . . . . . . .

. . . . . . . . .. . . . . . bN−1

cN−1 aN

(3.97)


where for the natural splines (n − 1) × (n − 1) system (3.90), the non-zerotridiagonal entries are

aj = 2(hj−1 + hj), j = 1, 2, . . . , n− 1 (3.98)

bj = hj, j = 1, 2, . . . , n− 2 (3.99)

cj = hj−1 j = 1, 2, . . . , n− 2. (3.100)

We can solve the corresponding linear system of equations Ax = d byfactoring this matrix A into the product of a lower triangular matrix L andan upper triangular matrix U . To illustrate the idea let us take N = 5. A5× 5 tridiagonal linear system has the form

a1 b1 0 0 0c1 a2 b2 0 00 c2 a3 b3 00 0 c3 a4 b4

0 0 0 c4 a5

x1

x2

x3

x4

x5

=

d1

d2

d3

d4

d5

(3.101)

and we seek a factorization of the forma1 b1 0 0 0c1 a2 b2 0 00 c2 a3 b3 00 0 c3 a4 b4

0 0 0 c4 a5

=

1 0 0 0 0l1 1 0 0 00 l2 1 0 00 0 l3 1 00 0 0 l4 1

m1 u1 0 0 00 m2 u2 0 00 0 m3 u3 00 0 0 m4 u4

0 0 0 0 m5

Note the the first matrix on the right hand side is lower triangular and thesecond one is upper triangular. Performing the product of the matrices andcomparing with the corresponding entries of the left hand side matrix wehave

1st row: a1 = m1, b1 = u1,

2nd row: c1 = m1l1, a2 = l1u1 +m2, b2 = u2,

3rd row: c2 = m2l2, a3 = l2u2 +m3, b3 = u3,

4th row: c3 = m3l3, a4 = l3u3 +m4, b4 = u4,

5th row: c4 = m4l4, a5 = l4u4 +m5, b5 = u5.

So we can determine the unknowns in the following order

m1;u1, l1,m2;u2, l2,m3; . . . , u4, l4,m5.


Since uj = bj for all j we can write down the algorithm for general N as

% Determine the factorization coefficientsm1 = a1

for j = 1 : N − 1lj = cj/mj

mj+1 = aj+1 − lj ∗ bjend

% Forward substitution on Ly = dy1 = d1

for j = 2 : Nyj = dj − lj−1 ∗ yj−1

end

% Backward substitution to solve Ux = yxN = yN/mN

for j = N − 1 : 1xj = (yj − bj ∗ xj+1)/mj

end

3.9.2 Complete Splines

Sometimes it is more appropriate to specify the first derivative at the endpoints instead of the second derivative. This is called a complete or clampedspline. In this case z0 = f ′′0 and zn = f ′′n become unknowns together withz1, z2, . . . , zn−1.

We need to add two more equations to have a complete system for all then+ 1 unknown values z0, z1, . . . , zn in a complete spline. Recall that

sj(x) = Aj(x− xj)3 +Bj(x− xj)2 + Cj(x− xj) +Dj

and so

s′j(x) = 3Aj(x− xj)2 + 2Bj(x− xj) + Cj.


Therefore

s′0(x0) = C0 = f ′0 (3.102)

s′n−1(xn) = 3An−1h2n−1 + 2Bn−1hn−1 + Cn−1 = f ′n. (3.103)

Substituting C0, An−1, Bn−1, and Cn−1 from (3.84)-(3.86) we get

2h0z0 + h0z1 =6

h0

(f1 − f0)− 6f ′0, (3.104)

hn−1zn−1 + 2hn−1zn = − 6

hn−1

(fn − fn−1) + 6f ′n. (3.105)

These two equations together with (3.89) uniquely determine the secondderivative values at all the nodes. The resulting (n + 1) × (n + 1) is alsotridiagonal and diagonally dominant (hence nonsingular). Once the valuesz0, z1, . . . , zn are found the splines coefficients are obtained from (3.84)-(3.87).

3.9.3 Parametric Curves

In computer graphics and animation it is often required to construct smoothcurves that are not necessarily the graph of a function but that have a para-metric representation x = x(t) and y = y(t) for t ∈ [a, b]. Hence one needsto determine two splines interpolating (tj, xj) and (tj, yj) (j = 0, 1, . . . n),respectively.

The arc length of the curve is a natural choice for the parameter t. How-ever, this is not known a priori and instead the nodes tj’s are usually chosenas the distances of consecutive, judiciously chosen points:

t0 = 0, tj = tj−1 +√

(xj − xj−1)2 + (yj − yj−1)2, j = 1, 2, . . . n. (3.106)


Chapter 4

Trigonometric Approximation

We will study approximations employing truncated Fourier series. This typeof approximation finds multiple applications in digital signal and image pro-cessing, and in the construction of highly accurate approximations to thesolution of some partial differential equations. We will look at the problemof best approximation in the convenient L2 norm and then consider the morepractical approximation using interpolation. The latter will put us in thediscrete framework to introduce the leading star of this chapter, the DiscreteFourier Transform, and one of the top ten algorithms of all time, the FastFourier Transform, to compute it.

4.1 Approximating a Periodic Function

We begin with the problem of approximating a periodic function f at thecontinuum level. Without loss of generality, we can assume that f is of period2π (if it is of period p then the function F (y) = f( p

2πy) has period 2π).

If f is a smooth periodic function we can approximate it with the first nfew terms of its Fourier series:

f(x) ≈ 1

2a0 +

n∑k=1

(ak cos kx+ bk sin kx), (4.1)

65

66 CHAPTER 4. TRIGONOMETRIC APPROXIMATION

where

ak =1

π

∫ 2π

0

f(x) cos kx dx, k = 0, 1, . . . , n (4.2)

bk =1

π

∫ 2π

0

f(x) sin kx dx, k = 1, 2, . . . , n. (4.3)

We will show that this is the best approximation to f , in the L2 norm, bya trigonometric polynomial of degree n (the right hand side of (4.1)). Forconvenience, we write a trigonometric polynomial Sn (of degree n) in complexform (see (1.45)-(1.48) ) as

Sn(x) =n∑

k=−n

ckeikx. (4.4)

Consider the square of the error

Jn = ‖f − Sn‖22 =

∫ 2π

0

[f(x)− Sn(x)]2 dx. (4.5)

Let us try to find the coefficients ck (k = 0,±1 . . . ,±n) in (4.4) that minimizeJn. We have

Jn =

∫ 2π

0

[f(x)−

n∑k=−n

ckeikx

]2

dx

=

∫ 2π

0

[f(x)]2dx− 2n∑

k=−n

ck

∫ 2π

0

f(x)eikxdx

+n∑

k=−n

n∑l=−n

ckcl

∫ 2π

0

eikxeilxdx.

(4.6)

This problem simplifies if we use the orthogonality of the set 1, eix, e−ix, . . . , einx, e−inx,as for k 6= −l∫ 2π

0

eikxeilxdx =

∫ 2π

0

ei(k+l)xdx =1

i(k + l)ei(k+l)x

∣∣∣∣2π0

= 0 (4.7)

and for k = −l ∫ 2π

0

eikxeilxdx =

∫ 2π

0

dx = 2π. (4.8)

4.1. APPROXIMATING A PERIODIC FUNCTION 67

Thus, we get

Jn =

∫ 2π

0

[f(x)]2dx− 2n∑

k=−n

ck

∫ 2π

0

f(x)eikxdx+ 2πn∑

k=−n

ckc−k. (4.9)

Jn is a quadratic function of the coefficients ck and so to find the its minimum,we determine the critical point of Jn as a function of the ck’s

∂Jn∂cm

= −2

∫ 2π

0

f(x)eimxdx+ 2(2π)c−m = 0, m = 0,±1, . . . ,±n. (4.10)

Therefore, relabeling the coefficients with k again, we get

ck =1

2π

∫ 2π

0

f(x)e−ikxdx, k = 0,±1, . . . ,±n, (4.11)

which are the complex Fourier coefficients of f . The real Fourier coefficients(4.2)-(4.3) follow from Euler’s formula eikx = cos kx+i sin kx, which producesthe relation

c0 =1

2a0, ck =

1

2(ak − ibk), c−k =

1

2(ak + ibk), k = 1, . . . , n. (4.12)

This concludes the proof to the claim that the best L2 approximation to fby trigonometric polynomials of degree n is furnished by the Fourier seriesof f truncated up to wave number k = n.

Now, if we substitute the Fourier coefficients (4.11) in (11.3) we get

0 ≤ Jn =

∫ 2π

0

[f(x)]2dx− 2πn∑

k=−n

|ck|2

that is

n∑k=−n

|ck|2 ≤1

2π

∫ 2π

0

[f(x)]2 dx. (4.13)

This is known as Bessel’s inequality. In terms of the real Fourier coefficients,Bessel’s inequality becomes

1

2a2

0 +n∑k=1

(a2k + b2

k) ≤1

π

∫ 2π

0

[f(x)]2 dx. (4.14)


If

∫ 2π

0

[f(x)]2dx is finite, then then series

1

2a2

0 +∞∑k=1

(a2k + b2

k)

converges and consequently limk→∞ ak = limk→∞ bk = 0.The convergence of a Fourier series is a delicate question. For a continuous

function f , it does not follow that its Fourier series converges point-wise toit, only that it does so in the mean

limn→∞

∫ 2π

0

[f(x)− Sn(x)]2 dx = 0. (4.15)

This convergence in the mean for a continuous periodic function implies thatBessel’s inequality becomes the equality

1

2a2

0 +∞∑k=1

(a2k + b2

k) =1

π

∫ 2π

0

[f(x)]2 dx, (4.16)

which is known as Parseval’s identity. We state now a convergence resultwithout proof.

Theorem 4.1. Suppose that f is piecewise continuous and periodic in [0, 2π]and with a piecewise continuous first derivative. Then

1

2a0 +

∞∑k=1

(ak cos kx+ bk sin kx) =1

2

[f+(x) + f−(x)

]for each x ∈ [0, 2π], where ak and bk are the Fourier coefficients of f . Here

limh→0+

f(x+ h) = f+(x), (4.17)

limh→0+

f(x− h) = f−(x). (4.18)

In particular if x is a point of continuity of f

1

2a0 +

∞∑k=1

(ak cos kx+ bk sin kx) = f(x).

4.1. APPROXIMATING A PERIODIC FUNCTION 69

We have been working on the interval [0, 2π] but we can choose any otherinterval of length 2π. This is so because if g is 2π periodic then we have thefollowing result

Lemma 2. ∫ 2π

0

g(x)dx =

∫ t+2π

t

g(x)dx (4.19)

for any real t.

Proof. Define

G(t) =

∫ t+2π

t

g(x)dx. (4.20)

Then

G(t) =

∫ 0

t

g(x)dx+

∫ t+2π

0

g(x)dx =

∫ t+2π

0

g(x)dx−∫ t

0

g(x)dx. (4.21)

By the Fundamental theorem of calculus

G′(t) = g(t+ 2π)− g(t) = 0

since g is 2π periodic. Thus, G is independent of t.

Example 4.1. f(x) = |x| on [−π, π].

ak =1

π

∫ π

−πf(x) cos kxdx =

1

π

∫ π

−π|x| cos kxdx

=2

π

∫ π

0

x cos kxdx

u = x, dv = cos kxdx

du = dx, v =1

ksin kx

=2

π

[x

ksin kx

∣∣∣π0− 1

k

∫ π

0

sin kxdx

]=

2

πk2cos kx

∣∣∣∣π0

=2

πk2[(−1)k − 1], k = 1, ...


and a0 = π, bk = 0 for all k as the function is even. Therefore

S(x) = limn→∞

Sn(x)

=1

2π +

∞∑k=1

2

πk2[(−1)k − 1] cos kx

=1

2π − 4

π

[cosx

12+

cos 3x

32+

cos 5x

52+ · · ·

].

How do we find accurate numerical approximations to the Fourier coeffi-cients? We know that for periodic smooth integrands, integrated over one (ormultiple) period(s), the Composite Trapezoidal Rule using equidistributednodes gives spectral accuracy. Let xj = jh, j = 0, 1, . . . N , h = 2π/N , thenwe can approximate

ck ≈ h1

2π

[1

2f(x0)e−ikx0 +

N−1∑j=1

f(xj)e−ikxj +

1

2f(xN)e−ikxN

].

But due to periodicity f(x0)e−ikx0 = f(xN)e−ikxN so we have

ck ≈1

N

N∑j=1

f(xj)e−ikxj =

1

N

N−1∑j=0

f(xj)e−ikxj . (4.22)

and for the real Fourier coefficients we have

ak ≈2

N

N∑j=1

f(xj) cos kxj =2

N

N−1∑j=0

f(xj) cos kxj, (4.23)

bk ≈2

N

N∑j=1

f(xj) sin kxj =2

N

N−1∑j=0

f(xj) sin kxj. (4.24)

4.2 Interpolating Fourier Polynomial

Let f be a 2π-periodic function and xj = j 2πN

, j = 0, 1, . . . N , equidistributednodes in [0, 2π]. The interpolation problem is to find a trigonometric poly-nomial of lowest order Sn such that Sn(xj) = f(xj), for j = 0, 1, . . . , N .Because of periodicity f(x0) = f(xN) so we only have N independent values.

4.2. INTERPOLATING FOURIER POLYNOMIAL 71

If we take N = 2n then Sn has 2n + 1 = N + 1 coefficients. So we needone more condition. At the equidistributed nodes, the sine term of highestfrequency vanishes as sin(N

2xj) = sin(jπ) = 0 so the coefficient bN/2 is ir-

relevant for interpolation. We thus look for an interpolating trigonometricpolynomial of the form

PN(x) =1

2a0 +

N/2−1∑k=1

(ak cos kx+ bk sin kx) +1

2aN/2 cos

(N

2x

). (4.25)

The convenience of the 1/2 factor in the last term will be seen in the formulasfor the coefficients below. It is conceptually and computationally simpler towork with the corresponding polynomial in complex form

PN(x) =

N/2∑′′

k=−N/2

ckeikx, (4.26)

where the double prime in the sum means that the first and last term (fork = −N/2 and k = N/2) have a factor of 1/2. It is also understood thatc−N/2 = cN/2, which is equivalent to the bN/2 = 0 condition in (4.25).

Theorem 4.2.

PN(x) =

N/2∑′′

k=−N/2

ckeikx

with

ck =1

N

N−1∑j=0

f(xj)e−ikxj , k = −N

2, . . . ,

N

2

interpolates f at the equidistributed points xj = j(2π/N), j = 0, 1, . . . , N .

Proof. We have

PN(x) =

N/2∑′′

k=−N/2

ckeikx =

N−1∑j=0

f(xj)1

N

N/2∑′′

k=−N/2

eik(x−xj).

Defining

lj(x) =1

N

N/2∑′′

k=−N/2

eik(x−xj) (4.27)


we have

PN(x) =N−1∑j=0

f(xj)lj(x). (4.28)

Thus, we only need to prove that for j and m in the range 0, . . . , N − 1

lj(xm) =

1 for m = j,

0 for m 6= j.(4.29)

lj(xm) =1

N

N/2∑′′

k=−N/2

eik(m−j)2π/N .

But ei(±N/2)(m−j)2π/N = e±i(m−j)π = (−1)(m−j) so we can combine the firstand the last term and remove the prime from the sum:

lj(xm) =1

N

N/2−1∑k=−N/2

eik(m−j)2π/N

=1

N

N/2−1∑k=−N/2

ei(k+N/2)(m−j)2π/Ne−i(N/2)(m−j)2π/N

= e−i(m−j)π1

N

N−1∑k=0

eik(m−j)2π/N

and, as we proved in the introduction

1

N

N−1∑k=0

e−ik(j−m)2π/N =

0 if j −m is not divisible by N

1 otherwise.(4.30)

Then (4.29) follows and

PN(xm) = f(xm), m = 0, 1, . . . N − 1.

4.2. INTERPOLATING FOURIER POLYNOMIAL 73

Using the relations (4.12) between the ck and the ak and bk coefficientswe find that

PN(x) =1

2a0 +

N/2−1∑k=1

(ak cos kx+ bk sin kx) +1

2aN/2 cos

(N

2x

)interpolates f at the equidistributed nodes xj = j(2π/N), j = 0, 1, . . . , N ifand only if

ak =2

N

N∑j=1

f(xj) cos kxj, (4.31)

bk =2

N

N∑j=1

f(xj) sin kxj. (4.32)

A smooth periodic function f can be approximated very accurately by itsFourier interpolant PN . Note that derivatives of PN can be easily computed

P(p)N (x) =

N/2∑′′

k=−N/2

(ik)pckeikx (4.33)

The discrete Fourier coefficients of the p-th derivative of PN are (ik)pck.Thus, once these Fourier coefficients have been computed a very accurateapproximation of the derivatives of f is obtained, f (p)(x) ≈ P

(p)N (x). Fig-

ure 5.1 shows the approximation of f(x) = sinxecosx on [0, 2π] by P8. Thegraph of f and of the Fourier interpolant are almost indistinguishable.

Let us go back to the complex Fourier interpolant (4.26). Its coefficientsck are periodic of period N ,

ck+N =1

N

N−1∑j=0

fje−i(k+N)xj =

1

N

N−1∑j=0

fje−ikxje−ij2π = ck (4.34)

and in particular c−N/2 = cN/2. Using the interpolation property and settingfj = f(xj), we have

fj =

N/2∑′′

k=−N/2

ckeikxj =

N/2−1∑k=−N/2

ckeikxj (4.35)


0 1 2 3 4 5 6−1.5

−1

−0.5

0

0.5

1

1.5

Figure 4.1: S8(x) for f(x) = sin xecosx on [0, 2π].

and

N/2−1∑k=−N/2

ckeikxj =

−1∑k=−N/2

ckeikxj +

N/2−1∑k=0

ckeikxj

=N−1∑k=N/2

ckeikxj +

N/2−1∑k=0

ckeikxj =

N−1∑k=0

ckeikxj ,

(4.36)

where we have used that ck+N = ck. Combining this with the formula for theck’s we get Discrete Fourier Transform (DFT) pair

ck =1

N

N−1∑j=0

fje−ikxj , k = 0, . . . , N − 1, (4.37)

fj =N−1∑k=0

ckeikxj , j = 0, . . . , N − 1. (4.38)

The set of discrete coefficients (4.39) is known as DFT of the periodic arrayf0, f1, . . . , fN−1 and (4.40) is referred to as the Inverse DFT.

The direct evaluation of the DFT is computationally expensive, it re-quires order N2 operations. However, there is a remarkable algorithm which

4.3. THE FAST FOURIER TRANSFORM 75

achieves this in merely order N logN operations. This is known as the FastFourier Transform.

4.3 The Fast Fourier Transform

The DFT was defined as

ck =1

N

N−1∑j=0

fje−ikxj , k = 0, . . . , N − 1, (4.39)

fj =N−1∑k=0

ckeikxj , j = 0, . . . , N − 1. (4.40)

The direct computation of either (4.39) or (4.40) is requires order N2 op-erations. As N increasing the cost quickly becomes prohibitive. In manyapplications N could easily be on the order of thousands, millions, etc.

One of the top algorithms of all times is the Fast Fourier Transform(FFT). It is usually attributed to Cooley and Tukey (1965) but its origin canbe tracked back to C. F. Gauss (1777-1855). We now look at the main ideasof this famous and widely used algorithm.

Let us define dk = Nck for k = 0, 1, . . . , N − 1. Then we can rewrite(4.39) as

dk =N−1∑j=0

fjωkjN , (4.41)

where ωN = e−i2π/N . In matrix formd0

d1...

dN−1

=

ω0N ω0

N · · · ω0N

ω0N ω1

N · · · ωN−1N

......

. . ....

ω0N ωN−1

N · · · ω(N−1)2

N

f0

f1...

fN−1

(4.42)

Let us call the matrix on the right FN . Then FN is N times the matrix ofthe DFT and the matrix of the Inverse DFT is simply the complex conjugate


of FN . This follows from the identities:

1 + ωN + ω2N + · · ·+ ωN−1

N = 0,

1 + ω2N + ω4

N + · · ·+ ω2(N−1)N = 0,

...

1 + ωN−1N + ω

2(N−1)N + · · ·+ ω

(N−1)2

N = 0,

1 + ωNN + ω2NN + · · ·+ ω

N(N−1)N = N.

We already proved the first of these identities. This geometric sum is equalto (1 − ωNN )/(1 − ωN), which in turn is equal to zero because ωNN = 1. Theother are proved similarly. We can summarize these identities as

N−1∑k=0

ωjkN =

0 j 6= 0 (mod N)

N j = 0 (mod N),(4.43)

where j = k (mod N) means j − k is an integer multiple of N . Then

1

N(FN FN)jl =

1

N

N−1∑k=0

ωjkN ω−lkN =

1

N

N−1∑k=0

ω(j−l)kN =

0 j 6= l (mod N)

1 j = l (mod N).

(4.44)

which shows that FN is the inverse of 1NFN .

Let N = 2n. Going back to (4.41), if we split the even-numbered and theodd-numbered points we have

dk =n−1∑j=0

f2jω2jkN +

n−1∑j=0

f2j+1ω(2j+1)kN (4.45)

But

ω2jkN = e−i2jk

2πN = e

−ijk 2πN2 = e−ijk

2πn = ωkjn , (4.46)

ω(2j+1)kN = e−i(2j+1)k 2π

N = e−ik2πN e−i2jk

2πN = ωkNω

kjn . (4.47)

So denoting f ej = f2j and f 0j = f2j+1, we get

dk =n−1∑j=0

f ej ωjkn + ωkN

n−1∑j=0

f 0j ω

jkn (4.48)

4.3. THE FAST FOURIER TRANSFORM 77

We have reduced the problem to two DFT of size n = N2

plus N multiplica-tions (and N sums). The numbers ωkN , k = 0, 1, . . . , N − 1 only depend onN so they can be precomputed once and stored for other DFT of the samesize N .

If N = 2p, for p positive integer, we can repeat the process to reduce eachof the DFT’s of size n to a pair of DFT’s of size n/2 plus n multiplications(and n additions), etc. We can do this p times so that we end up with 1-pointDFT’s, which require no multiplications!

Let us count the number of operations in the FFT algorithm. For simplic-ity, let is count only the number of multiplications (the numbers of additionsis of the same order). Let mN be the number of multiplications to computethe DFT for a periodic array of size N and assume that N = 2p. Then

mN = 2mN2

+N

= 2m2p−1 + 2p

= 2(2m2p−2 + 2p−1) + 2p

= 22m2p−2 + 2 · 2p

= · · ·= 2pm20 + p · 2p = p · 2p

= N log2N,

where we have used that m20 = m1 = 0 ( no multiplication is needed forDFT of 1 point). To illustrate the savings, if N = 220, with the FFT we canobtain the DFT (or the Inverse DFT) in order 20× 220 operations, whereasthe direct methods requires order 240, i.e. a factor of 1

20220 ≈ 52429 more

operations.The FFT can also be implemented efficiently when N is the product of

small primes. A very efficient implementation of the FFT is the FFTW(“the Fastest Fourier Transform in the West”), which employs a variety ofcode generation and runtime optimization techniques and is a free software.


Chapter 5

Least Squares Approximation

5.1 Continuous Least Squares Approximation

Let f be a continuous function on [a, b]. We would like to find the bestapproximation to f by a polynomial of degree at most n in the L2 norm. Wehave already studied this problem for the approximation of periodic functionsby trigonometric (complex exponential) polynomials. The problem is to finda polynomial pn of degree ≤ n such that∫ b

a

[f(x)− pn(x)]2dx = min (5.1)

Such polynomial, is also called the Least Squares approximation to f . As anillustration let us consider n = 1. We look for p1(x) = a0 +a1x, for x ∈ [a, b],which minimizes

J(a0, a1) =

∫ b

a

[f(x)− p1(x)]2dx

=

∫ b

a

f 2(x)dx− 2

∫ b

a

f(x)(a0 + a1x)dx+

∫ b

a

(a0 + a1x)2dx.

(5.2)

J(a0, a1) is a quadratic function of a0 and a1 and thus a necessary conditionfor the minimum is that it is a critical point:

∂J(a0, a1)

∂a0

= −2

∫ b

a

f(x)dx+ 2

∫ b

a

(a0 + a1x)dx = 0,

∂J(a0, a1)

∂a1

= −2

∫ b

a

xf(x)dx+ 2

∫ b

a

(a0 + a1x)x dx = 0,

79

80 CHAPTER 5. LEAST SQUARES APPROXIMATION

which yields the following linear 2× 2 system for a0 and a1:(∫ b

a

1dx

)a0 +

(∫ b

a

xdx

)a1 =

∫ b

a

f(x)dx, (5.3)(∫ b

a

xdx

)a0 +

(∫ b

a

x2dx

)a1 =

∫ b

a

xf(x)dx. (5.4)

These two equations are known as the Normal Equations for n = 1.

Example 5.1. Let f(x) = ex for x ∈ [0, 1]. Then∫ 1

0

exdx = e− 1, (5.5)∫ 1

0

xexdx = 1, (5.6)

and the normal equations are

a0 +1

2a1 = e− 1, (5.7)

1

2a0 +

1

3a1 = 1, (5.8)

whose solution is a0 = 4e − 10, a1 = −6e + 18. Therefore the least squaresapproximation to f(x) = ex by a linear polynomial is

p1(x) = 4e− 10 + (18− 6e)x.

p1 and f are plotted in Fig. 5.1

The Least Squares Approximation to a function f on an interval [a, b] bya polynomial of degree at most n is the best approximation of f in the L2

norm by that class of polynomials. It is the polynomial

pn(x) = a0 + a1x+ · · ·+ anxn

such that ∫ b

a

[f(x)− pn(x)]2dx = min . (5.9)

5.1. CONTINUOUS LEAST SQUARES APPROXIMATION 81

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

P1(x)

f(x)=ex

Figure 5.1: The function f(x) = ex on [0, 1] and its Least Squares Approxi-mation p1(x) = 4e− 10 + (18− 6e)x.

Defining this squared L2 error as J(a0, a1, ..., an) we have

J(a0, a1, ..., an) =

∫ b

a

[f(x)− (a0 + a1x+ · · ·+ anxn)]2dx

=

∫ b

a

f 2(x)dx− 2n∑k=0

ak

∫ b

a

xkf(x)dx+n∑k=0

n∑l=0

akal

∫ b

a

xk+ldx.

J(a0, a1, ..., an) is a quadratic function of the parameters a0, ..., an. A neces-sary condition is that the set of a0, ..., an which minimizes J is the criticalpoint. That is

0 =∂J(a0, a1, . . . , an)

∂am

= −2

∫ b

a

xmf(x)dx+n∑k=0

ak

∫ b

a

xk+mdx+n∑l=0

al

∫ b

a

xl+mdx

= −2

∫ b

a

xmf(x)dx+ 2n∑k=0

ak

∫ b

a

xk+mdx, m = 0, 1, . . . , n


and we get the Normal equations

n∑k=0

ak

∫ b

a

xk+mdx =

∫ b

a

xmf(x)dx, m = 0, 1, ..., n. (5.10)

We can write the Normal Equations in matrix as

∫ b

a

1dx

∫ b

a

xdx · · ·∫ b

a

xndx

∫ b

a

xdx

∫ b

a

x2dx · · ·∫ b

a

xn+1dx

......

...∫ b

a

xndx

∫ b

a

xn+1dx · · ·∫ b

a

x2ndx

a0

a1

...

an

=

∫ b

a

f(x)dx

∫ b

a

xf(x)dx

...∫ b

a

xnf(x)dx

. (5.11)

The matrix in this system is clearly symmetric. Denoting this matrix by Hand a = [a0 a1 · · · an]T any n+ 1 row vector, we have

aTHa =n∑i=0

n∑j=0

aiajHij

=n∑i=0

n∑j=0

aiaj

∫ b

a

xi+jdx

=n∑i=0

n∑j=0

∫ b

a

aixiajx

jdx

=

∫ b

a

n∑i=0

n∑j=0

aixiajx

jdx

=

∫ b

a

(n∑j=0

ajxj)2dx ≥ 0.

Moreover, aTHa = 0 if and only if a = 0, i.e. H is positive definite. Thisimplies that H is nonsigular and hence there is a unique solution to (5.11).For if there is a 6= 0 such that Ha = 0 then aTHa = 0 contradicting thefact that H is positive definite. Furthermore, the Hessian, ∂2J

∂ai∂ajis equal to

2H so it is also positive definite and therefore the critical point is indeed aminimum.

5.1. CONTINUOUS LEAST SQUARES APPROXIMATION 83

In the interval [0, 1],

H =

1

1

1

2· · · 1

n+ 1

1

2

1

3· · · 1

n+ 2...

.... . .

...1

n+ 1

1

n+ 2· · · 1

2n+ 1

, (5.12)

which is known as the [(n+ 1)× (n+ 1)] Hilbert matrix.

In principle, the direct process to obtain the Least Squares Approximationpn to f is to solve the normal equations (5.10) for the coefficients a0, a1, . . . , anand set pn(x) = a0 + a1x + . . . anx

n. There are however two problems withthis approach:

1. It is difficult to solve this linear system numerically for even moderate nbecause the matrix H is very sensitive to small perturbations and thissensitivity increases rapidly with n. For example, numerical solutionsin double precision (about 16 digits of accuracy) of a linear system withthe Hilbert matrix (5.12) will lose all accuracy for n ≥ 11.

2. If we want to increase the degree of the approximating polynomial weneed to start over again and solve a larger set of normal equations.That is, we cannot use the a0, a1, . . . , an we already found.

It is more efficient and easier to solve the Least Squares Approximationproblem using orthogonality, as we did with approximation by trigonometricpolynomials. Suppose that we have a set of polynomials defined on an interval[a, b],

φ0, φ1, ..., φn,

such that φk is a polynomial of degree k. Then, we can write any polyno-mial of degree at most n as a linear combination of these polynomials. Inparticular, the Least Square Approximating polynomial pn can be written as

pn(x) = a0φ0(x) + a1φ1(x) + ...+ anφn(x) =n∑k=0

akφk(x),


for some coefficients a0, . . . , an to be determined. Then

J(a0, ..., an) =

∫ b

a

[f(x)−n∑k=0

akφj(x)]2dx

=

∫ b

a

f 2(x)dx− 2n∑k=0

aj

∫ b

a

φk(x)f(x)dx

+n∑k=0

n∑l=0

akal

∫ b

a

φk(x)φl(x)dx

(5.13)

and

0 =∂J

∂am= −2

∫ b

a

φm(x)f(x)dx+ 2n∑k=0

ak

∫ b

a

φk(x)φm(x)dx.

for m = 0, 1, . . . , n, which gives the normal equations

n∑k=0

ak

∫ b

a

φk(x)φm(x)dx =

∫ b

a

φm(x)f(x)dx, m = 0, 1, ..., n. (5.14)

Now, if the set of approximating functions φ0, ....φn is orthogonal, i.e.∫ b

a

φk(x)φm(x)dx = 0 if k 6= m (5.15)

then the coefficients of the least squares approximation are explicitly givenby

am =1

αm

∫ b

a

φm(x)f(x)dx, αm =

∫ b

a

φ2m(x)dx, m = 0, 1, ..., n. (5.16)

andpn(x) = a0φ0(x) + a1φ1(x) + ...+ anφn(x).

Note that if the set φ0, ....φn is orthogonal, (5.16) and (5.13) imply theBessel inequality

n∑k=0

αka2k ≤

∫ b

a

f 2(x)dx. (5.17)

5.2. LINEAR INDEPENDENCE ANDGRAM-SCHMIDT ORTHOGONALIZATION85

This inequality shows that if f is square integrable, i.e. if∫ b

a

f 2(x)dx <∞,

then the series∞∑k=0

αka2k converges.

We can consider the Least Squares approximation for a class of linearcombinations of orthogonal functions φ0, ..., φn not necessarily polynomials.We saw an example of this with Fourier approximations 1. It is convenientto define a weighted L2 norm associated with the Least Squares problem

‖f‖w,2 =

(∫ b

a

f 2(x)w(x)dx

) 12

, (5.18)

where w(x) ≥ 0 for all x ∈ (a, b) 2 . A corresponding inner product is definedby

〈f, g〉 =

∫ b

a

f(x)g(x)w(x)dx. (5.19)

Definition 5.1. A set of functions φ0, ..., φn is orthogonal, with respect tothe weighted inner product (5.19), if 〈φk, φl〉 = 0 for k 6= l.

5.2 Linear Independence and Gram-Schmidt

Orthogonalization

Definition 5.2. A set of functions φ0(x), ..., φn(x) defined on an interval[a, b] is said to be linearly independent if

a0φ0(x) + a1φ1(x) + . . . anφn(x) = 0, for all x ∈ [a, b] (5.20)

then a0 = a1 = . . . = an = 0. Otherwise, it is said to be linearly dependent.

1For complex-valued functions orthogonality means

∫ b

a

φk(x)φl(x)dx = 0 if k 6= l,

where the bar denotes the complex conjugate

2More precisely, we will assume w ≥ 0,

∫ b

a

w(x)dx > 0, and

∫ b

a

xkw(x)dx < +∞ for

k = 0, 1, . . .. We call such a w an admissible weight function.


Example 5.2. The set of functions φ0(x), ..., φn(x), where φk(x) is a poly-nomial of degree k for k = 0, 1, . . . , n is linearly independent on any interval[a, b]. For a0φ0(x) + a1φ1(x) + . . . anφn(x) is a polynomial of degree at mostn and hence a0φ0(x) + a1φ1(x) + . . . anφn(x) = 0 for all x in a given interval[a, b] implies a0 = a1 = . . . = an = 0.

Given a set of linearly independent functions φ0(x), ..., φn(x) we canproduce an orthogonal set ψ0(x), ..., ψn(x) by doing the Gram-Schmidtprocedure:

ψ0(x) = φ0(x)

ψ1(x) = φ1(x)− c0ψ0(x),

〈ψ1, ψ0〉 = 0⇒ c0 =〈φ1, ψ0〉〈ψ0, ψ0〉

ψ2(x) = φ2(x)− c0ψ0(x)− c1ψ1(x),

〈ψ2, ψ0〉 = 0⇒ c0 =〈φ2, ψ0〉〈ψ0, ψ0〉

〈ψ2, ψ1〉 = 0⇒ c1 =〈φ2, ψ1〉〈ψ1, ψ1〉

...

We can write this procedure recursively as

ψ0(x) = φ0(x),

ψk(x) = φk(x)−k−1∑j=0

cjψj(x), cj =〈φk, ψj〉〈ψj, ψj〉

.(5.21)

5.3 Orthogonal Polynomials

Let us take the set 1, x, . . . , xn on a interval [a, b]. We can use the Gram-Schmidt process to obtain an orthogonal set ψ0(x), ..., ψn(x) of polynomialswith respect to the inner product (5.19). Each of the ψk is a polynomial ofdegree k, determined up to a multiplicative constant (orthogonality is notchanged). Suppose we select the ψk(x), k = 0, 1, . . . , n to be monic, i.e.the coefficient of xk is 1. Then ψk+1(x) − xψk(x) = rk(x), where rk(x) is a

5.3. ORTHOGONAL POLYNOMIALS 87

polynomial of degree at most k. So we can write

ψk+1(x)− xψk(x) = −αkψk(x)− βkψk−1(x) +k−2∑j=0

cjψj(x). (5.22)

Then taking the inner product of this expression with ψk and using orthog-onality we get

−〈xψk, ψk〉 = −αk〈ψk, ψk〉and

αk =〈xψk, ψk〉〈ψk, ψk〉

.

Similarly, taking the inner product with ψk−1 we obtain

−〈xψk, ψk−1〉 = −βk〈ψk−1, ψk−1〉

but 〈xψk, ψk−1〉 = 〈ψk, xψk−1〉 and xψk−1(x) = ψk(x)+qk−1(x), where qk−1(x)is a polynomial of degree at most k − 1 then

〈ψk, xψk−1〉 = 〈ψk, ψk〉+ 〈ψk, qk−1〉 = 〈ψk, ψk〉,

where we have used orthogonality in the last equation. Therefore

βk =〈ψk, ψk〉〈ψk−1, ψk−1〉

.

Finally, taking the inner product of (5.22) with ψm for m = 0, . . . , k − 2 weget

−〈ψk, xψm〉 = cm〈ψm, ψm〉 m = 0, . . . , k − 2

but the left hand side is zero because xψm(x) is a polynomial of degree atmost k − 1 and hence it is orthogonal to ψk(x). Collecting the results weobtain a three-term recursion formula

ψ0(x) = 1, (5.23)

ψ1(x) = x− α0, α0 =〈xψ0, ψ0〉〈ψ0, ψ0〉

(5.24)

for k = 1, . . . n

ψk+1(x) = (x− αk)ψk(x)− βkψk−1(x), (5.25)

αk =〈xψk, ψk〉〈ψk, ψk〉

, (5.26)

βk =〈ψk, ψk〉〈ψk−1, ψk−1〉

. (5.27)


Example 5.3. Let [a, b] = [−1, 1] and w(x) ≡ 1. The corresponding orthog-onal polynomials are known as the Legendre Polynomials and are widelyused in a variety of numerical methods. Because xψ2

k(x)w(x) is an odd func-tion it follows that αk = 0 for all k. We have ψ0(x) = 1 and ψ1(x) = x. Wecan now use the three-term recursion (5.25) to obtain

β1 =

∫ 1

−1

x2dx∫ 1

−1

dx

= 1/3

and ψ2(x) = x2 − 13. Now for k = 2 we get

β2 =

∫ 1

−1

(x2 − 1

3)2dx∫ 1

−1

x2dx

= 4/15

and ψ3(x) = x(x2− 13)− 4

15x = x3− 3

5x. We now collect Legendre polynomials

we found:

ψ0(x) = 1,

ψ1(x) = x,

ψ2(x) = x2 − 1

3,

ψ3(x) = x3 − 3

5x,

...

Theorem 5.1. The zeros of orthogonal polynomials are real, simple, andthey all lie in (a, b).

Proof. Indeed, ψk(x) is orthogonal to ψ0(x) = 1 for each k ≥ 1, thus∫ b

a

ψk(x)w(x)dx = 0 (5.28)

i.e. ψk has to change sign in [a, b] so it has a zero, say x1 ∈ (a, b). Suppose x1

is not a simple root, then q(x) = ψk(x)/(x − x1)2 is a polynomial of degree

5.3. ORTHOGONAL POLYNOMIALS 89

k − 2 and so

0 = 〈ψk, q〉 =

∫ b

a

ψ2k(x)

(x− x1)2w(x)dx > 0,

which is of course impossible. Assume that ψk(x) has only l zeros in (a, b),x1, . . . , xl. Then ψk(x)(x − x1) · · · (x − xl) = qk−l(x)(x − x1)2 · · · (x − xl)2,where qk−l(x) is a polynomial of degree k − l which does not change sign in[a, b]. Then

〈ψk, (x− x1) · · · (x− xl)〉 =

∫ b

a

qk−j(x)(x− x1)2 · · · (x− xl)2w(x)dx 6= 0

but 〈ψk, (x− x1) · · · (x− xl)〉 = 0 for l < k. Therefore l = k.

5.3.1 Chebyshev Polynomials

We introduced in Section 2.4 the Chebyshev polynomials, which as we haveseen possess remarkable properties. We now add one more important prop-erty of this outstanding class of polynomials, namely orthogonality.

The Chebyshev polynomials are orthogonal with respect to the weightfunction

w(x) =1√

1− x2. (5.29)

Indeed. Recall that Tn(x) = cosnθ, (x = cos θ). Then,∫ 1

−1

Tn(x)Tm(x)√1− x2

dx = =

∫ π

0

cosnθ cosmθdθ (5.30)

and since 2 cosnθ cosmθ = cos(m+ n)θ + cos(m− n)θ, we get for m 6= n∫ 1

−1


dx =1

2

[1

n+msin(n+m)θ +

1

n−msin(n−m)θ

]∣∣∣∣π0

= 0.

Consequently,

∫ 1

−1


dx =

0 m 6= n,π2

m = n > 0,π m = n = 0.

(5.31)


5.4 Discrete Least Squares Approximation

Suppose that we are given the data (x1, f1), (x2, f2), · · · , (xN , fN) obtainedfrom an experiment. Can we find a simple function that appropriately fitsthese data? Suppose that empirically we determine that there is an approx-imate linear behavior between fj and xj, j = 1, . . . , N . What is the straightline y = a0 + a1x that best fits these data? The answer depends on howwe measure the error, the deviations fj − (a0 + a1xj), i.e. which norm weuse for the error. The most convenient measure is the squared of the 2-norm(Euclidean norm) because we will end up with a linear system of equationsto find a0 and a1. Other norms will yield a nonlinear system for the unknownparameters. So the problem is: Find a0, a1 which minimize

J(a0, a1) =N∑j=1

[fj − (a0 + a1xj)]2. (5.32)

We can repeat all the Least Squares Approximation theory that we haveseen at the continuum level except that integrals are replaced by sums. Theconditions for the minimum

∂J(a0, a1)

∂a0

= 2N∑j=1

[fj − (a0 + a1xj)](−1) = 0, (5.33)

∂J(a0, a1)

∂a1

= 2N∑j=1

[fj − (a0 + a1xj)](−xj) = 0, (5.34)

produce the Normal Equations:

a0

N∑j=1

1 + a1

N∑j=1

xj =N∑j=1

fj, (5.35)

a0

N∑j=1

xj + a1

N∑j=1

x2j =

N∑j=1

xjfj. (5.36)

Approximation by a higher order polynomial

pn(x) = a0 + a1x+ · · ·+ anxn, n < N − 1

5.4. DISCRETE LEAST SQUARES APPROXIMATION 91

is similarly done. In practice we would like n << N to avoid overfitting. Wefind the a0, a1, ..., an which minimize

J(a0, a1, ..., an) =N∑j=1

[fj − (a0 + a1xj + · · ·+ anxnj )]2. (5.37)

Proceeding as in the continuum case, we get the Normal Equations

n∑k=0

ak

N∑j=1

xk+mj =

N∑j=1

xmj fj, m = 0, 1, ..., n. (5.38)

That is

a0

N∑j=1

x0j + a1

N∑j=1

x1j + · · ·+ an

N∑j=1

xnj =N∑j=1

x0jfj

a0

N∑j=1

x1j + a1

N∑j=1

x2j + · · ·+ an

N∑j=1

xn+1j =

N∑j=1

x1jfj

...

a0

N∑j=1

xnj + a1

N∑j=1

xn+1j + · · ·+ an

N∑j=1

x2nj =

N∑j=1

xnj fj.

The matrix of coefficients of this linear system is, as in the continuum case,symmetric, positive definite, and highly sensitive to small perturbations inthe data. And again, if we want to increase the degree of the approximatingpolynomial we cannot reuse the coefficients we already computed for the lowerorder polynomial. But now we know how to go around these two problems:we use orthogonality.

Let us define the (weighted) discrete inner product as

〈f, g〉N =N∑j=1

f(xj)g(xj)ωj, (5.39)

where ωj > 0, for j = 1, . . . , N are given weights. Let φ0, ..., φn be a setof polynomials such that φk is of degree exactly k. Then the solution to the


discrete Least Squares Approximation problem by a polynomial of degree≤ n can be written as

pn(x) = a0φ0(x) + a1φ1(x) + · · ·+ anφn(x) (5.40)

and the square of the error is given by

J(a0, · · · , an) =N∑j=1

[fj − (a0φ0(xj) + · · ·+ anφn(xj))]2ωj. (5.41)

Consequently, the normal equations are

n∑k=0

ak〈φk, φl〉N = 〈φl, f〉N , l = 0, 1, ..., n. (5.42)

If φ0, · · · , φn are orthogonal with respect to the inner product 〈·, ·〉N ,i.e. if 〈φk, φl〉N = 0 for k 6= l, then the coefficients of the Least SquaresApproximation are given by

ak =〈φk, f〉N〈φk, φk〉N

, k = 0, 1, ..., n (5.43)

and pn(x) = a0φ0(x) + a1φ1(x) + · · ·+ anφn(x).If the φ0, · · · , φn are not orthogonal we can produce an orthogonal set

ψ0, · · · , ψn using the 3-term recursion formula adapted to the discrete innerproduct. We have

ψ0(x) ≡ 1, (5.44)

ψ1(x) = x− α0, α0 =〈xψ0, ψ0〉N〈ψ0, ψ0〉N

, (5.45)

for k = 1, ..., n

ψk+1(x) = (x− αk)ψk(x)− βkψk−1(x), (5.46)

αk =〈xψk, ψk〉N〈ψk, ψk〉N

, βk =〈ψk, ψk〉N〈ψk−1, ψk−1〉N

. (5.47)

Then,

aj =〈ψj, f〉N〈ψj, ψj〉N

, j = 0, ..., n. (5.48)

and the Least Squares Approximation is pn(x) = a0ψ0(x) + a1ψ1(x) + · · · +anψn(x).

5.4. DISCRETE LEAST SQUARES APPROXIMATION 93

Example 5.4. Suppose we are given the data: xj : 0, 1, 2, 3, fj = 1.1, 3.2, 5.1, 6.9and we would like to fit to a line. The normal equations are

a0

4∑j=1

1 + a1

4∑j=1

xj =4∑j=1

fj (5.49)

a0

4∑j=1

xj + a1

4∑j=1

x2j =

4∑j=1

xjfj (5.50)

and performing the sums we have

4a0 + 6a1 = 16.3, (5.51)

6a0 + 14a1 = 34.1. (5.52)

Solving this 2 × 2 linear system we get a0 = 1.18 and a1 = 1.93. Thus, theLeast Squares Approximation is

p1(x) = 1.18 + 1.93x

and the square of the error is

J(1.18, 193) =4∑j=1

[fj − (1.18 + 1.93xj)]2 = 0.023.

Example 5.5. Fitting to an exponential y = beaxk . Defining

J(a, b) =N∑j=1

[fj − beaxj ]2

we get the conditions for a and b

∂J

∂a= 2

N∑j=1

[fj − beaxj ](−bxjeaxj) = 0, (5.53)

∂J

∂b= 2

N∑j=1

[fj − beaxj ](−eaxj) = 0. (5.54)


which is a nonlinear system of equations. However, if we take the natural logof y = beaxk we have ln y = ln b+ ax. Defining B = ln b the problem becomeslinear in B and a. Tabulating (xj, ln fj) we can obtain the normal equations

BN∑j=1

1 + a

N∑j=1

xj =N∑j=1

ln fj, (5.55)

BN∑j=1

xj + a

N∑j=1

x2j =

N∑j=1

xj ln fj, (5.56)

and solve this linear system for B and a. Then b = eB and a = a.If a is given and we only need to determine b then the problem is linear.

From (5.54) we have

b

N∑j=1

e2axj =N∑j=1

fjeaxj ⇒ b =

N∑j=1

fjeaxj

N∑j=1

e2axj

Example 5.6. Discrete orthogonal polynomials. Let us construct the firstfew orthogonal polynomials with respect to the discrete inner product withω ≡ 1 and xj = j

N, j = 1, ..., N . Here N = 10 (the points are equidistributed

in [0, 1]). We have ψ0(x) = 1 and ψ1(x) = x− α0, where

α0 =〈xψ0, ψ0〉N〈ψ0, ψ0〉N

=

∑Nj=1 xj∑Nj=1 1

= 0.55.

and hence ψ1(x) = x− 0.55. Now

ψ2(x) = (x− α1)ψ1(x)− β1ψ0(x), (5.57)

α1 =〈xψ1, ψ1〉N〈ψ1, ψ1〉N

=

∑Nj=1 xj(xj − 0.55)2∑Nj=1(xj − 0.55)2

= 0.55, (5.58)

β1 =〈ψ1, ψ1〉N〈ψ0, ψ0〉N

= 0.0825. (5.59)

Therefore ψ2(x) = (x − 0.55)2 − 0.0825. We can now use these orthogonalpolynomials to find the Least Squares Approximation by polynomial of degree

5.5. HIGH-DIMENSIONAL DATA FITTING 95

at most two of a given set of data. Let us take fj = x2j + 2xj + 3. Clearly, the

Least Squares Approximation should be p2(x) = x2 + 2x + 3. Let us confirmthis by using the orthogonal polynomials ψ0, ψ1 and ψ2. The Least SquaresApproximation coefficients are given by

a0 =〈f, ψ0〉N〈ψ0, ψ0〉N

= 4.485, (5.60)

a1 =〈f, ψ1〉N〈ψ1, ψ1〉N

= 3.1, (5.61)

a2 =〈f, ψ2〉N〈ψ2, ψ2〉N

= 1, (5.62)

which gives, p2(x) = (x−0.55)2−0.0825+(3.1)(x−0.55)+4.485 = x2+2x+3.

5.5 High-dimensional Data Fitting

In many applications each data point contains many variables. For example,a value for each pixel in an image, or clinical measurements of a patient, etc.We can put all these variables in a vector x ∈ Rd for d ≥ 1. Associated withx there is a scalar quantity f that can be measured or computed so thatour data set consists of the points (xj, fj), where xj ∈ Rd and fj ∈ R, forj = 1, . . . , N .

A central problem is that of predicting f from a given large, high-dimensionaldata set. The simplest and most commonly used approach is to postulate alinear relation

f(x) = a0 + aTx (5.63)

and determine the so-called bias coefficient a0 and the vector a = [a1, . . . , ad]T

as a least squares solution, i.e. such that it minimizes the deviations of thedata:

N∑j=1

[fj − (a0 + aTxj)]2.

We have studied in detail the case d = 1. Here we are interested in the cased >> 1.

If we add an extra component, equal to 1, to each data vector Xj so thatnow xj = [1, xj1, . . . , xjd]

T , for j = 1, . . . , N , then we can write (5.63) as

f(x) = aTx (5.64)


and the dimension d is increased by one. Then we are seeking a vector a ∈ Rd

that minimizes

J(a) =N∑j=1

[fj − aTxj]2. (5.65)

Putting the data xj as rows of an N × d (N ≥ d) matrix X and fj as thecomponents of a (column) vector f , i.e.

X =

x1

x2...

xN

and f =

f1

f2...fN

(5.66)

we can write (5.65) as

J(a) = (f −Xa)T (f −Xa) = ‖f −Xa‖2. (5.67)

The normal equations are given by the condition∇aJ(a) = 0. Since∇aJ(a) =−2XT f + 2XTXa, we get the linear system of equations

XTXa = XT f . (5.68)

Every solution of the least square problem is necessarily a solution of thenormal equations. We will prove that the converse is also true and that thesolutions have a geometric characterization.

Let W be the linear space spanned by the columns of X. Clearly, W ⊆RN . Then, the least square problem is equivalent to minimizing ‖f − w‖2

among all vectors w in W . There is always at least one solution, which canbe obtained by projecting f onto W , as Fig. 5.2 illustrates. First, note that ifa ∈ Rd is a solution of the normal equations (5.68) then the residual f −Xais orthogonal to W because

XT (f −Xa) = XT f −XTXa = 0 (5.69)

and a vector r ∈ RN is orthogonal to W if it is orthogonal to each columnof X, i.e. XT r = 0. Let a∗ be a solution of the normal equations, letr = f −Xa∗, and for arbitrary a ∈ Rd, let s = Xa−Xa∗. Then, we have

‖f −Xa‖2 = ‖f −Xa∗ − (Xa−Xa∗)‖2 = ‖r− s‖2. (5.70)

5.5. HIGH-DIMENSIONAL DATA FITTING 97

f

Xa

f −Xa

W

Figure 5.2: Geometric interpretation of the solution Xa of the Least Squaresproblem as the orthogonal projection of f on the approximating linear sub-space W .

But r and s are orthogonal. Therefore,

‖r− s‖2 = ‖r‖2 + ‖s‖2 ≥ ‖r‖2 (5.71)

and so we have proved that

‖f −Xa‖2 ≥ ‖f −Xa∗‖2 (5.72)

for arbitrary a ∈ Rd, i.e. a∗ minimizes ‖f −Xa‖2.If the columns of X are linearly independent, i.e. if for every a 6= 0 we

have that Xa 6= 0, then the d× d matrix XTX is positive definite and hencenonsingular. Therefore, in this case, there is a unique solution to the leastsquares problem mina ‖f −Xa‖2 given by

a∗ = (XTX)−1XT f . (5.73)

The d×N matrix

X† = (XTX)−1XT (5.74)

is called the pseudoinverse of the N × d matrix X. Note that if X weresquare and nonsingular X† would coincide with the inverse, X−1.

As we have done in the other least squares problems we seen so far, ratherthan working with the normal equations, whose matrix XTX may be verysensitive to perturbations in the data, we use an orthogonal basis for the ap-proximating subspace (W in this case) to find a solution. While in principlethis can be done by applying the Gram-Schmidt process to the columns of


X, this is a numerically unstable procedure; when two columns are nearlylinearly dependent, errors introduced by the finite precision representationof computer numbers can be largely amplified during the the Gram-Schmidtprocess and render vectors which are not orthogonal. A re-orthogonalizationstep can be introduced in the Gram-Schmidt algorithm to remedy this prob-lem at the price of doubling the computational cost. A more efficient methodusing a sequence of orthogonal transformations, known as Householder reflec-tions, is usually preferred. Once this orthonormalization process is completedwe get a QR factorization of X

X = QR, (5.75)

where Q is an N ×N orthogonal matrix, i.e. QTQ = I, and R is an N × dupper triangular matrix

R =

∗ · · · ∗

. . ....∗

=

[R1

0

]. (5.76)

Here R1 is a d×d upper triangular matrix and the zero stands for a (N−d)×dzero matrix.

Using X = QR we have

‖f −Xa‖2 = ‖f −QRa‖2 = ‖QT f −Ra‖2. (5.77)

Therefore the least square solution is obtained by solving the system Ra =QT f . Writing R in blocks we have[

R1a0

]=

[(QT f)1

(QT f)2

]. (5.78)

so that the solution is found by solving upper triangular systemR1a = (QT f)1

(R1 is nonsingular if the columns of X are linearly independent). The lastN − d equations, (QT f)2 = 0, may be satisfied or not depending on f but wehave no control on them.

Chapter 6

Computer Arithmetic

6.1 Floating Point Numbers

Floating point numbers are based on scientific notation in binary (base 2).For example

(1.0101)2 × 22 = (1 · 20 + 0 · 2−1 + 1 · 2−2 + 0 · 2−3 + 1 · 2−4)× 22

= (1 +1

4+

1

16)× 4 = 5.2510.

We can write any non-zero real number x in normalized, binary, scientificnotation as

x = ±S × 2E, 1 ≤ S < 2, (6.1)

where S is called the significant or mantissa and E is the exponent. Ingeneral S is an infinite expansion of the form

S = (1.b1b2 · · · )2. (6.2)

In a computer a real number is represented in scientific notation butusing a finite number of binary digits (bits). We call these floating pointnumbers. In single precision (SP), floating point numbers are stored in 32-bit words whereas in double precision (DP), used in most scientific computingapplications, a 64-bit word is employed: 1 bit is used for the sign, 52 bits forS, and 11 bits for E. This memory limits produce a large but finite set offloating point numbers which can be represented in a computer. Moreover,the floating points numbers are not uniformly distributed!

99

100 CHAPTER 6. COMPUTER ARITHMETIC

The maximum exponent possible in DP would be 211 = 2048 but thisis shifted to allow representation of small and large numbers so that weactually have Emin = −1022, Emax = 1023. Consequently the min and maxDP floating point number which can be represented in DP are

Nmin = minx∈DP

|x| = 2−1022 ≈ 2.2× 10−308, (6.3)

Nmax = maxx∈DP

|x| = (1.1.....1)2 · 21023 = (2− 2−52) · 21023 ≈ 1.8× 10308. (6.4)

If in the course of a computation a number is produced which is bigger thanNmax we get an overflow error and the computation would halt. If the numberis less than Nmin (in absolute value) then an underflow error occurs.

6.2 Rounding and Machine Precision

To represent a real number x as a floating point number, rounding has to beperformed to retain only the numbers of binary bits allowed in the significant.Let x ∈ R and its binary expansion be x = ±(1.b1b2 · · · )2 × 2E.

One way to approximate x to a floating number with d bits in the signif-icant is to truncate or chop discarding all the bits after bd, i.e.

x∗ = chop(x) = ±(1.b1b2 · · · bd)2 × 2E. (6.5)

In double precision d = 52.

A better way to approximate to a floating point number is to do roundingup or down (to the nearest floating point number), just as we do when weround in base 10. In binary, rounding is simpler because bd+1 can only be 0(we round down) or 1 (we round up). We can write this type of rounding interms of the chopping described above as

x∗ = round(x) = chop(x+ 2−(d+1) × 2E). (6.6)

Definition 6.1. Given an approximation x∗ to x the absolute error isdefined by |x− x∗| and the relative error by |x−x∗

x|, x 6= 0.

The relative error is generally more meaningful than the absolute errorto measure a given approximation.

6.3. CORRECTLY ROUNDED ARITHMETIC 101

The relative error in chopping and in rounding (called a round-off error)is ∣∣∣∣x− chop(x)

x

∣∣∣∣ ≤ 2−d2E

(1.b1b2 · · · )2E≤ 2−d, (6.7)∣∣∣∣x− round(x)

x

∣∣∣∣ ≤ 1

22−d. (6.8)

The number 2−d is called machine precision or epsilon (eps). In doubleprecision eps=2−52 ≈ 2.22 × 10−16. The smallest double precision numbergreater than 1 is 1+eps. As we will see below, it is more convenient to write(6.8) as

round(x) = x(1 + δ), |δ| ≤ eps. (6.9)

6.3 Correctly Rounded Arithmetic

Computers today follow the IEEE standard for floating point representationand arithmetic. This standard requires a consistent floating point represen-tation of numbers across computers and correctly rounded arithmetic.

In correctly rounded arithmetic, the computer operations of addition, sub-traction, multiplication, and division are the correctly rounded value of theexact result.

If x and y are floating point numbers and ⊕ is the machine addition, then

x⊕ y = round(x+ y) = (x+ y)(1 + δ+), |δ+| ≤ eps, (6.10)

and similarly for ,⊗,.One important interpretation of (6.10) is the the following. Assuming

x+ y 6= 0, write

δ+ =1

x+ y[δx + δy].

Then

x⊕ y = (x+ y)

[1 +

1

x+ y(δx + δy)

]= (x+ δx) + (y + δy). (6.11)

The computer ⊕ is giving the exact result but for a sightly perturbed data.This interpretation is the basis for Backward Error Analysis, which isused to study how round-off errors propagate in a numerical algorithm.


6.4 Propagation of Errors and Cancellation

of Digits

Let fl(x) and fl(y) denote the floating point approximation of x and y,respectively, and assume that their product is computed exactly, i.e

fl(x) ·fl(y) = x(1+δx) ·y(1+δy) = x ·y(1+δx+δy+δxδy) ≈ x ·y(1+δx+δy),

where |δx|,|δy| ≤ eps. Therefore, for the relative error we get∣∣∣∣x · y − fl(x) · fl(y)

x · y

∣∣∣∣ ≈ |δx + δy|, (6.12)

which is acceptable.Let us now consider addition (or subtraction):

fl(x) + fl(y) = x(1 + δx) + y(1 + δy) = x+ y + xδx + yδy

= (x+ y)

(1 +

x

x+ yδx +

y

x+ yδy

).

The relative error is∣∣∣∣x+ y − (fl(x) + fl(y))

x+ y

∣∣∣∣ =

∣∣∣∣ x

x+ yδx +

y

x+ yδy

∣∣∣∣ . (6.13)

If x and y have the same sign then xx+y

, yx+y

are both positive and bounded

by 1. Therefore the relative error is less than |δx + δy|, which is fine. Butif x and y have different sign and are close in magnitude the error could belargely amplified because | x

x+y|, | y

x+y| can be very large.

Example 6.1. Suppose we have 10 bits of precision and

x = (1.01011100 ∗ ∗)2 × 2E,

y = (1.01011000 ∗ ∗)2 × 2E,

where the ∗ stands for inaccurate bits (i.e. garbage) that say were generated inprevious floating point computations. Then, in this 10 bit precision arithmetic

z = x− y = (1.00 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗)2 × 2E−6. (6.14)

We end up with only 2 bits of accuracy in z. Any further computations usingz will result in an accuracy of 2 bits or lower!

6.4. PROPAGATIONOF ERRORS AND CANCELLATIONOF DIGITS103

Example 6.2. Sometimes we can rewrite the difference of two very closenumbers to avoid digit cancellation. For example, suppose we would like tocompute

y =√

1 + x− 1

for x > 0 and very small. Clearly, we will have loss of digits if we proceeddirectly. However, if we rewrite y as

y = (√

1 + x+ 1)

√1 + x− 1√1 + x+ 1

=x√

1 + x+ 1

then the computation can be performed at nearly machine precision level.


Chapter 7

Numerical Differentiation

7.1 Finite Differences

Suppose f is a differentiable function and we’d like to approximate f ′(x0)given the value of f at x0 and at neighboring points x1, x2, ..., xn. We couldapproximate f by its interpolating polynomial pn at those points and usef ′(x0) ≈ p′n(x0). There are several other possibilities. For example, we canapproximate f ′(x0) by the derivative of the cubic spline of f evaluated at x0,or by the derivative of the Least Squares Chebyshev expansion of f :

f ′(x0) =n∑j=1

ajT′j(x0),

etc. We are going to focus here on simple, finite difference formulas obtainedby differentiating low order interpolating polynomials.

Assuming x, x0, . . . , xn ∈ [a, b] and f ∈ Cn+1[a, b], we have

f(x) = pn(x) +1

(n+ 1)!f (n+1)(ξ(x))ωn(x), (7.1)

for some ξ(x) ∈ (a, b) and

ωn(x) = (x− x0)(x− x1) · · · (x− xn). (7.2)

Thus,

f ′(x0) = p′n(x0) +1

(n+ 1)!

[d

dxf (n+1)(ξ(x))ωn(x) + f (n+1)(ξ(x))ω′n(x)

]∣∣∣∣x=x0

.

105

106 CHAPTER 7. NUMERICAL DIFFERENTIATION

But ωn(x0) = 0 and ω′n(x0) = (x0 − x1) · · · (x0 − xn), thus

f ′(x0) = p′n(x0) +1

(n+ 1)!f (n+1)(ξ(x))(x0 − x1) · · · (x0 − xn) (7.3)

Example 7.1. Take n = 1 and x1 = x0 + h (h > 0). In Newton’s form

p1(x) = f(x0) +f(x0 + h)− f(x0)

h(x− x0), (7.4)

and p′1(x0) = 1h[f(x0+h)−f(x0)]. We obtain the so-called Forward Difference

Formula for approximating f ′(x0)

D+h f(x0) :=

f(x0 + h)− f(x0)

h. (7.5)

From (7.3) the error in this approximation is

f ′(x0)−D+h f(x0) =

1

2!f′′(ξ(x))(x0 − x1) = −1

2f ′′(ξ)h. (7.6)

Example 7.2. Take again n = 1 but now x1 = x0 − h. Then p′1(x0) =1h[f(x0) − f(x0 − h)] and we get the so-called Backward Difference Formula

for approximating f ′(x0)

D−h f(x0) :=f(x0)− f(x0 − h)

h. (7.7)

Its error is

f ′(x0)−D−h f(x0) =1

2f ′′(ξ)h. (7.8)

Example 7.3. Let n=2 and x1 = x0−h, x2 = x0 +h. Then, p2 in Newton’sform is

p2(x) = f [x1] + f [x1, x0](x− x1) + f [x1, x0, x2](x− x1)(x− x0).

Let us obtain the finite difference table:

x0 − h f(x0 − h)f(x0)−f(x0−h)

h

x0 f(x0) f(x0+h)−2f(x0)+f(x0+h)2h2

f(x0+h)−f(x0)h

x0 + h f(x0 + h)

7.1. FINITE DIFFERENCES 107

Therefore,

p′2(x0) =f(x0)− f(x0 − h)

h+f(x0 + h)− 2f(x0) + f(x0 − h)

2h2h

and thus

p′2(x0) =f(x0 + h)− f(x0 − h)

2h. (7.9)

This defines the Centered Difference Formula to approximate f ′(x0)

D0hf(x0) :=

f(x0 + h)− f(x0 − h)

2h. (7.10)

Its error is

f ′(x0)−D0hf(x0) =

1

3!f ′′′(ξ)(x0 − x1)(x0 − x2) = −1

6f ′′′(ξ)h2. (7.11)

Example 7.4. Let n = 2 and x1 = x0 + h, x2 = x0 + 2h. The table of finitedifferences is

x0 f(x0)f(x0+h)−f(x0)

h

x0 + h f(x0 + h) f(x0+2h)−2f(x0+h)+f(x0)2h2

f(x0+2h)−f(x0+h)h

x0 + 2h f(x0 + 2h)

and

p′2(x0) =f(x0 + h)− f(x0)

h+f(x0 + 2h)− 2f(x0 + h) + f(x0)

2h2h

thus

p′2(x0) =−f(x0 + 2h) + 4f(x0 + h)− 3f(x0)

2h. (7.12)

If we use this sided difference to approximate f ′(x0), the error is

f ′(x0)− p′2(x0) =1

3!f ′′′(ξ)(x0 − x1)(x0 − x2) =

1

3h2f ′′′(ξ), (7.13)

which is twice as large as that in Centered Finite Difference Formula.


7.2 The Effect of Round-off Errors

In numerical differentiation we take differences of values, which for small h,could be very close to each other. As we know, this leads to loss of accuracybecause of finite precision floating point arithmetic. Consider for examplethe centered difference formula. For simplicity let us suppose that h has anexact floating point representation and that we make no rounding error whendoing the division by h. That is, suppose that the the only source of round-off error is in the computation of the difference f(x0 + h)− f(x0 − h). Thenf(x0+h) and f(x0−h) are replaced by f(x0+h)(1+δ+) and f(x0−h)(1+δ−),respectively with |δ+| ≤ eps and |δ−| ≤ eps. Then

f(x0 + h)(1 + δ+)− f(x0 − h)(1 + δ−)

2h=f(x0 + h)− f(x0 − h)

2h+ rh,

where

rh =f(x0 + h)δ+ − f(x0 − h)δ−

2h.

Clearly, |rh| ≤ (|f(x0 +h)|+ |f(x0−h)|) eps2h≈ |f(x0)| eps

h. The approximation

error or truncation error for the centered finite difference approximation is−1

6f ′′′(ξ)h2. Thus, the total error E(h) can be approximately bounded by

16h2M3 + |f(x0)| eps

h. The minimum error occurs at h0 such that E ′(h0) = 0,

i.e.

h0 =

(3 eps |f(x0)|

M3

) 13

≈ c eps13 (7.14)

and E(h0) = O(eps23 ). We do not get machine precision!

Higher order finite differences exacerbate the problem of digit cancella-tion. When f can be extended to an analytic function in the complex plane,Cauchy Integral Theorem can be used to evaluate the derivative:

f ′(z0) =1

2πi

∫C

f(z)

(z − z0)2dz, (7.15)

where C is a simple closed contour around z0 and f is analytic on and insideC. Parametrizing C as a circle of radius r we get

f ′(z0) =1

2πr

∫ 2π

0

f(x0 + reit)e−itdt (7.16)

7.3. RICHARDSON’S EXTRAPOLATION 109

The integrand is periodic and smooth so it can be approximated with spectralaccuracy with the composite trapezoidal rule.

Another approach to obtain finite difference formulas to approximatederivatives is through Taylor expansions. For example,

f(x0 + h) = f(x0) + f ′(x0)h+1

2f ′′(x0)h2 +

1

3!f (3)(x0)h3 +

1

4!f (4)(ξ+)h4,

(7.17)

f(x0 − h) = f(x0)− f ′(x0)h+1

2f ′′(x0)h2 − 1

3!f (3)(x0)h3 +

1

4!f (4)(ξ−)h4,

(7.18)

where x0 < ξ+ < x0 + h and x0 − h < ξ− < x0. Then subtracting (7.17)from (7.18) we have f(x0 +h)− f(x0−h) = 2f ′(x0)h+ 2

3!f ′′′(x0)h3 + · · · and

therefore

f(x0 + h)− f(x0 − h)

2h= f ′(x0) + c2h

2 + c4h4 + · · · (7.19)

Similarly if we add (7.17) and (7.18) we obtain f(x0 + h) + f(x0 − h) =2f(x0) + f ′′(x0)h2 + ch4 + · · · and consequently

f ′′(x0) =f(x0 + h)− 2f(x0) + f(x0 − h)

h2+ ch2 + · · · (7.20)

The finite difference

D2hf(x0) =

f(x0 + h)− 2f(x0) + f(x0 − h)

h2(7.21)

is thus a second order approximation to f ′′(x0), i.e., f ′′(x0) − D2hf(x0) =

O(h2).

7.3 Richardson’s Extrapolation

From (7.19) we know that, asymptotically

D0hf(x0) = f ′(x0) + c2h

2 + c4h4 + · · · (7.22)

We can apply Richardson extrapolation once to obtain a fourth order ap-proximation. Evaluating (7.22) at h/2 we get

D0h/2f(x0) = f ′(x0) +

1

4c2h

2 +1

16c4h

4 + · · · (7.23)


and multiplying this equation by 4, subtracting (7.22) to the result anddividing by 3 we get

Dexth f(x0) :=

4D0h/2f(x0)−D0

hf(x0)

3= f ′(x0) + c4h

4 + · · · (7.24)

The method Dexth f(x0) has order of convergence 4 for about twice the amount

of work of that D0hf(x0). Round-off errors are still O(eps/h) and the min-

imum total error will be when O(h4) is O(eps/h), i.e. when h = eps1/5.The minimum error is thus O(eps4/5) for Dext

h f(x0), about 10−14 in doubleprecision with h = O(10−3).

Chapter 8

Numerical Integration

We now revisit the problem of numerical integration that we used to intro-duced some principles of numerical analysis in Chapter 1.

The problem in question is to find accurate and efficient approximationsto ∫ b

a

f(x)dx

Numerical formulas to approximate a definite integral are called quadra-tures and, as we saw in Chapter 1, they can be elementary (simple) or com-posite.

We shall assume henceforth, unless otherwise noted, that the integrandis sufficiently smooth.

8.1 Elementary Simpson Quadrature

The elementary trapezoidal rule quadrature was derived by replacing theintegrand f by its linear interpolating polynomial p1 at a and b, that is

f(x) = p1(x) +1

2f ′′(ξ)(x− a)(x− b), (8.1)

for some ξ between a and b and thus∫ b

a

f(x)dx =

∫ b

a

p1(x)dx+1

2

∫ b

a

f ′′(ξ)(x− a)(x− b)dx

=1

2(b− a)[f(a) + f(b)]− 1

2f ′′(η)(b− a)3

(8.2)

111

112 CHAPTER 8. NUMERICAL INTEGRATION

Thus, the approximation∫ b

a

f(x)dx ≈ 1

2(b− a)[f(a) + f(b)] (8.3)

has an error given by −12f ′′(η)(b− a)3.

We can add an intermediate point, say xm = (a + b)/2, and replace fby its quadratic interpolating polynomial p2 with respect to the nodes a, xmand b. For simplicity let’s take [a, b] = [−1, 1]. With the simple change ofvariables

x =1

2(a+ b) +

1

2(b− a)t, t ∈ [−1, 1] (8.4)

we can obtain a quadrature formula for a general interval [a, b].Let p2 be the interpolating polynomial of f at −1, 0, 1. The corresponding

divided difference table is:

−1 f(−1)f(0)− f(−1)

0 f(0) f(1)−2f(0)+f(−1)2

.f(1)− f(0)

1 f(1)

Thus

p2(x) = f(−1) + [f(0)− f(1)](x+ 1) +f(1)− 2f(0) + f(−1)

2(x+ 1)x.

(8.5)

Now using the interpolation formula with remainder expressed in terms of adivided difference (3.56) we have

f(x) = p2(x) + f [−1, 0, 1, x](x+ 1)x(x− 1)

= p2(x) + f [−1, 0, 1, x]x(x2 − 1).(8.6)

Therefore∫ 1

−1

f(x)dx =

∫ 1

−1

p2(x)dx+

∫ 1

−1

f [−1, 0, 1, x]x(x2 − 1)dx

= 2f(−1) + 2[f(0)− f(−1)] +1

3[f(1)− 2f(0) + f(−1)] + E[f ]

=1

3[f(−1) + 4f(0) + f(1)] + E[f ],

8.1. ELEMENTARY SIMPSON QUADRATURE 113

where

E[f ] =

∫ 1

−1

f [−1, 0, 1, x]x(x2 − 1)dx (8.7)

is the error. Note that x(x2− 1) changes sign in [−1, 1] so we cannot use theMean Value Theorem for integrals. However, if we add another node, x4, wecan relate f [−1, 0, 1, x] to the fourth order divided difference f [−1, 0, 1, x4, x],which will make the integral in (8.7) easier to evaluate:

f [−1, 0, 1, x] = f [−1, 0, 1, x4] + f [−1, 0, 1, x4, x](x− x4). (8.8)

This identity is just an application of Theorem 3.2. Using (8.8)

E[f ] = f [−1, 0, 1, x4]

∫ 1

−1

x(x2 − 1)dx+

∫ 1

−1

f [−1, 0, 1, x4, x]x(x2 − 1)(x− x4)dx.

The first integral is zero, because the integrand is odd. Now we choose x4

symmetrically, x4 = 0, so that x(x2 − 1)(x − x4) does not change sign in[−1, 1] and

E[f ] =

∫ 1

−1

f [−1, 0, 1, 0, x]x2(x2 − 1)dx =

∫ 1

−1

f [−1, 0, 0, 1, x]x2(x2 − 1)dx.

(8.9)

Now, using (3.58), there is ξ(x) ∈ (−1, 1) such that

f [−1, 0, 0, 1, x] =f (4)(ξ(x))

4!, (8.10)

and assuming f ∈ C4[−1, 1], by the Mean Value Theorem for integrals, thereis η ∈ (−1, 1) such that

E[f ] =f (4)(η)

4!

∫ 1

−1

x2(x2 − 1)dx = − 4

15

f (4)(η)

4!= − 1

90f (4)(η). (8.11)

Summarizing, Simpson’s elementary quadrature for the interval [−1, 1] is∫ 1

−1

f(x)dx =1

3[f(−1) + 4f(0) + f(1)]− 1

90f (4)(η). (8.12)


Note that Simpson’s elementary quadrature gives the exact value of theintegral when f is polynomial of degree 3 or less (the error is proportionalto the fourth derivative), even though we used a second order polynomial toapproximate the integrand. We gain extra precision because of the symme-try of the quadrature around 0. In fact, we could have derived Simpson’squadrature by using the Hermite (third order) interpolating polynomial of fat −1, 0, 0, 1.

To obtain the corresponding formula for a general interval [a, b] we usethe change of variables (8.4)∫ b

a

f(x)dx =1

2(b− a)

∫ 1

−1

F (t)dt,

where

F (t) = f

(1

2(a+ b) +

1

2(b− a)t

), (8.13)

and noting that F (k)(t) = ( b−a2

)kf (k)(x) we obtain Simpson’s elementary ruleon the interval [a, b]:∫ b

a

f(x)dx =1

6(b− a)

[f(a) + 4f

(a+ b

2

)+ f(b)

]− 1

90f (4)(η)

(b− a

2

)5

.

(8.14)

8.2 Interpolatory Quadratures

The elementary trapezoidal and Simpson rules are examples of interpolatoryquadratures. This class of quadratures is obtained by selecting a set of nodesx0, x1, . . . , xn in the interval of integration and by approximating the integralby that of the interpolating polynomial pn of the integrand at these nodes.By construction, such interpolatory quadrature is exact for polynomials ofdegree up to n, at least. We just saw that Simpson rule is exact for polynomialup to degree 3 and we used p2 in its construction. The “degree gain” wasdue to the symmetric choice of the interpolation nodes. This leads us to twoimportant questions:

8.2. INTERPOLATORY QUADRATURES 115

1. For a given n, how do we choose the nodes x0, x1, . . . , xn so that thecorresponding interpolation quadrature is exact for polynomials of thehighest degree k possible?

2. What is that k?

Because orthogonal polynomials (Section 5.3) play a central role in theanswer to these questions, we will consider the more general problem ofapproximating

I[f ] =

∫ b

a

f(x)w(x)dx, (8.15)

where w is an admissible weight function (w ≥ 0,∫ baw(x)dx > 0, and∫ b

axkw(x)dx < +∞ for k = 0, 1, . . .), w ≡ 1 being a particular case. The

interval of integration [a, b] can be either finite or infinite (e.g. [0,+∞],[−∞,+∞]).

Definition 8.1. We say that a quadrature Q[f ] to approximate I[f ] hasdegree of precision k if it is exact, i.e. I[P ] = Q[P ], for all polynomials Pof degree up to k but not exact for polynomials of degree k+ 1. Equivalently,a quadrature Q[f ] has degree of precision k if I[xm] = Q[xm], for m =0, 1, . . . , k but I[xk+1] 6= Q[xk+1].

Example 8.1. The trapezoidal rule quadrature has degree of precision 1 whilethe Simpson quadrature has degree of precision 3.

For a given set of nodes x0, x1, . . . , xn in [a, b], let pn be the interpolatingpolynomial of f at these nodes. In Lagrange form we can write pn as (seeSection 3.1)

pn(x) =n∑j=0

f(xj)lj(x), (8.16)

where

lj(x) =n∏

k=0,k 6=j

(x− xk)(xj − xk)

, for j = 0, 1, ..., n. (8.17)


are the elementary Lagrange polynomials. The corresponding interpolatoryquadrature Qn[f ] to approximate I[f ] is then given by

Qn[f ] =n∑j=0

Ajf(xj), Aj =

∫ b

a

lj(x)w(x)dx, for j = 0, 1, . . . , n. (8.18)

Theorem 8.1. Degree of precision of the interpolatory quadrature (8.18) isless than 2n+ 2

Proof. Suppose the degree of precision k of (8.18) is greater or equal than2n + 2. Take f(x) = (x − x0)2(x − x1)2 · · · (x − xn)2. This is polynomial ofdegree exactly 2n+ 2. Then∫ b

a

f(x)w(x)dx =n∑j=0

Ajf(xj) = 0. (8.19)

and on the other hand∫ b

a

f(x)w(x)dx =

∫ b

a

(x− x0)2 · · · (x− xn)2w(x)dx > 0 (8.20)

which is a contradiction. Therefore k < 2n+ 2.

8.3 Gaussian Quadratures

We will now show that there is a choice of nodes x0, x1, ..., xn which yieldsthe optimal degree of precision 2n+ 1 for an interpolatory quadrature. Thecorresponding quadratures are called Gaussian quadratures. To define themwe recall that ψk is the k-th orthogonal polynomial with respect to the innerproduct

< f, g >=

∫ b

a

f(x)g(x)w(x)dx, (8.21)

if < ψk, q >= 0 for all polynomials q of degree less than k. Recall also thatthe zeros of the orthogonal polynomials are real, simple, and contained in[a, b] (see Theorem 5.1).

Definition 8.2. Let ψn+1 be the (n + 1)st orthogonal polynomial and letx0, x1, ..., xn be its n+1 zeros. Then the interpolatory quadrature (8.18) withthe nodes so chosen is called a Gaussian quadrature.

8.3. GAUSSIAN QUADRATURES 117

Theorem 8.2. The interpolatory quadrature (8.18) has degree of precisionk = 2n+ 1 if and only if it is a Gaussian quadrature.

Proof. Let f is a polynomial of degree ≤ 2n+ 1. Then, we can write

f(x) = q(x)ψn+1(x) + r(x), (8.22)

where q and r are polynomials of degree ≤ n. Now∫ b

a

f(x)w(x)dx =

∫ b

a

q(x)ψn+1(x)w(x)dx+

∫ b

a

r(x)w(x)dx (8.23)

The first integral on the right hand side is zero because of orthogonality. Forthe second integral the quadrature is exact (it is interpolatory). Therefore∫ b

a

f(x)w(x)dx =n∑j=0

Ajr(xj). (8.24)

Moreover, r(xj) = f(xj) − q(xj)ψn+1(xj) = f(xj) for all j = 0, 1, . . . , n.Thus, ∫ b

a

f(x)w(x)dx =n∑j=0

Ajf(xj). (8.25)

This proves that the Gaussian quadrature has degree of precision k = 2n+1.Now suppose that the interpolatory quadrature (8.18) has maximal degreeof precision 2n+ 1. Take f(x) = p(x)(x− x0)(x− x1) · · · (x− xn) where p isa polynomial of degree ≤ n. Then, f is a polynomial of degree ≤ 2n+ 1 and∫ b

a

f(x)w(x)dx =

∫ b

a

p(x)(x− x0) · · · (x− xn)w(x)dx =n∑j=0

Ajf(xj) = 0.

Therefore, the polynomial (x − x0)(x − x1) · · · (x − xn) of degree n + 1 isorthogonal to all polynomials of degree ≤ n. Thus, it is a multiple of ψn+1.

Example 8.2. Consider the interval [−1, 1] and the weight function w ≡ 1.The corresponding orthogonal the Legendre Polynomials 1, x, x2 − 1

3, x3 −


35x, · · · . Take n = 1. The roots of ψ2 are x0 = −

√13

and x1 =√

13. There-

fore, the corresponding Gaussian quadrature is∫ 1

−1

f(x)dx ≈ A0f

(−√

1

3

)+ A1f

(√1

3

)(8.26)

where

A0 =

∫ 1

−1

l0(x)dx, (8.27)

A1 =

∫ 1

−1

l1(x)dx. (8.28)

We can evaluate these integrals directly or use the method of undeter-mined coefficients to find A0 and A1. The latter is generally easier andwe illustrate it now. Using that the quadrature is exact for 1 and x we have

2 =

∫ 1

−1

1dx = A0 + A1, (8.29)

0 =

∫ 1

−1

xdx = −A0

√1

3+ A1

√1

3. (8.30)

Solving this 2 × 2 linear system we get A0 = A1 = 1. So the Gaussianquadrature for n = 1 in [−1, 1] is

Q1[f ] = f

(−√

1

3

)+ f

(√1

3

)(8.31)

Let us compare this quadrature to the elementary trapezoidal rule. Takef(x) = x2. The trapezoidal rule, T [f ], gives

T [x2] =2

2[f(−1) + f(1)] = 2 (8.32)

whereas the Gaussian quadrature Q1[f ] yields the exact result:

Q1[x2] =

(−√

1

3

)2

+

(√1

3

)2

=2

3. (8.33)

8.3. GAUSSIAN QUADRATURES 119

Example 8.3. Let us take again the interval [−1, 1] but now w(x) = 1√1−x2 .

As we know (see 2.4 ), ψn+1 = Tn+1, i.e.the Chebyshev polynomial of degreen+ 1. Its zeros are xj = cos[ 2j+1

2(n+1)π] for j = 0, . . . , n. For n = 1 we have

cos(π

4

)=

√1

2, cos

(5π

4

)= −

√1

2. (8.34)

We can use again the method of undetermined coefficients to find A0 and A1:

π =

∫ 1

−1

11√

1− x2dx = A0 + A1, (8.35)

0 =

∫ 1

−1

x1√

1− x2dx = −A0

√1

2+ A1

√1

2, (8.36)

which give A0 = A1 = π2. Thus, the corresponding Gaussian quadrature to

approximate∫ 1

−1f(x) 1√

1−x2dx is

Q1[f ] =π

2

[f

(−√

1

2

)+ f

(√1

2

)]. (8.37)

8.3.1 Convergence of Gaussian Quadratures

Let f ∈ C[a, b] and consider the interpolation quadrature (8.18). Can weguarantee that the error converges to zero as n→∞, i.e.,∫ b

a

f(x)w(x)dx−n∑j=0

Ajf(xj)→ 0, as n→∞ ?

The answer is no. As we know, convergence of the interpolating polynomialto f depends on the smoothness of f and the distribution of the interpolatingnodes. However, if the interpolatory quadrature is a Gaussian the answer isyes. This follows from the following special properties of the quadratureweights A0, A1, . . . , An in the Gaussian quadrature.

Theorem 8.3. For a Gaussian quadrature all the quadrature weights arepositive and sum up to ‖w‖1, i.e.,

(a) Aj > 0 for all j = 0, 1, . . . , n.


(b)n∑j=0

Aj =

∫ b

a

w(x)dx.

Proof. (a) Let pk = l2k for k = 0, 1, . . . , n. These are polynomials of degreeexactly equal to 2n and pk(xj) = δkj. Thus,

0 <

∫ b

a

l2k(x)w(x)dx =n∑j=0

Ajl2k(xj) = Ak (8.38)

for k = 0, 1, . . . , n.(b) Take f(x) ≡ 1 then ∫ b

a

w(x)dx =n∑j=0

Aj. (8.39)

as the quadrature is exact for polynomials of degree zero.

We can now use these special properties of the Gaussian quadrature toprove its convergence for all f ∈ C[a, b]:

Theorem 8.4. Let

Qn[f ] =n∑j=0

Ajf(xj) (8.40)

be the Gaussian quadrature. Then

En[f ] :=

∫ b

a

f(x)w(x)dx−Qn[f ]→ 0, as n→∞ . (8.41)

Proof. Let p∗2n+1 be the best uniform approximation to f (in the max norm,‖f‖∞ = maxx∈[a,b] |f(x)|) by polynomials of degree ≤ 2n+ 1. Then,

En[f − p∗2n+1] = En[f ]− En[p∗2n+1] = En[f ] (8.42)

and therefore

En[f ] = En[f − p∗2n+1] =

∫ b

a

[f(x)− p∗2n+1(x)]w(x)dx−n∑j=0

Aj[f(xj)− p∗2n+1(xj)].

8.4. COMPUTING THE GAUSSIAN NODES AND WEIGHTS 121

Taking the absolute value, using the triangle inequality, and the fact thatthe quadrature weights are positive we obtain

|En[f ]| ≤∫ b

a

|f(x)− p∗2n+1(x)|w(x)dx+n∑j=0

Aj|f(xj)− p∗2n+1(xj)|

≤ ‖f − p∗2n+1‖∞∫ b

a

w(x)dx+ ‖f − p∗2n+1‖∞n∑j=0

Aj

= 2‖w‖1‖f − p∗2n+1‖∞

From the Weierstrass approximation theorem it follows that En[f ] → 0 asn→∞.

Moreover, one can prove (using one of the Jackson Theorems) that iff ∈ Cm[a, b]

|En[f ]| ≤ C(2n)−m‖f (m)‖∞. (8.43)

That is, the rate of convergence is not fixed; it depends on the number ofderivatives the integrand has. We say in this case that the approximation isspectral. In particular if f ∈ C∞[a, b] then the error decreases down to zerofaster than any power of 1/(2n).

8.4 Computing the Gaussian Nodes and Weights

Orthogonal polynomials satisfy a three-term relation:

ψk+1(x) = (x− αk)ψk(x)− βkψk−1(x), for k = 0, 1, . . . , n, (8.44)

where β0 is defined by∫ baw(x)dx, ψ0(x) = 1 and ψ−1(x) = 0. Equivalently

xψk(x) = βkψk−1(x) + αkψk(x) + ψk+1(x), for k = 0, 1, . . . , n. (8.45)

If we use the normalized orthogonal polynomials

ψk(x) =ψk(x)√

< ψk, ψk >(8.46)

and recalling that

βk =< ψk, ψk >

< ψk−1, ψk−1 >


then (8.45) can be written as

xψk(x) =√βkψk−1(x) + αkψk(x) +

√βk+1ψk+1(x), for k = 0, 1, . . . , n.

(8.47)

Now evaluating this at a root xj of ψn+1 we get the eigenvalue problem

xjvj = Jnvj, (8.48)

where

Jn =

α0

√β1 0 · · · 0√

β1 α0

√β2 · · · 0

.... . . . . . . . . √

βn0 0 0

√βn αn

, vj =

ψ0(xj)

ψ1(xj)...

ψn−1(xj)

ψn(xj)

. (8.49)

That is, the Gaussian nodes xj, j = 0, 1, . . . , n are the eigenvalues of theJacobi Matrix Jn. One can show that the Gaussian weights Aj, are given interms of the first component vj,0 of the (normalized) eigenvector vj (vTj vj =1):

Aj = β0v2j,0. (8.50)

There are efficient numerical methods (e.g. the QR method) to solve theeigenvalue problem for a symmetric triadiagonal matrix and this is one ofmost popular approached to compute the Gaussian nodes.

8.5 Clenshaw-Curtis Quadrature

Gaussian quadratures are optimal in terms of the degree of precision andoffer superalgebraic convergence for smooth integrands. However, the com-putation of Gaussian weights and nodes carries a significant cost, for largen. There is an ingenious interpolatory quadrature that is a close competitorto the Gaussian quadrature due to its efficient and fast rate of convergence.This is the Clenshaw-Curtis quadrature.

Suppose f is a smooth function in [−1, 1] and we are interested in anaccurate approximation of the integral∫ 1

−1

f(x)dx.

8.5. CLENSHAW-CURTIS QUADRATURE 123

The idea is to use the extrema of the n Chebyshev polynomial Tn, xj =cos( jπ

n), j = 0, 1, ..., n as the nodes of the corresponding interpolatory quadra-

ture. The degree of precision is only n (not 2n+ 1!). However, as we know,for smooth functions the approximation by polynomial interpolation usingthe Chebyshev nodes converges very rapidly. Hence, for smooth integrandsthis particular interpolatory quadrature can be expected to converge fast tothe exact value of the integral.

Let pn be the interpolating polynomial of f at xj = cos( jπn

), j = 0, 1, ..., n.We can write pn as

pn(x) =a0

2+

n−1∑k=1

akTk(x) +an2Tn(x) (8.51)

Under the change of variable x = cos θ, for θ ∈ [0, π] we get

pn(cos θ) =a0

2+

n−1∑k=1

ak cos kθ +1

2an cosnθ. (8.52)

Let Πn(θ) = pn(cos θ) and F (θ) = f(cos θ). By extending F evenly over[−π, 0] (or over [π, 2π]) and using Theorem 4.2, we conclude that Πn(θ)interpolates F (θ) = f(cos θ) at the equally spaced points θj = jπ

n, j =

0, 1, ...n if and only if

ak =2

n

n∑′′

j=0

F (θj) cos kθj, k = 0, 1, .., n. (8.53)

These are the (Type I) Discrete Cosine Transform (DCT) coefficients of Fand we can compute them efficiently in O(n log2 n) operations with the FFT.

Now, using the change of variable x = cos θ we have∫ 1

−1

f(x)dx =

∫ π

0

f(cos θ) sin θdθ =

∫ π

0

F (θ) sin θdθ, (8.54)

and approximating F (θ) by its interpolant Πn(θ) = Pn(cos θ), we obtain thecorresponding quadradure∫ 1

−1

f(x)dx ≈∫ π

0

Πn(θ) sin θdθ. (8.55)


Substituting (8.52) we have∫ π

0

Πn(θ) sin θdθ =a0

2

∫ π

0

sin θdθ +n−1∑k=1

ak

∫ π

0

cos kθ sin θdθ +an2

∫ π

0

cosnθ sin θdθ.

(8.56)

Assuming n even and using cos kθ sin θ = 12[sin(1 + k)θ+ sin(1− k)θ] we get

the Clenshaw-Curtis quadrature:∫ 1

−1

f(x)dx ≈ a0 +n−2∑k=2k even

2ak1− k2

+an

1− n2. (8.57)

For a general interval [a, b] we simply use the change of variables

x =a+ b

2+b− a

2cos θ (8.58)

for θ ∈ [0, π] and thus∫ a

b

f(x)dx =b− a

2

∫ π

0

F (θ) sin θdθ, (8.59)

where F (θ) = f(a+b2

+ b−a2

cos θ) and so the formula (8.57) gets an extra factorof (b− a)/2.

8.6 Composite Quadratures

We saw in Section 1.2.2 that one strategy to improve the accuracy of aquadrature formula is to divide the interval of integration [a, b] into smallsubintervals, use the elementary quadrature in each of them, and sum up allthe contributions.

For simplicity, let us divide uniformly [a, b] into N subintervals of equallength h = (b − a)/N , [xj, xj+1], where xj = a + jh for j = 0, 1, . . . , N − 1.If we use the elementary trapezoidal rule in each subinterval (as done inSection 1.2.2) we arrive at the composite trapezoidal rule:∫ b

a

f(x)dx = h

[1

2f(a) +

N−1∑j=1

f(xj) +1

2f(b)

]− 1

12(b− a)h2f ′′(η), (8.60)

8.7. MODIFIED TRAPEZOIDAL RULE 125

where η is some point in (a, b).To derive a corresponding composite Simpson quadrature we take N even

and apply the elementary Simpson quadrature in each of the N/2 intervals[xj, xj+2], j = 0, ..., N − 2. That is:∫ b

a

f(x)dx =

∫ x2

x0

f(x)dx+

∫ x4

x2

f(x)dx+ · · ·+∫ xN

xN−2

f(x)dx (8.61)

and since the elementary Simpson quadrature applied to [xj, xj+2] reads:∫ xj+2

xj

f(x)dx =h

3[f(xj) + 4f(xj+1) + f(xj+2)]− 1

90f (4)(ηj)h

5 (8.62)

for some ηj ∈ (xj, xj+2), summing up all the N/2 contributions we get thecomposite Simpson quadrature:∫ b

a

f(x)dx =h

3

f(a) + 2

N/2−1∑j=1

f(x2j) + 4

N/2∑j=1

f(x2j−1) + f(b)

− 1

180(b− a)h4f (4)(η),

for some η ∈ (a, b).

8.7 Modified Trapezoidal Rule

We are going to consider here a modification to the trapezoidal rule that willyield a quadrature with an error of the same order as Simpson’s rule. More-over, this modified quadrature will give us some insight to the the asymptoticform of the trapezoidal rule error.

To simplify the derivation let us consider the interval [0, 1] and let p3 bethe polynomial interpolating f(0), f ′(0), f(1), f ′(1). Newton’s divided differ-ences representation of p3 is

p3(x) = f(0) + f [0, 0]x+ f [0, 0, 1]x2 + f [0, 0, 1, 1]x2(x− 1), (8.63)

and thus∫ 1

0

p3(x)dx = f(0) +1

2f ′(0) +

1

3f [0, 0, 1]− 1

12f [0, 0, 1, 1]. (8.64)


Th divided differences are obtained in the tableau:

0 f(0)f ′(0)

0 f(0) f(1)− f(0)− f ′(0)f(1)− f(0) f ′(1) + f ′(0) + 2(f(0)− f(1))

1 f(1) f ′(1)− f(1) + f(0)f ′(1)

1 f(1)

Thus,∫ 1

0

p3(x)dx = f(0)+1

2f ′(0)+

1

3[f(1)−f(0)−f ′(0)]− 1

12[f ′(0)+f ′(1)+2(f(0)−f(1))]

and simplifying the right hand side we get∫ 1

0

p3(x)dx =1

2[f(0) + f(1)] +

1

12[f ′(0)− f ′(1)], (8.65)

which is the simple trapezoidal rule plus a correction involving the derivativeof the integrand at the end points.

We can obtain an expression for the error of this quadrature formula byrecalling that the Cauchy remainder in the interpolation is

f(x)− p3(x) =1

4!f (4)(ξ(x))x2(x− 1)2 (8.66)

and since x2(x− 1)2 does not change sign in [0, 1] we can use the mean valueTheorem for integrals to get

E[f ] =

∫ 1

0

[f(x)− p3(x)]dx =1

4!f (4)(η)

∫ 1

0

x2(x− 1)2dx =1

720f (4)(η)

(8.67)

for some η ∈ (0, 1).To obtain the quadrature in a general finite interval [a, b] we use the

change of variables x = a+ (b− a)t, t ∈ [0, 1]∫ b

a

f(x)dx = (b− a)

∫ 1

0

F (t)dt, (8.68)

8.8. THE EULER-MACLAURIN FORMULA 127

where F (t) = f(a+ (b− a)t). Thus,∫ b

a

f(x)dx =b− a

2[f(a) + f(b)] +

(b− a)2

12[f ′(a)− f ′(b)] +

1

720f (4))(η)(b− a)5,

(8.69)

for some η ∈ (a, b).We can get a composite modified trapezoidal rule by subdividing [a, b]

in N subintervals of equal length h = b−aN

, applying the simple rule in eachsubinterval and adding up all the contributions:∫ b

a

f(x)dx = h

[1

2f(x0) +

N−1∑j=1

f(xj) +1

2f(xN)

]− h2

12[f ′(b)− f ′(a)]

+1

720f (4)(η)h4.

(8.70)

8.8 The Euler-Maclaurin Formula

We are now going to obtain a more general formula for the asymptotic formof the error in the trapezoidal rule quadrature. The idea is to use integrationby parts with the aid of suitable polynomials. Let us consider again theinterval [0, 1] and define B0(x) = 1, B1(x) = x− 1

2, then∫ 1

0

f(x)dx =

∫ 1

0

f(x)B0(x)dx =

∫ 1

0

f(x)B′1(x)dx

= f(x)B1(x)|10 −∫ 1

0

f ′(x)B1(x)dx

=1

2[f(0) + f(1)]−

∫ 1

0

f ′(x)B1(x)dx

(8.71)

We can continue the integration by parts using the Bernoulli Polynomialswhich satisfy

B′k+1(x) = (k + 1)Bk(x), k = 1, 2, . . . (8.72)

Since we start with B1(x) = x − 12

it is clear that Bk(x) is a polynomial ofdegree exactly k with leading order coefficient 1, i.e. monic. These polyno-mials are determined by the recurrence relation (8.72) up to a constant. The


constant is fixed by requiring that

Bk(0) = Bk(1) = 0, k = 3, 5, 7, . . . (8.73)

Indeed,

B′′k+1(x) = (k + 1)B′k(x) = (k + 1)kBk−1(x) (8.74)

and Bk−1(x) has the form

Bk−1(x) = xk−1 + ak−2xk−2 + . . . a1x+ a0. (8.75)

Integrating (8.74) twice we get

Bk+1(x) = k(k + 1)

[1

k(k + 1)xk+1 +

ak−2

(k − 1)kxk + . . .+

1

2a0x

2 + bx+ c

](8.76)

For k + 1 odd, the two constants of integration b and c are determinedby the condition (8.73). The Bk(x) for k even are then given by Bk(x) =B′k+1(x)/(k + 1).

We are going to need a few properties of the Bernoulli polynomials. Be-cause of construction, Bk(x) is an even (odd) polynomial in x− 1

2is k is even

(odd). Equivalently, they satisfy the identity

(−1)kBk(1− x) = Bk(x). (8.77)

This follows because the polynomials Ak(x) = (−1)kBk(1 − x) satisfy thesame conditions that define the Bernoulli polynomials, i.e. A′k+1(x) = (k +1)Ak(x) and Ak(0) = Ak(1) = 0, for k = 3, 5, 7, . . . and since A1(x) = B1(x)they have are the same. From (8.77) and (8.73) we get that

Bk(0) = Bk(1), k = 2, 3, , . . . (8.78)

We define Bernoulli numbers as Bk = Bk(0) = Bk(1). This together withthe recurrence relation (8.72) implies that∫ 1

0

Bk(x)dx =1

k + 1

∫ 1

0

B′k+1(x)dx =1

k + 1[Bk+1(1)−Bk+1(0)] = 0

(8.79)

for k = 1, 2, . . ..

8.8. THE EULER-MACLAURIN FORMULA 129

Lemma 3. The polynomials C2m(x) = B2m(x) − B2m, m = 1, 2, . . . do notchange sign in [0, 1].

Proof. We will prove it by contradiction. Let us suppose that C2M(x) changessign. Then it has at least 3 zeros and, by Rolle’s theorem, C ′2m(x) = B′2m(x)has at least 2 zeros in (0, 1). This implies that B2m−1(x) has 2 zeros in(0, 1). Since B2m−1(0) = B2m−1(1) = 0, again by Rolle’s theorem, B′2m−1(x)has 3 zeros in (0, 1), which implies that B2m−2(x) has 3 zeros, ...,etc. Wethen conclude that B2l−1(x) has 2 zeros in (0, 1) plus the two at the endpoints, B2l−1(0) = B2l−1(1) for all l = 1, 2, . . ., which is a contradiction (forl = 1, 2).

Here are the first few Bernoulli polynomials

B0(x) = 1 (8.80)

B1(x) = x− 1

2(8.81)

B2(x) =

(x− 1

2

)2

− 1

12= x2 − x+

1

6(8.82)

B3(x) =

(x− 1

2

)3

− 1

4

(x− 1

2

)= x3 − 3

2x2 +

1

2x (8.83)

B4(x) =

(x− 1

2

)4

− 1

2

(x− 1

2

)2

+7

5 · 48= x4 − 2x3 + x2 − 1

30. (8.84)

Let us retake the idea of integration by parts that we started in (8.71)

−∫ 1

0

f ′(x)B1(x)dx = −1

2

∫ 1

0

f ′(x)B′2(x)dx

=1

2B2[f ′(0)− f ′(1)] +

1

2

∫ 1

0

f ′′(x)B2(x)dx

(8.85)


and

1

2

∫ 1

0

f ′′(x)B2(x)dx =1

2 · 3

∫ 1

0

f ′′(x)B′3(x)dx

=1

2 · 3

[f ′′(x)B3(x)

∣∣∣10−∫ 1

0

f ′′′(x)B3(x)dx

]= − 1

2 · 3

∫ 1

0

f ′′′(x)B3(x)dx = − 1

2 · 3 · 4

∫ 1

0

f ′′′(x)B′4(x)dx

=B4

4![f ′′′(0)− f ′′′(1)] +

1

4!

∫ 1

0

f (4)(x)B4(x)dx.

(8.86)

Continuing this way we arrive at the Euler-Maclaurin formula for the simpletrapezoidal rule in [0, 1]:

Theorem 8.5.∫ 1

0

f(x)dx =1

2[f(0) + f(1)] +

m∑k=1

B2k

(2k)![f (2k−1)(0)− f (2k−1)(1)] +Rm

(8.87)

where

Rm =1

(2m+ 2)!

∫ 1

0

f (2m+2)(x)[B2m+2(x)−B2m+2]dx (8.88)

and using (8.79), the Mean Value theorem for integrals, and Lemma 3

Rm =1

(2m+ 2)!

∫ 1

0

f (2m+2)(η)[B2m+2(x)−B2m+2]dx = − B2m+2

(2m+ 2)!f (2m+2)(η)

(8.89)

for some η ∈ (0, 1).

It is now straight forward to obtain the Euler Maclaurin formula for thecomposite trapezoidal rule with equally spaced points:

8.9. ROMBERG INTEGRATION 131

Theorem 8.6. (The Euler-Maclaurin Summation Formula)Let m be a positive integer and f ∈ C(2m+2)[a, b], h = b−a

Nthen

∫ b

a

f(x)dx = h

[1

2f(a) +

1

2f(b) +

N−1∑j=1

f(a+ jh)

]

+m∑k=1

B2k

(2k)!h2k[f 2k−1(a)− f 2k−1(b)]

− B2m+2

(2m+ 2)!(b− a)h2m+2f (2m+2)(η). η ∈ (0, 1)

(8.90)

Remarks: The error is in even powers of h. The formula gives m correctionsto the composite trapezoidal rule. For a smooth periodic function and if b−ais a multiple of its period, then the error of the composite trapezoidal rule,with equally spaced points, decreases faster than any power of h as h→ 0.

8.9 Romberg Integration

We are now going to apply successively Richardson’s Extrapolation to thetrapezoidal rule. Again, we consider equally spaced nodes, xj = a + jh,j = 0, 1, . . . , N , h = (b− a)/N , and assume N is even

Th[f ] = h

[1

2f(a) +

1

2f(b) +

n−1∑j=1

f(a+ jh)

]:= h

N∑′′

j=0

f(a+ jh) (8.91)

where∑′′ means that first and last terms have a 1

2factor.

We know from the Euler-Maclaurin formula that for a smooth integrand∫ b

a

f(x)dx = Th[f ] + c2h2 + c4h

4 + · · · (8.92)

for some constants c2, c4, etc. We can do Richardson extrapolation to obtaina quadrature with a leading order error O(h4). If we have computed T2h[f ]we can combine it with Th[f ] to achieve this by noting that∫ b

a

f(x)dx = T2h[f ] + c2(2h)2 + c4(2h)4 + · · · (8.93)


we have ∫ b

a

f(x)dx =4Th[f ]− T2h[f ]

3+ c4h

4 + c6h6 + · · · (8.94)

We can continue the Richardson extrapolation process but we can do thismore efficiently if we reuse the work we have done to compute T2h[f ] toevaluate Th[f ]. To this end, we note that

Th[f ]− 1

2T2h[f ] = h

N∑′′

j=0

f(a+ jh)− h

N2∑′′

j=0

f(a+ 2jh) = h

N2∑j=1

f(a+ (2j − 1)h)

If we let hl = b−a2l

then

Thl =1

2Thl−1

[f ] + hl

2l−1∑j=1

f(a+ (2j − 1)hl). (8.95)

Beginning with the simple trapezoidal rule (two points) we can successivelydouble the number of points in the quadrature by using (8.95) and immedi-ately do extrapolation.

Let

R(0, 0) = Th0 [f ] =b− a

2[f(a) + f(b)] (8.96)

and for l = 1, 2, ...,M define

R(l, 0) =1

2R(l − 1, 0) + hl

2l−1∑j=1

f(a+ 2j − 1)hl). (8.97)

From R(0, 0) and R(1, 0) we can extrapolate to obtain

R(1, 1) = R(1, 0) +1

4− 1[R(1, 0)−R(0, 0)] (8.98)

We can generate a tableau of approximations like the following, for M = 4

R(0, 0)R(1, 0) R(1, 1)R(2, 0) R(2, 1) R(2, 2)R(3, 0) R(3, 1) R(3, 2) R(3, 3)R(4, 0) R(4, 1) R(4, 2) R(4, 3) R(4, 4)

8.9. ROMBERG INTEGRATION 133

Each of the R(l,m) is obtained by extrapolation

R(l,m) = R(l,m− 1) +1

4m − 1[R(l,m− 1)−R(l − 1,m− 1)]. (8.99)

and R(4,4) would be the most accurate approximation (neglecting round offerrors). This is the Romberg algorithm and can be written as:

h = b− a;R(0, 0) = 1

2(b− a)[f(a) + f(b)];

for l = 1 : Mh = h/2;

R(1, 0) = 12R(l − 1, 0) + h

∑2l−1

j=1 f(a+ (2j − 1)h);for m = 1 : M

R(l,m) = R(l,m− 1) + 14m−1

[R(l,m− l)−R(l − 1,m− 1)];end

end


Chapter 9

Linear Algebra

9.1 The Three Main Problems

There are three main problems in Numerical Linear Algebra:

1. Solving large linear systems of equations.

2. Finding eigenvalues and eigenvectors.

3. Computing the Singular Value Decomposition (SVD) of a large matrix.

The first problem appears in a wide variety of applications and is anindispensable tool in Scientific Computing.

Given a nonsingular n×n matrix A and a vector b ∈ Rn, where n could beon the order of millions or billions, we would like to find the unique solutionx, satisfying

Ax = b (9.1)

or an accurate approximation x to x. Henceforth we will assume, unlessotherwise stated, that the matrix A is real.

We will study Direct Methods (for example Gaussian Elimination), whichcompute the solution (up to roundoff errors) in a finite number of stepsand Iterative Methods, which starting from an initial approximation of thesolution x(0) produce subsequent approximations x(1), x(2), . . . from a givenrecipe

x(k+1) = G(x(k), A, b), k = 0, 1, . . . (9.2)

135

136 CHAPTER 9. LINEAR ALGEBRA

where G is a continuous function of the first variable. Consequently, if theiterations converge, x(k) → x as k →∞, to the solution x of the linear systemAx = b, then

x = G(x,A, b). (9.3)

That is, x is a fixed point of G.One of the main strategies in the design of efficient numerical methods

for linear systems is to transform the problem to one which is much easierto solve. Both direct and iterative methods use this strategy.

The eigenvalue problem for an n× n matrix A consists of finding each orsome of the scalars (the eigenvalues) λ and the corresponding eigenvectorsv 6= 0 such that

Av = λv. (9.4)

Equivalently, (A − λI)v = 0 and so the eigenvalues are the roots of thecharacteristic polynomial of A

p(λ) = det(A− λI). (9.5)

Clearly, we cannot solve this problem with a finite number of elementaryoperations (for n ≥ 5 it would be a contradiction to Abel’s theorem) soiterative methods have to be employed. Also, λ and v could be complexeven if A is real. The maximum of the absolute value of the eigenvalues of amatrix is useful concept in numerical linear algebra.

Definition 9.1. Let A be an n × n matrix. The spectral radius ρ of A isdefined as

ρ(A) = max|λ1|, . . . , |λn|, (9.6)

where λi, i = 1, . . . , n are the eigenvalues (not necessarily distinct) of A.

Large eigenvalue (or more appropriately eigenvector) problems arise inthe study of the steady state behavior of time-discrete Markov processeswhich are often used in a wide range of applications, such as finance, popu-lation dynamics, and data mining. The original Google’s PageRank searchalgorithm is a prominent example of the latter. The problem is to find aneigenvector v associated with the eigenvalue 1, i.e. v = Av. Such v is a

9.2. NOTATION 137

probability vector so all its entries are positive, add up to 1, and representthe probabilities of the system described by the Markov process to be in agiven state in the limit as time goes to infinity. This eigenvector v is in effecta fixed point of the linear transformation represented by the Markov matrixA.

The third problem is related to the second one and finds applicationsin image compression, model reduction techniques, data analysis, and manyother fields. Given an m×n matrix A, the idea is to consider the eigenvaluesand eigenvectors of the square, n× n matrix ATA (or A∗A, where A∗ is theconjugate transpose of A as defined below, if A is complex). As we will see,the eigenvalues are all real and nonnegative and ATA has a complete set oforthogonal eigenvectors. The singular values of a matrix A are the positivesquare roots of the eigenvalues of ATA. Using this, it follows that any realm× n matrix A has the singular value decomposition (SVD)

UTAV = Σ, (9.7)

where U is an orthogonal m ×m matrix, V is an orthogonal n × n matrix,and Σ is a “diagonal” matrix of the form

Σ =

[D 00 0

], D = diag(σ1, σ2, . . . , σr), (9.8)

where σ1 ≥ σ2 ≥ . . . σr > 0 are the nonzero singular values of A.

9.2 Notation

A matrix A with elements aij will be denoted A = (aij), this could be asquare n×n matrix or an m×n matrix. AT denotes the transpose of A, i.e.AT = (aji).

A vector in x ∈ Rn will be represented as the n-tuple

x =

x1

x2...xn

. (9.9)

The canonical vectors, corresponding to the standard basis in Rn, will bedenoted by e1, e2, . . . , en, where ek is the n-vector with all entries equal tozero except the j-th one, which is equal to one.


The inner product of two real vectors x and y in Rn is

〈x, y〉 =n∑i=1

xiyi = xTy. (9.10)

If the vectors are complex, i.e. x and y in Cn we define their inner productas

〈x, y〉 =n∑i=1

xiyi, (9.11)

where xi denotes the complex conjugate of xi.With the inner product (9.10) in the real case or (9.11) in the complex

case, we can define the Euclidean norm

‖x‖2 =√〈x, x〉. (9.12)

Note that if A is an n× n real matrix and x, y ∈ Rn then

〈x,Ay〉 =n∑i=1

xi

(n∑k=1

aikyk

)=

n∑i=1

n∑k=1

aikxiyk

=n∑k=1

(n∑i=1

aikxi

)yk =

n∑k=1

(n∑i=1

aTkixi

)yk,

(9.13)

that is

〈x,Ay〉 = 〈ATx, y〉. (9.14)

Similarly in the complex case we have

〈x,Ay〉 = 〈A∗x, y〉, (9.15)

where A∗ is the conjugate transpose of A, i.e. A∗ = (aji).

9.3 Some Important Types of Matrices

One useful type of linear transformations consists of those that preserve theEuclidean norm. That is, if y = Ax, then ‖y‖2 = ‖x‖2 but this implies

〈Ax,Ax〉 = 〈ATAx, x〉 = 〈x, x〉 (9.16)

and consequently ATA = I.

9.3. SOME IMPORTANT TYPES OF MATRICES 139

Definition 9.2. An n×n real (complex) matrix A is called orthogonal (uni-tary) if ATA = I (A∗A = I).

Two of the most important types of matrices in applications are symmet-ric (Hermitian) and positive definite matrices.

Definition 9.3. An n × n real matrix A is called symmetric if AT = A. Ifthe matrix A is complex it is called Hermitian if A∗ = A.

Symmetric (Hermitian) matrices have real eigenvalues, for if v is an eigen-vector associated to an eigenvalue λ of A, we can assumed it has been nor-malized so that 〈v, v〉 = 1, and

〈v,Av〉 = 〈v, λv〉 = λ〈v, v〉 = λ. (9.17)

But if AT = A then

λ = 〈v, Av〉 = 〈Av, v〉 = 〈λv, v〉 = λ〈v, v〉 = λ, (9.18)

and λ = λ if and only if λ ∈ R.

Definition 9.4. An n×n matrix A is called positive definite if it is symmetric(Hermitian) and 〈x,Ax〉 > 0 for all x ∈ Rn, x 6= 0.

By the preceding argument the eigenvalues of a positive definite matrixA are real because AT = A. Moreover, if Av = λv with ‖v‖2 = 1 then0 < 〈v,Av〉 = λ. Therefore, positive definite matrices have real, positiveeigenvalues. Conversely, if all the eigenvalues of a symmetric matrix A arepositive then A is positive definite. This follows from the fact that symmetricmatrices are diagonalizable by an orthogonal matrix S, i.e. A = SDST ,where D is a diagonal matrix with the eigenvalues λ1, . . . , λn (not necessarilydistinct) of A. Then

〈x,Ax〉 =n∑i=1

λiy2i , (9.19)

where y = STx. Thus a symmetric (Hermitian) matrix A is positive definiteif and only if all its eigenvalues are positive. Moreover, since the determinantis the product of the eigenvalues, positive definite matrices have a positivedeterminant.

We now review another useful consequence of positive definiteness.


Definition 9.5. Let A = (aij) be an n × n matrix. Its leading principalsubmatrices are the square matrices

Ak =

a11 · · · a1k...ak1 · · · akk

, k = 1, . . . , n. (9.20)

Theorem 9.1. All the leading principal submatrices of a positive definitematrix are positive definite.

Proof. Suppose A is an n× n positive definite matrix. Then, all its leadingprincipal submatrices are symmetric (Hermitian). Moreover, if we take avector x ∈ Rn of the form

x =

y1...yk0...0

, (9.21)

where y = [y1, . . . , yk]T ∈ Rk is an arbitrary nonzero vector then

0 < 〈x,Ax〉 = 〈y, Aky〉

which shows that Ak for k = 1, . . . , n is positive definite.

The converse of Theorem 9.1 is also true but the proof is much moretechnical: A is positive definite if and only if det(Ak) > 0 for k = 1, . . . , n.

Note also that if A is positive definite then all its diagonal elements arepositive because 0 < 〈ej, Aej〉 = ajj, for j = 1, . . . , n.

9.4. SCHUR THEOREM 141

9.4 Schur Theorem

Theorem 9.2. (Schur) Let A be an n×n matrix, then there exists a unitarymatrix T ( T ∗T = I ) such that

T ∗AT =

λ1 b12 b13 · · · b1n

λ2 b23 · · · b2n

. . ....

bn−1,n

λn

, (9.22)

where λ1, . . . , λn are the eigenvalues of A and all the elements below thediagonal are zero.

Proof. We will do a proof by induction. Let A be a 2 × 2 matrix witheigenvalues λ1 and λ2. Let u be a normalized, eigenvector u (u∗u = 1)corresponding to λ1. Then we can take T as the matrix whose first columnis u and its second column is a unit vector v orthogonal to u (u∗v = 0). Wehave

T ∗AT =

[u∗

v∗

] [λ1u Av

]=

[λ1 u∗Av0 v∗Av

]. (9.23)

The scalar v∗Av has to be equal to λ2, as similar matrices have the sameeigenvalues. We now assume the result is true for all k × k (k ≥ 2) matricesand will show that it is also true for all (k + 1) × (k + 1) matrices. LetA be a (k + 1) × (k + 1) matrix and let u1 be a normalized eigenvectorassociated with eigenvalue λ1. Choose k unit vectors t1, . . . , tk so that thematrix T1 = [u1 t1 . . . tk] is unitary. Then,

T ∗1AT1 =

λ1 c12 c13 · · · c1n

0... Ak

0

, (9.24)

where Ak is a k × k matrix. Now, the eigenvalues of the matrix on theright hand side of (9.24) are the roots of (λ1 − λ) det(Ak − λI) and sincethis matrix is similar to A, it follows that the eigenvalues of Ak are the


remaining eigenvalues of A, λ2, . . . , λk+1. By the induction hypothesis there isa unitary matrix Tk such that T ∗kAkTk is upper triangular with the eigenvaluesλ2, . . . , λk+1 sitting on the diagonal. We can now use Tk to construct the(k + 1)× (k + 1) unitary matrix as

Tk+1 =

1 0 0 · · · 00... Tk

0

(9.25)

and define T = T1Tk+1. Then

T ∗AT = T ∗k+1T∗1AT1Tk+1 = T ∗k+1(T ∗1AT1)Tk+1 (9.26)

and using (9.24) and (9.25) we get

T ∗AT =

1 0 0 · · · 00... T ∗k

0

λ1 c12 c13 · · · c1n

0... Ak

0

1 0 0 · · · 00... Tk

0

=

λ1 b12 b13 · · · b1n

λ2 b23 · · · b2n

. . ....

bn−1,n

λn

.

9.5 Norms

A norm on a vector space V (for example Rn or Cn) over K = R (or C) is amapping ‖ · ‖ : V → [0,∞), which satisfy the following properties:

(i) ‖x‖ ≥ 0 ∀x ∈ V and ‖x‖ = 0 iff x = 0.

(ii) ‖x+ y‖ ≤ ‖x‖+ ‖y‖ ∀x, y ∈ V .

9.5. NORMS 143

(iii) ‖λx‖ = |λ| ‖x‖ ∀x ∈ V, λ ∈ K.

Example 9.1.

‖x‖1 = |x1|+ . . .+ |xn|, (9.27)

‖x‖2 =√〈x, x〉 > =

√|x1|2 + . . .+ |xn|2, (9.28)

‖x‖∞ = max|x1|, . . . , |xn|. (9.29)

Lemma 4. Let ‖ · ‖ be a norm on a vector space V then

| ‖x‖ − ‖y‖ |≤ ‖x− y‖. (9.30)

This lemma implies that a norm is a continuous function (on V to R).

Proof. ‖x‖ = ‖x− y + y‖ ≤ ‖x− y‖+ ‖y‖ which gives that

‖x‖ − ‖y‖ ≤ ‖x− y‖. (9.31)

By reversing the roles of x and y we also get

‖y‖ − ‖x‖ ≤ ‖x− y‖. (9.32)

We will also need norms defined on matrices. Let A be an n× n matrix.We can view A as a vector in Rn×n and define its corresponding Euclideannorm

‖A‖ =

√√√√ n∑i=1

n∑j=1

|aij|2. (9.33)

This is called the Frobenius norm for matrices. A different matrix norm canbe obtained by using a given vector norm and matrix-vector multiplication.Given a vector norm ‖ · ‖ in Rn (or in Cn), it is easy to show that

‖A‖ = maxx 6=0

‖Ax‖‖x‖

, (9.34)

satisfies the properties (i), (ii), (iii) of a norm for all n×n matrices A . Thatis, the vector norm induces a matrix norm.


Definition 9.6. The matrix norm defined by (11.1) is called the subordinateor natural norm induced by the vector norm ‖ · ‖.

Example 9.2.

‖A‖1 = maxx6=0

‖Ax‖1

‖x‖1

, (9.35)

‖A‖∞ = maxx 6=0

‖Ax‖∞‖x‖∞

, (9.36)

‖A‖2 = maxx6=0

‖Ax‖2

‖x‖2

. (9.37)

Theorem 9.3. Let ‖ · ‖ be an induced matrix norm then

(a) ‖Ax‖ ≤ ‖A‖‖x‖,

(b) ‖AB‖ ≤ ‖A‖‖B‖.

Proof. (a) if x = 0 the result holds trivially. Take x 6= 0, then the definition(11.1) implies

‖Ax‖‖x‖

≤ ‖A‖ (9.38)

that is ‖Ax‖ ≤ ‖A‖‖x‖.(b) Take x 6= 0. By (a) ‖ABx‖ ≤ ‖A‖‖Bx‖ ≤ ‖A‖‖B‖‖x‖ and thus

‖ABx‖‖x‖

≤ ‖A‖‖B‖. (9.39)

Taking the max it we get that ‖AB‖ ≤ ‖A‖‖B‖.

The following theorem offers a more concrete way to compute the matrixnorms (9.35)-(9.37).

Theorem 9.4. Let A = (aij) be an n× n matrix then

(a) ‖A‖1 = maxj

n∑i=1

|aij|.

(b) ‖A‖∞ = maxi

n∑j=1

|aij|.

9.5. NORMS 145

(c) ‖A‖2 =√ρ(ATA),

where ρ(ATA) is the spectral radius of ATA, as defined in (9.6).

Proof. (a)

‖Ax‖1 =n∑i=1

∣∣∣∣∣n∑j=1

aijxj

∣∣∣∣∣ ≤n∑j=1

|xj|

(n∑i=1

|aij|

)≤

(maxj

n∑i=1

|aij|

)‖x‖1.

Thus, ‖A‖1 ≤ maxj

n∑i=1

|aij|. We just need to show there is a vector x for

which the equality holds. Let j∗ be the index such that

n∑i=1

|aij∗| = maxj

n∑i=1

|aij| (9.40)

and take x to be given by xi = 0 for i 6= j∗ and xj∗ = 1. Then, ‖x‖1 = 1 and

‖Ax‖1 =n∑i=1

∣∣∣∣∣n∑j=1

aijxj

∣∣∣∣∣ =n∑i=1

|aij∗| = maxj

n∑i=1

|aij|. (9.41)

(b) Analogously to (a) we have

‖Ax‖∞ = maxi

∣∣∣∣∣n∑j=1

aijxj

∣∣∣∣∣ ≤(

maxi

n∑j=1

|aij|

)‖x‖∞. (9.42)

Let i∗ be the index such that

n∑j=1

|ai∗j| = maxi

n∑j=1

|aij| (9.43)

and take x given by

xj =

ai∗j|ai∗j |

if ai∗j 6= 0,

1 if ai∗j = 0.(9.44)


Then, |xj| = 1 for all j and ‖x‖∞ = 1. Hence

‖Ax‖∞ = maxi

∣∣∣∣∣n∑j=1

aijxj

∣∣∣∣∣ =n∑j=1

|ai∗j| = maxi

n∑i=1

|aij|. (9.45)

(c) By definition

‖A‖22 = max

x 6=0

‖Ax‖22

‖x‖22

= maxx 6=0

xTATAx

xTx(9.46)

Note that the matrix ATA is symmetric and all its eigenvalues are nonnega-tive. Let us label them in increasing order, 0 ≤ λ1 ≤ λ1 ≤ · · · ≤ λn. Then,λn = ρ(ATA). Now, since ATA is symmetric, there is an orthogonal matrix Qsuch that QTATAQ = D = diag(λ1, . . . , λn). Therefore, changing variables,x = Qy, we have

xTATAx

xTx=yTDy

yTy=λ1y

21 + · · ·+ λny

2n

y21 + · · ·+ y2

n

≤ λn. (9.47)

Now take the vector y such that yj = 0 for j 6= n and yn = 1 and the equalityholds. Thus,

‖A‖2 =

√maxx 6=0

‖Ax‖22

‖x‖22

=√λn =

√ρ(ATA). (9.48)

Note that if AT = A then

‖A‖2 =√ρ(ATA) =

√ρ(A2) = ρ(A). (9.49)

Let λ be an eigenvalue of the matrix A with eigenvector x, normalized sothat ‖x‖ = 1. Then,

|λ| = |λ|‖x‖ = ‖λx‖ = ‖Ax‖ ≤ ‖A‖‖x‖ = ‖A‖ (9.50)

for any matrix norm with the property ‖Ax‖ ≤ ‖A‖‖x‖. Thus,

ρ(A) ≤ ‖A‖ (9.51)

for any induced norm. However, given an n× n matrix A and ε > 0 there isat least one induced matrix norm such that ‖A‖ is within ε of the spectralradius of A.

9.5. NORMS 147

Theorem 9.5. Let A be an n× n matrix. Given ε > 0 there is at least oneinduced matrix norm ‖ · ‖ such that

ρ(A) ≤ ‖A‖ ≤ ρ(A) + ε. (9.52)

Proof. By Schur’s Theorem, there is a unitary matrix T such that

T ∗AT =

λ1 b12 b13 · · · b1n

λ2 b23 · · · b2n

. . ....

bn−1,n

λn

= U, (9.53)

where λj, j = 1, . . . , n are the eigenvalues of A. Take 0 < δ < 1 and definethe diagonal matrix Dδ = diag(δ, δ2, . . . , δn). Then

D−1δ UDδ =

λ1 δb12 δ2b13 · · · δn−1b1n

λ2 δb23 · · · δn−2b2n

. . ....

δbn−1,n

λn

. (9.54)

Given ε > 0, we can find δ sufficiently small so that D−1δ UDδ is “within ε”

of a diagonal matrix, in the sense that the sum of the absolute values of theoff diagonal entries is less than ε for each row:

n∑j=i+1

∣∣δj−ibij∣∣ ≤ ε for i = 1, . . . , n. (9.55)

Now,

D−1δ UDδ = D−1

δ T ∗ATDδ = (TDδ)−1A(TDδ) (9.56)

Given a nonsingular matrix S and a matrix norm ‖ · ‖ then

‖A‖′ = ‖S−1AS‖ (9.57)


is also a norm. Taking S = TDδ and using the infinity norm we get

‖A‖′ = ‖(TDδ)−1A(TDδ)‖∞

≤

∥∥∥∥∥∥∥∥∥∥∥

λ1

λ2

. . .

λn

∥∥∥∥∥∥∥∥∥∥∥∞

+

∥∥∥∥∥∥∥∥∥∥∥

0 δb12 δ2b13 · · · δn−1b1n

0 δb23 · · · δn−2b2n

. . ....

δbn−1,n

0

∥∥∥∥∥∥∥∥∥∥∥∞

≤ ρ(A) + ε.

9.6 Condition Number of a Matrix

Consider the 5× 5 Hilbert matrix

H5 =

1

1

1

2

1

3

1

4

1

5

1

2

1

3

1

4

1

5

1

6

1

3

1

4

1

5

1

6

1

7

1

4

1

5

1

6

1

7

1

8

1

5

1

6

1

7

1

8

1

9

(9.58)

and the linear system H5x = b where

b =

137/6087/60

153/140743/840

1879/2520

. (9.59)

9.6. CONDITION NUMBER OF A MATRIX 149

The exact solution of this linear system is x = [1, 1, 1, 1, 1]T . Note thatb ≈ [2.28, 1.45, 1.09, 0.88, 0.74]T . Let us perturb b slightly (about % 1)

b+ δb =

2.281.461.100.890.75

(9.60)

The solution of the perturbed system (up to rounding at 12 digits of accuracy)is

x+ δx =

0.57.2−21.030.8−12.6

. (9.61)

A relative perturbation of ‖δb‖2/‖b‖2 = 0.0046 in the data produces a changein the solution equal to ‖δx‖2 ≈ 40. The perturbations gets amplified nearlyfour orders of magnitude!

This high sensitivity of the solution to small perturbations is inherent tothe matrix of the linear system, H5 in this example.

Consider the linear system Ax = b and the perturbed one A(x + δx) =b+ δb. Then, Ax+ Aδx = b+ δb implies δx = A−1δb and so

‖δx‖ ≤ ‖A−1‖‖δb‖ (9.62)

for any induced norm. But also ‖b‖ = ‖Ax‖ ≤ ‖A‖‖x‖ or

1

‖x‖≤ ‖A‖ 1

‖b‖. (9.63)

Combining (9.62) and (9.63) we obtain

‖δx‖‖x‖

≤ ‖A‖‖A−1‖‖δb‖‖b‖

. (9.64)

The right hand side of this inequality is actually a least upper bound, thereare b and δb for which the equality holds.


Definition 9.7. Given a matrix norm ‖ · ‖, the condition number of amatrix A, denoted by κ(A) is defined by

κ(A) = ‖A‖‖A−1‖. (9.65)

Example 9.3. The condition number of the 5× 5 Hilbert matrix H5, (9.58),in the 2 norm is approximately 4.7661× 105. For the particular b and δb wechose we actually got a variation in the solution of O(104) times the relativeperturbation but now we know that the amplification factor could be as badas κ(A).

Similarly, if we perturbed the entries of a matrix A for a linear systemAx = b so that we have (A+ δA)(x+ δx) = b we get

Ax+ Aδx+ δA(x+ δx) = b (9.66)

that is, Aδx = −δA(x+ δx), which implies that

‖δx‖ ≤ ‖A−1‖‖δA‖‖x+ δx‖ (9.67)

for any induced matrix norm and consequently

‖δx‖‖x+ δx‖

≤ ‖A−1‖‖A‖‖δA‖‖A‖

= κ(A)‖δA‖‖A‖

. (9.68)

Because, for any induced norm, 1 = ‖I‖ = ‖A−1‖‖A‖ ≤ ‖A−1‖‖A‖, we getthat κ(A) ≥ 1. We say that A is ill-conditioned if κ(A) is very large.

Example 9.4. The Hilbert matrix is ill-conditioned. We already saw thatin the 2 norm κ(H5) = 4.7661 × 105. The condition number increases veryrapidly as the size of the Hilbert matrix increases, for example κ(H6) =1.4951× 107, κ(H10) = 1.6025× 1013.

9.6.1 What to Do When A is Ill-conditioned?

There are two ways to deal with a linear system with an ill-conditioned matrixA. One approach is to work with extended precision (using as many digitsas required to to obtain the solution up to a given accuracy). Unfortunately,computations using extended precision can be computationally expensive,several times the cost of regular double precision operations.

9.6. CONDITION NUMBER OF A MATRIX 151

A more practical approach is often to replace the ill-conditioned linearsystem Ax = b by an equivalent linear system with a much smaller conditionnumber. This can be done by for example by premultiplying by a matrixP−1 such that we have P−1Ax = P−1b. Obviously, taking P = A gives usthe smallest possible condition number but this choice is not practical soa compromise is made between P approximating A and the cost of solvinglinear systems with the matrix P to be low. This very useful technique, alsoemployed to accelerate the convergence of some iterative methods, is calledpreconditioning.


Chapter 10

Linear Systems of Equations I

In this chapter we focus on a problem which is central to many applications:find the solution to a large linear system of n linear equations in n unknownsx1, x2, . . . , xn

a11x1 + a12x2 + . . .+ a1nxn = b1,

a21x1 + a22x2 + . . .+ a2nxn = b2,

...

an1x1 + an2x2 + . . .+ annxn = bn,

(10.1)

or written in matrix form

Ax = b (10.2)

where A is the n× n matrix of coefficients

A =

a11 a12 · · · a1n

a21 a22 · · · a2n...

.... . .

...an1 an2 · · · ann

, (10.3)

x is a column vector whose components are the unknowns, and b is the givenright hand side of the linear system

x =

x1

x2...xn

, b =

b1

b2...bn

. (10.4)

153

154 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

We will assume, unless stated otherwise, that A is a nonsingular, real matrix.That is, the linear system (10.2) has a unique solution for each b. Equiva-lently, the determinant of A, det(A), is non-zero and A has an inverse.

While mathematically we can write the solution as x = A−1b, this is notcomputationally efficient. Finding A−1 is several (about four) times morecostly than solving Ax = b for a given b.

In many applications n can be on the order of millions or much larger.

10.1 Easy to Solve Systems

When A is diagonal, i.e.

A =

a11

a22

. . .

ann

(10.5)

(all the entries outside the diagonal are zero and since A is assumed non-singular aii 6= 0 for all i), then each equation can be solved with just onedivision:

xi = bi/aii, for i = 1, 2, . . . , n. (10.6)

If A is lower triangular and nonsingular,

A =

a11

a21 a22...

.... . .

an1 an2 · · · ann

(10.7)

the solution can also be obtained easily by the process of forward substitution:

x1 =b1

a11

x2 =b2 − a21x1

a22

x3 =b3 − [a31x1 + a32x2]

a33

...

xn =bn − [an1x1 + an2x2 + . . .+ an,n−1xn−1]

ann,

(10.8)

10.1. EASY TO SOLVE SYSTEMS 155

or in pseudo-code:

Algorithm 10.1 Forward Subsitution

1: for i = 1, . . . , n do

2: xi ←

(bi −

i−1∑j=1

aijxj

)/aii

3: end for

Note that the assumption that A is nonsingular implies that aii 6= 0 for alli = 1, 2, . . . , n since det(A) = a11a22 · · · ann. Also observe that (10.8) showsthat xi is a linear combination of bi, bi−1, . . . , b1 and since x = A−1b it followsthat A−1 is also lower triangular.

To compute xi we perform i−1 multiplications, i−1 additions/subtractions,and one division, so the total amount of computational work W (n) to do for-ward substitution is

W (n) = 2n∑i=1

(i− 1) + n = n2 − 2n, (10.9)

where we have used that

n∑i=1

i =n(n− 1)

2. (10.10)

That is, W (n) = O(n2) to solve a lower triangular linear system.

If A is nonsingular and upper triangular

A =

a11 a12 · · · a1n

0 a22 · · · a2n...

.... . .

...0 0 · · · ann

(10.11)

we solve the linear system Ax = b starting from xn, then we solve for xn−1,


etc. This is called backward substitution

xn =bnann

,

xn−1 =bn−1 − an−1,nxn

an−1,n−1

,

xn−2 =bn−2 − [an−2,n−1xn−1 + an−2,nxn]

an−2,n−2

,

...

x1 =b1 − [a12x2 + a13x3 + · · · a1nxn]

a11

.

(10.12)

From this we deduce that xi is a linear combination of bi.., bi+1, ...bn and soA−1 is an upper triangular matrix. In pseudo-code, we have

Algorithm 10.2 Backward Subsitution

1: for i = n, n− 1, . . . , 1 do

2: xi ←

(bi −

n∑j=i+1

aijxj

)/aii

3: end for

The operation count is the same as for forward substitution, W (n) =O(n2).

10.2 Gaussian Elimination

The central idea of Gaussian elimination is to reduce the linear system Ax =b to an equivalent upper triangular system, which has the same solutionand can readily be solved with backward substitution. Such reduction isdone with an elimination process employing linear combinations of rows. Weillustrate first the method with a concrete example:

x1 + 2x2 − x3 + x4 = 0,

2x1 + 4x2 − x4 = −3,

3x1 + x2 − x3 + x4 = 3,

x1 − x2 + 2x3 + x4 = 3.

(10.13)

10.2. GAUSSIAN ELIMINATION 157

To do the elimination we form an augmented matrix Ab by appending onemore column to the matrix of coefficients A, consisting of the right hand sideb:

Ab =

1 2 −1 1 02 4 0 −1 −33 1 −1 1 31 −1 2 1 3

. (10.14)

The first step is to eliminate the first unknown in the second to last equations,i.e. to produce a zero in the first column of Ab for rows 2, 3, and 4:

1 2 −1 1 02 4 0 −1 −33 1 −1 1 31 −1 2 1 3

−−−−−−→R2←R2−2R1R3←R3−3R1R4←R4−1R1

1 2 −1 1 00 0 2 −3 −30 −5 2 −2 30 −3 3 0 3

, (10.15)

where R2 ← R2 − 2R1 means that the second row has been replaced bythe second row minus two times the first row, etc. Since the coefficient ofx1 in the first equation is 1 it is easy to figure out the number we need tomultiply rows 2, 3, and 4 to achieve the elimination of the first variable foreach row, namely 2, 3, and 1. These numbers are called multipliers. Ingeneral, to obtain the multipliers we divide the coefficient of x1 in the rowsbelow the first one by the nonzero coefficient a11 (2/1=2, 3/1=3, 1/1=1).The coefficient we need to divide by to obtain the multipliers is called a pivot(1 in this case).

Note that the (2, 2) element of the last matrix in (10.15) is 0 so we cannotuse it as a pivot for the second round of elimination. Instead, we proceed byexchanging the second and the third rows

1 2 −1 1 00 0 2 −3 −30 −5 2 −2 30 −3 3 0 3

−−−−−−→R2↔R3

1 2 −1 1 00 −5 2 −2 30 0 2 −3 −30 −3 3 0 3

. (10.16)

We can now use -5 as a pivot and do the second round of elimination:1 2 −1 1 00 −5 2 −2 30 0 2 −3 −30 −3 3 0 3

−−−−−−→R3←R3−0R2

R4←R4− 35R2

1 2 −1 1 00 −5 2 −2 30 0 2 −3 −30 0 9

565

65

. (10.17)


Clearly, the elimination step R3 ← R3−0R2 is unnecessary as the coefficientto be eliminated is already zero but we include it to illustrate the generalprocedure. The last round of the elimination is

1 2 −1 1 00 −5 2 −2 30 0 2 −3 −30 0 9

565

65

−−−−−−→R4←R4− 9

10R3

1 2 −1 1 00 −5 2 −2 30 0 2 −3 −30 0 0 39

103910

, (10.18)

The last matrix, let us call it Ub, corresponds to the upper triangular system1 2 −1 10 −5 2 −20 0 2 −30 0 0 39

10

x1

x2

x3

x4

=

03−3

3910

, (10.19)

which we can solve with backward substitution to obtain the solution

x =

x1

x2

x3

x4

=

1−1

01

. (10.20)

Each of the steps in the Gaussian elimination process are linear trans-formations and hence we can represent these transformations with matrices.Note, however, that these matrices are not constructed in practice, we onlyimplement their effect (row exchange or elimination). The first round of elim-ination (10.15) is equivalent to multiplying (from the left) Ab by the lowertriangular matrix

E1 =

1 0 0 0−2 1 0 0−3 0 1 0−1 0 0 1

, (10.21)

that is

E1Ab =

1 2 −1 1 00 0 2 −3 −30 −5 2 −2 30 −3 3 0 3

. (10.22)


The matrix E1 is formed by taking the 4 × 4 identity matrix and replac-ing the elements in the first column below 1 by negative the multiplier, i.e.−2,−3,−1. We can exchange rows 2 and 3 with a permutation matrix

P =

1 0 0 00 0 1 00 1 0 00 0 0 1

, (10.23)

which is obtained by exchanging the second and third rows in the 4 × 4identity matrix,

PE1Ab =

1 2 −1 1 00 −5 2 −2 30 0 2 −3 −30 −3 3 0 3

. (10.24)

To construct the matrix associated with the second round of elimination wehave to take 4 × 4 identity matrix and replace the elements in the secondcolumn below the diagonal by negative the multipliers we got with the pivotequal to -5:

E2 =

1 0 0 00 1 0 00 0 1 00 −3

50 1

, (10.25)

and we get

E2PE1Ab =

1 2 −1 1 00 −5 2 −2 30 0 2 −3 −30 0 9

565

65

. (10.26)

Finally, for the last elimination we have

E3 =

1 0 0 00 1 0 00 0 1 00 0 − 9

101

, (10.27)


and E3E2PE1Ab = Ub.Observe that PE1Ab = E ′1PAb, where

E ′1 =

1 0 0 0−3 1 0 0−2 0 1 0−1 0 0 1

, (10.28)

i.e., we exchange rows in advance and then reorder the multipliers accord-ingly. If we focus on the matrix A, the first four columns of Ab, we have thematrix factorization

E3E2E′1PA = U, (10.29)

where U is the upper triangular matrix

U =

1 2 −1 10 −5 2 −20 0 2 −30 0 0 39

10

. (10.30)

Moreover, the product of upper (lower) triangular matrices is also an upper(lower) triangular matrix and so is the inverse. Hence, we obtain the so-calledLU factorization

PA = LU, (10.31)

where L = (E3E2E′1)−1 = E ′−1

1 E−12 E−1

3 is a lower triangular matrix. Nowrecall that the matrices E ′1, E2, E3 perform the transformation of subtractingthe row of the pivot times the multiplier to the rows below. Therefore, theinverse operation is to add the subtracted row back, i.e. we simply removethe negative sign in front of the multipliers,

E ′−11 =

1 0 0 03 1 0 02 0 1 01 0 0 1

, E−12 =

1 0 0 00 1 0 00 0 1 00 3

50 1

, E−13 =

1 0 0 00 1 0 00 0 1 00 0 9

101

.It then follows that

L =

1 0 0 03 1 0 02 0 1 01 3

5910

1

. (10.32)


Note that L has all the multipliers below the diagonal and U has all thepivots on the diagonal. We will see that a factorization PA = LU is alwayspossible for any nonsingular n× n matrix A and can be very useful.

We now consider the general linear system (10.1). The matrix of coeffi-cients and the right hand size are

A =

a11 a12 · · · a1n

a21 a22 · · · a2n...

.... . .

...an1 an2 · · · ann

, b =

b1

b2...bn

, (10.33)

respectively. We form the augmented matrix Ab by appending b to A as thelast column:

Ab =

a11 a12 · · · a1n b1

a21 a22 · · · a2n b2...

.... . .

...an1 an2 · · · ann bn

. (10.34)

In principle if a11 6= 0 we can start the elimination. However, if |a11| istoo small, dividing by it to compute the multipliers might lead to inaccurateresults in the computer, i.e. using finite precision arithmetic. It is generallybetter to look for the coefficient of largest absolute value in the first column,to exchange rows, and then do the elimination. This is called partial pivoting.It is possible to then search for the element of largest absolute value in thefirst row and switch columns accordingly. This is called complete pivotingand works well provided the matrix is properly scaled. Henceforth, we willconsider Gaussian elimination only with partial pivoting, which is less costlyto apply.

To perform the first round of Gaussian elimination we do three steps:

1. Find the maxi|ai1|, let us say this corresponds to the m-th row, i.e.

|am1| = maxi|ai1|. If |am1| = 0, the matrix is singular. Stop.

2. Exchange rows 1 and m.

3. Compute the multipliers and perform the elimination.


After these three steps, we have transformed Ab into

A(1)b =

a

(1)11 a

(1)12 · · · a

(1)1n b′1

0 a(1)22 · · · a

(1)2n b

(1)2

......

. . ....

0 a(1)n2 · · · a

(1)nn b

(1)n

. (10.35)

This corresponds to A(1)b = E1P1Ab, where P1 is the permutation matrix

that exchanges rows 1 and m (P1 = I if no exchange is made) and E1 is thematrix to obtain the elimination of the entries below the first element in thefirst column. The same three steps above can now be applied to the smaller(n− 1)× n matrix

A(1)b =

a(1)22 · · · a

(1)2n b

(1)2

......

. . ....

a(1)n2 · · · a

(1)nn b

(1)n

, (10.36)

and so on. Doing this process (n − 1) times, we obtain the reduced, uppertriangular system, which can be solved with backward substitution.

In matrix terms, the linear transformations in the Gaussian eliminationprocess correspond to A

(k)b = EkPkA

(k−1)b , for k = 1, 2, . . . , n− 1 (A

(0)b = Ab),

where the Pk and Ek are permutation and elimination matrices, respectively.Pk = I if no row exchange is made prior to the k-th elimination round (butrecall that we do not construct the matrices Ek and Pk in practice). Hence,the Gaussian elimination process for a nonsingular linear system producesthe matrix factorization

Ub ≡ A(n−1)b = En−1Pn−1En−2Pn−2 · · ·E1P1Ab. (10.37)

Arguing as in the introductory example we can rearrange the rows of Ab, withthe permutation matrix P = Pn−1 · · ·P1 and the corresponding multipliers,as if we knew in advance the row exchanges that would be needed to get

Ub ≡ A(n−1)b = E ′n−1E

′n−2 · · ·E ′1PAb. (10.38)


Since the inverse of E ′n−1E′n−2 · · ·E ′1 is the lower triangular matrix

L =

1 0 · · · · · · 0l21 1 0 · · · 0l31 l32 1 · · · 0...

.... . . . . .

...ln1 ln2 · · · ln,n−1 1

, (10.39)

where the lij, j = 1, . . . , n − 1, i = j + 1, . . . , n are the multipliers (com-puted after all the rows have been rearranged), we arrive at the anticipatedfactorization PA = LU . Incidentally, up to sign, Gaussian elimination alsoproduces the determinant of A because

det(PA) = ± det(A) = det(LU) = det(U) = a(1)11 a

(2)22 · · · a(n)

nn (10.40)

and so det(A) is plus or minus the product of all the pivots in the eliminationprocess.

In the implementation of Gaussian elimination the array storing the aug-mented matrix Ab is overwritten to save memory. The pseudo code withpartial pivoting (assuming ai,n+1 = bi, i = 1, . . . , n) is presented in Algo-rithm 10.3.

10.2.1 The Cost of Gaussian Elimination

We now do an operation count of Gaussian elimination to solve an n × nlinear system Ax = b.

We focus on the elimination as we already know that the work for thestep of backward substitution is O(n2). For each round of elimination, j =1, . . . , n − 1, we need one division to compute each of the n − j multipliersand (n− j)(n− j+1) multiplications and (n− j)(n− j+1) sums (subtracts)to perform the eliminations. Thus, the total number number of operations is

W (n) =n−1∑j=1

[2(n− j)(n− j + 1) + (n− j)] =n−1∑j=1

[2(n− j)2 + 3(n− j)

](10.41)

and using (10.10) and

m∑i=1

i2 =m(m+ 1)(2m+ 1)

6, (10.42)


Algorithm 10.3 Gaussian Elimination with Partial Pivoting

1: for j = 1, . . . , n− 1 do2: Find m such that |amj| = max

j≤i≤n|aij|

3: if |amj| = 0 then4: stop . Matrix is singular5: end if6: ajk ↔ amk, k = j, . . . , n+ 1 . Exchange rows7: for i = j + 1, . . . , n do8: m← aij/ajj . Compute multiplier9: aik ← aik −m ∗ ajk, k = j + 1, . . . , n+ 1 . Elimination

10: aij ← m . Store multiplier11: end for12: end for13: for i = n, n− 1, . . . , 1 do . Backward Substitution

14: xi ←

(ai,n+1 −

n∑j=i+1

aijxj

)/aii

15: end for

we get

W (n) =2

3n3 +O(n2). (10.43)

Thus, Gaussian elimination is computationally rather expensive for largesystems of equations.

10.3 LU and Choleski Factorizations

If Gaussian elimination can be performed without row interchanges, then weobtain an LU factorization of A, i.e. A = LU . This factorization can beadvantageous when solving many linear systems with the same n×n matrixA but different right hand sides because we can turn the problem Ax = b intotwo triangular linear systems, which can be solved much more economicallyin O(n2) operations. Indeed, from LUx = b and setting y = Ux we have

Ly = b, (10.44)

Ux = y. (10.45)

10.3. LU AND CHOLESKI FACTORIZATIONS 165

Given b, we can solve the first system for y with forward substitution andthen we solve the second system for x with backward substitution. Thus,while the LU factorization of A has an O(n3) cost, subsequent solutions tothe linear system with the same matrix A but different right hand sides canbe done in O(n2) operations.

When can we obtain the factorization A = LU? the following resultprovides a useful sufficient condition.

Theorem 10.1. Let A be an n×n matrix whose leading principal submatricesA1, . . . , An are all nonsingular. Then, there exists an n× n lower triangularmatrix L, with ones on its diagonal, and an n × n upper triangular matrixU such that A = LU and this factorization is unique.

Proof. Since A1 is nonsingular then a11 6= 0 and P1 = I. Suppose nowthat we do not need to exchange rows in steps 2, . . . , k − 1 so that A(k−1) =Ek−1 · · ·E2E1A, that is

a11 · · · a1k · · · a1n

. . . · · ·· · ·

a(k−1)kk · · · a(k−1)

kn...

...

a(k−1)nk · · · a(k−1)

nn

=

1

−m21. . .

...−mk1 1

...−mn1 · · · 1

a11 · · · a1k · · · a1n...

. . . · · · ...· · ·

ak1 akk · · · akn...

......

an1 · · · ank · · · ann

.

The determinant of the boxed k×k leading principal submatrix on the left isa11a

(2)22 · · · a

(k−1)kk and this is equal to the determinant of the product of boxed

blocks on the right hand side. Since the determinant of the first such blockis one (it is a lower triangular matrix with ones on the diagonal), it followsthat

a11a(2)22 · · · a

(k−1)kk = det(Ak) 6= 0, (10.46)

which implies that a(k−1)kk 6= 0 and so Pk = I and we conclude that U =

En−1 · · ·E1A and therefore A = LU .Let us now show that this decomposition is unique. Suppose A = L1U1 =

L2U2 then

L−12 L1 = U2U

−11 . (10.47)


But the matrix on the left hand side is lower triangular (with ones in its diag-onal) whereas the one on the right hand side is upper triangular. ThereforeL−1

2 L1 = I = U2U−11 , which implies that L2 = L1 and U2 = U1.

An immediate consequence of this result is that Gaussian eliminationcan be performed without row interchange for a SDD matrix, as each ofits leading principal submatrices is itself SDD, and for a positive definitematrix, as each of its leading principal submatrices is itself positive definite,and hence non-singular in both cases.

Corollary 1. Let A be an n×n matrix. Then A = LU , where L is an n×nlower triangular matrix , with ones on its diagonal, and U is an n× n uppertriangular matrix if either

(a) A is SDD or

(b) A is symmetric positive definite.

In the case of a positive definite matrix the number number of operationscan be cut down in approximately half by exploiting symmetry to obtaina symmetric factorization A = BBT , where B is a lower triangular matrixwith positive entries in its diagonal. This representation is called Choleskifactorization of the symmetric positive definite matrix A.

Theorem 10.2. Let A be a symmetric positive definite matrix. Then, thereis a unique lower triangular matrix B with positive entries in its diagonalsuch that A = BBT .

Proof. By Corollary 1 A has an LU factorization. Moreover, from (10.46) itfollows that all the pivots are positive and thus uii > 0 for all i = 1, . . . , n. Wecan split the pivots evenly in L and U by letting D = diag(

√u11, . . . ,

√unn)

and writing A = LDD−1U = (LD)(D−1U). Let B = LD and C = D−1U .Both matrices have diagonal elements

√u11, . . . ,

√unn but B is lower trian-

gular while C is upper triangular. Moreover, A = BC and because AT = Awe have that CTBT = BC, which implies

B−1CT = C(BT )−1. (10.48)

The matrix on the left hand side is lower triangular with ones in its diagonalwhile the matrix on the right hand side is upper triangular also with onesin its diagonal. Therefore, B−1CT = I = C(BT )−1 and thus, C = BT

10.3. LU AND CHOLESKI FACTORIZATIONS 167

and A = BBT . To prove that this Choleski factorization is unique we goback to the LU factorization, which we now is unique (if we choose L tohave ones in its diagonal). Given A = BBT , where B is lower triangularwith positive diagonal elements b11, . . . , bnn, we can write A = BD−1

B DBBT ,

where DB = diag(b11, . . . , bnn). Then L = BD−1B and U = DBB

T yieldthe unique LU factorization of A. Now suppose there is another Choleskifactorization A = CCT . Then by the uniqueness of the LU factorization, wehave

L = BD−1B = CD−1

C , (10.49)

U = DBBT = DCC

T , (10.50)

where DC = diag(c11, . . . , cnn). Equation (10.50) implies that b2ii = c2

ii fori = 1, . . . , n and since bii > 0 and cii > 0 for all i, then DC = DB andconsequently C = B.

The Choleski factorization is usually written as A = LLT and is obtainedby exploiting the lower triangular structure of L and symmetry as follows.First, L = (lij) is lower triangular then lij = 0 for 1 ≤ i < j ≤ n and thus

aij =n∑k=1

likljk =

min(i,j)∑k=1

likljk. (10.51)

Now, because AT = A we only need aij for i ≤ j, that is

aij =i∑

k=1

likljk 1 ≤ i ≤ j ≤ n. (10.52)

We can solve equations (10.52) to determine L, one column at a time. If weset i = 1 we get

a11 = l211, → l11 =√a11,

a12 = l11l21,

...

a1n = l11ln1


and this allows us to get the first column of L. The second column is nowfound by using (10.52) for i = 2

a22 = l221 + l222, → l22 =√a22 − l221,

a23 = l21l31 + l22l32,

...

a2n = l21ln1 + l22ln2,

etc. Algorithm 10.4 gives the pseudo code for the Choleski factorization.

Algorithm 10.4 Choleski factorization

1: for i = 1, . . . , n do . Compute column i of L for i = 1, . . . , n

2: lii ←√(

aii −∑i−1

k=1 l2ik

)3: for j = i+ 1, . . . , n do4: lji ← (aij −

∑i−1k=1 likljk)/lii

5: end for6: end for

10.4 Tridiagonal Linear Systems

If the matrix of coefficients A has a triadiagonal structure

A =

a1 b1

c1 a2 b2

. . . . . . . . .

bn−1

cn−1 an

(10.53)

its LU factorization can be computed at an O(n) cost and the correspondinglinear system can thus be solved efficiently.

Theorem 10.3. If A is triadiagonal and all of its leading principal subma-

10.4. TRIDIAGONAL LINEAR SYSTEMS 169

trices are nonsingular thena1 b1

c1 a2 b2

. . . . . . . . .

bn−1

cn−1 an

=

1l1 1

. . . . . .

ln−1 1

m1 b1

m2 b2

. . . . . .

bn−1

mn

,(10.54)

where

m1 = a, (10.55)

lj = cj/mj, mj+1 = aj+1 − ljbj, for j = 1, . . . , n− 1, (10.56)

and this factorization is unique.

Proof. By Theorem 10.1 we know that A has a unique LU factorization,where L is unit lower triangular and U is upper triangular. We will showthat we can solve uniquely for l1, . . . , ln−1 and m1, . . . ,mn so that (10.54)holds. Equating the matrix product on the right hand side of (10.54), rowby row, we get

1st row: a1 = m1, b1 = b1,

2nd row: c1 = m1l1, a2 = l1b1 +m2, b2 = b2,

...

(n− 1)-st row: cn−2 = mn−2ln−2, an−1 = ln−2bn−2 +mn−1, bn−1 = bn−1,

n-th row: cn−1 = mn−1ln−1, an = ln−1bn−1 +mn

from which (10.55)-(10.56) follows. Of course, we need the mj’s to be nonzeroto use (10.56). We now prove this is the case.

Note that mj+1 = aj+1 − ljbj = aj+1 − cjmjbj. Therefore

mj+1mj = aj+1mj − bjcj, for j = 1, . . . , n− 1. (10.57)

Thus,

det(A1) = a1 = m1, (10.58)

det(A2) = a2a1 − c1b1 = a2m1 − b1c1 = m1m2. (10.59)


We now do induction to show that det(Ak) = m1m2 · · ·mk. Suppose det(Aj) =m1m2 · · ·mj for j = 1, . . . , k − 1. Expanding by the last row we get

det(Ak) = ak det(Ak−1)− bk−1ck−1 det(Ak−2) (10.60)

and using the induction hypothesis and (10.57) it follows that

det(Ak) = m1m2 · · ·mk−2 [akmk−1 − bk−1ck−1] = m1 · · ·mk, (10.61)

for k = 1, . . . , n. Since det(Ak) 6= 0 for k = 1, . . . , n then m1,m2, . . . ,mn areall nonzero.

Algorithm 10.5 Tridiagonal solver1: m1 ← a1

2: for j = 1, . . . , n− 1 do . Compute column L and U3: lj ← cj/mj

4: mj+1 ← aj+1 − li ∗ bj5: end for6: y1 ← d1 . Forward substitution on Ly = d7: for j = 2, . . . , n do8: yj ← dj − lj−1 ∗ yj−1

9: end for10: xn ← yn/mn . Backward substitution on Ux = y11: for j = n− 1, n− 2 . . . , 1 do12: xj ← (yj − bj ∗ xj+1)/mj

13: end for

10.5 A 1D BVP: Deformation of an Elastic

Beam

We saw in Section 5.5 an example of a very large system of equations inconnection with the least squares problem for fitting high dimensional data.We now consider another example which leads to a large linear system ofequations.

Suppose we have a thin beam of unit length, stretched horizontally andoccupying the interval [0, 1]. The beam is subjected to a load density f(x)

10.5. A 1D BVP: DEFORMATION OF AN ELASTIC BEAM 171

at each point x ∈ [0, 1], and pinned at end points. Let u(x) be the beamdeformation from the horizontal position. Assuming that the deformationsare small (linear elasticity regime), u satisfies

−u′′(x) + c(x)u(x) = f(x), 0 < x < 1, (10.62)

where c(x) ≥ 0 is related to the elastic, material properties of the beam. Be-cause the beam is pinned at the end points we have the boundary conditions

u(0) = u(1) = 0. (10.63)

The system (10.62)-(10.63) is called a boundary value problem (BVP). Thatis, we need to find a function u that satisfies the ordinary differential equation(10.62) and the boundary conditions (10.63) for any given, continuous f andc. The condition c(x) ≥ 0 guarantees existence and uniqueness of solutionto this problem.

We will construct a discrete model whose solution gives an accurate ap-proximation to the exact solution at a finite collection of selected points(called nodes) in [0, 1]. We take the nodes to be equally spaced and to in-clude the interval end points (boundary). So we choose a positive integer Nand define the nodes or grid points

x0 = 0, x1 = h, x2 = 2h, . . . , xN = Nh, xN+1 = 1, (10.64)

where h = 1/(N + 1) is the grid size or node spacing. The nodes x1, . . . , xNare called interior nodes, because they lie inside the interval [0, 1], and thenodes x0 and xN+1 are called boundary nodes.

We now construct a discrete approximation to the ordinary differentialequation by replacing the second derivative with a second order finite differ-ence approximation. As we know,

u′′(xj) =u(xj+1)− 2u(xj) + u(xj−1)

h2+O(h2). (10.65)

Neglecting the O(h2) error and denoting the approximation of u(xj) by vj(i.e. vj ≈ u(xj)) fj = f(xj) and cj = c(xj), for j = 1, . . . , N , then at eachinterior node

−vj−1 + 2vj + vj+1

h2+ cjvj = fj, j = 1, 2, . . . , N (10.66)


and at the boundary nodes, applying (10.63), we have

v0 = vN+1 = 0. (10.67)

Thus, (10.66) is a linear system of N equations in N unknowns v1, . . . , vN ,which we can write in matrix form as

1

h2

2 + c1h2 −1 0 · · · · · · 0

−1 2 + c2h2 −1

. . ....

0. . . . . . . . . . . .

.... . . . . . . . . . . . . . .

. . . . . . . . . . . . 0. . . . . . . . . −1

0 · · · 0 −1 2 + cNh2

v1

v2............vN

=

f1

f2............fN

.

(10.68)

The matrix, let us call it A, of this system is tridiagonal and symmetric.A direct computation shows that for an arbitrary, nonzero, column vectorv = [v1, . . . , vN ]T

vTAv =N∑j=1

[(vj + cjvj

h

)2

+ cjv2j

]> 0, ∀v 6= 0 (10.69)

and therefore, since cj ≥ 0 for all j, A is positive definite. Thus, there isa unique solution to (10.68) and can be efficiently found with our tridiago-nal solver, Algorithm 10.5. Since the expected numerical error is O(h2) =O(1/(N + 1)2), even a modest accuracy of O(10−4) requires N ≈ 100.

10.6 A 2D BVP: Dirichlet Problem for the

Poisson’s Equation

We now look at a simple 2D BVP for an equation that is central to manyapplications, namely Poisson’s equation. For concreteness here, we can thinkof the equation as a model for small deformations u of a stretched, squaremembrane fixed to a wire at its boundary and subject to a force density

10.6. A 2D BVP: DIRICHLET PROBLEM FOR THE POISSON’S EQUATION173

f . Denoting by Ω, and ∂Ω, the unit square [0, 1] × [0, 1] and its boundary,respectively, the BVP is to find u such that

−∆u(x, y) = f(x, y), for (x, y) ∈ Ω (10.70)

and

u(x, y) = 0. for (x, y) ∈ ∂Ω. (10.71)

In (10.70), ∆u is the Laplacian of u, also denoted as ∇2u, and is given by

∆u = ∇2u = uxx + uyy =∂2u

∂x2+∂2u

∂y2. (10.72)

Equation (10.70) is Poisson’s equation (in 2D) and together with (10.71)specify a (homogeneous) Dirichlet problem because the value of u is given atthe boundary.

To construct a numerical approximation to (10.70)-(10.71), we proceed asin the previous 1D BVP example by discretizing the domain. For simplicity,we will use uniformly spaced grid points. We choose a positive integer N anddefine the grid points of our domain Ω = [0, 1]× [0, 1] as

(xi, xj) = (ih, jh), for i, j = 0, . . . , N + 1, (10.73)

where h = 1/(N + 1). The interior nodes correspond to 1 ≤ i, j ≤ N and theboundary nodes are those corresponding to the remaining values of indices iand j (i or j equal 0 and i or j equal N + 1).

At each of the interior nodes we replace the Laplacian by its second orderorder finite difference approximation, called the five-point discrete Laplacian

∇2u(xi, yj) =u(xi−1, xj) + u(xi+1, yj) + u(xi, yj−1) + u(xi, yj+1)− 4u(xi, yj)

h2

+O(h2).

(10.74)

Neglecting the O(h2) discretization error and denoting by vij the approxima-tion to u(xi, yj) we get:

−vi−1,j + vi+1,j + vi,j−1 + vi,j+1 − 4vijh2

= fij, for 1 ≤ i, j ≤ N. (10.75)


This is a linear system of N2 equations for the N2 unknowns, vij, 1 ≤ i, j ≤N . We have freedom to order or label the unknowns any way we wish andthat will affect the structure of the matrix of coefficients of the linear systembut remarkably the matrix will be symmetric positive definite regardless ofordering of the unknowns!.

The most common labeling is the so-called lexicographical order, whichproceeds from the bottom row to top one, left to right, v11, v12, . . . , v1N , v21, . . .,etc. Denoting by v1 = [v11, v12, . . . , v1N ]T , v2 = [v21, v22, . . . , v2N ]T , etc., andsimilarly for the right hand side f , the linear system (10.75) can be writtenin matrix form as

T −I 0 0

−I T −I . . .

0. . . . . . . . . . . .. . . . . . . . . . . . . . .

. . . . . . . . . . . . 0. . . . . . . . . −I

0 0 −I T

v1

v2............

vN

= h2

f1

f2............

fN

. (10.76)

Here, I is the N ×N identity matrix and T is the N ×N tridiagonal matrix

T =

4 −1 0 0

−1 4 −1. . .

0. . . . . . . . . . . .. . . . . . . . . . . . . . .

. . . . . . . . . . . . 0. . . . . . . . . −1

0 0 −1 4

. (10.77)

Thus, the matrix of coefficients in (10.76), is sparse, i.e. the vast majority of

10.7. LINEAR ITERATIVE METHODS FOR AX = B 175

its entries are zeros. For example, for N = 3 this matrix is

4 −1 0 −1 0 0 0 0 0−1 4 −1 0 −1 0 0 0 00 −1 4 0 0 −1 0 0 0−1 0 0 4 −1 0 −1 0 00 −1 0 −1 4 −1 0 −1 00 0 −1 0 −1 4 0 0 −10 0 0 −1 0 0 4 −1 00 0 0 0 −1 0 −1 4 −10 0 0 0 0 −1 0 −1 4

.

Gaussian elimination is hugely inefficient for a large system (n > 100) with asparse matrix, as in this example. This is because the intermediate matricesin the elimination would be generally dense due to fill-in introduced by theelimination process. To illustrate the high cost of Gaussian elimination, ifwe merely use N = 100 (this corresponds to a modest discretization error ofO(10−4)), we end up with n = N2 = 104 unknowns and the cost of Gaussianelimination would be O(1012) operations.

10.7 Linear Iterative Methods for Ax = b

As we have seen, Gaussian elimination is an expensive procedure for largelinear systems of equations. An alternative is to seek not an exact (up toroundoff error) solution in a finite number of steps but an approximation tothe solution that can be obtained from an iterative procedure applied to aninitial guess x(0).

We are going to consider first a class of iterative methods where thecentral idea is to write the matrix A as the sum of a non-singular matrix M ,whose corresponding system is easy to solve, and a remainder −N = A−M ,so that the system Ax = b is transformed into the equivalent system

Mx = Nx+ b. (10.78)

Starting with an initial guess x(0), (10.78) defines a sequence of approxima-tions generated by

Mx(k+1) = Nx(k) + b, k = 0, 1, . . . (10.79)

The main questions are


1. When does this iteration converge?

2. What determines its rate of convergence?

3. What is the computational cost?

But first we look at three concrete iterative methods of the form (10.79).Unless otherwise stated, A is assumed to be a non-singular n×n matrix andb a given n-column vector.

10.8 Jacobi, Gauss-Seidel, and S.O.R.

If the all the diagonal elements of A are nonzero we can take M = diag(A)and then at each iteration (i.e. for each k) the linear system (10.79) can beeasily solved to obtain the next iterate x(k+1). Note that we do not need tocompute M−1 nor do we need to do the matrix product M−1N (and due toits cost it should be avoided). We just need to solve the linear system withthe matrix M , which in this case is trivial to do. We just solve the firstequation for the first unknown, the second equation for the second unknown,etc., and we obtain the so-called Jacobi iterative method:

x(k+1)i =

−n∑j=1j 6=i

aijx(k+1)j + b

aii, i = 1, 2, ..., n, and k = 0, 1, ... (10.80)

The iteration could be stopped when

‖x(k+1) − x(k)‖∞‖x(k+1)‖∞

≤ Tolerance. (10.81)

Example 10.1. Consider the 4× 4 linear system

10x1 − x2 + 2x3 = 6,

−x1 + 11x2 − x3 + 3x4 = 25,

2x1 − x2 + 10x3 − x4 = −11,

3x2 − x3 + 8x4 = 15.

(10.82)

10.8. JACOBI, GAUSS-SEIDEL, AND S.O.R. 177

It has the unique solution (1,2,-1,1). Jacobi’s iteration for this system is

x(k+1)1 =

1

10x

(k)2 −

1

5x

(k)3 +

3

5,

x(k+1)2 =

1

11x

(k)1 +

1

11x

(k)3 −

3

11x

(k)4 +

25

11,

x(k+1)3 = −1

5x

(k)1 +

1

10x

(k)2 +

1

10x

(k)4 −

11

10,

x(k+1)4 = −3

8x

(k)2 +

1

8x

(k)3 +

15

8.

(10.83)

Starting with x(0) = [0, 0, 0, 0]T we obtain

x(1) =

0.600000002.27272727−1.10000000

1.87500000

, x(2) =

1.047272731.71590909−0.80522727

0.88522727

, x(3) =

0.932636362.05330579−1.04934091

1.13088068

.(10.84)

In the Jacobi iteration, when we evaluate x(k+1)2 we have already x

(k+1)1

available. When we evaluate x(k+1)3 we have already x

(k+1)1 and x

(k+1)2 available

and so on. If we update the Jacobi iteration with the already computedcomponents of x(k+1) we obtained the Gauss-Seidel iteration:

x(k+1)i =

−i−1∑j=1

aijx(k+1)j −

n∑j=i+1

aijx(k)j + b

aii, i = 1, 2, ..., n, k = 0, 1, ...

(10.85)

The Gauss-Seidel iteration is equivalent to the iteration obtained by takingM as the lower triangular part of the matrix A, including its diagonal.

Example 10.2. For the system (10.82), starting again with the initial guess[0, 0, 0, 0]T , Gauss-Seidel produces the following approximations

x(1) =

0.600000002.32727273−0.98727273

0.87886364

, x(2) =

1.030181822.03693802−1.0144562

0.98434122

, x(3) =

1.006585042.00355502−1.00252738

0.99835095

.(10.86)


In an attempt to accelerate convergence of the Gauss-Seidel iteration,one could also put some weight in diagonal part of A, and split this into thematrices M and N of the iterative method (10.79). Specifically, we can write

diag(A) =1

ωdiag(A)− 1− ω

ωdiag(A), (10.87)

where the first term of the right hand side goes into M and the last into N .The Gauss-Seidel method then becomes

x(k+1)i =

aiix(k)i − ω

[i−1∑j=1

aijx(k+1)j +

n∑j=i

aijx(k)j − b

]aii

, i = 1, 2, ..., n, k = 0, 1, . . .

(10.88)

Note that ω = 1 corresponds to Gauss-Seidel. This iteration is genericallyS.O.R. (successive over-relaxation), even though we refer to over-relaxationonly when ω > 1 and under-relaxation when ω < 1. It can be proved that anecessary condition for convergence is that 0 < ω < 2.

10.9 Convergence of Linear Iterative Meth-

ods

To study the convergence of iterative methods of the form Mx(k+1) = Nx(k)+b, for k = 0, 1, . . . we use the equivalent iteration

x(k+1) = Tx(k) + c, k = 0, 1, . . . (10.89)

where

T = M−1N = I −M−1A (10.90)

is called the iteration matrix and c = M−1b.The issue of convergence is that of existence of a fixed point for the map

F (x) = Tx + c defined for all x ∈ Rn. That is, whether or not there is anx ∈ Rn such that F (x) = x. For if the sequence defined in (10.89) convergesto a vector x then, by continuity of F , we would have x = Tx + c = F (x).For any x, y ∈ Rn and for any inducedinduced matrix norm we have

‖F (x)− F (y)‖ = ‖Tx− Ty‖ ≤ ‖T‖ ‖x− y‖. (10.91)

10.9. CONVERGENCE OF LINEAR ITERATIVE METHODS 179

If for some induced norm ‖T‖ < 1, F is a contracting map or contractionand we will show that this guarantees the existence of a unique fixed point.We will also show that the rate of convergence of the sequence generated byiterative methods of the form (10.89) is given by the spectral radius ρ(T )of the iteration matrix T . These conclusions will follow from the followingresult.

Theorem 10.4. Let T be an n × n matrix. Then the following statementsare equivalent:

(a) limk→∞

T k = 0.

(b) limk→∞

T kx = 0 for all x ∈ Rn.

(c) ρ(T ) < 1.

(d) ‖T‖ < 1 for at least one induced norm.

Proof. (a) ⇒ (b): For any induced norm we have that

‖T kx‖ ≤ ‖T k‖ ‖x‖ (10.92)

and so if T k → 0 as k →∞ then ‖T kx‖ → 0, that is T kx→ 0 for all x ∈ Rn.

(b) ⇒ (c): Let us suppose that limk→∞

T kx = 0 for all x ∈ Rn but that

ρ(T ) ≥ 1. Then, there is a eigenvector v such that Tv = λv with |λ| ≥ 1 andthe sequence T kv = λkv does not converge, which is a contradiction.

(c) ⇒ (d): By Theorem 9.5, for each ε > 0, there is at least one inducednorm ‖ · ‖ such that ‖T‖ ≤ ρ(T ) + ε from which the statement follows.

(d) ⇒ (a): This follows immediately from ‖T k‖ ≤ ‖T‖k.

Theorem 10.5. The iterative method (10.89) is convergent for any initialguess x(0) if and only if ρ(T ) < 1 or equivalently if and only if ‖T‖ < 1 forat least one induced norm.

Proof. Let x be the exact solution of Ax = b. Then

x− x(1) = Tx+ c− (Tx(0) + c) = T (x− x(0)), (10.93)


from which it follows that the error of the k iterate, ek = x(k) − x, satisfies

ek = T ke0, (10.94)

for k = 1, 2, . . . and where e0 = x− x(0) is the error of the initial guess. Theconclusion now follows immediately from Theorem 10.4.

The spectral radius ρ(T ) of the iteration matrix T measures the rate ofconvergence of the method. For if T is normal, then ‖T‖2 = ρ(T ) and from(10.94) we get

‖ek‖2 ≤ ρ(T )k‖e0‖2. (10.95)

But each k we can find a vector e0 for which the equality holds so ρ(T )k‖e0‖2

is a least upper bound for the error ‖ek‖2. If T is not normal, the followingresults shows that, asymptotically ‖T k‖ ≈ ρ(T )k, for any matrix norm.

Theorem 10.6. Let T be any n×n matrix. Then, for any matrix norm ‖ ·‖

limk→∞‖T k‖1/k = ρ(T ). (10.96)

Proof. We know that ρ(T k) = ρ(T )k and that ρ(T ) ≤ ‖T‖. Therefore

ρ(T ) ≤ ‖T k‖1/k (10.97)

Now, for any given ε > 0 construct the matrix Tε = T/(ρ(T ) + ε). Thenlimk→∞ T

kε = 0 as ρ(Tε) < 1. Therefore, there is an integer Kε such that

‖T kε ‖ =‖T k‖

(ρ(T ) + ε)k≤ 1, for all k ≥ Kε. (10.98)

Thus, for all k ≥ Kε we have

ρ(T ) ≤ ‖T k‖1/k ≤ ρ(T ) + ε (10.99)

from which the results follows.

Theorem 10.7. Let A an n× n strictly diagonally dominant matrix. Then,for any initial guess x(0) ∈ Rn

(a) The Jacobi iteration converges to the exact solution of Ax = b.

10.9. CONVERGENCE OF LINEAR ITERATIVE METHODS 181

(b) The Gauss-Seidel iteration converges to the exact solution of Ax = b.

Proof. (a) The Jacobi iteration matrix T has entries Tii = 0 and Tij =−aij/aii for i 6= j. Therefore,

‖T‖∞ = max1≤i≤n

n∑j=1j 6=i

∣∣∣∣aijaii∣∣∣∣ = max

1≤i≤n

1

|aii|

n∑j=1,j 6=i

|aij| < 1. (10.100)

(b) We will proof that ρ(T ) < 1 for the Gauss-Seidel iteration. Let x be aneigenvector of T with eigenvalue λ, normalized to have ‖x‖∞ = 1. Recallthat T = I −M−1A. Then, Tx = λx implies Mx− Ax = λMx from whichwe get

−n∑

j=i+1

aijxj = λi∑

j=1

aijxj = λaiixi + λi−1∑j=1

aijxj. (10.101)

Now choose i such that ‖x‖∞ = |xi| = 1 then

|λ| |aii| ≤ |λ|i−1∑j=1

|aij|+n∑

j=i+1

|aij|

|λ| ≤

n∑j=i+1

|aij|

|aii| −i−1∑j=1

|aij|<

n∑j=i+1

|aij|

n∑j=i+1

|aij|= 1.

where the last inequality was obtained by using that A is SDD. Thus, |λ| < 1and so ρ(T ) < 1.

Theorem 10.8. A necessary condition for the S.O.R. iteration is 0 < ω < 2.

Proof. We will show that det(T ) = (1−ω)n and because det(T ) is equal, up toa sign, to the product of the eigenvalues of T we have that | det(T )| ≤ ρn(T )and this implies that

ρ(T ) ≥ |1− ω|. (10.102)


Since ρ(T ) < 1 is required for convergence, the conclusion follows. Now,T = M−1[M − A] and det(T ) = det(M−1) det(M − A). From the definitionof the S.O.R. iteration (10.88) we get that

aiix(k+1)i + ω

i−1∑j=1

aijx(k+1)j = aiix

(k)i − ω

n∑j=i

aijx(k)j + ωb. (10.103)

Therefore, M is lower triangular with a diagonal equal to that of A. Con-sequently, det(M−1) = det(diag(A)−1). Similarly, det(M − A) = det((1 −ω)diag(A)). Thus,

det(T ) = det(M−1) det(M − A) = det(diag(A)−1) det((1− ω)diag(A))

= det(diag(A)−1(1− ω)diag(A)) = det((1− ω)I) = (1− ω)n.

(10.104)

If A is positive definite S.O.R. converges for any initial guess. However,as we will see, there are more efficient iterative methods for positive definitelinear systems.

Chapter 11

Linear Systems of Equations II

In this chapter we focus on some numerical methods for the solution of largelinear systems Ax = b where A is a sparse, symmetric positive definite matrix.We also look briefly at the non-symmetric case.

11.1 Positive Definite Linear Systems as an

Optimization Problem

Suppose that A is an n × n symmetric, positive definite matrix and we areinterested in solving Ax = b. Let x be the unique, exact solution of Ax = b.Since A is positive definite, we can define the norm

‖x‖A =√xTAx. (11.1)

Henceforth we are going to denote the inner product of two vector x, y in Rn

by 〈x, y〉, i.e

〈x, y〉 = xTy =n∑i=1

xiyi. (11.2)

Consider now the quadratic function of x ∈ Rn defined by

J(x) =1

2‖x− x‖2

A. (11.3)

Note that J(x) ≥ 0 and J(x) = 0 if and only if x = x because A is positivedefinite. Therefore, x minimizes J if and only if x = x. In optimization, the

183

184 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

function to be minimized (maximized), J in our case, is called the objectivefunction.

For several optimization methods it is useful to consider the one-dimensionalproblem of minimizing J along a fixed direction. For given x, v ∈ Rn we con-sider the so called line minimization problem consisting in minimizing Jalong the line that passes through x and is in the direction of v, i.e

mint∈R

J(x+ tv). (11.4)

Denoting g(t) = J(x+ tv) and using the definition (11.3) of J we get

g(t) =1

2〈x− x+ tv, A(x− x+ tv)〉

= J(x) + 〈x− x, Av〉 t+1

2〈v,Av〉 t2

= J(x) + 〈Ax− b, v〉 t+1

2〈v,Av〉 t2.

(11.5)

This is a parabola opening upward because 〈v, Av〉 > 0 for all v 6= 0. Thus,its minimum is given by the critical point

0 = g′(t∗) = −〈v, b− Ax〉+ t∗〈v, Av〉, (11.6)

that is

t∗ =〈v, b− Ax〉〈v,Av〉

(11.7)

and the minimum of J along the line x+ tv, t ∈ R is

g(t∗) = J(x)− 1

2

〈v, b− Ax〉2

〈v, Av〉. (11.8)

Finally, using the definition of ‖ · ‖A and Ax = b, we have

1

2‖x− x‖2

A =1

2‖x‖2

A − 〈b, x〉+1

2‖x‖2

A (11.9)

and so it follows that

∇J(x) = Ax− b. (11.10)

11.2. LINE SEARCH METHODS 185

11.2 Line Search Methods

We just saw in the previous section that the problem of solving Ax = b,when A is a symmetric positive definite matrix is equivalent to a convex,minimization problem of a quadratic objective function J(x) = ‖x − x‖2

A.An important class of methods for this type of optimization problems iscalled line search methods.

Line search methods produce a sequence of approximations to the mini-mizer, in the form

x(k+1) = x(k) + tkv(k), k = 0, 1, . . . , (11.11)

where the vector v(k) and the scalar tk are called the search direction and thestep length at the k-th iteration, respectively. The question then is how toselect the search directions and the step lengths to converge to the minimizer.Most line search methods are of descent type because they required that thevalue of J is decreased with each iteration. Going back to (11.5) this meansthat descent line search methods have the condition 〈v(k),∇J(x(k))〉 < 0.

Starting with an initial guess x(0), line search methods generate

x(1) = x(0) + t0v(0) (11.12)

x(2) = x(1) + t1v(0) = x(0) + t0v

(0) + t1v(1), (11.13)

etc., so that the k-th element of the sequence is x(0) plus a linear combinationof v(0), v(1), . . . , v(k−1) :

x(k) = x(0) + t0v(0) + t1v

(0) + · · ·+ tk−1v(k−1). (11.14)

That is, (x(k) − x(0)

)∈ spanv(0), v(1), . . . , v(k−1). (11.15)

Unless otherwise noted, we will take the step length tk to be given by theone-dimensional minimizer (11.7) evaluated at the k-step, i.e.

tk =〈v(k), r(k)〉〈v(k), Av(k)〉

, (11.16)

where

r(k) = b− Ax(k) (11.17)

is the residual of the linear equation Ax = b associated with the approxima-tion x(k).


11.2.1 Steepest Descent

One way to guarantee a decrease of J(x) = ‖x− x‖2A at every step of a line

search method is to choose v(k) = −∇J(x(k)), which is locally the fastest rateof decrease of J . Recalling that ∇J(x(k)) = −r(k), we take v(k) = r(k). Theoptimal step length is selected according to (11.16) so that we choose theline minimizer (in the direction of −∇J(x(k))) of J . The resulting method iscalled steepest descent and, starting from an initial guess x(0), is given by

tk =〈r(k), r(k)〉〈r(k), Ar(k)〉

, (11.18)

x(k+1) = x(k) + tkr(k), (11.19)

r(k+1) = r(k) − tkAr(k), (11.20)

for k = 0, 1, . . .. Formula (11.20), which comes from subtracting A times(11.19) to b, is preferable to using the definition of the residual, i.e. r(k+1) =b− Ax(k), due to round-off errors.

If A is an n × n diagonal, positive definite matrix, the steepest descentmethod finds the minimum in at most n steps. This is easy to visualize forn = 2 as the level sets of J are ellipses with their principal axes aligned withthe coordinate axes. For a general, non-diagonal positive definite matrix A,convergence of the steepest descent sequence to the minimizer of J and henceto the solution of Ax = b is guaranteed but it may not be reached in a finitenumber of steps.

11.3 The Conjugate Gradient Method

The steepest descent method uses an optimal search direction locally but notglobally and as a results it converges in general very slowly to the minimizer.

A key strategy to accelerate convergence in line search methods is to widenour search space by considering the previous search directions, not just thecurrent one. Obviously, we would like the v(k)’s to be linear independent.

Recall that(x(k) − x(0)

)∈ spanv(0), v(1), . . . , v(k−1). We are going to

denote

Vk = spanv(0), v(1), . . . , v(k−1) (11.21)

and write x ∈ x(0) + Vk to mean that x = x(0) + v with v ∈ Vk.

11.3. THE CONJUGATE GRADIENT METHOD 187

The idea is to select v(0), v(1), . . . , v(k−1) such that

x(k) = minx∈x(0)+Vk

‖x− x‖2A. (11.22)

If the search directions are linearly independent, as k increases our searchspace grows so the minimizer would be found in at most n steps, whenVn = Rn.

Let us derive a condition for the minimizer of J(x) = 12‖x − x‖2

A inx(0) + Vk. Suppose x ∈ x(0) + Vk. Then, there are scalars c0, c1, . . . , ck−1 suchthat

x(k) = x0 + c0v(0) + c1v

(1) + · · ·+ ck−1v(k−1). (11.23)

For fixed v(0), v(1), . . . , v(k−1), define the following function of c0, c1, . . . , ck−1

G(c0, c1, ..., ck−1) := J(x0 + c0v

(0) + c1v(1) + · · ·+ ck−1v

(k−1))

(11.24)

Because J is a quadratic function, the minimizer of G is the critical pointc∗0, c

∗1, ..., c

∗k−1

∂G

∂cj(c∗0, c

∗1, ..., c

∗k−1) = 0, j = 0, . . . , k − 1. (11.25)

But by the Chain Rule

0 =∂G

∂cj= ∇J · v(j) = −〈r(k), v(j)〉, j = 0, 1, . . . , k − 1. (11.26)

We have proved the following theorem.

Theorem 11.1. The vector x(k) ∈ x(0)+V k minimizes ‖x−x‖2A over x(0)+Vk,

for k = 0, 1, . . . if and only if

〈r(k), v(j)〉 = 0, j = 0, 1, . . . , k − 1. (11.27)

That is, the residual r(k) = b−Ax(k) is orthogonal to all the search directionsv(0), . . . , v(k−1).

Let us go back to one step of a line search method, x(k+1) = x(k) + tkv(k),

where tk is given by the one-dimensional minimizer (11.16). As we havedone in the Steepest Descent method, we find that the corresponding residual


satisfies r(k+1) = r(k)−tkAv(k). Starting with an initial guess x(0), we computer(0) = b− Ax(0) and take v(0) = r(0). Then,

x(1) = x(0) + t0v(0), (11.28)

r(1) = r(0) − t0Av(0) (11.29)

and

〈r(1), v(0)〉 = 〈r(0), v(0)〉 − t0〈v(0), Av(0)〉 = 0 (11.30)

where the last equality follows from the definition (11.16) of t0. Now,

r(2) = r(1) − t1Av(1) (11.31)

and consequently

〈r(2), v(0)〉 = 〈r(1), v(0)〉 − t1〈v(0), Av(1)〉 = −t1〈v(0), Av(1)〉. (11.32)

Thus if

〈v(0), Av(1)〉 = 0 (11.33)

then 〈r(2), v(0)〉 = 0. Moreover, r(2) = r(1)−t1Av(1) from which it follows that

〈r(2), v(1)〉 = 〈r(1), v(1)〉 − t1〈v(1), Av(1)〉 = 0, (11.34)

where in the last equality we have used the definition of t1, (11.16). Thus, ifcondition (11.33) holds we can guarantee that 〈r(1), v(0)〉 = 0 and 〈r(2), v(j)〉 =0, j = 0, 1, i.e. we satisfy the conditions of Theorem 11.1 for k = 1, 2.

Definition 11.1. Let A be an n×n matrix. We say that two vectors x, y ∈ Rn

are conjugate with respect to A if

〈x,Ay〉 = 0. (11.35)

We can now proceed by induction to prove the following theorem.

Theorem 11.2. Suppose v(0), ..., v(k−1) are conjugate with respect to A, thenfor k = 1, 2, . . .

〈r(k), v(j)〉 = 0, j = 0, 1, . . . , k − 1.


Proof. Let us do induction. We know the statement is true for k = 1.Suppose

〈r(k−1), v(j)〉 = 0, j = 0, 1, ...., k − 2. (11.36)

Recall that r(k) = r(k−1) − tk−1Av(k−1) and so

〈r(k), v(k−1)〉 = 〈r(k−1), v(k−1)〉 − tk−1〈v(k−1), Av(k−1)〉 = 0 (11.37)

because of the choice (11.16) of tk−1. Now, for j = 0, 1, . . . , k − 2

〈r(k), v(j)〉 = 〈r(k−1), v(j)〉 − tk−1〈v(j), Av(k−1)〉 = 0, (11.38)

where the first term is zero because of the induction hypothesis and thesecond term is zero because the search directions are conjugate.

Combining Theorems 11.1 and 11.2 we get the following important con-clusion.

Theorem 11.3. If the search directions, v(0), v(1), . . . , v(k−1) are conjugate(with respect to A) then x(k) = x(k−1)+tk−1v

(k−1) is the minimizer of ‖x−x‖2A

over x(0) + Vk.

11.3.1 Generating the Conjugate Search Directions

The conjugate gradient method, due to Hestenes and Stiefel, is an ingeniousapproach to generating efficiently the set of conjugate search directions. Theidea is to modify the negative gradient direction, r(k), by adding informationabout the previous search direction, v(k−1). Specifically, we start with

v(k) = r(k) + skv(k−1), (11.39)

where the scalar sk is chosen so that v(k) is conjugate to v(k−1) with respectto A, i.e.

0 = 〈v(k), Av(k−1)〉 = 〈r(k), Av(k−1)〉+ sk〈v(k−1), Avk−1)〉 (11.40)

which gives

sk = − 〈r(k), Av(k−1)〉

〈v(k−1), Av(k−1)〉. (11.41)

Magically this simple construction renders all the search directions conjugateand the residuals orthogonal!


Theorem 11.4.

(a) 〈r(i), r(j)〉 = 0, i 6= j.

(b) 〈v(i), Av(j)〉 = 0, i 6= j.

Proof. By the choice of tk and sk it follows that

〈r(k+1), r(k)〉 = 0, (11.42)

〈v(k+1), v(k)〉 = 0, (11.43)

for k = 0, 1, . . . Let us now proceed by induction. We know 〈r(1), r(0)〉 = 0and 〈v(1), v(0)〉 = 0. Suppose 〈r(i), r(j)〉 = 0 and 〈v(i), Av(j)〉 = 0 holds for0 ≤ j < i ≤ k. We need to prove that this holds also for 0 ≤ j < i ≤ k + 1.In view of (11.42) and (11.43) we can assume j < k. Now,

〈r(k+1), r(j)〉 = 〈r(k) − tkAv(k), r(j)〉= 〈r(k), r(j)〉 − tk〈r(j), Av(k)〉= −tk〈r(j), Av(k)〉,

(11.44)

where we have used the induction hypothesis on the orthogonality of theresiduals for the last equality. B v(j) = r(j) + sj−1v

(j−1) and so r(j) = v(j) −sj−1v

(j−1). Thus,

〈r(k+1), r(j)〉 = −tk〈v(j) − sj−1v(j−1), Av(k)〉

= −tk〈v(j), Av(k)〉+ tksj〈v(j−1), Av(k)〉 = 0.(11.45)

Also for j < k

〈v(k+1), Av(i)〉 = 〈r(k+1) + skv(k), Av(i)〉

= 〈r(k+1), Av(i)〉+ sk〈v(k), Av(i)〉

= 〈r(k+1),1

tj(r(j) − r(j+1))〉

=1

tj〈r(k+1), r(j)〉 − 1

tj〈r(k+1), r(j+1)〉 = 0.

(11.46)


The conjugate gradient method is completely specified by (11.11), (11.16),(11.39), (11.41). We are now going to do some algebra to get computationallybetter formulas for tk and sk.

Recall that

tk =〈v(k), r(k)〉〈v(k), Av(k)〉

.

Now,

〈v(k), r(k)〉 = 〈r(k) + skv(k−1), r(k)〉

= 〈r(k), r(k)〉+ sk〈v(k−1), r(k)〉 = 〈r(k), r(k)〉.(11.47)

Therefore

tk =〈r(k), r(k)〉〈r(k), Ar(k)〉

. (11.48)

Let us now work with the numerator of sk+1, the inner product 〈r(k+1), Av(k)〉.First recall that r(k+1) = r(k)− tkAv(k) and so tkAv

(k) = r(k)− r(k+1). There-fore,

−〈r(k+1), Av(k)〉 =1

tk〈r(k+1), r(k+1) − r(k)〉 =

1

tk〈r(k+1), r(k+1)〉. (11.49)

And for the denominator, we have

〈v(k), Av(k)〉 =1

tk〈v(k), r(k) − r(k+1)〉

=1

tk〈v(k), r(k)〉 − 1

tk〈v(k), r(k+1)〉

=1

tk〈r(k) + skv

(k−1), r(k)〉

=1

tk〈r(k), r(k)〉+

sktk〈v(k+1), r(k)〉 =

1

tk〈r(k), r(k)〉.

(11.50)

Thus, we can write

sk+1 =〈r(k+1), r(k+1)〉〈r(k), r(k)〉

. (11.51)

A pseudo-code for the conjugate gradient method is given in Algorithm 11.1.The main cost per iteration of the conjugate gradient method is the evalua-tion of Av(k). If A is sparse then this product of a and a vector can be done


cheaply by avoiding to operate with the zeros of A. For example, for matrixin the solution of Poisson’s equation in 2D (10.76), the cost of computingAv(k) is just O(n), where n = N2 is the total number of unknowns.

Algorithm 11.1 The conjugate gradient Method

1: Given x(0) and TOL, set r(0) = b− Ax(0), v(0) = r(0), and k = 0.2: while ‖r(k)‖2 > TOL do3:

tk ←〈r(k), r(k)〉〈v(k), Av(k)〉

4:

x(k+1) ← x(k) + tkv(k)

5:

r(k+1) ← r(k) − tkAv(k)

6:

sk+1 ←〈r(k+1), r(k+1)〉〈r(k), r(k)〉

7:

v(k+1) ← r(k+1) + sk+1v(k)

8: k ← k + 19: end while

Theorem 11.5. Let A be an n× n symmetric positive definite matrix, thenthe conjugate gradient method converges to the exact solution (assuming noround-off errors) of Ax = b in at most n steps .

Proof. By Theorem 11.4, the residuals are orthogonal hence linearly inde-pendent. After n steps, r(n) is orthogonal to r(0), r(1), . . . , r(n−1). Since thedimension of the space is n, r(n) has to be the zero vector.

11.4 Krylov Subspaces

In the conjugate gradient method we start with an initial guess x(0), computethe residual r(0) = b−Ax(0) and set v(0) = r(0). We then get x(1) = x(0)+t0r

(0)

11.4. KRYLOV SUBSPACES 193

and evaluate the residual r(1), etc. If we use the definition of the residual wehave

r(1) = b− Ax(1) = b− Ax(0) + t0Ar(0) = r(0) + t0Ar

(0) (11.52)

so that r(1) is a linear combination of r(0) and Ar(0). Similarly,

x(2) = x(1) + t1v(1)

= x(0) + t0r(0) + t1r

(1) + t1s0r(0)

= x(0) + (t0 + t1s0)r(0) + t1r(1)

(11.53)

so that r(2) = b−Ax(2) is a linear combination of r(0), Ar(0), and A2r(0) andso on.

Definition 11.2. The set Kk(r(0), A) = spanr(0), Ar(0), ..., Ak−1r(0) is calledthe Krylov subspace of degree k for r(0).

Krylov subspaces are central to an important class of numerical methodsthat rely on getting approximations through matrix-vector multiplication likethe conjugate gradient method.

The following theorem provides a reinterpretation of the conjugate gra-dient method. The approximation x(k) is the minimizer of ‖x − x‖2

A overKk(r(0), A).

Theorem 11.6. Kk(r(0), A) = spanr(0), ..., r(k−1) = spanv(0), ..., v(k−1).

Proof. We will proof it by induction. The case k = 1 by construction. Let usnow assume that it holds for k and we will prove that it also holds for k+ 1.

By the induction hypothesis r(k), v(k−1) ∈ Kk(r(0);A) then

Av(k−1) ∈ spanAr(0), ..., Akr(0)

but r(k) = r(k−1) − tk−1Av(k−1) and so

r(k) ∈ Kk+1(r(0), A).

Consequently,spanr(0), ..., r(k) ⊆ Kk+1(r(0), A).

We now prove the reverse inclusion,

spanr(0), ..., r(k) ⊇ Kk+1(r(0), A).


Note that Akr(0) = A(Ak−1r(0)). But by the induction hypothesis

spanr(0), Ar(0), ..., Ak−1r(0) = spanv(0), ..., v(k−1).

Given that

Akr(0) = A(Ak−1r(0)) ∈ spanAv(0), ..., Av(k−1)

and since

Av(j) =1

tj(r(j) − r(j+1))

it follows thatAkr(0) ∈ spanr(0), r(1), ..., r(k).

Thus,spanr(0), ..., r(k) = Kk+1(r(0), A).

For the last equality we observe that spanv(0), ..., v(k) = span(v(0), ..., v(k), r(k)because v(k) = r(k) + skv

(k−1) and by the induction hypothesis

spanv(0), ..., v(k), r(k) = spanr(0), Ar(0), ..., Akr(0), r(k)= spanr(0), r(1), ..., r(k), r(k)= Kk+1(r(0), A).

(11.54)

11.5 Convergence of the Conjugate Gradient

Method

Let us define the initial error as e(0) = x(0) − x. Then Ae(0) = Ax(0) − Aximplies that

r(0) = −Ae(0). (11.55)

For the conjugate gradient method x(k) ∈ x(0) + Kk(r(0), A) and in view of(11.55) we have that

x(k) − x = e(0) + c1Ae(0) + c2A

2e(0) + · · ·+ ckAke(0), (11.56)

11.5. CONVERGENCE OF THE CONJUGATEGRADIENTMETHOD195

for some real constants c1, . . . , ck. In fact,

‖x(k) − x‖A = minp∈Pk‖p(A)e(0)‖A, (11.57)

where Pk is the set of all polynomials of degree ≤ k and that are equalto one at 0. Since A is symmetric positive definite all its eigenvalues arereal and positive. Let’s order them as 0 < λ1 ≤ λ2 ≤ . . . ≤ λn, withassociated orthonormal eigenvectors v1, v2, . . . vn. Then, we can write e(0) =α1v1 + . . . αnvn for some scalars α0, . . . , αn and

p(A)e(0) =n∑j=1

p(λj)αjvj. (11.58)

Therefore,

‖p(A)e(0)‖2A = 〈p(A)e(0), Ap(A)e(0)〉 =

n∑j=1

p2(λj)λjα2j

≤(

maxjp2(λj)

) n∑j=1

λjα2j

(11.59)

and since

‖e(0)‖2A =

n∑j=1

λjα2j (11.60)

we get

‖e(k)‖A ≤ minp∈Pk

maxj|p(λj)| ‖e(0)‖A. (11.61)

The min max term can be estimated using the Chebyshev polynomial Tkwith the change of variables

f(λ) =2λ− λ1 − λnλn − λ1

(11.62)

to map [λ1, λn] to [−1, 1]. The polynomial

p(λ) =1

Tk(f(0))Tk(f(λ)) (11.63)


is in Pk and since |Tk(f(λ))| ≤ 1

maxj|p(λj)| =

1

|Tk(f(0))|. (11.64)

Now

|Tk(f(0))| =∣∣∣∣Tk (λ1 + λn

λn − λ1

)∣∣∣∣ =

∣∣∣∣Tk (λn/λ1 + 1

λn/λ1 − 1

)∣∣∣∣ =

∣∣∣∣Tk (κ2(A) + 1

κ2(A)− 1

)∣∣∣∣(11.65)

because κ2(A) = λn/λ1 is the condition number of A in the 2-norm. Now weuse an identity of Chebyshev polynomials, namely if x = (z + 1/z)/2 thenTk(x) = (zk + 1/zk)/2. Noting that

κ2(A) + 1

κ2(A)− 1=

1

2(z + 1/z) (11.66)

for

z = (√κ2(A) + 1)/(

√κ2(A)− 1) (11.67)

we obtain ∣∣∣∣Tk (κ2(A) + 1

κ2(A)− 1

)∣∣∣∣−1

≤ 2

(√κ2(A)− 1√κ2(A) + 1

)k

. (11.68)

Thus we get the upper bound error estimate for the error in the conjugategradient method:

‖e(k)‖A ≤ 2

(√κ2(A)− 1√κ2(A) + 1

)k

‖e(0)‖A. (11.69)

Chapter 12

Eigenvalue Problems

In this chapter we take a brief look at some numerical methods for the stan-dard eigenvalue problem, i.e. of finding eigenvalues λ and eigenvectors v ofan n× n matrix A.

12.1 The Power Method

Suppose that A has a dominant eigenvalue:

|λ1| > |λ2| ≥≥ · · · ≥ |λn| (12.1)

and a complete set of eigenvector v1, . . . , vn associated to λ1, . . . , λn, respec-tively. Then, each vector v ∈ Rn can be written in terms of the eigenvectorsas

v = c1v1 + · · ·+ cnvn (12.2)

and

Akv = c1λk1v1 + · · ·+ cnλ

k1vn = c1λ

k1

[v1 +

n∑j=2

cjc1

(λjλ1

)kvj

]. (12.3)

Therefore Akv/c1λk1 → v1 and we get a method to determine v1 and λ1.

To avoid overflow we normalize the approximating vector at each iterationas Algorithm 12.1 shows. The rate of convergence of the power method isdetermined by the ratio λ2/λ1. From (12.3), |λ(k) − λ| = O(|λ2/λ1|k), whereλ(k) is the approximation to λ1 at the k-th iteration.

197

198 CHAPTER 12. EIGENVALUE PROBLEMS

Algorithm 12.1 The Power Method

1: Set k = 0, ‖r‖2 >> 1.2: while ‖r‖2 > TOL do3: v ← Av4: v ← v/‖v‖2

5: λ← vTAv6: r = Av − λv7: k ← k + 18: end while

The power method is useful and efficient for computing the dominanteigenpair λ1, v1 when A is sparse, so that the evaluation of Av is economical,and when |λ2/λ1| << 1.

One can use shifts in the matrix A to decrease |λ2/λ1| and improve con-vergence. We apply the power method with the shifted matrix A−sI, wherethe shift s is chosen to accelerated convergence. For example, suppose A issymmetric and has eigenvalues 100, 90, 50, 40, 30, 30 the matrix A− 60I haseigenvalues 40, 30,−10,−20,−30,−30 and the power method would convergeat a rate of 30/40 = 0.75 instead of a rate of 90/100 = 0.9.

A variant of the shift power method is the inverse power method, whichapplies the iteration to the matrix (A − sI)−1. The inverse is not actuallycomputed; instead the linear system (A− sI)v(k) = v(k−1) is solved at everyiteration. The method will converge to the eigenvalue λj for which |λj − s isthe smallest and so with an appropriate choice for s it is possible to convergeto each of the eigenpairs of A.

12.2 Methods Based on Similarity Transfor-

mations

For a general n× n matrix, numerical methods for eigenvalues are typicallybased on a sequence of similarity transformations

Ak = P−1k APk, k = 1, 2, . . . (12.4)

to attempt to converge to either a diagonal matrix or to an upper triangularmatrix.

12.2. METHODS BASED ON SIMILARITY TRANSFORMATIONS 199

12.2.1 The QR method

The most successful numerical method for the eigenvalue problem of a generalsquare matrix A is the QR method. It is based on the QR factorization ofa matrix. Here Q is a unitary (orthogonal in the real case) matrix and R isupper triangular.

Given an n× n matrix A, we set A1 = A, obtain its QR factorization

A1 = Q1R1 (12.5)

and define A2 = R1Q1 so that

A2 = R1Q1 = Q∗1AQ1, (12.6)

etc. The k + 1-st similar matrix is generated by

Ak+1 = RkQk = Q∗kAkQk = (Q1 · · ·Qk)∗A(Q1 · · ·Qk). (12.7)

It can be proved that if A is diagonalizable and with distinct eigenvaluesin modulus then the sequence of matrices Ak, k = 1, 2, . . . produced by theQR method will converge to a diagonal matrix with the eigenvalues of A onthe diagonal. There is no convergence proof for a general matrix A but themethod is remarkably robust and fast to converge.

The QR factorization is expensive, O(n3) operations, so the QR methodis usually applied only after the original matrix A has been reduced to atridiagonal matrix if A is symmetric or to an upper Hessenberg form if Ais not symmetric. This reduction is done with a sequence of orthogonaltransformations known as Householder reflections.

Example 12.1. Consider the matrix the 5× 5 matrix

A =

12 13 10 7 713 18 9 8 1510 9 10 4 127 8 4 4 67 15 12 6 18

(12.8)

the A20 = R19Q19 produced by the QR method gives the eigenvalues of A

200 CHAPTER 12. EIGENVALUE PROBLEMS

within 4 digits of accuracy

A20 =

51.7281 0.0000 0.0000 0.0000 0.00000.0000 8.2771 0.0000 0.0000 0.00000.0000 0.0000 4.6405 0.0000 0.00000.0000 0.0000 0.0000 −2.8486 0.00000.0000 0.0000 0.0000 0.0000 0.2028

. (12.9)

Chapter 13

Non-Linear Equations

13.1 Introduction

In this chapter we consider the problem of finding zeros of a continuousfunction f , i.e. solving f(x) = 0 for example ex − x = 0 or a system ofnonlinear equations:

f1(x1, x2, · · · , xn) = 0,

f2(x1, x2, · · · , xn) = 0,

...

fn(x1, x2, · · · , xn) = 0.

(13.1)

We are going to write this generic system in vector form as

f(x) = 0, (13.2)

where f : U ⊆ Rn → Rn. Unless otherwise noted the function f is assumedto be smooth in its domain U .

We are going to start with the scalar case, n = 1 and look a very simplebut robust method that relies only on the continuity of the function and theexistence of a zero.

13.2 Bisection

Suppose we are interested in solving a nonlinear equation in one unknown

f(x) = 0, (13.3)

201

202 CHAPTER 13. NON-LINEAR EQUATIONS

where f is a continuous function on an interval [a, b] and has at least onezero there.

Suppose that f has values of different sign at the end points of the interval,i.e.

f(a)f(b) < 0. (13.4)

By the Intermediate Value Theorem, f has at least one zero in (a, b). Tolocate a zero we bisect the interval and check on which subinterval f changessign. We repeat the process until we bracket a zero within a desired accuracy.The Bisection algorithm to find a zero x∗ is shown below.

Algorithm 13.1 The Bisection Method

1: Given f , a and b (a < b), TOL, and Nmax, set k = 1 and do:2: while (b− a) > TOL and k ≤ Nmax do3: c = (a+ b)/24: if f(c) == 0 then5: x∗ = c . This is the solution6: stop7: end if8: if sign(f(c)) == sign(f(a)) then9: a← c

10: else11: b← c12: end if13: k ← k + 114: end while15: x∗ ← (a+ b)/2

13.2.1 Convergence of the Bisection Method

With the bisection method we generate a sequence

ck =ak + bk

2, k = 1, 2, . . . (13.5)

where each ak and bk are the endpoints of the subinterval we select at eachbisection step (because f changes sign there). Since

bk − ak =b− a2k−1

, k = 1, 2, . . . (13.6)

13.3. RATE OF CONVERGENCE 203

and ck = ak+bk2

is the midpoint of the interval then

|ck − x∗| ≤1

2(bk − ak) =

b− a2k

(13.7)

and consequently ck → x∗, a zero of f in [a, b].

13.3 Rate of Convergence

We now define in precise terms the rate of convergence of a sequence ofapproximations to a value x∗.

Definition 13.1. Suppose a sequence xn∞n=1 converges to x∗ as n → ∞.We say that xn → x∗ of order p (p ≥ 1) if there is a positive integer N anda constant C such that

|xn+1 − x∗| ≤ C |xn − x∗|p , for all n ≥ N. (13.8)

or equivalently

limn→∞

|xn+1 − x∗||xn − x∗|p

= C. (13.9)

C < 1 for p = 1.

Example 13.1. The sequence generated by the bisection method convergeslinearly to x∗ because

|cn+1 − x∗||cn − x∗|

≤b−a2n+1

b−a2n

=1

2.

Let’s examine the significance of the rate of convergence. Consider first,p = 1, linear convergence. Suppose

|xn+1 − x∗| ≈ C|xn − x∗|, n ≥ N. (13.10)

Then

|xN+1 − x∗| ≈ C|xN − x∗|,|xN+2 − x∗| ≈ C|xN+1 − x∗| ≈ C(C|xN − x∗|) = C2|xN − x∗|.


Continuing this way we get

|xN+k − x∗| ≈ Ck|xN − x∗|, k = 0, 1, . . . (13.11)

and this is the reason of the requirement C < 1 for p = 1. If the error at theN step, |xN − x∗|, is small enough it will be reduced by a factor of Ck afterk more steps. Setting Ck = 10−dk , then the error |xN − x∗| will be reducedapproximately

dk =

(log10

1

C

)k (13.12)

digits.Let us now do a similar analysis for p = 2, quadratic convergence. We

have

|xN+1 − x∗| ≈ C|xN − x∗|2,|xN+2 − x∗| ≈ C|xN+1 − x∗|2 ≈ C(C|xN − x∗|2)2 = C3|xN − x∗|4,|xN+3 − x∗| ≈ C|xN+2 − x∗|2 ≈ C(C3|xN − x∗|4)2 = C7|xN − x∗|8.

It is easy to prove by induction that

|xN+k − x∗| ≈ C2k−1|xN − x∗|2k

, k = 0, 1, . . . (13.13)

To see how many digits of accuracy we gain in k steps beginning from xN ,we write C2k−1|xN − x∗|2

k= 10−dk |xN − x∗|, and solving for dk we get

dk =

(log10

1

C+ log10

1

|xN − x∗|

)(2k − 1). (13.14)

It is not difficult to prove that for the general p > 1 and as k → ∞ we getdk = αpp

k, where αp = 1p−1

log101C

+ log101

|xN−x∗|.

13.4 Interpolation-Based Methods

Assuming again that f is a continuous function in [a, b] and f(a)f(b) < 0we can proceed as in the bisection method but instead of using the midpointc = a+b

2to subdivide the interval in question we could use the root of linear

polynomial interpolating (a, f(a)) and (b, f(b)). This is called the method

13.5. NEWTON’S METHOD 205

of false position. Unfortunately, this method only converges linearly andunder stronger assumptions than the Bisection Method.

An alternative approach to use interpolation to obtain numerical methodsfor f(x) = 0 is to proceed as follows: Given m+1 approximations to the zeroof f , x0, . . . , xm, construct the interpolating polynomial of f , pm, at thosepoints, and set the root of pm closest to xm as the new approximation to thezero of f . In practice, only m = 1, 2 are used. The method for m = 1 iscalled the Secant method and we will look at it in some detail later. Themethod for m = 2 is called Muller’s Method.

13.5 Newton’s Method

If the function f is smooth, say at least C2[a, b], and we have already agood approximation x0 to a zero x∗ of f then the tangent line of f at x0,y = f(x0) + f ′(x0)(x − x0) provides a good approximation to f in a smallneighborhood of x0, i.e.

f(x) ≈ f(x0) + f ′(x0)(x− x0). (13.15)

Then we can define the next approximation as the zero of that tangent line,i.e.

x1 = x0 −f(x0)

f ′(x0), (13.16)

etc. At the k step or iteration we get the new approximation xk+1 accordingto:

xk+1 = xk −f(xk)

f ′(xk), k = 0, 1, . . . (13.17)

This iteration is called Newton’s method or Newton-Raphson’s method. Thereare some conditions for this method to work and converge. But when it doesconverge it does it at least quadratically. Indeed, a Taylor expansion of faround xk gives

f(x) = f(xk) + f ′(xk)(x− xk) +1

2f ′′(ξk)(x− xk)2, (13.18)


where ξk is a point between x and xk. Evaluating at x = x∗ and using thatf(x∗) = 0 we get

0 = f(xk) + f ′(xk)(x∗ − xk) +

1

2f ′′(ξk)(x

∗ − xk)2, (13.19)

which we can recast as

x∗ = xk −f(xk)

f ′(xk)−

12f ′′(ξk)

f ′(xk)(x∗ − xk)2 = xk+1 −

12f ′′(ξk)

f ′(xk)(x∗ − xk)2.

(13.20)

Thus,

|xk+1 − x∗| =12|f ′′(ξk)||f ′(xk)|

|xk − x∗|2. (13.21)

So if the sequence xk∞k=0 generated by Newton’s method converges then itdoes so at least quadratically.

Theorem 13.1. Let x∗ be a simple zero of f (i.e. f(x∗) = 0 and f ′(x∗) 6= 0)and suppose f ∈ c2. Then there’s a neighborthood Iε of x∗ such that Newton’smethod converges to x∗ for any initial guess in Iε.

Proof. Since f ′ is continuous and f ′(x∗) 6= 0 we can choose ε > 0, sufficientlysmall so that f ′(x) 6= 0 for all x such that |x − x∗| ≤ ε (this is Iε) and thatεM(ε) < 1 where

M(ε) =12

maxx∈Iε |f ′′(x)|minx∈Iε |f ′(x)|

.

This is possible because limε→0M(ε) =12|f ′′(x∗)||f ′(x∗)| < +∞.

The condition εM(ε) < 1 allows us to guarantee that x∗ is the only zeroof f in Iε, as we show now. A Taylor expansion of f around x∗ gives

f(x) = f(x∗) + f ′(x∗)(x− x∗) +1

2f ′′(ξ)(x− x∗)3

= f ′(x∗)(x− x∗)(

1 + (x− x∗)12f ′′(ξ)

f ′(x∗)

),

(13.22)

and since ∣∣∣∣(x− x∗) 12f ′′(ξ)

f ′(x∗)

∣∣∣∣ = |x− x∗|12|f ′′(ξ)||f ′(x∗)|

≤ εM(ε) < 1 (13.23)

13.6. THE SECANT METHOD 207

f(x) 6= 0 for all x ∈ Iε unless x = x∗. We will now show that Newton’siteration is well defined starting from any initial guess x0 ∈ Iε. We prove thisby induction. From (13.21) with k = 0 it follow that x1 ∈ Iε as

|x1 − x∗| = |x0 − x|2∣∣∣∣ 1

2f ′′(ξ0)

f ′(x0)

∣∣∣∣ ≤ ε2M(ε) ≤ ε. (13.24)

Now assume that xk ∈ Iε then again from (13.21)

|xk+1 − x∗| = |xk − x|2∣∣∣∣ 1

2f ′′(ξk)

f ′(xk)

∣∣∣∣ ≤ ε2M(ε) < ε (13.25)

so xk+1 ∈ Iε. Now,

|xk+1 − x∗| ≤ |xk − x∗|2M(ε) = |xk − x∗|εM(ε)

≤ |xk−1 − x∗|(εM(ε))2

...

≤ |x0 − x∗|(εM(ε))k+1

and since εM(ε) < 1 it follows that xk → x∗ as k →∞.

The need for a good initial guess x0 for Newton’s method should beemphasized. In practice, this is obtained with another method, like bisection.

13.6 The Secant Method

Sometimes it could be computationally expensive or not possible to evaluatethe derivative of f . The following method, known as the secant method,replaces the derivative by the secant:

xk+1 = xk −f(xk)

f(xk)−f(xk−1)

xk−xk−1

, k = 1, 2, . . . (13.26)


Note that since f(x∗) = 0

xk+1 − x∗ = xk − x∗ −f(xk)− f(x∗)f(xk)−f(xk−1)

xk−xk−1

,

= xk − x∗ −f(xk)− f(x∗)

f [xk, xk−1]

= (xk − x∗)

(1−

f(xk)−f(x∗)xk−x∗

f [xk, xk−1]

)

= (xk − x∗)(

1− f [xk, x∗]

f [xk, xk−1]

)= (xk − x∗)

(f [xk, xk−1]− f [xk, x

∗]

f [xk, xk−1]

)

= (xk − x∗)(xk−1 − x∗)

f [xk,xk−1]−f [xk,x∗]

xk−1−x∗

f [xk, xk−1]

= (xk − x∗)(xk−1 − x∗)

f [xk−1, xk, x∗]

f [xk, xk−1]

If xk → x∗, then f [xk−1,xk,x∗]

f [xk,xk−1]→

12f ′′(x∗)

f ′(x∗)and limk→∞

xk+1−x∗xk−x∗

= 0, i.e. the

sequence generated by the secant method would converge faster than linear.

Defining ek = |xk − x∗|, the calculation above suggests

ek+1 ≈ cekek−1. (13.27)

Let’s try to determine the rate of convergence of the secant method. Starting

with the ansatz ek ≈ Aepk−1 or equivalently ek−1 =(

1Aek)1/p

we have

ek+1 ≈ cekek−1 ≈ cek

(1

Aek

) 1p

,

which implies

A1+ 1p

c≈ e

1−p+ 1p

k . (13.28)

13.7. FIXED POINT ITERATION 209

Since the left hand side is a constant we must have 1− p+ 1p

= 0 which gives

p = 1±√

52

, thus

p =1 +√

5

2=≈ 1.61803 (13.29)

gives the rate of convergence of the secant method. It is better than linear,but worse than quadratic. Sufficient conditions for local convergence are asin Newton’s method.

13.7 Fixed Point Iteration

Newton’s method is a particular example of a functional iteration of the form

xk+1 = g(xk), k = 0, 1, . . .

with the particular choice of g(x) = x− f(x)f ′(x)

. Clearly, if x∗ is a zero of f then

x∗ is a fixed point of g, i.e. g(x∗) = x∗. We will look at fixed point iterationsas a tool for solving f(x) = 0.

Example 13.2. Suppose we want to solve x− e−x = 0 in [0, 1]. Then if wetake g(x) = e−x, a fixed point of g corresponds to a zero of f .

Definition 13.2. Let g is defined in an interval [a, b]. We say that g is acontraction or a contractive map if there is a constant L with 0 ≤ L < 1such that

|g(x)− g(y)| ≤ L|x− y|, for all x, y ∈ [a, b]. (13.30)

If x∗ is a fixed point of g in [a, b] then

|xk − x∗|= |g(xk−1)− g(x∗)|≤ L|xk−1 − x∗|≤ L2|xk−2 − x∗|≤ · · ·≤ Lk|x0 − x∗| → 0, as k →∞.


Theorem 13.2. If g is contraction on [a, b] and maps [a, b] into [a, b] theng has a unique fixed point x∗ in [a, b] and the fixed point iteration convergesto it for any [a, b]. Moreover(a)

|xk − x∗| ≤L∗

1− L|x1 − x0|

(b)|xk − x∗| ≤ Lk|x0 − x∗|

Proof. With proved (b) already. Since g : [a, b] → [a, b], the fixed pointiteration xk+1 = g(xk), k = 0, 1, ... is well-defined and

|xk+1 − xk| = |g(xk)− g(xk−1)|≤ L|xk − xk−1|≤ · · ·≤ Lk|x1 − x0|.

Now, for n ≥ m

xn − xm = xn − xn−1 + xn−1 − xn−2 + . . .+ xm+1 − xm (13.31)

and so

|xn − xm| ≤ |xn − xn−1|+ |xn−1 − xn−2|+ . . .+ |xm+1 − xm|≤ Ln−1|x1 − x0|+ Ln−2|x1 − x0|+ . . .+ Lm|x1 − x0|≤ Lm|x1 − x0|

(1 + L+ L2 + . . . Ln−1−m)

≤ Lm|x1 − x0|∞∑j=0

Lj =Lm

1− L|x1 − x0|.

Thus given ε > 0 there is N such

LN

1− L|x1 − x0| ≤ ε (13.32)

and thus for n ≥ m ≥ N , |xn − xm| ≤ ε, i.e. xn∞n=0 is a Cauchy sequencein [a, b] so it converges to a point x∗ ∈ [a.b]. But

|xk − g(x∗)| = |g(xk−1)− g(x∗)| ≤ L|xk−1 − x∗|, (13.33)

13.8. SYSTEMS OF NONLINEAR EQUATIONS 211

and so xk → g(x∗) as k →∞ i.e. x∗ is a fixed point of g.Suppose that there are two fixed points, x1, x2 ∈ [a, b]. |x1 − x2| =

|g(x1) − g(x2)| ≤ L|x1 − x2| ⇒ (1 − L)|x1 − x2) ≤ 0 but 0 ≤ L < 1 ⇒|x1 − x2| = 0⇒ x1 = x2 i.e the fixed point is unique.

If g is differentiable in (a, b), then by the mean value theorem

g(x)− g(y) = g′(ξ)(x− y), for some ξ ∈ [a, b]

and if the derivative is bounded by a constant L less than 1, i.e. |g′(x)| ≤ Lfor all x ∈ (a, b), then |g(x) − g(y)| ≤ L|x − y| with 0 ≤ L < 1, i.e. g iscontractive in [a, b].

Example 13.3. Let g(x) = 14(x2 + 3) for x ∈ [0, 1]. Then 0 ≤ g(x) ≤ 1 and

|g′(x)| ≤ 12

for all x ∈ [0, 1]. So g is contractive in [0, 1] and the fixed pointiteration will converge to the unique fixed point of g in [0, 1].

Note that

xk+1 − x∗ = g(xk)− g(x∗)

= g′(ξk)(xk − x∗), for some ξk ∈ [xk, x∗].

Thus,

xk+1 − x∗

xk − x∗= g′(ξk) (13.34)

and unless g′(x∗) = 0, the fixed point iteration converges linearly, when itdoes converge.

13.8 Systems of Nonlinear Equations

We now look at the problem of finding numerical approximation to the solu-tion(s) of a nonlinear system of equations f(x) = 0, where f : U ⊆ Rn → Rn.

The main approach to solve a nonlinear system is fixed point iteration

xk+1 = G(xk), k = 0, 1, . . . (13.35)

where we assume that G is defined on a closed set B ⊆ Rn and G : B → B.


The map G is a contraction (with respect to some norm,‖ · ‖) if there isa constant L with 0 ≤ L < 1 and

‖G(x)−G(y)‖ ≤ L‖x− y‖, for all x,y ∈ B. (13.36)

Then, as we know, by the contraction map principle, G has a unique fixedpoint and the sequence generated by the fixed point iteration (13.35) con-verges to it.

Suppose that G is C1 on some convex set B ⊆ Rn, for example a ball.Consider the linear segment x + t(y − x) for t ∈ [0, 1] with x,y fixed in B.Define the one-variable function

h(t) = G(x + t(y − x)). (13.37)

Then, by the Chain Rule, h′(t) = DG(x + t(y − x))(y − x), where DGstands for the derivative matrix of G. Then, using the definition of h andthe Fundamental Theorem of Calculus we have

G(y)−G(x) = h(1)− h(0) =

∫ 1

0

h′(t)dt

=

[∫ 1

0

DG(x + t(y − x))dt

](y − x).

(13.38)

Thus if there is 0 ≤ L < 1 such that

‖DG(x)‖ ≤ L, for all x ∈ B, (13.39)

for some subordinate norm ‖ · ‖. Then

‖G(y)−G(x)‖ ≤ L‖y − x‖ (13.40)

and G is a contraction (in that norm). The spectral radius of DG, ρ(DG)willdetermine the rate of convergence of the corresponding fixed point itera-tion.

13.8.1 Newton’s Method

By Taylor theorem

f(x) ≈ f(x0) + Df(x0)(x− x0) (13.41)

13.8. SYSTEMS OF NONLINEAR EQUATIONS 213

so if we take x1 as the zero of the right ahnd side of (13.41) we get

x1 = x0 − [Df(x0)]−1f(x0). (13.42)

Continuing this way, Newton’s method for the system of equations f(x) = 0can be written as

xk+1 = xk − [Df(xk)]−1f(xk). (13.43)

In the implementation of Newton’s method for a system of equations wesolve the linear system Df(xk)w = −f(xk) at each iteration and updatexk+1 = xk + w.


Chapter 14

Numerical Methods for ODEs

14.1 Introduction

In this chapter we will be concerned with numerical methods for the initialvalue problem:

dy(t)

dt= f(t, y(t)), t0 < t ≤ T, (14.1)

y(t0) = α. (14.2)

The independent variable t often represents time but does not have to. With-out loss of generality we will take t0 = 0. The time derivative is also fre-quently denoted with a dot (especially in physics) or an apostrophe

dy

dt= y = y′. (14.3)

In (14.1)-(14.2), y and f may be vector-valued in which case we havean initial value for a system of ordinary differential equations (ODEs). Theconstant α in (14.2) is the given initial condition, which is a constant vectorin the case of a system.

Example 14.1.

y′(t) = t sin y(t), 0 < t < 2π (14.4)

y(0) = α (14.5)

215

216 CHAPTER 14. NUMERICAL METHODS FOR ODES

Example 14.2.

y′1(t) = y1(t)y2(t)− y21, 0 < t ≤ T

y′2(t) = −y2(t) + t2 cos y1(t), 0 < t ≤ T(14.6)

y1(0) = α1,

y2(0) = α2.(14.7)

These two are examples of first order ODEs, which is the type of initialvalue problems we will focus on. Higher order ODEs can be written as firstorder systems by introducing new variables for the derivatives from the firstup to one order less.

Example 14.3. The Harmonic Oscillator.

y′′(t) + k2y(t) = 0. (14.8)

If we define y1 = y and y2 = y′ we get

y′1(t) = y2(t),

y′2(t) = −k2y1(t).(14.9)

Example 14.4.

y′′′(t) + 2y(t)y′′(t) + cos y′(t) + et = 0. (14.10)

Introducing the variables y1 = y, y2 = y′, and y3 = y′′ we obtain the firstorder system:

y′1(t) = y2(t),

y′2(t) = y3(t),

y′3(t) = −2y1(t)y3(t)− cos y2(t)− et.(14.11)

If f does not depend explicitly on t we call the ODE (or the systemof ODEs) autonomous. We can turn a non-autonomous system into anautonomous one by introducing t as a new variable.

Example 14.5. Consider the ODE

y′(t) = sin t− y2(t) (14.12)

If we define y1 = y and y2 = t we can write this ODE as the autonomoussystem

y′1(t) = sin y2(t)− y21(t),

y′2(t) = 1.(14.13)

14.1. INTRODUCTION 217

We now state a fundamental theorem of existence and uniqueness of so-lutions to the initial value problem (14.1)-(14.2).

Theorem 14.1. Existence and Uniqueness.Let

D = (t, y) : 0 ≤ t ≤ T, y ∈ Rn . (14.14)

If f is continuous in D and uniformly Lipschitz in y, i.e. there is a constantL ≥ 0 such that

‖f(t, y2)− f(t, y1)‖ ≤ L‖y2 − y1‖ (14.15)

for all t ∈ [0, T ] and all y1, y2 ∈ Rn, then the initial value problem (14.1)-(14.2) has a unique solution for each α ∈ Rn.

Note that if there is L′ ≥ 0 such that∣∣∣∣ ∂fi∂yn(t, y)

∣∣∣∣ ≤ L′ (14.16)

for all y, t ∈ [0, T ], and i, j = 1, . . . , n then f is uniformly Lipschitz in y(equivalently, if given norm of the derivative matrix of f is bounded L, seeSection 13.8).

Example 14.6.

y′ = y2/3, 0 < t

y(0) = 0.(14.17)

The partial derivative

∂f

∂y=

2

3y−1/3 (14.18)

is not continuous around 0. Clearly, y ≡ 0 is a solution of this initial valueproblem but so is y(t) = 1

27t3. There is no uniqueness of solution for for

initial value problem.

Example 14.7.

y′ = y2, 1 < t ≤ 3

y(1) = 3.(14.19)


We can integrate to obtain

y(t) =1

2− t, (14.20)

which becomes unbounded as→ 2. Consequently, there is no solution in [1, 3].Note that

∂f

∂y= 2y (14.21)

is unbounded (because y is so) in [1, 3]. The function f is not uniformlyLipschitz in y for t ∈ [1, 3].

The initial value problem (14.1)-(14.2) can be reformulated as

y(t) = y(0) +

∫ T

0

f(s, y(s))ds. (14.22)

This is an integral equation for the unknown function y. In particular, if fdoes not depend on y then the initial value problem (14.1)-(14.2) is reducedto the approximation of the definite integral∫ T

0

f(s)ds. (14.23)

for which a numerical quadrature can be applied.The numerical methods that we will study next deal with the more general

and important case when f depends on the unknown y. They will produce anapproximation to the exact solution of the initial value problem (assuminguniqueness) at a set of discrete points 0 = t0 < t1 < . . . < tN ≤ T . Forsimplicity in the presentation of the ideas we will assume that these pointsare equispaced

tn = n∆t, n = 0, 1, . . . , N and ∆t = T/N. (14.24)

∆t is called the time step or simply the step size. A numerical method for aninitial value problem will be written as an algorithm to go from one discretetime tn to the next tn+1. With that in mind, it is useful to transform theODE (or ODE system) into a integral equation by integrating from tn totn+1:

y(tn+1) = y(tn) +

∫ tn+1

tn

f(t, y(t))dt. (14.25)

This equation provides a useful framework for the construction of numericalmethods with the aid of quadratures.

14.2. A FIRST LOOK AT NUMERICAL METHODS 219

14.2 A First Look at Numerical Methods

Let us denote by yn the approximation to the exact solution at tn, i.e.

yn ≈ y(tn). (14.26)

Starting from the integral formulation of the problem, eq. (14.25), if weapproximate the integral using f evaluated at the lower integration limit∫ tn+1

tn

f(t, y(t))dt ≈ f(tn, y(tn)) ∆t (14.27)

and replace the exact derivative of y at tn by f(tn, yn), we obtain the so calledforward Euler’s method:

y0 = α (14.28)

yn+1 = yn + ∆tf(tn, yn), n = 0, 1, . . . N − 1. (14.29)

This provides an explicit formula to advance from one time step to the next.The approximation at the future step, yn+1, only depends on the approxima-tion at the current step, yn. The forward Euler method is an example of anexplict one-step method.

If on the other hand we approximate the integral in (14.25) using onlythe upper limit of integration and replace f(tn+1, y(tn+1)) by f(tn+1, yn+1)we get the backward Euler method:

y0 = α (14.30)

yn+1 = yn + ∆tf(tn+1, yn+1), n = 0, 1, . . . N − 1. (14.31)

Note that now, yn+1 is defined implicitly in (14.31). Thus, to update theapproximation, i.e. to obtain yn+1 for each n = 0, 1, . . . , N − 1, we need tosolve a nonlinear equation (or a system of nonlinear equations if the initialvalue problem is for a system of ODEs). Since we expect the approximatesolution not to change much in one time step (for sufficiently small ∆t), yncan be a good initial guess for a Newton iteration to find yn+1. The backwardEuler method is an implicit one-step method.

We can start we more accurate quadratures as the basis for our numericalmethods. For example if we use the trapezoidal rule∫ tn+1

tn

f(t, y(t))dt ≈ ∆t

2[f(tn, y(tn)) + f(tn+1, y(tn+1))] (14.32)


and proceed as before we get the trapezoidal rule method:

y0 = α (14.33)

yn+1 = yn +∆t

2[f(tn, yn) + f(tn+1, yn+1)] , n = 0, 1, . . . N − 1. (14.34)

Like the backward Euler method, this is an implicit one-step method. We willsee later an important class of one-step methods, known as Runge-Kutta(RK) methods, that use intermediate approximations to the derivative (i.e.approximations to f) and a corresponding quadrature. For example, we canuse the midpoint rule quadrature and the approximation

f(tn+1/2, y(tn+1/2)

)≈ f

(tn+1/2, yn +

∆t

2f(tn, yn)

)(14.35)

to obtain the explicit midpoint Runge -Kutta method

yn+1 = yn + ∆tf

(tn+1/2, yn +

∆t

2f(tn, yn)

). (14.36)

Another possibility is to approximate the integrand f in (14.25) by aninterpolating polynomial using f evaluated at previous approximations yn,yn−1, . . ., yn−m. To simplify the notation, let us write

fn = f(tn, yn). (14.37)

For example, if we replace f in [tn, tn+1] by the linear polynomial p1 interpo-lating (tn, fn) and (tn−1, fn−1), i.e

p1(t) =(t− tn−1)

∆tfn −

(t− tn)

∆tfn−1 (14.38)

we get ∫ tn+1

tn

f(t, y(t))dt ≈∫ tn+1

tn

p1(t)dt =∆t

2[3fn − fn−1] (14.39)

and the corresponding numerical method is

yn+1 = yn +∆t

2[3fn − fn−1] , n = 1, 2, . . . N − 1. (14.40)

14.3. ONE-STEP AND MULTISTEP METHODS 221

This is an example of a multistep method. To obtain the approximationat the future step we need the approximations of more than one step; in thisparticular case we need yn and yj−1 so this is a two-step method. Note thatto start using (14.40), i.e. j = 1, we need y0 and y1. For y0 we can use theinitial condition, y0 = α, and we can get y1 by using a one-step method. Allmultistep methods require this initialization process when approximations toy1, . . . , ym−1 have to be generated with one-step methods to begin using themultistep formula.

Numerical methods can also be constructed by approximating the deriva-tive y′ using finite differences or interpolation. For example, the centraldifference approximation

y′(t) ≈ y(t+ ∆t)− y(t−∆t)

2∆t(14.41)

produces the two-step method

yn+1 = yn−1 + ∆tfn. (14.42)

If we approximate y′(tn+1) by the derivative of the polynomial interpo-lating yn+1 and some previous approximations we obtain a class of methodknown as backward differentiation formula (BDF) methods. For exam-ple, let p2 be the polynomial of degree at most 2 that interpolates (tn−1, yn−1),(tn, yn), and (tn+1, yn+1). Then

y′(tn+1) ≈ p′2(tn+1) =3yn+1 − 4yn + yn−1

2∆t, (14.43)

which gives the BDF method

3yn+1 − 4yn + yn−1

2∆t= fn+1, n = 1, 2, . . . N − 1. (14.44)

Note that this is an implicit, multistep method.

14.3 One-Step and Multistep Methods

As we have seen, there are two broad classes of methods for the initial valueproblem (14.1)-(14.2): one-step methods and multistep methods.


Explicit one-step methods can be written in the general form

yn+1 = yn + ∆tΦ(tn, yn,∆t) (14.45)

for some continuous function Φ. For example Φ(t, y,∆t) = f(t, y) for theforward Euler method and Φ(t, y,∆t) = f

(t+ ∆t

2, y + ∆t

2f(t, y)

)for the mid-

point RK method.A general, m-step multistep method can be cast as

amyn+1 + am−1yn + . . .+ a0yn−m+1 =

∆t [bmfn+1 + bm−1fn + . . .+ b0fn−m+1] ,(14.46)

for some coefficients a0, a1, . . . , am, with am 6= 0, and b0, b1, . . . , bm. If bm 6= 0the multistep is implicit otherwise it is explicit. Without loss of generalitywe will assume am = 1. Shifting the index by m− 1, we can write an m-step(m ≥ 2) method as

m∑j=0

ajyn+j = ∆tm∑j=0

bjfn+j. (14.47)

14.4 Local and Global Error

When computing a numerical approximation of an initial value problem, ateach time step there is an error associated to evolving the solution fromtn to tn+1 with the numerical method instead of using the ODE (or theintegral equation) and there is also an error due to the use of yn as startingpoint instead of the exact value y(tn). After several time steps, these localerrors accumulate in the global error of the approximation to the initial valueproblem. Let us make the definition of these errors more precise.

Definition 14.1. The global error en at a given discrete time tn is given by

en = y(tn)− yn, (14.48)

where y(tn) and yn are the exact solution of the initial value problem and thenumerical approximation at tn, respectively.

Definition 14.2. The local truncation error τn, also called local discretiza-tion error is given by

τn = y(tn)− yn, (14.49)

14.4. LOCAL AND GLOBAL ERROR 223

where yn is computed with the numerical method using as starting valuey(tn−1) for a one-step method and y(tn−1), y(tn−2), . . . , y(tn−m) for an m-stepmethod.

For an explicit one-step method the local truncation error is simply

τn+1 = y(tn+1)− [y(tn) + ∆tΦ(tn, y(tn),∆t)] (14.50)

and for an explicit multistep method (bm = 0) it follows immediately that(am = 1)

τn+m =m∑j=0

ajy(tn+j)−∆tm∑j=0

bjf(tn+j, y(tn+j)) (14.51)

or equivalently

τn+m =m∑j=0


bjy′(tn+j), (14.52)

where we have used y′ = f(t, y).For implicit methods we can use also use (14.51) because it gives the local

error up to a multiplicative factor. Indeed, let

Tn+m =m∑j=0


bjf(tn+j, y(tn+j)). (14.53)

Thenm∑j=0

ajy(tn+j) = ∆tm∑j=0

bjf(tn+j, y(tn+j)) + Tn+m. (14.54)

On the other hand yn+m in the definition of the local error is computed using

amyn+m +m−1∑j=0

ajy(tn+j) = ∆t

[bmf(tn+m, yn+m) +

m−1∑j=0

bjf(tn+j, y(tn+j))

].

(14.55)

Subtracting (14.54) to (14.55) and using am = 1 we get

y(tn+m)− yn+m = ∆t bm [f(tn+m, y(tn+m))− f(tn+m, yn+m)] + Tn+k.(14.56)


Assuming f is a scalar C1 function, from the mean value theorem we have

f(tn+m, y(tn+m))− f(tn+m, yn+m) =∂f

∂y(tn+m, ηn+m) [y(tn+m)− yn+m] ,

(14.57)

for some ηn+m between y(tn+m) and yn+m. Substituting this into (14.56) andsolving for τn+m = y(tn+m)− yn+m we get

τn+m =

(1−∆t bm

∂f

∂y(tn+m, ηn+m)

)−1

Tn+m. (14.58)

If f is a vector valued function (a system of ODEs) then the partial derivativein (14.58) is a derivative matrix. A similar argument can be made for animplicit one-step method if the increment function Φ is Lipschitz in y and weuse absolute values in the errors. Thus, we can also view the local truncationerror as a measure of how well the exact solution of the initial value problemsatisfies the numerical method formula.

Example 14.8. The local truncation error for the forward Euler method is

τn+1 = y(tn+1)− [y(tn) + ∆t f(tn, y(tn))] . (14.59)

Taylor expanding the exact solution around tn we have

y(tn+1) = y(tn) + ∆t y′(tn) +1

2(∆t)2y′′(ηn) (14.60)

for some ηn between tn and tn+1. Using y′ = f and substituting (14.59) into(14.60) we get

τn+1 =1

2(∆t)2y′′(ηn). (14.61)

Thus, assuming the exact solution is C2, the local truncation error of theforward Euler method is O(∆t)2 1.

To simplify notation we will henceforth write O(∆t)k instead of O((∆t)k).

1To simplify notation we write O(∆t)2 to mean O((∆t)2) and similarly for other ordersof ∆t.

14.4. LOCAL AND GLOBAL ERROR 225

Example 14.9. For the explicit midpoint Runge -Kutta method we have

τn+1 = y(tn+1)− y(tn)−∆tf

(tn+1/2, y(tn) +

∆t

2f(tn, y(tn))

). (14.62)

Taylor expanding f around (tn, y(tn)) we obtain

f

(tn+1/2, y(tn) +

∆t

2f(tn, y(tn))

)= f(tn, y(tn))

+∆t

2

∂f

∂t(tn, y(tn))

+∆t

2f(tn, y(tn))

∂f

∂y(tn, y(tn)) +O(∆t)2.

(14.63)

But y′ = f , y′′ = f ′ and

f ′ =∂f

∂t+∂f

∂yy′ =

∂f

∂t+∂f

∂yf. (14.64)

Therefore

f

(tn+1/2, y(tn) +

∆t

2f(tn, y(tn))

)= y′(tn) +

1

2∆t y′′(tn) +O(∆t)2. (14.65)

On the other hand

y(tn+1) = y(tn) + ∆t y′(tn) +1

2(∆t)2y′′(tn) +O(∆t)3. (14.66)

Substituting (14.65) and (14.66) into (14.62) we get

τn+1 = O(∆t)3. (14.67)

In the previous two examples the methods are one-step. We know obtainthe local truncation error for a multistep method.

Example 14.10. Let us consider the 2-step Adams-Bashforth method (14.40).We have

τn+2 = y(tn+2)− y(tn+1)− ∆t

2[3f(tn+1, y(tn+1))− f(tn, y(tn))] (14.68)


and using y′ = f

τn+2 = y(tn+2)− y(tn+1)− ∆t

2[3y′(tn+1)− y′(tn)] . (14.69)

Taylor expanding y(tn+2) and y′(tn) around tn+1 we have

y(tn+2) = y(tn+1) + ∆ty′(tn+1) +1

2(∆t)2y′′(tn+1) +O(∆t)3, (14.70)

y′(tn) = y′(tn+1)−∆ty′′(tn+1) +O(∆t)2. (14.71)

Substituting these expressions into (14.69) we get

τn+2 = ∆ty′(tn+1) +1

2(∆t)2y′′(tn+1)

− ∆t

2[2y′(tn+1)−∆ty′′(tn+1)] +O(∆t)3

= O(∆t)3.

(14.72)

14.5 Order of a Method and Consistency

We saw in the previous section that assuming sufficient smoothness of theexact solution of the initial value problem, the local truncation error canbe expressed as O(∆t)k, for some positive integer k. If the local truncationerror accumulates no worse than linearly as we march up N steps to getthe approximate solution at T = N∆t, the global error would be orderN(∆t)k = T

∆t(∆t)k, i.e. O(∆t)k−1. So we need k ≥ 2 for all methods and

to prevent uncontrolled accumulation of the local errors as n →∞ we needthe methods to be stable in a sense we will make more precise later. Thismotivates the following definitions.

Definition 14.3. A numerical method for the initial value problem (14.1)-(14.2) is said to be of order p if its local truncation error is O(∆t)p+1.

Euler’s method is order 1 or first order. The midpoint Runge-Kuttamethod and the 2-step Adams-Bashforth method are order 2 or second order.

Definition 14.4. We say that a numerical method is consistent (with theODE of the initial value problem) if the method is at least of order 1.

14.6. CONVERGENCE 227

For the case of one-step methods we have

τn+1 = y(tn+1)− [y(tn) + ∆tΦ(tn, y(tn),∆t)]

= ∆t y′(tn)−∆tΦ(tn, y(tn),∆t) +O(∆t)2

= ∆t [f(tn, y(tn))− Φ(tn, y(tn),∆t)] +O(∆t)2.

(14.73)

Thus, assuming the increment function Φ is continuous, a one-step methodis consistent with the ODE y′ = f(t, y) if and only if

Φ(t, y, 0) = f(t, y). (14.74)

To find a consistency condition for a multistep method, we expand y(tn+j)and y′(tn+j) around tn

y(tn+j) = y(tn) + (j∆t)y′(tn) +1

2!(j∆t)2y′′(tn) + . . . (14.75)

y′(tn+j) = y′(tn) + (j∆t)y′′(tn) +1

2!(j∆t)2y′′′(tn) + . . . (14.76)

and substituting in the definition of the local error (14.51) we get that for amultistep method is consistent if and only if

a0 + a1 + . . .+ am = 0, (14.77)

a1 + 2a2 + . . .mam = b0 + b1 + . . .+ bm. (14.78)

All the methods that we have seen so far are consistent (with y′ = f(t, y)).

14.6 Convergence

A basic requirement of the approximations generated by a numerical methodis that they get better and better as we take smaller step sizes. That is, wewant the approximations to approach the exact solution as ∆t→ 0.

Definition 14.5. A numerical method for the initial value (14.1)-(14.2) isconvergent if the global error at a given t = n∆t converges to zero as ∆t→ 0and t = n∆t i.e.

lim∆t→0n∆t=t

[y(n∆t)− yn] = 0. (14.79)

Note that for a multistep method the initialization values y1, . . . , ym−1 mustconverge to y(0) = α as ∆t→ 0.


If we consider a one-step method and the definition (14.50) of the localtruncation error τ , the exact solution satisfies

y(tn+1) = y(tn) + ∆tΦ(tn, y(tn),∆t) + τn+1 (14.80)

while the approximation is given by

yn+1 = yn + ∆tΦ(tn, yn,∆t). (14.81)

Subtracting (14.81) from (14.80) we get a difference equation for the globalerror

en+1 = en + ∆t [Φ(tn, y(tn),∆t)− Φ(tn, yn,∆t)] + τn+1. (14.82)

This the growth of the global error as we take more and more time steps islinked not only to the local error but also to the increment function Φ. Letus suppose that Φ is Lipschitz in y, i.e. there is L ≥ 0 such that

|Φ(t, y1,∆t)− Φ(t, y2,∆t)| ≤ L|y1 − y2| (14.83)

for all t ∈ [0, T ] and y1 and y2 in the relevant domain of existence of thesolution. Recall that for the Euler method Φ(t, y,∆t) = f(t, y) and weassume f(t, y) is Lipschitz in y to guarantee existence and uniqueness of theinitial value problem so the Lipschitz assumption on Φ is somewhat natural.Then, taking absolute values (or norms in the vector case), using the triangleinequality and (14.83) we obtain

|en+1| ≤ (1 + ∆tL)|en|+ |τn+1|. (14.84)

For a method of order p, |τn+1| ≤ C(∆t)p+1, for sufficiently small ∆t. There-fore,

|en+1| ≤ (1 + ∆tL)|en|+ C(∆t)p+1

≤ (1 + ∆tL)[(1 + ∆tL)|en−1|+ C(∆t)p+1

]+ C(∆t)p+1

≤ . . .

≤ (1 + ∆tL)n+1|e0|+ C(∆t)p+1

n∑j=0

(1 + ∆tL)j

(14.85)

and summing up the geometry sum we get

|en+1| ≤ (1 + ∆tL)n+1|e0|+[

(1 + ∆tL)n+1 − 1

∆tL

]C(∆t)p+1. (14.86)

14.6. CONVERGENCE 229

Now 1 + t ≤ et for all real t and consequently (1 + ∆tL)n ≤ en∆tL. Thus,since e0 = 0,

|en| ≤[en∆tL − 1

∆tL

]C(∆t)p+1 =

C

Len∆tL(∆t)p. (14.87)

Thus, the global error goes to zero as ∆t→ 0, keeping t = n∆t fixed and weobtain the following important result.

Theorem 14.2. A consistent (p ≥ 1) one-step method with a Lipschitz in yincrement function Φ(t, y,∆t) is convergent.

The Lipschitz condition on Φ gives stability to the method in the sensethat this condition allows us to control the growth of the global error in thelimit as ∆t→ 0 and n→∞.

Example 14.11. As we have seen the forward Euler method is oder 1 andhence consistent. Since Φ = f and we are assuming that f is Lipschitz in ythen by the previous theorem the forward Euler method is convergent.

Example 14.12. Prove that the midpoint Runge-Kutta method is convergent(assuming f is Lipschitz in y).

The increment function in this case is

Φ(t, y,∆t) = f

(t+

∆t

2, y +

∆t

2f(t, y)

). (14.88)

Therefore

|Φ(t, y1,∆t)− Φ(t, y2,∆t)| =∣∣∣∣f (t+

∆t

2, y1 +

∆t

2f(t, y1)

)− f

(t+

∆t

2, y2 +

∆t

2f(t, y2)

)∣∣∣∣≤ L

∣∣∣∣y1 +∆t

2f(t, y1)− y2 −

∆t

2f(t, y2)

∣∣∣∣≤ L |y1 − y2|+

∆t

2|f(t, y1)− f(t, y2)|

≤(

1 +∆t

2

)L |y1 − y2| ≤ L |y1 − y2| .

(14.89)

where L = (1 + ∆t02

)L and ∆t ≤ ∆t0, i.e. for sufficiently small ∆t. Thisproves that Φ is Lipschitz in y and since the midpoint Runge-Kutta methodis of order 2 and hence consistent, it is convergent.


The exact solution of the initial value problem at tn+1 is determineduniquely from its value at tn. In contrast, multistep methods use not only ynbut also yn−1, . . . , yn−m−1 to produce yn+1. The use of more than one stepintroduces some peculiarities to the theory of stability and convergence ofmultistep methods. We will cover these topics separately after we look atthe most important class of one-step methods: the Runge-Kutta methods.

14.7 Runge-Kutta Methods

As seen earlier Runge-Kutta (RK) methods are based on replacing the inte-gral in

y(tn+1) = y(tn) +

∫ tn+1

tn

f(t, y(t))dt (14.90)

with a quadrature formula and using accurate enough intermediate approx-imations for integrand f (the derivative of y). For example if we use thetrapezoidal rule quadrature∫ tn+1

tn

f(t, y(t))dt ≈ ∆t

2[f(tn, y(tn)) + f(tn+1, y(tn+1)] (14.91)

and the approximations y(tn) ≈ yn and yn+1 ≈ yn + ∆tf(tn, yn), we obtain atwo-stage RK method known as improved Euler method

K1 = f(tn, yn), (14.92)

K2 = f(tn + ∆t, yj + ∆tK1), (14.93)

yn+1 = yn + ∆t

[1

2K1 +

1

2K2

]. (14.94)

Note that K1 and K2 are approximations to the derivative of y.

Example 14.13. The midpoint RK method (14.36) is also a two-stage RKmethod and can be written as

K1 = f(tn, yn), (14.95)

K2 = f

(tn +

∆t

2, yj +

∆t

2K1

), (14.96)

yn+1 = yn + ∆tK2. (14.97)

14.7. RUNGE-KUTTA METHODS 231

We know from Example 14.9 that the midpoint RK is order 2. Theimproved order method is also order 2. Obtaining the order of a RK methodusing Taylor expansions becomes a long, tedious process because numberof terms in the derivatives of f rapidly grow (y′ = f, y′′ = ft + fyf, y

′′′ =ftt + 2ftyf + fyyf

2 + fyft + f 2y f , etc.).

The most popular RK method is the following 4-stage (and fourth order)explicit RK, known as the classical fourth order RK

K1 = f(tn, yn),

K2 = f(tn +1

2∆t, yj +

1

2∆tK1),

K3 = f(tn +1

2∆t, yj +

1

2∆tK2),

K4 = f(tn + ∆t, yj + ∆tK3),

yn+1 = yn +∆t

6[K1 + 2K2 + 2K3 +K4] .

(14.98)

A general s-stage RK method can be written as

K1 = f

(tn + c1∆t, yn +

s∑j=1

a1jKj

),

K2 = f

(tn + c2∆t, yn +

s∑j=1

a2jKj

),

...

Ks = f

(tn + cs∆t, yn +

s∑j=1

asjKj

),

yn+1 = yn + ∆ts∑j=1

bjKj.

(14.99)

RK methods are determined by the constants c1, . . . , cs that specify thequadrature points, the coefficients a1j, . . . , asj for j = 1, . . . , s used to ob-tain approximations of the solution at the intermediate quadrature points,and the quadrature coefficients b1, . . . , bs. Consistent RK methods need to


satisfy the conditions

ci =s∑j=1

aij, (14.100)

s∑j=1

bj = 1. (14.101)

To define a RK method it is enough to specify the coefficients cj, aij andbj for i, j = 1, . . . , s. There coefficients are often display in a table, calledthe Butcher tableau (after J.C. Butcher) of the RK method as shown inTable14.1. For an explicit RK method, the matrix of coefficients A = (aij) is

Table 14.1: Butcher tableau for a general RK method.c1 a11 . . . a1s...

......

...cs as1 . . . ass

b1 . . . bs

lower triangular with zeros on the diagonal, i.e. aij = 0 for i ≤ j. The zerosof A are usually not displayed in the tableau.

Example 14.14. The tables 14.2-14.4 show the Butcher tableau of someexplicit RK methods.

Table 14.2: Improved Euler.01 1

12

12

Implicit RK methods are useful for some initial values problems with dis-parate time scales as we will see later. To reduce the computational workneeded to solve for the unknown K1, . . . , Ks (each K is vector-valued for asystem of ODEs) in an implicit RK method two particular types of implicitRK methods are usually employed. The first type is the diagonally implicit

14.7. RUNGE-KUTTA METHODS 233

Table 14.3: Midpoint RK.012

12

0 1

Table 14.4: Classical fourth order RK.012

12

12

0 12

1 0 0 116

13

13

16

RK method or DIRK which has aij = 0 for i < j and at least one aii isnonzero. The second type has also aij = 0 for i < j but with the addi-tional condition that aii = γ for all i = 1, . . . , s and γ is a constant. Thecorresponding methods are called singly diagonally implicit RK method orSDIRK.

Example 14.15. Tables 14.5-14.8 show some examples of DIRK and SDIRKmethods.

Table 14.5: Backward Euler.1 1

1

Table 14.6: Implicit mid-point rule RK.12

12

1


Table 14.7: Hammer and Hollingworth DIRK.0 0 023

13

13

14

34

Table 14.8: Two-stage order 3 SDIRK (γ = 3±√

36

).

γ γ 01− γ 1− 2γ γ

12

12

14.8 Adaptive Stepping

So far we have considered a fixed ∆t throughout the entire computation ofan approximation to the initial value problem of an ODE or of a systems ofODEs. We can vary ∆t as we march up in t to maintain the approximationwithin a given error bound. The idea is to obtain an estimate of the errorusing two different methods, one of order p and one of order p + 1, and usethis estimate to decide whether the size of ∆t is appropriate or not at thegiven time step.

Let yn+1 and wn+1 be the numerical approximations updated from ynusing the method of order p, and p+ 1, respectively. Then, we estimate theerror at tn+1 by

en+1 ≈ wn+1 − yn+1. (14.102)

If |wn+1 − yn+1| ≤ δ, where δ is a prescribed tolerance, then we maintainthe same ∆t and use wn+1 as initial condition for the next time step. If|yn+1 − wn+1| > δ, we decrease ∆t (e.g. we set it to ∆t/2), recompute yn+1,obtain the new estimate of the error (14.102), etc.

One-step methods allow for straightforward use of variable ∆t. Variablestep, multistep methods can also be derived but are not used much in practicedue to limited stability properties.

14.9. EMBEDDED METHODS 235

14.9 Embedded Methods

For computational efficiency, adaptive stepping as described above is imple-mented reusing as much as possible evaluations of f , the right hand side ofy′ = f(t, y) because evaluating f is the most expensive part of Runge-Kuttamethods. So the idea is to embed, with minimal additional f evaluations, aRunge-Kutta method inside another. The following example illustrates thisidea.

Consider the explicit trapezoidal method (second order) and the Eulermethod (first order). We can embed them as follows

K1 = f(tn, yn), (14.103)

K2 = f (tn + ∆t, yn + ∆tK1) , (14.104)

wn+1 = yn + ∆t

[1

2K1 +

1

2K2

], (14.105)

yn+1 = yn + ∆tK1. (14.106)

Note that the approximation of the derivative K1 is used for both methods.The computation of the higher order method (14.105) only costs an additionalevaluation of f .

14.10 Multistep Methods

As we have seen, multistep methods use approximations from more than onestep to update the new approximation. They can be written in the generalform

m∑j=0

ajyn+j = ∆tm∑j=0

bjfn+j. (14.107)

where m ≥ 2 is the number of previous steps the method employs.Multistep methods only require one evaluation of f per step because the

other, previously computed, values of f are stored. Thus, multistep methodshave generally lower computational cost than one-step methods of the sameorder. The trade-off is reduced stability as we will see later.

We saw in Section 14.2 the approaches of interpolation and finite differ-ences to construct multistep methods. It is possible to build also multistep


methods by choosing the coefficients a0, . . . , am and b0, . . . , bm so as to achievea desired maximal order for a given m ≥ 2 and/or to have certain stabilityproperties.

Two classes of multistep methods, derived both from interpolation, arethe most commonly used multistep methods. These are the (explicit) Adams-Bashforth methods and the (implicit) Adams-Moulton methods.

14.10.1 Adams Methods

We constructed in Section 14.2 the two-step Adams-Barshforth method

yn+1 = yn +∆t

2[3fn − fn−1] , n = 1, 2, . . . N − 1.

An m-step explicit Adams method, also called Adams-Bashforth, can bederived by starting with the integral formulation of the initial value problem

y(tn+1) = y(tn) +

∫ tn+1

tn

f(t, y(t))dt. (14.108)

and replacing the integral with that of the interpolating polynomial p of(tj, fj) for j = n−m+ 1, . . . , n. Recall that fj = f(tj, yj). If we represent pin Lagrange form we have

p(t) =n∑

j=n−m+1

lj(t)fj, (14.109)

where

lj(t) =n∏

k=n−m+1k 6=j

(t− tk)(tj − tk)

, for j = n−m+ 1, . . . , n. (14.110)

Thus, the m-step explicit Adams method has the form

yn+1 = yn+1 + ∆t [bm−1fn + bm−2fn−1 + . . . b0fn−m+1] , (14.111)

where

bj−(n−m+1) =1

∆t

∫ tn+1

tn

lj(t)dt, for j = n−m+ 1, . . . , n. (14.112)

14.11. A-STABILITY 237

Here are the first three explicit Adams methods, 2-step, 3-step, and 4-step,respectively:

yn+1 = yn +∆t

2[3fn − fn−1] , (14.113)

yn+1 = yn +∆t

12[23fn − 16fn−1 + 5fn−2] , (14.114)

yn+1 = yn +∆t

24[55fn − 59fn−1 + 37fn−2 − 9fn−3] . (14.115)

The implicit Adams methods, also called Adams-Moulton methods, arederived by including also (tn+1, fn+1) in the interpolation. That is, p is nowthe polynomial interpolating (tj, fj) for j = n −m + 1, . . . , n + 1. Here arethe first three implicit Adams methods:

yn+1 = yn +∆t

12[5fn+1 + 8fn − fn−1] , (14.116)

yn+1 = yn +∆t

24[9fn+1 + 19fn − 5fn−1 + fn−2] , (14.117)

yn+1 = yn +∆t

720[251fn+1

+ 646fn − 264fn−1 + 106fn−2 − 19fn−3] .(14.118)

14.10.2 Zero-Stability and Dahlquist Theorem

14.11 A-Stability

14.12 Stiff ODEs

Documents

Introduction to Numerical Analysisweb.math.ucsb.edu/.../Introduction_Numerical_Analysis.pdf1.1 What is Numerical Analysis? This is an introductory course of Numerical Analysis, which