numath

7/27/2019 numath

http://slidepdf.com/reader/full/numath 1/71

Numerics

W. Schreiber

April 1999

7/27/2019 numath


Contents

1 Error Analysis 3

1.1 Number representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Floating-point-numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Error Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Error propagation in arithmetic operations . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2 Condition numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Errors in Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Nonlinear Equations and Systems 12

2.1 1-dimensional equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.4 Polynomial Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Contraction Mapping Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Linear Systems of Equations 22

3.1 Gauss Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Gauss elimination and triangular decomposition . . . . . . . . . . . . . . . . . . . . 25

3.1.2 Pivoting strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.3 Direct triangular decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.4 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Orthogonal Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Gram-Schmidt-process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7/27/2019 numath


CONTENTS 2

3.2.2 Jacobi-Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3 Householder-Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Data fitting; Least square problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Iterative solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.1 Jacobi Iteration (total-step-method) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.2 Gauss-Seidel-Iteration (single-step-method) . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Eigenvalues 42

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Localizing Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 The Power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4 The QR−method and Hessenberg matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Ordinary Differential Equations 47

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Basic analytic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.1 Separation of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.2 Linear differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 One-step-methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.1 Euler’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3.2 Midpoint Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.3 Two-stage-models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.4 Runge-Kutta-methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Polynomial Interpolation 55

6.1 Lagrange form of Interpolation Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 Interpolation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3 Divided Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.4 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Numerical Integration 63

7.1 Integration via polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.1.1 The trapezoid rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.1.2 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7/27/2019 numath


Chapter 1

Error Analysis

In practice one cannot compute results exactly. One has to be aware of different sources of errors. Thereare

• errors in the input data

• errors in the formula (modeling errors)

• roundoff errors

• human errors rsp. machine errors.

Most problems involve small errors in the input data. This is due to the fact, that the input data stemfrom physical measurements and are therefore subject to errors. Besides that, input data can be the results

of a former calculation which are infected by roundoff errors.

If the solution process starts, errors in the input data and roundoff errors can accumulate. It is veryimportant to estimate how such small errors in the input data affect the output.

Example 1. The solution of the linear system

1.985x − 1.358y = 2.212

0.953x − 0.652y = 1.062

isx = 0.6087, y = −0.7391.

A small perturbation of the system yields

1.985x − 1.358y = 2.212

0.953x − 0.652y = 1.061

This system has the solutionx = 30.13, y = 42.413

The linear system is therefore very sensitive to perturbations of the input data.

7/27/2019 numath


CHAPTER 1. ERROR ANALYSIS 4

1.1 Number representation

In every day life real numbers are represented as decimal numbers, e.g.: 1/4 = (0.25)10. So they are writtenwith the help of digits 0, 1, . . . , 9; their positions indicate which power of 10 they represent, a decimal pointmarks the position of the power of order 0:

(12.734)10 = 1 · 101 + 2 · 100 + 7 · 10−1 + 3 · 10−2 + 4 · 10−3.

Quite often numbers cannot be represented by a finite decimal number, e.g.:

324

990= (0.32727272727 . . .)10.

For technical reasons we use instead of the decimal system the binary number system (the digits are thenbits) to represent a real number x ∈ R in a computer, e.g.:

(10011.01)2 = 1 · 24 + 0 · 23 + 0 · 22 + 1 · 21 + 1 · 20 + 0 · 2−1 + 1 · 2−2 = 19 + 0.25 = (19.25)10.

Again many numbers don’t have a finite binary representation — even if their decimal representation isfinite:

(0.2)10 = (0.00110011001100110011. . .)2

Instead of 10 we can use any number p ∈ N>1 as a basis for a number system.

Theorem 1.1.1. Fix a basis p ∈ N>1. Then any x ∈ R>0 has a unique representation

x =

∞k=1

bk pe−k

where

1. e ∈ Z is the exponent,

2. bk ∈ {0, 1, . . . , p − 1} are the digits,

3. b1 = 0 is the so called normalization condition which fixes e,

4. there is no k0 ∈ N such that bk = p − 1 for k ≥ k0.

Just note that the last condition exludes some ambiguity: For example we prefer 1/10 = (0.1)10 to therepresentation 1/10 = (0.0999 . . .)10.

The proof of the theorem is constructive: First the exponent is defined by the condition: x ∈ [ pe−1, pe[.

Then the digits are defined and computed inductively by requiring for n ∈ N:n

k=1

bk pe−k ≤ x < pe−n +

nk=1

bk pe−k. (1.1)

1.1.1 Floating-point-numbers

In a computer not all real numbers can be represented; for each number there is only a certain amount of storage space available. Numbers are stored as floating-point-numbers :

7/27/2019 numath



Definition 1.1.2. Fix p ∈ N>1, t ∈ N and L, U ∈ Z with L < U . Then the set of floating-point-numbersis defined by

x ∈ M ( p, t, [L, U ]) :⇔ x = ±mpe, m =

tk=1

bk p−k

bk ∈ {1, . . . , p − 1}, b1 = 0, L ≤ e ≤ U.

Additionally: 0 ∈ M ( p, t, [L, U ]).

In this context m ∈ [ p−1, 1[ is called mantissa , e exponent . So the concept of floating-point numbers isto store the leading t digits irrespective of the size of the number and to store the information about thepower of the first significant digit b1 in the exponent. This is opposed to the idea of fixed-point-numbers

x =t

k=1

bk pn−k

where n is fixed and only the digits b1, . . . , bk have to be stored and the k-th digit refers to the fixed powern − k.

Example 1. Let p = 2, t = 3, L = −1, U = 1 Then there are 4 mantissas

0.100, 0.101, 0.110, 0.111.

The possible exponents are e = −1, e = 0, e = 1.Therefore there exists 4 · 3 = 12 positive floating-point-numbers, namely:

m: 0.100 0.101 0.110 0.111e = −1: 0.2500 0.3125 0.3750 0.4375

e = 0: 0.5000 0.6250 0.7500 0.8750e = 1: 1.0000 1.2500 1.5000 1.7500

- h

The distribution of positive machine numbers in the above example

We remark, that the floating-point-numbers are not equidistant on the real axis. Indeed: If x = mpe belongsto M ( p, t, [L, U ]) then the next machine number is x + pe−t.

1.1.2 Rounding

Because the number of floating-point-numbers is finite, we have to approximate a real number x ∈ R bya floating-point-number rdt(x), which we can represent in a computer. Such a machine number can beobtained by rounding. For example:

Example 2. Let p = 10, x ∈ R. Suppose the decimal representation of |x| is given by

|x| = 0.x1x2 . . . xtxt+1 . . . · 10e.

x1 = 0, 0 ≤ xi ≤ 9. Then one forms

x =

0.x1x2 . . . xt if 0 ≤ xt+1 ≤ 4

0.x1x2 . . . xt + 10−t if 5 ≤ xt+1 ≤ 9

7/27/2019 numath



The rounded number is thenrdt(x) = ±x · 10e

Example 3. t = 4. Then

rd4(0.14285) = 0.1429 · 100

rd4(3.14159) = rd4(0.314159) · 101

= 0.3142 · 101

rd4(0.142842 · 102) = 0.1428 · 102

We can generalize this idea to any basis p ≥ 2. For the following definition we ignore the restrictions onthe exponent:

Definition 1.1.3. Define a rounding function rdt : R→ M ( p, t, ] − ∞, ∞[): Let

x =

∞k=1

xk pe−

k

∈ R>0

be given. Then define:

rdt(x) :=

tk=1

xk pe−k if xt+1 < p

2

tk=1

xk pe−k + pe−t if xt+1 ≥ p

2

Furthermore: For x ∈ R<0 : rdt(x) := −rdt(|x|) and rdt(0) := 0.

On a computer there is additionally the problem that the rounded number could exceed the restrictionsfor the exponent: underflow or overflow which has to be caught and indicated by the system.

Rounding causes some error which may be estimated in the following way:

Theorem 1.1.4. For even p the following estimates hold for x = 0:

| rdt(x) − x| ≤ 1

2 pe−t

rdt(x) − x

x

≤ 1

2 p−t+1

rdt(x) − x

rdt(x)

≤ 1

2 p−t+1

Proof: Let x ∈ R>0 be given as in the Definition 1.1.3 and assume xt+1 < p/2. If p is even this implies:xt+1 ≤ p/2 − 1. So it follows with the help of 1.1:

rdt(x) ≤ x <t+1k=1

xk pe−k + pe−t−1 ≤ rdt(x) + (

p

2− 1 + 1) pe−t−1 ≤ rdt(x) +

1

2 pe−t

In the other case we estimate:

x <t

k=1

xk pe−k + pe−t = rdt(x) =

t+1k=1

xk pe−k + ( p − xt+1) pe−t−1 ≤ x +

1

2 pe−t.

The first relative error estimate follows from the fact that x ≥ pe−1 whence:rdt(x) − x

x

≤ 1

2

pe−t

pe−1 =1

2 p1−t.

The last estimate follows in the same way. eop

7/27/2019 numath



Definition 1.1.5. For a computer using floating-point-numbers in M ( p, t, [L, U ]) one defines the machineprecision by

eps :=

1

2 p1

−t

By 1.1.4 we have:rdt(x) = (1 + )x, with || ≤ eps.

1.2 Error Propagation

A problem has generally an input and an output. We collect the input data x1, x2, . . . , xn in a vector

x = (x1, x2, . . . , xn).

From these input data we compute a result y by using a formula (an algorithm)

y = f (x1, x2, . . . , xn)

that is we can think of the problem as a map f : Rn → R.

Example 1. Calculate a zero y of the quadraticc polynomial

p(t) = t2 + x1t + x2.

Here the coefficients x1, x2 are the input data and one of the possible solutions can be computed asy = f (x1, x2) where

f : D → R, f (x1, x2) :=1

2(−x1 +

x21 − 4x2), (x1, x2) ∈ D := {(x1, x2) ∈ R2 | x2

1 − 4x2 ≥ 0}

Usually the input data are infected by errors. We denote by xk the perturbed data, that is the data with

errors. We are interested in the sensitivity of the map f to the absolute errors∆k := xk − xk

respectivily the relative errors

εk :=xk − xk

xk, 1 ≤ k ≤ n (xk = 0).

1.2.1 Error propagation in arithmetic operations

It is quite natural first to analyze the errors in the elementary arithmetic operations.

Multiplication

The perturbed input data arex = x(1 + εx) and y = y(1 + εy)

Thenxy = x(1 + εx)y(1 + εy) = xy + xy(εx + εy) + xy(εxεy) ≈ xy(1 + εx + εy)

Here we assume that εx, εy are very small, so that powers of εx and εy like ε2x, εxεy, . . . can be neglected.Then the above equation shows that the relative errors in the input data are added to produce the relativeerror in the result.

εxy = εx + εy (first order)

This is an acceptable error.

7/27/2019 numath



Division

Assume that y= 0. Then We have similarly

x(1 + εx)

y(1 + εy)=

x

y(1 + εx)(1 − εy + ε2y ∓ · · · ) ≈ x

y(1 + εx − εy)

that isεx/y = εx − εy (first order)

Again the error is acceptable.

Addition and Subtraction

We have

x(1 + εx) + y(1 + εy) = x + y + xεx + yεy = (x + y)

1 + xx + y

εx + yx + y

εy

.

Thereforeεx+y =

x

x + yεx +

y

x + yεy.

If x and y have the same sign, then|εx+y| ≤ |εx| + |εy|.

In this case there is no problem. But if x and y have opposite signs, there can be a magnification of therelative errors εx and εy. This magnification is known as cancellation error.

Cancellation errors are the Achilles heel of every numeric calculation and they should be avoided whenever possible.

Example 2. Let x = 1.36, x = 1.41 that is

εx =1.41 − 1.36

1.36= 0.0367 . . .

and y = −1.35, y = −1.39 that is

εy =−0.04

−1.35= 0.0269 . . . .

Thenx + y = 0.01, x + y = 0.02

and the relative error in the sum is

εx+y

=x + y − (x + y)

x + y=

0.01

0.01= 1.

Example 3. The algebraic identity

(a − b)2 = a2 − 2ab + b2

is not longer valid in floating-point arithmetic. On a 2-decimal-digit computer with a = 1.8 and b = 1.7we get:

a2 − 2ab + b2 = 3.2 − 2 · 3.1 + 2.9 = 3.2 − 6.2 + 2.9 = −0.10

instead of the correct result 0.01, which we obtain, if we compute (a − b)2. In this example even the signof the result is false.

7/27/2019 numath



Example 4. Quadratic equationx2 − 56x + 1 = 0.

If we use 5 digits the usual formula gives

x1 = 28 +√

783 = 28 + 27.982 = 55.982

x2 = 28 −√

783 = 28 − 27.982 = 0.01800.

The exact roots are 55.98213716 . . . . and 0.01786284 . . . . x2 = 0.0180 is obtained to 2 correct digits dueto a cancellation error which is not acceptable. This error is inherent to the function for computing x2. If we want to avoid the error we have to use another formula.

A better method to compute x2 is by Vieta’s formula

x2 =1

x1.

Division does not magnify the error and we get the much better result

x2 =1

55.982= 0.017863.

Addition is an example of a potentially ill-conditioned function of 2 variables.

1.2.2 Condition numbers

We now study the condition of a more general problem f : Rn → R. We start with the simplest case n = 1:So we want to evaluate y = f (x) and assume that f is sufficiently differentiable. From Calculus we know:

Theorem 1.2.1. Let I ⊂R be an interval and f : I

→R be twice continously differentiable then

f (x + h) = f (x) + f (x)h +1

2f (x + th)h2

for any x, x + h ∈ I with some t ∈]0, 1[ which depends on x and h.

We apply this theorem to our case of error propagation. So we evaluate y = f (x) with some infected inputx = (1 + εx)x and obtain:

y = f (x + εxx) = f (x) + f (x)εxx +1

2f (x + tεxx)(εxx)2.

As before, we assume that relative errors are small and hence the second power of such a relative error is

very much smaller. So we neglect it — and all possibly higher powers of it — and just keep error terms of first order:

εy =y − y

y≈ f (x)x

f (x)εx.

We denote by

τ :=xf (x)

f (x)

the condition number of f . It indicates how much a relative error of the input affects the relative error of the output.

7/27/2019 numath



Now we look at the case of several variables: Let D ⊂ Rn and f : D → R be continously differentiable.Then the n-dimensional analogue of the previous theorem tells us

f (x1 + h1, . . . , xn + hn) = f (x1, . . . , xn) +n

i=1

∂ if (x1 + th1, . . . xn + thn)hi

Again we assume that instead of true values xi we have distorted values xi = (1 + εi)xi with small relativeerrors. We neglect terms involving other than first order powers of these errors and obtain:

y − y = f

(1 + ε1)x1, . . . , (1 + εn)xn

− f (x1, . . . , xn) ≈

ni=1

∂ if (x1, . . . xn)εixi.

Turning to the relative error we define

Definition 1.2.2. Let D ⊂ Rn and f : D → R be continously differentiable. For an element x ∈ D,

x = (x1, . . . , xn) we define the conditions numbers:

τ i :=xi∂ if (x)

f (x), i = 1, . . . , n .

With this notation we can compute the relative error for an evaluation of f as:

εy =n

i=1

τ iεi (first order).

Again the condition numbers are an estimate how much a relative error in the i-th data element willcontribute to the result.

Example 5. Compute the integral

I n :=

1

0

tn

t + 5dt.

One approach is to compute I n recursively. From

I n + 5I n−1 =

10

tn + 5tn−1

t + 5dt =

10

tn−1dt =1

n.

it follows

I n =1

n− 5I n−1.

If we use 3 significant digits, we get the starting value

I 0 = 10

1

t + 5 dt = ln(t + 5)|1

0 = ln(

6

5 ) = 0.182

and recursivelyI 0 = 0.182, I 1 = 0.090, I 2 = 0.050, I 3 = 0.083, I 4 = −0.165

Since the integral is always positive, I 4 is false.

We now look at the condition number. The recursion defines a function

I n = f n(I 0).

FromI 1 = f 1(I 0) = 1 − 5I 0

7/27/2019 numath



it follows

I 2 = f 2(I 0) =1

2− 5I 1 =

1

2− 5(1 − 5I 0) = (−5)2I 0 − 5 +

1

2

and in general by inductionI n = f n(I 0) = (−5)nI 0 + pn

where pn is some number independent of I 0. The condition of f n is

τ (f n) =

I 0f n(I 0)

f n(I 0)

=

I 0(−5)n

I n

>I 05n

I 0= 5n

f n is therefore severely ill-conditioned . The algorithm for computing the integral is numerically unstable.In order to compute the integral we have to look for another algorithm.

1.3 Errors in Algorithms

A numerical algorithm computes successively values xi from some initial data x1, . . . , xk by

xi := f i(x1, . . . , xi−1), i = k + 1, . . . , n

Here f i denotes an elementary operation or function evaluation which depends on some of the previouslycomputed values xj .

When we use this algorithm on a computer we have to work in the set M of machine numbers instead of R. Each result is affected by rounding errors. So really we begin with rounded data y1, . . . , yk and computemachine numbers

yi := gi(y1, . . . , yi−1) ∈ M, i = k + 1, . . . , n

Here gi denotes the computer program wich evaluates (approximately) f i.

The basic assumption is: For machine numbers yj ∈ M:

yi = gi(y1, . . . , yi−1) = (1 + i)f i(y1, . . . , yi−1), with |i| ≤ eps

So we assume that the computed result is the exact result except for a final rounding error. If f i denotesan elementary operation (e.g. addition) this may be guaranteed by using additional digits and round theresult. If f i involves the evaluation of a mathematical function this is more difficult.

Under this assumption the relative errors

rj :=yj − xj

xj

can be computed recursively (up to 1st order) by

ri = i +i−1j=1

τ ijrj, i = k + 1, . . . , n

where we use the condition numbersτ ij =

xj

xi∂ j f i

So the relative errors are composed by the errors inherited from previous steps and an additional error εi

caused by gi.

7/27/2019 numath


Chapter 2

Nonlinear Equations and Systems

In this chapter we discuss the problem to solve a nonlinear equation or a system of nonlinear equations.We use the following term:

Definition 2.0.1. A zero of a map f : D → Rd ( D ⊂ Rd) is a point z ∈ D with f (z) = 0.

Even the simplest nonlinear equations do not admit solutions expressed in rational form of the data, oneneeds an iterative method to compute the solution. An iterative method is a procedure that generates asequence of approximations (xn) which converges towards z.

To compare several methods we define a criterion to characterize convergence speed — in the 1-dimensionalcase:

Definition 2.0.2. The sequence (xn) in R converges to z

• linearly, if there is a sequence (εn) in R>0 such that:

|xn − z| ≤ εn and limn→∞

εn+1

εn= c, 0 < c < 1.

• with order p ≥ 1 if there is a sequence (εn) in R>0 such that:

|xn − z| ≤ εn and limn→∞

εn+1

ε pn

= c, c > 0.

c is called the asymptotic error constant . Roughly said linearly convergence means that the number of correct digits increases by 1 per iteration step; if p ≥ 1 the number of correct digits increases by a powerof p.

If each iteration step requires m units of work one defines the efficiency index as p1/m. This index is usedto compare different iterative methods to solve a problem.

We now discuss some methods to solve the equation f (x) = 0, x ∈ R, and compare their advantages anddisadvantages.

2.1 1-dimensional equations

2.1.1 Bisection

This method is based on an important theorem of Calculus:

7/27/2019 numath


CHAPTER 2. NONLINEAR EQUATIONS AND SYSTEMS 13

Theorem 2.1.1 (Intermediate value theorem). Let f : [a, b] → R be a continous map with f (a) < 0 and f (b) > 0 then there exists a zero z ∈]a, b[ of f .

There may be several zeros! If the conditions of this theorem are satisfied for same a, b then on maycompute the midpoint x of this interval, check the sign of f (x) and repeat with a new smaller interval.This leads to the following bisection algorithm :

a1 := a, b1 := bfor n=1,2,...do:

xn := 0.5(an + bn)if f (xn) < 0

then: an+1 := xn, bn+1 := bn

else: an+1 := an, bn+1 := xn

By the algorithm it is always guaranteed that f (an) < 0 and f (bn) ≥ 0. Hence the interval [an, bn] contains

a zero of f . Obviously there is an analogues version if f (a) > 0 and f (b) < 0 or one may just replace f by−f .

To determine the convergence speed we note that:

bn − an =1

2n−1 (b − a), n = 1, 2, 3, . . . .

So both sequences (an) and (bn) converge to a zero z of f and we have the estimate:

|xn − z| ≤ 1

2(bn − an) ≤ 1

2n(b − a) =: εn

where

εn+1

εn= 1

2.

Hence the bisection method converges linearly.

When we program this method we need a termination criterion. We require a certain accuracy δ ∈ R>0

and stop the iteration when:bn − an ≤ δ.

From the given error estimate we can derive an upper bound for the number of necessary steps:

|εn| ≤ δ ⇐⇒ 1

2n(b − a) ≤ δ ⇐⇒ n ≥ ln( b−a

δ )

ln(2).

Bisection is a very simple method with a simple error estimate. But the convergence is linear only.There is a modification which is also based on the intermediate value theorem, the “Regular falsi”. Here thetest-point xn is determined as the zero of the secant passing through the points ( an, f (an)) and (bn, f (bn)).This secant is the graph of the function:

S (x) := f (an) +f (bn) − f (an)

bn − an(x − an)

hence one defines xn by:

S (xn) = 0 ⇒ xn := an − bn − an

f (bn) − f (an)f (an).

7/27/2019 numath



2.1.2 Fixed Point Iteration

Often a nonlinear equation is given as a fixed point problem. And many methods for solving equations canbe viewed as fixed point iterations. We use the following term and notations:

Definition 2.1.2 (Fixed point). Let h : D ⊂ R → R be a function. z ∈ D is called a fixed point if z = h(z).

Nonlinear equations can be treated as zero-finding problems or as fixed point problems by the followingsimple relations:

f (z) = 0 ⇒ z = h(z) with h(x) := x − g(x)f (x) with some arbitrary function g

z = h(z) ⇒ f (z) = 0 with f (x) := x − h(x).

To compute a fixed point one may try the corresponding fixed point iteration:

Lemma 2.1.3 (Fixed point iteration). Let h : D

⊂R

→R be a continuous function. If the recusively

defined sequence (xn), xn+1 = h(xn) with x0 remains in D and converges towards a z ∈ D then z is a fixed point of h.

Example 1. We want to solve the equation

x = h(x) := cos(x).

Executing the corresponding fixed point iteration we com-pute the following values:

x0 := 0.

x1 = h(x0) = 1.

x2 = h(x1) = 0.5403023059

x3 = h(x2) = 0.8575532158

x4 = h(x3) = 0.6542897905

The fixed point iteration graphically:

We have the following qualitative statement about convergence of the fixed point iteration:

Theorem 2.1.4 (Convergence of fixed point iteration). Let h : D ⊂ R → R be a function with a fixed point z ∈ D.

• If there exists ε ∈ R>0 and L ∈ [0, 1[ such that:

I ε := {x ∈ R | |x − z| < ε} ⊂ D and ∀x ∈ I ε : |h(x) − h(z)| ≤ L|x − z|then the fixed point iteration converges towards z for any intial value x0 ∈ I ε.

• If furthermore f is p-times continously differentiable and

h(z) = · · · = h( p−1)(z) = 0 and h( p)(z) = 0

then the convergence order is p.

7/27/2019 numath



Proof: For any x ∈ I ε we have

|h(x)

−h(z)

| ≤L

|x

−z

|< ε

⇒h(x)

∈I ε.

As a consequence the fixed point iteration sequence (xn) is well defined and remains in I ε ⊂ D. Furthermore:

|xn+1 − z| = |h(xn) − h(z)| ≤ L|xn − z| ≤ · · · ≤ Ln+1|x0 − z| → 0

since L < 1.

The statement about the convergence order follows with the help of Taylor’s formula:

h(x) = h(z) + (x − z)h(z) + · · · +1

( p − 1)!(x − z) p−1h( p−1)(z) +

1

p!(x − z) ph( p)(z + s(x − z)).

So under the conditions of the theorem we conclude:

xn+1 − z = h(xn) − h(z) =

1

p! (xn − z) p

h( p)

(z + sn(xn − z)) ⇒ limn→∞xn+1

−z

(xn − z) p =

1

p! h( p)

(z).

eop

Clearly, this theorem cannot be applied directly to a specific problem since it requires knowledge of theexistence and position of the fixed point. But it contains the following qualitative statement: If a conti-nously differentiable function h has a fixed point z with |h(z)| < 1 then there is a — possibly — smallneighbourhood where fixed point iteration will work. Indeed by continuity there exists an ε such that

sup{|h(x)| | x ∈ I ε} ≤ L := 0.5(1 + |h(z)|) < 1

which by the mean value theorem of Calculus implies the condition of the theorem.

The following theorem gives verifiable conditions to guarantee convergence of the fixed point iteration. We

need the following definition:

Definition 2.1.5 (Contraction). A function h : D ⊂ R → R is called Lipschitz-continous if there is a constant L ≥ 0 such that

∀x, y ∈ D : |h(y) − h(x)| ≤ L|y − x|.It is called contraction if L ∈ [0, 1[.

Practically this condition will be checked with the help of the derivative of h — if available:

|h(y) − h(x)| = |h(x + s(y − x))(y − x)| ≤ sup{|h(x)| | x ∈ D}|y − x|Theorem 2.1.6 (Contraction mapping principle). Let D ⊂ R be closed, h : D → R a function which

satisfies:

1. h : D → D.

2. h is a contraction.

Then: h has exactly one fixed point in D. The fixed point iteration will converge for any starting value x0 ∈ D. We have the following error estimates:

|xn − z| ≤ L

1 − L|xn − xn−1| ≤ Ln

1 − L|x1 − x0|.

7/27/2019 numath



This theorem is is the 1-dimensional version of the contraction mapping principle discussed and proved inthe next section.

Example 2. Solve the equationf (x) := 2 − 2

3x − exp(x) = 0.

The function f : R → R is decreasing. We have f (0) > 0 and f (1) < 0. Hence this equation has a uniquesolution which lies in [0, 1]. One may try to apply fixed point iteration to the function

g : R→ R, g(x) := 3 − 3

2exp(x).

But there is no chance to apply the theorem since

∀x ∈ [0, 1] : |g(x)| = | − 3

2exp(x)| ≥ 3

2> 1.

So one may try to useh : [0, 1] → R, h(x) := ln(2 − 2

3x).

We compute

h(x) =1

2 − 23x

−2

3=

1

x − 3.

So h is negative hence h is decreasing and we conclude that the first condition of the theorem is satisfied:

x ∈ [0, 1] ⇒ 1 ≥ 0.693 = h(0) ≥ h(x) ≥ h(1) = 0.288 ≥ 0 ⇒ h(x) ∈ [0, 1].

We use the fact that |h| is increasing to verify the contraction condition:

∀x

∈[0, 1] :

|h(x)

|=

1

3 − x ≤1

3 − 1= 0.5 =: L

Fixed point iterationand error estimates:

1: 0.6931471806 6.93147e-012: 0.4304190719 2.62728e-013: 0.5382777145 1.07859e-014: 0.4953961127 4.28816e-025: 0.5126654852 1.72694e-026: 0.5057465531 6.91893e-037: 0.5085243565 2.77780e-038: 0.5074100545 1.11430e-03

9: 0.5078572004 4.47146e-0410: 0.5076777943 1.79406e-04

The graphs of g and h:

2.1.3 Newton’s Method

Consider a continuously differentiable function f : D → R, D ⊂ R. In order to compute a zero of f definea sequence (xn) recursively from a starting value x0 ∈ D: When xn has been computed approximate thegraph of f by its tangent,

T (x) := f (xn) + (x − xn)f (xn)

7/27/2019 numath



and take the zero of this tangent as the next iterate:

xn+1 := xn −f (xn)

f (xn)

Example 3. Compute the square root√

a, a > 0 as a zero of the function f : R → R, f (x) := x2 − a.Then the iterates of Newton’s method are defined by the recursion formula:

xn+1 = xn − x2n − a

2xn=

1

2

xn +

a

xn

.

This method is also known as “Babylonian square rooting”. It converges to the positive square root foreach x0 > 0. For a := 3 and x0 := 3 one computes the following values:

n 0 1 2 3 4xn 3.000000000 2.000000000 1.750000000 1.732142857 1.732050810

Example 4. Consider the function f := arctan : R → R which has the only zero z = 0. If we try to“compute” it by Newton’s method we apply the recursion:

xn+1 := xn − arctan(xn)(1 + x2n)

and obtain the following values for two different starting points:

0 1 2 3 41.000000000 −0.570796327 0.1168599042 −0.0010610221 0.0000000002.000000000 −3.535743590 13.95095909 −279.3440667 122016.9990

As we see for x0 = 1 the method converges and for x0 = 2 it diverges. So in general the behaviour of Newton’s method may depend on the inital value.

Example 5. Solve the equationf (x) = x20 − 1 = 0. x > 0.

The equation has exactly one positive root at x = 1. Newton’s method yields the iteration

xn+1 = xn − x20n − 1

20x19n

=19

20

xn +

1

20x19n

.

If we start with x0 = 1/2 then x1 = 26214.875 a huge number, which is more fare away from the rootx∗ = 1 then our starting value x0. It takes over 200 steps to get back in the vicinity of x∗ = 1 . Thisexample again shows, that Newton’s method converges to a zero only if we start in the vicinity of thatzero.

There is a theorem, which guarantees local convergence.

Theorem 2.1.7. Let f : D ⊂ R → R be a twice continously differentiable funktion, z be a simple (this means: f (z) = 0) zero of f . Assume that there is an ε > 0 such that: ,

I ε := {x ∈ R | |x − z| ≤ ε} ⊂ D and εM (ε) < 1

where

M (ε) := max{ f (s)

2f (t)

| s, t ∈ I ε}.

Then for every x0 ∈ I ε Newton’s method is well defined and converges quadratically to the unique root z ∈ I ε.

7/27/2019 numath



This theorem does not allow one to check convergence for a given x0 but it confirms that Newton’s methodwill converge if x0 is sufficiently close to the zero z — “local convergence”. Indeed by continuity thereexists e.g. a δ

∈R>0 such that:

∀s, t ∈ I δ :

f (s)

2f (t)

≤ 2

f (z)

2f (z)

=: c.

If we take

ε := min{δ,1

2c}

then the conditions of the theorem are satisfied.

Proof of the theorem: The Newton iteration fits into the framework of the previous section by

h(x) := x − f (x)

f (x).

So we check the first condition of the convergence theorem for x∈

I ε :

h(x) − h(z) = x − f (x)

f (x)− z = (x − z) − f (x) − f (z)

f (x)

= (x − z) − f (x)(x − z) − 0.5f (t)(x − z)2

f (x)=

f (t)(x − z)

2f (x)(x − z)

Here we have used Taylor’s theorem with an intermediate point t between x and z. With the notation of the theorem we obtain:

|h(x) − h(z)| ≤ εM (ε)|x − z|.

The statement about the order follows from the following computation:

h(x) = x − f (x)f (x)

h(x) = 1 − f (x)2 − f (x)f (x)

(f (x))2= f (x)

f (x)

(f (x))2

h(x) = f (x)

f (x)

(f (x))2

+

f (x)

f (x)

Hence

h(z) = 0, h(z) =f (z)

f (z)= 0

unless f (z) = 0. Newton’s method converges therefore quadratically and with order 3, when f (z) = 0.

eop

2.1.4 Polynomial Equations

Assume that we want to compute zeros of a polynomial by Newton’s method. There is a particular toolto compute the values of the polynomial and its derivative. It is known as Horner’s scheme . It is moreeffcient then direct evaluation.

So let P be a polynomial of degree m:

P (x) =m

k=0

akxk = amxm + am−1xm−1 + · · · + a1x + a0, am = 0.

7/27/2019 numath



If we want to evaluate it at a point t then we may represent P as

P (x) = (x − t)Q(x) + b0, with some polynomial Q(x) =

m−1k=0

bk+1xk

. (2.1)

This represents a division of P by x−t with remainder b0. By multiplying the right hand side and comparingthe coefficients one obtains the following recursion to compute the coefficients bk:

bm = am and bk = ak + tbk+1, k = m − 1, m − 2, . . . 1, 0.

By equation (2.1) we have P (t) = b0 and by differentiating we obtain P (t) = Q(t), so P (t) can becomputed by applying Horner’s scheme to Q.

Example 6. To evaluateP (x) = x4 − 10x2 + 12x − 2

at the point x = 2 we form the following table

x4 x3 x2 x1 x0

1 0 −10 12 −2x = 2 2 4 −12 0

1 2 −6 0 −2b4 b3 b2 b1 b0 = P (2)

and deduce that P (2) = −2. To evaluate the derivative of P at the point x = 2 we again use the Hornerscheme

x3 x2 x1 x0

1 2 −6 0x = 2 2 8 4

1 4 2 4

and deduce that P (2) = 4.Now suppose t = z, where z is a zero of P . Then it follows from (2.1) b0 = 0 that is

f (x)

(x − z)= Q(x)

The polynomial Q is then called the deflated polynomial . Once Newton’s method has converged to a zeroz, Horner’s scheme gives the coefficients of the deflated polynomial. Then we can apply again Newton’smethod to compute a zero of the deflated polynomial and in this way we obtain all zeros. In practise thisprocess can be quite unsatisfactory. Since we can compute only an approximation z of the correct zero z,the coefficients of the deflated polynomial will not be exact. The cumulative effect of the errors introducedat each deflation step can be disastrous.

2.2 Systems of equations

2.2.1 Norms

In many problems of Numerical Analysis it is necessary to determine or estimate the length of vectors.Quite often it is useful to generalize the concept of “length of a vector” and not just apply the Euclideanlength derived from geometry.

So let V = Rn and K = R or V = Cn and K = C. (The following definition applies also to more generalvector spaces V over a field K .)

A norm is a map . : V → R≥0 with the following properties:

7/27/2019 numath



• ∀x ∈ V : x = 0 ⇔ x = 0.

• ∀s

∈K :

∀x

∈V :

sx

=

|s

| x

.

• ∀x, y ∈ V : x + y ≤ x + y.

The last inequality is called “triangle inequality” .

The properties of a norm are, of course, satisfied by the standard Euclidean length which may be definedby the Euclidean scalar product:

x2 :=

x, x, where x, y :=n

j=1

xj yj

There are two more norms frequently appearing in Numerical Analysis:

x1 :=

nj=1

|xj |.

andx∞ := max{|x1|, . . . , |xn|}

Convergence in (V, .) is defined by

x(k) → y :⇔ x(k) − y → 0.

So in general normed spaces this definition depends on the chosen norm. But in the case V = Rn andV = Cn it is independent und equivalent to componentwise convergence:

xk

→y

⇔x(k)i

→yi for i = 1, . . . , n .

2.2.2 Contraction Mapping Principle

In this section we examine the convergence of fixed point iteration, now for a map h : D ⊂ Rd → Rd:Similar to the 1-dimensional case we define:

Definition 2.2.1 (Contraction). A map h : D ⊂ Rd → Rd is called Lipschitz-continuous with respect to a norm . if there exists a constant L ∈ R≥0 sucht that:

∀x, y ∈ D : h(y) − h(x) ≤ Ly − x.

It is called contraction if L ∈ [0, 1[.

Theorem 2.2.2 (Contraction mapping principle). Let D ⊂ Rd

be a closed subset and h : D → Rd

be a map which satisfies:

1. h : D → D.

2. h is a contraction with respect to a norm ..

Then the recursively defined sequence (fixed point iteration) (xn) — xn+1 = h(xn) with arbitrary x0 ∈ D— converges to the unique fixed point z ∈ D of h. The following estimates are valid for n ∈ N:

z − xn ≤ Ln

1 − Lxn − xn−1 ≤ Ln

1 − Lx1 − x0.

7/27/2019 numath



Proof: Due to the first condition x0 ∈ D implies x1 = h(x0) ∈ D and this again x2 = h(x1) ∈ D . . . . Soinductively the sequence (xn) is well defined and remains in D.

For the convergence proof we use Cauchy’s criterion. So we have to verify that xn+ p − xn → ∞ for each p ∈ N. The basic estimate is:

xn+1 − xn = h(xn) − h(xn−1 ≤ Lxn − xn−1 ≤ · · · ≤ Lnx1 − x0.

This implies with the help of the geometric series formula:

xn+ p − xn = p

i=1

(xn+i − xn+i−1) ≤ p

i=1

xn+i − xn+i−1

≤ p

i=1

Li−1xn+1 − xn ≤ L

1 − Lxn − xn−1 ≤ Ln

1 − Lx1 − x0

Since Ln → 0 Cauchy’s criterion is satisfied; hence xn → z ∈ D. The error estimates follow from the aboveinequalities by letting p → ∞.

Finally the fixed point is unique: If there were a second one, say z ∈ D then:

z − z = h(z) − h(z) ≤ Lz − z < z − z

which is a contradiction. eop

2.2.3 Newton’s Method

Here we consider a nonlinear system of equations defined in the following way: Given a map f : D ⊂ Rd →R

d

find z ∈ D such that f (z) = 0.As in the 1-dimensional case we define an approximating sequence (xn) by the following principle: Startwith some x0 ∈ D. When some xn ∈ D has been computed consider a linear approximation of f by:

T (x) = f (xn) + f (xn)(x − xn)

and define the next iterate as the zero of T :

xn+1 = xn − [f (xn)]−1f (xn).

In this context f (x) is the Jacobian of f . That means: If f is represented by its component functions:

f (x) = (f i(x1, x2, . . . , xn))i=1,2,...,n, for x = (x1, x2, . . . , xn)

then f (x) is a n × n-matrix with entries:

(f (x))i,j = ∂ jf i(x), i, j = 1, 2, . . . , n .

In practice one does not use the inverse of f (x) explicitly but rather computes xn+1 from the linearequation T (x) = 0.

Similar to the 1-dimensional case there are some convergence statements for functions f with specialproperties. But in general only local convergence can be expected by starting sufficiently close to thesolution.

7/27/2019 numath


Chapter 3

Linear Systems of Equations

In this chapter we discuss methods for solving systems of linear equations

a11x1 + · · · + a1nxn = b1...

...

am1x1 + · · · + amnxn = bm

This system may be expressed more compactly as

nj=1

aij xj = bi, 1 ≤ i ≤ m,

or making use of matrix multiplication asAx = b

where A is an (m × n)-Matrix containing the coefficients

A :=

a11 · · · a1n

......

am1 · · · amn

and x and b are vectors.

From Linear Algebra it is known that such a system may or may not have solutions. Possibly solutions arenot unique. In the particular case that m = n and A is regular — this means the inverse A−1 exists which

satisfies: AA−1

= I and A−1

A = I — there exists a unique solution.There are two different classes of methods for the solution of linear systems: Direct methods and iterative methods . We begin with direct methods.

3.1 Gauss Elimination

“Gauss elimination” is the standard method known from Linear Algebra to solve a linear system of equtions

Ax = b, where A ∈ Rm×n and x ∈ Rn, b ∈ Rm.

7/27/2019 numath


CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS 23

Here A and b are given and x is to be computed. This system is transformed into an equivalent systemRx = c with an an upper-triangular matrix R,

R =

r11 · · · r1m · · · r1n

. . ....

...0 · · · rmm · · · rmn

( if m ≤ n).

Such a system is easily solved. We start with any solution of the last equation

rmmxm + · · · + rmnxn = cm

and obtain the remaining values by back-substituion (if rii = 0):

xi =1

rii(ci −

n

k=i+1

rikxk), i = m − 1, . . . 1

Instead of working with the equations themselves the operations are carried out on the extended matrix

(A, b) :=

a11 · · · a1n b1...

......

am1 · · · amn bm

We explain the algorithm more explicitly. Given a marix A, a row operation on A is one of the followingtwo operations:

• adding a multiple of a row onto another row;

•exchanging two rows.

In Linear Algebra also multiplication of a row with a constant and interchanging of columns are considered.Row operations are closely connected to linear equations. The key property is the following:

Theorem 3.1.1. If the matrix (A, b) is obtained from (A, b) by a sequence of row operations, then the solution-sets of Ax = b and of Ax = b coincide.

Now, Gassian elimination transforms a given system into an equivalent system with upper-triangularmatrix. It proceeds inductively: In the first step the elements of the first column below the diagonal areeliminated, then the elements of the second column below the diagonal and so on . . . .

Let B = (A, b) be the extended system. For the first step assume that b11 = 0, define factors

lj1 :=

bj1

b11 , j = 2, . . . , n

and obtain the transformed matrix B with

B :=

b11 b12 . . . b1,n+1

0 b22 . . . b2,n+1...

......

0 bm,2 . . . bm,n+1

So the first row is unchanged and the othere entries are computed by

bjk = bjk − lj1b1k ( j = 2, . . . , m; k = 2, . . . n + 1)

7/27/2019 numath



Figure 3.1: Gauss algorithm

for i from 1 to m-1 dofind pivot b(k,i) (k >= i) or die

swap rows k and i of B and e

for j from i+1 to m do

l:=b(j,i)/b(i,i)

for k from i+1 to n+1 do

b(j,k):=b(j,k)-l*b(i,k)

next k

b(j,i):=l

next j

next i

Given A ∈ Rm×n and b ∈ Rm definethe extended matrix B = (A, b). Use avector e, initially

e = (1, . . . , m),

to record the information about row in-terchanges. The entries which are trans-formed into 0 are used to store the fac-tors l for further use.

When i − 1 steps have been executed we have obtained a matrix with the following shape:

B :=

b11 . . . b1,n+1

0 b22 . . . b2,n+1

.... . .

. . ....

0 . . . 0 bii . . . bi,n+1

... . . ....

......

0 . . . 0 bm,i . . . bm,n+1

For the next step we assume bii = 0 and obtain:

B :=

b11 . . . b1,n+1

0 b22 . . . b2,n+1

.... . .

. . ....

0 . . . 0 bii . . . bi,n+1

0 . . . 0 0 bi+1,i+1 . . . bi+1,n+1... . . .

......

......

0 . . . 0 0 bm,i+1 . . . bm,n+1

Where the fist i rows remain unchanged and the following rows are computed as:

bjk = bjk

−ljibik ( j = i + 1, . . . , m; k = i + 1, . . . n + 1), lji :=

bji

bii

This algorithm requires at each step that bii = 0. If that is not the case one tries to find some bki = 0 withk > i and swap the rows i and k. This element is called pivot element .

A formal description of this algorithm is given in figure 3.1.

From Linear algebra we take the information that in the case of a regular matrix A (so espacially m = n)there is a pivot element available at each step and the algorithm will succeed to transform A into anupper-triangular matrix R.

7/27/2019 numath



3.1.1 Gauss elimination and triangular decomposition

We will now show that the row-operations of the Gauss algorithm can be described by matrix operations.

Definition 3.1.2 (Permutation matrix). By a permutation matrix P = P [i, k] we understand an m × m-matrix where all entries are 0 except:

P i,k := 1, P k,i := 1, P j,j := 1 for j = i, j = k.

Definition 3.1.3 (Frobenius matrix). A Frobenius matrix is a m × m-matrix Li = I − M i where the matrix M i has zero entries everywhere except:

(M i)j,i = lj,i, j = i + 1, . . . , m

The following properties can easily be proved by plain matrix multiplication.

Lemma 3.1.4. The matrices P [i, k] and Li = I − M i have the following properties:

1. P [i, k]P [i, k] = I .

2. (I ± M i)(I ± M j ) = I ± M i ± M j for j ≥ i

3. (I − M i)(I + M i) = I

4. M iP [ j, k] = M i if i < j < k.

Lemma 3.1.5. For any matrix B ∈ Rm×n

1. P [i, k]B reproduces B with rows i and k interchanged.

2. BP [i, k] reproduces B with columns i and k interchanged if m = n.

3. LiB subtracts multiples of row i from rows j ≥ i.

4. BL−1i = B(I + M i) adds multiples of columns j > i to column i if m = n.

Now let P i be the appropriate permutation matrix in the i-th step of the Gauss-elimination and Li thematrix built from the coefficients lji. Then the algorithm produces an upper-triangular matrix R by:

R := Ln−1P n−1 · · · L1P 1A

We denoteQi := P nP n−1 · · · P i

so that Q := Q1 denotes the complete permutation of rows during the algorithm. Hence:

QA = QP 1L−11 P 2L−12 · · · P n−2L−1n−2P n−1L−1n−1R

= Q2(I + M 1)P 2L−12 · · · P n−2L−1n−2P n−1L−1n−1R

= (I + Q2M 1)Q3L−1

2 · · · P n−2L−1

n−2P n−1L−1

n−1R= (I + Q2M 1)(I + Q3M 2)Q4 · · · P n−2L−1n−2P n−1L−1n−1R

...

= (I + Q2M 1)(I + Q3M 2) · · · (I + Qn−1M n−2)(I + M n−1)R

= (I + Q2M 1 + Q3M 2 · · · + Qn−1M n−2 + M n−1)R =: LR

We observe that the matrix L defined above is computed by the Gauss elimination algorithm: The factorslij are stored on the free places of the matrix B and the interchanging of rows given by the matrices Qi isalso included in the algorithm. The information about Q can be read off from the list e.

7/27/2019 numath



Theorem 3.1.6. If the matrix A ∈ Rn×n is regular Gauss elimination algorithm — cf. figure 3.1 — produces a triangular decomposition QA = LR where

lij =

bij j < i

1 j = i0 j > i

rij =

bij j ≥ i0 j < i

3.1.2 Pivoting strategies

To execute Gassian elimination algorithm its necessary to have a non-zero pivot element. In computing itmakes no sense to test whether a floating point number is zero or not. So it appears reasonible to takethe largest available element as pivot element. Furthermore numerical reasons influence the choice of thiselement:

Example 1. Solve the following system with 2 significant digits:

0.50 · 10−3

x + y = 0.50x + y = 1

Gauss elimination yields:0.50 · 10−3 0.10 · 101 0.50 · 100

0.10 · 101 0.10 · 101 0.10 · 101

−→

0.50 · 10−3 0.10 · 101 0.50 · 100

0 0.20 · 104 0.10 · 104

From the last equation0.20 · 104y = 0.10 · 104

we conclude y = 0.10/0.20 = 0.50 · 100 and from the first

0.50 · 10−3x + 0.10 · 101y = 0.50 · 100

that

x =0.50 · 100 − 0.50 · 100

0.50 · 10−3= 0.

This result is rather disastrous because the exact solution is

x =1000

1999∼ 0.50025, y =

999

1999∼ 0.49975

The reason for the poor result is that the pivot element 0 .50 · 10−3 is very small compared to the otherelements of the matrix.

Interchanging the first and second row we have to solve:

0.10 · 101 0.10 · 101 0.10 · 101

0.50 · 10−3

0.10 · 101

0.50 · 100 −→ 0.10 · 101 0.10 · 101 0.10 · 101

0.00 · 100

0.10 · 101

0.50 · 100

From the last equation it follows

y =0.50 · 100

0.10 · 101= 0.50 · 100

and from the firstx = 0.50 · 100 .

To avoid such diastrous results we have to be careful in selecting the pivot element. A simple guideline is:Choose the absolutely largest element in the column. So determine the pivot-element b(k, i) in Gaussianelminiation by

|b(k, i)| = max{|b( j, i)| | j = i , . . . m}.

7/27/2019 numath



Of course, any row of a linear system of equations may be multiplied with a constant without changing thesolution. So the mentioned guideline may not be significant. Practice shows that the matrix entries shouldbe of comparable size. But instead of scaling the matrix beforehand only the choice of the pivot elementis based on the scaled entries. This is done by the so called scaled partial pivot selection .

In the first step we compute the scale of each row:

si := max{|ai1|, |ai2|, . . . , |ain|}.

As a pivot row we select the pivot-element b(k, i) by

|b(k, i)|sk

= max

|b( j, i)|sj

| j = i , . . . m

.

Example 2. Let

A := 2 3 −61

−6 8

3 −2 1 .

Then s = (6, 8, 3). Because the maximum of {2/6, 1/8, 3/3} is 3/3, we have to interchange row 1 and 3.This is accomplished by multiplying from the left by the permutation matrix

P 1 :=

0 0 1

0 1 01 0 0

.

We obtain

P 1A :=

3 −2 11 −6 82 3 −6

.

Now

L1 :=

1 0 0

− 13 1 0

− 23 0 1

an from that it follows

L1P 1A =

3 −2 1

0 − 163

233

0 133 − 20

3

.

Now we have to determine the next pivot row. Because we have interchanged rows 1 and 3 our scale vectoris s = (3, 8, 6).The maximum of 16/3 · 8 and 13/3 · 6 is 13/3 · 6. Therefore row 3 is the new pivot row andwe have to interchange row 2 and 3. Thus

P 2 :=1 0 0

0 0 10 1 0

and L2 :=1 0 0

0 1 00 16

13 1

.

Therefore

R = L2P 2L1P 1A =

3 −2 1

0 133 − 20

30 0 − 7

13

.

We have

P =

0 0 1

1 0 00 1 0

.

7/27/2019 numath



and

L = P (L2P 2L2P 1)−1 = 1 0 023 1 013 −163 1

.

It is easy to check that

LR = P A =

3 −2 1

2 3 −61 −6 8

.

3.1.3 Direct triangular decomposition

Now suppose that A ∈ Rn×n can be factored into the product of a lower triangular matrix L and an uppertriangular matrix R — both regular (diagonal elements different from 0) — such that A = LR. Then thesystem Ax = b may be written as

LRx = b ⇔ Ly = b and Rx = y.

So y can be computed by forward-substitution :

yi =1

lii

bi −

i−1k=1

likyk

, i = 1, 2 . . . , n .

and x by backward-substitution :

xi =1

rii

bi −

nk=i+1

rikxk

, i = n, n − 1, . . . , 1.

Here a sum with an empty index set is defined to be 0.

Therefore we can solve a linear system if we can factor A. So we try to compute matrices

L =

l11 . . . . . . 0l21 l22 . . . 0

.... . .

...ln1 . . . . . . ln,n

, R =

r11 r12 . . . r1n

0 r22 . . . r2n

.... . .

...0 · · · 0 rn,n

such that A = LR which means:

aij =

nk=1

likrkj =

min(i,j)k=1

likrkj , i, j = 1, . . . , n .

These n2 equations can be solved recursively. In the Crout method we set lii = 1 and partition then × n−matrix A as follows:

12 3

4 56 7

8 9

.

7/27/2019 numath



Then we solve equations for L and R in the order indicated by the this partitioning:

m = 1, 2, . . . n : i = m , . . . n : ami =m

k=1 lmkrki

i = m + 1, . . . n : aim =m

k=1 likrkm

These equation can be solved in this order for the unknowns:

(m = 1) i = 1, . . . n : r1i := a1i

i = 2, . . . n : li1 := ai1r11

m = 2, . . . n :

i = m , . . . n : rmi := ami −m−1k=1

lmkrki

i = m + 1, . . . n : lim :=1

rmm

(aim

−

m−1

k=1

likrkm)

Check that in the sums only values computed in previous steps are used.

3.1.4 Cholesky Decomposition

We recall the following definitions and statements from Linear Algebra: In Cn we use the standard scalarproduct:

< x, y >:=n

j=1

xj yj x = (xj ), y = (yj ) ∈ Cn.

For a matrix A = (ajk ) ∈ Cm×n we put

AH = (aH jk ) ∈ Cn×n with aH jk := akj .

An important property of this matrix is:

∀x ∈ Cn, y ∈ Cm : < Ax,y >=< x, AH y > .

In the case of real space Rn the conjugate complex values are irrelevant and instead of AH we write AT

and call it the transposed matrix.

Definition 3.1.7. A matrix A ∈ Cn×n is said to be positive definite if it satisfies:

a) A = AH , that is A is an Hermitian — symmetric in the real case — matrix.

b) ∀x ∈ Cn \ {0} :< Ax, x >> 0.

Theorem 3.1.8. Let A positive definite. Then there exists a unique n×n lower triangular matrix L (lij = 0 for j > i) with lii > 0, i = 1, . . . , n satisfying A = LLH . If A is real, so is L.

Proof. See Stoer/Bulirsch[1980, Th. 4.3.3]

The decomposition A = LLH can be determined in a manner similar to the general triangular decompo-sition. From

l11 0 . . . 0l21 l22 . . . 0

.... . .

ln1 ln2 . . . lnn

l11 l21 . . . ln1

0 l22 . . . ln2

.... . .

...0 0 . . . lnn

=

a11 . . . a1n

......

an1 . . . ann

7/27/2019 numath



it follows

akk =n

ν =1

lkν lkν =k

ν =1 |

lkν

|2

and for j > k:

ajk =n

ν =1

ljν lkν =k

ν =1

ljν lkν

So theses equations can be utilized to compute lkk — this requires the computation of a square root —and ljk for k = 1, 2, . . . , n.

Example 3. Decompose

A =

3 11 6

.

We have l11 0l21 l22

l11 l210 l22

=

3 11 6

such that

l211 = 3, l11l21 = 1, l221 + l222 = 6 ,

that is

l11 =√

3, l21 =1√

3, l22 =

6 − 1

3=

17

3.

Therefore

L =

√ 3 01√ 3

173

.

3.2 Orthogonal Decompositions

In this section we want to decompose A as a product A = QR , where Q is an orthogonal and R is anupper triangular matrix. We recall the definition of the scalar product:

x, y :=n

j=1

xj yj , x, y ∈ Cn.

Two vectors x, y ∈ Cn are called orthogonal if x, y = 0.

Definition 3.2.1. A matrix Q is called orthogonal (unitary) if it preserves the inner product, that is

∀x, y ∈R

n

(C

n

) : Qx,Qy = x, yConsequently an orthogonal matrix preserves angles and lengths. Hence a transformation with such amatrix will not deteriorate the condition of a problem. In Rn they represent rotations and reflections.

As is shown in Linear Algebra the following holds:

Theorem 3.2.2. The three following conditions are equivalent:

1. Q is orthogonal (unitary);

2. The column vectors of the matrix Q form an orthonormal basis;

7/27/2019 numath



3. If Q is square it is regular and its inverse is its transpose; i.e. QT = Q−1, (QH = Q−1).

Example 1. The orthogonal matrix giving rotation in R2 by angle θ is

Q(θ) =

cos(θ) − sin(θ)sin(θ) cos(θ)

The inverse of Q(θ) is rotation by −θ , and we see that

Q−1(θ) = QT (θ) =

cos(θ) sin(θ)

− sin(θ) cos(θ)

Example 2. The following matrix defines a reflection in R2:

1 0

0 −1 .

3.2.1 Gram-Schmidt-process

At first we explain how we can transform a basis into an equivalent orthonormal basis:

Theorem 3.2.3 (Schmidt orthogonalization). Let V be a vector space with an inner product and {u1, . . . , un} be a set of linearly independent elements of V . Then the following algorithm constructs an orthonormal set {w1, . . . , wn} having the same span.

Proof. The new elements wi are defined inductively: At first:

v1 := u1 and w1 :=

1

v1v1.

When w1, . . . , wk (k < n) have been computed then define:

vk+1 := uk+1 −k

i=1

uk+1, wiwi and wk+1 :=1

vk+1vk+1.

So each wk is a linear combination of the original elements u1, . . . , uk. Vice versa, the elements uk arelinear combinations of w1, . . . , wk since

uk = vkwk +k−1

i=1uk, wiwi. (3.1)

Hence both sets span the same space.

Inductively it is verified that by construction the set {wi, . . . , wn} is orthonormal — wi, wj = 0 if i = jand wi, wi = 1

The result of the Gram-Schmidt-orthogonalization process can be expressed in terms of orthogonal matri-ces:

Theorem 3.2.4. Any matrix A ∈ Rm×n with linearly independent columns can be decomposed as a product A = QR, where Q ∈ Rm×n has orthogonal columns and R ∈ Rn×n is an upper-triangular matrix.

7/27/2019 numath



Proof. Let u1, . . . un be the columns of A. By the Schmidt-process we can orthogonalize these vectors. LetQ be the matrix built with tthe columns wi and observe equation 3.1:

w1 . . . wn

v1 u2, w1 u3, w1 . . . un, w10 v2 u3, w2 . . . un, u2...

... . . ....

0 0 . . . vn

=

u1 . . . un

.

So we have indeed: QR = A.

3.2.2 Jacobi-Rotations

In this section we use Jacobi-roations to transform a matrix A into an upper triangular matrix.

Definition 3.2.5. A Jacobi-rotation matrix is defined to be the matrix Qij(θ) ∈ Rm×m ( i < j) with entries

q kl := 0 except

q kk := 1 if k = i, k = j and q ii := cos(θ), q ij := − sin(θ), q ji := sin(θ), q jj := cos(θ)

Qij (θ) is a rotation in the plain containing the ith and jth standard basis vector. Figure 3.2 shows such amatrix.

If we multiply A from the left by Qij (θ) we obtain a new matrix A := Qij (θ)A which differs from A onlyin rows i and j. For rows i and j we have ( c := cos(θ), s := sin(θ) )

aik = caik − sajk (3.2)

ajk = saik + cajk (3.3)

k = 1, . . . , n. To delete aji ( j > i) we have to solve the equation

saii + caji = 0 with c2 + s2 = 1.

A solution iss = − aji

a2ii + a2ji

, c =aii

a2ii + a2ji

(3.4)

In this way we can successively delete all the elements of a given m × n-matrix A beneath the diagonalby multiplying by an appropriate rotation matrix. So we use a sequence of rotations to transform columns1, 2, . . . , k (e.g. k = n − 1 if m = n or k = n if m > n

Qk,m · · · Q34Q2m · · · Q24Q23Q1m · · · Q13Q12A = R.

Figure 3.2: A Jacobi-rotation matrix, i = 3, j = 5, n = 8

1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 0 cos(θ) 0 0 − sin(θ) 0 00 0 0 1 0 0 0 00 0 0 0 1 0 0 00 0 sin(θ) 0 0 cos(θ) 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

7/27/2019 numath



Because the product of orthogonal matrices is again orthogonal we obtain in this way the decomposition

A = QR, Q := QT 12QT

13

· · ·QT

k,m (3.5)

Example 3. Solve the system 3 2 0

4 1 30 1 −2

x =

1

01

.

Eliminate a21 = 4 According (3.4) we set

s = − a21 a211 + a221

= −4

5= −0.8 c =

a11 a211 + a221

=3

5= 0.6

Multiplying from the left we get

0.6 0.8 0

−0.8 0.6 00 0 1

3 2 0 1

4 1 3 00 1 −2 1

=

5 2 2.4 0.6

0 −1 1.8 −0.80 1 −2 1

Because a31 = 0 no elimination is necessary.

Eliminate a32 = 1 We have

s = − a32 a232 + a222

= − 1√ 2

, c =a22

a232 + a222= − 1√

2

Multiplying from the left we obtain

1 0 0

0 − 1√ 2 1√ 20 − 1√

2− 1√

2

5 2 2.4 0.6

0 −1 1.8 −0.80 1 −2 1

=5 2 2.4000 0.6000

0 1.4142 −2.6870 1.27280 0 1.4142 −1.4142

By back-substitution we obtain the result

x3 = −1, x2 = −1, x1 = 1.

3.2.3 Householder-Matrices

We derive Householder’s method for the real case. This method uses reflections to transform a vector (acolumn of matrix) into another one.

Let a vector v

∈Rn with

v

= 1 be given. This vector defines a hyperplain by

H := {x ∈ Rn | x, v = 0}.

Now for a given vector x ∈ Rn the reflected image x ∈ Rn with respect to H is computed from anorthogonal decompostion of x into a part parallel to H and a part orthogonal to H :

x = x, vv + x − x, vv ⇒ x = x − 2x, vv

In the context of matrix computations we view v as an n × 1-matrix and vT as an 1 × n-matrix, hence vvT

defines a n × n-matrix. Furthermore we have: vvT x = x, vv.

Lemma 3.2.6. For any v ∈ Rn with v = 1 the matrix U := I − 2vvT is symmetric and orthogonal.

7/27/2019 numath



d d d

Q

0

x

v

x

x’H

Figure 3.3: Reflection at the Hyperplane H

Proof. The symmetrie is obvious. Just check the condition for orthogonality:

U U T = (I − 2vvT )(I − 2vvT )T = I − 2vvT − 2vvT + 4vvT vvT = I

(Recall: vT v = v2 = 1).

Lemma 3.2.7. For two different vectors x, y ∈ Rn with x = y the Housholder-matrix

U := I − 2vvT , with v :=1

x − y (x − y)

satisfies: U x = y.

Proof. The condition x = y = 1 and the choice of v are clearly suggested by the reflection principle.We just check the statement by computation:

U x = x − 2vvT x = x − 2x, vv = x − 2x, x − yx − y2 (x − y)

= x − 2(x2 − x, y)

x2 + y2 − 2x, y(x − y) = y

Now we apply Housholder transformations to transform a matrix A ∈ R

m×

n

into an upper triangularmatrix. Again we proceed inductively, column by column.

Let a1 be the first column of A. Apply lemma 3.2.7 to

x := a1, y := βe1 := (β, 0, . . . , 0)T with β := − a11|a11| a1.

The factor β garantees that x and y have the same length, the particular choice of its sign prevents roundingerrors. So in the fist step we use a Householder matrix

U 1 := I − 2vvT , where v :=1

a1 − βe1 (a1 − βe1).

7/27/2019 numath



and the resulting matrix A := U 1A has zeros in its first column except (possibly) a11.

When k − 1 steps of this method have been executed the result is a matrix A ∈ Rm×n with aij = 0 for

i > j and j = 1, . . . , k − 1:

A =

∗ · · · ∗ ∗ · · · ∗0

. . ....

......

.... . . ∗

0 ∗ · · · ∗...

......

0 · · · 0 ∗ · · · ∗

Now we apply lemma 3.2.7 to

x := (0, . . . , 0, akk, . . . , amk), y := βek, β := − akk

|akk

|x.

Hence multiplying A with the resulting Householder-matrix U k produces a matrix A = U kA with theproperty: aij = 0 for i > j and j = 1, . . . , k — in particular the zeros in columns 1, . . . , k − 1 will bepreserved.

After k = m − 1 steps if m ≤ n or k = n steps if m > n the result is an upper-triangular matrix R ∈ Rm×n

satisfying:R = U k · · · U 2U 1A or A = U 1U 2 · · · U kR

In practical computations one does not compute the U j explicitly followed by a standard matrix multipli-cation. It is faster and easier to make use of the particular structure of Householder-matrices: ComputewT := vT A ∈ R1×n first and then:

U A = A−

2vvT A = A−

2vwT .

3.3 Data fitting; Least square problems

Often one has to perform an experiment which attempts to verify a linear relationship between twoquantities x, y. This is done in the following way. Having taken a number of test data ( xi, yi), 1 ≤ i ≤ m,these are plotted and a straight line is drawn which is considered to fit the data best.

Example 1.xi 0.1 1.0 1.9 2.2 3.9yi 6.3 5.2 4.6 4.0 3.0

To draw a straight line

y = αx + β

is a problem of best approximation. One solution is to minimize the expression

mi=1

(yi − (αxi + β ))2 (3.6)

If we define

A :=

x1 1...

...xm 1

7/27/2019 numath



we can rewrite (3.6) as

x1 1...

.

..xm 1

α

β −

y1...

ym

2

2

(3.7)

More generally we have to compute the minimum

minx∈Rn

Ax − y2 (3.8)

where A is an m × n−matrix and y ∈ Rm. To minimize (3.8) is a so called least square problem .

A necessary condition for x to minimize (3.8) is

∂

∂xi(Ax − y22) = 0 (3.9)

for 1

≤i

≤n. From

∂

∂xi(Ax − y22) =

∂

∂xi

mj=1

n

k=1

ajk xk − yj

2

=m

j=1

2

n

k=1

ajk xk − yj

aji

= 2

mj=1

nk=1

ajk xkaji − 2

mj=1

yj aji

= 2(AT Ax − AT y)i

it follows that (3.9) is equivalent to

∇(Ax − y2) = 2AT Ax − 2AT y = 0

respectivelyAT Ax = AT y (3.10)

Equations (3.10) are called normal equations . Conditions (3.10) are also sufficient.

Theorem 3.3.1. The least square problem

minx∈Rn

Ax − y (3.11)

has at least one minimum point x0. If x1 is another minimum point, then Ax0 = Ax1. Every minimum point is a solution of the normal equations (3.10) and conversely.

Proof. LetL = {Ax | x ∈ Rn} ⊂ R

m

the image of A andL⊥ = { y ∈ Rm | yT A = 0}

the orthogonal complement of L. The vector y decomposes uniqely as

y = y1 + y2, y1 ∈ L, y2 ∈ L⊥,

and since L is the image of A there exists at least one x0 ∈ Rn with Ax0 = y1. Since AT y2 = 0 it followsfurtheron

AT y = AT Ax0 + AT y2 = AT Ax0

7/27/2019 numath



that is , x0 is a solution of the normal equations.

Conversely let x0 a solution of the normal equations (3.10) and x ∈ Rn arbitrary.Then

Ax − Ax0, Ax0 − y = (xT − xT 0 )AT (Ax0 − y)

= (xT − xT 0 )(AT Ax0 − AT y) = 0

and from that it follows with the pythagorean theorem

Ax − y2 = Ax − Ax0 + Ax0 − y2 = Ax − Ax02 + Ax0 − y2 ≥ Ax0 − y2

If the columns of the matrix A are linearly independent, then the matrix AT A is positive definite and thenormal equations

AT Ax = AT y

have a unique solutionx = (AT A)−1AT y

which can be computed with the Cholesky method.

Example 2. Continuation of the example.

(α, β )T is a solution of the least square problem if and only if

AT A

αβ

= AT

y1...

ym

that is iff

x1 . . . xn

1 . . . 1

x1 1...

.

..xn 1

α

β

=x1 . . . xn

1 . . . 1

y1...

ym

.

3.4 Iterative solutions

We consider again the system of equationsAx = b (3.12)

and assume aii = 0, i = 1, . . . , n. Then the coefficient matrix A may be split as

A = L + D + R (3.13)

where L (R) are strictly lower (upper) triangular matrices and D is a diagonal matrix:

L =

0 0 . . . 0a21 0 0

.... . .

......

an1 . . . an,n−1 0

, D =

a11 0 . . . 0

0 a22. . . 0

......

......

0 . . . . . . ann

, R =

0 a12 . . . a1n

0 0 . . . a2n

......

. . . an−1,n

0 0 . . . 0

.

The original system (3.12) may be expressed as

(L + D + R)x = b (3.14)

Now (3.14) may be rewritten in a number of ways to provide different iterative methods.

7/27/2019 numath



3.4.1 Jacobi Iteration (total-step-method)

We rewrite (3.14) as

Dx = −(L + R)x + b (3.15)

and define the iteration scheme

xn+1 = −D−1(L + R)xn + D−1 b . (3.16)

In components equation (3.16) is (xk = (x(k)1 , . . . , x

(k)n ))

x(n+1)i =

1

aii

−

nj=1,j=i

aij x(n)j + bi

. (3.17)

that is, we use the old approximations xn to compute the new approximations xn+1.

Example 1. Solve the linear system 5 1 1

1 6 01 0 7

x1

x2

x3

=

1

20

.

From (3.17) we obtain the following iteration scheme:

x(n+1)1 =

1

5(1 − x

(n)2 − x

(n)3 )

x(n+1)2 =

1

6(2 − x

(n)1 )

x(n+1)3 =

1

7

(

−x(n)1 )

The first 3 iterates (startvector x0 = 0) are given by

−aij /aii bi/aii x0 x1 x2 x3

0 −1/5 −1/5 1/5 0 1/5 2/15 51/350−1/6 0 0 2/6 0 2/6 3/10 14/45−1/7 0 0 0 0 0 −1/35 −2/105

3.4.2 Gauss-Seidel-Iteration (single-step-method)

Now we rewrite (3.14) as

(L + D)x = −Rx + b (3.18)

and define the iteration scheme

xn+1 = −(L + D)−1Rxn + (L + D)−1 b. (3.19)

To obtain the iterates in components we start with (3.18)

ij=1

aij x(n+1)j = −

nj=i+1

aij x(n)j + bi

7/27/2019 numath



and solving by x(n+1)i

x(n+1)

i

=1

aii−

i−1

j=1

aij x(n+1)

j −

n

j=i+1

aij x(n)

j

+ bi . (3.20)

The idea of Gauss-Seidel-Iteration is therefore to make use of the new iterates x(n+1)i as soon as possible.

Example 2. Last example. We have

−aij /aii bi/aii x0 x1 x2

0 −1/5 −1/5 1/5 0 1/5 51/350−1/6 0 0 2/6 0 3/10 649/2100−1/7 0 0 0 0 −1/35 −51/2450

andx3 = (0.1424, 0.3096, −0.0203), x4 = (0.1421, 0.3096, −0.0203).

Example 3. Let

A :=

4 1 −1−1 3 12 2 5

and

b :=

12.0

6.05.0

.

We list the iterations in the following table:

k Jacobi Gauss-Seidel1 [3.000000000, 2.000000000, 1.000000000] [3.000000000, 3.000000000, −1.400000000]2 [2.750000000, 2.666666667, −1.000000000] [1.900000000, 3.100000000, −1.000000000]

3 [2.083333333, 3.250000000, −1.166666667] [1.975000000, 2.991666667, −.9866666668]4 [1.895833333, 3.083333333, −1.133333333] [2.005416667, 2.997361111, −1.001111111]5 [1.945833333, 3.009722222, −.991666666] [2.000381944, 3.000497685, −1.000351852]6 [1.999652778, 2.979166666, −.982222222] [1.999787616, 3.000046489, −.9999336421]7 [2.009652778, 2.993958333, −.991527777] [2.000004967, 2.999979536, −.9999938016]8 [2.003628472, 3.000393518, −1.001444444] [2.000006666, 3.000000156, −1.000002729]9 [1.999540509, 3.001690972, −1.001608796] [1.999999279, 3.000000669, −.9999999793]

10 [1.999175058, 3.000383102, −1.000492593] [1.999999838, 2.999999939, −.9999999109]11 [1.999781076, 2.999889217, −.999823264] [2.000000038, 2.999999983, −1.000000008]12 [2.000071880, 2.999868113, −.999868117] [2.000000002, 3.000000003, −1.000000002]13 [2.000065942, 2.999979999, −.999975997] [1.999999999, 3.000000000, −.9999999997]14 [2.000011001, 3.000013980, −1.000018377] [2.000000000, 3.000000000, −1.000000000]15 [1.999991911, 3.000009793,

−1.000009992] [2.000000000, 3.000000000,

−1.000000000]

The exact solution is (2, 3, −1)T .

3.4.3 Convergence Analysis

We call the method convergent if for all start vectors x0 the sequence xn converges towards the solutionof Ax = b .

In both schemes we have split the matrix A in two matrices

A = N − P

where

7/27/2019 numath



a) N = D, P = −(L + R) (Jacobi iterartion)

b) N = L + D, P =

−R ( Gauss-Seidel iteration)

Hence the equation Ax = b is equivalent to

Nx = P x + b

respectivelyx = N −1(P x + b).

x is therefore a solution of the system if and only if x is a fixpoint of the map ϕ : Rn → Rn

ϕ(x) = N −1(P x + b) = T x + c

where T := N −1P, c := N −1 b. Since

ϕ(x) − ϕ( y) = T x − T y ≤ T x − yit follows from theorem (??) that the iteration

xn+1 = T xn + c (3.21)

converges if L = T < 1, where

T := sup x∈Rn

T xx

is the so called operator norm . To obtain a convergence criterion we take the uniform norm in Rn that is

x∞ := max1≤i≤n

|xi|.

It is easy to show that the operator norm is then given by

A∞ = max1≤i≤n

nj=1

|aij |.

Therefore the Jacobi Iteration converges if

T ∞ = D−1P ∞ = max1≤i≤n

1

|aii|n

j=1,j=i

|aij | < 1.

With more mathematics one can prove the same result for Gauss-Seidel iteration.

Theorem 3.4.1. a) Strong Row Sum criterion: The Jacobi method is convergent for all matrices Awith

|aii| >n

j=1,j=i

|aij | (3.22)

b) Strong Column Sum criterion: The Jacobi method is convergent for all matrices A with

|aii| >n

j=1,j=i

|aji| (3.23)


7/27/2019 numath



Remarks. The conditions of the theorem are also sufficient for the convergence of the Gauss-Seidelmethod. For a proof see Stoer/Bulirsch[1980, Th. 8.2.12]. Matrices which satisfies the condition (3.22) arealso called diagonally dominant .

Example 4. The matrix 5 1 1

1 6 01 0 7

is diagonally dominant, since5 > 1 + 1, 6 > 1 + 0, 7 > 1 + 0.

The strong row sum criterion (3.22) is sufficient for convergence but not necessary. To give a sufficient andnecessary criterion we first define the spectral radius.

Definition 3.4.2.

ρ(A) = max

{|λ

| |λ eigenvalue of A

}is called the spectral radius of A.

Then one can proof:

Theorem 3.4.3. The Jacobi and Gauss-Seidel methods are convergent if and only if ρ(T ) < 1.


7/27/2019 numath


Chapter 4

Eigenvalues

4.1 Introduction

We recall the following definition from Linear Algebra:

Definition 4.1.1. Let A ∈ Cn×n be given. If there exist λ ∈ C and v ∈ Cn \ {0} such that:

Av = λv ⇔ (A − λI )v = 0

then λ is called eigenvalue of A with corresponding eigenvector v.

The second equation emphasizes that in this case the matrix A − λI is singular. Therefore:

Lemma 4.1.2. The eigenvalues of a matrix A ∈ Cn×n are the zeros of the characteristic polynomial of degree n:

p(t) := det(A−

tI ).

Example 1. Determine the eigenvalues of the matrix1 2 1

0 1 32 1 1

The characteristic equation is

det

1 − λ 2 1

0 1 − λ 32 1 1 − λ

= −λ3 + 3λ2 + 2λ + 8 = 0.

The roots of the characteristic equation are

λ1 = 4, λ2,3 = −1

2 ±√

7

2 i .

This illustrates the fact that the eigenvalues of a real matrix are not necessarily real numbers.

We need some further ideas and a simple fact from Linear Algebra:

Definition 4.1.3. Two matrices A, B ∈ Cn×n are called similar if there exists a regular matrix T ∈ Cn×n

such that A = T −1BT .

Lemma 4.1.4. Similar matrices have the same eigenvalues.

Computing the roots of the characteristic polynomial is impractical as soon as the matrix A is large. Onereason for this is that the roots of a polynomial may be very sensitive functions of the coefficients of thepolynomial.

7/27/2019 numath


CHAPTER 4. EIGENVALUES 43

4.2 Localizing Eigenvalues

The most famous theorem, which describes where the eigenvalues are situated in the complex plane isGershgorins theorem .

Theorem 4.2.1 (Gershgorin). The set of eigenvalues of a matrix A ∈ Cn×n is contained in the union of the following discs Di in the complex plane.

Di = {z ∈ C | |z − aii| ≤n

j=1,j=i

|aij |}, 1 ≤ i ≤ n.

Proof: Let λ be an eigenvalue and v a corresponding eigenvector with v∞ = 1. Let i be an index with|vi| = 1. We conclude:

(Av)i = λvi

⇒λvi =

n

j=1

aij vj

⇒(λ

−aii)vi =

n

j=1,j=i

aij vj .

Taking absolute values we obtain

|λ − aii| ≤n

j=1,j=i

|aij ||vj | ≤n

j=1,j=i

|aij|.

eop

Example 1. Let

A =

−1 + i 0 14

14 1 1

41 1 3

Then the Gershgorin discs are

|λ − (−1 + i)| ≤ 1

4, |λ − 1| ≤ 1

2, |λ − 3| ≤ 2.

4.3 The Power method

This procedure is desigend to compute the dominant eigenvalue and a eigenvector corresponding to thedominant eigenvalue. We assume that A has the following properties:

1. there is a single eigenvalue of maximum modulus;

2. there is a basis of Cn of eigenvectors.

According to the first assumption ,the eigenvalues λ1, λ2, . . . , λn can be labeled so that

|λ1| > |λ2| ≥ |λ3| ≥ · · · ≥ |λn|and according to the second condition there exists a basis {u1, . . . , un} of eigenvectors: Aui = λiui.

Now letx(0) = c1u1 + · · · + cnun, c1 = 0. (4.1)

We define a sequence (x(k)) recusively by

x(k) = Ax(k−1) = · · · = Akx(0), k ∈ N.

7/27/2019 numath



From equation (4.1) we obtain

x(k) = c1λk1u1 +

· · ·+ cnλk

nun

= λk1

c1u1 + c2

λ2

λ1

k

u2 + · · · + cn

λn

λ1

k

un

= λk1(c1u1 + e(k))

where e(k) → 0 as k → ∞. Now let φ : Cn → C be any linear functional which means:

∀v, w ∈ C : ∀α, β ∈ C : φ(αx + βy) = αφ(x) + βφ(y).

Thenφ(x(k)) = λk

1

c1φ(u1) + φ(e(k))

so that

φ(x(k+1))φ(x(k))

= λ1

c1φ(u1) + φ(ε(k+1))

c1φ(u1) + φ(ε(k))

→ λ1 as k → ∞.

A typical example for a possible functional φ is:

φ(x) = xj , x = (x1, x2, . . . , xn)

with some fixed j.

In practise one normalizes the vector x(k) , since otherwise they may converge to zero or become unbounded.

Scaled power method

Choose some linear functional φ and some norm ..

Choose x∈Cn

Iterate:

y := Ax, λ :=φ(y)

φ(x), x :=

1

yy.

Example 1. Let

A =

41 26 10

26 38 1610 16 20

x0 =

0

0−1

The linear functional φ was taken to be φ(x) = xj with j = 1 and j = 3:

k x y j = 1 j = 31 77. 80. 46. 0.9625 1.0 0.57500 77.0 46.0

2 71.213 72.225 37.125 0.98598 1.0 0.51402 73.987 64.5653 71.565 71.860 36.140 0.99590 1.0 0.50293 72.583 70.3094 71.861 71.940 36.018 0.99890 1.0 0.50066 72.157 71.6165 71.962 71.982 36.002 0.99972 1.0 0.50016 72.041 71.910

From this table one may guess that the eigenvalue is 72 and als that v = (2, 2, 1) may be an eigenvector.Indeed:

A =

41 26 10

26 38 1610 16 20

2

21

= 72

2

21

.

The remaining eigenvalues are 18, 9.

7/27/2019 numath



The inverse power method: The following lemma should be known.

Lemma 4.3.1. If λ is an eigenvalue of A and A is nonsingular, then λ−1 is an eigenvalue of A−1.

Proof: Let Ax = λx, x = 0. Thenx = A−1(λx) = λA−1x

and thereforeA−1x = λ−1x

that is λ−1 is an eigenvalue of A−1. EOP

Now suppose that there is a smallest eigenvalue

|λ1| ≥ |λ2| ≥ · · · ≥ |λn−1| > |λn| > 0.

Then A is nonsingular and λ−1n is the largest eigenvalue of A−1, which can be computed by the power

method x(k+1) = A−1x(k).

It is definitely not a good idea to compute A−1. Rather x(k+1) is obtained by solving the linear system

Ax(k+1) = x(k).

This can be done by the Gauss elimination process or the QR-method.

Remarks. By considering the shifted matrix A − µE we can compute with the power method the eigen-values farest and closest to µ.

4.4 The QR

−method and Hessenberg matrices

In this section we report on a method to compute all eigenvalues of a given n × n-matrix A. It involvessimilarity transformations with the help of orthogonal decompositions:

Set A0 := A and compute recursively

Ak := RkQk, where Ak−1 = QkRk with orthogonal Qk and upper-triangular Rk.

For the decomposition one may use Jacobi rotations or Householder reflections. Note that the computedmatrices are all similar, hence in particular similar to A:

Ak = RkQk = Q−1k Ak−1Qk.

Theorem 4.4.1. Suppose A has n real eigenvalues with distinct absolute values. Then the matrices Ak

converge as k → ∞ to an upper triangular matrix with the eigenvalues of A on the diagonal.

Remarks. In practice one computes the matrices Ak until they are sufficiently upper triangular. Thismeans that the entries below the diagonal are below a preset tolerance (like 0.001).

There are matrices for which the QR−method does not work. One posibility is the case of repeatedeigenvalues; another posibility is a matrix with distinct eigenvalues that have the same absolute value.

One can check, that each iteration step of the QR−method needs O(n3) operations. With a trick one canreduce this number to O(n2). One transforms the matrix A in a prepatory step into a matrix with specialstructure:

7/27/2019 numath



Definition 4.4.2 (Hessenberg matrix). A Hessenberg matrix is a matrix that has nonzero entries only on or above the subdiagonal, the diagonal beneath the main diagonal.

H =

b11 . . . b1n

b21 . . . b2n

0. . .

......

. . .. . .

...0 . . . 0 bn,n−1 bn,n

A Hessenberg matrix can be reduced to an upper triangular matrix R very efficiently. One can remove thesubdiagonal elements by rotations. Due to the special structure one needs only n − 1 rotations

Q := Qn−1,nQn−2,n−1 · · · Q2,3Q1,2

(see Definition 3.2.5)

Theorem 4.4.3. A Hessenberg matrix H can be decomposed — H = QR — by a n − 1 Jacobi rotations with O(n2) operations. The corresponding similar matrix

H = RQ = Q−1HQ = QT HQ

is also a Hessenberg matrix.

Finally we have to discuss how a matrix A can be transformed into a similar Hessenberg matrix. In principleall methods for linear systems can be utilized. We just consider the transformation of Gauss type usingpermutation matrices and Frobenius matrices:

One proceeds inductively, column by column. So let H be a matrix with the right shape in the first k − 1

columns (k ≤ n − 2) — initially k = 1 and H = A —:H = (hij ) with hij = 0, j > i + 1, i ≤ k − 1

If the column k is 0 below the subdiagonal it has already the right shape. Otherwise we look for a pivotelement and swap rows to guarantee that hk+1,k = 0 and to avoid rounding errors. Let P k+1 = P (k + 1, j)be the corresponding permutation matrix. Then define the appropriate Frobenius matrix Lk+1 to deletethe entries below the subdiagonal. By similarity transformation we obtain:

H = Lk+1P k+1HP k+1L−1k+1.

Observe that the multiplications on the left of H produce the desired zeros in column k. The operationson the right — which are necessary for similarity — affect only columns with index j > k .

7/27/2019 numath


Chapter 5

Ordinary Differential Equations

5.1 Introduction

There are many situations in physical and life sciences which can be modeled through an ordinary differen-tial equation. This is an equation involving derivatives, and the order of the equation is the highest orderof derivative that appears in it. “Ordinary” is opposed to “partial” which is used for equations involvingpartial derivatives of several independent variables. Here we discuss only ordinary differential equation.

The simplest equation is a first order ordinary differential equation

x(t) = f (t, x(t))

where the function f of two variables is given. If such an equation has a solution then in most cases it hasa set of solutions depending on a parameter. In practical problems one of these solutions is singled out byimposing an initial condition :

Definition 5.1.1 (Initial value problem). An initial value problem is defined by a continous function f : I × D → D where I = [a, b] is an interval and D ⊂ R and by a pair (t0, x0) ∈ I × D. A solution is a differentiable function x : J → D where t0 ∈ J ⊂ I and:

∀t ∈ J : x(t) = f (t, x(t)) and x(t0) = x0.

So a solution may not be defined on the entire interval I . If we replace R by Rn in the definition we obtainan initial value system. This can be written more explicitly as:

x1(t) = f 1(t, x1(t), . . . , xn(t)) x1(t0) = x0,1

... =...

... =...

xn(t) = f n(t, x1(t), . . . , xn(t)) xn(t0) = x0,n

An initial value problem of second order

x(t) = f (t, x(t), x(t)), x(t0) = x0, x(t0) = x0

can be transformed into a first order system by defining

x1 := x, x2 := x.

This leads to the system:

x1(t) = x2(t) x1(t0) = x0

x2(t) = f (t, x1(t), x2(t)) x2(t0) = x0

So, in principle it suffices to examine first order systems. Before we turn to numerical methods we considertwo simple types of initial value problems which may be solved explicitly:

7/27/2019 numath


CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS 48

5.2 Basic analytic methods

5.2.1 Separation of variables

The method of separation of variables applies to the case

x(t) = f (t, x(t)) = g(t)h(x(t)).

So here the function f is a product of a function g depending on t only and a function h depending on xonly.

Theorem 5.2.1 (Variables separable). The initial value problem

x(t) = g(t)h(x(t)), x(t0) = x0

with continuous functions g, h and h(x0)

= 0 has a unique solution which can be obtained from the equation:

x(t)

x0

1

h(s)d s =

t

t0

g(s) d s.

The proof is just straightforward computation:

x(t) = g(t)h(x(t)) ⇔ x(t)

h(x(t))= g(t) ⇔

t

t0

g(s) ds =

t

t0

x(τ )

h(x(τ ))dτ =

x(t)

x0

1

h(s)ds

where in the last step one uses the substitution s = x(τ ). Observe that due to the condition h(x0) = 0 thisintegral is well defined in a neighbourhood of x0.

Remark: If h(x0) = 0 the the constant function x(t) := x0 solves the initial value problem.

Example: Solve the initial value problem:

x(t) = −2t(x(t)2 − x(t)), x(1) = 0.5.

According to the theorem we consider:

x(t)

0.5

1

s2 − sds =

t

1

(−2s) ds = 1 − t2

The integral on the left hand side yields

x(t)

0.5

1

s

−1

− 1

s

ds = ln

x(t) − 1

−0.5

− ln

x(t)

0.5

= ln

1 − x(t)

x(t)

.

Finally we solve for x(t):

ln

1 − x(t)

x(t)

= 1 − t2 ⇔ 1 − x(t)

x(t)= exp(1 − t2) ⇔ x(t) =

1

1 + exp(1 − t2).

5.2.2 Linear differential equations

If p(t) and q (t) , a < t < b are continous functions, then

x(t) = p(t)x(t) + q (t)

7/27/2019 numath



is called a linear differential equation of first order . It is called homogeneous if q (t) = 0 for all t, otherwiseinhomogeneous .

The associated homogeneous equation x(t) = p(t)x(t)

is a differential equation with separated variables and therefore can be solved. The solution with uh(t0) = x0

is

uh(t) = x0 exp(

t

t0

p(s)ds) (5.1)

We can use this solution to solve the inhomogeneous case.

Theorem 5.2.2 (Linear differential equation of first order). The solution to

x(t) = p(t)x(t) + q (t), x(t0) = x0

is u(t) = uh(t) + u p(t)

where uh is the solution of the homogeneous equation as given by euqation 5.1 and u p is the solution of the inhomogeneous equation with u p(t0) = 0.

All we need to solve a linear equation is therefore to find u p. This can be done by a method, an ”Ansatz”,which is known as variation of parameters . Take the solution of the homogeneous equation and replace theinitial condition by a function:

u(t) = v(t)P (t) with P (t) := exp(

t

t0

p(s)ds)

Substitute this in the original equation to determine v:

q (t) = u(t)

− p(t)u(t) = v(t)P (t) + v(t)P (t)

− p(t)v(t)P (t) = v(t)P (t)

It follows:

v(t) = q (t)exp(− t

t0

p(s)ds)

By integration one obtains v withv(t0 = 0.

Example: Solve the initial-value-problem

x(t) =2

tx(t) +

t

2, x(1) = 1.

So we compute:

p(t) =2

t⇒ P (t) = exp(

t

1

2

sds = t2).

To solve the inhomogeneous equation we try u(t) = v(t)t2

:t

2= u(t) − 2

tu(t) = v(t)t2 + 2tv(t) − 2

tt2v(t) = v(t)t2.

Form this equation we obtain v with v(1) = 0:

v(t) =1

2t⇒ v(t) =

t

1

1

2sds =

1

2ln(t).

Combining these results the solution is

x(t) = t2 +1

2t2 ln(t).

7/27/2019 numath



5.3 One-step-methods

In general we cannot expect to obtain the solution in terms of elementary functions. Instead one usuallyconstruct a table of function values of the form

t0 t1 . . . tm

x0 x1 . . . tm

Here xi is the computed approximate value of the exact value u(ti).

Given a generic point (t, x) in the rectangle [a, b] × [c, d] , we define a single step of the one-step-methodby

xnext := x + hΦ(t,x,h), h > 0 (5.2)

Along with (5.2) we consider the exact solution u(t) of the initial-value-problem

x = f (t, x), u(t) = x.

u(t) is called the reference solution . xnext is intended to approximate u(t + h).

Definition 5.3.1. The truncation error of the method Φ at the point (t, x) is defined by

T (t,x,h) =1

h(xnext − u(t + h)) (5.3)

Using (5.2) we can rewrite (5.3) as

T (t,x,h) = Φ(t,x,h) − 1

h(u(t + h) − u(t)) (5.4)

that is, T is the difference between the approximate and exact increment per step of the size h.

Definition 5.3.2. The method Φ is called consistent if

T (t,x,h) → 0, as h → 0

uniformly for (t, x) ∈ [a, b] × [c, d].

From the definition it follows immediately that the method is consistent iff

Φ(t,x, 0) = f (t, x), ∀(t, x) ∈ [a, b] × [c, d].

Definition 5.3.3. The method Φ is said to have order p if

|T (t,x,h)| ≤ const ·h p

(5.5)

uniformly on [a, b] × [c, d] with a constant const not depending on t, x and h.

We express (5.5) as usual asT (t,x,h) = O(h p), h → 0

We now proceed to different methods.

7/27/2019 numath



5.3.1 Euler’s Method

Suppose we are at some point (t, x) representing the initial conditions. At this point the ODE specifiessome slope f (t, x). Then we follow the tangent line in the direction of the slope for h units. We thereforedefine:

Φ(t,x,h) := f (t, x)

that isxnext = x + hf (t, x).

Remarks. This method is very old. It is generally attributed to Euler (1768), but it was used by Newtonin the very first book of differential equations.

The method is evidently consistent. For the truncation error we have:

T (t,x,h) = Φ(t,x,h) − 1

h(u(t + h) − u(t))

= f (t, x) − 1h

(u(t + h) − u(t))

Sinceu(t) = f (t, u(t)) = f (t, x)

it follows with Taylor’s formulae

T (t,x,h) = u(t) − 1

h(u(t + h) − u(t))

= u(t) − 1

h(u(t) + hu(t) +

h2

2u(ξ ) − u(t)) (5.6)

= −h

2u(ξ ), t < ξ < t + h, (5.7)

Differentiating u(t) = f (t, u(t)) totally with respect to t and setting t = ξ gives

u(ξ ) = f t(ξ, u(ξ )) + f x(ξ, u(ξ ))f (ξ, u(ξ )).

Substituting this in (5.7) gives the truncation error

T (t,x,h) = −h

2(f t + f xf )(ξ, u(ξ )).

Assuming that f, f t and f x are uniformly bounded in [a, b] × [c, d] we obtain the estimate

|T (t,x,h)| ≤ ch.

Therefore Euler’s method is of order p = 1.In practice the method proceed as follows:

Start with the initial values (t0, x0) ∈ [a, b] × [c, d] and choose a stepsize h.

For n = 1, 2, 3 do

tn = tn−1 + h = t0 + nh

xn = xn−1 + hf (tn−1, xn−1)

as long as (tn, xn) ∈ [a, b] × [c, d]

The approximate solution through (t0, x0) is the piecewise linear function joining all the points (tn, xn).

7/27/2019 numath



Example 1. Let x = sin(tx), (t0, x0) = (0, 3) and h = 0.1. then we obtain the following table:

tn xn0.00 3.000000000.10 3.000000000.20 3.029552020.30 3.086503080.40 3.16642235

With a smaller stepsize the numerical results are better.

Example 2. Same example with stepsize h = 0.05.

tn xn

0.00 3.000000000.05 3.000000000.10 3.007471900.15 3.022283600.20 3.044182240.25 3.072777910.30 3.107519810.35 3.147668140.40 3.19226663

5.3.2 Midpoint Euler

In the Euler method one follows the initial slope f (t, x) over the whole interval of length h. A good ideais to look ahead. We take

Φ(t,x,h) := f (t + h2

, x + h2

f (t, x)).

In practise one computes first

k1(t, x) = f (t, x)

k2(t, x) = f (t +h

2, x +

h

2k1(t, x))

and thenxnext = x + hk2.

As is shown in the next section the order of midpoint Euler is p = 2.

5.3.3 Two-stage-models

Midpoint Euler is a special case of a so called two-stage-model. More general we take

Φ(t,x,h) = α1k1 + α2k2

where

k1(t, x) = f (t, x)

k2(t, x) = f (t + µh,x + µhk1(t, x))

For midpoint Euler we have α1 = 0, α2 = 1, µ = 12 .

7/27/2019 numath



Now we have three parameters α1, α2 and µ at our disposal, and we can try to choose them to maximize p. Since

T (t,x,h) = Φ(t,x,h) −1

h (u(t + h) − u(t))

we expand Φ and 1h (u(t + h) − u(t)) in powers of h. Now

1

h(u(t + h) − u(t)) =

1

h

hu(t) +

h2

2u(t) +

h3

6u(t) + O(h4)

= u(t) +h

2u(t) +

h2

6u(t) + O(h3)

with

u(t) = f (t, u(t))

u(t) = (f t + f xf )(t, u(t))

u(t) = (f tt + f xtf + f txf + f xxf 2 + f xf t + f xf xf )(t, u(t))= (f tt + 2f txf + f xxf 2 + f x(f t + f xf ))(t, u(t))

We now compute the derivatives of Φ. The following holds:

Φ(t,x,h) = µα2(f t + f xf )

Φ(t,x,h) = µ2α2(f tt + 2f txf + f xxf 2)

Therefore Taylor expanding Φ(t,x,h) gives

Φ(t,x,h) = Φ(t,x, 0) + hΦ(t,x, 0) +h2

2Φ(t,x, 0) + O(h3)

= (α1 + α2)f + α2µ(f t + f xf )h + α2µ2(f tt + 2f txf + f xxf 2) h2

2+ O(h3) .

Therfore the truncation error is

T (t,x,h) = Φ(t,x,h) − 1

h(u(t + h) − u(t))

= (α1 + α2 − 1)f +

α2µ(f t + f xf ) − 1

2(f t + f xf )

h +

α2µ2

2(f tt + 2f txf + f xxf 2) − 1

6(f tt + 2f txf + f xxf 2 + f x(f t + f xf ))

h2 + O(h3)

= (α1 + α2 − 1)f + (α2µ − 1

2)(f t + f xf )h +

12

α2µ2 − 16

(f tt + 2f txf + f xxf 2) − 1

6(f x(f t + f xf )

h2 + O(h3)

We now see that we cannot make the coefficients of h2 equal to zero. Thus the maximum order of atwo-stage-method is p = 2 and this is achieved for

α1 + α2 − 1 = 0

α2µ − 1

2= 0

Midpoint Euler (α1 = 0, α2 = 1, µ = 12) satisfies these equations and is therfore of order p = 2.

7/27/2019 numath



5.3.4 Runge-Kutta-methods

We now extend the two-stage-method straightforward to a r−

stage method:

Φ(t,x,h) =r

s=1

αsks

k1(t, x) = f (t, x)

ks(t, x) = f (t + µsh, x + hs−1j=1

λsjkj ), s = 2, 3, . . . , r

The clasical Runge-Kutta-method is:

Φ(x,y,h) =1

6

(k1 + 2k2 + 2k3 + k4),

k1(x, y) = f (x, y)

k2(x,y,h) = f (x +1

2h, y +

1

2hk1)

k3(x,y,h) = f (x +1

2h, y +

1

2hk2)

k4(x,y,h) = f (x + h, y + hk3)

This method is of order p = 4.

7/27/2019 numath


Chapter 6

Polynomial Interpolation

In this section we look for the following problem: Given n +1 distinct points and values f i = f (xi) of somefunction of these points, find a polynomial p of degree n such that

p(xi) = f i, i = 0, 1, 2, . . . , n .

Such a polynomial is said to interpolate the data (xi, f i).

Example 3. The simplest case is linear interpolation.

We have

p(x) = f 0 +f 1 − f 0x1 − x0

(x − x0) (6.1)

respectively

p(x) = f 0x − x1

x0 − x1+ f 1

x − x0

x1 − x0(6.2)

(6.1) is the prototype of Newton’s form of the interpolation polynomial and (6.2) that of Lagrange’s form.

6.1 Lagrange form of Interpolation PolynomialFirst we prove the existence of the interpolation polynomial by writing it down. Let

li(x) =n

j=0,j=i

x − xj

xi − xji = 0, 1, . . . n . (6.3)

li is a polynomial of degree n and

li(xk) =

1 if k = i

0 if k = i(6.4)

7/27/2019 numath


CHAPTER 6. POLYNOMIAL INTERPOLATION 56

If we define

p(x) :=n

i=0

f ili(x) (6.5)

it follows from (6.3) immediately that p(xk) = f k

that is p is an interpolating polynomial.

Example 1. Four values of a function are given in the following table

xi 1 2 3 4f i 6 5 2 −9

Then

p3(x) = 6l0(x) + 5l1(x) + 2l2(x) − 9l3(x)

= 6 (x − 2)(x − 3)(x − 4)(1 − 2)(1 − 3)(1 − 4)

+ 5 (x − 1)(x − 3)(x − 4)(2 − 1)(2 − 3)(2 − 4)

+2(x − 1)(x − 2)(x − 4)

(3 − 1)(3 − 2)(3 − 4)− 9

(x − 1)(x − 2)(x − 3)

(4 − 1)(4 − 2)(4 − 3)

= −x3 + 5x2 − 9x + 11

Polynomial interpolation can lead to very poor approximation.

Example 2. Interpolate the function f (x) = e−xsin(4x) at the points 0.1, 0.6 and 2.

Now we prove the uniqeness of the interpolating polynomial.

Theorem 6.1.1. Let p(x) a polynomial of degree ≤ n which passes through the data points (xi, f i), 0 ≤i ≤ n. Then p(x) is unique.

Proof. Let p∗(x) a second interpolating polynomial. Then d(x) = p(x) − p∗(x) is a polynomial of degree≤ n that satisfies

d(xi) = 0, 0 ≤ i ≤ n,

in other words, d has n + 1 zeros xi. There is only one polynomial of degree ≤ n with that many zeros,namely d(x) = 0 ∀x. Therefore p∗(x) = p(x).

6.2 Interpolation Error

Theorem 6.2.1. Let f be a function in C n+1[a, b], and let p be a polynomial of degree

≤n that interpolates

the function f at n+1 distinct points x0, x1, . . . , xn in the interval [a, b]. To each x in [a, b] there corresponds a point ξ x in (a, b) such that

f (x) − p(x) =1

(n + 1)!f (n+1)(ξ x)

ni=0

(x − xi)

Example 1. If the function f (x) = sin x is approximated by a polynomial of degree 9 that interpolatesf at 10 points in the in the interval [0, 1], how large is the error on this interval ?

Obviously|f (10)(ξ x)| ≤ 1

7/27/2019 numath



and9

i=0 |

x

−xi

| ≤1.

So for all x ∈ [0, 1]

| sin x − p(x)| ≤ 1

10!< 2.8 · 10−7.

Example 2. Linear interpolation (n = 1). Let [a, b] = [x0, x1] and h = x1 − x0. Then

f (x) − p(x) =1

2!f (ξ x)(x − x0)(x − x1), x0 < ξ x < ξ 1

and an easy computation gives

f − p∞ =M 2

8h2, M 2 = f ∞.

Example 3. Quadratic interpolation (n = 2) on equally spaced points a = x0, x1 = x0+h, x2 = x0+2h =b. Now we have for x ∈ [x0, x2]

f (x) − p(x) =1

3!f (ξ x)(x − x0)(x − x1)(x − x2), x0 < ξ x < x2

and

f − p∞ =M 3

9√

3h3, M 3 = f ∞.

If interpolating polynomials pn of higher and higher degree are constructed, we expect that these polyno-mials will converge to f uniformly, that is

f − pn∞ → 0 as n → ∞.

For most continous functions this is not true.

Example 4. [Runge,1901] Let

f (x) =1

1 + x2, −5 ≤ x ≤ 5

and

x(n)k = −5 + k

10

n, k = 0, 1, 2 . . . , n

equally spaced nodes. Then one can prove that

limn→∞ |

f (x)

− pn(x)

|= 0 if |x| < 3.633 . . .

∞ if |x| > 3.633 . . .

Thereforef − pn∞ → ∞ as n → ∞.

One can wonder if the equidistribution of the nodes can be blamed for the non-convergence. One can provethe following result:

Theorem 6.2.2. If f is a continous function on [a, b], then there is a system of nodes such that the interpolation polynomials pn satisfy

f − pn∞ → 0 as n → ∞.

7/27/2019 numath



6.3 Divided Differences

Lagrange’s formula (6.5) is not so attractive for computational work. We now use the Newton form of theinterpolating formula.

We start with the basic polynomials

q 0(x) = 1

q 1(x) = (x − x0)

q 2(x) = (x − x0)(x − x1)

... (6.6)

q n(x) = (x − x0)(x − x1) · · · (x − xn−1) (6.7)

This gives rise to a new form of the interpolation polynomial

pn(x) =n

i=0

aiq i(x) (6.8)

The coefficients ai can be determined by solving the linear system

f 0 = a0

f 1 = a0 + a1(x1 − x0)

f 2 = a0 + a1(x2 − x0) + a2(x2 − x0)(x2 − x1)

...

f n = a0 + a1(xn − x0) + · · · + an(xn − x0)(xn − x1) · · · (xn − xn−1)

For n = 3 we therefore have to solve the system1 0 0

1 (x1 − x0) 01 (x2 − x0) (x2 − x0)(x2 − x1)

a0

a1a2

=

f 0

f 1f 2

(6.9)

If we solve the system from the top (forward-substitution), we obtain

a0 = f 0

a1 =f 1 − f 0x1 − x0

a2 =f 2 − a0 − a1(x2 − x0)

(x2

−x0)(x2

−x1)

Evidently, an is a linear combination of f 0, . . . , f n with coefficients that depend on x0, . . . , xn. We use thenotation

an = f [x0, x1, . . . , xn] . (6.10)

The expression f [x0, x1, . . . , xn] are called divided differences of f relative to the nodes x0, x1, . . . , xn. Thename ”divided differences” comes from the following useful property.

Theorem 6.3.1. Divided differences satisfy the equation

f [x0, x1, . . . , xn] =f [x1, . . . , xn] − f [x0, x1, . . . , xn−1]

xn − x0. (6.11)

7/27/2019 numath



Proof. Letr(x) = pk−1(x1, . . . , xk; x)

ands(x) = pk−1(x0, . . . , xk−1; x)

interpolation polynomials. Then

pk(x0, . . . , xk; x) = r(x) +x − xk

xk − x0(r(x) − s(x)) (6.12)

Indeed, the polynomial on the right has degree ≤ k. If i = 0, i = k we have

pk(x0, . . . , xk; xi) = r(xi) +xi − xk

xk − x0(r(xi) − s(xi))

= f i +xi − xk

xk − x0(f i − f i) = f i

and similarly for i = 0 or i = k

pk(x0, . . . , xk; x0) = f 0, pk(x0, . . . , xk; xk) = f k .

By the uniqeness of the interpolation polynomial the assertion (6.12) follows.

Now equating the leading coefficients on both sides of (6.12) gives

f [x0, x1, . . . , xk] =f [x1, . . . , xk] − f [x0, x1, . . . , xk−1]

xk − x0.

If a table of function values (xi, f (xi)) is given, we can construct from it a table of divided differences.

x0 f 0 f [x0, x1] f [x0, x1, x2] f [x0, x1, x2, x3]x1 f 1 f [x1, x2] f [x1, x2, x3]x2 f 2 f [x2, x3]x3 f 3

Example 1. Compute a divided difference table for these function values:

x 3 1 5 6f (x) 1 −3 2 4

We havexi f i

3 1 2 −3/8 7/401 −3 5/4 3/205 2 26 4

The corresponding Newton polynomial is therefore

p(x) = 1 + 2(x − 3) − 3

8(x − 3)(x − 1) +

7

40(x − 3)(x − 1)(x − 5)

7/27/2019 numath



6.4 Spline Interpolation

A spline function consists of polynomial pieces on subintervals joint together with certain continuity con-ditions.

Let∆ : a = x0 < x1 < x2 < · · · < xn−1 < xn = b

be a subdivision of the interval [a, b]. The points xi are called knots .

Definition 6.4.1. A function S : [a, b] → R is called a spline function of degree k with respect to a subdivision ∆, if

a) S | [xi−1, xi[ ( i = 1, . . . , n)is a polynomial of degree k ;

b) S ∈ C (k−1)([a, b]).

Hence a spline function is a piecewise polynomial of degree k having continous derivatives of all orders upto k − 1. A spline of degree 0 is a piecewise constant function. A spline of degree 1 is a piecewise linearfunction.

Figure 6.1: Spline of degree 0 Figure 6.2: Spline of degree 1

The most widely used splines in practise are cubic splines, that is k = 3. This means that each

pi := S | [xi−1, xi], 1 ≤ i ≤ n

is polynomial of degree 3.

We now show how to compute an interpolating cubic spline. We assume that a subdivion ∆ and values f i,(i = 0, . . . , n) are given. So we look for a cubic spline function satisfying the interpolation condition:

S (xi) = f i, i = 0, . . . , n

First we define numbers zi = S (xi), 0 ≤ i ≤ n. These numbers are well defined because S is twicecontinously differentiable. Since each pi is a polynomial of degree 3, pi is a linear function. Hence we have

pi (x) = zi−1 +zi − zi−1

hi(x − xi−1) with hi := xi − xi−1.

If this is integrated twice we obtain

pi(x) = ai(x − xi−1)3 + bi(x − xi−1)2 + ci(x − xi−1) + di

7/27/2019 numath



The coefficients ai and bi are derived from pi . The interpolation condition implies: di = f i−1. Finally wehave to satisfy:

f i = S (xi) = pi(xi) = aih3i + bih2

i + cihi + di

Solving for ci we obtain:

ci =f i − f i−1

hi− aih2

i − bihi

Collecting these informations we get:

ai =1

6hi(zi − zi−1)

bi =1

2zi−1

ci =f i − f i−1

hi− 1

6hi(2zi−1 + zi)

di = f i−1

Finally we have to assure that at the inner knots the left and right values of the first derivative coincide.Hence we have to satisfy:

pi(xi) = pi+1(xi) ⇔ 3aih2i + 2bihi + ci = ci+1, i = 1, . . . , n − 1.

By inserting the above expressions and sorting for the unknowns zi:

hizi−1 + 2(hi + hi+1)zi + hi+1zi+1 = 6

f i+1 − f i

hi+1− f i − f i−1

hi

i = 1, . . . , n − 1.

These are n−1 equations in the n+1 unknowns z0, . . . , zn. We can select the values of z0 and zn arbitrarilyand solve the remaining equations. There are several systematic ways to determine z0 and zn by additionalconditions:

1. One simply requires: z0 = 0 and zn = 0. The resulting spline is called a natural spline .

2. One prescribes values S (x0) and S (xn).

3. One prescribes additional function values S (0.5(x0 + x1)) and S (0.5(xn−1 + xn)).

4. One requires that S can be continued periodically.

In the case of a natural spline the linear system is of the form

u1 h2

h2 u2 h3

h3 u3 h4

. . .. . .

. . .

hn−2 un−2 hn−1hn−1 un−1

z1z2z3...

zn−2zn−1

=

v1v2v3...

vn−2vn−1

where

hi = xi − xi−1ui = 2(hi + hi+1)

bi =6

hi(f i − f i−1)

vi = bi+1 − bi

Example: Determine the interpolating cubic natural spline for the following data:

7/27/2019 numath



Figure 6.3: The computed interpolating spline

x 0 1 2 5 8f 6 0 6 24 6

To set up the system of equations to determine the zi we may collect the relevant values in the followingtable:

i xi hi f i bi/6 vi/60 0 61 1 1 0 -6 122 2 1 6 6 03 5 3 24 6 -124 8 3 6 -6

From this we obtain the system of equations:4 1 0

1 8 30 3 12

z1

z2z3

=

72

0−72

with the solution (extended by z0 = 0 and z4 = 0: z = (0, 18, 0, −6, 0). From this one computes thefollowing representation of the poynomial-pieces:

p1(x) = 3x3 − 9x + 6, in [0, 1] (6.13)

p2(x) = −3(x − 1)3 + 9(x − 1)2, in [1, 2] (6.14)

p3(x) = −13

(x − 2)3 + 9(x − 2) + 6, in [2, 5] (6.15)

p4(x) =1

3(x − 5)3 − 3(x − 5)2 + 24, in [5, 8] (6.16)

A natural spline has the following minimality property:

Theorem 6.4.2. Let f ∈ C 2([a, b]) . If s is the natural cubic spline that interpolates f , then b

a

(s(x))2dx ≤ b

a

(f (x))2dx .

7/27/2019 numath


Chapter 7

Numerical Integration

Some integration problems cannot be solved by elementary techniques. Here are some examples: 20

e−x2 dx,

π

0

cos(4cos θ) dθ,

10

tan(sin x2) dx.

Elementary techniques depends on antidifferentation. Thus to evaluate the integral 10

f (x) dx (7.1)

we first find a function F with the property F = f . Then we have

b

a f (x) dx = F (b) − F (a).

Many elementary functions do not have simple antiderivatives, so we cannot rely on that method.

One method for computing the integral (7.1) numerically is to replace the function f by another functiong which is easily integrated. Then if f ≈ g then it should be

b

a

f (x) dx ≈

b

a

g(x) dx.

It should now be plausible that polynomials are good candidates for the function g.

7.1 Integration via polynomial interpolationSuppose that we want to evaluate the integral (7.1). We can select nodes x0, x1, . . . , xn ∈ [a, b] and set upthe interpolation polynomial

p(x) =n

i=0

f (xi) li(x), (7.2)

where

li(x) =n

j=0,j=i

x − xj

xi − xj(7.3)

7/27/2019 numath


CHAPTER 7. NUMERICAL INTEGRATION 64

is the Lagrange polynomial. Then we write

b

a f (x) dx≈ b

a p(x) dx =

ni=0 f (xi)

b

a li(x) dx =

ni=0 f (xi)Ai (7.4)

where

Ai :=

b

a

li(x) dx. (7.5)

A formula of the form b

a

f (x) dx ≈n

i=0

f (xi)Ai (7.6)

is called a Newton-Cotes-Formula if the nodes are equally spaced.

7.1.1 The trapezoid rule

In the simples case we set n = 1 and x0 = a, x1 = b. Then

l0(x) =x − b

a − b, l1(x) =

x − a

b − a.

Consequently

A0 =

b

a

l0(x) dx =

1

2

1

a − b(x − b)2

=

1

2(b − a)

A1 =

b

a

l1(x) dx =1

2(b − a).

The corresponding quadrature formula is b

a

f (x) dx ≈b − a

2(f (a) + f (b)) (7.7)

This formula is known as the trapezoid rule .

Formula (7.7) is exact for all polynomials of degree at most 1 ( because the interpolation formula in that

case is exact). The error term is according to theorem 6.2.1

− 1

12(b − a)3f (ξ ), a < ξ < b.

If the interval [a, b] is partioned like this

a = x0 < x1 < · · · xn = b

the we can apply the trapezoid rule (7.7) to each subinterval. Thus we obtain the formula b

a

f (x) dx =n

i=1

xi

xi−1

f (x) dx ≈1

2

ni=1

(xi − xi−1)(f (xi−1) + f (xi)). (7.8)

7/27/2019 numath



With uniform spacing h = (b − a)/n,xi = a + ih formula (7.8) takes the form

b

af (x) dx =

h

2

f (a) + 2

n−1i=1

(f (a + ih) + f (b))

. (7.9)

7.1.2 Simpson’s rule

Example 1. n = 2, [a, b] = [0, 1]. We take the nodes 0, 1/2, 1. Then

l0(x) = 2(x − 1

2)(x − 1), l1(x) = −4x(x − 1), l2(x) = 2x(x − 1

2).

It follows

A0 =

10

l0(x) dx =1

6, A1 =

10

l1(x) dx =2

3, A2 =

10

l2(x) dx =1

6.

We obtain the formula 10

f (x) dx =1

6f (0) +

2

3f

1

2

+

1

6f (1). (7.10)

Formula (7.6) is exact for all polynomials of degree ≤ n. Now suppose that a formula (7.6) is given to usand that we know only that it is exact for all polynomials of degree ≤ n. Then we have b

a

lj (x) dx =n

i=0

Ai lj (xi) =n

i=0

Ai δ ij = Aj .

This makes it possible to compute the weights Ai by the method of undetermined coefficients.

Example 2. n = 2, [a, b] = [0, 1]. We look for a formula

10

f (x) dx ≈ A0f (0) + A1f

12

+ A2f (1)

that will be exact for all polynomials of degree ≤ 2. By using the functions 1, x , x2 we obtain

21 =

10

dx = A0 + A1 + A2 (7.11)

1

2=

10

x dx =1

2A1 + A2 (7.12)

1

3=

10

x2 dx =1

4A1 + A2 (7.13)

The solution of this system is

A0 = 16

, A1 = 23

, A2 = 16

,

so that we get the same formula as in (7.10) .

Similar calculations for an arbitrary interval [a, b] lead to Simpson’s rule : b

a

f (x) dx ≈b − a

6

f (a) + 4f

a + b

2

+ f (b)

. (7.14)

By construction Simpson’s rule is exact for all polynomials of degree ≤ 2, but one can prove that it isexact for all polynomials of degree ≤ 3.

7/27/2019 numath



Change of intervals

From a formuls of numerical integration for one interval, we can derive a formula for any otherinterval. Suppose that a formula for numerical integration is given:

d

c

f (x) dx ≈n

i=0

Ai f (xi). (7.15)

We assume that this formula is exact for all polynomials of degree ≤ m. If we need a formula for anotherinterval [a, b] then we map the interval [c, d] onto the interval [a, b] by a linear function λ, explicitely

λ(t) =b − a

d − ct +

ad − bc

d − c. (7.16)

Then it follows

b

af (x) dx =

d

cf (λ(t))λ(t) dt

=b − a

d − c

d

c

f (λ(t)) dt

≈b − a

d − c

ni=1

Aif (λ(ti))

that is b

a

f (x) dx ≈b − a

d − c

ni=1

Ai f

b − a

d − cti +

ad − bc

d − c

. (7.17)

If f is a polynomial of degree ≤ m then f (λ(t)) is a polynomial of the same degree because the functionλ is linear. Hence the formula (7.17) is exact for all polynomial of degree

≤m.

We now estimate the error involved in the numerical integration formula (7.6) . Recall from Theorem 6.2.1that if p is a polynomial of degree ≤ n that interpolates f at x0, x1, . . . xn, then

f (x) − p(x) =1

(n + 1)!f (n+1)(ξ x)

ni=0

(x − xi).

It follows b

a

f (x) dx −n

i=1

Aif (xi) =1

(n + 1)!

b

a

f (n+1)(ξ x)n

i=0

(x − xi) dx.

Because f (n+1) is continous we have |f (n+1)| ≤ M and therefore

b

a

f (x) dx −n

i=1

Aif (xi) ≤ M

(n + 1)!

b

a

n

i=0

(x − xi) dx.

The choice of nodes that makes the right side of this inequality as small as possible is known to be

xi =a + b

2+

b − a

2cos

(i + 1)π

n + 2

, 0 ≤ i ≤ n. (7.18)

If the interval is [−1, 1] these nodes have the simpler form

xi = cos

(i + 1)π

n + 2

, 0 ≤ i ≤ n.

7/27/2019 numath



If we set x = cos θ, then the xi are precisely the zeroes of the funcion

U n+1(x) =

sin[(n + 2)θ]

sin θ .

The function U n+1(x) is known as Chebyshev polynomial of the second kind . For the nodes (7.18) and thecorresponding Ai one can prove that

1−1

f (x) dx −n

i=1

Aif (xi)

≤ M

(n + 1)!2n.

7.2 Gaussian Quadrature

We look again at the quadrature formula

b

a

f (x) dx ≈n

i=1

Aif (xi)

which is exact for all polynomials of degree ≤ n. In the last section the choice of the nodes x0, x1, . . . xn

was made a priori. We now ask ourselves if there is some choice of nodes which is better than other choices.

We consider the slightly more general formula

b

a

f (x)w(x) dx ≈n

i=0

Aif (xi), (7.19)

where w is fixed positive weight function . The idea of Gauss is to use the variabilty of the nodes to forcethe formula (7.19 ) to be exact for all polynomials of degree

≤2n + 1. The next theorem shows where the

nodes should be placed.

Theorem 7.2.1. Let w be a positive weight function and let q be a nonzero polynomial of degree n + 1that is orthogonal to all polynomials p of degree ≤ n, that is

b

a

q (x) p(x)w(x) dx = 0 (7.20)

for all polynomials p of degree ≤ n. If x0, x1, . . . , xn are the zeros of q then the formula (7.19) will be exact for all polynomials of degree ≤ 2n + 1.

Proof. Let f be a polynomial of degree ≤ 2n + 1. Divide f by q obtaining a qotient p and remainder r,

f = qp + r,

where p, r are polynomials of degree ≤ n. It follows f (xi) = r(xi). Then by (7.20) and (7.19)

b

a

f (x)w(x) dx =

b

a

(qp + r)w dx =

b

a

rwdx =n

i=1

Air(xi) =n

i=1

Aif (xi).

It turns out that the roots of q are simple roots and lie in the interior of the inteval [a, b].

Example 1. Let [a, b] = [−1, 1] and w(x) = 1. This is the case originally considered by Gauss.

7/27/2019 numath



We consider first the case n = 1. Then

q (x) = ax2 + bx + c.

According to (7.20) q (x) must be orthogonal to 1 and x, thus 1−1

(ax2 + bx + c) dx = 0,

1−1

x(ax2 + bx + c) dx = 0.

It follows2

3a + 2c = 0,

2

3b = 0.

Solving the system yields the polynomial

q (x) = a

x2 − 1

3

.

Usually one normalizes q (x) in such a way that q (1) = 1. If we normalize we get

q (x) =3

2x2 − 1

2. (7.21)

The zeros of this polynomial are x1,2 = ±1/√

3 and therefore the formula is 1−1

f (x) dx = f

− 1√

3

+ f

1√

3

.

The polynomials which satisfies the orthogonality relation (7.20) are known as Legendre polynomials andare denoted by P n(x). We have just shown that

P 2(x)(=3

2

x2

−1

2

.

The Legendre polynomials satisfies the following recurrence relation:

P n+1(x) =2n + 1

n + 1xP n(x) − n

n + 1P n−1(x), n ≥ 1. (7.22)

We now consider the case n = 4. First we have to compute the Legendre polynomial P 5(x). This can bedone either by using the recurrence relation (7.22) or by setting

q (x) = ax5 + bx4 + cx3 + dx2 + ex + f

and solving the system

1−1 q (x)dx =

2

5 b +

2

3 d + f = 0, (7.23) 1−1

xq (x)dx =2

7a +

2

5c +

2

3e = 0, (7.24)

1−1

x2q (x)dx =2

7b +

2

5d +

2

3f = 0, (7.25)

1−1

x3q (x)dx =2

9a +

2

7c +

2

5e = 0, (7.26)

1−1

x4q (x)dx =2

9b +

2

7d +

2

5f = 0. (7.27)

7/27/2019 numath



We obtain the solution (observe that P 5(1) = 1 )

P 5(x) = q (x) =

63

8 x

5

−35

4 x

3

+

15

8 x. (7.28)

The zeros of P n(x) are

−x0 = x4 =1

21

245 + 14

√ 70 = 0.9061798459

−x1 = x3 =1

21

245 − 14

√ 70 = 0.5384693101

x2 = 0.

According to (7.5) the weights Ai are given by the formula

Aj = 1

−1

4

i=0,i=i

x − xi

xj − xi

,

that isA0 = A40.2369268852, A1 = A3 = 0.478628670, A2 = 0.568888890 .

We obtain the integration formula 1−1

f (x) dx = A0f (x0) + A1f (x1) + A2f (x2) + A3f (x3) + A4f (x4).

7/27/2019 numath



Literature

Gautschi,W. [1997]. Numerical Analysis , Birkhauser, Boston .

Hammerlin and Hoffmann [1994]. Numerische Mathematik , Springer.

Householder,A.S. [1953]. Principles of Numerical Analysis , Dover Publications,New York.

Kincaid,D. and Cheney,W. [1991]. Numerical Analysis , Brooks/Cole Publishing Company .

Locher [1992]. Numerische Mathematik fur Informatiker , Springer.

Phillips,C. and Cornelius,B. [1986]. Computational Numerical Methods , Ellis Hoorwood .

Stoer.J and Burlisch.R. [1980]. Introduction to Numerical Analysis , Springer.

Documents

numath