Analiza Numerica [Utm, Bostan v.]

Numerical Methods Fall 2010

Lecturer: conf. dr. Viorel Bostan

O�ce: 6-417

Telephone: 50-99-38

E-mail address: viorel [email protected]

Course web page: moodle.fcim.utm.md

O�ce hours: TBA. I will also be available at other

times. Just drop by my o�ce, talk to me after the

class or send me an e-mail to make an appointment.

Prerequisites: A basic course on mathematical anal-

ysis (single and multivariable calculus), ordinary dif-

ferential equations and some knowledge of computer

programming.

Course outline: This is a fast-paced course. This coursegives an in-depth introduction to the basic areas of nu-merical analysis. The main objective will be to have aclear understanding of the ideas and techniques underly-ing the numerical methods, results, and algorithms thatwill be presented, where error analysis plays an impor-tant role. You will then be able to use this knowledge toanalyze the numerical methods and algorithms that youwill encounter, and also to program them e¤ectively ona computer. This knowledge will be useful in your futurenot only to solve problems with a numerical component,but also to develop numerical algorithms of your own.

Topics to be covered:

1. Computer representation of numbers. Errors: types,sources, propagation.

2. Solution of nonlinear equations. Root�nding.

3. Interpolation by polynomials and spline functions.

4. Approximation of functions.

5. Numerical integration. Automatic di¤erentiation.

6. Matrix computations and systems of linear equations.

7. Numerical methods for ODE.

This course plan may be modi�ed during the semester.Such modi�cations will be anounced in advance duringclass period. The student is responsible for keeping abreastof such changes.

Class procedure: The majority of each class period

will be lecture oriented. Some material will be handed

during lectures, some material will be send by e-mail.

I strongly advise to attend lectures, do your home-

work, work consistently, and ask questions. Lecture

time is at premium; you cannot be taught everything

in class. It is your responsability to learn the material;

the instructor's job is to guide you in your learning.

During the semester, 10 homeworks and 4 program-

ming projects will be assigned. As a general rule, you

will �nd it necessary to spend approximately 2-3 hours

of study for each lecture/lab meeting, and additional

time will be needed for exam preparation. It is strongly

advised that you start working on this course from the

very beginning. The importance of doing the assigned

homeworks and projects cannot be over emphasized.

Programming projects: The predominant programminglanguage used in numerical analysis are Fortran and MAT-LAB. We will focus on MATLAB. Programs in other lan-guages are also sometimes acceptable, but no program-ming assistance will be given in the use of such languages(i.e. C,C++,Java,Pascal). For students unaquaintedwith MATLAB, the following e-readings are suggested

1.Ian Cavers, An Introductory Guide to MATLAB, 2ndEdition, Dept. of Computer Science, University of BritishColumbia, December 1998,

www.cs.ubc.ca/spider/cavers/MatlabGuide/guide.html

2. Paul Fackler, A MATLAB Primer, North Carolina Sta-teUniversity,

www4.ncsu.edu/unity/users/p/pfackler/www/MPRIMER.htm

3.MATLAB Tutorials, Dept. of Mathematics, SouthernIllinois University at Carbondale,

www.math.siu.edu/matlab/tutorials.html

4.Christian Roessler, MATLAB Basics, University of Mel-bourne, june 2004,

www.econphd.net/downloads/matlab.pdf

5.Kermit Sigmon, MATLAB Primer, 3rd edition, Dept.of Mathematics, University of Florida,

www.wiwi.uni-frankfurt.de/professoren/krueger/teaching/ws0506/macromh/matlabprimer.pdf.

In your project report you should include:

1. The routines you have developed;

2. The results for your test cases in forms of tables,graphs etc.;

3. Answers to all questions contained in the assign-ment;

4. Comments.

You should report your results in a way that is easy toread, communicates the problem and the results ef-fectively, and can be reproduced by someone else whohas not seen the problem before, but is technicallyknowledgeable. You should also give any justi�cationor other reasons to believe the correctness of yourresults and code. Also, give conclusions on how e�ec-tive your methods and routines appear to be, repportand comment any "unusual behavior" of your results.Team working is allowed, but you should specify thisin your report ,as well as the tasks executed by eachmember of your team.

Grading policy: The �nal grade will be based on testsand hw/projects, as follows:

1. There will be one 3-hour written exam given after8 weeks of classes at a time arranged later (assumablyat the end of October). This midterm exam will count25% of the course grade.

2. The �nal comprehensive exam will be given dur-ing the scheduled examination time at the end of thesemester, it will cover all material, and it will count35% of your �nal grade.

3. HW and lab projects will count 20% of the gradeeach. Late homeworks and projects are not allowed!

4. You will need a scienti�c calculator during exams.Sharing of calculators will not be allowed. Make sureyou have one.

The exams will be open notes, i.e. you will be allowedto use your class notes and class slides (no other ma-terial will be allowed).

Grading for homeworks and labprojects

The HW will be graded on a scale from 0 to 4 with a

possibility of getting extra bonus point at each home-

work. Grades will be given according to the following

guidelines:

{ 0 { no homework turned in;

{ 1 { poor lousy job;

{ 2 { incomplete job;

{ 3 { good job;

{ 4 { very good job;

+1 for optional problems and/or excellent/outstanding

solution to one of the porblems

It is very important that you take the examinations at thescheduled times. Alternate exams will be scheduled onlyfor those who have compelling and convincing enoughreasons.

Academic misconduct: Any kinds of academic mis-

conduct will not be tolerated. If a situation arises

where you and your instructor disagree on some mat-

ter and cannot resolve the issue, you should see the

Dean. However, any problems concerning the course

should be �rst discussed with your instructor.

Readings:

1. Kendall Atkinson, An Introduction to Numerical Analy-sis, 2nd edition

2. Cleve Moler, Numerical Computing with MATLAB,

http://www.mathworks.com/moler/

3. Bjoerck A., Dahlquist G , Numerical mathematics andscienti�c computation.

4. Steven E. Pav, Numerical Methods Course Notes,University of California at San diego, 2005

5. Mathews J.H., Fink D.K., Numerical methods usingMATLAB, 1999

6. Kincaid D.�Cheney.W., Numerical analysis, 1991

7. Goldberg, What every computer scientist should knowabout �oating-point arithmetic, 1991

8. Ho¤man J.D.,. Numerical methods for engineers andscientists, 2001

9. Johnston.R.L., Numerical methods, a software ap-proach, 1982

10. Carothers N.L, A short course on approximation the-ory, Course notes, Bowling Green State University

11. George W. Collins, Fundamental Numerical Methodsand Data Analysis

12. Shampine L.F., Allen R.C., Pruess S., Fundamentalsof numerical computing, 1997

Also, you should check the university library for availablebooks.

Useful web-sites with on-line literature:

www.math.gatech.edu/~cain/textbooks/onlinebooks.html

www.econphd.net/notes.htm

De�nition of Numerical Analysis by Kendall Atkinson,Prof. University of Iowa

Numerical analysis is the area of mathematics and com-puter science that creates, analyzes, and implements al-gorithms for solving numerically the problems of contin-uous mathematics.

Such problems originate generally from real-world appli-cations of algebra, geometry and calculus, and they in-volve variables which vary continuously; these problemsoccur throughout the natural sciences, social sciences,engineering, medicine, and business.

During the past half-century, the growth in power andavailability of digital computers has led to an increas-ing use of realistic mathematical models in science andengineering, and numerical analysis of increasing sophis-tication has been needed to solve these more detailedmathematical models of the world.

With the growth in importance of using computers tocarry out numerical procedures in solving mathematicalmodels of the world, an area known as scienti�c com-puting or computational science has taken shape duringthe 1980s and 1990s. This area looks at the use of nu-merical analysis from a computer science perspective. Itis concerned with using the most powerful tools of nu-merical analysis, computer graphics, symbolic mathemat-ical computations, and graphical userinterfaces to makeit easier for a user to set up, solve, and interpret compli-cated mathematical models of the real world.

De�nition of Numerical Analysis by Lloyd N Trefethen,Prof. Cornell Unviersity

Here is the wrong answer: Numerical analysis is the studyof rounding errors

Some other wrong or incomplete answers:

Websters New Collegiate Dictionary:The study of quan-titative approximations to the solutions of mathematicalproblems including consideration of the errors and boundsto the errors involved.

Chambers 20th Century Dictionary: The study ofmethods of approximation and their accuracy etc

The American Heritage Dictionary: The study of ap-proximate solutions to mathematical problems taking intoaccount the extent of possible errors

Correct answer is:Numerical analysis is the study of algo-rithms for the problems of continuous mathematics

NUMERICAL ANALYSIS: This refers to the analysis

of mathematical problems by numerical means, es-

pecially mathematical problems arising from models

based on calculus.

Effective numerical analysis requires several things:

• An understanding of the computational tool beingused, be it a calculator or a computer.

• An understanding of the problem to be solved.

• Construction of an algorithm which will solve the

given mathematical problem to a given desired

accuracy and within the limits of the resources

(time, memory, etc) that are available.

This is a complex undertaking. Numerous people

make this their life’s work, usually working on only

a limited variety of mathematical problems.

Within this course, we attempt to show the spirit of

the subject. Most of our time will be taken up with

looking at algorithms for solving basic problems such

as rootfinding and numerical integration; but we will

also look at the structure of computers and the impli-

cations of using them in numerical calculations.

We begin by looking at the relationship of numerical

analysis to the larger world of science and engineering.

SCIENCE

Traditionally, engineering and science had a two-sided

approach to understanding a subject: the theoretical

and the experimental. More recently, a third approach

has become equally important: the computational.

Traditionally we would build an understanding by build-

ing theoretical mathematical models, and we would

solve these for special cases. For example, we would

study the flow of an incompressible irrotational fluid

past a sphere, obtaining some idea of the nature of

fluid flow. But more practical situations could seldom

be handled by direct means, because the needed equa-

tions were too difficult to solve. Thus we also used

the experimental approach to obtain better informa-

tion about the flow of practical fluids. The theory

would suggest ideas to be tried in the laboratory, and

the experiemental results would often suggest direc-

tions for a further development of theory.

1

Computational

Science

Theoretical

Science

Experimental

Science

With the rapid advance in powerful computers, we

now can augment the study of fluid flow by directly

solving the theoretical models of fluid flow as applied

to more practical situations; and this area is often re-

ferred to as “computational fluid dynamics”. At the

heart of computational science is numerical analysis;

and to effectively carry out a computational science

approach to studying a physical problem, we must un-

derstand the numerical analysis being used, especially

if improvements are to be made to the computational

techniques being used.

MATHEMATICAL MODELS

A mathematical model is a mathematical description

of a physical situtation. By means of studying the

model, we hope to understand more about the physi-

cal situation. Such a model might be very simple. For

example,

A = 4πR2e, Re.= 6, 371 km

is a formula for the surface area of the earth. How

accurate is it? First, it assumes the earth is sphere,

which is only an approximation. At the equator, the

radius is approximately 6,378 km; and at the poles,

the radius is approximately 6,357 km. Next, there is

experimental error in determining the radius; and in

addition, the earth is not perfectly smooth. Therefore,

there are limits on the accuracy of this model for the

surface area of the earth.

AN INFECTIOUS DISEASE MODEL

For rubella measles, we have the following model for

the spread of the infection in a population (subject to

certain assumptions).

ds

dt= −a s i

di

dt= a s i− b i

dr

dt= b i

In this, s, i, and r refer, respectively, to the propor-

tions of a total population that are susceptible, infec-

tious, and removed (from the susceptible and infec-

tious pool of people). All variables are functions of

time t. The constants can be taken as

a =6.8

11, b =

1

11The same model works for some other diseases (e.g.

flu), with a suitable change of the constants a and b.

Again, this is an approximation of reality (and a useful

one).

But it has its limits. Solving a bad model will not give

good results, no matter how accurately it is solved;

and the person solving this model and using the results

must know enough about the formation of the model

to be able to correctly interpret the numerical results.

THE LOGISTIC EQUATION

This is the simplest model for population growth. Let

N(t) denote the number of individuals in a population

(rabbits, people, bacteria, etc). Then we model its

growth by

N 0(t) = cN(t), t ≥ 0, N(t0) = N0

The constant c is the growth constant, and it usually

must be determined empirically. Over short periods of

time, this is often an accurate model for population

growth. For example, it accurately models the growth

of US population over the period of 1790 to 1860, with

c = 0.2975.

THE PREDATOR-PREY MODEL

Let F (t) denote the number of foxes at time t; and

let R(t) denote the number of rabbits at time t. A

simple model for these populations is called the Lotka-

Volterra predator-prey model :

dR

dt= a [1− bF (t)]R(t)

dF

dt= c [−1 + dR(t)]F (t)

with a, b, c, d positive constants. If one looks carefully

at this, then one can see how it is built from the logis-

tic equation. In some cases, this is a very useful model

and agrees with physical experiments. Of course, we

can substitute other interpretations, replacing foxes

and rabbits with other predator and prey. The model

will fail, however, when there are other populations

that affect the first two populations in a significant

way.

NEWTON’S SECOND LAW

Newton’s second law states that the force acting on

an object is directly proportional to the product of its

mass and acceleration,

F ∝ ma

With a suitable choice of physical units, we usually

write this in its scalar form as

F = ma

Newton’s law of gravitation for a two-body situation,

say the earth and an object moving about the earth is

then

md2r(t)

dt2= −Gmme

|r(t)|2 ·r(t)

|r(t)|with r(t) the vector from the center of the earth to

the center of the object moving about the earth. The

constant G is the gravitational constant, not depen-

dent on the earth; and m and me are the masses,

respectively of the object and the earth.

This is an accurate model for many purposes. But

what are some physical situations under which it will

fail?

When the object is very close to the surface of the

earth and does not move far from one spot, we take

|r(t)| to be the radius of the earth. We obtain thenew model

md2r(t)

dt2= −mgk

with k the unit vector directly upward from the earth’s

surface at the location of the object. The gravitational

constant

g.= 9.8meters/second2

Again this is a model; it is not physical reality.

The Patriot Missile Failure

On February 25, 1991, during the Gulf War, an Amer-

ican Patriot Missile battery in Dharan, Saudi Arabia,

failed to intercept an incoming Iraqi Scud missile. The

Scud struck an American Army barracks and killed 28

soliders.

A report of the General Accounting o�ce, GAO/IMTEC-

92-26, entitled Patriot Missile Defense: Software Prob-

lem Led to System Failure at Dhahran, Saudi Arabia

reported on the cause of the failure.

It turns out that the cause was an inaccurate calcula-

tion of the time since boot due to computer arithmetic

errors.

Speci�cally, the time in tenths of second as measured

by the system's internal clock was multiplied by 1=10

to produce the time in seconds. This calculation was

performed using a 24 bit �xed point register. In par-

ticular, the value 1=10, which has a non-terminating

binary expansion, was chopped at 24 bits after the

radix point. The small chopping error, when multi-

plied by the large number giving the time in tenths of

a second, lead to a

signi�cant error. Indeed, the Patriot battery had been uparound 100 hours, and an easy calculation shows that theresulting time error due to the magni�ed chopping errorwas about 0.34 seconds.

The number 110 equals

1

10=

1

24+1

25+1

28+1

29+

1

212+

1

213+ : : :

= (0:0001100110011001100110011001100 : : :)2

Now the 24 bit register in the Patriot stored instead

(0:00011001100110011001100)2

introducing an error of

(0:0000000000000000000000011001100:::)2

which being converted in decimal is

(0:000000095)10

Multiplying by the number of tenths of a second in 100hours gives:

0:000000095 � 100 � 60 � 60 � 10 = 0:34

A Scud travels at about 1676 meters per second, andso travels more than half a kilometer in this time. Thiswas far enough that the incoming Scud was outside the"range gate" that the Patriot tracked. Ironically, the factthat the bad time calculation had been improved in someparts of the code, but not all, contributed to the problem,since it meant that the inaccuracies did not cancel.

The following paragraph is excerpted from the GAO re-port.

The range gate�s prediction of where the Scud will nextappear is a function of the Scud�s known velocity and thetime of the last radar detection. Velocity is a real number

that can be expressed as a whole number and a decimal(e.g., 3750.2563...miles per hour). Time is kept continu-ously by the system�s internal clock in tenths of secondsbut is expressed as an integer or whole number (e.g., 32,33, 34...). The longer the system has been running, thelarger the number representing time. To predict wherethe Scud will next appear, both time and velocity mustbe expressed as real numbers. Because of the way thePatriot computer performs its calculations and the factthat its registers are only 24 bits long, the conversion oftime from an integer to a real number cannot be anymore precise than 24 bits. This conversion results in aloss of precision causing a less accurate time calculation.The e¤ect of this inaccuracy on the range gate�s calcu-lation is directly proportional to the target�s velocity andthe length of the the system has been running. Conse-quently, performing the conversion after the Patriot hasbeen running continuously for extended periods causesthe range gate to shift away from the center of the tar-get, making it less likely that the target, in this case aScud, will be successfully intercepted.

CALCULATION OF FUNCTIONS

Using hand calculations, a hand calculator, or a com-puter, what are the basic operations of which we arecapable? In essence, they are addition, subtraction,multiplication, and division (and even this will usuallyrequire a truncation of the quotient at some point).In addition, we can make logical decisions, such asdeciding which of the following are true for two realnumbers a and b:

a > b, a = b, a < b

Furthermore, we can carry out only a finite numberof such operations. If we limit ourselves to just addi-tion, subtraction, and multiplication, then in evaluat-ing functions f(x) we are limited to the evaluation ofpolynomials:

p(x) = a0 + a1x+ · · · anxnIn this, n is the degree (provided an 6= 0) and {a0, ..., an}are the coefficients of the polynomial. Later we willdiscuss the efficient evaluation of polynomials; but fornow, we ask how we are to evaluate other functionssuch as ex, cosx, log x, and others.

TAYLOR POLYNOMIAL APPROXIMATIONS

We begin with an example, that of f(x) = ex from

the text. Consider evaluating it for x near to 0. We

look for a polynomial p(x) whose values will be the

same as those of ex to within acceptable accuracy.

Begin with a linear polynomial p(x) = a0+a1x. Then

to make its graph look like that of ex, we ask that the

graph of y = p(x) be tangent to that of y = ex at

x = 0. Doing so leads to the formula

p(x) = 1 + x

Continue in this manner looking next for a quadratic

polynomial

p(x) = a0 + a1x+ a2x2

We again make it tangent; and to determine a2, we

also ask that p(x) and ex have the same “curvature”

at the origin. Combining these requirements, we have

for f(x) = ex that

p(0) = f(0), p0(0) = f 0(0), p00(0) = f 00(0)

This yields the approximation

p(x) = 1 + x+ 12x2

We continue this pattern, looking for a polynomial

p(x) = a0 + a1x+ a2x2 + · · ·+ anx

n

We now require that

p(0) = f(0), p0(0) = f 0(0), · · · , p(n)(0) = f (n)(0)

This leads to the formula

p(x) = 1 + x+ 12x2 + · · ·+ 1

n!xn

What are the problems when evaluating points x that

are far from 0?

TAYLOR’S APPROXIMATION FORMULA

Let f(x) be a given function, and assume it has deriv-

atives around some point x = a (with as many deriv-

atives as we find necessary). We seek a polynomial

p(x) of degree at most n, for some non-negative inte-

ger n, which will approximate f(x) by satisfying the

following conditions:

p(a) = f(a)

p0(a) = f 0(a)p00(a) = f 00(a)

...

p(n)(a) = f (n)(a)

The general formula for this polynomial is

pn(x) = f(a) + (x− a)f 0(a) + 1

2!(x− a)2f 00(a)

+ · · ·+ 1

n!(x− a)nf (n)(a)

Then f(x) ≈ pn(x) for x close to a.

TAYLOR POLYNOMIALS FOR f(x) = log x

In this case, we expand about the point x = 1, making

the polynomial tangent to the graph of f(x) = log x

at the point x = 1. For a general degree n ≥ 1, this

results in the polynomial

pn(x) = (x− 1)− 12(x− 1)2 + 1

3(x− 1)3

+ · · ·+ (−1)n−11n(x− 1)n

Note the graphs of these polynomials for varying n.

THE TAYLOR POLYNOMIAL ERROR FORMULA

Let f(x) be a given function, and assume it has deriv-

atives around some point x = a (with as many deriva-

tives as we find necessary). For the error in the Taylor

polynomial pn(x), we have the formulas

f(x)− pn(x) =1

(n+ 1)!(x− a)n+1f (n+1)(cx)

=1

n!

Z x

a(x− t)nf (n+1)(t) dt

The point cx is restricted to the interval bounded by x

and a, and otherwise cx is unknown. We will use the

first form of this error formula, although the second

is more precise in that you do not need to deal with

the unknown point cx.

Consider the special case of n = 0. Then the Taylor

polynomial is the constant function:

f(x) ≈ p0(x) = f(a)

The first form of the error formula becomes

f(x)− p0(x) = f(x)− f(a) = (x− a) f 0(cx)

with cx between a and x. You have seen this in

your beginning calculus course, and it is called the

mean-value theorem. The error formula

f(x)− pn(x) =1

(n+ 1)!(x− a)n+1f (n+1)(cx)

can be considered a generalization of the mean-value

theorem.

EXAMPLE: f(x) = ex

For general n ≥ 0, and expanding ex about x = 0, wehave that the degree n Taylor polynomial approxima-

tion is given by

pn(x) = 1 + x+1

2!x2 +

1

3!x3 + · · ·+ 1

n!xn

For the derivatives of f(x) = ex, we have

f (k)(x) = ex, f (k)(0) = 1, k = 0, 1, 2, ...

For the error,

ex − pn(x) =1

(n+ 1)!xn+1ecx

with cx located between 0 and x. Note that for x ≈ 0,we must have cx ≈ 0 and

ex − pn(x) ≈ 1

(n+ 1)!xn+1

This last term is also the final term in pn+1(x), and

thus

ex − pn(x) ≈ pn+1(x)− pn(x)

Consider calculating an approximation to e. Then let

x = 1 in the earlier formulas to get

pn(1) = 1 + 1 +1

2!+1

3!+ · · ·+ 1

n!For the error,

e− pn(1) =1

(n+ 1)!ecx, 0 ≤ cx ≤ 1

To bound the error, we have

e0 ≤ ecx ≤ e1

1

(n+ 1)!≤ e− pn(1) ≤ e

(n+ 1)!

To have an approximation accurate to within 10−5,we choose n large enough to have

e

(n+ 1)!≤ 10−5

which is true if n ≥ 8. In fact,e− p8(1) ≤

e

9!

.= 7.5× 10−6

Then calculate p8(1).= 2.71827877, and e− p8(1)

.=

3.06× 10−6.

FORMULAS OF STANDARD FUNCTIONS

1

1− x= 1 + x+ x2 + · · ·+ xn +

xn+1

1− x

cosx = 1− x2

2!+x4

4!− · · ·+ (−1)m x2m

(2m)!

+(−1)m x2m+2

(2m+ 2)!cos cx

sinx = x− x3

3!+x5

5!− · · ·+ (−1)m−1 x2m−1

(2m− 1)!+(−1)m x2m+1

(2m+ 1)!cos cx

with cx between 0 and x.

OBTAINING TAYLOR FORMULAS

Most Taylor polynomials have been bound by other

than using the formula

pn(x) = f(a) + (x− a)f 0(a) + 1

2!(x− a)2f 00(a)

+ · · ·+ 1

n!(x− a)nf (n)(a)

because of the difficulty of obtaining the derivatives

f (k)(x) for larger values of k. Actually, this is now

much easier, as we can use Maple or Mathematica.

Nonetheless, most formulas have been obtained by

manipulating standard formulas; and examples of this

are given in the text.

For example, use

et = 1 + t+1

2!t2 +

1

3!t3 + · · ·+ 1

n!tn

+1

(n+ 1)!tn+1ect

in which ct is between 0 and t. Let t = −x2 to obtain

e−x2 = 1− x2 +1

2!x4 − 1

3!x6 + · · ·+ (−1)

n

n!x2n

+(−1)n+1(n+ 1)!

x2n+2e−ξx

Because ct must be between 0 and −x2, we have itmust be negative. Thus we let ct = −ξx in the errorterm, with 0 ≤ ξx ≤ x2.

EVALUATING A POLYNOMIAL

Consider having a polynomial

p(x) = a0 + a1x+ a2x2 + · · ·+ anx

n

which you need to evaluate for many values of x. How

do you evaluate it? This may seem a strange question,

but the answer is not as obvious as you might think.

The standard way, written in a loose algorithmic for-

mat:

poly = a0for j = 1 : npoly = poly + ajx

j

end

To compare the costs of different numerical meth-

ods, we do an operations count, and then we compare

these for the competing methods. Above, the counts

are as follows:

additions : n

multiplications : 1 + 2 + 3 + · · ·+ n =n(n+ 1)

2

This assumes each term ajxj is computed indepen-

dently of the remaining terms in the polynomial.

Next, do the terms xj recursively:

xj = x · xj−1

Then to computenx2, x3, ..., xn

owill cost n−1 mul-

tiplications. Our algorithm becomes

poly = a0 + a1xpower = xfor j = 2 : npower = x · powerpoly = poly + aj · power

end

The total operations cost is

additions : n

multiplications : n+ n− 1 = 2n− 1When n is evenly moderately large, this is much less

than for the first method of evaluating p(x). For ex-

ample, with n = 20, the first method has 210 multi-

plications, whereas the second has 39 multiplications.

We now considered nested multiplication. As exam-ples of particular degrees, write

n = 2 : p(x) = a0 + x(a1 + a2x)n = 3 : p(x) = a0 + x (a1 + x (a2 + a3x))n = 4 : p(x) = a0 + x (a1 + x (a2 + x (a3 + a4x)))

These contain, respectively, 2, 3, and 4 multiplica-tions. This is less than the preceding method, whichwould have need 3, 5, and 7 multiplications, respec-tively.

For the general case, write

p(x) = a0+x (a1 + x (a2 + · · ·+ x (an−1 + anx) · · · ))This requires n multiplications, which is only abouthalf that for the preceding method. For an algorithm,write

poly = anfor j = n− 1 : −1 : 0poly = aj + x · poly

end

With all three methods, the number of additions is n;but the number of multiplications can be dramaticallydifferent for large values of n.

NESTED MULTIPLICATION

Imagine we are evaluating the polynomial

p(x) = a0 + a1x+ a2x2 + · · ·+ anx

n

at a point x = z. Thus with nested multiplication

p(z) = a0+z (a1 + z (a2 + · · ·+ z (an−1 + anz) · · · ))We can write this as the following sequence of oper-

ations:

bn = anbn−1 = an−1 + zbnbn−2 = an−2 + zbn−1

...b0 = a0 + zb1

The quantities bn−1, ..., b0 are simply the quantities inparentheses, starting from the inner most and working

outward.

Introduce

q(x) = b1 + b2x+ b3x2 + · · ·+ bnx

n−1

Claim:

p(x) = b0 + (x− z)q(x) (∗)Proof: Simply expand

b0 + (x− z)³b1 + b2x+ b3x

2 + · · ·+ bnxn−1´

and use the fact that

zbj = bj−1 − aj−1, j = 1, ..., n

With this result (*), we have

p(x)

x− z=

b0x− z

+ q(x)

Thus q(x) is the quotient when dividing p(x) by x−z,and b0 is the remainder.

If z is a zero of p(x), then b0 = 0; and then

p(x) = (x− z)q(x)

For the remaining roots of p(x), we can concentrate

on finding those of q(x). In rootfinding for polynomi-

als, this process of reducing the size of the problem is

called deflation.

Another consequence of (*) is the following. Form

the derivative of (*) with respect to x, obtaining

p0(x) = (x− z)q0(x) + q(x)

p0(z) = q(z)

Thus to evaluate p(x) and p0(x) simultaneously at x =z, we can use nested multiplication for p(z) and we

can use the intermediate steps of this to also evaluate

p0(z). This is useful when doing rootfinding problemsfor polynomials by means of Newton’s method.

APPROXIMATING SF (x)

Define

SF (x) =1

x

Z x

0

sin t

tdt, x 6= 0

We use Taylor polynomials to approximate this func-

tion, to obtain a way to compute it with accuracy and

simplicity.

x

y

0.5

1.0

-8 -4 84

As an example, begin with the degree 3 Taylor ap-

proximation to sin t, expanded about t = 0:

sin t = t− 16t3 +

1

120t5 cos ct

with ct between 0 and t. Then

sin t

t= 1− 1

6t2 +

1

120t4 cos ctZ x

0

sin t

tdt =

Z x

0

·1− 1

6t2 +

1

120t4 cos ct

¸dt

= x− 1

18x3 +

1

120

Z x

0t4 cos ctdt

1

x

Z x

0

sin t

tdt = 1− 1

18x2 +R2(x)

R2(x) =1

120

1

x

xZ0

t4 cos ct dt

How large is the error in the approximation

SF (x) ≈ 1− 1

18x2

on the interval [−1, 1]? Since |cos ct| ≤ 1, we have

for x > 0 that

0 ≤ R2(x) ≤1

120

1

x

Z x

0t4dt

=1

600x4

and the same result can be shown for x < 0. Then

for |x| ≤ 1, we have

0 ≤ R2(x) ≤1

600

To obtain a more accurate approximation, we can pro-

ceed exactly as above, but simply use a higher degree

approximation to sin t.

BINARY INTEGERS

A binary integer x is a finite sequence of the digits 0

and 1, which we write symbolically as

x = (amam−1 · · · a2a1a0)2where I insert the parentheses with subscript ()2 in

order to make clear that the number is binary. The

above has the decimal equivalent

x = am2m + am−12m−1 + · · ·+ a12

1 + a0

For example, the binary integer x = (110101)2 has

the decimal value

x = 25 + 24 + 22 + 20 = 53

The binary integer x = (111 · · · 1)2 with m ones has

the decimal value

x = 2m−1 + · · ·+ 21 + 1 = 2m − 1

DECIMAL TO BINARY INTEGER CONVERSION

Given a decimal integer x we write

x = (amam−1 · · · a2a1a0)2= am2m + am−12m−1 + · · ·+ a12

1 + a0

Divide x by 2, calling the quotient x1. The remainder

is a0, and

x1 = am2m−1 + am−12m−2 + · · ·+ a12

0

Continue the process. Divide x1 by 2, calling the quo-

tient x2. The remainder is a1, and

x2 = am2m−2 + am−12m−3 + · · ·+ a22

0

After a finite number of such steps, we will obtain all

of the coefficients ai, and the final quotient will be

zero.

Try this with a few decimal integers.

EXAMPLE

The following shortened form of the above method is

convenient for hand computation. Convert (11)10 to

binary.

b2√11c = 5 = x1 a0 = 1b2√5c = 2 = x2 a1 = 1

b2√2c = 1 = x3 a2 = 0b2√1c = 0 = x4 a3 = 1

In this, the notation bbc denotes the largest integer≤ b, and the notation 2

√n denotes the quotient re-

sulting from dividing 2 into n. From the above cal-

culation, (11)10 = (1011)2.

BINARY FRACTIONS

A binary fraction x is a sequence (possibly infinite) of

the digits 0 and 1:

x = (.a1a2a3 · · · am · · · )2= a12

−1 + a22−2 + a32

−3 + · · ·For example, x = (.1101)2 has the decimal value

x = 2−1 + 2−2 + 2−4

= .5 + .25 + .0625 = 0.8125

Recall the formula for the geometric series

nXi=0

ri =1− rn+1

1− r, r 6= 1

Letting n→∞ with |r| < 1, we obtain the formula

∞Xi=0

ri =1

1− r, |r| < 1

Using this,

(.0101010101010 · · · )2 = 2−2 + 2−4 + 2−6 + · · ·= 2−2

³1 + 2−2 + 2−4 + · · ·

´which sums to the fraction 1/3.

Also,

(.11001100110011 · · · )2= 2−1 + 2−2 + 2−5 + 2−6 + · · ·

and this sums to the decimal fraction 0.8 = 810.

DECIMAL TO BINARY FRACTION CONVERSION

In

x1 = (.a1a2a3 · · · am · · · )2= a12

−1 + a22−2 + a32

−3 + · · ·we multiply by 2. The integer part will be a1; and

after it is removed we have the binary fraction

x2 = (.a2a3 · · · am · · · )2= a22

−1 + a32−2 + a42

−3 + · · ·Again multiply by 2, obtaining a2 as the integer part of

2x2. After removing a2, let x3 denote the remaining

number. Continue this process as far as needed.

For example, with x = 15, we have

x1 = .2; 2x1 = .4; x2 = .4 and a1 = 02x2 = .8; x3 = .8 and a2 = 02x3 = 1.6; x4 = .6 and a2 = 1

Continue this to get the pattern

(.2)10 = (.00110011001100 · · · )2

DECIMAL FLOATING-POINT NUMBERS

Floating point notation is akin to what is called scien-

ti�c notation. For a nonzero number x, we can write

it in the form

x = � � x � 10e;

where � = �1, e is an integer, and 1 � x < 10. Num-ber � is called sign, e is exponent, and x is signi�candor mantissa.

For example,

345:78 = 3:4578 � 102;where � = +1, e = 2, x = 3:4578.

On a decimal computer or calculator, we store x by

instead storing �, e and x: We must restrict the num-

ber of digits in x and the size of the exponent e. The

number of digits in x is called precision.

For example, on an HP-15C calculator, precision is 10,

and the exponent is restricted to �99 � e � 99.

BINARY FLOATING-POINT NUMBERS

We now do something similar with the binary repre-

sentation of a number x. Write

x = � � x � 2e;

with 1 � x < (10)2 = (2)10 and e an integer.

For example,

(0:1)10 = (:000110011001100. . . )2

= +|{z}�=+1

(1:10011001100 : : :)2| {z }x

� 2ez}|{�4 ;

The number x is stored in the computer by storing the

�, x, and e. On all computers, there are restrictions

on the number of digits in x and the size of e.

FLOATING POINT NUMBERS

When a number x outside a computer or calculator

is converted into a machine number, we denote it by

fl(x). On an HP-calculator,

fl(:3333. . . ) = (3:333333333)10 � 10�1

The decimal fraction of in�nite length will not �t in

the registers of the calculator, but the latter 10�digitnumber will �t. Some calculators actually carry more

digits internally than they allow to be displayed. On

a binary computer, we use a similar notation.

We will concentrate on a particular form of computer

oating point number, that is called the IEEE oating

point standard.

Example 1 Consider a binary oating point represen-

tation with precision 3; and emin = �2 � e � 2 =

emax: All the numbers admitted by this representation

are presented in the table

ex �2 �1 0 1 2

(1:00)2 (:25)10 (:5)10 (1)10 (2)10 (4)10(1:01)2 (:3125)10 (:625)10 (1:25)10 (2:5)10 (5)10(1:10)2 (:375)10 (:75)10 (1:5)10 (3)10 (6)10(1:11)2 (:4375)10 (:875)10 (1:75)10 (3:5)10 (7)10

0 4 5 6 71 2 30.5

This representation can be extended to include smaller

numbers called denormalized numbers. These num-

bers are obtained if e = emin; and the �rst digit of the

signi�cand is 0:

Example 2 Previous example plus denormalized num-

bers

(0:01)2 � 2�1 =

1

16= (0:0625)10

(0:10)2 � 2�1 =

2

16= (0:125)10

(0:11)2 � 2�1 =

3

16= (0:1875)10

0 4 5 6 71 2 30.5

IEEE SINGLE PRECISION STANDARD

In IEEE single precision 32 bits are used to store num-

bers. A number is written as

x = � � 1:a1a2 : : : a23 � 2e:

The signi�cand x = (1:a1a2��a23)2 immediately sat-is�es 1 � x < 2.

What are the limits on e? To understand the limits

on e and the number of binary digits chosen for x, we

must look roughly at how the number x will be stored

in the computer.

Basically, we store � as a single bit, the signi�cand x

as 24 bits (only 23 need be stored), and the exponent

�lls out 8 bits, including both negative and positive

integers.

Roughly speaking, we have that e must satisfy

�(1111111| {z }7

)2 � e � (1111111| {z }7

)2

or in decimal

�127 � e � 127

In actuality, the limits are

�126 � e � 127

for reasons related to the storage of 0 and other num-

bers such as �1. In order to avoid the sign for ex-ponent, denote E = e+ 127:

Obviously, 1 � E � 254 with two additional values:

0 and 255:

� E xb1 b2 : : : b9 b10 : : : b32

Number x = 0 is stored in the following way: E = 0,

� = 0 and b10b11 : : : b32 = (00 : : : 0)2.

E = (b2 : : : b9)2 e x

(00000000)2 = 0 �127 �(0:b10 : : : b32)2 � 2�126(00000001)2 = 1 �126 �(1:b10 : : : b32)2 � 2�126(00000010)2 = 2 �125 �(1:b10 : : : b32)2 � 2�125

... ... ...(01111111)2 = 127 0 �(1:b10 : : : b32)2 � 20(10000000)2 = 128 1 �(1:b10 : : : b32)2 � 21

... ... ...(11111101)2 = 253 126 �(1:b10 : : : b32)2 � 2126(11111110)2 = 254 127 �(1:b10 : : : b32)2 � 2127

(11111111)2 = 255 128�1; if bi = 0NaN; otherwise

IEEE DOUBLE PRECISSION STANDARD

x = � � 1:a1a2 : : : a52 � 2e:

� E xb1 b2 : : : b12 b13 : : : b64

where E = e+ 1023

E = (b2 : : : b12)2 e x

(00000000000)2 = 0 �1023 �(:b13 : : : b64)2�1022(00000000001)2 = 1 �1022 �(1:b13 : : : b64)2�1022(00000000010)2 = 2 �1021 �(1:b13 : : : b64)2�1021

... ... ...(01111111111)2 = 1023 0 �(1:b13 : : : b64)20(10000000000)2 = 1024 1 �(1:b13 : : : b64)21

... ... ...(11111111101)2 = 2045 1022 �(1:b13 : : : b64)21022(11111111110)2 = 2046 1023 �(1:b13 : : : b64)21023

(11111111111)2 = 2047 1024�1; bi = 0NaN; otherwise

What is the connection of the 24 bits in the signi�cand

x to the number of decimal digits in the storage of

a number x into oating point form. One way of

answering this is to �nd the integer M for which

1. 0 < x � M and x an integer implies fl(x) = x;

and

2.fl(M + 1) 6=M + 1

This integer M is at least as big as

(11 : : : 1| {z })223 ones

= (1:11 : : : 1)2 � 223

= 223 + 222 + : : :+ 20

= 224 � 1

Also, 224 = (1:00 : : : 0)2 � 224 will be stored exactly.

Next integer 224+1 cannot be stored exactly since its

signi�cand will contain 24 + 1 binary digits:

224 + 1 = (1:00 : : : 0| {z }23 of 0

1)2 � 224:

Therefore for single precision M = 224. Any integer

less or equal to M will be stored exactly. So

M = 224 = 16777216:

For IEEE double precision standard we have

M = 253 � 9:0� 1015:

THE MACHINE EPSILON

Let y be the smallest number representable in the ma-

chine arithmetic that is greater than 1 in the machine.

The machine epsilon is η = y − 1. It is a widely usedmeasure of the accuracy possible in representing num-

bers in the machine.

The number 1 has the simple floating point represen-

tation

1 = (1.00 · · · 0)2 · 20

What is the smallest number that is greater than 1?

It is

1 + 2−23 = (1.0 · · · 01)2 · 20 > 1

and the machine epsilon in IEEE single precision float-

ing point format is η = 2−23 .= 1.19× 10−7.

THE UNIT ROUND

Consider the smallest number δ > 0 that is repre-

sentable in the machine and for which

1 + δ > 1

in the arithmetic of the machine.

For any number 0 < α < δ, the result of 1 + α is

exactly 1 in the machines arithmetic. Thus α ‘drops

off the end’ of the floating point representation in the

machine. The size of δ is another way of describing

the accuracy attainable in the floating point represen-

tation of the machine. The machine epsilon.has been

replacing it in recent years.

It is not too difficult to derive δ. The number 1 has

the simple floating point representation

1 = (1.00 · · · 0)2 · 20

What is the smallest number which can be added to

this without disappearing? Certainly we can write

1 + 2−23 = (1.0 · · · 01)2 · 20 > 1

Past this point, we need to know whether we are us-

ing chopped arithmetic or rounded arithmetic. We

will shortly look at both of these. With chopped

arithmetic, δ = 2−23; and with rounded arithmetic,δ = 2−24.

ROUNDING AND CHOPPING

Let us first consider these concepts with decimal arith-

metic. We write a computer floating point number z

as

z = σ · ζ · 10e ≡ σ · (a1.a2 · · · an)10 · 10e

with a1 6= 0, so that there are n decimal digits in thesignificand (a1.a2 · · · an)10.

Given a general number

x = σ · (a1.a2 · · · an · · · )10 · 10e, a1 6= 0we must shorten it to fit within the computer. This

is done by either chopping or rounding. The floating

point chopped version of x is given by

fl(x) = σ · (a1.a2 · · · an)10 · 10e

where we assume that e fits within the bounds re-

quired by the computer or calculator.

For the rounded version, we must decide whether to

round up or round down. A simplified formula is

fl(x) =(σ · (a1.a2 · · · an)10 · 10e an+1 < 5

σ · [(a1.a2 · · · an)10 + (0.0 · · · 1)10] · 10e an+1 ≥ 5The term (0.0 · · · 1)10 denotes 10−n+1, giving the or-dinary sense of rounding with which you are familiar.

In the single case

(0.0 · · · 0an+1an+2 · · · )10 = (0.0 · · · 0500 · · · )10a more elaborate procedure is used so as to assure an

unbiased rounding.

CHOPPING/ROUNDING IN BINARY

Let

x = σ · (1.a2 · · · an · · · )2 · 2e

with all ai equal to 0 or 1. Then for a chopped floating

point representation, we have

fl(x) = σ · (1.a2 · · · an)2 · 2e

For a rounded floating point representation, we have

fl(x) =(σ · (1.a2 · · · an)2 · 10e an+1 = 0

σ · [(1.a2 · · · an)2 + (0.0 · · · 1)2] · 10e an+1 = 1

ERRORS

The error x−fl(x) = 0 when x needs no change to beput into the computer or calculator. Of more interest

is the case when the error is nonzero. Consider first

the case x > 0 (meaning σ = +1). The case with

x < 0 is the same, except for the sign being opposite.

With x 6= fl(x), and using chopping, we have

fl(x) < x

and the error x− fl(x) is always positive. This later

has major consequences in extended numerical com-

putations. With x 6= fl(x) and rounding, the error

x− fl(x) is negative for half the values of x, and it is

positive for the other half of possible values of x.

We often write the relative error as

x− fl(x)

x= −ε

This can be expanded to obtain

fl(x) = (1 + ε)x

Thus fl(x) can be considered as a perturbed value

of x. This is used in many analyses of the effects of

chopping and rounding errors in numerical computa-

tions.

For bounds on ε, we have

−2−n ≤ ε ≤ 2−n, rounding−2−n+1 ≤ ε ≤ 0, chopping

IEEE ARITHMETIC

We are only giving the minimal characteristics of IEEE

arithmetic. There are many options available on the

types of arithmetic and the chopping/rounding. The

default arithmetic uses rounding.

Single precision arithmetic:

n = 24, −126 ≤ e ≤ 127This results in

M = 224 = 16777216

η = 2−23 = 1.19× 10−7

Double precision arithmetic:

n = 53, −1022 ≤ e ≤ 1023What are M and η?

There is also an extended representation, having n =

69 digits in its significand.

MATLAB can be used to generate the binary oating

point representation of a number.

Execute in MATLAB the command:

format hex

This will cause all subsequent numerical output to the

screen to be given in hexadecimal format (base 16).

For example, listing the number 7.125 results in an

output of

401c800000000000:

The 16 hexadecimal digits are

f0; 1; 2; 3; 4; 5; 6; 7; 8; 9; a; b; c; d; e; fg

To obtain the binary representation, convert each hex-

adecimal digit to a four digit binary number according

to the table below

Format Format Format Formathex binary hex binary

0 0000 8 10001 0001 9 10012 0010 a 10103 0011 b 10114 0100 c 11005 0101 d 11016 0110 e 11107 0111 f 1111

For the above number, we obtain the binary expansion

4z }| {0100

0z }| {0000

1z }| {0001

cz }| {1100

8z }| {1000

0z }| {0000

0z }| {0000 : : :

0z }| {0000

0|{z}�10000000001| {z }

E

1100100000000000 : : : 0000| {z }1:b13b14:::b64=x

which provides us with the IEEE double precision rep-

resentation of 7:125.

SOME DEFINITIONS

Let xT denote the true value of some number, usuallyunknown in practice; and let xA denote an approxi-mation of xT .

The error in xA is

error(xA) = xT � xA

The relative error in xA is

rel(xA) =error(xA)

xT

=xT � xAxT

Example:

xT = e; xA =197 : Then,

error(xA) = e� 197= 0:003996

rel(xA) =0:003996

e= 0:00147

Relative error is more exact in representing the d�er-

ence between the true value and approximated one.

Example: Suppose the distance between two cities is

DT = 100 km and let this distance be approximated

with DA = 99 km. In this case,

Err (DA) = DT �DA = 1 km,

Rel (DA) =Err (DA)

DT= 0:01 � 1%:

Now, suppose that distance is dT = 2 km and esti-

mate it with dA = 1 km. Then

Err (dA) = dT � dA = 1 km,

Rel (dA) =Err (dA)

dT= 0:5 � 50%:

In both cases the error is the same. But, obviously

DA is a better approximation of DT , then dA of dT :

Numerical Analysis

conf.dr. Bostan Viorel

Fall 2010 Lecture 3

1 / 83

Sources of Error

The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:

1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by

md2�!rdt2

(t) = �mg�!k � bd�!rdt

with b � 0. In this, �!r (t) is the vector position of the projectile;and the �nal term in the equation represents friction force in air. Ifthere is an error in this a model of a physical situation, then thenumerical solution of this equation is not going to improve theresults.

2 / 83

Sources of Error

The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.

Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by

md2�!rdt2



3 / 83

Sources of Error

The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by

md2�!rdt2



4 / 83

Sources of Error

The sources of error in the computation of the solution of amathematical model for some physical situation can be roughlycharacterised as follows:1. Modelling Error.Consider the example of a projectile of mass m that is travellingthrugh the earth's atmosphere. A simple and oftenly useddescription of projectile motion is given by

md2�!rdt2



5 / 83

Sources of Error

2. Physical / Observational / Measurement Error.The radius of an electron is given by

(2.81777+ ε)� 10�13cm, jεj � 0.00011

This error cannot be removed, and it must a�ect the accuracy ofany computation in which it is used.

We need to be aware of these e�ects and to so arrange thecomputation as to minimize the e�ects.

6 / 83

Sources of Error

2. Physical / Observational / Measurement Error.The radius of an electron is given by

(2.81777+ ε)� 10�13cm, jεj � 0.00011

This error cannot be removed, and it must a�ect the accuracy ofany computation in which it is used.

We need to be aware of these e�ects and to so arrange thecomputation as to minimize the e�ects.

7 / 83

Sources of Error

3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.

Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation

ex � 1+ x + 12x2

contains an \approximation error".The numerical integration

1R0

f (x)dx � 1

N

N

∑j=1f

�j

N

�contains an approximation error.

8 / 83

Sources of Error

3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.

For example, the Taylor polynomial approximation

ex � 1+ x + 12x2


1R0

f (x)dx � 1

N

N

∑j=1f

�j

N


9 / 83

Sources of Error

3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation

ex � 1+ x + 12x2

contains an \approximation error".

The numerical integration

1R0

f (x)dx � 1

N

N

∑j=1f

�j

N


10 / 83

Sources of Error

3. Approximation Error.This is also called \discretization error" and \truncation error";and it is the main source of error with which we deal in this course.Such errors generally occur when we replace a computationallyunsolvable problem with a nearby problem that is more tractablecomputationally.For example, the Taylor polynomial approximation

ex � 1+ x + 12x2


1R0

f (x)dx � 1

N

N

∑j=1f

�j

N


11 / 83

Sources of Error

4. Finiteness of Algorithm ErrorThis is an error due to stopping an algorithm after a �nite numberof iterations.

Even if theoretically an algorithm can run for inde�nite time, aftera �nite (usually speci�ed) number of iterations the algorithm willbe stopped.

12 / 83

Sources of Error

4. Finiteness of Algorithm ErrorThis is an error due to stopping an algorithm after a �nite numberof iterations.

Even if theoretically an algorithm can run for inde�nite time, aftera �nite (usually speci�ed) number of iterations the algorithm willbe stopped.

13 / 83

Sources of Error

5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors.

Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:

Break programs into small testable subprograms;

Run test cases for which you know the outcome;

When running the full code, maintain a skeptical eye on theoutput, checking whether the output is reasonable or not.

14 / 83

Sources of Error

5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs.

Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:




15 / 83

Sources of Error

5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.

Some simple rules to decrease the risk of having a bug in the code:




16 / 83

Sources of Error

5. Blunders.In the pre-computer era, blunders were mostly arithmetic errors. Inthe earlier years of the computer era, the typical blunder was aprogramming bugs. Present day \blunders" are still oftenprogramming errors. But now they are often much more di�cult to�nd, as they are often embedded in very large codes which maymask their e�ect.Some simple rules to decrease the risk of having a bug in the code:




17 / 83

Sources of Error

6. Rounding/chopping Error.This is the main source of many problems, especially problems insolving systems of linear equations. We later look at the e�ects ofsuch errors.

18 / 83

Sources of Error

7. Finitness of precision errorsAll the numbers stored in computer memory are subject to the�niteness of allocated space for storage.

19 / 83

Pendulum Example

Original problem in engineering or in science to be solved:

θ

T

mg

Model this physical problem mathematically.Second Newton law provides us with:

..θ = �g

lsin θ

( .θ = ω.

ω = � gl sin θ

20 / 83

Pendulum Example


θ

T

mg


..θ = �g

lsin θ

( .θ = ω.

ω = � gl sin θ

21 / 83

Pendulum Example


θ

T

mg


..θ = �g

lsin θ

( .θ = ω.

ω = � gl sin θ

22 / 83

Pendulum Example


θ

T

mg


..θ = �g

lsin θ

( .θ = ω.

ω = � gl sin θ

23 / 83

Pendulum Example


θ

T

mg


..θ = �g

lsin θ

( .θ = ω.

ω = � gl sin θ

24 / 83

Pendulum Example

Problem of continuous mathematics:

θ

T

mg

( .θ = ω.

ω = � gl sin θ

Modeling Errors

Physical Errors

25 / 83

Pendulum Example


θ

T

mg

( .θ = ω.

ω = � gl sin θ

Modeling Errors

Physical Errors

26 / 83

Pendulum Example


θ

T

mg

( .θ = ω.

ω = � gl sin θ

Modeling Errors

Physical Errors

27 / 83

Pendulum Example

Mathematical Algorithms:

θ

T

mg

�θn+1 = θn + hωn+1

ωn+1 = ωn � h gl sin (θn)

Discretisation Errors

Finiteness of Algorithm Errors

28 / 83

Pendulum Example


θ

T

mg

�θn+1 = θn + hωn+1




29 / 83

Pendulum Example


θ

T

mg

�θn+1 = θn + hωn+1




30 / 83

Pendulum Example

Computer Implementation:

θ

T

mg

for i=1:NmaxOmega = Omega - H*g/L*sin(Theta);Theta = Theta + H*Omega

end

Rounding / Chopping Errors

Bugs in the Code

Finite Precision Errors

31 / 83

Pendulum Example


θ

T

mg


end


Bugs in the Code


32 / 83

Pendulum Example


θ

T

mg


end


Bugs in the Code


33 / 83

Loss of signi�cance errors

This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.

Example. De�ne

f (x) = x�p

x + 1�px�

and consider evaluating it on a 6-digit decimal calculator whichuses rounded arithmetic.

x Computed f (x) True f (x) Error

1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024

1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012

100000 100.000 158.113 �58.1130

34 / 83


This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.Example. De�ne

f (x) = x�p

x + 1�px�



1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024

1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012

100000 100.000 158.113 �58.1130

35 / 83



f (x) = x�p

x + 1�px�



1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024

1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012

100000 100.000 158.113 �58.113036 / 83


Example. De�ne

g(x) =1� cos xx2



0.1 0.4995834700 0.4995834722 �2.2000e � 0090.01 0.4999960000 0.4999958333 1.6670e � 0070.001 0.5000000000 0.4999999583 4.1700e � 0080.0001 0.5000000000 0.4999999996 4.0000e � 0100.00001 0.0 0.5000000000 0.5

37 / 83


Example. De�ne

g(x) =1� cos xx2



0.1 0.4995834700 0.4995834722 �2.2000e � 0090.01 0.4999960000 0.4999958333 1.6670e � 0070.001 0.5000000000 0.4999999583 4.1700e � 0080.0001 0.5000000000 0.4999999996 4.0000e � 0100.00001 0.0 0.5000000000 0.5

38 / 83


Consider one case, that of x = 0.001.

Then on the calculator:

cos(0.001) = 0.9999994999

1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2

= 0.5001000000

The true answer is

f (0.001) = .4999999583

The relative error in our answer is

0.4999999583� 0.5001 = �0.00010004170.4999999583

= �0.0002

There are 3 signi�cant digits in the answer. How can such astraightforward and short calculation lead to such a large error(relative to the accuracy of the calculator)?

39 / 83


Consider one case, that of x = 0.001. Then on the calculator:

cos(0.001) = 0.9999994999

1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2

= 0.5001000000

The true answer is

f (0.001) = .4999999583


0.4999999583� 0.5001 = �0.00010004170.4999999583

= �0.0002


40 / 83



cos(0.001) = 0.9999994999

1� cos(0.001) = 5.001� 10�7

1� cos(0.001)(0.001)2

= 0.5001000000

The true answer is

f (0.001) = .4999999583


0.4999999583� 0.5001 = �0.00010004170.4999999583

= �0.0002


41 / 83



cos(0.001) = 0.9999994999

1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2

= 0.5001000000

The true answer is

f (0.001) = .4999999583


0.4999999583� 0.5001 = �0.00010004170.4999999583

= �0.0002


42 / 83



cos(0.001) = 0.9999994999

1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2

= 0.5001000000

The true answer is

f (0.001) = .4999999583


0.4999999583� 0.5001 = �0.00010004170.4999999583

= �0.0002


43 / 83



cos(0.001) = 0.9999994999

1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2

= 0.5001000000

The true answer is

f (0.001) = .4999999583


0.4999999583� 0.5001 = �0.00010004170.4999999583

= �0.0002


44 / 83



cos(0.001) = 0.9999994999

1� cos(0.001) = 5.001� 10�71� cos(0.001)(0.001)2

= 0.5001000000

The true answer is

f (0.001) = .4999999583


0.4999999583� 0.5001 = �0.00010004170.4999999583

= �0.0002


45 / 83


When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation.

In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:

cos(2θ) = 2 cos2(θ)� 1 = 1� 2 sin2(θ)Let x = 2θ. Then

f (x) =1� cos xx2

=2 sin2 (x/2)

x2=1

2

�sin(x/2)x/2

�2This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .

46 / 83


When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect.

And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:


f (x) =1� cos xx2

=2 sin2 (x/2)

x2=1

2

�sin(x/2)x/2


47 / 83


When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.

The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:


f (x) =1� cos xx2

=2 sin2 (x/2)

x2=1

2

�sin(x/2)x/2


48 / 83


When two numbers are nearly equal and we subtract them, thenwe su�er a \loss of signi�cance error" in the calculation. In somecases, these can be quite subtle and di�cult to detect. And evenafter they are detected, they may be di�cult to �x.The last example, fortunately, can be �xed in a number of ways.Easiest is to use a trigonometric identity:


f (x) =1� cos xx2

=2 sin2 (x/2)

x2=1

2

�sin(x/2)x/2


49 / 83




f (x) =1� cos xx2

=2 sin2 (x/2)

x2=1

2

�sin(x/2)x/2

�2

This latter formula, with x = 0.001, yields a computed value of0.4999999584, nearly the true answer. We could also have used aTaylor polynomial for cos(x) around x = 0 to obtain a betterapproximation to f (x) for small values of x .

50 / 83




f (x) =1� cos xx2

=2 sin2 (x/2)

x2=1

2

�sin(x/2)x/2


51 / 83

Another example

Evaluate e�5 using a Taylor polynomial approximation:

e�5 = 1+(�5)1!

+(�5)22!

+(�5)33!

+(�5)44!

++(�5)55!

+(�5)66!

. . .

With n = 25, the error is�� (�5)�2626!ec�� 10�8

Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding. To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.

52 / 83

Another example


e�5 = 1+(�5)1!

+(�5)22!

+(�5)33!

+(�5)44!

++(�5)55!

+(�5)66!

. . .



53 / 83

Another example


e�5 = 1+(�5)1!

+(�5)22!

+(�5)33!

+(�5)44!

++(�5)55!

+(�5)66!

. . .


Imagine calculating this polynomial using a computer with 4 digitdecimal arithmetic and rounding.

To make the point aboutcancellation more strongly, imagine that each of the terms in theabove polynomial is calculated exactly and then rounded to thearithmetic of the computer. We add the terms exactly and then weround to four digits.

54 / 83

Another example


e�5 = 1+(�5)1!

+(�5)22!

+(�5)33!

+(�5)44!

++(�5)55!

+(�5)66!

. . .



55 / 83

Another example

Degree Term Sum Degree Term Sum

0 1.000 1.000 13 �0.1960 �0.042301 �5.000 �4.000 14 0.7001e � 1 0.027712 12.50 8.500 15 �0.2334e � 1 0.0043703 �20.83 �12.33 16 0.7293e � 2 0.011664 26.04 13.71 17 �0.2145e � 2 0.0095185 �26.04 �12.33 18 0.5958e � 3 0.010116 21.70 9.370 19 �0.1568e � 3 0.0099577 �15.50 �6.130 20 0.3920e � 4 0.0099968 9.688 3.558 21 �0.9333e � 5 0.0099879 �5.382 �1.824 22 0.2121e � 5 0.00998910 2.691 0.8670 23 �0.4611e � 6 0.00998911 �1.223 �0.3560 24 0.9670e � 7 0.00998912 0.5097 0.1537 25 �0.1921e � 7 0.009989

True answer is 0.006738

56 / 83

Another example

Degree Term Sum Degree Term Sum

0 1.000 1.000 13 �0.1960 �0.042301 �5.000 �4.000 14 0.7001e � 1 0.027712 12.50 8.500 15 �0.2334e � 1 0.0043703 �20.83 �12.33 16 0.7293e � 2 0.011664 26.04 13.71 17 �0.2145e � 2 0.0095185 �26.04 �12.33 18 0.5958e � 3 0.010116 21.70 9.370 19 �0.1568e � 3 0.0099577 �15.50 �6.130 20 0.3920e � 4 0.0099968 9.688 3.558 21 �0.9333e � 5 0.0099879 �5.382 �1.824 22 0.2121e � 5 0.00998910 2.691 0.8670 23 �0.4611e � 6 0.00998911 �1.223 �0.3560 24 0.9670e � 7 0.00998912 0.5097 0.1537 25 �0.1921e � 7 0.009989

True answer is 0.00673857 / 83

Another example

To understand more fully the source of the error, look at thenumbers being added and their accuracy.

For example,

(�5)33!

= �1256= �20.83

in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.

General principle

Whenever a sum is being formed in which the �nal answer is muchsmaller than some of the terms being combined, then a loss ofsigni�cance error is occurring.

58 / 83

Another example

To understand more fully the source of the error, look at thenumbers being added and their accuracy. For example,

(�5)33!

= �1256= �20.83

in the 4 digit decimal calculation, with an error of magnitude0.00333.

Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought. Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.

General principle


59 / 83

Another example


(�5)33!

= �1256= �20.83

in the 4 digit decimal calculation, with an error of magnitude0.00333. Note that this error in an intermediate step is of samemagnitude as the true answer 0.006738 being sought.

Othersimilar errors are present in calculating other coe�cients, and thusthey cause a major error in the �nal answer being calculated.

General principle


60 / 83

Another example


(�5)33!

= �1256= �20.83


General principle


61 / 83

Another example


(�5)33!

= �1256= �20.83


General principle


62 / 83

Noise in function evaluation

Consider plotting the function

f (x) = (x � 1)3 = x3 � 3x2 + 3x � 1 = �1+ x(3+ x(�3+ x))

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 21

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

x

y

63 / 83


0.99998 1.00000 1.000002

8

4

0

4

8x 10

15

x

y

64 / 83


Whenever a function f (x) is evaluated, there are arithmeticoperations carried out which involve rounding or chopping errors.

This means that what the computer eventually returns as ananswer contains noise.

This noise is generally \random" and small.

But it can a�ect the accuracy of other calculations which dependon f (x).

65 / 83






66 / 83






67 / 83






68 / 83

Under ow errors

Consider evaluatingf (x) = x10

for x near 0.

When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is

m = 2�126 = 1.18� 10�38

Thus f (x) will be set to zero if

x10 < m

jx j < m110

jx j < 1.61� 10�4

�0.000161 < x < 0.000161

69 / 83

Under ow errors


for x near 0. When using IEEE single precision arithmetic, thesmallest nonzero positive number expressible in normalized oating-point format is

m = 2�126 = 1.18� 10�38


x10 < m

jx j < m110

jx j < 1.61� 10�4

�0.000161 < x < 0.000161

70 / 83

Under ow errors



m = 2�126 = 1.18� 10�38


x10 < m

jx j < m110

jx j < 1.61� 10�4

�0.000161 < x < 0.000161

71 / 83

Under ow errors



m = 2�126 = 1.18� 10�38


x10 < m

jx j < m110

jx j < 1.61� 10�4

�0.000161 < x < 0.000161

72 / 83

Under ow errors



m = 2�126 = 1.18� 10�38


x10 < m

jx j < m110

jx j < 1.61� 10�4

�0.000161 < x < 0.000161

73 / 83

Under ow errors



m = 2�126 = 1.18� 10�38


x10 < m

jx j < m110

jx j < 1.61� 10�4

�0.000161 < x < 0.000161

74 / 83

Under ow errors



m = 2�126 = 1.18� 10�38


x10 < m

jx j < m110

jx j < 1.61� 10�4

�0.000161 < x < 0.000161

75 / 83

Over ow errors

Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors.

These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is

m = 2128�1� 2�24

�= 3.40� 1038

Thus, f (x) will over ow if

x10 > m

jx j > m110

jx j > 7131.6

76 / 83

Over ow errors

Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context.

Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is

m = 2128�1� 2�24

�= 3.40� 1038


x10 > m

jx j > m110

jx j > 7131.6

77 / 83

Over ow errors

Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.

When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is

m = 2128�1� 2�24

�= 3.40� 1038


x10 > m

jx j > m110

jx j > 7131.6

78 / 83

Over ow errors

Attempts to use numbers that are too large for the oating-pointformat will lead to over ow errors. These are generally fatalerrors on most computers. With the IEEE oating-point format,over ow errors can be carried along as having a value of �∞ orNaN, depending on the context. Usually an over ow error is anindication of a more signi�cant problem or error in the programand the user needs to be aware of such errors.When using IEEE single precision arithmetic, the largest nonzeropositive number expressible in normalized oating point format is

m = 2128�1� 2�24

�= 3.40� 1038


x10 > m

jx j > m110

jx j > 7131.6

79 / 83

Over ow errors


m = 2128�1� 2�24

�= 3.40� 1038


x10 > m

jx j > m110

jx j > 7131.6

80 / 83

Over ow errors


m = 2128�1� 2�24

�= 3.40� 1038


x10 > m

jx j > m110

jx j > 7131.6

81 / 83

Over ow errors


m = 2128�1� 2�24

�= 3.40� 1038


x10 > m

jx j > m110

jx j > 7131.682 / 83

Over ow errors


m = 2128�1� 2�24

�= 3.40� 1038


x10 > m

jx j > m110

jx j > 7131.683 / 83

Numerical Analysis


Fall 2010 Lecture 5

1 / 101


This can be considered a source of error or a consequence of the�niteness of calculator and computer arithmetic.

Example. De�ne

f (x) = x�p

x + 1�px�



1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024

1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012

100000 100.000 158.113 �58.1130

2 / 101



f (x) = x�p

x + 1�px�



1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024

1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012

100000 100.000 158.113 �58.1130

3 / 101



f (x) = x�p

x + 1�px�



1 0.4142210 0.414214 7.0000e � 00610 1.54340 1.54347 �7.0000e � 005100 4.99000 4.98756 0.0024

1000 15.8000 15.8074 �0.007410000 50.0000 49.9988 0.0012

100000 100.000 158.113 �58.11304 / 101


In order to localise the error consider the case x = 100.

The calculator with 6 decimal digits will provide us with thefollowing values

p100 = 10,

p101 = 10.0499.

Then px + 1�

px =

p101�

p100 = 0.0499000,

while the exact value is 0.0498756.

Three signi�cant digits inpx + 1 =

p101 have been lost fromp

x =p100.

The loss of precision is due to the form of the function f (x) andthe �niteness of the precision of the 6 digit calculator.

5 / 101


In order to localise the error consider the case x = 100.The calculator with 6 decimal digits will provide us with thefollowing values

p100 = 10,

p101 = 10.0499.

Then px + 1�

px =

p101�

p100 = 0.0499000,




x =p100.


6 / 101



p100 = 10,

p101 = 10.0499.

Then px + 1�

px =

p101�

p100 = 0.0499000,




x =p100.


7 / 101



p100 = 10,

p101 = 10.0499.

Then px + 1�

px =

p101�

p100 = 0.0499000,




x =p100.


8 / 101



p100 = 10,

p101 = 10.0499.

Then px + 1�

px =

p101�

p100 = 0.0499000,




x =p100.


9 / 101


In this particular case, we can avoid the loss of precision byrewritining the function as follows

f (x) = x

px + 1+

pxp

x + 1+px�px + 1�

px

1

=xp

x + 1+px.

Thus we will avoid the subtraction on near quantities.

Doing so gives usf (100) = 4.98756,

a value with 6 signi�cant digits.

10 / 101



f (x) = x

px + 1+

pxp

x + 1+px�px + 1�

px

1

=xp

x + 1+px.




11 / 101



f (x) = x

px + 1+

pxp

x + 1+px�px + 1�

px

1

=xp

x + 1+px.




12 / 101



f (x) = x

px + 1+

pxp

x + 1+px�px + 1�

px

1

=xp

x + 1+px.




13 / 101



f (x) = x

px + 1+

pxp

x + 1+px�px + 1�

px

1

=xp

x + 1+px.




14 / 101

Propagation of errorsPropagation in arithmetic operations

Let ω denote arithmetic operation such as +,�, �,or /.

Let ω� denote the same arithmetic operation as it is actuallycarried out in the computer, including rounding or chopping error.

Let xA � xT and yA � yT .

We want to obtain xT ω yT , but we actually obtain xA ω� yA.

The error in xA ω� yA is given by

(xT ω yT )� (xA ω� yA)

15 / 101








16 / 101








17 / 101








18 / 101








19 / 101


The error in xA ω� yA is rewritten as

(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)

The �nal term is the error introduced by the inexactness of themachine arithmetic. For it, we usually assume

xA ω� yA = (xA ω yA)

This means that the quantity xA ω yA is computed exactly and isthen rounded or chopped to �t the answer into the oating pointrepresentation of the machine.

20 / 101







21 / 101







22 / 101


The formulaxA ω� yA = (xA ω yA)

impliesxA ω� yA = xA ω yA (1+ ε)

since (x) = x(1+ ε)

where limits for ε were given earier.Then,

Rel(xA ω� yA) =xA ω yA � xA ω� yA

xA ω yA

=xA ω yA � xA ω yA (1+ ε)

xA ω yA= �ε

23 / 101







xA ω yA


xA ω yA= �ε

24 / 101





where limits for ε were given earier.

Then,


xA ω yA


xA ω yA= �ε

25 / 101







xA ω yA


xA ω yA= �ε

26 / 101







xA ω yA


xA ω yA

= �ε

27 / 101







xA ω yA


xA ω yA= �ε

28 / 101







xA ω yA


xA ω yA= �ε

29 / 101


With rounded binary arithmetic having n digits in the mantissa,

�2�n � ε � 2�n

Coming back to error formula

(xT ω yT )� (xA ω� yA) = (xT ω yT � xA ω yA)+ (xA ω yA � xA ω� yA)| {z }Relative error is �ε

The second termxT ω yT � xA ω yA

is the propagated error.In what follows we examine it for particular cases.

30 / 101



�2�n � ε � 2�n





31 / 101



�2�n � ε � 2�n




is the propagated error.

In what follows we examine it for particular cases.

32 / 101



�2�n � ε � 2�n





33 / 101

Propagation of errorsPropagation in multiplication

Let ω = �. WritexT = xA + ξ, yT = yA + η

Then for the relative error in xA yA

Rel(xA yA) =xT yT � xA yA

xT yT

=xT yT � (xT � ξ)(yt � η)

xT yT

=xT yT � xT yT + xTη + yT ξ � ξη

xT yT

=xTη + yT ξ � ξη

xT yT

=ξ

xT+

η

yT� ξ

xT� η

yT= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)

34 / 101





xT yT


xT yT


xT yT


xT yT

=ξ

xT+

η

yT� ξ

xT� η


35 / 101





xT yT


xT yT


xT yT


xT yT

=ξ

xT+

η

yT� ξ

xT� η


36 / 101





xT yT


xT yT


xT yT


xT yT

=ξ

xT+

η

yT� ξ

xT� η


37 / 101





xT yT


xT yT


xT yT


xT yT

=ξ

xT+

η

yT� ξ

xT� η


38 / 101





xT yT


xT yT


xT yT


xT yT

=ξ

xT+

η

yT� ξ

xT� η

yT

= Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)

39 / 101





xT yT


xT yT


xT yT


xT yT

=ξ

xT+

η

yT� ξ

xT� η


40 / 101





xT yT


xT yT


xT yT


xT yT

=ξ

xT+

η

yT� ξ

xT� η


41 / 101


Usually we have

jRel(xA)j � 1, jRel(yA)j � 1

therefore, we can skip the last term Rel(xA)�Rel( yA), since it ismuch smaller compared with previous two

Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)� Rel(xA) + Rel( yA)

Thus small relative errors in the arguments xA and yA leads to asmall relative error in the product xAyA.

Also, note that there is some cancellation if these relative errorsare of opposite sign.

42 / 101


Usually we have



Rel(xA yA) = Rel(xA) + Rel( yA)� Rel(xA) � Rel( yA)

� Rel(xA) + Rel( yA)



43 / 101


Usually we have






44 / 101


Usually we have






45 / 101


Usually we have






46 / 101


Usually we have






47 / 101

Propagation of errorsPropagation in division

There is a similar result for division:

Rel(xA yA) � Rel(xA)� Rel( yA)

providedjRel(yA)j � 1

48 / 101

Propagation of errorsPropagation in addition and subtraction

For ω equal to � or +, we have

[xT � yT ]� [xA � yA] = [xT � xA]� [yT � yA]

Thus the error in a sum is the sum of the errors in theoriginal arguments, and similarly for subtraction.

However, there is a more subtle error occurring here.

49 / 101






50 / 101






51 / 101

Propagation of errorsExample

Suppose you are solving

x2 � 26x + 1 = 0

Using the quadratic formula, we have the true answers

r1,T = 13+p168, r2,T = 13�

p168

From a table of square roots, we takep168 � 12.961

Since this is correctly rounded to 5 digits, we have��p168� 12.961�� 0.0005Then de�ne

r1,A = 13+ 12.961 = 25.961,

r2,A = 13� 12.961 = 0.039

52 / 101



x2 � 26x + 1 = 0Using the quadratic formula, we have the true answers

r1,T = 13+p168, r2,T = 13�

p168



r1,A = 13+ 12.961 = 25.961,

r2,A = 13� 12.961 = 0.039

53 / 101




r1,T = 13+p168, r2,T = 13�

p168



r1,A = 13+ 12.961 = 25.961,

r2,A = 13� 12.961 = 0.039

54 / 101




r1,T = 13+p168, r2,T = 13�

p168


Since this is correctly rounded to 5 digits, we have��p168� 12.961�� 0.0005

Then de�ne

r1,A = 13+ 12.961 = 25.961,

r2,A = 13� 12.961 = 0.039

55 / 101




r1,T = 13+p168, r2,T = 13�

p168



r1,A = 13+ 12.961 = 25.961,

r2,A = 13� 12.961 = 0.03956 / 101


Then for both roots,

jrT � rAj � 0.0005

For the relative errors, however,

Rel (r1,A) =r1,T � r1,Ar1,T

� 0.0005

25.9605� 3.13� 10�5


� 0.0005

0.0385� 0.0130

Why does r2,A have such poor accuracy in comparison to r1,A?

57 / 101



jrT � rAj � 0.0005



� 0.0005

25.9605� 3.13� 10�5


� 0.0005

0.0385� 0.0130


58 / 101



jrT � rAj � 0.0005



� 0.0005

25.9605� 3.13� 10�5


� 0.0005

0.0385� 0.0130


59 / 101



jrT � rAj � 0.0005



� 0.0005

25.9605� 3.13� 10�5


� 0.0005

0.0385� 0.0130


60 / 101



jrT � rAj � 0.0005



� 0.0005

25.9605� 3.13� 10�5


� 0.0005

0.0385� 0.0130


61 / 101



jrT � rAj � 0.0005



� 0.0005

25.9605� 3.13� 10�5


� 0.0005

0.0385� 0.0130


62 / 101



jrT � rAj � 0.0005



� 0.0005

25.9605� 3.13� 10�5


� 0.0005

0.0385� 0.0130


63 / 101


The answer is due to the loss of signi�cance error involved in theformula for calculating r2,A.

Instead, use the mathematically equivalent formula

r2,A =1

13+p168

� 1

25.961

This results in a much more accurate answer, at the expense of anadditional division.

64 / 101




r2,A =1

13+p168

� 1

25.961


65 / 101




r2,A =1

13+p168

� 1

25.961


66 / 101

Propagation of errorsErrors in function evaluation

Suppose we are evaluating a function f (x) in the machine.

Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).Now suppose that we have a number xA � xT .

We want to calculate f (xT ), but instead we evaluate ef (xA).What can we say about the error in this latter computed quantity?

f (xT )� ef (xA)

67 / 101



Then the result is generally not f (x), but rather an approximate ofit which we denote by ef (x).

Now suppose that we have a number xA � xT .


f (xT )� ef (xA)

68 / 101





f (xT )� ef (xA)

69 / 101




We want to calculate f (xT ), but instead we evaluate ef (xA).

What can we say about the error in this latter computed quantity?

f (xT )� ef (xA)

70 / 101





f (xT )� ef (xA)71 / 101


f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)i

The quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.

The quantity f (xT )� f (xA) is called the propagated error. It isthe error that results from using perfect arithmetic in theevaluation of the function.

If the function f (x) is di�erentiable, then we can use the\mean-value theorem" to write

f (xT )� f (xA) = f 0(ξ)(xT � xA)

for some ξ between xT andxA.

72 / 101


f (xT )� ef (xA) = [f (xT )� f (xA)]� hf (xA)� ef (xA)iThe quantity f (xA)� ef (xA) is the \noise" in the evaluation off (xA) in the computer, and we will return later to some discussionof it.



f (xT )� f (xA) = f 0(ξ)(xT � xA)


73 / 101





f (xT )� f (xA) = f 0(ξ)(xT � xA)


74 / 101





f (xT )� f (xA) = f 0(ξ)(xT � xA)

for some ξ between xT andxA.75 / 101


Since usually xT and xA are close together, we can say ξ is close toeither of them, and

f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)

76 / 101



f (xT )� f (xA) = f 0(ξ)(xT � xA)

� f 0(xT )(xT � xA)� f 0(xA)(xT � xA)

77 / 101



f (xT )� f (xA) = f 0(ξ)(xT � xA)� f 0(xT )(xT � xA)

� f 0(xA)(xT � xA)

78 / 101




79 / 101




80 / 101


De�ne f (x) = bx , where b is a positive real number. Then lastformula yields

bxT � bxA � (ln b)bxT (xT � xA)

Rel (bxA) � (ln b)bxT (xT � xA)bxT

=(ln b)(xT � xA)xT

xT= xT ln b � Rel(xA)= K � Rel(xA)

Note that if K = 104 and Rel(xA) = 10�7, then Rel(bxA) � 10�3.

This is a large decrease in accuracy; and it is independent of howwe actually calculate bx . The number K is called a conditionnumber for the computation.

81 / 101









82 / 101






xT

= xT ln b � Rel(xA)= K � Rel(xA)



83 / 101






xT= xT ln b � Rel(xA)

= K � Rel(xA)



84 / 101









85 / 101









86 / 101









87 / 101








This is a large decrease in accuracy; and it is independent of howwe actually calculate bx .

The number K is called a conditionnumber for the computation.

88 / 101









89 / 101

Summation

Let S be a sum with a relatively large number of terms

S = a1 + a2 + . . . an (1)

where aj , j = 1, . . . , n, are oating point numbers.

Thesummation process consists of n� 1 consecutive additions

S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,

De�ne

S2 = (a1 + a2)

S3 = (S2 + a3)

S4 = (S3 + a4)...

Sn = (Sn�1 + an)

Recall the formula (x) = x(1+ ε)

90 / 101

Summation


S = a1 + a2 + . . . an (1)

where aj , j = 1, . . . , n, are oating point numbers. Thesummation process consists of n� 1 consecutive additions

S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,

De�ne

S2 = (a1 + a2)

S3 = (S2 + a3)

S4 = (S3 + a4)...

Sn = (Sn�1 + an)

Recall the formula (x) = x(1+ ε)

91 / 101

Summation


S = a1 + a2 + . . . an (1)

where aj , j = 1, . . . , n, are oating point numbers. Thesummation process consists of n� 1 consecutive additions

S = (((. . . (a1 + a2) + a3) + . . .+ an�1) + an,

De�ne

S2 = (a1 + a2)

S3 = (S2 + a3)

S4 = (S3 + a4)...

Sn = (Sn�1 + an)

Recall the formula (x) = x(1+ ε) 92 / 101

Summation

S2 = (a1 + a2)(1+ ε2)

S3 = (S2 + a3)(1+ ε3)

S4 = (S3 + a4)(1+ ε4)...

Sn = (Sn�1 + an)(1+ εn)

Then

S3 = (S2 + a3)(1+ ε3)

= ((a1 + a2)(1+ ε2) + a3)(1+ ε3)

� (a1 + a2 + a3) + a1(ε2 + ε3)

+a2(ε2 + ε3) + a3ε3,

93 / 101

Summation

S2 = (a1 + a2)(1+ ε2)

S3 = (S2 + a3)(1+ ε3)

S4 = (S3 + a4)(1+ ε4)...

Sn = (Sn�1 + an)(1+ εn)

Then

S3 = (S2 + a3)(1+ ε3)

= ((a1 + a2)(1+ ε2) + a3)(1+ ε3)

� (a1 + a2 + a3) + a1(ε2 + ε3)

+a2(ε2 + ε3) + a3ε3,

94 / 101

Summation

Similarly,

S4 � (a1 + a2 + a3 + a4) + a1(ε2 + ε3 + ε4)

+a2(ε2 + ε3 + ε4) + a3(ε3 + ε4) + a4ε4

Finally,

Sn � (a1 + a2 + . . .+ an) + a1(ε2 + . . .+ εn)

+a2(ε2 + . . .+ εn) + a3(ε3 + . . .+ εn)

+a4(ε4 + . . .+ εn) + . . .+ anεn

95 / 101

Summation

Similarly,

S4 � (a1 + a2 + a3 + a4) + a1(ε2 + ε3 + ε4)

+a2(ε2 + ε3 + ε4) + a3(ε3 + ε4) + a4ε4

Finally,

Sn � (a1 + a2 + . . .+ an) + a1(ε2 + . . .+ εn)

+a2(ε2 + . . .+ εn) + a3(ε3 + . . .+ εn)

+a4(ε4 + . . .+ εn) + . . .+ anεn

96 / 101

Summation

We are interested in the error S � Sn :

S � Sn � �a1(ε2 + . . .+ εn)� a2(ε2 + . . .+ εn)� a3(ε3 + . . .+ εn)

� a4(ε4 + . . .+ εn)� . . .� anεn

From the last relation we can establish the strategy for sumation inorder to minimize the error S � Sn: initially rearrange the termsinincreasing order

ja1j � ja2j � ja3j � . . . � janj

In this case smaller numbers a1 and a2 will be multiplied withlarger numbers ε2 + . . .+ εn, and larger number an will bemultiplied with smaller number εn.

97 / 101

Summation



� a4(ε4 + . . .+ εn)� . . .� anεn


ja1j � ja2j � ja3j � . . . � janj


98 / 101

Summation



� a4(ε4 + . . .+ εn)� . . .� anεn


ja1j � ja2j � ja3j � . . . � janj


99 / 101

Summation with chopping

Numberof terms, n

Exactvalue

SL Error LS Error

10 2.929 2.928 0.001 2.927 0.00225 3.816 3.813 0.003 3.806 0.01050 4.499 4.491 0.008 4.470 0.020100 5.187 5.170 0.017 5.142 0.045200 5.878 5.841 0.037 5.786 0.092500 6.793 6.692 0.101 6.569 0.2241000 7.486 7.284 0.202 7.069 0.417

100 / 101

Summation with rounding

Numberof terms, n

Exactvalue

SL Error LS Error

10 2.929 2.929 0 2.929 025 3.816 3.816 0 3.817 �0.00150 4.499 4.500 �0.001 4.498 0.001100 5.187 5.187 0 5.187 0200 5.878 5.878 0 5.876 0.002500 6.793 6.794 �0.001 6.783 0.0101000 7.486 7.486 0 7.449 0.037

101 / 101

Numerical Analysis


Fall 2010 Lecture 6

1 / 94

Root�nding

We want to �nd the numbers x for which

f (x) = 0

with f : [a, b]! R a given real-valued function. Here, we denotesuch roots or zeroes by the Greek letter α. So

f (α) = 0

Root�nding problems occur in many contexts. Sometimes they area direct formulation of some physical situtation, but more often,they are an intermediate step in solving a much larger problem.

2 / 94

Root�nding

We want to �nd the numbers x for which

f (x) = 0

with f : [a, b]! R a given real-valued function. Here, we denotesuch roots or zeroes by the Greek letter α. So

f (α) = 0

Root�nding problems occur in many contexts. Sometimes they area direct formulation of some physical situtation, but more often,they are an intermediate step in solving a much larger problem.

3 / 94

Bisection method

Most methods for solving f (x) = 0 are iterative methods. Thismeans that such a method given an initail guess x0 will provide uswith a sequence of consecutively computed solutionsx1, x2, x3, . . . , xn, . . . such that xn ! α.

We begin with the simplest of such methods, one which mostpeople use at some time.

Suppose we are given a function f (x) and we assume we have aninterval [a, b] containing the root, on which the function iscontinuous.

We also assume we are given an error tolerance ε > 0, and wewant an approximate root eα 2 [a, b] for which

jα� eαj < ε

4 / 94

Bisection method





jα� eαj < ε

5 / 94

Bisection method





jα� eαj < ε

6 / 94

Bisection method





jα� eαj < ε

7 / 94

Bisection method

Bisection method is based on the following theorem:

Theorem

If f : [a, b]! R is a continuous function on closed and boundedinterval [a, b] and

f (a) � f (b) < 0then there exists α 2 [a, b] such that f (α) = 0.

Therefore, further assume that the function f (x) changes sign on[a, b].

8 / 94

Bisection method


Theorem




9 / 94

Bisection method


Theorem




10 / 94

Bisection method

Bisection Algorithm: Bisect(f , a, b, ε)

Step 1: De�ne

c =a+ b

2

Step 2: If b� c � ε, accept c as our root, and then stop.Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If

sign(f (a)) � sign(f (b)) � 0

then replace a with c ; and otherwise, replace b with c .Return to Step 1.

Note that we prefer checking the sign using conditionsign(f (a)) � sign(f (b)) � 0 instead of using sign(f (a) � f (b)) � 0.

11 / 94

Bisection method


Step 1: De�ne

c =a+ b

2





12 / 94

Bisection method


Step 1: De�ne

c =a+ b

2

Step 2: If b� c � ε, accept c as our root, and then stop.

Step 3: If b� c > ε, then compare the sign of f (c) to that off (a) and f (b). If




13 / 94

Bisection method


Step 1: De�ne

c =a+ b

2



then replace a with c ; and otherwise, replace b with c .

Return to Step 1.


14 / 94

Bisection method


Step 1: De�ne

c =a+ b

2





15 / 94

Bisection method


Step 1: De�ne

c =a+ b

2





16 / 94

Bisection method

y

x

α

a1

b1=b

2

c1=a

2c

2

17 / 94

Bisection method

Example

Consider the function

f (x) = x6 � x � 1

We want to �nd the largest root with accuracy of ε = 0.001. It canbe seen form the graph of the function that the root is located in[1, 2] . Also, note that the function is continuous. Let a = 1 andb = 2, then f (a) = �1 and f (b) = 61, consequently the functionchanges its sign and thus all conditions are being satis�ed.

18 / 94

Bisection method

n an bn cn f (cn) bn � cn1 1.00000 2.00000 1.50000 8.891e + 00 5.000e � 012 1.00000 1.50000 1.25000 1.565e + 00 2.500e � 013 1.00000 1.25000 1.12500 �9.771e � 02 1.250e � 014 1.12500 1.25000 1.18750 6.167e � 01 6.250e � 025 1.12500 1.18750 1.15625 2.333e � 01 3.125e � 026 1.12500 1.15625 1.14063 6.158e � 02 1.563e � 027 1.12500 1.14063 1.13281 �1.958e � 02 7.813e � 038 1.13281 1.14063 1.13672 2.062e � 02 3.906e � 039 1.13281 1.13672 1.13477 4.268e � 04 1.953e � 0310 1.13281 1.13477 1.13379 �9.598e � 03 9.766e � 04

19 / 94

Error analysis for bisection method

Let an, bn and cn be the values provided by bisection method atiteration n. Evidently,

bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)

Since either α 2 [an, cn] or α 2 [cn, bn] we have

jα� cnj � cn � an = bn � cn =1

2(bn � an) =

1

2n(b� a)

20 / 94



bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)



2(bn � an) =

1

2n(b� a)

21 / 94



bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)



2(bn � an) =

1

2n(b� a)

22 / 94



bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)



2(bn � an) =

1

2n(b� a)

23 / 94



bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)



2(bn � an) =

1

2n(b� a)

24 / 94



bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)



2(bn � an)

=1

2n(b� a)

25 / 94



bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)



2(bn � an) =

1

2n(b� a)

26 / 94



bn+1 � an+1 =1

2(bn � an)

bn � an =1

2(bn�1 � an�1)

=1

22(bn�2 � an�2)

= . . .

=1

2n�1(b� a)



2(bn � an) =

1

2n(b� a)

27 / 94


jα� cnj �1

2n(b� a)

This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.

Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,

jα� cnj � ε

1

2n(b� a) � ε

n �ln�b�a

ε

�ln 2

For previuos example we get

n �ln�

10.001

�ln 2

� 9.97

28 / 94


jα� cnj �1

2n(b� a)

This relation provides us with a stopping criterion for bisectionmethod. Moreover, it follows that cn ! α as n! ∞.Suppose we want to estimate the number of iterations in bisectionmethod necessary to �nd the root with an error tolerance ε,

jα� cnj � ε

1

2n(b� a) � ε

n �ln�b�a

ε

�ln 2


n �ln�

10.001

�ln 2

� 9.97

29 / 94


jα� cnj �1

2n(b� a)


jα� cnj � ε

1

2n(b� a) � ε

n �ln�b�a

ε

�ln 2


n �ln�

10.001

�ln 2

� 9.97

30 / 94


jα� cnj �1

2n(b� a)


jα� cnj � ε

1

2n(b� a) � ε

n �ln�b�a

ε

�ln 2


n �ln�

10.001

�ln 2

� 9.9731 / 94

Advantages and Disadvantages of Bisection method

Advantages:

1 It always converges.

2 You have a guaranteed error bound, and it decreases witheach successive iteration.

3 You have a guaranteed rate of convergence. The error bounddecreases by 1/2 with each iteration.

Disadvantages:

1 It is relatively slow when compared with other root�ndingmethods we will study, especially when the function f (x) hasseveral continuous derivatives about the root α.

2 The algorithm has no check to see whether the ε is too smallfor the computer arithmetic being used.

We also assume the function f (x) is continuous on the giveninterval [a, b]; but there is no way for the computer to con�rm this.

32 / 94


Advantages:




Disadvantages:




33 / 94


Advantages:




Disadvantages:




34 / 94


Advantages:




Disadvantages:




35 / 94


Advantages:




Disadvantages:




36 / 94


Advantages:




Disadvantages:




37 / 94


Advantages:




Disadvantages:




38 / 94


Advantages:




Disadvantages:




39 / 94


Advantages:




Disadvantages:




40 / 94

Root�nding

We want to �nd the root α of a given function f (x).

Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.

Numerical Analysis Principle

If you cannot solve the given problem, then solve a "nearbyproblem".

How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.

41 / 94

Root�nding

We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis.

One of the principles of numerical analysis is the following.




42 / 94

Root�nding

We want to �nd the root α of a given function f (x). Thus we wantto �nd the point x at which the graph of y = f (x) intersects thex-axis. One of the principles of numerical analysis is the following.




43 / 94

Root�nding





44 / 94

Root�nding




How do we obtain a nearby problem for f (x) = 0?

Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.

45 / 94

Root�nding




How do we obtain a nearby problem for f (x) = 0?Begin �rst by asking for types of problems which we can solveeasily. At the top of the list should be that of �nding where astraight line intersects the x-axis.

Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 forsome linear polynomial p(x) that approximates f (x) in the vicinityof the root α.

46 / 94

Root�nding





47 / 94

Root�nding

y

x

α

(x0,f (x

0))

x0x

1

48 / 94

Newton's method

Let x0 be an initial guess, su�ciently closed to the root α.

Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation

p1(x) = f (x0) + f0(x0)(x � x0)

Since p1(x1) = 0 we get

f (x0) + f0(x0)(x1 � x0) = 0

x1 = x0 �f (x0)

f 0(x0)

Similarly, we get x2,

x2 = x1 �f (x1)

f 0(x1)

49 / 94

Newton's method

Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).

Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation

p1(x) = f (x0) + f0(x0)(x � x0)


f (x0) + f0(x0)(x1 � x0) = 0

x1 = x0 �f (x0)

f 0(x0)


x2 = x1 �f (x1)

f 0(x1)

50 / 94

Newton's method

Let x0 be an initial guess, su�ciently closed to the root α.Consider the tangent line to the graph of f (x) in (x0, f (x0)).Tangent intersects x-axis at x1, a closer point to α. Tangent hasequation

p1(x) = f (x0) + f0(x0)(x � x0)


f (x0) + f0(x0)(x1 � x0) = 0

x1 = x0 �f (x0)

f 0(x0)


x2 = x1 �f (x1)

f 0(x1)

51 / 94

Newton's method


p1(x) = f (x0) + f0(x0)(x � x0)


f (x0) + f0(x0)(x1 � x0) = 0

x1 = x0 �f (x0)

f 0(x0)


x2 = x1 �f (x1)

f 0(x1)

52 / 94

Newton's method


p1(x) = f (x0) + f0(x0)(x � x0)


f (x0) + f0(x0)(x1 � x0) = 0

x1 = x0 �f (x0)

f 0(x0)


x2 = x1 �f (x1)

f 0(x1)

53 / 94

Newton's method


p1(x) = f (x0) + f0(x0)(x � x0)


f (x0) + f0(x0)(x1 � x0) = 0

x1 = x0 �f (x0)

f 0(x0)


x2 = x1 �f (x1)

f 0(x1)

54 / 94

Newton's method

Repeat this process to obtaian the sequence x1, x2, x3, . . . thathopefully will converge to α.

General scheme for Newton's method consists in:

Starting with initial guess x0 compute iteratively

xn+1 = xn �f (xn)

f 0(xn), n = 0, 1, 2, . . .

55 / 94

Newton's method




xn+1 = xn �f (xn)

f 0(xn), n = 0, 1, 2, . . .

56 / 94

Newton's method




xn+1 = xn �f (xn)

f 0(xn), n = 0, 1, 2, . . .

57 / 94

Newton's method

Example

Apply Newton's method to

f (x) = x6 � x � 1,f 0(x) = 6x5 � 1

to get

xn+1 = xn �x6n � xn � 16x5n � 1

, n � 0

Use initial guess x0 = 1.5.

58 / 94

Newton's method

Example


f (x) = x6 � x � 1,f 0(x) = 6x5 � 1

to get

xn+1 = xn �x6n � xn � 16x5n � 1

, n � 0


59 / 94

Newton's method

Example


f (x) = x6 � x � 1,f 0(x) = 6x5 � 1

to get

xn+1 = xn �x6n � xn � 16x5n � 1

, n � 0


60 / 94

Newton's method

n xn f (xn) xn � xn�1 α� xn0 1.50000000 8.89e + 11 1.30049088 2.54e + 1 �2.00e � 1 �3.65e � 12 1.18148042 5.38e � 1 �1.19e � 1 �1.66e � 13 1.13945559 4.92e � 2 �4.20e � 2 �4.68e � 24 1.13477763 5.50e � 4 �4.68e � 3 �4.73e � 35 1.13472415 7.11e � 8 �5.35e � 5 �5.35e � 56 1.13472414 1.55e � 15 �6.91e � 9 �6.91e � 9

True solution is α = 1.134724138.

61 / 94

Newton's method. Division example

Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past.

Say, we are interestedin computing a

b = a �1b , where

1b is computed using Newton's

method.

f (x) � b� 1

x= 0,

with b positive. The root of this equation is: α = 1b .

f 0(x) =1

x2

and Newton's method for this problem becomes

xn+1 = xn �b� 1

xn1x2n

Simplifyingxn+1 = xn(2� bxn), n � 0

62 / 94


Here we consider a division algorithm (based on Newton's method)implemented in some computers in the past. Say, we are interestedin computing a

b = a �1b , where


method.

f (x) � b� 1

x= 0,


f 0(x) =1

x2


xn+1 = xn �b� 1

xn1x2n


63 / 94



b = a �1b , where


method.

f (x) � b� 1

x= 0,


f 0(x) =1

x2


xn+1 = xn �b� 1

xn1x2n


64 / 94



b = a �1b , where


method.

f (x) � b� 1

x= 0,


f 0(x) =1

x2


xn+1 = xn �b� 1

xn1x2n


65 / 94



b = a �1b , where


method.

f (x) � b� 1

x= 0,


f 0(x) =1

x2


xn+1 = xn �b� 1

xn1x2n


66 / 94


Initial guess x0 must be close enough to the true solution and ofcourse x0 > 0. Consider the error

α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand

Rel(xn+1) =α� xn+1

α= 1� bxn+1

67 / 94



α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand


α= 1� bxn+1

68 / 94



α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand


α= 1� bxn+1

69 / 94



α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand


α= 1� bxn+1

70 / 94



α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand


α= 1� bxn+1

71 / 94



α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand


α

= 1� bxn+1

72 / 94



α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand


α= 1� bxn+1

73 / 94



α� xn+1 =1

b� xn+1

=1� bxn+1

b

=1� bxn(2� bxn)

b

=(1� bxn)2

b

On the other hand


α= 1� bxn+1

74 / 94


It can be shown (try it!) that

Rel(xn+1) = (Rel(xn))2

In order to guarantee convergence xn ! α,

jRel(x0)j < 1

or

0 < x0 <2

b

For example, suppose that jRel(x0)j = 0.1. Then

Rel(x1) = 10�2, Rel(x2) = 10�4

Rel(x3) = 10�8, Rel(x4) = 10�16

75 / 94





jRel(x0)j < 1

or

0 < x0 <2

b


Rel(x1) = 10�2, Rel(x2) = 10�4

Rel(x3) = 10�8, Rel(x4) = 10�16

76 / 94





jRel(x0)j < 1

or

0 < x0 <2

b


Rel(x1) = 10�2, Rel(x2) = 10�4

Rel(x3) = 10�8, Rel(x4) = 10�16

77 / 94





jRel(x0)j < 1

or

0 < x0 <2

b


Rel(x1) = 10�2, Rel(x2) = 10�4

Rel(x3) = 10�8, Rel(x4) = 10�16

78 / 94





jRel(x0)j < 1

or

0 < x0 <2

b


Rel(x1) = 10�2, Rel(x2) = 10�4

Rel(x3) = 10�8, Rel(x4) = 10�16

79 / 94


y

x

y=b1/x

1/b

(x0,f(x

0))

x0 x1

2/b

b

80 / 94

Error analysis for Newton's method

Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0.

ConsiderTaylor formula for f (x) about xn

f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2

2f 00(ξn),

where ξn is between x and xn. Take x = α to get

f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2

2f 00(ξn),

with ξn between α and xn. Since f (α) = 0 we have

0 =f (xn)

f 0(xn)+ (α� xn) + (α� xn)2

f 00(ξn)

2f 0(xn)

α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)

�

81 / 94


Let f (x) 2 C 2[a, b] and α 2 [a, b]. Also let f 0(α) 6= 0. ConsiderTaylor formula for f (x) about xn

f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2

2f 00(ξn),

where ξn is between x and xn.

Take x = α to get

f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2

2f 00(ξn),


0 =f (xn)

f 0(xn)+ (α� xn) + (α� xn)2

f 00(ξn)

2f 0(xn)

α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)

�

82 / 94



f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2

2f 00(ξn),


f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2

2f 00(ξn),

with ξn between α and xn.

Since f (α) = 0 we have

0 =f (xn)

f 0(xn)+ (α� xn) + (α� xn)2

f 00(ξn)

2f 0(xn)

α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)

�

83 / 94



f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2

2f 00(ξn),


f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2

2f 00(ξn),


0 =f (xn)

f 0(xn)+ (α� xn) + (α� xn)2

f 00(ξn)

2f 0(xn)

α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)

�

84 / 94



f (x) = f (xn) + (x � xn)f 0(xn) +(x � xn)2

2f 00(ξn),


f (α) = f (xn) + (α� xn)f 0(xn) +(α� xn)2

2f 00(ξn),


0 =f (xn)

f 0(xn)+ (α� xn) + (α� xn)2

f 00(ξn)

2f 0(xn)

α� xn+1 = (α� xn)2��f 00(ξn)2f 0(xn)

�85 / 94


For previous example, f 00(x) = 30x4.We have

�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

=�30α4

2(6α5 � 1) � �2.42

Thereforeα� xn+1 � �2.42(α� xn)2

For example if n = 3, we get α� x3 � �4.73e � 03 and

α� x4 � �2.42(α� x3)2 � �5.42e � 05,

a result in accordance with the result presented in the table:α� x4 � �5.35e � 05.

86 / 94



�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

=�30α4

2(6α5 � 1) � �2.42



α� x4 � �2.42(α� x3)2 � �5.42e � 05,


87 / 94



�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

=�30α4

2(6α5 � 1) � �2.42



α� x4 � �2.42(α� x3)2 � �5.42e � 05,


88 / 94



�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

=�30α4

2(6α5 � 1) � �2.42



α� x4 � �2.42(α� x3)2 � �5.42e � 05,


89 / 94


If iteration xn is close to α we have

�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

� M

α� xn+1 � M(α� xn)2

M(α� xn+1) � (M(α� xn))2

Inductively

M(α� xn+1) � (M(α� x0))2n

, n � 0In other words, in order to guarantee the convergence of Newton'smethod we should have

jM(α� x0)j < 1

jα� x0j <1

jM j =��2f 0(α)f 00(α)

��

90 / 94



�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

� M

α� xn+1 � M(α� xn)2

M(α� xn+1) � (M(α� xn))2

Inductively

M(α� xn+1) � (M(α� x0))2n


jM(α� x0)j < 1

jα� x0j <1

jM j =��2f 0(α)f 00(α)

��

91 / 94



�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

� M

α� xn+1 � M(α� xn)2

M(α� xn+1) � (M(α� xn))2

Inductively

M(α� xn+1) � (M(α� x0))2n


jM(α� x0)j < 1

jα� x0j <1

jM j =��2f 0(α)f 00(α)

��

92 / 94



�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

� M

α� xn+1 � M(α� xn)2

M(α� xn+1) � (M(α� xn))2

Inductively

M(α� xn+1) � (M(α� x0))2n

, n � 0

In other words, in order to guarantee the convergence of Newton'smethod we should have

jM(α� x0)j < 1

jα� x0j <1

jM j =��2f 0(α)f 00(α)

��

93 / 94



�f 00(ξn)2f 0(xn)

� �f 00(α)2f 0(α)

� M

α� xn+1 � M(α� xn)2

M(α� xn+1) � (M(α� xn))2

Inductively

M(α� xn+1) � (M(α� x0))2n


jM(α� x0)j < 1

jα� x0j <1

jM j =��2f 0(α)f 00(α)

��94 / 94

For xn close to α, and therefore cn also close to α,

we have

α− xn+1 ≈ −f 00(α)2f 0(α)

(α− xn)2

Thus Newton’s method is quadratically convergent,

provided f 0(α) 6= 0 and f(x) is twice differentiable inthe vicinity of the root α.

We can also use this to explore the ‘interval of con-

vergence’ of Newton’s method. Write the above as

α− xn+1 ≈M (α− xn)2 , M = − f 00(α)

2f 0(α)Multiply both sides by M to get

M (α− xn+1) ≈ [M (α− xn)]2

M (α− xn+1) ≈ [M (α− xn)]2

Then we want these quantities to decrease; and this

suggests choosing x0 so that

|M (α− x0)| < 1

|α− x0| <1

|M | =¯̄̄̄¯2f 0(α)f 00(α)

¯̄̄̄¯

If |M | is very large, then we may need to have a verygood initial guess in order to have the iterates xn

converge to α.

ADVANTAGES & DISADVANTAGES

Advantages: 1. It is rapidly convergent in most cases.

2. It is simple in its formulation, and therefore rela-

tively easy to apply and program.

3. It is intuitive in its construction. This means it is

easier to understand its behaviour, when it is likely to

behave well and when it may behave poorly.

Disadvantages: 1. It may not converge.

2. It is likely to have difficulty if f 0(α) = 0. This

condition means the x-axis is tangent to the graph of

y = f(x) at x = α.

3. It needs to know both f(x) and f 0(x). Contrastthis with the bisection method which requires only

f(x).

THE SECANT METHOD

Newton’s method was based on using the line tangent

to the curve of y = f(x), with the point of tangency

(x0, f(x0)). When x0 ≈ α, the graph of the tangent

line is approximately the same as the graph of y =

f(x) around x = α. We then used the root of the

tangent line to approximate α.

Consider using an approximating line based on ‘inter-

polation’. We assume we have two estimates of theroot α, say x0 and x1. Then we produce a linear

function

q(x) = a0 + a1x

with

q(x0) = f(x0), q(x1) = f(x1) (*)

This line is sometimes called a secant line. Its equa-

tion is given by

q(x) =(x1 − x) f(x0) + (x− x0) f(x1)

x1 − x0

(x0,f(x0))

(x1,f(x1))

x2x0

x1α

x

yy=f(x)

(x0,f(x0))

(x1,f(x1))

x2 x0x1

αx

yy=f(x)

q(x) =(x1 − x) f(x0) + (x− x0) f(x1)

x1 − x0

This is linear in x; and by direction evaluation, it satis-

fies the interpolation conditions of (*). We now solve

the equation q(x) = 0, denoting the root by x2. This

yields

x2 = x1 − f(x1)÷f(x1)− f(x0)

x1 − x0

We can now repeat the process. Use x1 and x2 to

produce another secant line, and then uses its root

to approximate α. This yields the general iteration

formula

xn+1 = xn−f(xn)÷f(xn)− f(xn−1)xn − xn−1

, n = 1, 2, 3...

This is called the secant method for solving f(x) = 0.

Example We solve the equation

f(x) ≡ x6 − x− 1 = 0which was used previously as an example for both the

bisection and Newton methods. The quantity xn −xn−1 is used as an estimate of α−xn−1. The iteratex8 equals α rounded to nine significant digits. As with

Newton’s method for this equation, the initial iterates

do not converge rapidly. But as the iterates become

closer to α, the speed of convergence increases.

n xn f(xn) xn − xn−1 α− xn−10 2.0 61.0

1 1.0 −1.0 −1.02 1.01612903 −9.15E− 1 1.61E− 2 1.35E− 13 1.19057777 6.57E− 1 1.74E− 1 1.19E− 14 1.11765583 −1.68E− 1 −7.29E− 2 −5.59E− 25 1.13253155 −2.24E− 2 1.49E− 2 1.71E− 26 1.13481681 9.54E− 4 2.29E− 3 2.19E− 37 1.13472365 −5.07E− 6 −9.32E− 5 −9.27E− 58 1.13472414 −1.13E− 9 4.92E− 7 4.92E− 7

It is clear from the numerical results that the se-

cant method requires more iterates than the New-

ton method. But note that the secant method does

not require a knowledge of f 0(x), whereas Newton’smethod requires both f(x) and f 0(x).

Note also that the secant method can be considered

an approximation of the Newton method

xn+1 = xn − f(xn)

f 0(xn)by using the approximation

f 0(xn) ≈ f(xn)− f(xn−1)xn − xn−1

CONVERGENCE ANALYSIS

With a combination of algebraic manipulation and the

mean-value theorem from calculus, we can show

α− xn+1 = (α− xn) (α− xn−1)"−f 00(ξn)2f 0(ζn)

#, (**)

with ξn and ζn unknown points. The point ξn is lo-

cated between the minimum and maximum of xn−1, xn,and α; and ζn is located between the minimum and

maximum of xn−1 and xn. Recall for Newton’s methodthat the Newton iterates satisfied

α− xn+1 = (α− xn)2

"−f 00(ξn)2f 0(xn)

#which closely resembles (**) above.

Using (**), it can be shown that xn converges to α,

and moreover,

limn→∞

|α− xn+1||α− xn|r =

¯̄̄̄¯ f 00(α)2f 0(α)

¯̄̄̄¯r−1

≡ c

where 12 (1 + sqrt(5)).= 1.62. This assumes that x0

and x1 are chosen sufficiently close to α; and how

close this is will vary with the function f . In addition,

the above result assumes f(x) has two continuous

derivatives for all x in some interval about α.

The above says that when we are close to α, that

|α− xn+1| ≈ c |α− xn|r

This looks very much like the Newton result

α− xn+1 ≈M (α− xn)2 , M =

−f 00(α)2f 0(α)

and c = |M |r−1. Both the secant and Newton meth-ods converge at faster than a linear rate, and they are

called superlinear methods.

The secant method converge slower than Newton’s

method; but it is still quite rapid. It is rapid enough

that we can prove

limn→∞

|xn+1 − xn||α− xn| = 1

and therefore,

|α− xn| ≈ |xn+1 − xn|is a good error estimator.

A note of warning: Do not combine the secant for-

mula and write it in the form

xn+1 =f(xn)xn−1 − f(xn−1)xn

f(xn)− f(xn−1)This has enormous loss of significance errors as com-

pared with the earlier formulation.

COSTS OF SECANT & NEWTON METHODS

The Newton method

xn+1 = xn − f(xn)

f 0(xn), n = 0, 1, 2, ...

requires two function evaluations per iteration, that

of f(xn) and f 0(xn). The secant method

xn+1 = xn−f(xn)÷f(xn)− f(xn−1)xn − xn−1

, n = 1, 2, 3...

requires 1 function evaluation per iteration, following

the initial step.

For this reason, the secant method is often faster in

time, even though more iterates are needed with it

than with Newton’s method to attain a similar accu-

racy.

ADVANTAGES & DISADVANTAGES

Advantages of secant method: 1. It converges at

faster than a linear rate, so that it is more rapidly

convergent than the bisection method.

2. It does not require use of the derivative of the

function, something that is not available in a number

of applications.

3. It requires only one function evaluation per iter-

ation, as compared with Newton’s method which re-

quires two.

Disadvantages of secant method:

1. It may not converge.

2. There is no guaranteed error bound for the com-

puted iterates.

3. It is likely to have difficulty if f 0(α) = 0. This

means the x-axis is tangent to the graph of y = f(x)

at x = α.

4. Newton’s method generalizes more easily to new

methods for solving simultaneous systems of nonlinear

equations.

BRENT’S METHOD

Richard Brent devised a method combining the advan-

tages of the bisection method and the secant method.

1. It is guaranteed to converge.

2. It has an error bound which will converge to zero

in practice.

3. For most problems f(x) = 0, with f(x) differen-

tiable about the root α, the method behaves like the

secant method.

4. In the worst case, it is not too much worse in its

convergence than the bisection method.

In Matlab, it is implemented as fzero; and it is present

in most Fortran numerical analysis libraries.

FIXED POINT ITERATION

We begin with a computational example. Consider

solving the two equations

E1: x = 1 + .5 sinxE2: x = 3 + 2 sinx

Graphs of these two equations are shown on accom-

panying graphs, with the solutions being

E1: α = 1.49870113351785E2: α = 3.09438341304928

We are going to use a numerical scheme called ‘fixed

point iteration’. It amounts to making an initial guess

of x0 and substituting this into the right side of the

equation. The resulting value is denoted by x1; and

then the process is repeated, this time substituting x1into the right side. This is repeated until convergence

occurs or until the iteration is terminated.

In the above cases, we show the results of the first 10

iterations in the accompanying table. Clearly conver-

gence is occurring with E1, but not with E2. Why?

x

y

y = x

y = 1 + .5sin x

α

x

y

y = x

y = 3 + 2sin x

α

E1: x = 1 + .5 sinxE2: x = 3 + 2 sinx

E1 E2n xn xn0 0.00000000000000 3.000000000000001 1.00000000000000 3.282240016119732 1.42073549240395 2.719631771815563 1.49438099256432 3.819100254885144 1.49854088439917 1.746293896516525 1.49869535552190 4.969279572147626 1.49870092540704 1.065630652992167 1.49870112602244 4.750188616394658 1.49870113324789 1.001428642365169 1.49870113350813 4.6844840491609710 1.49870113351750 1.00077863465869

The above iterations can be written symbolically as

E1 : xn+1 = 1 + 0:5 sinxn

E2 : xn+1 = 3 + 2 sinxn

for n = 0; 1; 2; : : : Why does one of these iterationsconverge, but not the other? The graphs show similarbehaviour, so why the di¤erence? Consider one moreexample:

Suppose we are solving the equation

x2 � 5 = 0

with exact root � =p5 � 2:2361 using iterates of the

form

xn+1 = g(xn):

Consider four di¤erent iterations

I1 : xn+1 = 5 + xn � x2n;

I2 : xn+1 =5

xn;

I3 : xn+1 = 1 + xn �1

5x2n;

I4 : xn+1 =1

2

�xn +

5

xn

�:

All of them, in case they are convergent will convergeto � =

p5 (just take the limit as n ! 1 of each

relation).

I1 I2 I3 I4n xn xn xn xn0 1:0e+ 00 1:0 1:0 1:01 5:0000e+ 00 5:0 1:8000 3:00002 �1:5000e+ 01 1:0 2:1520 2:33333 �2:3500e+ 02 5:0 2:2258 2:23814 �5:5455e+ 04 1:0 2:2350 2:23615 �3:0753e+ 09 5:0 2:2360 2:23616 �9:4575e+ 18 1:0 2:2361 2:23617 �8:9445e+ 37 5:0 2:2361 2:23618 �8:0004e+ 75 1:0 2:2361 2:2361

As another example, note that the Newton method

xn+1 = xn �f(xn)

f 0(xn)

is also a �xed point iteration, for the equation

x = x� f(x)

f 0(x)

In general, we are interested in solving equations

x = g(x)

by means of �xed point iteration:

xn+1 = g(xn); n = 0; 1; 2; : : :

It is called ��xed point iteration�because the root � isa �xed point of the function g(x), meaning that � is anumber for which

g(�) = �

EXISTENCE THEOREM

We begin by asking whether the equation

x = g(x)

has a solution. For this to occur, the graphs of y =x and y = g(x) must intersect, as seen on the earliergraphs. There are several lemmas and theorems that giveconditions under which we are guaranteed there is a �xedpoint �.

Lemma 1 Let g(x) be a continuous function on the in-terval [a; b], and suppose it satis�es the property

a � x � b ) a � g(x) � bThen the equation x = g(x) has at least one solution �in the interval [a; b].

The proof of this is fairly intuitive. Look at the functionf(x) = x � g(x), a � x � b. Evaluating at the end-points, f(a) � 0; f(b) � 0. The function f(x) iscontinuous on [a; b]; and therefore it contains a zero inthe interval.

Theorem: Assume g(x) and g0(x) exist and are con-tinuous on the interval [a, b]; and further, assume

a ≤ x ≤ b ⇒ a ≤ g(x) ≤ b

λ ≡ maxa≤x≤b

¯̄̄g0(x)

¯̄̄< 1

Then:

S1. The equation x = g(x) has a unique solution α

in [a, b].

S2. For any initial guess x0 in [a, b], the iteration

xn+1 = g(xn), n = 0, 1, 2, ...

will converge to α.

S3.

|α− xn| ≤ λn

1− λ|x1 − x0| , n ≥ 0

S4.

limn→∞

α− xn+1α− xn

= g0(α)

Thus for xn close to α,

α− xn+1 ≈ g0(α) (α− xn)

The proof is given in the text, and I go over only a

portion of it here. For S2, note that from (#), if x0is in [a, b], then

x1 = g(x0)

is also in [a, b]. Repeat the argument to show that

x2 = g(x1)

belongs to [a, b]. This can be continued by induction

to show that every xn belongs to [a, b].

We need the following general result. For any two

points w and z in [a, b],

g(w)− g(z) = g0(c) (w − z)

for some unknown point c between w and z. There-

fore,

|g(w)− g(z)| ≤ λ |w − z|for any a ≤ w, z ≤ b.

For S3, subtract xn+1 = g(xn) from α = g(α) to get

α− xn+1 = g(α)− g(xn)

= g0(cn) (α− xn) ($)

|α− xn+1| ≤ λ |α− xn| (*)

with cn between α and xn. From (*), we have that

the error is guaranteed to decrease by a factor of λ

with each iteration. This leads to

|α− xn| ≤ λn |α− xn| , n ≥ 0With some extra manipulation, we can obtain the error

bound in S3.

For S4, use ($) to write

α− xn+1α− xn

= g0(cn)

Since xn → α and cn is between α and xn, we have

g0(cn)→ g0(α).

The statement

α− xn+1 ≈ g0(α) (α− xn)

tells us that when near to the root α, the errors will

decrease by a constant factor of g0(α). If this is nega-tive, then the errors will oscillate between positive and

negative, and the iterates will be approaching from

both sides. When g0(α) is positive, the iterates willapproach α from only one side.

The statements

α− xn+1 = g0(cn) (α− xn)

α− xn+1 ≈ g0(α) (α− xn)

also tell us a bit more of what happens when¯̄̄g0(α)

¯̄̄> 1

Then the errors will increase as we approach the root

rather than decrease in size.

Look at the earlier examples

E1: x = 1 + .5 sinxE2: x = 3 + 2 sinx

In the first case E1,

g(x) = 1 + .5 sinxg0(x) = .5 cosx¯̄g0(α

¯̄ ≤ 12

Therefore the fixed point iteration

xn+1 = 1 + .5 sinxn

will converge for E1.

For the second case E2,

g(x) = 3 + 2 sinxg0(x) = 2 cosxg0(α) = 2 cos (3.09438341304928)

.= −1.998

Therefore the fixed point iteration

xn+1 = 3 + 2 sinxn

will diverge for E2.

Consider example x2 � 5 = 0

(I1) g(x) = 5 + x� x2; g0(x) = 1� 2x; g0(�) =1 � 2

p5 < �1: Thus, xn = g(xn�1) do not con-

verge top5:

(I2) g(x) =5x; g0(x) = � 5

x2; g0(�) = �1: There-

fore, xn = g(xn�1) can be either convergent ordivergent, but numerical results show it divergent.

(I3) g(x) = 1+x� 15x2; g0(x) = 1� 2

5x; g0(�) =1 � 2

5

p5 � 0:106: Thus, xn = g(xn�1) converge

top5: Moreover, we have

j�� xn+1j � 0:106 j�� xnj ;

if xn is su¢ ciently close to �: The errors are de-creasing with a liniar rate of 0:106.

(I4) g(x) =12

�x+ 5

x

�; g0(x) = 1

2

�1� 5

x2

�; g0(�) =

0:Sequence xn = g(xn�1) will converge top5;with

an order of convergence bigger than 1:

Sometimes it is di¢ cult to express equation f(x) = 0 inthe form x = g(x); such that the resulting iterates willconverge. Such a process is presented in the followingexamples.

Example 1 Let x4 � x� 1 = 0; rewritten as

x = 4p1 + x;

which will prov�de us with iterations

x0 = 1; xn+1 =4p1 + xn; n � 1

This sequence will converge to � � 1:2207:

Example 2 Let x3 + x� 1 = 0; rewritten as

x =1

1 + x2

and its �xed point iterations

x0 = 1; xn+1 =1

1 + x2n; n � 1

that will converge to � � 0:6823: Iterations are repre-sented graphically in the following �gure

0 x

y

y=g(x)

α=0.6823 x0x

2x

1 x3

y=x

x0 x

1 x2

α

y

xO

y =x

y =g(x)

0 < g0(�) < 1

x

y

O

x0

x1x

2x

3α

y =x

y =g(x)

�1 < g0(�) < 0

x

y

Oαx

0x

1x

2

y =x

y =g(x)

g0(�) > 1

y

xO

y =x

y =g(x)

α x0x

1 x2

g0(�) < �1

Besides the convergence we would like to know how fast isthe sequence xn = g(xn�1) converging to the solution,in other words how fast the error � � xn is decreasing.We will say that sequence fxng1n=0 converges to � withorder of convergence p � 1; if

j�� xn+1j � c j�� xnjp ; n � 0;

where c � 0 is a constant. Cases p = 1, p = 2 and p =3 are called linear, quadratic and cubic convergencies. Incase of linear convergence, constant c is called the rateof linear convergence liniare and we require additionallythat c < 1; otherwise sequence of errors �� xn can failto converge to zero. Also, for linear convergence wer canuse the relation,

j�� xn+1j � cn j�� x0j ; n � 0:

Thus bisection method is linearly convergent with rate 12;Newton�s method is quadratically convergent, and secantmethod has order of convergence p = 1+

p5

2 :

If��g0(�)�� < 1, from the last theorem we have that iter-

ations xn are at least linearly convergent. If in addition,g0(�) 6= 0; then we have exactly linear convergence withrate g0(�): In practice, the last theorem is rarely usedsince.it is quite di¢ cult to �nd an interval [a; b] such thatg ([a; b]) � [a; b] : To simplify the usage of the Theoremwe consider the following corollary.

Corollary: Assume x = g(x) has a solution α, and

further assume that both g(x) and g0(x) are contin-uous for all x in some interval about α. In addition,

assume ¯̄̄g0(α)

¯̄̄< 1 (**)

Then any sufficiently small number ε > 0, the interval

[a, b] = [α − ε, α + ε] will satisfy the hypotheses of

the preceding theorem.

This means that if (**) is true, and if we choose x0sufficiently close to α, then the fixed point iteration

xn+1 = g(xn) will converge and the earlier results

S1-S4 will all hold. The corollary does not tell us how

close we need to be to α in order to have convergence.

NEWTON’S METHOD

For Newton’s method

xn+1 = xn − f(xn)

f 0(xn)we have it is a fixed point iteration with

g(x) = x− f(x)

f 0(x)Check its convergence by checking the condition (**).

g0(x) = 1− f 0(x)f 0(x)

+f(x)f 00(x)[f 0(x)]2

=f(x)f 00(x)[f 0(x)]2

g0(α) = 0

Therefore the Newton method will converge if x0 is

chosen sufficiently close to α.

HIGHER ORDER METHODS

What happens when g0(α) = 0? We use Taylor’s

theorem to answer this question.

Begin by writing

g(x) = g(α) + g0(α) (x− α) +1

2g00(c) (x− α)2

with c between x and α. Substitute x = xn and

recall that g(xn) = xn+1 and g(α) = α. Also assume

g0(α) = 0.Then

xn+1 = α+ 12g00(cn) (xn − α)2

α− xn+1 = −12g00(cn) (xn − α)2

with cn between α and xn. Thus if g0(α) = 0, the

fixed point iteration is quadratically convergent or bet-

ter. In fact, if g00(α) 6= 0, then the iteration is exactlyquadratically convergent.

ANOTHER RAPID ITERATION

Newton’s method is rapid, but requires use of the

derivative f 0(x). Can we get by without this. Theanswer is yes! Consider the method

Dn =f(xn + f(xn))− f(xn)

f(xn)

xn+1 = xn − f(xn)

Dn

This is an approximation to Newton’s method, with

f 0(xn) ≈ Dn. To analyze its convergence, regard it

as a fixed point iteration with

D(x) =f(x+ f(x))− f(x)

f(x)

g(x) = x− f(x)

D(x)

Then we can, with some difficulty, show g0(α) = 0

and g00(α) 6= 0. This will prove this new iteration is

quadratically convergent.

FIXED POINT INTERATION: ERROR

Recall the result

limn→∞

α− xn

α− xn−1= g0(α)

for the iteration

xn = g(xn−1), n = 1, 2, ...

Thus

α− xn ≈ λ (α− xn−1) (***)

with λ = g0(α) and |λ| < 1.

If we were to know λ, then we could solve (***) for

α:

α ≈ xn − λxn−11− λ

Usually, we write this as a modification of the cur-

rently computed iterate xn:

α ≈ xn − λxn−11− λ

=xn − λxn

1− λ+λxn − λxn−1

1− λ

= xn +λ

1− λ[xn − xn−1]

The formula

xn +λ

1− λ[xn − xn−1]

is said to be an extrapolation of the numbers xn−1and xn. But what is λ?

From

limn→∞

α− xn

α− xn−1= g0(α)

we have

λ ≈ α− xn

α− xn−1

Unfortunately this also involves the unknown root α

which we seek; and we must find some other way of

estimating λ.

To calculate λ consider the ratio

λn =xn − xn−1xn−1 − xn−2

To see this is approximately λ as xn approaches α,

write

xn − xn−1xn−1 − xn−2

=g(xn−1)− g(xn−2)

xn−1 − xn−2= g0(cn)

with cn between xn−1 and xn−2. As the iterates ap-proach α, the number cn must also approach α. Thus

λn approaches λ as xn→ α.

We combine these results to obtain the estimation

bxn = xn+λn

1− λn[xn − xn−1] , λn =

xn − xn−1xn−1 − xn−2

We call bxn the Aitken extrapolate of {xn−2, xn−1, xn};and α ≈ bxn.We can also rewrite this as

α− xn ≈ bxn − xn =λn

1− λn[xn − xn−1]

This is called Aitken’s error estimation formula.

The accuracy of these procedures is tied directly to

the accuracy of the formulas

α−xn ≈ λ (α− xn−1) , α−xn−1 ≈ λ (α− xn−2)

If this is accurate, then so are the above extrapolation

and error estimation formulas.

EXAMPLE

Consider the iteration

xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...

for solving

x = 6.28 + sinx

Iterates are shown on the accompanying sheet, includ-

ing calculations of λn, the error estimate

α−xn ≈ bxn−xn =λn

1− λn[xn − xn−1] (Estimate)

The latter is called “Estimate” in the table. In this

instance,

g0(α) .= .9644

and therefore the convergence is very slow. This is

apparent in the table.

AITKEN’S ALGORITHM

Step 1: Select x0Step 2: Calculate

x1 = g(x0), x2 = g(x1)

Step3: Calculate

x3 = x2 +λ2

1− λ2[x2 − x1] , λ2 =

x2 − x1x1 − x0

Step 4: Calculate

x4 = g(x3), x5 = g(x4)

and calculate x6 as the extrapolate of {x3, x4, x5}.Continue this procedure, ad infinatum.

Of course in practice we will have some kind of er-

ror test to stop this procedure when believe we have

sufficient accuracy.

EXAMPLE

Consider again the iteration

xn+1 = 6.28 + sin(xn), n = 0, 1, 2, ...

for solving

x = 6.28 + sinx

Now we use the Aitken method, and the results are

shown in the accompanying table. With this we have

α− x3 = 7.98× 10−4, α− x6 = 2.27× 10−6

In comparison, the original iteration had

α− x6 = 1.23× 10−2

GENERAL COMMENTS

Aitken extrapolation can greatly accelerate the con-

vergence of a linearly convergent iteration

xn+1 = g(xn)

This shows the power of understanding the behaviour

of the error in a numerical process. From that un-

derstanding, we can often improve the accuracy, thru

extrapolation or some other procedure.

This is a justification for using mathematical analyses

to understand numerical methods. We will see this

repeated at later points in the course, and it holds

with many different types of problems and numerical

methods for their solution.

MULTIPLE ROOTS

We study two classes of functions for which there is

additional difficulty in calculating their roots. The first

of these are functions in which the desired root has a

multiplicity greater than 1. What does this mean?

Let α be a root of the function f(x), and imagine

writing it in the factored form

f(x) = (x− α)mh(x)

with some integer m ≥ 1 and some continuous func-tion h(x) for which h(α) 6= 0. Then we say that α

is a root of f(x) of multiplicity m. For example, the

function

f(x) = ex2 − 1

has x = 0 as a root of multiplicity m = 2. In partic-

ular, define

h(x) =ex2 − 1x2

for x 6= 0.

Using Taylor polynomial approximations, we can show

for x 6= 0 thath(x) ≈ 1 + 1

2x2 + 1

6x4

limx→0h(x) = 1

This leads us to extend the definition of h(x) to

h(x) =ex2 − 1x2

, x 6= 0h(0) = 1

Thus

f(x) = x2h(x)

as asserted and x = 0 is a root of f(x) of multiplicity

m = 2.

Roots for which m = 1 are called simple roots, and

the methods studied to this point were intended for

such roots. We now consider the case of m > 1.

If the function f(x) is m-times differentiable around

α, then we can differentiate

f(x) = (x− α)mh(x)

m times to obtain an equivalent formulation of what

it means for the root to have multiplicity m.

For an example, consider the case

f(x) = (x− α)3 h(x)

Then

f 0(x) = 3 (x− α)2 h(x) + (x− α)3 h0(x)≡ (x− α)2 h2(x)

h2(x) = 3h(x) + (x− α)h0(x)h2(α) = 3h(α) 6= 0

This shows α is a root of f 0(x) of multiplicity 2.

Differentiating a second time, we can show

f 00(x) = (x− α)h3(x)

for a suitably defined h3(x) with h3(α) 6= 0, and α isa simple root of f 00(x).

Differentiating a third time, we have

f 000(α) = h3(α) 6= 0We can use this as part of a proof of the following: α

is a root of f(x) of multiplicity m = 3 if and only if

f(α) = f 0(α) = f 00(α) = 0, f 000(α) 6= 0

In general, α is a root of f(x) of multiplicity m if and

only if

f(α) = · · · = f (m−1)(α) = 0, f (m)(α) 6= 0

DIFFICULTIES OF MULTIPLE ROOTS

There are two main difficulties with the numerical cal-

culation of multiple roots (by which we mean m > 1

in the definition).

1. Methods such as Newton’s method and the se-

cant method converge more slowly than for the

case of a simple root.

2. There is a large interval of uncertainty in the pre-

cise location of a multiple root on a computer or

calculator.

The second of these is the more difficult to deal with,

but we begin with the first for the case of Newton’s

method.

Recall that we can regard Newton’s method as a fixed

point method:

xn+1 = g(xn), g(x) = x− f(x)

f 0(x)Then we substitute

f(x) = (x− α)mh(x)

to obtain

g(x) = x− (x− α)mh(x)

m (x− α)m−1 h(x) + (x− α)mh0(x)

= x− (x− α)h(x)

mh(x) + (x− α)h0(x)Then we can use this to show

g0(α) = 1− 1

m=

m− 1m

For m > 1, this is nonzero, and therefore Newton’s

method is only linearly convergent:

α− xn+1 ≈ λ (α− xn) , λ =m− 1m

Similar results hold for the secant method.

There are ways of improving the speed of convergence

of Newton’s method, creating a modified method that

is again quadratically convergent. In particular, con-

sider the fixed point iteration formula

xn+1 = g(xn), g(x) = x−mf(x)

f 0(x)in which we assume to know the multiplicity m of

the root α being sought. Then modifying the above

argument on the convergence of Newton’s method,

we obtain

g0(α) = 1−m · 1m= 0

and the iteration method will be quadratically conver-

gent.

But this is not the fundamental problem posed by

multiple roots.

NOISE IN FUNCTION EVALUATION

Recall the discussion of noise in evaluating a function

f(x), and in our case consider the evaluation for val-

ues of x near to α. In the following figures, the noise

as measured by vertical distance is the same in both

graphs.

x

y

simple root

x

y

double root

Noise was discussed earlier and as example we used func-tion

f(x) = x3 � 3x2 + 3x� 1 � (x� 1)3

Because of the noise in evaluating f(x), it appears fromthe graph that f(x) has many zeros around x = 1,whereas the exact function outside of the computer hasonly the root � = 1; of multiplicity 3. Any root�ndingmethod to �nd a multiple root � that uses evaluation off(x) is doomed to having a large interval of uncertaintyas to the location of the root. If high accuracy is desired,then the only satisfactory solution is to reformulate theproblem as a new problem F (x) = 0 in which � is a sim-ple root of F . Then use a standard root�nding methodto calculate �. It is important that the evaluation ofF (x) not involve f(x) directly, as that is the source ofthe noise and the uncertainly.

EXAMPLE

Consider �nding the roots of

f(x) = (x� 1:1)3(x� 2:1)= 2:7951� 8:954x+ 10:56x2 � 5:4x3 + x4

This has a root at 1.1

n xn f(xn) �� xn Rate0 0:800000 0:03510 0:3000001 0:892857 0:01073 0:207143 0:6902 0:958176 0:00325 0:141824 0:6853 1:00344 0:00099 0:09656 0:6814 1:03486 0:00029 0:06514 0:6755 1:05581 0:00009 0:04419 0:6786 1:07028 0:00003 0:02972 0:6737 1:08092 0:0 0:01908 0:642

From an examination of the rate of linear convergence ofNewton�s method applied to this function, one can guesswith high probability that the multiplicity ism = 3. Thenform exactly the second derivative

f 00(x) = 21:12� 32:4x+ 12x2

Applying Newton�s method to this with a guess of x = 1will lead to rapid convergence to � = 1:1.

In general, if we know the root � has multiplicitym > 1,then replace the problem by that of solving

f (m�1)(x) = 0

since � is a simple root of this equation.

STABILITY

Generally we expect the world to be stable. By this,we mean that if we make a small change in something,then we expect to have this lead to other correspond-ingly small changes. In fact, if we think about thiscarefully, then we know this need not be true. Wenow illustrate this for the case of rootfinding.

Consider the polynomial

f(x) = x7 − 28x6 + 322x5 − 1960x4+6769x3 − 13132x2 + 13068x− 5040

This has the exact roots {1, 2, 3, 4, 5, 6, 7}. Now con-sider the perturbed polynomial

F (x) = x7 − 28.002x6 + 322x5 − 1960x4+6769x3 − 13132x2 + 13068x− 5040

This is a relatively small change in one coefficient, ofrelative error

−.002−28 = 7.14× 10−5

What are the roots of F (x)?

Root of Root of Errorf(x) F (x)1 1.0000028 −2.8E − 62 1.9989382 1.1E − 33 3.0331253 −0.0334 3.8195692 0.1805 5.4586758 + .54012578i −.46− .54i6 5.4586758− .54012578i −.46 + .54i7 7.2330128 −0.233

Why have some of the roots departed so radically from

the original values? This phenomena goes under a

variety of names. We sometimes say this is an example

of an unstable or ill-conditioned rootfinding problem.

These words are often used in a casual manner, but

they also have a very precise meaning in many areas

of numerical analysis (and more generally, in all of

mathematics).

A PERTURBATION ANALYSIS

We want to study what happens to the root of a func-

tion f(x) when it is perturbed by a small amount. For

some function g(x) and for all small ε, define a per-

turbed function

Fε(x) = f(x) + εg(x)

The polynomial example would fit this if we use

g(x) = x6, ε = −.002Let α0 be a simple root of f(x). It can be shown (us-

ing the implicit differentiation theorem from calculus)

that if f(x) and g(x) are differentiable for x ≈ α0,

and if f 0(α0) 6= 0, then Fε(x) has a unique simple

root α(ε) near to α0 = α(0) for all small values of ε.

Moreover, α(ε) will be a differentiable function of ε.

We use this to estimate α(ε).

The linear Taylor polynomial approximation of α(ε) is

given by

α(ε) ≈ α(0) + εα0(0)

We need to find a formula for α0(0). Recall that

Fε(α(ε)) = 0

for all small values of ε. Differentiate this as a function

of ε and using the chain rule. Then we obtain

F 0ε(α(ε)) = f 0(α(ε))α0(ε)+g(α(ε)) + ε g0(α(ε))α0(ε) = 0

for all small ε. Substitute ε = 0, recall α(0) = α0,

and solve for α0(0) to obtain

f 0(α0)α0(0) + g(α0) = 0

α0(0) = − g(α0)

f 0(α0)This then leads to

α(ε) ≈ α(0) + εα0(0)

= α0 − εg(α0)

f 0(α0)(*)

Example: In our earlier polynomial example, consider

the simple root α0 = 3. Then

α(ε) ≈ 3− ε36

48

.= 3− 15.2ε

With ε = −.002, we obtainα(−.002) ≈ 3− 15.2(−.002) .

= 3.0304

This is close to the actual root of 3.0331253.

However, the approximation (*) is not good at esti-

mating the change in the roots 5 and 6. By ob-

servation, the perturbation in the root is a complex

number, whereas the formula (*) predicts only a per-

turbation that is real. The value of ε is too large to

have (*) be accurate for the roots 5 and 6.

DISCUSSION

Looking again at the formula

α(ε) ≈ α0 − εg(α0)

f 0(α0)we have that the size of

εg(α0)

f 0(α0)is an indication of the stability of the solution α0.

If this quantity is large, then potentially we will have

difficulty. Of course, not all functions g(x) are equally

possible, and we need to look only at functions g(x)

that will possibly occur in practice.

One quantity of interest is the size of f 0(α0). If itis very small relative to εg(α0), then we are likely to

have difficulty in finding α0 accurately.

INTERPOLATION

Interpolation is a process of finding a formula (oftena polynomial) whose graph will pass through a givenset of points (x, y).

As an example, consider defining

x0 = 0, x1 =π

4, x2 =

π

2and

yi = cosxi, i = 0, 1, 2

This gives us the three points

(0, 1) ,µπ4 ,

1sqrt(2)

¶,

³π2 , 0

Ńow find a quadratic polynomial

p(x) = a0 + a1x+ a2x2

for which

p(xi) = yi, i = 0, 1, 2

The graph of this polynomial is shown on the accom-panying graph. We later give an explicit formula.

Quadratic interpolation of cos(x)

x

y

π/4 π/2

y = cos(x)y = p2(x)

PURPOSES OF INTERPOLATION

1. Replace a set of data points {(xi, yi)} with a func-tion given analytically.

2. Approximate functions with simpler ones, usually

polynomials or ‘piecewise polynomials’.

Purpose #1 has several aspects.

• The data may be from a known class of functions.Interpolation is then used to find the member of

this class of functions that agrees with the given

data. For example, data may be generated from

functions of the form

p(x) = a0 + a1ex + a2e

2x + · · ·+ anenx

Then we need to find the coefficientsnajobased

on the given data values.

• We may want to take function values f(x) givenin a table for selected values of x, often equally

spaced, and extend the function to values of x

not in the table.

For example, given numbers from a table of loga-

rithms, estimate the logarithm of a number x not

in the table.

• Given a set of data points {(xi, yi)}, find a curvepassing thru these points that is “pleasing to the

eye”. In fact, this is what is done continually with

computer graphics. How do we connect a set of

points to make a smooth curve? Connecting them

with straight line segments will often give a curve

with many corners, whereas what was intended

was a smooth curve.

Purpose #2 for interpolation is to approximate func-

tions f(x) by simpler functions p(x), perhaps to make

it easier to integrate or differentiate f(x). That will

be the primary reason for studying interpolation in this

course.

As as example of why this is important, consider the

problem of evaluating

I =Z 10

dx

1 + x10

This is very difficult to do analytically. But we will

look at producing polynomial interpolants of the inte-

grand; and polynomials are easily integrated exactly.

We begin by using polynomials as our means of doing

interpolation. Later in the chapter, we consider more

complex ‘piecewise polynomial’ functions, often called

‘spline functions’.

LINEAR INTERPOLATION

The simplest form of interpolation is probably thestraight line, connecting two points by a straight line.

Let two data points (x0, y0) and (x1, y1) be given.

There is a unique straight line passing through these

points. We can write the formula for a straight lineas

P1(x) = a0 + a1x

In fact, there are other more convenient ways to write

it, and we give several of them below.

P1(x) =x− x1x0 − x1

y0 +x− x0x1 − x0

y1

=(x1 − x) y0 + (x− x0) y1

x1 − x0

= y0 +x− x0x1 − x0

[y1 − y0]

= y0 +

Ãy1 − y0x1 − x0

!(x− x0)

Check each of these by evaluating them at x = x0and x1 to see if the respective values are y0 and y1.

Example. Following is a table of values for f(x) =tanx for a few values of x.

x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021

Use linear interpolation to estimate tan(1.15). Then

use

x0 = 1.1, x1 = 1.2

with corresponding values for y0 and y1. Then

tanx ≈ y0 +x− x0x1 − x0

[y1 − y0]

tanx ≈ y0 +x− x0x1 − x0

[y1 − y0]

tan (1.15) ≈ 1.9648 +1.15− 1.11.2− 1.1 [2.5722− 1.9648]

= 2.2685

The true value is tan 1.15 = 2.2345. We will want

to examine formulas for the error in interpolation, to

know when we have sufficient accuracy in our inter-

polant.

x

y

1 1.3

y=tan(x)

x

y

1.1 1.2

y = tan(x)y = p1(x)

QUADRATIC INTERPOLATION

We want to find a polynomial

P2(x) = a0 + a1x+ a2x2

which satisfies

P2(xi) = yi, i = 0, 1, 2

for given data points (x0, y0) , (x1, y1) , (x2, y2). One

formula for such a polynomial follows:

P2(x) = y0L0(x) + y1L1(x) + y2L2(x) (∗∗)with

L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =

(x−x0)(x−x2)(x1−x0)(x1−x2)

L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)

The formula (∗∗) is called Lagrange’s form of the in-

terpolation polynomial.

LAGRANGE BASIS FUNCTIONS

The functions

L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =

(x−x0)(x−x2)(x1−x0)(x1−x2)

L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)

are called ‘Lagrange basis functions’ for quadratic in-

terpolation. They have the properties

Li(xj) =

(1, i = j0, i 6= j

for i, j = 0, 1, 2. Also, they all have degree 2. Their

graphs are on an accompanying page.

As a consequence of each Li(x) being of degree 2, we

have that the interpolant

P2(x) = y0L0(x) + y1L1(x) + y2L2(x)

must have degree ≤ 2.

UNIQUENESS

Can there be another polynomial, call it Q(x), forwhich

deg(Q) ≤ 2Q(xi) = yi, i = 0, 1, 2

Thus, is the Lagrange formula P2(x) unique?

Introduce

R(x) = P2(x)−Q(x)

From the properties of P2 and Q, we have deg(R) ≤2. Moreover,

R(xi) = P2(xi)−Q(xi) = yi − yi = 0

for all three node points x0, x1, and x2. How manypolynomials R(x) are there of degree at most 2 andhaving three distinct zeros? The answer is that onlythe zero polynomial satisfies these properties, and there-fore

R(x) = 0 for all x

Q(x) = P2(x) for all x

SPECIAL CASES

Consider the data points

(x0, 1), (x1, 1), (x2, 1)

What is the polynomial P2(x) in this case?

Answer: We must have the polynomial interpolant is

P2(x) ≡ 1meaning that P2(x) is the constant function. Why?First, the constant function satisfies the property ofbeing of degree ≤ 2. Next, it clearly interpolates thegiven data. Therefore by the uniqueness of quadraticinterpolation, P2(x) must be the constant function 1.

Consider now the data points

(x0,mx0), (x1,mx1), (x2,mx2)

for some constant m. What is P2(x) in this case? Byan argument similar to that above,

P2(x) = mx for all x

Thus the degree of P2(x) can be less than 2.

HIGHER DEGREE INTERPOLATION

We consider now the case of interpolation by poly-nomials of a general degree n. We want to find apolynomial Pn(x) for which

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n (∗∗)

with given data points

(x0, y0) , (x1, y1) , · · · , (xn, yn)The solution is given by Lagrange’s formula

Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)

The Lagrange basis functions are given by

Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)

(xk − x0) ..(xk − xk−1)(xk − xk+1).. (xk − xn)

for k = 0, 1, 2, ..., n. The quadratic case was coveredearlier.

In a manner analogous to the quadratic case, we canshow that the above Pn(x) is the only solution to theproblem (∗∗).

In the formula

Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)


we can see that each such function is a polynomial of

degree n. In addition,

Lk(xi) =

(1, k = i0, k 6= i

Using these properties, it follows that the formula

Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)

satisfies the interpolation problem of finding a solution

to

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n

EXAMPLE

Recall the table

x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021

We now interpolate this table with the nodes

x0 = 1, x1 = 1.1, x2 = 1.2, x3 = 1.3

Without giving the details of the evaluation process,

we have the following results for interpolation with

degrees n = 1, 2, 3.

n 1 2 3Pn(1.15) 2.2685 2.2435 2.2296Error −.0340 −.0090 .0049

It improves with increasing degree n, but not at a very

rapid rate. In fact, the error becomes worse when n is

increased further. Later we will see that interpolation

of a much higher degree, say n ≥ 10, is often poorly

behaved when the node points {xi} are evenly spaced.

A FIRST ORDER DIVIDED DIFFERENCE

For a given function f(x) and two distinct points x0 andx1, de�ne

f [x0; x1] =f(x1)� f(x0)x1 � x0

This is called a �rst order divided di¤erence of f(x). Bythe Mean-value theorem,

f(x1)� f(x0) = f 0(c)(x1 � x0)

for some c between x0 and x1. Thus

f [x0; x1] = f 0(c)

and the divided di¤erence is very much like the derivative,especially if x0 and x1 are quite close together. In fact,

f 0(x1 + x02

) � f [x0; x1]

is quite an accurate approximation of the derivative

SECOND ORDER DIVIDED DIFFERENCES

Given three distinct points x0, x1, and x2, de�ne

f [x0; x1; x2] =f [x1; x2]� f [x0; x1]

x2 � x0This is called the second order divided di¤erence of f(x).By a fairly complicated argument, we can show

f [x0; x1; x2] =1

2f 00(c)

for some c intermediate to x0, x1, and x2. In fact, as weinvestigate,

f 00(x1) � 2f [x0; x1; x2]

in the case the nodes are evenly spaced,

x1 � x0 = x2 � x1:

EXAMPLE

Consider the table

x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997

Let x0 = 1, x1 = 1.1, and x2 = 1.2. Then

f [x0, x1] =.45360− .54030

1.1− 1 = −.86700

f [x1, x2] =.36236− .45360

1.1− 1 = −.91240

f [x0, x1, x2] =f [x1, x2]− f [x0, x1]

x2 − x0

=−.91240− (−.86700)

1.2− 1.0 = −.22700For comparison,

f 0µx1 + x02

¶= − sin (1.05) = −.86742

1

2f 00 (x1) = −1

2cos (1.1) = −.22680

GENERAL DIVIDED DIFFERENCES

Given n + 1 distinct points x0, ..., xn, with n ≥ 2,

define

f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]

xn − x0

This is a recursive definition of the nth-order divided

difference of f(x), using divided differences of order

n. Its relation to the derivative is as follows:

f [x0, ..., xn] =1

n!f (n)(c)

for some c intermediate to the points {x0, ..., xn}. LetI denote the interval

I = [min {x0, ..., xn} ,max {x0, ..., xn}]Then c ∈ I, and the above result is based on the

assumption that f(x) is n-times continuously differ-

entiable on the interval I.

EXAMPLE

The following table gives divided differences for the

data in

x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997

For the column headings, we use

Dkf(xi) = f [xi, ..., xi+k]

i xi f(xi) Df(xi) D2f(xi) D3f(xi) D4f(xi)0 1.0 .54030 -.8670 -.2270 .1533 .01251 1.1 .45360 -.9124 -.1810 .15832 1.2 .36236 -.9486 -.13353 1.3 .26750 -.97534 1.4 .16997

These were computed using the recursive definition

f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]

xn − x0

ORDER OF THE NODES

Looking at f [x0, x1], we have

f [x0, x1] =f(x1)− f(x0)

x1 − x0=

f(x0)− f(x1)

x0 − x1= f [x1, x0]

The order of x0 and x1 does not matter. Looking at

f [x0, x1, x2] =f [x1, x2]− f [x0, x1]

x2 − x0

we can expand it to get

f [x0, x1, x2] =f(x0)

(x0 − x1) (x0 − x2)

+f(x1)

(x1 − x0) (x1 − x2)+

f(x2)

(x2 − x0) (x2 − x1)

With this formula, we can show that the order of the

arguments x0, x1, x2 does not matter in the final value

of f [x0, x1, x2] we obtain. Mathematically,

f [x0, x1, x2] = f [xi0, xi1, xi2]

for any permutation (i0, i1, i2) of (0, 1, 2).

We can show in general that the value of f [x0, ..., xn]

is independent of the order of the arguments {x0, ..., xn},even though the intermediate steps in its calculations

using

f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]

xn − x0

are order dependent.

We can show

f [x0, ..., xn] = f [xi0, ..., xin]

for any permutation (i0, i1, ..., in) of (0, 1, ..., n).

COINCIDENT NODES

What happens when some of the nodes {x0, ..., xn}are not distinct. Begin by investigating what happens

when they all come together as a single point x0.

For first order divided differences, we have

limx1→x0

f [x0, x1] = limx1→x0

f(x1)− f(x0)

x1 − x0= f 0(x0)

We extend the definition of f [x0, x1] to coincident

nodes using

f [x0, x0] = f 0(x0)

For second order divided differences, recall

f [x0, x1, x2] =1

2f 00(c)

with c intermediate to x0, x1, and x2.

Then as x1 → x0 and x2 → x0, we must also have

that c→ x0. Therefore,

limx1→x0x2→x0

f [x0, x1, x2] =1

2f 00(x0)

We therefore define

f [x0, x0, x0] =1

2f 00(x0)

For the case of general f [x0, ..., xn], recall that

f [x0, ..., xn] =1

n!f (n)(c)

for some c intermediate to {x0, ..., xn}. Then

lim{x1,...,xn}→x0

f [x0, ..., xn] =1

n!f (n)(x0)

and we define

f [x0, ..., x0| {z }]n+1 times

=1

n!f (n)(x0)

What do we do when only some of the nodes are

coincident. This too can be dealt with, although we

do so here only by examples.

f [x0, x1, x1] =f [x1, x1]− f [x0, x1]

x1 − x0

=f 0(x1)− f [x0, x1]

x1 − x0The recursion formula can be used in general in this

way to allow all possible combinations of possibly co-

incident nodes.

LAGRANGE�S FORMULA FOR THEINTERPOLATION POLYNOMIAL

Recall the general interpolation problem: �nd a polyno-mial Pn(x) for which

deg(Pn) � n

Pn(xi) = yi; i = 0; 1; : : : ; n


(x0; y0); (x1; y1); ��; (xn; yn)

and with fx0; :::; xng distinct points. The solution tothis problem is given as Lagrange�s formula

Pn(x) = y0L0(x) + y1L1(x) + ��+ ynLn(x)

with fL0(x); :::; Ln(x)g the Lagrange basis polynomials.Each Lj is of degree n and it satis�es

Lj(xi) =

(1; if ; j = i0; if ; j 6= i

for i = 0; 1; : : : ; n.

THE NEWTON DIVIDED DIFFERENCE FORM

OF THE INTERPOLATION POLYNOMIAL

Let the data values for the problem

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n

be generated from a function f(x):

yi = f(xi), i = 0, 1, ..., n

Using the divided differences

f [x0, x1], f [x0, x1, x2], ..., f [x0, ..., xn]

we can write the interpolation polynomials

P1(x), P2(x), ..., Pn(x)

in a way that is simple to compute.

P1(x) = f(x0) + f [x0, x1] (x− x0)P2(x) = f(x0) + f [x0, x1] (x− x0)

+f [x0, x1, x2] (x− x0) (x− x1)= P1(x) + f [x0, x1, x2] (x− x0) (x− x1)

For the case of the general problem

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n

we have

Pn(x) = f(x0) + f [x0, x1] (x− x0)+f [x0, x1, x2] (x− x0) (x− x1)+f [x0, x1, x2, x3] (x− x0) (x− x1) (x− x2)+ · · ·+f [x0, ..., xn] (x− x0) · · · (x− xn−1)

From this we have the recursion relation

Pn(x) = Pn−1(x)+f [x0, ..., xn] (x− x0) · · · (x− xn−1)

in which Pn−1(x) interpolates f(x) at the points in{x0, ..., xn−1}.

Example: Recall the table


withDkf(xi) = f [xi, ..., xi+k], k = 1, 2, 3, 4. Then

P1(x) = .5403− .8670 (x− 1)P2(x) = P1(x)− .2270 (x− 1) (x− 1.1)P3(x) = P2(x) + .1533 (x− 1) (x− 1.1) (x− 1.2)P4(x) = P3(x)

+.0125 (x− 1) (x− 1.1) (x− 1.2) (x− 1.3)Using this table and these formulas, we have the fol-

lowing table of interpolants for the value x = 1.05.

The true value is cos(1.05) = .49757105.

n 1 2 3 4Pn(1.05) .49695 .49752 .49758 .49757Error 6.20E−4 5.00E−5 −1.00E−5 0.0

EVALUATION OF THE DIVIDED DIFFERENCE

INTERPOLATION POLYNOMIAL

Let

d1 = f [x0, x1]d2 = f [x0, x1, x2]

...dn = f [x0, ..., xn]

Then the formula


can be written as

Pn(x) = f(x0) + (x− x0) (d1 + (x− x1) (d2 + · · ·+(x− xn−2) (dn−1 + (x− xn−1) dn) · · · )

Thus we have a nested polynomial evaluation, and

this is quite efficient in computational cost.

ERROR IN LINEAR INTERPOLATION

Let P1(x) denote the linear polynomial interpolating

f(x) at x0 and x1, with f(x) a given function (e.g.

f(x) = cosx). What is the error f(x)− P1(x)?

Let f(x) be twice continuously differentiable on an in-

terval [a, b] which contains the points {x0, x1}. Thenfor a ≤ x ≤ b,

f(x)− P1(x) =(x− x0) (x− x1)

2f 00(cx)

for some cx between the minimum and maximum of

x0, x1, and x.

If x1 and x are ‘close to x0’, then

f(x)− P1(x) ≈(x− x0) (x− x1)

2f 00(x0)

Thus the error acts like a quadratic polynomial, with

zeros at x0 and x1.

EXAMPLE

Let f(x) = log10 x; and in line with typical tables of

log10 x, we take 1 ≤ x, x0, x1 ≤ 10. For definiteness,let x0 < x1 with h = x1 − x0. Then

f 00(x) = −log10 ex2

log10 x− P1(x) =(x− x0) (x− x1)

2

"−log10 e

c2x

#

= (x− x0) (x1 − x)

"log10 e

2c2x

#We usually are interpolating with x0 ≤ x ≤ x1; and

in that case, we have

(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1

(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1

and therefore

(x− x0) (x1 − x)

"log10 e

2x21

#≤ log10 x− P1(x)

≤ (x− x0) (x1 − x)

"log10 e

2x20

#

For h = x1 − x0 small, we have for x0 ≤ x ≤ x1

log10 x− P1(x) ≈ (x− x0) (x1 − x)

"log10 e

2x20

#

Typical high school algebra textbooks contain tables

of log10 x with a spacing of h = .01. What is the

error in this case? To look at this, we use

0 ≤ log10 x− P1(x) ≤ (x− x0) (x1 − x)

"log10 e

2x20

#

By simple geometry or calculus,

maxx0≤x≤x1

(x− x0) (x1 − x) ≤ h2

4

Therefore,

0 ≤ log10 x− P1(x) ≤h2

4

"log10 e

2x20

#.= .0543

h2

x20

If we want a uniform bound for all points 1 ≤ x0 ≤ 10,we have

0 ≤ log10 x− P1(x) ≤h2 log10 e

8

.= .0543h2

0 ≤ log10 x− P1(x) ≤ .0543h2

For h = .01, as is typical of the high school text book

tables of log10 x,

0 ≤ log10 x− P1(x) ≤ 5.43× 10−6

If you look at most tables, a typical entry is given to

only four decimal places to the right of the decimal

point, e.g.

log 5.41.= .7332

Therefore the entries are in error by as much as .00005.

Comparing this with the interpolation error, we see the

latter is less important than the rounding errors in the

table entries.

From the bound

0 ≤ log10 x− P1(x) ≤h2 log10 e

8x20

.= .0543

h2

x20

we see the error decreases as x0 increases, and it is

about 100 times smaller for points near 10 than for

points near 1.

AN ERROR FORMULA:

THE GENERAL CASE

Recall the general interpolation problem: find a poly-

nomial Pn(x) for which deg(Pn) ≤ n

Pn(xi) = f(xi), i = 0, 1, · · · , nwith distinct node points {x0, ..., xn} and a givenfunction f(x). Let [a, b] be a given interval on which

f(x) is (n+ 1)-times continuously differentiable; and

assume the points x0, ..., xn, and x are contained in

[a, b]. Then

f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)

(n+ 1)!f (n+1) (cx)

with cx some point between the minimum and maxi-

mum of the points in {x, x0, ..., xn}.

f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)

(n+ 1)!f (n+1) (cx)

As shorthand, introduce

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

a polynomial of degree n+ 1 with roots {x0, ..., xn}.Then

f(x)− Pn(x) =Ψn(x)

(n+ 1)!f (n+1) (cx)

THE QUADRATIC CASE

For n = 2, we have

f(x)− P2(x) =(x− x0) (x− x1) (x− x2)

3!f (3) (cx)

(*)


mum of the points in {x, x0, x1, x2}.

To illustrate the use of this formula, consider the case

of evenly spaced nodes:

x1 = x0 + h, x2 = x1 + h

Further suppose we have x0 ≤ x ≤ x2, as we would

usually have when interpolating in a table of given

function values (e.g. log10 x). The quantity

Ψ2(x) = (x− x0) (x− x1) (x− x2)

can be evaluated directly for a particular x.

Graph of

Ψ2(x) = (x+ h)x (x− h)

using (x0, x1, x2) = (−h, 0, h):

x

y

h

-h

In the formula (∗), however, we do not know cx, and

therefore we replace¯̄̄f (3) (cx)

¯̄̄with a maximum of¯̄̄

f (3) (x)¯̄̄as x varies over x0 ≤ x ≤ x2. This yields

|f(x)− P2(x)| ≤|Ψ2(x)|3!

maxx0≤x≤x2

¯̄̄f (3) (x)

¯̄̄(**)

If we want a uniform bound for x0 ≤ x ≤ x2, we must

compute

maxx0≤x≤x2

|Ψ2(x)| = maxx0≤x≤x2

|(x− x0) (x− x1) (x− x2)|

Using calculus,

maxx0≤x≤x2

|Ψ2(x)| =2h3

3 sqrt(3), at x = x1±

h

sqrt(3)

Combined with (∗∗), this yields

|f(x)− P2(x)| ≤h3

9 sqrt(3)max

x0≤x≤x2

¯̄̄f (3) (x)

¯̄̄for x0 ≤ x ≤ x2.

For f(x) = log10 x, with 1 ≤ x0 ≤ x ≤ x2 ≤ 10, thisleads to

|log10 x− P2(x)| ≤h3

9 sqrt(3)· maxx0≤x≤x2

2 log10 e

x3

=.05572h3

x30

For the case of h = .01, we have

|log10 x− P2(x)| ≤5.57× 10−8

x30≤ 5.57× 10−8

Question: How much larger could we make h so that

quadratic interpolation would have an error compa-

rable to that of linear interpolation of log10 x with

h = .01? The error bound for the linear interpolation

was 5.43× 10−6, and therefore we want the same tobe true of quadratic interpolation. Using a simpler

bound, we want to find h so that

|log10 x− P2(x)| ≤ .05572h3 ≤ 5× 10−6

This is true if h = .04477. Therefore a spacing of

h = .04 would be sufficient. A table with this spac-

ing and quadratic interpolation would have an error

comparable to a table with h = .01 and linear inter-

polation.

For the case of general n,

f(x)− Pn(x) =(x− x0) · · · (x− xn)

(n+ 1)!f (n+1) (cx)

=Ψn(x)

(n+ 1)!f (n+1) (cx)

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

with cx some point between the minimum and max-

imum of the points in {x, x0, ..., xn}. When bound-ing the error we replace f (n+1) (cx) with its maximum

over the interval containing {x, x0, ..., xn}, as we haveillustrated earlier in the linear and quadratic cases.

Consider now the function

Ψn(x)

(n+ 1)!

over the interval determined by the minimum and

maximum of the points in {x, x0, ..., xn}. For evenlyspaced node points on [0, 1], with x0 = 0 and xn = 1,

we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9

on accompanying pages.

DISCUSSION OF ERROR

Consider the error

f(x)− Pn(x) =(x− x0) · · · (x− xn)

(n+ 1)!f (n+1) (cx)

=Ψn(x)

(n+ 1)!f (n+1) (cx)

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

as n increases and as x varies. As noted previously, we

cannot do much with f (n+1) (cx) except to replace it

with a maximum value of¯̄̄f (n+1) (x)

¯̄̄over a suitable

interval. Thus we concentrate on understanding the

size of

Ψn(x)

(n+ 1)!

ERROR FOR EVENLY SPACED NODES

We consider first the case in which the node points

are evenly spaced, as this seems the ‘natural’ way to

define the points at which interpolation is carried out.

Moreover, using evenly spaced nodes is the case to

consider for table interpolation. What can we learn

from the given graphs?

The interpolation nodes are determined by using

h =1

n, x0 = 0, x1 = h, x2 = 2h, ..., xn = nh = 1

For this case,

Ψn(x) = x (x− h) (x− 2h) · · · (x− 1)Our graphs are the cases of n = 2, ..., 9.

x

y n = 2

1x

y n = 3

1

x

y n = 4

1

x

y n = 5

1

Graphs of Ψn(x) on [0, 1] for n = 2, 3, 4, 5

x

y n = 6

1

x

y n = 7

1

x

y n = 8

1

x

y n = 9

1


Graph of

Ψ6(x) = (x− x0) (x− x1) · · · (x− x6)

with evenly spaced nodes:

xx0 x1 x2 x3 x4 x5 x6

Using the following table

,

n Mn n Mn

1 1.25E−1 6 4.76E−72 2.41E−2 7 2.20E−83 2.06E−3 8 9.11E−104 1.48E−4 9 3.39E−115 9.01E−6 10 1.15E−12

we can observe that the maximum

Mn ≡ maxx0≤x≤xn

|Ψn(x)|(n+ 1)!

becomes smaller with increasing n.

From the graphs, there is enormous variation in the

size of Ψn(x) as x varies over [0, 1]; and thus there

is also enormous variation in the error as x so varies.

For example, in the n = 9 case,

maxx0≤x≤x1

|Ψn(x)|(n+ 1)!

= 3.39× 10−11

maxx4≤x≤x5

|Ψn(x)|(n+ 1)!

= 6.89× 10−13

and the ratio of these two errors is approximately 49.

Thus the interpolation error is likely to be around 49

times larger when x0 ≤ x ≤ x1 as compared to the

case when x4 ≤ x ≤ x5. When doing table inter-

polation, the point x at which you are interpolating

should be centrally located with respect to the inter-

polation nodes m{x0, ..., xn} being used to define theinterpolation, if possible.

AN APPROXIMATION PROBLEM

Consider now the problem of using an interpolation

polynomial to approximate a given function f(x) on

a given interval [a, b]. In particular, take interpolation

nodes

a ≤ x0 < x1 < · · · < xn−1 < xn ≤ b

and produce the interpolation polynomial Pn(x) that

interpolates f(x) at the given node points. We would

like to have

maxa≤x≤b |f(x)− Pn(x)|→ 0 as n→∞

Does it happen?

Recall the error bound

maxa≤x≤b |f(x)− Pn(x)|

≤ maxa≤x≤b

|Ψn(x)|(n+ 1)!

· maxa≤x≤b

¯̄̄f (n+1) (x)

¯̄̄We begin with an example using evenly spaced node

points.

RUNGE�S EXAMPLE

Use evenly spaced node points:

h =b� an

xi = a+ ih for i = 0; : : : ; n

For some functions, such as f(x) = ex, the maximumerror goes to zero quite rapidly. But the size of the deriv-ative term f (n+1)(x) in

maxa�x�b

jf(x)� Pn(x)j

� 1

(n+ 1)!maxa�x�b

jn(x)j � maxa�x�b

��f (n+1)(x)��can badly hurt or destroy the convergence of other cases.In particular, we show the graph of

f(x) =1

1 + x2

and Pn(x) on [�5; 5] for the case n = 10. It canbe proven that for this function, the maximum error on[�5; 5] does not converge to zero. Thus the use of evenlyspaced nodes is not necessarily a good approach to ap-proximating a function f(x) by interpolation.

Runge’s example with n = 10:

x

y

y=P10(x)

y=1/(1+x2)

OTHER CHOICES OF NODES

Recall the general error bound

maxa≤x≤b |f(x)− Pn(x)| ≤ max

a≤x≤b|Ψn(x)|(n+ 1)!

· maxa≤x≤b

¯̄̄f (n+1) (x)

¯̄̄There is nothing we really do with the derivative term

for f ; but we can examine the way of defining the

nodes {x0, ..., xn} within the interval [a, b]. We askhow these nodes can be chosen so that the maximum

of |Ψn(x)| over [a, b] is made as small as possible.

This problem has quite an elegant solution, and it will beconsidered in next lecture. The node points fx0; :::; xngturn out to be the zeros of a particular polynomial Tn+1(x)of degree n + 1, called a Chebyshev polynomial. Thesezeros are known explicitly, and with them

maxa�x�b

jn(x)j =�b� a2

�n+12�n

This turns out to be smaller than for evenly spaced cases;and although this polynomial interpolation does not workfor all functions f(x), it works for all di¤erentiable func-tions and more.

ANOTHER ERROR FORMULA

Recall the error formula


(n+ 1)!f (n+1) (c)

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

with c between the minimum and maximum of {x0, ..., xn, x}.A second formula is given by

f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]

To show this is a simple, but somewhat subtle argu-

ment.

Let Pn+1(x) denote the polynomial of degree ≤ n+1

which interpolates f(x) at the points {x0, ..., xn, xn+1}.Then

Pn+1(x) = Pn(x)

+f [x0, ..., xn, xn+1] (x− x0) · · · (x− xn)

Substituting x = xn+1, and using the fact that Pn+1(x)

interpolates f(x) at xn+1, we have

f(xn+1) = Pn(xn+1)

+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)

f(xn+1) = Pn(xn+1)

+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)

In this formula, the number xn+1 is completely ar-

bitrary, other than being distinct from the points in

{x0, ..., xn}. To emphasize this fact, replace xn+1 byx throughout the formula, obtaining

f(x) = Pn(x) + f [x0, ..., xn, x] (x− x0) · · · (x− xn)

= Pn(x) +Ψn(x) f [x0, ..., xn, x]

provided x 6= x0, ..., xn.

The formula


= Pn(x) +Ψn(x) f [x0, ..., xn, x]

is easily true for x a node point. Provided f(x) is

differentiable, the formula is also true for x a node

point.

This shows

f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]

Compare the two error formulas

f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]


(n+ 1)!f (n+1) (c)

Then

Ψn(x) f [x0, ..., xn, x] =Ψn(x)

(n+ 1)!f (n+1) (c)

f [x0, ..., xn, x] =f (n+1) (c)

(n+ 1)!

for some c between the smallest and largest of the

numbers in {x0, ..., xn, x}.

To make this somewhat symmetric in its arguments,

let m = n+ 1, x = xn+1. Then

f [x0, ..., xm−1, xm] =f (m) (c)

m!

with c an unknown number between the smallest and

largest of the numbers in {x0, ..., xm}. This was givenin an earlier lecture where divided differences were in-

troduced.

PIECEWISE POLYNOMIAL INTERPOLATION

Recall the examples of higher degree polynomial in-

terpolation of the function f(x) =³1 + x2

´−1on

[−5, 5]. The interpolants Pn(x) oscillated a great

deal, whereas the function f(x) was nonoscillatory.

To obtain interpolants that are better behaved, we

look at other forms of interpolating functions.

Consider the data

x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0

What are methods of interpolating this data, other

than using a degree 6 polynomial. Shown in the text

are the graphs of the degree 6 polynomial interpolant,

along with those of piecewise linear and a piecewise

quadratic interpolating functions.

Since we only have the data to consider, we would gen-

erally want to use an interpolant that had somewhat

the shape of that of the piecewise linear interpolant.

x

y

1 2 3 4

1

2

The data points

x

y

1 2 3 4

1

2

Piecewise linear interpolation

x

y

1 2 3 4

1

2

3

4

Polynomial Interpolation

x

y

1 2 3 4

1

2

Piecewise quadratic interpolation

PIECEWISE POLYNOMIAL FUNCTIONS

Consider being given a set of data points (x1, y1), ...,

(xn, yn), with

x1 < x2 < · · · < xn

Then the simplest way to connect the points (xj, yj)

is by straight line segments. This is called a piecewise

linear interpolant of the datan(xj, yj)

o. This graph

has “corners”, and often we expect the interpolant to

have a smooth graph.

To obtain a somewhat smoother graph, consider using

piecewise quadratic interpolation. Begin by construct-

ing the quadratic polynomial that interpolates

{(x1, y1), (x2, y2), (x3, y3)}Then construct the quadratic polynomial that inter-

polates

{(x3, y3), (x4, y4), (x5, y5)}

Continue this process of constructing quadratic inter-

polants on the subintervals

[x1, x3], [x3, x5], [x5, x7], ...

If the number of subintervals is even (and therefore

n is odd), then this process comes out fine, with the

last interval being [xn−2, xn]. This was illustrated

on the graph for the preceding data. If, however, n is

even, then the approximation on the last interval must

be handled by some modification of this procedure.

Suggest such!

With piecewise quadratic interpolants, however, there

are “corners” on the graph of the interpolating func-

tion. With our preceding example, they are at x3 and

x5. How do we avoid this?

Piecewise polynomial interpolants are used in many

applications. We will consider them later, to obtain

numerical integration formulas.

SMOOTH NON-OSCILLATORY

INTERPOLATION

Let data points (x1, y1), ..., (xn, yn) be given, as let

x1 < x2 < · · · < xn

Consider finding functions s(x) for which the follow-

ing properties hold:

(1) s(xi) = yi, i = 1, ..., n

(2) s(x), s0(x), s00(x) are continuous on [x1, xn].Then among such functions s(x) satisfying these prop-

erties, find the one which minimizes the integralZ xn

x1

¯̄̄s00(x)

¯̄̄2dx

The idea of minimizing the integral is to obtain an in-

terpolating function for which the first derivative does

not change rapidly. It turns out there is a unique so-

lution to this problem, and it is called a natural cubic

spline function.

SPLINE FUNCTIONS

Let a set of node points {xi} be given, satisfyinga ≤ x1 < x2 < · · · < xn ≤ b

for some numbers a and b. Often we use [a, b] =

[x1, xn]. A cubic spline function s(x) on [a, b] with

“breakpoints” or “knots” {xi} has the following prop-erties:

1. On each of the intervals

[a, x1], [x1, x2], ..., [xn−1, xn], [xn, b]

s(x) is a polynomial of degree ≤ 3.2. s(x), s0(x), s00(x) are continuous on [a, b].

In the case that we have given data points (x1, y1),...,

(xn, yn), we say s(x) is a cubic interpolating spline

function for this data if

3. s(xi) = yi, i = 1, ..., n.

EXAMPLE

Define

(x− α)3+ =

((x− α)3 , x ≥ α

0, x ≤ α

This is a cubic spline function on (−∞,∞) with thesingle breakpoint x1 = α.

Combinations of these form more complicated cubic

spline functions. For example,

s(x) = 3 (x− 1)3+ − 2 (x− 3)3+is a cubic spline function on (−∞,∞) with the break-points x1 = 1, x2 = 3.

Define

s(x) = p3(x) +nX

j=1

aj³x− xj

´3+

with p3(x) some cubic polynomial. Then s(x) is a

cubic spline function on (−∞,∞) with breakpoints{x1, ..., xn}.

Return to the earlier problem of choosing an interpo-

lating function s(x) to minimize the integralZ xn

x1

¯̄̄s00(x)

¯̄̄2dx

There is a unique solution to problem. The solution

s(x) is a cubic interpolating spline function, and more-

over, it satisfies

s00(x1) = s00(xn) = 0

Spline functions satisfying these boundary conditions

are called “natural” cubic spline functions, and the so-

lution to our minimization problem is a “natural cubic

interpolatory spline function”. We will show a method

to construct this function from the interpolation data.

Motivation for these boundary conditions can be given

by looking at the physics of bending thin beams of

flexible materials to pass thru the given data. To the

left of x1 and to the right of xn, the beam is straight

and therefore the second derivatives are zero at the

transition points x1 and xn.

CONSTRUCTION OF THE

INTERPOLATING SPLINE FUNCTION

To make the presentation more specific, suppose we

have data

(x1, y1) , (x2, y2) , (x3, y3) , (x4, y4)

with x1 < x2 < x3 < x4. Then on each of the

intervals

[x1, x2] , [x2, x3] , [x3, x4]

s(x) is a cubic polynomial. Taking the first interval,

s(x) is a cubic polynomial and s00(x) is a linear poly-nomial. Let

Mi = s00(xi), i = 1, 2, 3, 4

Then on [x1, x2],

s00(x) = (x2 − x)M1 + (x− x1)M2

x2 − x1, x1 ≤ x ≤ x2

We can find s(x) by integrating twice:

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6 (x2 − x1)+ c1x+ c2

We determine the constants of integration by using

s(x1) = y1, s(x2) = y2 (*)

Then

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6 (x2 − x1)

+(x2 − x) y1 + (x− x1) y2

x2 − x1

−x2 − x16

[(x2 − x)M1 + (x− x1)M2]

for x1 ≤ x ≤ x2.

Check that this formula satisfies the given interpola-

tion condition (*)!

We can repeat this on the intervals [x2, x3] and [x3, x4],

obtaining similar formulas.

For x2 ≤ x ≤ x3,

s(x) =(x3 − x)3M2 + (x− x2)

3M3

6 (x3 − x2)

+(x3 − x) y2 + (x− x2) y3

x3 − x2

−x3 − x26

[(x3 − x)M2 + (x− x2)M3]

For x3 ≤ x ≤ x4,

s(x) =(x4 − x)3M3 + (x− x3)

3M4

6 (x4 − x3)

+(x4 − x) y3 + (x− x3) y4

x4 − x3

−x4 − x36

[(x4 − x)M3 + (x− x3)M4]

We still do not know the values of the second deriv-

atives {M1,M2,M3,M4}. The above formulas guar-antee that s(x) and s00(x) are continuous forx1 ≤ x ≤ x4. For example, the formula on [x1, x2]

yields

s(x2) = y2, s00(x2) =M2

The formula on [x2, x3] also yields

s(x2) = y2, s00(x2) =M2

All that is lacking is to make s0(x) continuous at x2and x3. Thus we require

s0(x2 + 0) = s0(x2 − 0)s0(x3 + 0) = s0(x3 − 0) (**)

This means

limx&x2

s0(x) = limx%x2

s0(x)

and similarly for x3.

To simplify the presentation somewhat, I assume in

the following that our node points are evenly spaced:

x2 = x1 + h, x3 = x1 + 2h, x4 = x1 + 3h

Then our earlier formulas simplify to

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6h

+(x2 − x) y1 + (x− x1) y2

h

−h6[(x2 − x)M1 + (x− x1)M2]

for x1 ≤ x ≤ x2, with similar formulas on [x2, x3] and

[x3, x4].

Without going thru all of the algebra, the conditions

(**) leads to the following pair of equations.

h

6M1 +

2h

3M2 +

h

6M3

=y3 − y2

h− y2 − y1

hh

6M2 +

2h

3M3 +

h

6M4

=y4 − y3

h− y3 − y2

h

This gives us two equations in four unknowns. The

earlier boundary conditions on s00(x) gives us immedi-ately

M1 =M4 = 0

Then we can solve the linear system for M2 and M3.

EXAMPLE

Consider the interpolation data points

x 1 2 3 4

y 1 12

13

14

In this case, h = 1, and linear system becomes

2

3M2 +

1

6M3 = y3 − 2y2 + y1 =

1

31

6M2 +

2

3M3 = y4 − 2y3 + y2 =

1

12

This has the solution

M2 =1

2, M3 = 0

This leads to the spline function formula on each

subinterval.

On [1, 2],

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6h

+(x2 − x) y1 + (x− x1) y2

h

−h6[(x2 − x)M1 + (x− x1)M2]

=(2− x)3 · 0 + (x− 1)3

³12

´6

+(2− x) · 1 + (x− 1)

³12

´1

−16

h(2− x) · 0 + (x− 1)

³12

í= 112 (x− 1)3 − 7

12 (x− 1) + 1

Similarly, for 2 ≤ x ≤ 3,

s(x) =−112(x− 2)3 + 1

4(x− 2)2 − 1

3(x− 1) + 1

2

and for 3 ≤ x ≤ 4,

s(x) =−112(x− 4) + 1

4

x 1 2 3 4

y 1 12

13

14

0 0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1

x

y

y = 1/xy = s(x)

Graph of example of natural cubic spline

interpolation

x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0

x

y

1 2 3 4

1

2

Interpolating natural cubic spline function

ALTERNATIVE BOUNDARY CONDITIONS

Return to the equations

h

6M1 +

2h

3M2 +

h

6M3

=y3 − y2

h− y2 − y1

hh

6M2 +

2h

3M3 +

h

6M4

=y4 − y3

h− y3 − y2

h

Sometimes other boundary conditions are imposed on

s(x) to help in determining the values of M1 and

M4. For example, the data in our numerical exam-

ple were generated from the function f(x) = 1x. With

it, f 00(x) = 2x3, and thus we could use

M1 = 2, M4 =1

32

With this we are led to a new formula for s(x), one

that approximates f(x) = 1x more closely.

THE CLAMPED SPLINE

In this case, we augment the interpolation conditions

s(xi) = yi, i = 1, 2, 3, 4

with the boundary conditions

s0(x1) = y01, s0(x4) = y04 (#)

The conditions (#) lead to another pair of equations,

augmenting the earlier ones. Combined these equa-

tions are

h

3M1 +

h

6M2 =

y2 − y1h

− y01h

6M1 +

2h

3M2 +

h

6M3

=y3 − y2

h− y2 − y1

hh

6M2 +

2h

3M3 +

h

6M4

=y4 − y3

h− y3 − y2

hh

6M3 +

h

3M4 = y04 −

y4 − y3h

For our numerical example, it is natural to obtain

these derivative values from f 0(x) = − 1x2:

y01 = −1, y04 = −1

16

When combined with your earlier equations, we have

the system

1

3M1 +

1

6M2 =

1

21

6M1 +

2

3M2 +

1

6M3 =

1

31

6M2 +

2

3M3 +

1

6M4 =

1

121

6M3 +

1

3M4 =

1

48


[M1,M2,M3,M4] =·173

120,7

60,11

120,1

60

¸

We can now write the functions s(x) for each of the

subintervals [x1, x2], [x2, x3], and [x3, x4]. Recall for

x1 ≤ x ≤ x2,

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6h

+(x2 − x) y1 + (x− x1) y2

h

−h6[(x2 − x)M1 + (x− x1)M2]

We can substitute in from the data

x 1 2 3 4

y 1 12

13

14

and the solutions {Mi}. Doing so, consider the errorf(x)− s(x). As an example,

f(x) =1

x, f

µ3

2

¶=2

3, s

µ3

2

¶= .65260

This is quite a decent approximation.

THE GENERAL PROBLEM

Consider the spline interpolation problem with n nodes

(x1, y1) , (x2, y2) , ..., (xn, yn)

and assume the node points {xi} are evenly spaced,xj = x1 + (j − 1)h, j = 1, ..., n

We have that the interpolating spline s(x) on

xj ≤ x ≤ xj+1 is given by

s(x) =

³xj+1 − x

´3Mj +

³x− xj

´3Mj+1

6h

+

³xj+1 − x

ýj +

³x− xj

ýj+1

h

−h6

h³xj+1 − x

´Mj +

³x− xj

´Mj+1

ifor j = 1, ..., n− 1.

To enforce continuity of s0(x) at the interior nodepoints x2, ..., xn−1, the second derivatives

nMj

omust

satisfy the linear equations

h

6Mj−1 +

2h

3Mj +

h

6Mj+1 =

yj−1 − 2yj + yj+1

h

for j = 2, ..., n− 1. Writing them out,

h

6M1 +

2h

3M2 +

h

6M3 =

y1 − 2y2 + y3h

h

6M2 +

2h

3M3 +

h

6M4 =

y2 − 2y3 + y4h

...h

6Mn−2 +

2h

3Mn−1 +

h

6Mn =

yn−2 − 2yn−1 + yn

h

This is a system of n−2 equations in the n unknowns{M1, ...,Mn}. Two more conditions must be imposedon s(x) in order to have the number of equations equal

the number of unknowns, namely n. With the added

boundary conditions, this form of linear system can be

solved very efficiently.

BOUNDARY CONDITIONS

“Natural” boundary conditions

s00(x1) = s00(xn) = 0Spline functions satisfying these conditions are called“natural cubic splines”. They arise out the minimiza-tion problem stated earlier. But generally they are notconsidered as good as some other cubic interpolatingsplines.

“Clamped” boundary conditions We add the condi-tions

s0(x1) = y01, s0(xn) = y0nwith y01, y0n given slopes for the endpoints of s(x) on[x1, xn]. This has many quite good properties whencompared with the natural cubic interpolating spline;but it does require knowing the derivatives at the end-points.

“Not a knot” boundary conditions This is more com-plicated to explain, but it is the version of cubic splineinterpolation that is implemented in Matlab.

THE “NOT A KNOT” CONDITIONS

As before, let the interpolation nodes be

(x1, y1) , (x2, y2) , ..., (xn, yn)

We separate these points into two categories. For

constructing the interpolating cubic spline function,

we use the points

(x1, y1) , (x3, y3) , ..., (xn−2, yn−2) , (xn, yn)Thus deleting two of the points. We now have n− 2points, and the interpolating spline s(x) can be deter-

mined on the intervals

[x1, x3] , [x3, x4] , ..., [xn−3, xn−2] , [xn−2, xn]This leads to n− 4 equations in the n− 2 unknownsM1,M3, ...,Mn−2,Mn. The two additional boundary

conditions are

s(x2) = y2, s(xn−1) = yn−1These translate into two additional equations, and we

obtain a system of n−2 linear simultaneous equationsin the n− 2 unknowns M1,M3, ...,Mn−2,Mn.

x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0

x

y

1 2 3 4

1

2

Interpolating cubic spline function with ”not-a knot”

boundary conditions

MATLAB SPLINE FUNCTION LIBRARY

Given data points

(x1, y1) , (x2, y2) , ..., (xn, yn)

type arrays containing the x and y coordinates:

x = [x1 x2 ...xn]y = [y1 y2 ...yn]plot (x, y, ’o’)

The last statement will draw a plot of the data points,

marking them with the letter ‘oh’. To find the inter-

polating cubic spline function and evaluate it at the

points of another array xx, say

h = (xn − x1) / (10 ∗ n) ; xx = x1 : h : xn;

use

yy = spline (x, y, xx)plot (x, y, ’o’, xx, yy)

The last statement will plot the data points, as be-

fore, and it will plot the interpolating spline s(x) as a

continuous curve.

ERROR IN CUBIC SPLINE INTERPOLATION

Let an interval [a, b] be given, and then define

h =b− a

n− 1, xj = a+ (j − 1)h, j = 1, ..., n

Suppose we want to approximate a given function

f(x) on the interval [a, b] using cubic spline inter-

polation. Define

yi = f(xi), j = 1, ..., n

Let sn(x) denote the cubic spline interpolating this

data and satisfying the “not a knot” boundary con-

ditions. Then it can be shown that for a suitable

constant c,

En ≡ maxa≤x≤b |f(x)− sn(x)| ≤ ch4

The corresponding bound for natural cubic spline in-

terpolation contains only a term of h2 rather than h4;

it does not converge to zero as rapidly.

EXAMPLE

Take f(x) = arctanx on [0, 5]. The following ta-

ble gives values of the maximum error En for various

values of n. The values of h are being successively

halved.

n En E12n/En

7 7.09E−313 3.24E−4 21.925 3.06E−5 10.649 1.48E−6 20.797 9.04E−8 16.4

BEST APPROXIMATION

Given a function f(x) that is continuous on a giveninterval [a, b], consider approximating it by some poly-nomial p(x). To measure the error in p(x) as an ap-proximation, introduce

E(p) = maxa≤x≤b |f(x)− p(x)|

This is called the maximum error or uniform error ofapproximation of f(x) by p(x) on [a, b].

With an eye towards efficiency, we want to find the‘best’ possible approximation of a given degree n.With this in mind, introduce the following:

ρn(f) = mindeg(p)≤n

E(p)

= mindeg(p)≤n

"maxa≤x≤b |f(x)− p(x)|

#The number ρn(f) will be the smallest possible uni-form error, orminimax error, when approximating f(x)by polynomials of degree at most n. If there is apolynomial giving this smallest error, we denote it bymn(x); thus E(mn) = ρn(f).

Example. Let f(x) = ex on [−1, 1]. In the followingtable, we give the values of E(tn), tn(x) the Tay-

lor polynomial of degree n for ex about x = 0, and

E(mn).

Maximum Error in:n tn(x) mn(x)1 7.18E− 1 2.79E− 12 2.18E− 1 4.50E− 23 5.16E− 2 5.53E− 34 9.95E− 3 5.47E− 45 1.62E− 3 4.52E− 56 2.26E− 4 3.21E− 67 2.79E− 5 2.00E− 78 3.06E− 6 1.11E− 89 3.01E− 7 5.52E− 10

Consider graphically how we can improve on the Tay-

lor polynomial

t1(x) = 1 + x

as a uniform approximation to ex on the interval [−1, 1].

The linear minimax approximation is

m1(x) = 1.2643 + 1.1752x

x

y

-1 1

1

2

y=t1(x)

y=m1(x)

y=ex

Linear Taylor and minimax approximations to ex

x

y

-1 1

0.0516

Error in cubic Taylor approximation to ex

x

y

-1 1

0.00553

-0.00553

Error in cubic minimax approximation to ex

Accuracy of the minimax approximation.

ρn(f) ≤[(b− a)/2]n+1

(n+ 1)!2nmaxa≤x≤b

¯̄̄f (n+1)(x)

¯̄̄This error bound does not always become smaller with

increasing n, but it will give a fairly accurate bound

for many common functions f(x).

Example. Let f(x) = ex for −1 ≤ x ≤ 1. Thenρn(e

x) ≤ e

(n+ 1)!2n(*)

n Bound (*) ρn(f)1 6.80E− 1 2.79E− 12 1.13E− 1 4.50E− 23 1.42E− 2 5.53E− 34 1.42E− 3 5.47E− 45 1.18E− 4 4.52E− 56 8.43E− 6 3.21E− 67 5.27E− 7 2.00E− 7

CHEBYSHEV POLYNOMIALS

Chebyshev polynomials are used in many parts of nu-merical analysis, and more generally, in applicationsof mathematics. For an integer n ≥ 0, define thefunction

Tn(x) = cos³n cos−1 x

´, −1 ≤ x ≤ 1 (1)

This may not appear to be a polynomial, but we willshow it is a polynomial of degree n. To simplify themanipulation of (1), we introduce

θ = cos−1(x) or x = cos(θ), 0 ≤ θ ≤ π (2)

Then

Tn(x) = cos(nθ) (3)

Example. n = 0

T0(x) = cos(0 · θ) = 1n = 1

T1(x) = cos(θ) = x

n = 2

T2(x) = cos(2θ) = 2 cos2(θ)− 1 = 2x2 − 1

x

y

-1 1

1

-1

T0(x)T1(x)T2(x)

x

y

-1 1

1

-1

T3(x)T4(x)

The triple recursion relation. Recall the trigonomet-

ric addition formulas,

cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β)Let n ≥ 1, and apply these identities to get

Tn+1(x) = cos[(n+ 1)θ] = cos(nθ + θ)

= cos(nθ) cos(θ)− sin(nθ) sin(θ)Tn−1(x) = cos[(n− 1)θ] = cos(nθ − θ)

= cos(nθ) cos(θ) + sin(nθ) sin(θ)

Add these two equations, and then use (1) and (3) to

obtain

Tn+1(x) + Tn−1 = 2 cos(nθ) cos(θ) = 2xTn(x)Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1

(4)

This is called the triple recursion relation for the Cheby-

shev polynomials. It is often used in evaluating them,

rather than using the explicit formula (1).

Example. Recall

T0(x) = 1, T1(x) = x

Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1

Let n = 2. Then

T3(x) = 2xT2(x)− T1(x)

= 2x(2x2 − 1)− x

= 4x3 − 3xLet n = 3. Then

T4(x) = 2xT3(x)− T2(x)

= 2x(4x3 − 3x)− (2x2 − 1)= 8x4 − 8x2 + 1

The minimum size property. Note that

|Tn(x)| ≤ 1, −1 ≤ x ≤ 1 (5)

for all n ≥ 0. Also, note thatTn(x) = 2

n−1xn + lower degree terms, n ≥ 1(6)

This can be proven using the triple recursion relation

and mathematical induction.

Introduce a modified version of Tn(x),

eTn(x) = 1

2n−1Tn(x) = xn+lower degree terms (7)

From (5) and (6),¯̄̄ eTn(x)¯̄̄ ≤ 1

2n−1, −1 ≤ x ≤ 1, n ≥ 1 (8)

Example.

eT4(x) = 1

8

³8x4 − 8x2 + 1

´= x4 − x2 +

1

8

A polynomial whose highest degree term has a coeffi-

cient of 1 is called a monic polynomial. Formula (8)

says the monic polynomial eTn(x) has size 1/2n−1 on−1 ≤ x ≤ 1, and this becomes smaller as the degreen increases. In comparison,

max−1≤x≤1 |xn| = 1

Thus xn is a monic polynomial whose size does not

change with increasing n.

Theorem. Let n ≥ 1 be an integer, and consider all

possible monic polynomials of degree n. Then the

degree n monic polynomial with the smallest maxi-

mum on [−1, 1] is the modified Chebyshev polynomialeTn(x), and its maximum value on [−1, 1] is 1/2n−1.

This result is used in devising applications of Cheby-

shev polynomials. We apply it to obtain an improved

interpolation scheme.

A NEAR-MINIMAX APPROXIMATION METHOD

Let f(x) be continuous on [a, b] = [−1, 1]. Considerapproximating f by an interpolatory polynomial of de-

gree at most n = 3. Let x0, x1, x2, x3 be interpo-

lation node points in [−1, 1]; let c3(x) be of degree≤ 3 and interpolate f(x) at {x0, x1, x2, x3}. The in-terpolation error is

f(x)− c3(x) =ω(x)

4!f (4)(ξx), −1 ≤ x ≤ 1 (1)

ω(x) = (x− x0)(x− x1)(x− x2)(x− x3) (2)

with ξx in [−1, 1]. We want to choose the nodes

{x0, x1, x2, x3} so as to minimize the maximum value

of |f(x)− c3(x)| on [−1, 1].

From (1), the only general quantity, independent of f ,

is ω(x). Thus we choose {x0, x1, x2, x3} to minimizemax−1≤x≤1 |ω(x)| (3)

Expand to get

ω(x) = x4 + lower degree terms

This is a monic polynomial of degree 4. From the

theorem in the preceding section, the smallest possible

value for (3) is obtained with

ω(x) = eT4(x) = T4(x)

23=1

8(8x4 − 8x2 + 1) (4)

and the smallest value of (3) is 1/23 in this case. The

equation (4) defines implicitly the nodes {x0, x1, x2, x3}:they are the roots of T4(x).

In our case this means solving

T4(x) = cos(4θ) = 0, x = cos(θ)

4θ = ±π2,±3π

2,±5π

2,±7π

2, . . .

θ = ±π8,±3π

8,±5π

8,±7π

8, . . .

x = cosµπ

8

¶, cos

µ3π

8

¶, cos

µ5π

8

¶, . . . (5)

using cos(−θ) = cos(θ).

x = cosµπ

8

¶, cos

µ3π

8

¶, cos

µ5π

8

¶, cos

µ7π

8

¶, . . .

The first four values are distinct; the following ones

are repetitive. For example,

cosµ9π

8

¶= cos

µ7π

8

¶The first four values are

{x0, x1, x2, x3} = {±0.382683,±0.923880} (6)

Example. Let f(x) = ex on [−1, 1]. Use these nodesto produce the interpolating polynomial c3(x) of de-

gree 3. From the interpolation error formula and the

bound of 1/23 for |ω(x)| on [−1, 1] , we have

max−1≤x≤1 |f(x)− c3(x)| ≤1/23

4!max−1≤x≤1 e

ξx

≤ e

192

.= 0.014158

By direct calculation,

max−1≤x≤1 |ex − c3(x)| .= 0.00666

Interpolation Data: f(x) = ex

i xi f(xi) f [x0, . . . , xi]0 0.923880 2.5190442 2.51904421 0.382683 1.4662138 1.94537692 −0.382683 0.6820288 0.70474203 −0.923880 0.3969760 0.1751757

x

y

-1 1

0.00666

-0.00624

The error ex − c3(x)

For comparison, E(t3).= 0.0142 and ρ3(e

x).= 0.00553.

THE GENERAL CASE

Consider interpolating f(x) on [−1, 1] by a polyno-mial of degree ≤ n, with the interpolation nodes

{x0, . . . , xn} in [−1, 1]. Denote the interpolation poly-nomial by cn(x). The interpolation error on [−1, 1] isgiven by

f(x)− cn(x) =ω(x)

(n+ 1)!f (n+1)(ξx) (7)

ω(x) = (x− x0) · · · (x− xn)

with ξx and unknown point in [−1, 1]. In order to

minimize the interpolation error, we seek to minimize

max−1≤x≤1 |ω(x)| (8)

The polynomial being minimized is monic of degree

n+ 1,

ω(x) = xn+1 + lower degree terms

From the theorem of the preceding section, this min-

imum is attained by the monic polynomial

eTn+1(x) = 1

2nTn+1(x)

Thus the interpolation nodes are the zeros of Tn+1(x);

and by the procedure that led to (5), they are given

by

xj = cosµ2j + 1

2n+ 2π¶, j = 0, 1, . . . , n (9)

The near-minimax approximation cn(x) of degree n is

obtained by interpolating to f(x) at these n+1 nodes

on [−1, 1].

The polynomial cn(x) is sometimes called a Cheby-

shev approximation.

Example. Let f(x) = ex. the following table contains

the maximum errors in cn(x) on [−1, 1] for varyingn. For comparison, we also include the corresponding

minimax errors. These figures illustrate that for prac-

tical purposes, cn(x) is a satisfactory replacement for

the minimax approximation mn(x).

n max |ex − cn(x)| ρn(ex)

1 3.72E− 1 2.79E− 12 5.65E− 2 4.50E− 23 6.66E− 3 5.53E− 34 6.40E− 4 5.47E− 45 5.18E− 5 4.52E− 56 3.80E− 6 3.21E− 6

THEORETICAL INTERPOLATION ERROR

For the error

f(x)− cn(x) =ω(x)

(n+ 1)!f (n+1)(ξx)

we have

max−1≤x≤1 |f(x)− cn(x)| ≤max−1≤x≤1 |ω(x)|(n+ 1)!

max−1≤ξ≤1 |f(ξ)|

From the theorem of the preceding section,

max−1≤x≤1¯̄̄ eTn+1(x)¯̄̄ = max−1≤x≤1 |ω(x)| =

1

2n

in this case. Thus

max−1≤x≤1 |f(x)− cn(x)| ≤ 1

(n+ 1)!2nmax−1≤ξ≤1 |f(ξ)|

OTHER INTERVALS

Consider approximating f(x) on the finite interval

[a, b]. Introduce the linear change of variables

x =1

2[(1− t) a+ (1 + t) b] (10)

t =2

b− a

·x− b+ a

2

¸(11)

Introduce

F (t) = fµ1

2[(1− t) a+ (1 + t) b]

¶, −1 ≤ t ≤ 1

The function F (t) on [−1, 1] is equivalent to f(x) on[a, b], and we can move between them via (10)-(11).

We can now proceed to approximate f(x) on [a, b] by

instead approximating F (t) on [−1, 1].

Example. Approximating f(x) = cosx on [0, π/2] is

equivalent to approximating

F (t) = cosµ1 + t

4π¶, −1 ≤ t ≤ 1

NUMERICAL DIFFERENTIATION

There are two major reasons for considering numeri-

cally approximations of the differentiation process.

1. Approximation of derivatives in ordinary differen-

tial equations and partial differential equations.

This is done in order to reduce the differential

equation to a form that can be solved more easily

than the original differential equation.

2. Forming the derivative of a function f(x) which is

known only as empirical data {(xi, yi) | i = 1, . . . ,m}.The data generally is known only approximately,

so that yi ≈ f(xi), i = 1, . . . ,m.

Recall the definition

f 0(x) = limh→0

f(x+ h)− f(x)

h

This justifies using

f 0(x) ≈ f(x+ h)− f(x)

h≡ Dhf(x) (1)

for small values of h. The approximation Dhf(x) is

called a numerical derivative of f(x) with stepsize h.

Example. Use Dhf(x) to approximate the derivative

of f(x) = cos(x) at x = π/6. In the table, the error

is almost halved when h is halved.

h Dhf Error Ratio0.1 −0.54243 0.042430.05 −0.52144 0.02144 1.980.025 −0.51077 0.01077 1.990.0125 −0.50540 0.00540 1.990.00625 −0.50270 0.00270 2.000.003125 −0.50135 0.00135 2.00

Error behaviour. Using Taylor’s theorem,

f(x+ h) = f(x) + hf 0(x) + 12h2f 00(c)

with c between x and x+ h. Evaluating (1),

Dhf(x) =1

h

nhf(x) + hf 0(x) + 1

2h2f 00(c)

i− f(x)

o= f 0(x) + 1

2hf00(c)

f 0(x)−Dhf(x) = −12hf 00(c) (2)

Using a higher order Taylor expansion,

f 0(x)−Dhf(x) = −12hf 00(x)− 16h2f 00(c),

f 0(x)−Dhf(x) ≈ −12hf 00(x) (3)

for small values of h.

For f(x) = cosx,

f 0(x)−Dhf(x) =12h cos c, c ∈

hπ6 ,

π6 + h

iIn the preceding table, check the accuracy of the ap-

proximation (3) with x = π6.

The formula (1),

f 0(x) ≈ f(x+ h)− f(x)

h≡ Dhf(x)

is called a forward difference formula for approximat-

ing f 0(x). In contrast, the approximation

f 0(x) ≈ f(x)− f(x− h)

h, h > 0 (4)

is called a backward difference formula for approxi-

mating f 0(x). A similar derivation leads to

f 0(x)− f(x)− f(x− h)

h=

h

2f 00(c) (5)

for some c between x and x − h. The accuracy of

the backward difference formula (4) is essentially the

same as that of the forward difference formula (1).

The motivation for this formula is in applications to

solving differential equations.

DIFFERENTIATION USING INTERPOLATION

Let Pn(x) be the degree n polynomial that interpo-lates f(x) at n + 1 node points x0, x1, . . . , xn. Tocalculate f 0(x) at some point x = t, use

f 0(t) ≈ P 0n(t) (6)

Many different formulas can be obtained by varying nand by varying the placement of the nodes x0, . . . , xnrelative to the point t of interest.

Example. Take n = 2, and use evenly spaced nodesx0, x1 = x0 + h, x2 = x1 + h. Then

P2(x) = f(x0)L0(x) + f(x1)L1(x) + f(x2)L2(x)

P 02(x) = f(x0)L00(x) + f(x1)L

01(x) + f(x2)L

02(x)

with

L0(x) =(x− x1)(x− x2)

(x0 − x1)(x0 − x2)

L1(x) =(x− x0)(x− x2)

(x1 − x0)(x1 − x2)

L2(x) =(x− x0)(x− x1)

(x2 − x0)(x2 − x1)

Forming the derivatives of these Lagrange basis func-

tions and evaluating them at x = x1

f 0(x1) ≈ P 02(x1) =f(x1 + h)− f(x1 − h)

2h≡ Dhf(x1)

(7)

For the error,

f 0(x1)−f(x1 + h)− f(x1 − h)

2h= −h

2

6f 000(c2) (8)

with x1 − h ≤ c2 ≤ x1 + h.

A proof of this begins with the interpolation error for-

mula

f(x)− P2(x) = Ψ2(x)f [x0, x1, x2, x]

Ψ2(x) = (x− x0) (x− x1) (x− x2)

Differentiate to get

f 0(x)− P 02(x) = Ψ2(x)d

dxf [x0, x1, x2, x]

+Ψ02(x)f [x0, x1, x2, x]

f 0(x)− P 02(x) = Ψ2(x)d

dxf [x0, x1, x2, x]

+Ψ02(x)f [x0, x1, x2, x]With properties of the divided difference, we can show

f 0(x)−P 02(x) = 124Ψ2(x)f

(4)³c1,x

´+16Ψ

02(x)f

(3)³c2,x

´with c1,x and c2,x between the smallest and largest of

the values {x0, x1, x2, x}. Letting x = x1 and noting

that Ψ2(x1) = 0, we obtain (8).

Example. Take f(x) = cos(x) and x1 =16π. Then

(7) is illustrated as follows.

h Dhf Error Ratio0.1 −0.49916708 −0.00083290.05 −0.49979169 −0.0002083 4.000.025 −0.49994792 −0.00005208 4.000.0125 −0.49998698 −0.00001302 4.000.00625 −0.49999674 −0.000003255 4.00

Note the smaller errors and faster convergence as com-

pared to the forward difference formula (1).

UNDETERMINED COEFFICIENTS

Derive an approximation for f 00(x) at x = t. Write

f 00(t) ≈ D(2)h f(t) ≡ Af(t+ h)

+Bf(t) + Cf(t− h)(9)

with A, B, and C unspecified constants. Use Taylor

polynomial approximations

f(t− h) ≈ f(t)− hf 0(t) + h2

2f 00(t)

−h3

6f 000(t) + h4

24f (4)(t)

f(t+ h) ≈ f(t) + hf 0(t) + h2

2f 00(t)

+h3

6f 000(t) + h4

24f (4)(t)

(10)

Substitute into (9) and rearrange:

D(2)h f(t) ≈ (A+B + C)f(t)

+h(A− C)f 0(t) + h2

2(A+ C)f 00(t)

+h3

6(A− C)f 000(t) + h4

24(A+ C)f (4)(t)

(11)

To have

D(2)h f(t) ≈ f 00(t) (12)

for arbitrary functions f(x), require

A+B + C = 0: coefficient of f(t)

h(A− C) = 0: coefficient of f 0(t)h2

2(A+ C) = 1: coefficient of f 00(t)

Solution:

A = C =1

h2, B = − 2

h2(13)

This determines

D(2)h f(t) =

f(t+ h)− 2f(t) + f(t− h)

h2(14)

For the error, substitute (13) into (11):

D(2)h f(t) ≈ f 00(t) + h2

12f (4)(t)

Thus

f 00(t)− f(t+ h)− 2f(t) + f(t− h)

h2≈ −h

2

12f (4)(t)

(15)

Example. Let f(x) = cos(x), t = 16π; use (14) to

calculate f 00(t) = − cos³16π´.

h D(2)h f Error Ratio

0.5 −0.84813289 −1.789E− 20.25 −0.86152424 −4.501E− 3 3.97

0.125 −0.86489835 −1.127E− 3 3.99

0.0625 −0.86574353 −2.819E− 4 4.00

0.03125 −0.86595493 −7.048E− 5 4.00

EFFECTS OF ERROR IN FUNCTION VALUES

Recall

D(2)h f(x1) =

f(x2)− 2f(x1) + f(x0)

h2≈ f 00(x1)

with x2 = x1 + h, x0 = x1 − h. Assume the ac-

tual function values used in the computation contain

data error, and denote these values by bf0, bf1, and bf2.Introduce the data errors:

i = f(xi)− bfi, i = 0, 1, 2 (16)

The actual quantity calculated is

cD(2)h f(x1) =bf2 − 2 bf1 + bf2

h2(17)

For the error in this quantity, replace bfj by f(xj)− j,

j = 0, 1, 2, to obtain the following:

f 00(x1)− cD(2)h f(x1) = f 00(x1)

−[f(x2)− 2]− 2[f(x1)− 1] + [f(x0)− 0]

h2

=

"f 00(x1)−

f(x2)− 2f(x1) + f(x0)

h2

#

+ 2 − 2 1 + 0

h2

≈ − 112h

2f (4)(x1) +2 − 2 1 + 0

h2(18)

The last line uses (15).

The errors { 0, 1, 2} are generally random in some

interval [−δ, δ]. Ifn bf0, bf1, bf2o are experimental data,

then δ is a bound on the experimental error. Ifn bfjo

are obtained from computing f(x) in a computer, then

the errors j are the combination of rounding or chop-

ping errors and δ is a bound on these errors.

In either case, (18) yields the approximate inequality¯̄̄̄f 00(x1)− cD(2)h f(x1)

¯̄̄̄≤ h2

12

¯̄̄f (4)(x1)

¯̄̄+4δ

h2(19)

This suggests that as h→ 0, the error will eventually

increase, because of the final term 4δh2.

Example. Calculate cD(2)h (x1) for f(x) = cos(x) at

x1 =16π. To show the effect of rounding errors, the

values bfi are obtained by rounding f(xi) to six signif-icant digits; and the errors satisfy

| i| ≤ 5.0× 10−7 = δ, i = 0, 1, 2

Other than these rounding errors, the formula cD(2)h f(x1)

is calculated exactly. In this example, the bound (19)

becomes¯̄̄̄f 00(x1)− cD(2)h f(x1)

¯̄̄̄≤ 112h

2 cos³16π´

+³4h2

´(5× 10−7)

.= 0.0722h2 + 2×10−6

h2≡ E(h)

For h = 0.125, the bound E(h).= 0.00126, which is

not too far off from the actual error given in the table.

h cD(2)h f(x1) Error0.5 −0.848128 −0.0178970.25 −0.861504 −0.0045210.125 −0.864832 −0.0011930.0625 −0.865536 −0.0004890.03125 −0.865280 −0.0007450.015625 −0.860160 −0.0058650.0078125 −0.851968 −0.0140570.00390625 −0.786432 −0.079593

The bound E(h) indicates that there is a smallest

value of h, call it h∗, below which the error bound

will begin to increase. To find it, let E0(h) = 0, with

its root being h∗. This leads to h∗ .= 0.0726, which is

consistent with the behavior of the errors in the table.

LINEAR SYSTEMS

Consider the following example of a linear system:

x1 + 2x2 + 3x3 = −5−x1 + x3 = −3

3x1 + x2 + 3x3 = −3Its unique solution is

x1 = 1, x2 = 0, x3 = −2In general we want to solve n equations in n un-

knowns. For this, we need some simplifying nota-

tion. In particular we introduce arrays. We can think

of these as means for storing information about the

linear system in a computer. In the above case, we

introduce

A =

1 2 3−1 0 13 1 3

, b =

−5−3−3

, x =

10−2

These arrays completely specify the linear system and

its solution. We also know that we can give mean-

ing to multiplication and addition of these quantities,

calling them matrices and vectors. The linear system

is then written as

Ax = b

with Ax denoting a matrix-vector multiplication.

The general system is written as

a1,1x1 + · · ·+ a1,nxn = b1...

an,1x1 + · · ·+ an,nxn = bn

This is a system of n linear equations in the n un-

knowns x1, ..., xn. This can be written in matrix-

vector notation as

Ax = b

A =

a1,1 · · · a1,n... . . . ...

an,1 · · · an,n

, b = b1...bn

x =

x1...xn

A TRIDIAGONAL SYSTEM

Consider the tridiagonal linear system

3x1 − x2 = 2−x1 + 3x2 − x3 = 1

...−xn−2 + 3xn−1 − xn = 1

−xn−1 + 3xn = 2

The solution is

x1 = · · · = xn = 1

This has the associated arrays

A =

3 −1 0 · · · 0−1 3 −1 0

. . .... −1 3 −10 · · · −1 3

, b =21...12

, x =11...11

SOLVING LINEAR SYSTEMS

Linear systems Ax = b occur widely in applied mathe-matics. They occur as direct formulations of �real world�problems; but more often, they occur as a part of the nu-merical analysis of some other problem. As examples ofthe latter, we have the construction of spline functions,the numerical solution of systems of nonlinear equations,ordinary and partial di¤erential equations, integral equa-tions, and the solution of optimization problems.

There are many ways of classifying linear systems.

Size: Small, moderate, and large. This of course varieswith the machine you are using.

For a matrix A of order n× n, it will take 8n2 bytes

to store it in double precision. Thus a matrix of order

8000 will need around 512 MB of storage. The latter

would be too large for most present day PCs, if the

matrix was to be stored in the computer’s memory,

although one can easily expand a PC to contain much

more memory than this.

Sparse vs. Dense. Many linear systems have a matrixA in which almost all the elements are zero. These

matrices are said to be sparse. For example, it is quite

common to work with tridiagonal matrices

A =

a1 c1 0 · · · 0b2 a2 c2 0 ...0 b3 a3 c3... . . .0 · · · bn an

in which the order is 104 or much more. For such

matrices, it does not make sense to store the zero ele-

ments; and the sparsity should be taken into account

when solving the linear system Ax = b. Also, the

sparsity need not be as regular as in this example.

BASIC DEFINITIONS AND THEORY

A homogeneous linear systemAx = b is one for which theright hand constants are all zero. Using vector notation,we say b is the zero vector for a homogeneous system.Otherwise the linear system is call non-homogeneous.

Theorem. The following are equivalent statements.

(1) For each b, there is exactly one solution x.

(2) For each b, there is a solution x.

(3) The homogeneous system Ax = 0 has only the solu-tion x = 0.

(4) det(A) 6= 0.

(5) Inverse matrix A�1 exists.

EXAMPLE. Consider again the tridiagonal system

3x1 − x2 = 2−x1 + 3x2 − x3 = 1

...−xn−2 + 3xn−1 − xn = 1

−xn−1 + 3xn = 2

The homogeneous version is simply

3x1 − x2 = 0−x1 + 3x2 − x3 = 0

...−xn−2 + 3xn−1 − xn = 0

−xn−1 + 3xn = 0

Assume x 6= 0, and therefore that x has nonzero com-ponents. Let xk denote a component of maximum

size:

|xk| = max1≤j≤n

¯̄̄xj¯̄̄

Consider now equation k, and assume 1 < k < n.

Then

−xk−1 + 3xk − xk+1 = 0

xk = 13

¡xk−1 + xk+1

¢|xk| ≤ 1

3

¡¯̄xk−1

¯̄+¯̄xk+1

¯̄¢≤ 1

3 (|xk|+ |xk|)= 2

3 |xk|This implies xk = 0, and therefore x = 0. A similar

proof is valid if k = 1 or k = n, using the first or the

last equation, respectively.

Thus the original tridiagonal linear system Ax = b has

a unique solution x for each right side b.

METHODS OF SOLUTION

There are two general categories of numerical methods

for solving Ax = b.

Direct Methods: These are methods with a finite

number of steps; and they end with the exact solution

x, provided that all arithmetic operations are exact.

The most used of these methods is Gaussian elimi-

nation, which we begin with. There are other direct

methods, but we do not study them here.

Iteration Methods: These are used in solving all types

of linear systems, but they are most commonly used

with large sparse systems, especially those produced

by discretizing partial differential equations. This is

an extremely active area of research.

MATRICES in MATLAB

Consider the matrices

A =

1 2 32 2 33 3 3

, b =

111

In MATLAB, A can be created as follows.

A = [1 2 3; 2 2 3; 3 3 3];A = [1, 2, 3; 2, 2, 3; 3, 3, 3];A = [1 2 3

2 2 33 3 3] ;

Commas can be used to replace the spaces. The vec-

tor b can be created by

b = ones(3, 1);

Consider setting up the matrices for the system

Ax = b with

Ai,j = max {i, j} , bi = 1, 1 ≤ i, j ≤ n

One way to set up the matrix A is as follows:

A = zeros(n, n);for i = 1 : n

A(i, 1 : i) = i;A(i, i+ 1 : n) = i+ 1 : n;

end

and set up the vector b by

b = ones(n, 1);

MATRIX ADDITION

Let A =hai,j

iand B =

hbi,j

ibe matrices of order

m× n. Then

C = A+B

is another matrix of order m× n, with

ci.j = ai,j + bi,j

EXAMPLE. 1 23 45 6

+ 1 −1−1 11 −1

= 2 12 56 5

MULTIPLICATION BY A CONSTANT

c

a1,1 · · · a1,n... . . . ...

am,1 · · · am,n

= ca1,1 · · · ca1,n

... . . . ...cam,1 · · · cam,n

EXAMPLE.

5

1 23 45 6

= 5 1015 2025 30

(−1)"a bc d

#=

"−a −b−c −d

#

THE ZERO MATRIX 0

Define the zero matrix of order m× n as the matrix

of that order having all zero entries. It is sometimes

written as 0m×n, but more commonly as simply 0.Then for any matrix A of order m× n,

A+ 0 = 0 +A = A

The zero matrix 0m×n acts in the same role as doesthe number zero when doing arithmetic with real and

complex numbers.

EXAMPLE."1 23 4

#+

"0 00 0

#=

"1 23 4

#

We denote by −A the solution of the equation

A+B = 0

It is the matrix obtained by taking the negative of all

of the entries in A. For example,"a bc d

#+

"−a −b−c −d

#=

"0 00 0

#

⇒ −"a bc d

#=

"−a −b−c −d

#= (−1)

"a bc d

#

−"a1,1 a1,2a2,1 a2,2

#=

"−a1,1 −a1,2−a2,1 −a2,2

#

MATRIX MULTIPLICATION

Let A =hai,j

ihave order m×n and B =

hbi,j

ihave

order n× p. Then

C = AB

is a matrix of order m× p and

ci,j = Ai,∗B∗,j= ai,1b1,j + ai,2b2,j + · · ·+ ai,nbn,j

or equivalently

ci,j =hai,1 ai,2 · · · ai,n

ib1,jb2,j...

bn,j

= ai,1b1,j + ai,2b2,j + · · ·+ ai,nbn,j

EXAMPLES

"1 2 34 5 6

# 1 23 45 6

= "22 2849 64

#

1 23 45 6

" 1 2 34 5 6

#=

9 12 1519 26 3329 40 51

a1,1 · · · a1,n

... . . . ...an,1 · · · an,n

x1...xn

= a1,1x1 + · · ·+ a1,nxn

...an,1x1 + · · ·+ an,nxn

Thus we write the linear system

a1,1x1 + · · ·+ a1,nxn = b1...

an,1x1 + · · ·+ an,nxn = bn

as

Ax = b

THE IDENTITY MATRIX I

For a given integer n ≥ 1, Define In to be the matrixof order n × n with 1’s in all diagonal positions and

zeros elsewhere:

In =

1 0 . . . 00 1 0... . . . ...0 . . . 1

More commonly it is denoted by simply I.

Let A be a matrix of order m× n. Then

AIn = A, ImA = A

The identity matrix I acts in the same role as does

the number 1 when doing arithmetic with real and

complex numbers.

THE MATRIX INVERSE

Let A be a matrix of order n×n for some n ≥ 1. Wesay a matrix B is an inverse for A if

AB = BA = I

It can be shown that if an inverse exists for A, then

it is unique.

EXAMPLES. If ad− bc 6= 0, then"a bc d

#−1=

1

ad− bc

"d −b−c a

#"1 22 2

#−1=

" −1 1

1 −12

#1 1

213

12

13

14

13

14

15

−1

=

9 −36 30−36 192 −18030 −180 180

Recall the earlier theorem on the solution of linear

systems Ax = b with A a square matrix.

Theorem. The following are equivalent statements.

1. For each b, there is exactly one solution x.

2. For each b, there is a solution x.

3. The homogeneous system Ax = 0 has only the

solution x = 0.

4. det (A) 6= 0.

5. A−1 exists.

EXAMPLE

det

1 2 34 5 67 8 9

= 0Therefore, the linear system 1 2 3

4 5 67 8 9

x1x2x3

= b1b2b3

is not always solvable, the coefficient matrix does not

have an inverse, and the homogeneous system Ax = 0

has a solution other than the zero vector, namely 1 2 34 5 67 8 9

1−21

= 000

PARTITIONED MATRICES

Matrices can be built up from smaller matrices; or

conversely, we can decompose a large matrix into a

matrix of smaller matrices. For example, consider

A =

1 2 02 1 10 −1 5

= "B cd e

#

B =

"1 22 1

#c =

"01

#d =

h0 −1

ie = 5

Matlab allows you to build up larger matrices out of

smaller matrices in exactly this manner; and smaller

matrices can be defined as portions of larger matrices.

We will often write an n × n square matrix in terms

of its columns:

A =hA∗,1, ..., A∗,n

iFor the n× n identity matrix I, we write

I = [e1, ..., en]

with ej denoting a column vector with a 1 in position

j and zeros elsewhere.

ARITHMETIC OF PARTITIONED MATRICES

As with matrices, we can do addition and multiplica-

tion with partitioned matrices provided the individual

constituent parts have the proper orders.

For example, let A,B,C,D be n× n matrices. Then"I AB I

# "I CD I

#=

"I +AD C +AB +D I +BC

#

Let A be n × n and x be a column vector of length

n. Then

Ax =hA∗,1, ..., A∗,n

i x1...xn

= x1A∗,1+· · ·+xnA∗,n

Compare this to a1,1 · · · a1,n... . . . ...

an,1 · · · an,n

x1...xn

= a1,1x1 + · · ·+ a1,nxn

...an,1x1 + · · ·+ an,nxn

PARTITIONED MATRICES IN MATLAB

In MATLAB, matrices can be constructed using smaller

matrices. For example, let

A = [1, 2; 3, 4]; x = [5, 6]; y = [7, 8]0;

Then

B = [A, y; x, 9];

forms the matrix

B =

1 2 73 4 85 6 9

SOLVING LINEAR SYSTEMS

We want to solve the linear system

a1,1x1 + · · ·+ a1,nxn = b1...

an,1x1 + · · ·+ an,nxn = bn

This will be done by the method used in beginning

algebra, by successively eliminating unknowns from

equations, until eventually we have only one equation

in one unknown. This process is known as Gaussian

elimination. To put it onto a computer, however, we

must be more precise than is generally the case in high

school algebra.

We begin with the linear system

3x1 − 2x2 − x3 = 0 (E1)6x1 − 2x2 + 2x3 = 6 (E2)−9x1 + 7x2 + x3 = −1 (E3)

3x1 − 2x2 − x3 = 0 (E1)6x1 − 2x2 + 2x3 = 6 (E2)−9x1 + 7x2 + x3 = −1 (E3)

[1] Eliminate x1 from equations (E2) and (E3). Sub-

tract 2 times (E1) from (E2); and subtract −3 times(E1) from (E3). This yields

3x1 − 2x2 − x3 = 0 (E1)2x2 + 4x3 = 6 (E2)x2 − 2x3 = −1 (E3)

[2] Eliminate x2 from equation (E3). Subtract12 times

(E2) from (E3). This yields

3x1 − 2x2 − x3 = 0 (E1)2x2 + 4x3 = 6 (E2)

−4x3 = −4 (E3)

Using back substitution, solve for x3, x2, and x1, ob-

taining

x3 = x2 = x1 = 1

In the computer, we work on the arrays rather than

on the equations. To illustrate this, we repeat the

preceding example using array notation.

The original system is Ax = b, with

A =

3 −2 −16 −2 2−9 7 1

, b =

06−1

We often write these in combined form as an aug-

mented matrix:

[A | b] = 3 −2 −1

6 −2 2−9 7 1

¯̄̄̄¯̄̄ 0

6−1

In step 1, we eliminate x1 from equations 2 and 3.

We multiply row 1 by 2 and subtract it from row 2;

and we multiply row 1 by -3 and subtract it from row

3. This yields 3 −2 −10 2 40 1 −2

¯̄̄̄¯̄̄ 0

6−1

3 −2 −10 2 40 1 −2

¯̄̄̄¯̄̄ 0

6−1

In step 2, we eliminate x2 from equation 3. We mul-

tiply row 2 by 12 and subtract from row 3. This yields 3 −2 −10 2 40 0 −4

¯̄̄̄¯̄̄ 0

6−4

Then we proceed with back substitution as previously.

For the general case, we reduce

[A | b] =

a(1)1,1 · · · a

(1)1,n

... . . . ...

a(1)n,1 · · · a

(1)n,n

¯̄̄̄¯̄̄̄¯̄b(1)1...

b(1)n

in n− 1 steps to the form

a(1)1,1 · · · a

(1)1,n

0 . . . ...... . . .

0 · · · 0 a(n)n,n

¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......

b(n)n

More simply, and introducing new notation, this is

equivalent to the matrix-vector equation Ux = g:u1,1 · · · u1,n0 . . . ...... . . .0 · · · 0 un,n

x1......xn

=g1......gn

This is the linear system

u1,1x1 + u1,2x2 + · · ·+ u1,n−1xn−1 + u1,nxn = g1...

un−1,n−1xn−1 + un−1,nxn = gn−1un,nxn = gn

We solve for xn, then xn−1, and backwards to x1.

This process is called back substitution.

xn =gn

un,n

uk =gk −

nuk,k+1xk+1 + · · ·+ uk,nxn

ouk,k

for k = n−1, ..., 1. What we have done here is simplya more carefully defined and methodical version of

what you have done in high school algebra.

How do we carry out the conversion ofa(1)1,1 · · · a

(1)1,n

... . . . ...

a(1)n,1 · · · a

(1)n,n

¯̄̄̄¯̄̄̄¯̄b(1)1...

b(1)n

to

a(1)1,1 · · · a

(1)1,n

0 . . . ...... . . .

0 · · · 0 a(n)n,n

¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......

b(n)n

To help us keep track of the steps of this process, we

will denote the initial system by

[A(1) | b(1)] =

a(1)1,1 · · · a

(1)1,n

... . . . ...

a(1)n,1 · · · a

(1)n,n

¯̄̄̄¯̄̄̄¯̄b(1)1...

b(1)n

Initially we will make the assumption that every pivot

element will be nonzero; and later we remove this

assumption.

Step 1. We will eliminate x1 from equations 2 thru

n. Begin by defining the multipliers

mi,1 =a(1)i,1

a(1)1,1

, i = 2, ..., n

Here we are assuming the pivot element a(1)1,1 6= 0.

Then in succession, multiply mi,1 times row 1 (called

the pivot row) and subtract the result from row i.

This yields new matrix elements

a(2)i,j = a

(1)i,j −mi,1a

(1)1,j , j = 2, ..., n

b(2)i = b

(1)i −mi,1b

(1)1

for i = 2, ..., n.

Note that the index j does not include j = 1. The

reason is that with the definition of the multipliermi,1,

it is automatic that

a(2)i,1 = a

(1)i,1 −mi,1a

(1)1,1 = 0, i = 2, ..., n

The augmented matrix now is

[A(2) | b(2)] =

a(1)1,1 a

(1)1,2 · · · a

(1)1,n

0 a(2)2,2 a

(2)2,n

... ... . . . ...

0 a(2)n,2 · · · a

(2)n,n

¯̄̄̄¯̄̄̄¯̄̄̄¯̄

b(1)1

b(2)2...

b(2)n

Step k: Assume that for i = 1, ..., k− 1 the unknownxi has been eliminated from equations i + 1 thru n.

We have the augmented matrix

[A(k) | b(k)] =

a(1)1,1 a

(1)1,2 · · · a

(1)1,n

0 a(2)2,2 · · · a

(2)2,n

. . . . . . ...

... 0 a(k)k,k · · · a

(k)k,n

... ... . . . ...

0 · · · 0 a(k)n,k · · · a

(k)n,n

¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯

b(1)1

b(2)2...

b(k)k...

b(k)n

We want to eliminate unknown xk from equations k+

1 thru n. Begin by defining the multipliers

mi,k =a(k)i,k

a(k)k,k

, i = k + 1, ..., n

The pivot element is a(k)k,k, and we assume it is nonzero.

Using these multipliers, we eliminate xk from equa-

tions k + 1 thru n. Multiply mi,k times row k (the

pivot row) and subtract from row i, for i = k+1 thru

n.

a(k+1)i,j = a

(k)i,j −mi,ka

(k)k,j , j = k + 1, ..., n

b(k+1)i = b

(k)i −mi,kb

(k)k

for i = k+1, ..., n. This yields the augmented matrix

[A(k+1) | b(k+1)]:

a(1)1,1 · · · a

(1)1,n

0 . . . ...

a(k)k,k a

(k)k,k+1 · · · a

(k)k,n

... 0 a(k+1)k+1,k+1 a

(k+1)k+1,n

... ... . . . ...

0 · · · 0 a(k+1)n,k+1 · · · a

(k+1)n,n

¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯

b(1)1...

b(k)k

b(k+1)k+1...

b(k+1)n

Doing this for k = 1, 2, ..., n − 1 leads to the uppertriangular system with the augmented matrix

a(1)1,1 · · · a

(1)1,n

0 . . . ...... . . .

0 · · · 0 a(n)n,n

¯̄̄̄¯̄̄̄¯̄̄̄b(1)1......

b(n)n

We later remove the assumption

a(k)k,k 6= 0, k = 1, 2, ..., n

QUESTIONS

• How do we remove the assumption on the pivotelements?

• How many operations are involved in this proce-dure?

• How much error is there in the computed solutiondue to rounding errors in the calculations?

• How does the machine architecture affect the im-plementation of this algorithm.

PARTIAL PIVOTING

Recall the reduction of

[A(1) | b(1)] =

a(1)1,1 · · · a

(1)1,n

... . . . ...

a(1)n,1 · · · a

(1)n,n

¯̄̄̄¯̄̄̄¯̄b(1)1...

b(1)n

to

[A(2) | b(2)] =

a(1)1,1 a

(1)1,2 · · · a

(1)1,n

0 a(2)2,2 a

(2)2,n

... ... . . . ...

0 a(2)n,2 · · · a

(2)n,n

¯̄̄̄¯̄̄̄¯̄̄̄¯̄

b(1)1

b(2)2...

b(2)n

What if a

(1)1,1 = 0? In that case we look for an equation

in which the x1 is present. To do this in such a way

as to avoid zero the maximum extant possible, we do

the following.

Look at all the elements in the first column,

a(1)1,1, a

(1)2,1, ..., a

(1)n,1

and pick the largest in size. Say it is¯̄̄̄a(1)k,1

¯̄̄̄= max

j=1,...,n

¯̄̄̄a(1)j,1

¯̄̄̄Then interchange equations 1 and k, which means

interchanging rows 1 and k in the augmented matrix

[A(1) | b(1)]. Then proceed with the elimination of x1from equations 2 thru n as before.

Having obtained

[A(2) | b(2)] =

a(1)1,1 a

(1)1,2 · · · a

(1)1,n

0 a(2)2,2 a

(2)2,n

... ... . . . ...

0 a(2)n,2 · · · a

(2)n,n

¯̄̄̄¯̄̄̄¯̄̄̄¯̄

b(1)1

b(2)2...

b(2)n

what if a

(2)2,2 = 0? Then we proceed as before.

Among the elements

a(2)2,2, a

(2)3,2, ..., a

(2)n,2

pick the one of largest size:¯̄̄̄a(2)k,2

¯̄̄̄= max

j=2,...,n

¯̄̄̄a(2)j,2

¯̄̄̄Interchange rows 2 and k. Then proceed as before to

eliminate x2 from equations 3 thru n, thus obtaining

[A(3) | b(3)] =

a(1)1,1 a

(1)1,2 a

(1)1,3 · · · a

(1)1,n

0 a(2)2,2 a

(2)2,3 · · · a

(2)2,n

0 0 a(3)3,3 · · · a

(3)3,n

... ... ... . . . ...

0 0 a(3)n,3 · · · a

(3)n,n

¯̄̄̄¯̄̄̄¯̄̄̄¯̄̄̄¯̄

b(1)1

b(2)2

b(3)3...

b(3)n

This is done at every stage of the elimination process.

This technique is called partial pivoting, and it is a

part of most Gaussian elimination programs (including

the one in the text).

Consequences of partial pivoting. Recall the defini-tion of the elements obtained in the process of elimi-

nating x1 from equations 2 thru n.

mi,1 =a(1)i,1

a(1)1,1

, i = 2, ..., n

a(2)i,j = a

(1)i,j −mi,1a

(1)1,j , j = 2, ..., n

b(2)i = b

(1)i −mi,1b

(1)1

for i = 2, ..., n. By our definition of the pivot element

a(1)1,1, we have¯̄̄

mi,1

¯̄̄≤ 1, i = 2, ..., n

Thus in the calculation of a(2)i,j and b

(2)i , we have that

the elements do not grow rapidly in size. This is in

comparison to what might happen otherwise, in which

the multipliers mi,1 might have been very large. This

property is true of the multipliers at very step of the

elimination process:¯̄̄mi,k

¯̄̄≤ 1, i = k + 1, ..., n, k = 1, ..., n− 1

The property¯̄̄mi,k

¯̄̄≤ 1, i = k + 1, ..., n

leads to good error propagation properties in Gaussian

elimination with partial pivoting. The only error in

Gaussian elimination is that derived from the round-

ing errors in the arithmetic operations. For example,

at the first elimination step (eliminating x1 from equa-

tions 2 thru n),

a(2)i,j = a

(1)i,j −mi,1a

(1)1,j , j = 2, ..., n

b(2)i = b

(1)i −mi,1b

(1)1

The above property on the size of the multipliers pre-

vents these numbers and the errors in their calculation

from growing as rapidly as they might if no partial piv-

oting was used.

As an example of the improvement in accuracy ob-

tained with partial pivoting, see the example on pages

262-263.

OPERATION COUNTS

One of the major ways in which we compare the effi-

ciency of different numerical methods is to count the

number of needed arithmetic operations. For solving

the linear system

a1,1x1 + · · ·+ a1,nxn = b1...

an,1x1 + · · ·+ an,nxn = bn

using Gaussian elimination, we have the following op-

eration counts.

1. A → U , where we are converting Ax = b to

Ux = g:

Divisionsn(n− 1)

2

Additionsn(n− 1)(2n− 1)

6

Multiplicationsn(n− 1)(2n− 1)

6

2. b→ g:

Additionsn(n− 1)

2

Multiplicationsn(n− 1)

23. Solving Ux = g:

Divisions n

Additionsn(n− 1)

2

Multiplicationsn(n− 1)

2

On some machines, the cost of a division is much

more than that of a multiplication; whereas on others

there is not any important difference. We assume the

latter; and then the operation costs are as follows.

MD(A→ U) =n³n2 − 1

´3

MD(b→ g) =n(n− 1)

2

MD(Find x) =n(n+ 1)

2

AS(A→ U) =n(n− 1)(2n− 1)

6

AS(b→ g) =n(n− 1)

2

AS(Find x) =n(n− 1)

2

Thus the total number of operations is

Additions2n3 + 3n2 − 5n

6ÃMultiplicationsand Divisions

!n3 + 3n2 − n

3

Both are around 13n3, and thus the total operations

account is approximately

2

3n3

What happens to the cost when n is doubled?

Solving Ax = b and Ax = c. What is the cost? Only

the modification of the right side is different in these

two cases. Thus the additional cost isÃMD(b→ g)MD(Find x)

!= n2

ÃAS(b→ g)AS(Find x)

!= n(n− 1)

The total is around 2n2 operations, which is quite a

bit smaller than 23n3 when n is even moderately large,

say n = 100.

Thus one can solve the linear system Ax = c at little

additional cost to that for solving Ax = b. This has

important consequences when it comes to estimation

of the error in computed solutions.

CALCULATING THE MATRIX INVERSE

Consider finding the inverse of a 3× 3 matrix

A =

a1,1 a1,2 a1,3a2,1 a2,2 a2,3a3,1 a3,2 a3,3

= hA∗,1, A∗,2, A∗,3

iWe want to find a matrix

X =hX∗,1,X∗,2,X∗,3

ifor which

AX = I

AhX∗,1,X∗,2,X∗,3

i= [e1, e2, e3]h

AX∗,1, AX∗,2, AX∗,3i= [e1, e2, e3]

This means we want to solve

AX∗,1 = e1, AX∗,2 = e2, AX∗,3 = e3

We want to solve three linear systems, all with the

same matrix of coefficients A.

MATRIX INVERSE EXAMPLE

A =

1 1 −21 1 11 −1 0

1 1 −21 1 11 −1 0

¯̄̄̄¯̄̄ 1 0 00 1 00 0 1

m2,1 = 1 ↓ m3,1 = 1 1 1 −20 0 30 −2 2

¯̄̄̄¯̄̄ 1 0 0−1 1 0−1 0 1

↓ 1 1 −2

0 −2 20 0 3

¯̄̄̄¯̄̄ 1 0 0−1 0 1−1 1 0

1 1 −20 −2 20 0 3

¯̄̄̄¯̄̄ 1 0 0−1 0 1−1 1 0

Then by using back substitution to solve for each col-

umn of the inverse, we obtain

A−1 =

16

13

12

16

13 −12

−13 13 0

COST OF MATRIX INVERSION

In calculating A−1, we are solving for the matrix X =hX∗,1,X∗,2, . . . ,X∗,n

iwhere

AhX∗,1,X∗,2, . . . ,X∗,n

i= [e1, e2, . . . , en]

and ej is column j of the identity matrix. Thus weare solving n linear systems

AX∗,1 = e1, AX∗,2 = e2, . . . , AX∗,n = en (1)

all with the same coefficient matrix. Returning tothe earlier operation counts for solving a single linearsystem, we have the following.

Cost of triangulating A: approx. 23n3 operations

Cost of solving Ax = b: 2n2 operations

Thus solving the n linear systems in (1) costs approx-imately

23n3 + n

³2n2

´= 83n3 operations, approximately

It costs approximately four times as many operationsto invert A as to solve a single system. With attentionto the form of the right-hand sides in (1) this can bereduced to 2n3 operations.

MATLAB MATRIX OPERATIONS

To solve the linear system Ax = b in Matlab, use

x = A \ bIn Matlab, the command

inv (A)

will calculate the inverse of A.

There are many matrix operations built into Matlab,

both for general matrices and for special classes of

matrices. We do not discuss those here, but recom-

mend the student to investigate these thru the Matlab

help options.

GAUSSIAN ELIMINATION - REVISITED

Consider solving the linear system

2x1 + x2 − x3 + 2x4 = 54x1 + 5x2 − 3x3 + 6x4 = 9−2x1 + 5x2 − 2x3 + 6x4 = 44x1 + 11x2 − 4x3 + 8x4 = 2

by Gaussian elimination without pivoting. We denote

this linear system by Ax = b. The augmented matrix

for this system is

[A | b] =

2 1 −1 24 5 −3 6−2 5 −2 64 11 −4 8

¯̄̄̄¯̄̄̄¯5942

To eliminate x1 from equations 2, 3, and 4, use mul-

tipliers

m2,1 = 2, m3,1 = −1, m4,1 = 2

To eliminate x1 from equations 2, 3, and 4, use mul-

tipliers

m2,1 = 2, m3,1 = −1, m4,1 = 2

This will introduce zeros into the positions below the

diagonal in column 1, yielding2 1 −1 20 3 −1 20 6 −3 80 9 −2 4

¯̄̄̄¯̄̄̄¯

5−19−8

To eliminate x2 from equations 3 and 4, use multipli-

ers

m3,2 = 2, m4,2 = 3

This reduces the augmented matrix to2 1 −1 20 3 −1 20 0 −1 40 0 1 −2

¯̄̄̄¯̄̄̄¯

5−111−5

To eliminate x3 from equation 4, use the multiplier

m4,3 = −1This reduces the augmented matrix to

2 1 −1 20 3 −1 20 0 −1 40 0 0 2

¯̄̄̄¯̄̄̄¯

5−1116

Return this to the familiar linear system

2x1 + x2 − x3 + 2x4 = 53x2 − x3 + 2x4 = −1

−x3 + 4x4 = 112x4 = 6

Solving by back substitution, we obtain

x4 = 3, x3 = 1, x2 = −2, x1 = 1

There is a surprising result involving matrices asso-

ciated with this elimination process. Introduce the

upper triangular matrix

U =

2 1 −1 20 3 −1 20 0 −1 40 0 0 2

which resulted from the elimination process. Then

introduce the lower triangular matrix

L =

1 0 0 0

m2,1 1 0 0m3,1 m3,2 1 0m4,1 m4,2 m4,3 1

=

1 0 0 02 1 0 0−1 2 1 02 3 −1 1

This uses the multipliers introduced in the elimination

process. Then

A = LU2 1 −1 24 5 −3 6−2 5 −2 64 11 −4 8

=

1 0 0 02 1 0 0−1 2 1 02 3 −1 1

2 1 −1 20 3 −1 20 0 −1 40 0 0 2

In general, when the process of Gaussian elimination

without pivoting is applied to solving a linear system

Ax = b, we obtain A = LU with L and U constructed

as above.

For the case in which partial pivoting is used, we ob-

tain the slightly modified result

LU = PA

where L and U are constructed as before and P is a

permutation matrix. For example, consider

P =

0 0 1 01 0 0 00 0 0 10 1 0 0

Then

PA =

0 0 1 01 0 0 00 0 0 10 1 0 0

a1,1 a1,2 a1,3 a1,4a2,1 a2,2 a2,3 a2,4a3,1 a3,2 a3,3 a3,4a4,1 a4,2 a4,3 a4,4

=A3,∗A1,∗A4,∗A2,∗

PA =

0 0 1 01 0 0 00 0 0 10 1 0 0

a1,1 a1,2 a1,3 a1,4a2,1 a2,2 a2,3 a2,4a3,1 a3,2 a3,3 a3,4a4,1 a4,2 a4,3 a4,4

=

A3,∗A1,∗A4,∗A2,∗

The matrix PA is obtained fromA by switching around

rows of A. The result LU = PA means that the LU-

factorization is valid for the matrix A with its rows

suitably permuted.

Consequences: If we have a factorization

A = LU

with L lower triangular and U upper triangular, then

we can solve the linear system Ax = b in a relatively

straightforward way.

The linear system can be written as

LUx = b

Write this as a two stage process:

Lg = b, Ux = g

The system Lg = b is a lower triangular system

g1 = b12,1g1 + g2 = b23,1g1 + 3,2g2 + g3 = b3

...

n,1g1 + · · · n,n−1gn−1 + gn = bn

We solve it by “forward substitution”. Then we solve

the upper triangular system Ux = g by back substi-

tution.

VARIANTS OF GAUSSIAN ELIMINATION

If no partial pivoting is needed, then we can look for

a factorization

A = LU

without going thru the Gaussian elimination process.

For example, suppose A is 4× 4. We writea1,1 a1,2 a1,3 a1,4a2,1 a2,2 a2,3 a2,4a3,1 a3,2 a3,3 a3,4a4,1 a4,2 a4,3 a4,4

=

1 0 0 0

2,1 1 0 0

3,1 3,2 1 0

4,1 4,2 4,3 1

u1,1 u1,2 u1,3 u1,40 u2,2 u2,3 u2,40 0 u3,3 u3,40 0 0 u4,4

To find the elements

ni,j

oand

nui,j

o, we multiply

the right side matrices L and U and match the results

with the corresponding elements in A.

Multiplying the first row of L times all of the columns

of U leads to

u1,j = a1,j, j = 1, 2, 3, 4

Then multiplying rows 2, 3, 4 times the first column

of U yields

i,1u1,1 = ai,1, i = 2, 3, 4

and we can solve forn2,1, 3,1, 4,1

o. We can con-

tinue this process, finding the second row of U and

then the second column of L, and so on. For example,

to solve for 4,3, we need to solve for it in

4,1u1,3 + 4,2u2,3 + 4,3u3,3 = a4,3

Why do this? A hint of an answer is given by this

last equation. If we had an n× n matrix A, then we

would find n,n−1 by solving for it in the equation

n,1u1,n−1+ n,2u2,n−1+· · ·+ n,n−1un−1,n−1 = an,n−1

n,n−1 =an,n−1 −

hn,1u1,n−1 + · · ·+ n,n−2un−2,n−1

iun−1,n−1

Embedded in this formula we have a dot product. Thisis in fact typical of this process, with the length of theinner products varying from one position to another.

Recalling the discussion of dot products, we can evaluatethis last formula by using a higher precision arithmeticand thus avoid many rounding errors.

This leads to a variant of Gaussian elimination in whichthere are far fewer rounding errors.

With ordinary Gaussian elimination, the number of round-ing errors is proportional to n3. This reduces the numberof rounding errors, with the number now being propor-tional to only n2. This can lead to major increases inaccuracy, especially for matrices which are very sensitiveto small changes.

TRIDIAGONAL MATRICES

A =

b1 c1 0 0 · · · 0a2 b2 c2 00 a3 b3 c3

.... . .

... an−1 bn−1 cn−10 · · · an bn

These occur very commonly in the numerical solution

of partial differential equations, as well as in other ap-

plications (e.g. computing interpolating cubic spline

functions).

We factor A = LU , as before. But now L and U

take very simple forms. Before proceeding, we note

with an example that the same may not be true of the

matrix inverse.

EXAMPLE

Define an n× n tridiagonal matrix

A =

−1 1 0 0 · · · 01 −2 1 00 1 −2 1 ...

. . .... 1 −2 1

0 · · · 1 −n−1n

Then A−1 is given by³

A−1í,j= max {i, j}

Thus the sparse matrix A can (and usually does) have

a dense inverse.

We factor A = LU , with

L =

1 0 0 0 · · · 0α2 1 0 00 α3 1 0 ...

. . .... αn−1 1 00 · · · αn 1

U =

β1 c1 0 0 · · · 00 β2 c2 00 0 β3 c3

.... . .

... 0 βn−1 cn−10 · · · 0 βn

Multiply these and match coefficients with A to find

{αi, γi}.

To solve the linear system

Ax = f

or

LUx = f

instead solve the two triangular systems

Lg = f; Ux = g

Solving Lg = f :

g1 = f1

gj = fj � �jgj�1; j = 2; : : : ; n

Solving Ux = g:

xn =gn

�n

xj =gj � cjxj+1

�j; j = n� 1; : : : ; 1

By doing a few multiplications of rows of L times

columns of U , we obtain the general pattern as fol-

lows.

β1 = b1 : row 1 of LU

α2β1 = a2, α2c1 + β2 = b2 : row 2 of LU...

αnβn−1 = an, αncn−1 + βn = bn : row n of LU

These are straightforward to solve.

β1 = b1

αj =aj

βj−1, βj = bj − αjcj−1, j = 2, ..., n

OPERATIONS COUNT

Factoring A = LU .

Additions: n− 1Multiplications: n− 1Divisions: n− 1

Solving Lz = f and Ux = z:

Additions: 2n− 2Multiplications: 2n− 2Divisions: n

Thus the total number of arithmetic operations is ap-

proximately 3n to factor A; and it takes about 5n to

solve the linear system using the factorization of A.

If we had A−1 at no cost, what would it cost to com-pute x = A−1f?

xi =nX

j=1

³A−1

í,jfj, i = 1, ..., n

MATLAB MATRIX OPERATIONS

To obtain the LU-factorization of a matrix, including

the use of partial pivoting, use the Matlab command

lu. In particular,

[L, U, P ] = lu(X)

returns the lower triangular matrix L, upper triangular

matrix U , and permutation matrix P so that

PX = LU

NUMERICAL INTEGRATION

How do you evaluate

I =Z b

af(x) dx

From calculus, if F (x) is an antiderivative of f(x),

then

I =Z b

af(x) dx = F (x)|ba = F (b)− F (a)

However, in practice most integrals cannot be evalu-

ated by this means. And even when this can work, an

approximate numerical method may be much simpler

and easier to use. For example, the integrand inZ 10

dx

1 + x5

has an extremely complicated antiderivative; and it is

easier to evaluate the integral by approximate means.

Try evaluating this integral with Maple or Mathemat-

ica.

NUMERICAL INTEGRATIONA GENERAL FRAMEWORK

Returning to a lesson used earlier with rootfinding:If you cannot solve a problem, then replace it with a“near-by” problem that you can solve.In our case, we want to evaluate

I =Z b

af(x) dx

To do so, many of the numerical schemes are basedon choosing approximates of f(x). Calling one suchef(x), use

I ≈Z b

a

ef(x) dx ≡ eIWhat is the error?

E = I − eI = Z b

a

hf(x)− ef(x)i dx

|E| ≤Z b

a

¯̄̄f(x)− ef(x)¯̄̄ dx

≤ (b− a)°°°f − ef°°°∞°°°f − ef°°°∞ ≡ max

a≤x≤b¯̄̄f(x)− ef(x)¯̄̄

We also want to choose the approximates ef(x) of aform we can integrate directly and easily. Examples

are polynomials, trig functions, piecewise polynomials,

and others.

If we use polynomial approximations, then how do we

choose them. At this point, we have two choices:

1. Taylor polynomials approximating f(x)

2. Interpolatory polynomials approximating f(x)

EXAMPLE

Consider evaluating

I =Z 10ex2dx

Use

et = 1 + t+ 12!t2 + · · ·+ 1

n!tn + 1

(n+1)!tn+1ect

ex2= 1 + x2 + 1

2!x4 + · · ·+ 1

n!x2n + 1

(n+1)!x2n+2edx

with 0 ≤ dx ≤ x2. Then

I =Z 10

h1 + x2 + 1

2!x4 + · · ·+ 1

n!x2nidx

+ 1(n+1)!

Z 10

hx2n+2edx

idx

Taking n = 3, we have

I = 1 + 13 +

110 +

142 +E = 1.4571 +E

0 < E ≤ e24

Z 10

hx8idx = e

216 = .0126

USING INTERPOLATORY POLYNOMIALS

In spite of the simplicity of the above example, it is

generally more difficult to do numerical integration by

constructing Taylor polynomial approximations than

by constructing polynomial interpolates. We therefore

construct the function ef inZ b

af(x) dx ≈

Z b

a

ef(x) dxby means of interpolation.

Initially, we consider only the case in which the in-

terpolation is based on interpolation at evenly spaced

node points.


The linear interpolant to f(x), interpolating at a and

b, is given by

P1(x) =(b− x) f(a) + (x− a) f(b)

b− a

Using the linear interpolant

P1(x) =(b− x) f(a) + (x− a) f(b)

b− a

we obtain the approximationZ b

af(x) dx ≈

Z b

aP1(x) dx

= 12 (b− a) [f(a) + f(b)] ≡ T1(f)

The rulebZa

f(x) dx ≈ T1(f)

is called the trapezoidal rule.

x

y

a b

y=f(x)

y=p1(x)

Illustrating I ≈ T1(f)

Example.Z π/2

0sinxdx ≈ π

4

hsin 0 + sin

³π2

í= π

4.= .785398

Error = .215

HOW TO OBTAIN GREATER ACCURACY?

How do we improve our estimate of the integral

I =Z b

af(x) dx

One direction is to increase the degree of the approxi-mation, moving next to a quadratic interpolating poly-nomial for f(x). We first look at an alternative.

Instead of using the trapezoidal rule on the originalinterval [a, b], apply it to integrals of f(x) over smallersubintervals. For example:

I =Z c

af(x) dx+

Z b

cf(x) dx, c = b+a

2

≈ c−a2 [f(a) + f(c)] + b−c

2 [f(c) + f(b)]

= h2 [f(a) + 2f(c) + f(b)] ≡ T2(f), h = b−a

2

Example.Z π/2

0sinxdx ≈ π

8

hsin 0 + 2 sin

³π4

´+ sin

³π2

í.= .948059

Error = .0519

x

y

a=x0 b=x3x1 x2

y=f(x)

Illustrating I ≈ T3(f)

THE TRAPEZOIDAL RULE

We can continue as above by dividing [a, b] into even

smaller subintervals and applying

βZα

f(x) dx ≈ β − α

2[f(α) + f(β)] , (∗)

on each of the smaller subintervals. Begin by intro-

ducing a positive integer n ≥ 1,

h =b− a

n, xj = a+ j h, j = 0, 1, ..., n

Then

I =Z xn

x0f(x) dx

=Z x1

x0f(x) dx+

Z x2

x1f(x) dx+ · · ·+

Z xn

xn−1f(x) dx

Use [α, β] = [x0, x1], [x1, x2], ..., [xn−1, xn], for eachof which the subinterval has length h.

Then applying

βZα

f(x) dx ≈ β − α

2[f(α) + f(β)]

we have

I ≈ h2 [f(x0) + f(x1)] +

h2 [f(x1) + f(x2)]

+ · · ·+h2 [f(xn−2) + f(xn−1)] + h

2 [f(xn−1) + f(xn)]

Simplifying,

I ≈ h·1

2f(a) + f(x1) + · · ·+ f(xn−1) +

1

2f(b)

¸≡ Tn(f)

This is called the “composite trapezoidal rule”, or

more simply, the trapezoidal rule.

Example. Again integrate sinx overh0, π2

i. Then we

have

n Tn(f) Error Ratio1 .785398163 2.15E−12 .948059449 5.19E−2 4.134 .987115801 1.29E−2 4.038 .996785172 3.21E−3 4.0116 .999196680 8.03E−4 4.0032 .999799194 2.01E−4 4.0064 .999949800 5.02E−5 4.00128 .999987450 1.26E−5 4.00256 .999996863 3.14E−6 4.00

Note that the errors are decreasing by a constant fac-

tor of 4. Why do we always double n?

USING QUADRATIC INTERPOLATION

We want to approximate I =R ba f(x) dx using quadratic

interpolation of f(x). Interpolate f(x) at the points

{a, c, b}, with c = 12 (a+ b). Also let h = 1

2 (b− a).

The quadratic interpolating polynomial is given by

P2(x) =(x− c) (x− b)

2h2f(a) +

(x− a) (x− b)

−h2 f(c)

+(x− a) (x− c)

2h2f(b)

Replacing f(x) by P2(x), we obtain the approximationZ b

af(x) dx ≈

Z b

aP2(x) dx

= h3 [f(a) + 4f(c) + f(b)] ≡ S2(f)

This is called Simpson’s rule.

x

y

a b(a+b)/2

y=f(x)

Illustration of I ≈ S2(f)

Example.Z π/2

0sinxdx ≈ π/2

3

hsin 0 + 4 sin

³π4

´+ sin

³π2

í.= 1.00227987749221

Error = −0.00228

SIMPSON’S RULE

As with the trapezoidal rule, we can apply Simpson’s

rule on smaller subdivisions in order to obtain better

accuracy in approximating

I =Z b

af(x) dx

Again, Simpson’s rule is given byZ β

αf(x) dx ≈ δ

3[f(α) + 4f(γ) + f(β)] , γ =

α+ β

2

and δ = 12 (β − α).

Let n be a positive even integer, and

h =b− a

n, xj = a+ j h, j = 0, 1, ..., n

Then write

I =Z xn

x0f(x) dx

=Z x2

x0f(x) dx+

Z x4

x2f(x) dx+ · · ·+

Z xn

xn−2f(x) dx

ApplyZ β

αf(x) dx ≈ δ

3[f(α) + 4f(γ) + f(β)] , γ =

α+ β

2

to each of these subintegrals, with

[α, β] = [x0, x2] , [x2, x4] , ..., [xn−2, xn]

In all cases, 12 (β − α) = h. Then

I ≈ h3 [f(x0) + 4f(x1) + f(x2)]

+h3 [f(x2) + 4f(x3) + f(x4)]

+ · · ·+h3 [f(xn−4) + 4f(xn−3) + f(xn−2)]

+h3 [f(xn−2) + 4f(xn−1) + f(xn)]

This can be simplified toZ b

af(x) dx ≈ Sn(f) ≡ h

3 [f(x0) + 4f(x1)

+2f(x2) + 4f(x3) + 2f(x4)

+ · · ·+ 2f(xn−2) + 4f(xn−1) + f(xn)]

This is called the “composite Simpson’s rule” or more

simply, .Simpson’s rule

EXAMPLE

ApproximateZ π/2

0sinxdx. The Simpson rule results

are as follows.

n Sn(f) Error Ratio2 1.00227987749221 −2.28E−34 1.00013458497419 −1.35E−4 16.948 1.00000829552397 −8.30E−6 16.2216 1.00000051668471 −5.17E−7 16.0632 1.00000003226500 −3.23E−8 16.0164 1.00000000201613 −2.02E−9 16.00128 1.00000000012600 −1.26E−10 16.00256 1.00000000000788 −7.88E−12 16.00512 1.00000000000049 −4.92E−13 15.99

Note that the ratios of successive errors have con-

verged to 16. Why? Also compare this table with

that for the trapezoidal rule. For example,

I − T4 = 1.29E − 2I − S4 = −1.35E − 4

Example 1

I(1) =Z 10e�x

2dx � 0:746824132812427

I(2) =Z 40

dx

1 + x2= arctan(4) � 1:32581766366803

I(3) =Z 2�0

dx

2 + cosx=2�p3� 3:62759872846844

Table 1. Trapezoidal rule applied to Example 1.

n I(1) I(2) I(3)

Error R Error R Error R2 1:6E � 2 �1:3E � 1 �5:6E � 14 3:8E � 3 4:02 �3:6E � 3 37:0 �3:8E � 2 14:98 9:6E � 4 4:01 5:6E � 4 �6:4 �1:9E � 4 195:016 2:4E � 4 4:00 1:4E � 4 3:9 �5:2E � 9 3760032 6:0E � 5 4:00 3:6E � 5 4:0064 1:5E � 5 4:00 9:0E � 6 4:00128 3:7E � 6 4:00 2:3E � 6 4:00

Table 2. Simpson rule applied to Example 1.

n I(1) I(2) I(3)

Error R Error R Error R2 �3:6E � 4 8:7E � 2 �1:264 �3:1E � 5 11:4 3:9E � 2 2:2 1:4E � 1 �9:28 �2:0E � 6 15:7 2:0E � 3 20 1:2E � 2 11:216 �1:3E � 7 15:9 4:0E � 6 485 6:4E � 5 19132 �7:8E � 9 16:0 2:3E � 8 172 1:7E � 9 3760064 �4:9E � 10 16:0 1:5E � 9 16128 �3:0E � 11 16:0 9:2E � 11 16

TRAPEZOIDAL METHOD

ERROR FORMULA

Theorem Let f(x) have two continuous derivatives on

the interval a ≤ x ≤ b. Then

ETn (f) ≡

Z b

af(x) dx− Tn(f) = −h

2 (b− a)

12f 00 (cn)

for some cn in the interval [a, b].

Later I will say something about the proof of this re-

sult, as it leads to some other useful formulas for the

error.

The above formula says that the error decreases in

a manner that is roughly proportional to h2. Thus

doubling n (and halving h) should cause the error to

decrease by a factor of approximately 4. This is what

we observed with a past example from the preceding

section.

Example. Consider evaluating

I =Z 20

dx

1 + x2

using the trapezoidal method Tn(f). How large should

n be chosen in order to ensure that¯̄̄ETn (f)

¯̄̄≤ 5× 10−6

We begin by calculating the derivatives:

f 0(x) = −2x³1 + x2

´2, f 00(x) = −2 + 6x2³1 + x2

´3From a graph of f 00(x),

max0≤x≤2

¯̄̄f 00(x)

¯̄̄= 2

Recall that b− a = 2. Therefore,

ETn (f) = −h

2 (b− a)

12f 00 (cn)¯̄̄

ETn (f)

¯̄̄≤ h2 (2)

12· 2 = h2

3

ETn (f) = −h

2 (b− a)

12f 00 (cn)¯̄̄

ETn (f)

¯̄̄≤ h22

12· 2 = h2

3

We bound¯̄f 00 (cn)

¯̄since we do not know cn, and

therefore we must assume the worst possible case, that

which makes the error formula largest. That is what

has been done above.

When do we have¯̄̄ETn (f)

¯̄̄≤ 5× 10−6 (1)

To ensure this, we choose h so small that

h2

3≤ 5× 10−6

This is equivalent to choosing h and n to satisfy

h ≤ .003873

n =2

h≥ 516.4

Thus n ≥ 517 will imply (1).

DERIVING THE ERROR FORMULA

There are two stages in deriving the error:

(1) Obtain the error formula for the case of a single

subinterval (n = 1);

(2) Use this to obtain the general error formula given

earlier.

For the trapezoidal method with only a single subin-

terval, we haveZ α+h

αf(x) dx− h

2[f(α) + f(α+ h)] = −h

3

12f 00(c)

for some c in the interval [α,α+ h].

A sketch of the derivation of this error formula is given

in the problems.

Recall that the general trapezoidal rule Tn(f) was ob-

tained by applying the simple trapezoidal rule to a sub-

division of the original interval of integration. Recall

defining and writing

h =b− a

n, xj = a+ j h, j = 0, 1, ..., n

I =

xnZx0

f(x) dx

=

x1Zx0

f(x) dx+

x2Zx1

f(x) dx+ · · ·

+

xnZxn−1

f(x) dx

I ≈ h2 [f(x0) + f(x1)] +

h2 [f(x1) + f(x2)]

+ · · ·+h2 [f(xn−2) + f(xn−1)] + h

2 [f(xn−1) + f(xn)]

Then the error

ETn (f) ≡

Z b

af(x) dx− Tn(f)

can be analyzed by adding together the errors over the

subintervals [x0, x1], [x1, x2], ..., [xn−1, xn]. RecallZ α+h

αf(x) dx− h

2[f(α) + f(α+ h)] = −h

3

12f 00(c)

Then on [xj−1, xj],xjZxj−1

f(x) dx− h

2

hf(xj−1) + f(xj)

i= −h

3

12f 00(γj)

with xj−1 ≤ γj ≤ xj, but otherwise γj unknown.

Then combining these errors, we obtain

ETn (f) = −

h3

12f 00(γ1)− · · ·−

h3

12f 00(γn)

This formula can be further simplified, and we will do

so in two ways.

Rewrite this error as

ETn (f) = −

h3n

12

"f 00(γ1) + · · ·+ f 00(γn)

n

#Denote the quantity inside the brackets by ζn. This

number satisfies

mina≤x≤b f

00(x) ≤ ζn ≤ maxa≤x≤b f

00(x)

Since f 00(x) is a continuous function (by original as-sumption), we have that there must be some number

cn in [a, b] for which

f 00(cn) = ζn

Recall also that hn = b− a. Then

ETn (f) = −h

3n

12

"f 00(γ1) + · · ·+ f 00(γn)

n

#

= −h2 (b− a)

12f 00 (cn)

This is the error formula given on the first slide.

AN ERROR ESTIMATE

We now obtain a way to estimate the error ETn (f).

Return to the formula

ETn (f) = −

h3

12f 00(γ1)− · · ·−

h3

12f 00(γn)

and rewrite it as

ETn (f) = −

h2

12

hf 00(γ1)h+ · · ·+ f 00(γn)h

iThe quantity

f 00(γ1)h+ · · ·+ f 00(γn)h

is a Riemann sum for the integralZ b

af 00(x) dx = f 0(b)− f 0(a)

By this we mean

limn→∞

hf 00(γ1)h+ · · ·+ f 00(γn)h

i=Z b

af 00(x) dx

Thus

f 00(γ1)h+ · · ·+ f 00(γn)h ≈ f 0(b)− f 0(a)

for larger values of n. Combining this with the earlier

error formula

ETn (f) = −

h2

12

hf 00(γ1)h+ · · ·+ f 00(γn)h

iwe have

ETn (f) ≈ −

h2

12

hf 0(b)− f 0(a)

i≡ eET

n (f)

This is a computable estimate of the error in the nu-

merical integration. It is called an asymptotic error

estimate.


I(f) =Z π

0ex cosxdx = −e

π + 1

2

.= −12.070346

In this case,

f 0(x) = ex [cosx− sinx]f 00(x) = −2ex sinx

max0≤x≤π

¯̄f 00(x)

¯̄=

¯̄f 00 (.75π)

¯̄= 14. 921

Then

ETn (f) = −h

2 (b− a)

12f 00 (cn)¯̄̄

ETn (f)

¯̄̄≤ h2π

12· 14.921 = 3.906h2

Also

eETn (f) = −h

2

12

£f 0(π)− f 0(0)

¤=

h2

12[eπ + 1]

.= 2.012h2

I(f)� Tn(f) � �h2

12

�f 0(b)� f 0(a)

�I(f) � Tn(f)�

h2

12

�f 0(b)� f 0(a)

�CTn(f) � Tn(f)�

h2

12

�f 0(b)� f 0(a)

�

This is the corrected trapezoidal rule. It is easy to obtainfrom the trapezoidal rule, and in most cases, it convergesmore rapidly than the trapezoidal rule.

Table 3. Asymptotic and corrected trapesoidal rule ap-plied to integral I(1) from Example 1.

n I � Tn(f) R eEn(f) I � CTn(f) R2 1:6E � 2 1:5E � 2 1:3E � 44 3:8E � 3 4 3:8E � 3 7:9E � 6 15:88 9:6E � 4 4 9:6E � 4 4:9E � 7 1616 2:4E � 4 4 2:4E � 4 3:1E � 8 1632 5:9E � 5 4 5:9E � 5 2:0E � 9 1664 1:5E � 5 4 1:5E � 5 2:2E � 10 16

SIMPSON’S RULE ERROR FORMULA

Recall the general Simpson’s ruleZ b

af(x) dx ≈ Sn(f) ≡ h

3 [f(x0) + 4f(x1) + 2f(x2)

+4f(x3) + 2f(x4) + · · ·+2f(xn−2) + 4f(xn−1) + f(xn)]

For its error, we have

ESn(f) ≡

bZa

f(x) dx− Sn(f) = −h4 (b− a)

180f (4)(cn)

for some a ≤ cn ≤ b, with cn otherwise unknown. For

an asymptotic error estimate,

bZa

f(x) dx−Sn(f) ≈ eESn (f) ≡ −

h4

180

hf 000(b)− f 000(a)

i

DISCUSSION

For Simpson’s error formula, both formulas assume

that the integrand f(x) has four continuous deriva-

tives on the interval [a, b]. What happens when this

is not valid? We return later to this question.

Both formulas also say the error should decrease by a

factor of around 16 when n is doubled.

Compare these results with those for the trapezoidal

rule error formulas:.

ETn (f) ≡

Z b

af(x) dx− Tn(f) = −h

2 (b− a)

12f 00 (cn)

ETn (f) ≈ −

h2

12

hf 0(b)− f 0(a)

i≡ eET

n (f)

EXAMPLE

Consider evaluating

I =Z 20

dx

1 + x2

using Simpson’s rule Sn(f). How large should n be

chosen in order to ensure that¯̄̄ESn(f)

¯̄̄≤ 5× 10−6

Begin by noting that

f (4)(x) = 245x4 − 10x2 + 1³

1 + x2´5

max0≤x≤1

¯̄̄f (4)(x)

¯̄̄= f (4)(0) = 24

Then

ESn(f) = −h

4 (b− a)

180f (4)(cn)¯̄̄

ESn(f)

¯̄̄≤ h4 · 2

180· 24 = 4h4

15

Then¯̄̄ESn(f)

¯̄̄≤ 5× 10−6 is true if

4h4

15≤ 5× 10−6

h ≤ .0658n ≥ 30.39

Therefore, choosing n ≥ 32 will give the desired er-

ror bound. Compare this with the earlier trapezoidal

example in which n ≥ 517 was needed.

For the asymptotic error estimate, we have

f 000(x) = −24x x2 − 1³1 + x2

´4eESn (f) ≡ − h4

180

£f 000(2)− f 000(0)

¤=

h4

180· 144625

=4

3125h4

INTEGRATING sqrt(x)

Consider the numerical approximation ofZ 10sqrt(x) dx =

2

3

In the following table, we give the errors when using

both the trapezoidal and Simpson rules.

n ETn Ratio ES

n Ratio2 6.311E − 2 2.860E − 24 2.338E − 2 2.70 1.012E − 2 2.828 8.536E − 3 2.74 3.587E − 3 2.8316 3.085E − 3 2.77 1.268E − 3 2.8332 1.108E − 3 2.78 4.485E − 4 2.8364 3.959E − 4 2.80 1.586E − 4 2.83128 1.410E − 4 2.81 5.606E − 5 2.83

The rate of convergence is slower because the func-

tion f(x) =sqrt(x) is not sufficiently differentiable on

[0, 1]. Both methods converge with a rate propor-

tional to h1.5.

ASYMPTOTIC ERROR FORMULAS

If we have a numerical integration formula,Z b

af(x) dx ≈

nXj=0

wjf(xj)

let En(f) denote its error,

En(f) =Z b

af(x) dx−

nXj=0

wjf(xj)

We say another formula eEn(f) is an asymptotic errorformula this numerical integration if it satisfies

limn→∞

eEn(f)

En(f)= 1

Equivalently,

limn→∞

En(f)− eEn(f)

En(f)= 0

These conditions say that eEn(f) looks increasinglylike En(f) as n increases, and thus

En(f) ≈ eEn(f)

Example. For the trapezoidal rule,

ETn (f) ≈ eET

n (f) ≡ −h2

12

hf 0(b)− f 0(a)

iThis assumes f(x) has two continuous derivatives on

the interval [a, b].

Example. For Simpson’s rule,

ESn(f) ≈ eES

n(f) ≡ −h4

180

hf 000(b)− f 000(a)

iThis assumes f(x) has four continuous derivatives on

the interval [a, b].

Note that both of these formulas can be written in an

equivalent form as

eEn(f) =c

np

for appropriate constant c and exponent p. With the

trapezoidal rule, p = 2 and

c = −(b− a)2

12

hf 0(b)− f 0(a)

iand for Simpson’s rule, p = 4 with a suitable c.

The formula eEn(f) =c

np(2)

occurs for many other numerical integration formulas

that we have not yet defined or studied. In addition,

if we use the trapezoidal or Simpson rules with an

integrand f(x) which is not sufficiently differentiable,

then (2) may hold with an exponent p that is less than

the ideal.

Example. Consider

I =Z 10xβ dx

in which −1 < β < 1, β 6= 0. Then the conver-

gence of the trapezoidal rule can be shown to have an

asymptotic error formula

En ≈ eEn =c

nβ+1(3)

for some constant c dependent on β. A similar result

holds for Simpson’s rule, with −1 < β < 3, β not an

integer. We can actually specify a formula for c; but

the formula is often less important than knowing that

(2) is valid for some c.

APPLICATION OF ASYMPTOTIC

ERROR FORMULAS

Assume we know that an asymptotic error formula

I − In ≈ c

np

is valid for some numerical integration rule denoted by

In. Initially, assume we know the exponent p. Then

imagine calculating both In and I2n. With I2n, we

have

I − I2n ≈c

2pnp

This leads to

I − In ≈ 2p [I − I2n]

I ≈ 2pI2n − In

2p − 1 = I2n +I2n − In

2p − 1The formula

I ≈ I2n +I2n − In

2p − 1 (4)

is called Richardson’s extrapolation formula.

Example. With the trapezoidal rule and with the in-tegrand f(x) having two continuous derivatives,

I ≈ T2n +1

3[T2n − Tn]

Example. With Simpson’s rule and with the integrandf(x) having four continuous derivatives,

I ≈ S2n +1

15[S2n − Sn]

We can also use the formula (2) to obtain error esti-

mation formulas:

I − I2n ≈I2n − In

2p − 1 (5)

This is called Richardson’s error estimate. For exam-

ple, with the trapezoidal rule,

I − T2n ≈1

3[T2n − Tn]

These formulas are illustrated for the trapezoidal rule

in an accompanying table, forZ π

0ex cosxdx = −e

π + 1

2

.= −12.07034632

AITKEN EXTRAPOLATION

In this case, we again assume

I − In ≈ c

np

But in contrast to previously, we do not know either

c or p. Imagine computing In, I2n, and I4n. Then

I − In ≈ c

np

I − I2n ≈ c

2pnp

I − I4n ≈ c

4pnp

We can directly try to estimate I. Dividing

I − In

I − I2n≈ 2p ≈ I − I2n

I − I4nSolving for I, we obtain

(I − I2n)2 ≈ (I − In) (I − I4n)

I (In + I4n − 2I2n) ≈ InI4n − I22n

I ≈ InI4n − I22nIn + I4n − 2I2n

This can be improved computationally, to avoid loss

of significance errors.

I ≈ I4n +

"InI4n − I22n

In + I4n − 2I2n− I4n

#

= I4n −(I4n − I2n)

2

(I4n − I2n)− (I2n − In)

This is called Aitken’s extrapolation formula.

To estimate p, we use

I2n − In

I4n − I2n≈ 2p

To see this, write

I2n − In

I4n − I2n=(I − In)− (I − I2n)

(I − I2n)− (I − I4n)

Then substitute from the following and simplify:

I − In ≈ c

np

I − I2n ≈ c

2pnp

I − I4n ≈ c

4pnp

Example. Consider the following table of numerical

integrals. What is its order of convergence?

n In In − I12n

Ratio

2 .284517796864 .28559254576 1.075E − 38 .28570248748 1.099E − 4 9.7816 .28571317731 1.069E − 5 10.2832 .28571418363 1.006E − 6 10.6264 .28571427643 9.280E − 8 10.84

It appears

2p.= 10.84, p

.= log2 10.84 = 3.44

We could now combine this with Richardson’s error

formula to estimate the error:

I − In ≈ 1

2p − 1·In − I1

2n

¸For example,

I − I64 ≈1

9.84[9.280E − 8] = 9.43E − 9

PERIODIC FUNCTIONS

A function f(x) is periodic if the following condition

is satisfied. There is a smallest real number τ > 0 for

which

f(x+ τ) = f(x), −∞ < x <∞ (6)

The number τ is called the period of the function

f(x). The constant function f(x) ≡ 1 is also consid-ered periodic, but it satisfies this condition with any

τ > 0. Basically, a periodic function is one which

repeats itself over intervals of length τ .

The condition (6) implies

f (m)(x+ τ) = f (m)(x), −∞ < x <∞ (7)

for the mth-derivative of f(x), provided there is such

a derivative. Thus the derivatives are also periodic.

Periodic functions occur very frequently in applica-

tions of mathematics, reflecting the periodicity of many

phenomena in the physical world.

PERIODIC INTEGRANDS

Consider the special class of integrals

I(f) =Z b

af(x) dx

in which f(x) is periodic, with b−a an integer multipleof the period τ for f(x). In this case, the performance

of the trapezoidal rule and other numerical integration

rules is much better than that predicted by earlier error

formulas.

To hint at this improved performance, recallZ b

af(x) dx− Tn(f) ≈ eEn(f) ≡ −h

2

12

hf 0(b)− f 0(a)

iWith our assumption on the periodicity of f(x), we

have

f(a) = f(b), f 0(a) = f 0(b)

Therefore, eEn(f) = 0

and we should expect improved performance in the

convergence behaviour of the trapezoidal sums Tn(f).

If in addition to being periodic on [a, b], the integrand

f(x) also has m continous derivatives, then it can be

shown that

I(f)− Tn(f) =c

nm+ smaller terms

By “smaller terms”, we mean terms which decrease

to zero more rapidly than n−m.

Thus if f(x) is periodic with b−a an integer multiple

of the period τ for f(x), and if f(x) is infinitely differ-

entiable, then the error I−Tn decreases to zero more

rapidly than n−m for any m > 0. For periodic inte-

grands, the trapezoidal rule is an optimal numerical

integration method.


I =Z 2π0

sinxdx

1 + esinx

Using the trapezoidal rule, we have the results in the

following table. In this case, the formulas based on

Richardson extrapolation are no longer valid.

n Tn Tn − T12n

2 0.04 −0.72589193317292 −7.259E − 18 −0.74006131211583 −1.417E − 216 −0.74006942337672 −8.111E − 632 −0.74006942337946 −2.746E − 1264 −0.74006942337946 0.0

NUMERICAL INTEGRATION:

ANOTHER APPROACH

We look for numerical integration formulasZ 1−1

f(x) dx ≈nX

j=1

wjf(xj)

which are to be exact for polynomials of as large a

degree as possible. There are no restrictions placed

on the nodesnxjonor the weights

nwj

oin working

towards that goal. The motivation is that if it is exact

for high degree polynomials, then perhaps it will be

very accurate when integrating functions that are well

approximated by polynomials.

There is no guarantee that such an approach will work.

In fact, it turns out to be a bad idea when the node

pointsnxjoare required to be evenly spaced over the

interval of integration. But without this restriction onnxjowe are able to develop a very accurate set of

quadrature formulas.

The case n = 1. We want a formula

w1f(x1) �1R�1f(x)dx

The weight w1 and the nodex1 are to be so chosen thatthe formula is exact for polynomials of as large degreeas possible. To do this we substitute f(x) = 1 andf(x) = x. The �rst choice leads to

w1 � 1 �1R�11dx

w1 = 2

The choice f(x) = x leads to

w1x1 �1R�1xdx

x1 = 0

The desired formula is

1R�1f(x)dx � 2f(0)

It is called the midpoint rule.


w1f(x1) +w2f(x2) ≈Z 1−1

f(x) dx

The weights w1, w2 and the nodes x1, x2 are to be so

chosen that the formula is exact for polynomials of as

large a degree as possible. We substitute and force

equality for

f(x) = 1, x, x2, x3

This leads to the system

w1 +w2 =Z 1−11 dx = 2

w1x1 + w2x2 =Z 1−1

xdx = 0

w1x21 + w2x

22 =

Z 1−1

x2 dx =2

3

w1x31 + w2x

32 =

Z 1−1

x3 dx = 0

The solution is given by

w1 = w2 = 1, x1 =−1

sqrt(3), x2 =

1sqrt(3)

This yields the formulaZ 1−1

f(x) dx ≈ fµ

−1sqrt(3)

¶+ f

µ1

sqrt(3)

¶(1)

We say it has degree of precision equal to 3 since it

integrates exactly all polynomials of degree ≤ 3. We

can verify directly that it does not integrate exactly

f(x) = x4. Z 1−1

x4 dx = 25

fµ

−1sqrt(3)

¶+ f

µ1

sqrt(3)

¶= 29

Thus (1) has degree of precision exactly 3.

EXAMPLE IntegrateZ 1−1

dx

3 + x= log 2

.= 0.69314718

The formula (1) yields

1

3 + x1+

1

3 + x2= 0.69230769

Error = .000839

THE GENERAL CASE

We want to find the weights {wi} and nodes {xi} soas to have Z 1

−1f(x) dx ≈

nXj=1

wjf(xj)

be exact for a polynomials f(x) of as large a degreeas possible. As unknowns, there are n weights wi andn nodes xi. Thus it makes sense to initially impose2n conditions so as to obtain 2n equations for the 2nunknowns. We require the quadrature formula to beexact for the cases

f(x) = xi, i = 0, 1, 2, ..., 2n− 1Then we obtain the system of equations

w1xi1 +w2x

i2 + · · ·+ wnx

in =

Z 1−1

xi dx

for i = 0, 1, 2, ..., 2n− 1. For the right sides,Z 1−1

xi dx =

2

i+ 1, i = 0, 2, ..., 2n− 2

0, i = 1, 3, ..., 2n− 1

The system of equations

w1xi1 + · · ·+ wnx

in =

Z 1−1

xi dx, i = 0, ..., 2n− 1has a solution, and the solution is unique except for

re-ordering the unknowns. The resulting numerical

integration rule is called Gaussian quadrature.

In fact, the nodes and weights are not found by solv-

ing this system. Rather, the nodes and weights have

other properties which enable them to be found more

easily by other methods. There are programs to pro-

duce them; and most subroutine libraries have either

a program to produce them or tables of them for com-

monly used cases.

CHANGE OF INTERVAL

OF INTEGRATION

Integrals on other finite intervals [a, b] can be con-

verted to integrals over [−1, 1], as follows:Z b

aF (x) dx =

b− a

2

Z 1−1

F

Ãb+ a+ t(b− a)

2

!dt

based on the change of integration variables

x =b+ a+ t(b− a)

2, −1 ≤ t ≤ 1

EXAMPLE Over the interval [0, π], use

x = (1 + t) π2

Then Z π

0F (x) dx = π

2

Z 1−1

F³(1 + t) π2

´dt

AN ERROR FORMULA

The usual error formula for Gaussian quadrature for-

mula,

En(f) =Z 1−1

f(x) dx−nX

j=1

wjf(xj)

is not particularly intuitive. It is given by

En(f) = enf (2n)(cn)

(2n)!

en =22n+1 (n!)4

(2n+ 1) [(2n)!]2≈ π

4n

for some a ≤ cn ≤ b.

To help in understanding the implications of this error

formula, introduce

Mk = max−1≤x≤1

¯̄̄f (k)(x)

¯̄̄k!

With many integrands f(x), this sequence {Mk} isbounded or even decreases to zero. For example,

f(x) =

cosx

1

2 + x

⇒ Mk ≤1

k!1

Then for our error formula,

En(f) = enf (2n)(cn)

(2n)!|En(f)| ≤ enM2n (2)

By other methods, we can show

en ≈ π

4n

When combined with (2) and an assumption of uni-

form boundedness for {Mk}, we have the error de-creases by a factor of at least 4 with each increase of

n to n + 1. Compare this to the convergence of the

trapezoidal and Simpson rules for such functions, to

help explain the very rapid convergence of Gaussian

quadrature.

A SECOND ERROR FORMULA

Let f(x) be continuous for a ≤ x ≤ b; let n ≥ 1.

Then, for the Gaussian numerical integration formula

I ≡Z b

af(x) dx ≈

nXj=1

wjf(xj) ≡ In

on [a, b], the error in In satisfies

|I(f)− In(f)| ≤ 2 (b− a) ρ2n−1(f) (3)

Here ρ2n−1(f) is the minimax error of degree 2n− 1for f(x) on [a, b]:

ρm(f) = mindeg(p)≤m


#, m ≥ 0

EXAMPLE Let f(x) = e−x2. Then the minimax er-rors ρm(f) are given in the following table.

m ρm(f) m ρm(f)1 5.30E− 2 6 7.82E− 62 1.79E− 2 7 4.62E− 73 6.63E− 4 8 9.64E− 84 4.63E− 4 9 8.05E− 95 1.62E− 5 10 9.16E− 10

Using this table, apply (3) to

I =Z 10e−x2 dx

For n = 3, (3) implies

|I − I3| ≤ 2ρ5µe−x2

¶.= 3.24× 10−5

The actual error is 9.55E− 6.

INTEGRATING

A NON-SMOOTH INTEGRAND

Consider using Gaussian quadrature to evaluate

I =Z 10sqrt(x) dx = 2

3

n I − In Ratio2 −7.22E− 34 −1.16E− 3 6.28 −1.69E− 4 6.916 −2.30E− 5 7.432 −3.00E− 6 7.664 −3.84E− 7 7.8

The column labeled Ratio is defined by

I − I12n

I − In

It is consistent with I−In ≈ c

n3, which can be proven

theoretically. In comparison for the trapezoidal and

Simpson rules, I − In ≈ c

n1.5

WEIGHTED GAUSSIAN QUADRATURE

Consider needing to evaluate integrals such asZ 10f(x) log xdx,

Z 10x13f(x) dx

How do we proceed? Consider numerical integration

formulas Z b

aw(x)f(x) dx ≈

nXj=1

wjf(xj)

in which f(x) is considered a “nice” function (one

with several continuous derivatives). The function

w(x) is allowed to be singular, but must be integrable.

We assume here that [a, b] is a finite interval. The

function w(x) is called a “weight function”, and it is

implicitly absorbed into the definition of the quadra-

ture weights {wi}. We again determine the nodes

{xi} and weights {wi} so as to make the integrationformula exact for f(x) a polynomial of as large a de-

gree as possible.

The resulting numerical integration formulaZ b

aw(x)f(x) dx ≈

nXj=1

wjf(xj)

is called a Gaussian quadrature formula with weight

function w(x). We determine the nodes {xi} andweights {wi} by requiring exactness in the above for-mula for

f(x) = xi, i = 0, 1, 2, ..., 2n− 1

To make the derivation more understandable, we con-

sider the particular caseZ 10x13f(x) dx ≈

nXj=1

wjf(xj)

We follow the same pattern as used earlier.


w1f(x1) ≈Z 10x13f(x) dx

The weight w1 and the node x1 are to be so chosen

that the formula is exact for polynomials of as large a

degree as possible. Choosing f(x) = 1, we have

w1 =Z 10x13 dx = 3

4

Choosing f(x) = x, we have

w1x1 =

1Z0

x13xdx = 3

7

x1 = 47

Thus Z 10x13f(x) dx ≈ 3

4f³47

´has degree of precision 1.


w1f(x1) +w2f(x2) ≈Z 10x13f(x) dx

The weights w1, w2 and the nodes x1, x2 are to be

so chosen that the formula is exact for polynomials of

as large a degree as possible. We determine them by

requiring equality for

f(x) = 1, x, x2, x3

This leads to the system

w1 +w2 =

1Z0

x13 dx = 3

4

w1x1 + w2x2 =

1Z0

xx13 dx = 3

7

w1x21 + w2x

22 =

1Z0

x2x13 dx = 3

10

w1x31 + w2x

32 =

1Z0

x3x13 dx = 3

13

The solution is

x1 =713 − 3

65 sqrt(35), x2 =713 +

365 sqrt(35)

w1 =38 − 3

392 sqrt(35), w2 =38 +

3392 sqrt(35)

Numerically,

x1 = .2654117024, x2 = .8115113746w1 = .3297238792, w2 = .4202761208

The formulaZ 10x13f(x) dx ≈ w1f(x1) +w2f(x2) (4)

has degree of precision 3.

EXAMPLE Consider evaluating the integralZ 10x13 cosx dx (5)

In applying (4), we take f(x) = cosx. Then

w1f(x1) + w2f(x2) = 0.6074977951

The true answer isZ 10x13 cosx dx

.= 0.6076257393

and our numerical answer is in error by E2.= .000128.

This is quite a good answer involving very little com-

putational effort (once the formula has been deter-

mined). In contrast, the trapezoidal and Simpson

rules applied to (5) would converge very slowly be-

cause the first derivative of the integrand is singular

at the origin.

CHANGE OF VARIABLES

As a side note to the preceding example, we observe

that the change of variables x = t3 transforms the

integral (5) to

3Z 10t3 cos

³t3´dt

and both the trapezoidal and Simpson rules will per-

form better with this formula, although still not as

good as our weighted Gaussian quadrature.

A change of the integration variable can often im-

prove the performance of a standard method, usually

by increasing the differentiability of the integrand.

EXAMPLE Using x = tr for some r > 1, we haveZ 10g(x) log x dx = r

Z 10tr−1g (tr) log t dt

The new integrand is generally smoother than the

original one.

INTERPOLATION

Interpolation is a process of finding a formula (oftena polynomial) whose graph will pass through a givenset of points (x, y).

As an example, consider defining

x0 = 0, x1 =π

4, x2 =

π

2and

yi = cosxi, i = 0, 1, 2

This gives us the three points

(0, 1) ,µπ4 ,

1sqrt(2)

¶,

³π2 , 0

Ńow find a quadratic polynomial

p(x) = a0 + a1x+ a2x2

for which

p(xi) = yi, i = 0, 1, 2

The graph of this polynomial is shown on the accom-panying graph. We later give an explicit formula.

Quadratic interpolation of cos(x)

x

y

π/4 π/2

y = cos(x)y = p2(x)

PURPOSES OF INTERPOLATION

1. Replace a set of data points {(xi, yi)} with a func-tion given analytically.

2. Approximate functions with simpler ones, usually

polynomials or ‘piecewise polynomials’.

Purpose #1 has several aspects.

• The data may be from a known class of functions.Interpolation is then used to find the member of

this class of functions that agrees with the given

data. For example, data may be generated from

functions of the form

p(x) = a0 + a1ex + a2e

2x + · · ·+ anenx

Then we need to find the coefficientsnajobased

on the given data values.

• We may want to take function values f(x) givenin a table for selected values of x, often equally

spaced, and extend the function to values of x

not in the table.

For example, given numbers from a table of loga-

rithms, estimate the logarithm of a number x not

in the table.

• Given a set of data points {(xi, yi)}, find a curvepassing thru these points that is “pleasing to the

eye”. In fact, this is what is done continually with

computer graphics. How do we connect a set of

points to make a smooth curve? Connecting them

with straight line segments will often give a curve

with many corners, whereas what was intended

was a smooth curve.

Purpose #2 for interpolation is to approximate func-

tions f(x) by simpler functions p(x), perhaps to make

it easier to integrate or differentiate f(x). That will

be the primary reason for studying interpolation in this

course.

As as example of why this is important, consider the

problem of evaluating

I =Z 10

dx

1 + x10

This is very difficult to do analytically. But we will

look at producing polynomial interpolants of the inte-

grand; and polynomials are easily integrated exactly.

We begin by using polynomials as our means of doing

interpolation. Later in the chapter, we consider more

complex ‘piecewise polynomial’ functions, often called

‘spline functions’.


The simplest form of interpolation is probably thestraight line, connecting two points by a straight line.

Let two data points (x0, y0) and (x1, y1) be given.

There is a unique straight line passing through these

points. We can write the formula for a straight lineas

P1(x) = a0 + a1x

In fact, there are other more convenient ways to write

it, and we give several of them below.

P1(x) =x− x1x0 − x1

y0 +x− x0x1 − x0

y1

=(x1 − x) y0 + (x− x0) y1

x1 − x0

= y0 +x− x0x1 − x0

[y1 − y0]

= y0 +

Ãy1 − y0x1 − x0

!(x− x0)

Check each of these by evaluating them at x = x0and x1 to see if the respective values are y0 and y1.

Example. Following is a table of values for f(x) =tanx for a few values of x.

x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021

Use linear interpolation to estimate tan(1.15). Then

use

x0 = 1.1, x1 = 1.2

with corresponding values for y0 and y1. Then

tanx ≈ y0 +x− x0x1 − x0

[y1 − y0]

tanx ≈ y0 +x− x0x1 − x0

[y1 − y0]

tan (1.15) ≈ 1.9648 +1.15− 1.11.2− 1.1 [2.5722− 1.9648]

= 2.2685

The true value is tan 1.15 = 2.2345. We will want

to examine formulas for the error in interpolation, to

know when we have sufficient accuracy in our inter-

polant.

x

y

1 1.3

y=tan(x)

x

y

1.1 1.2

y = tan(x)y = p1(x)

QUADRATIC INTERPOLATION

We want to find a polynomial

P2(x) = a0 + a1x+ a2x2

which satisfies

P2(xi) = yi, i = 0, 1, 2

for given data points (x0, y0) , (x1, y1) , (x2, y2). One

formula for such a polynomial follows:

P2(x) = y0L0(x) + y1L1(x) + y2L2(x) (∗∗)with

L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =

(x−x0)(x−x2)(x1−x0)(x1−x2)

L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)

The formula (∗∗) is called Lagrange’s form of the in-

terpolation polynomial.

LAGRANGE BASIS FUNCTIONS

The functions

L0(x) =(x−x1)(x−x2)(x0−x1)(x0−x2), L1(x) =

(x−x0)(x−x2)(x1−x0)(x1−x2)

L2(x) =(x−x0)(x−x1)(x2−x0)(x2−x1)

are called ‘Lagrange basis functions’ for quadratic in-

terpolation. They have the properties

Li(xj) =

(1, i = j0, i 6= j

for i, j = 0, 1, 2. Also, they all have degree 2. Their

graphs are on an accompanying page.

As a consequence of each Li(x) being of degree 2, we

have that the interpolant

P2(x) = y0L0(x) + y1L1(x) + y2L2(x)

must have degree ≤ 2.

UNIQUENESS

Can there be another polynomial, call it Q(x), forwhich

deg(Q) ≤ 2Q(xi) = yi, i = 0, 1, 2

Thus, is the Lagrange formula P2(x) unique?

Introduce

R(x) = P2(x)−Q(x)

From the properties of P2 and Q, we have deg(R) ≤2. Moreover,

R(xi) = P2(xi)−Q(xi) = yi − yi = 0

for all three node points x0, x1, and x2. How manypolynomials R(x) are there of degree at most 2 andhaving three distinct zeros? The answer is that onlythe zero polynomial satisfies these properties, and there-fore

R(x) = 0 for all x

Q(x) = P2(x) for all x

SPECIAL CASES

Consider the data points

(x0, 1), (x1, 1), (x2, 1)

What is the polynomial P2(x) in this case?

Answer: We must have the polynomial interpolant is

P2(x) ≡ 1meaning that P2(x) is the constant function. Why?First, the constant function satisfies the property ofbeing of degree ≤ 2. Next, it clearly interpolates thegiven data. Therefore by the uniqueness of quadraticinterpolation, P2(x) must be the constant function 1.

Consider now the data points

(x0,mx0), (x1,mx1), (x2,mx2)

for some constant m. What is P2(x) in this case? Byan argument similar to that above,

P2(x) = mx for all x

Thus the degree of P2(x) can be less than 2.

HIGHER DEGREE INTERPOLATION

We consider now the case of interpolation by poly-nomials of a general degree n. We want to find apolynomial Pn(x) for which

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n (∗∗)


(x0, y0) , (x1, y1) , · · · , (xn, yn)The solution is given by Lagrange’s formula

Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)

The Lagrange basis functions are given by

Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)


for k = 0, 1, 2, ..., n. The quadratic case was coveredearlier.

In a manner analogous to the quadratic case, we canshow that the above Pn(x) is the only solution to theproblem (∗∗).

In the formula

Lk(x) =(x− x0) ..(x− xk−1)(x− xk+1).. (x− xn)


we can see that each such function is a polynomial of

degree n. In addition,

Lk(xi) =

(1, k = i0, k 6= i

Using these properties, it follows that the formula

Pn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)

satisfies the interpolation problem of finding a solution

to

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n

EXAMPLE

Recall the table

x 1 1.1 1.2 1.3tanx 1.5574 1.9648 2.5722 3.6021

We now interpolate this table with the nodes

x0 = 1, x1 = 1.1, x2 = 1.2, x3 = 1.3

Without giving the details of the evaluation process,

we have the following results for interpolation with

degrees n = 1, 2, 3.

n 1 2 3Pn(1.15) 2.2685 2.2435 2.2296Error −.0340 −.0090 .0049

It improves with increasing degree n, but not at a very

rapid rate. In fact, the error becomes worse when n is

increased further. Later we will see that interpolation

of a much higher degree, say n ≥ 10, is often poorly

behaved when the node points {xi} are evenly spaced.

A FIRST ORDER DIVIDED DIFFERENCE

For a given function f(x) and two distinct points x0and x1, define

f [x0, x1] =f(x1)− f(x0)

x1 − x0

This is called a first order divided difference of f(x).

By the Mean-value theorem,

f(x1)− f(x0) = f 0(c) (x1 − x0)

for some c between x0 and x1. Thus

f [x0, x1] = f 0(c)and the divided difference in very much like the deriv-

ative, especially if x0 and x1 are quite close together.

In fact,

f 0µx1 + x02

¶≈ f [x0, x1]

is quite an accurate approximation of the derivative

(see §5.4).

SECOND ORDER DIVIDED DIFFERENCES

Given three distinct points x0, x1, and x2, define

f [x0, x1, x2] =f [x1, x2]− f [x0, x1]

x2 − x0

This is called the second order divided difference of

f(x).

By a fairly complicated argument, we can show

f [x0, x1, x2] =1

2f 00(c)

for some c intermediate to x0, x1, and x2. In fact, as

we investigate in §5.4,f 00 (x1) ≈ 2f [x0, x1, x2]

in the case the nodes are evenly spaced,

x1 − x0 = x2 − x1

EXAMPLE

Consider the table

x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997

Let x0 = 1, x1 = 1.1, and x2 = 1.2. Then

f [x0, x1] =.45360− .54030

1.1− 1 = −.86700

f [x1, x2] =.36236− .45360

1.1− 1 = −.91240

f [x0, x1, x2] =f [x1, x2]− f [x0, x1]

x2 − x0

=−.91240− (−.86700)

1.2− 1.0 = −.22700For comparison,

f 0µx1 + x02

¶= − sin (1.05) = −.86742

1

2f 00 (x1) = −1

2cos (1.1) = −.22680

GENERAL DIVIDED DIFFERENCES

Given n + 1 distinct points x0, ..., xn, with n ≥ 2,

define

f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]

xn − x0

This is a recursive definition of the nth-order divided

difference of f(x), using divided differences of order

n. Its relation to the derivative is as follows:

f [x0, ..., xn] =1

n!f (n)(c)

for some c intermediate to the points {x0, ..., xn}. LetI denote the interval

I = [min {x0, ..., xn} ,max {x0, ..., xn}]Then c ∈ I, and the above result is based on the

assumption that f(x) is n-times continuously differ-

entiable on the interval I.

EXAMPLE

The following table gives divided differences for the

data in

x 1 1.1 1.2 1.3 1.4cosx .54030 .45360 .36236 .26750 .16997

For the column headings, we use

Dkf(xi) = f [xi, ..., xi+k]


These were computed using the recursive definition

f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]

xn − x0

ORDER OF THE NODES

Looking at f [x0, x1], we have

f [x0, x1] =f(x1)− f(x0)

x1 − x0=

f(x0)− f(x1)

x0 − x1= f [x1, x0]

The order of x0 and x1 does not matter. Looking at

f [x0, x1, x2] =f [x1, x2]− f [x0, x1]

x2 − x0

we can expand it to get

f [x0, x1, x2] =f(x0)

(x0 − x1) (x0 − x2)

+f(x1)

(x1 − x0) (x1 − x2)+

f(x2)

(x2 − x0) (x2 − x1)

With this formula, we can show that the order of the

arguments x0, x1, x2 does not matter in the final value

of f [x0, x1, x2] we obtain. Mathematically,

f [x0, x1, x2] = f [xi0, xi1, xi2]

for any permutation (i0, i1, i2) of (0, 1, 2).

We can show in general that the value of f [x0, ..., xn]

is independent of the order of the arguments {x0, ..., xn},even though the intermediate steps in its calculations

using

f [x0, ..., xn] =f [x1, ..., xn]− f [x0, ..., xn−1]

xn − x0

are order dependent.

We can show

f [x0, ..., xn] = f [xi0, ..., xin]

for any permutation (i0, i1, ..., in) of (0, 1, ..., n).

COINCIDENT NODES

What happens when some of the nodes {x0, ..., xn}are not distinct. Begin by investigating what happens

when they all come together as a single point x0.

For first order divided differences, we have

limx1→x0

f [x0, x1] = limx1→x0

f(x1)− f(x0)

x1 − x0= f 0(x0)

We extend the definition of f [x0, x1] to coincident

nodes using

f [x0, x0] = f 0(x0)

For second order divided differences, recall

f [x0, x1, x2] =1

2f 00(c)

with c intermediate to x0, x1, and x2.

Then as x1 → x0 and x2 → x0, we must also have

that c→ x0. Therefore,

limx1→x0x2→x0

f [x0, x1, x2] =1

2f 00(x0)

We therefore define

f [x0, x0, x0] =1

2f 00(x0)

For the case of general f [x0, ..., xn], recall that

f [x0, ..., xn] =1

n!f (n)(c)

for some c intermediate to {x0, ..., xn}. Then

lim{x1,...,xn}→x0

f [x0, ..., xn] =1

n!f (n)(x0)

and we define

f [x0, ..., x0| {z }]n+1 times

=1

n!f (n)(x0)

What do we do when only some of the nodes are

coincident. This too can be dealt with, although we

do so here only by examples.

f [x0, x1, x1] =f [x1, x1]− f [x0, x1]

x1 − x0

=f 0(x1)− f [x0, x1]

x1 − x0The recursion formula can be used in general in this

way to allow all possible combinations of possibly co-

incident nodes.

LAGRANGE’S FORMULA FOR

THE INTERPOLATION POLYNOMIAL


nomial Pn(x) for which

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n


(x0, y0) , (x1, y1) , · · · , (xn, yn)and with {x0, ..., xn} distinct points.

In §5.1, we gave the solution as Lagrange’s formulaPn(x) = y0L0(x) + y1L1(x) + · · ·+ ynLn(x)

with {L0(x), ..., Ln(x)} the Lagrange basis polynomi-als. Each Lj is of degree n and it satisfies

Lj(xi) =

(1, j = i0, j 6= i

for i = 0, 1, ..., n.

THE NEWTON DIVIDED DIFFERENCE FORM

OF THE INTERPOLATION POLYNOMIAL

Let the data values for the problem

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n

be generated from a function f(x):

yi = f(xi), i = 0, 1, ..., n

Using the divided differences

f [x0, x1], f [x0, x1, x2], ..., f [x0, ..., xn]

we can write the interpolation polynomials

P1(x), P2(x), ..., Pn(x)

in a way that is simple to compute.

P1(x) = f(x0) + f [x0, x1] (x− x0)P2(x) = f(x0) + f [x0, x1] (x− x0)

+f [x0, x1, x2] (x− x0) (x− x1)= P1(x) + f [x0, x1, x2] (x− x0) (x− x1)

For the case of the general problem

deg(Pn) ≤ nPn(xi) = yi, i = 0, 1, · · · , n

we have


From this we have the recursion relation

Pn(x) = Pn−1(x)+f [x0, ..., xn] (x− x0) · · · (x− xn−1)

in which Pn−1(x) interpolates f(x) at the points in{x0, ..., xn−1}.

Example: Recall the table


withDkf(xi) = f [xi, ..., xi+k], k = 1, 2, 3, 4. Then

P1(x) = .5403− .8670 (x− 1)P2(x) = P1(x)− .2270 (x− 1) (x− 1.1)P3(x) = P2(x) + .1533 (x− 1) (x− 1.1) (x− 1.2)P4(x) = P3(x)

+.0125 (x− 1) (x− 1.1) (x− 1.2) (x− 1.3)Using this table and these formulas, we have the fol-

lowing table of interpolants for the value x = 1.05.

The true value is cos(1.05) = .49757105.

n 1 2 3 4Pn(1.05) .49695 .49752 .49758 .49757Error 6.20E−4 5.00E−5 −1.00E−5 0.0

EVALUATION OF THE DIVIDED DIFFERENCE

INTERPOLATION POLYNOMIAL

Let

d1 = f [x0, x1]d2 = f [x0, x1, x2]

...dn = f [x0, ..., xn]

Then the formula


can be written as

Pn(x) = f(x0) + (x− x0) (d1 + (x− x1) (d2 + · · ·+(x− xn−2) (dn−1 + (x− xn−1) dn) · · · )

Thus we have a nested polynomial evaluation, and

this is quite efficient in computational cost.

ERROR IN LINEAR INTERPOLATION

Let P1(x) denote the linear polynomial interpolating

f(x) at x0 and x1, with f(x) a given function (e.g.

f(x) = cosx). What is the error f(x)− P1(x)?

Let f(x) be twice continuously differentiable on an in-

terval [a, b] which contains the points {x0, x1}. Thenfor a ≤ x ≤ b,

f(x)− P1(x) =(x− x0) (x− x1)

2f 00(cx)

for some cx between the minimum and maximum of

x0, x1, and x.

If x1 and x are ‘close to x0’, then

f(x)− P1(x) ≈(x− x0) (x− x1)

2f 00(x0)

Thus the error acts like a quadratic polynomial, with

zeros at x0 and x1.

EXAMPLE

Let f(x) = log10 x; and in line with typical tables of

log10 x, we take 1 ≤ x, x0, x1 ≤ 10. For definiteness,let x0 < x1 with h = x1 − x0. Then

f 00(x) = −log10 ex2

log10 x− P1(x) =(x− x0) (x− x1)

2

"−log10 e

c2x

#

= (x− x0) (x1 − x)

"log10 e

2c2x

#We usually are interpolating with x0 ≤ x ≤ x1; and

in that case, we have

(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1

(x− x0) (x1 − x) ≥ 0, x0 ≤ cx ≤ x1

and therefore

(x− x0) (x1 − x)

"log10 e

2x21

#≤ log10 x− P1(x)

≤ (x− x0) (x1 − x)

"log10 e

2x20

#

For h = x1 − x0 small, we have for x0 ≤ x ≤ x1

log10 x− P1(x) ≈ (x− x0) (x1 − x)

"log10 e

2x20

#

Typical high school algebra textbooks contain tables

of log10 x with a spacing of h = .01. What is the

error in this case? To look at this, we use

0 ≤ log10 x− P1(x) ≤ (x− x0) (x1 − x)

"log10 e

2x20

#

By simple geometry or calculus,

maxx0≤x≤x1

(x− x0) (x1 − x) ≤ h2

4

Therefore,

0 ≤ log10 x− P1(x) ≤h2

4

"log10 e

2x20

#.= .0543

h2

x20

If we want a uniform bound for all points 1 ≤ x0 ≤ 10,we have

0 ≤ log10 x− P1(x) ≤h2 log10 e

8

.= .0543h2

0 ≤ log10 x− P1(x) ≤ .0543h2

For h = .01, as is typical of the high school text book

tables of log10 x,

0 ≤ log10 x− P1(x) ≤ 5.43× 10−6

If you look at most tables, a typical entry is given to

only four decimal places to the right of the decimal

point, e.g.

log 5.41.= .7332

Therefore the entries are in error by as much as .00005.

Comparing this with the interpolation error, we see the

latter is less important than the rounding errors in the

table entries.

From the bound

0 ≤ log10 x− P1(x) ≤h2 log10 e

8x20

.= .0543

h2

x20

we see the error decreases as x0 increases, and it is

about 100 times smaller for points near 10 than for

points near 1.

AN ERROR FORMULA:

THE GENERAL CASE


nomial Pn(x) for which deg(Pn) ≤ n

Pn(xi) = f(xi), i = 0, 1, · · · , nwith distinct node points {x0, ..., xn} and a givenfunction f(x). Let [a, b] be a given interval on which

f(x) is (n+ 1)-times continuously differentiable; and

assume the points x0, ..., xn, and x are contained in

[a, b]. Then

f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)

(n+ 1)!f (n+1) (cx)


mum of the points in {x, x0, ..., xn}.

f(x)−Pn(x) = (x− x0) (x− x1) · · · (x− xn)

(n+ 1)!f (n+1) (cx)

As shorthand, introduce

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

a polynomial of degree n+ 1 with roots {x0, ..., xn}.Then


(n+ 1)!f (n+1) (cx)

THE QUADRATIC CASE

For n = 2, we have

f(x)− P2(x) =(x− x0) (x− x1) (x− x2)

3!f (3) (cx)

(*)


mum of the points in {x, x0, x1, x2}.

To illustrate the use of this formula, consider the case

of evenly spaced nodes:

x1 = x0 + h, x2 = x1 + h

Further suppose we have x0 ≤ x ≤ x2, as we would

usually have when interpolating in a table of given

function values (e.g. log10 x). The quantity

Ψ2(x) = (x− x0) (x− x1) (x− x2)

can be evaluated directly for a particular x.

Graph of

Ψ2(x) = (x+ h)x (x− h)

using (x0, x1, x2) = (−h, 0, h):

x

y

h

-h

In the formula (∗), however, we do not know cx, and

therefore we replace¯̄̄f (3) (cx)

¯̄̄with a maximum of¯̄̄

f (3) (x)¯̄̄as x varies over x0 ≤ x ≤ x2. This yields

|f(x)− P2(x)| ≤|Ψ2(x)|3!

maxx0≤x≤x2

¯̄̄f (3) (x)

¯̄̄(**)

If we want a uniform bound for x0 ≤ x ≤ x2, we must

compute

maxx0≤x≤x2

|Ψ2(x)| = maxx0≤x≤x2

|(x− x0) (x− x1) (x− x2)|

Using calculus,

maxx0≤x≤x2

|Ψ2(x)| =2h3

3 sqrt(3), at x = x1±

h

sqrt(3)

Combined with (∗∗), this yields

|f(x)− P2(x)| ≤h3

9 sqrt(3)max

x0≤x≤x2

¯̄̄f (3) (x)

¯̄̄for x0 ≤ x ≤ x2.

For f(x) = log10 x, with 1 ≤ x0 ≤ x ≤ x2 ≤ 10, thisleads to

|log10 x− P2(x)| ≤h3

9 sqrt(3)· maxx0≤x≤x2

2 log10 e

x3

=.05572h3

x30

For the case of h = .01, we have

|log10 x− P2(x)| ≤5.57× 10−8

x30≤ 5.57× 10−8

Question: How much larger could we make h so that

quadratic interpolation would have an error compa-

rable to that of linear interpolation of log10 x with

h = .01? The error bound for the linear interpolation

was 5.43× 10−6, and therefore we want the same tobe true of quadratic interpolation. Using a simpler

bound, we want to find h so that

|log10 x− P2(x)| ≤ .05572h3 ≤ 5× 10−6

This is true if h = .04477. Therefore a spacing of

h = .04 would be sufficient. A table with this spac-

ing and quadratic interpolation would have an error

comparable to a table with h = .01 and linear inter-

polation.

For the case of general n,

f(x)− Pn(x) =(x− x0) · · · (x− xn)

(n+ 1)!f (n+1) (cx)

=Ψn(x)

(n+ 1)!f (n+1) (cx)

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

with cx some point between the minimum and max-

imum of the points in {x, x0, ..., xn}. When bound-ing the error we replace f (n+1) (cx) with its maximum

over the interval containing {x, x0, ..., xn}, as we haveillustrated earlier in the linear and quadratic cases.

Consider now the function

Ψn(x)

(n+ 1)!

over the interval determined by the minimum and

maximum of the points in {x, x0, ..., xn}. For evenlyspaced node points on [0, 1], with x0 = 0 and xn = 1,

we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9

on accompanying pages.

DISCUSSION OF ERROR

Consider the error

f(x)− Pn(x) =(x− x0) · · · (x− xn)

(n+ 1)!f (n+1) (cx)

=Ψn(x)

(n+ 1)!f (n+1) (cx)

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

as n increases and as x varies. As noted previously, we

cannot do much with f (n+1) (cx) except to replace it

with a maximum value of¯̄̄f (n+1) (x)

¯̄̄over a suitable

interval. Thus we concentrate on understanding the

size of

Ψn(x)

(n+ 1)!

ERROR FOR EVENLY SPACED NODES

We consider first the case in which the node points

are evenly spaced, as this seems the ‘natural’ way to

define the points at which interpolation is carried out.

Moreover, using evenly spaced nodes is the case to

consider for table interpolation. What can we learn

from the given graphs?

The interpolation nodes are determined by using

h =1

n, x0 = 0, x1 = h, x2 = 2h, ..., xn = nh = 1

For this case,

Ψn(x) = x (x− h) (x− 2h) · · · (x− 1)Our graphs are the cases of n = 2, ..., 9.

x

y n = 2

1x

y n = 3

1

x

y n = 4

1

x

y n = 5

1


x

y n = 6

1

x

y n = 7

1

x

y n = 8

1

x

y n = 9

1


Graph of

Ψ6(x) = (x− x0) (x− x1) · · · (x− x6)

with evenly spaced nodes:

xx0 x1 x2 x3 x4 x5 x6

Using the following table

,

n Mn n Mn

1 1.25E−1 6 4.76E−72 2.41E−2 7 2.20E−83 2.06E−3 8 9.11E−104 1.48E−4 9 3.39E−115 9.01E−6 10 1.15E−12

we can observe that the maximum

Mn ≡ maxx0≤x≤xn

|Ψn(x)|(n+ 1)!

becomes smaller with increasing n.

From the graphs, there is enormous variation in the

size of Ψn(x) as x varies over [0, 1]; and thus there

is also enormous variation in the error as x so varies.

For example, in the n = 9 case,

maxx0≤x≤x1

|Ψn(x)|(n+ 1)!

= 3.39× 10−11

maxx4≤x≤x5

|Ψn(x)|(n+ 1)!

= 6.89× 10−13

and the ratio of these two errors is approximately 49.

Thus the interpolation error is likely to be around 49

times larger when x0 ≤ x ≤ x1 as compared to the

case when x4 ≤ x ≤ x5. When doing table inter-

polation, the point x at which you are interpolating

should be centrally located with respect to the inter-

polation nodes m{x0, ..., xn} being used to define theinterpolation, if possible.

AN APPROXIMATION PROBLEM

Consider now the problem of using an interpolation

polynomial to approximate a given function f(x) on

a given interval [a, b]. In particular, take interpolation

nodes

a ≤ x0 < x1 < · · · < xn−1 < xn ≤ b

and produce the interpolation polynomial Pn(x) that

interpolates f(x) at the given node points. We would

like to have

maxa≤x≤b |f(x)− Pn(x)|→ 0 as n→∞

Does it happen?

Recall the error bound


≤ maxa≤x≤b

|Ψn(x)|(n+ 1)!

· maxa≤x≤b

¯̄̄f (n+1) (x)

¯̄̄We begin with an example using evenly spaced node

points.

RUNGE’S EXAMPLE

Use evenly spaced node points:

h =b− a

n, xi = a+ ih for i = 0, ..., n

For some functions, such as f(x) = ex, the maximumerror goes to zero quite rapidly. But the size of thederivative term f (n+1)(x) in


≤ maxa≤x≤b

|Ψn(x)|(n+ 1)!

· maxa≤x≤b

¯̄̄f (n+1) (x)

¯̄̄can badly hurt or destroy the convergence of othercases.

In particular, we show the graph of f(x) = 1/³1 + x2

ánd Pn(x) on [−5, 5] for the cases n = 8 and n = 12.The case n = 10 is in the text on page 127. It canbe proven that for this function, the maximum er-ror on [−5, 5] does not converge to zero. Thus theuse of evenly spaced nodes is not necessarily a goodapproach to approximating a function f(x) by inter-polation.

Runge’s example with n = 10:

x

y

y=P10(x)

y=1/(1+x2)

OTHER CHOICES OF NODES

Recall the general error bound

maxa≤x≤b |f(x)− Pn(x)| ≤ max

a≤x≤b|Ψn(x)|(n+ 1)!

· maxa≤x≤b

¯̄̄f (n+1) (x)

¯̄̄There is nothing we really do with the derivative term

for f ; but we can examine the way of defining the

nodes {x0, ..., xn} within the interval [a, b]. We askhow these nodes can be chosen so that the maximum

of |Ψn(x)| over [a, b] is made as small as possible.

This problem has quite an elegant solution, and it is

taken up in §4.6. The node points {x0, ..., xn} turnout to be the zeros of a particular polynomial Tn+1(x)

of degree n+1, called a Chebyshev polynomial. These

zeros are known explicitly, and with them

maxa≤x≤b |Ψn(x)| =

µb− a

2

¶n+12−n

This turns out to be smaller than for evenly spaced

cases; and although this polynomial interpolation does

not work for all functions f(x), it works for all differ-

entiable functions and more.

ANOTHER ERROR FORMULA

Recall the error formula


(n+ 1)!f (n+1) (c)

Ψn(x) = (x− x0) (x− x1) · · · (x− xn)

with c between the minimum and maximum of {x0, ..., xn, x}.A second formula is given by

f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]

To show this is a simple, but somewhat subtle argu-

ment.

Let Pn+1(x) denote the polynomial of degree ≤ n+1

which interpolates f(x) at the points {x0, ..., xn, xn+1}.Then

Pn+1(x) = Pn(x)

+f [x0, ..., xn, xn+1] (x− x0) · · · (x− xn)

Substituting x = xn+1, and using the fact that Pn+1(x)

interpolates f(x) at xn+1, we have

f(xn+1) = Pn(xn+1)

+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)

f(xn+1) = Pn(xn+1)

+f [x0, ..., xn, xn+1] (xn+1 − x0) · · · (xn+1 − xn)

In this formula, the number xn+1 is completely ar-

bitrary, other than being distinct from the points in

{x0, ..., xn}. To emphasize this fact, replace xn+1 byx throughout the formula, obtaining


= Pn(x) +Ψn(x) f [x0, ..., xn, x]

provided x 6= x0, ..., xn.

The formula


= Pn(x) +Ψn(x) f [x0, ..., xn, x]

is easily true for x a node point. Provided f(x) is

differentiable, the formula is also true for x a node

point.

This shows

f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]

Compare the two error formulas

f(x)− Pn(x) = Ψn(x) f [x0, ..., xn, x]


(n+ 1)!f (n+1) (c)

Then

Ψn(x) f [x0, ..., xn, x] =Ψn(x)

(n+ 1)!f (n+1) (c)

f [x0, ..., xn, x] =f (n+1) (c)

(n+ 1)!

for some c between the smallest and largest of the

numbers in {x0, ..., xn, x}.

To make this somewhat symmetric in its arguments,

let m = n+ 1, x = xn+1. Then

f [x0, ..., xm−1, xm] =f (m) (c)

m!

with c an unknown number between the smallest and

largest of the numbers in {x0, ..., xm}. This was givenin an earlier lecture where divided differences were in-

troduced.

PIECEWISE POLYNOMIAL INTERPOLATION

Recall the examples of higher degree polynomial in-

terpolation of the function f(x) =³1 + x2

´−1on

[−5, 5]. The interpolants Pn(x) oscillated a great

deal, whereas the function f(x) was nonoscillatory.

To obtain interpolants that are better behaved, we

look at other forms of interpolating functions.

Consider the data

x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0

What are methods of interpolating this data, other

than using a degree 6 polynomial. Shown in the text

are the graphs of the degree 6 polynomial interpolant,

along with those of piecewise linear and a piecewise

quadratic interpolating functions.

Since we only have the data to consider, we would gen-

erally want to use an interpolant that had somewhat

the shape of that of the piecewise linear interpolant.

x

y

1 2 3 4

1

2

The data points

x

y

1 2 3 4

1

2

Piecewise linear interpolation

x

y

1 2 3 4

1

2

3

4

Polynomial Interpolation

x

y

1 2 3 4

1

2

Piecewise quadratic interpolation

PIECEWISE POLYNOMIAL FUNCTIONS

Consider being given a set of data points (x1, y1), ...,

(xn, yn), with

x1 < x2 < · · · < xn

Then the simplest way to connect the points (xj, yj)

is by straight line segments. This is called a piecewise

linear interpolant of the datan(xj, yj)

o. This graph

has “corners”, and often we expect the interpolant to

have a smooth graph.

To obtain a somewhat smoother graph, consider using

piecewise quadratic interpolation. Begin by construct-

ing the quadratic polynomial that interpolates

{(x1, y1), (x2, y2), (x3, y3)}Then construct the quadratic polynomial that inter-

polates

{(x3, y3), (x4, y4), (x5, y5)}

Continue this process of constructing quadratic inter-

polants on the subintervals

[x1, x3], [x3, x5], [x5, x7], ...

If the number of subintervals is even (and therefore

n is odd), then this process comes out fine, with the

last interval being [xn−2, xn]. This was illustrated

on the graph for the preceding data. If, however, n is

even, then the approximation on the last interval must

be handled by some modification of this procedure.

Suggest such!

With piecewise quadratic interpolants, however, there

are “corners” on the graph of the interpolating func-

tion. With our preceding example, they are at x3 and

x5. How do we avoid this?

Piecewise polynomial interpolants are used in many

applications. We will consider them later, to obtain

numerical integration formulas.

SMOOTH NON-OSCILLATORY

INTERPOLATION

Let data points (x1, y1), ..., (xn, yn) be given, as let

x1 < x2 < · · · < xn

Consider finding functions s(x) for which the follow-

ing properties hold:

(1) s(xi) = yi, i = 1, ..., n

(2) s(x), s0(x), s00(x) are continuous on [x1, xn].Then among such functions s(x) satisfying these prop-

erties, find the one which minimizes the integralZ xn

x1

¯̄̄s00(x)

¯̄̄2dx

The idea of minimizing the integral is to obtain an in-

terpolating function for which the first derivative does

not change rapidly. It turns out there is a unique so-

lution to this problem, and it is called a natural cubic

spline function.

SPLINE FUNCTIONS

Let a set of node points {xi} be given, satisfyinga ≤ x1 < x2 < · · · < xn ≤ b

for some numbers a and b. Often we use [a, b] =

[x1, xn]. A cubic spline function s(x) on [a, b] with

“breakpoints” or “knots” {xi} has the following prop-erties:

1. On each of the intervals

[a, x1], [x1, x2], ..., [xn−1, xn], [xn, b]

s(x) is a polynomial of degree ≤ 3.2. s(x), s0(x), s00(x) are continuous on [a, b].

In the case that we have given data points (x1, y1),...,

(xn, yn), we say s(x) is a cubic interpolating spline

function for this data if

3. s(xi) = yi, i = 1, ..., n.

EXAMPLE

Define

(x− α)3+ =

((x− α)3 , x ≥ α

0, x ≤ α

This is a cubic spline function on (−∞,∞) with thesingle breakpoint x1 = α.

Combinations of these form more complicated cubic

spline functions. For example,

s(x) = 3 (x− 1)3+ − 2 (x− 3)3+is a cubic spline function on (−∞,∞) with the break-points x1 = 1, x2 = 3.

Define

s(x) = p3(x) +nX

j=1

aj³x− xj

´3+

with p3(x) some cubic polynomial. Then s(x) is a

cubic spline function on (−∞,∞) with breakpoints{x1, ..., xn}.

Return to the earlier problem of choosing an interpo-

lating function s(x) to minimize the integralZ xn

x1

¯̄̄s00(x)

¯̄̄2dx

There is a unique solution to problem. The solution

s(x) is a cubic interpolating spline function, and more-

over, it satisfies

s00(x1) = s00(xn) = 0

Spline functions satisfying these boundary conditions

are called “natural” cubic spline functions, and the so-

lution to our minimization problem is a “natural cubic

interpolatory spline function”. We will show a method

to construct this function from the interpolation data.

Motivation for these boundary conditions can be given

by looking at the physics of bending thin beams of

flexible materials to pass thru the given data. To the

left of x1 and to the right of xn, the beam is straight

and therefore the second derivatives are zero at the

transition points x1 and xn.

CONSTRUCTION OF THE

INTERPOLATING SPLINE FUNCTION

To make the presentation more specific, suppose we

have data

(x1, y1) , (x2, y2) , (x3, y3) , (x4, y4)

with x1 < x2 < x3 < x4. Then on each of the

intervals

[x1, x2] , [x2, x3] , [x3, x4]

s(x) is a cubic polynomial. Taking the first interval,

s(x) is a cubic polynomial and s00(x) is a linear poly-nomial. Let

Mi = s00(xi), i = 1, 2, 3, 4

Then on [x1, x2],

s00(x) = (x2 − x)M1 + (x− x1)M2

x2 − x1, x1 ≤ x ≤ x2

We can find s(x) by integrating twice:

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6 (x2 − x1)+ c1x+ c2

We determine the constants of integration by using

s(x1) = y1, s(x2) = y2 (*)

Then

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6 (x2 − x1)

+(x2 − x) y1 + (x− x1) y2

x2 − x1

−x2 − x16

[(x2 − x)M1 + (x− x1)M2]

for x1 ≤ x ≤ x2.

Check that this formula satisfies the given interpola-

tion condition (*)!

We can repeat this on the intervals [x2, x3] and [x3, x4],

obtaining similar formulas.

For x2 ≤ x ≤ x3,

s(x) =(x3 − x)3M2 + (x− x2)

3M3

6 (x3 − x2)

+(x3 − x) y2 + (x− x2) y3

x3 − x2

−x3 − x26

[(x3 − x)M2 + (x− x2)M3]

For x3 ≤ x ≤ x4,

s(x) =(x4 − x)3M3 + (x− x3)

3M4

6 (x4 − x3)

+(x4 − x) y3 + (x− x3) y4

x4 − x3

−x4 − x36

[(x4 − x)M3 + (x− x3)M4]

We still do not know the values of the second deriv-

atives {M1,M2,M3,M4}. The above formulas guar-antee that s(x) and s00(x) are continuous forx1 ≤ x ≤ x4. For example, the formula on [x1, x2]

yields

s(x2) = y2, s00(x2) =M2

The formula on [x2, x3] also yields

s(x2) = y2, s00(x2) =M2

All that is lacking is to make s0(x) continuous at x2and x3. Thus we require

s0(x2 + 0) = s0(x2 − 0)s0(x3 + 0) = s0(x3 − 0) (**)

This means

limx&x2

s0(x) = limx%x2

s0(x)

and similarly for x3.

To simplify the presentation somewhat, I assume in

the following that our node points are evenly spaced:

x2 = x1 + h, x3 = x1 + 2h, x4 = x1 + 3h

Then our earlier formulas simplify to

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6h

+(x2 − x) y1 + (x− x1) y2

h

−h6[(x2 − x)M1 + (x− x1)M2]

for x1 ≤ x ≤ x2, with similar formulas on [x2, x3] and

[x3, x4].

Without going thru all of the algebra, the conditions

(**) leads to the following pair of equations.

h

6M1 +

2h

3M2 +

h

6M3

=y3 − y2

h− y2 − y1

hh

6M2 +

2h

3M3 +

h

6M4

=y4 − y3

h− y3 − y2

h

This gives us two equations in four unknowns. The

earlier boundary conditions on s00(x) gives us immedi-ately

M1 =M4 = 0

Then we can solve the linear system for M2 and M3.

EXAMPLE

Consider the interpolation data points

x 1 2 3 4

y 1 12

13

14

In this case, h = 1, and linear system becomes

2

3M2 +

1

6M3 = y3 − 2y2 + y1 =

1

31

6M2 +

2

3M3 = y4 − 2y3 + y2 =

1

12


M2 =1

2, M3 = 0

This leads to the spline function formula on each

subinterval.

On [1, 2],

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6h

+(x2 − x) y1 + (x− x1) y2

h

−h6[(x2 − x)M1 + (x− x1)M2]

=(2− x)3 · 0 + (x− 1)3

³12

´6

+(2− x) · 1 + (x− 1)

³12

´1

−16

h(2− x) · 0 + (x− 1)

³12

í= 112 (x− 1)3 − 7

12 (x− 1) + 1

Similarly, for 2 ≤ x ≤ 3,

s(x) =−112(x− 2)3 + 1

4(x− 2)2 − 1

3(x− 1) + 1

2

and for 3 ≤ x ≤ 4,

s(x) =−112(x− 4) + 1

4

x 1 2 3 4

y 1 12

13

14

0 0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1

x

y

y = 1/xy = s(x)

Graph of example of natural cubic spline

interpolation

x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0

x

y

1 2 3 4

1

2

Interpolating natural cubic spline function

ALTERNATIVE BOUNDARY CONDITIONS

Return to the equations

h

6M1 +

2h

3M2 +

h

6M3

=y3 − y2

h− y2 − y1

hh

6M2 +

2h

3M3 +

h

6M4

=y4 − y3

h− y3 − y2

h

Sometimes other boundary conditions are imposed on

s(x) to help in determining the values of M1 and

M4. For example, the data in our numerical exam-

ple were generated from the function f(x) = 1x. With

it, f 00(x) = 2x3, and thus we could use

M1 = 2, M4 =1

32

With this we are led to a new formula for s(x), one

that approximates f(x) = 1x more closely.

THE CLAMPED SPLINE

In this case, we augment the interpolation conditions

s(xi) = yi, i = 1, 2, 3, 4

with the boundary conditions

s0(x1) = y01, s0(x4) = y04 (#)

The conditions (#) lead to another pair of equations,

augmenting the earlier ones. Combined these equa-

tions are

h

3M1 +

h

6M2 =

y2 − y1h

− y01h

6M1 +

2h

3M2 +

h

6M3

=y3 − y2

h− y2 − y1

hh

6M2 +

2h

3M3 +

h

6M4

=y4 − y3

h− y3 − y2

hh

6M3 +

h

3M4 = y04 −

y4 − y3h

For our numerical example, it is natural to obtain

these derivative values from f 0(x) = − 1x2:

y01 = −1, y04 = −1

16

When combined with your earlier equations, we have

the system

1

3M1 +

1

6M2 =

1

21

6M1 +

2

3M2 +

1

6M3 =

1

31

6M2 +

2

3M3 +

1

6M4 =

1

121

6M3 +

1

3M4 =

1

48


[M1,M2,M3,M4] =·173

120,7

60,11

120,1

60

¸

We can now write the functions s(x) for each of the

subintervals [x1, x2], [x2, x3], and [x3, x4]. Recall for

x1 ≤ x ≤ x2,

s(x) =(x2 − x)3M1 + (x− x1)

3M2

6h

+(x2 − x) y1 + (x− x1) y2

h

−h6[(x2 − x)M1 + (x− x1)M2]

We can substitute in from the data

x 1 2 3 4

y 1 12

13

14

and the solutions {Mi}. Doing so, consider the errorf(x)− s(x). As an example,

f(x) =1

x, f

µ3

2

¶=2

3, s

µ3

2

¶= .65260

This is quite a decent approximation.

THE GENERAL PROBLEM

Consider the spline interpolation problem with n nodes

(x1, y1) , (x2, y2) , ..., (xn, yn)

and assume the node points {xi} are evenly spaced,xj = x1 + (j − 1)h, j = 1, ..., n

We have that the interpolating spline s(x) on

xj ≤ x ≤ xj+1 is given by

s(x) =

³xj+1 − x

´3Mj +

³x− xj

´3Mj+1

6h

+

³xj+1 − x

ýj +

³x− xj

ýj+1

h

−h6

h³xj+1 − x

´Mj +

³x− xj

´Mj+1

ifor j = 1, ..., n− 1.

To enforce continuity of s0(x) at the interior nodepoints x2, ..., xn−1, the second derivatives

nMj

omust

satisfy the linear equations

h

6Mj−1 +

2h

3Mj +

h

6Mj+1 =

yj−1 − 2yj + yj+1

h

for j = 2, ..., n− 1. Writing them out,

h

6M1 +

2h

3M2 +

h

6M3 =

y1 − 2y2 + y3h

h

6M2 +

2h

3M3 +

h

6M4 =

y2 − 2y3 + y4h

...h

6Mn−2 +

2h

3Mn−1 +

h

6Mn =

yn−2 − 2yn−1 + yn

h

This is a system of n−2 equations in the n unknowns{M1, ...,Mn}. Two more conditions must be imposedon s(x) in order to have the number of equations equal

the number of unknowns, namely n. With the added

boundary conditions, this form of linear system can be

solved very efficiently.

BOUNDARY CONDITIONS

“Natural” boundary conditions

s00(x1) = s00(xn) = 0Spline functions satisfying these conditions are called“natural cubic splines”. They arise out the minimiza-tion problem stated earlier. But generally they are notconsidered as good as some other cubic interpolatingsplines.

“Clamped” boundary conditions We add the condi-tions

s0(x1) = y01, s0(xn) = y0nwith y01, y0n given slopes for the endpoints of s(x) on[x1, xn]. This has many quite good properties whencompared with the natural cubic interpolating spline;but it does require knowing the derivatives at the end-points.

“Not a knot” boundary conditions This is more com-plicated to explain, but it is the version of cubic splineinterpolation that is implemented in Matlab.

THE “NOT A KNOT” CONDITIONS

As before, let the interpolation nodes be

(x1, y1) , (x2, y2) , ..., (xn, yn)

We separate these points into two categories. For

constructing the interpolating cubic spline function,

we use the points

(x1, y1) , (x3, y3) , ..., (xn−2, yn−2) , (xn, yn)Thus deleting two of the points. We now have n− 2points, and the interpolating spline s(x) can be deter-

mined on the intervals

[x1, x3] , [x3, x4] , ..., [xn−3, xn−2] , [xn−2, xn]This leads to n− 4 equations in the n− 2 unknownsM1,M3, ...,Mn−2,Mn. The two additional boundary

conditions are

s(x2) = y2, s(xn−1) = yn−1These translate into two additional equations, and we

obtain a system of n−2 linear simultaneous equationsin the n− 2 unknowns M1,M3, ...,Mn−2,Mn.

x 0 1 2 2.5 3 3.5 4y 2.5 0.5 0.5 1.5 1.5 1.125 0

x

y

1 2 3 4

1

2

Interpolating cubic spline function with ”not-a knot”

boundary conditions

MATLAB SPLINE FUNCTION LIBRARY

Given data points

(x1, y1) , (x2, y2) , ..., (xn, yn)

type arrays containing the x and y coordinates:

x = [x1 x2 ...xn]y = [y1 y2 ...yn]plot (x, y, ’o’)

The last statement will draw a plot of the data points,

marking them with the letter ‘oh’. To find the inter-

polating cubic spline function and evaluate it at the

points of another array xx, say

h = (xn − x1) / (10 ∗ n) ; xx = x1 : h : xn;

use

yy = spline (x, y, xx)plot (x, y, ’o’, xx, yy)

The last statement will plot the data points, as be-

fore, and it will plot the interpolating spline s(x) as a

continuous curve.

ERROR IN CUBIC SPLINE INTERPOLATION

Let an interval [a, b] be given, and then define

h =b− a

n− 1, xj = a+ (j − 1)h, j = 1, ..., n

Suppose we want to approximate a given function

f(x) on the interval [a, b] using cubic spline inter-

polation. Define

yi = f(xi), j = 1, ..., n

Let sn(x) denote the cubic spline interpolating this

data and satisfying the “not a knot” boundary con-

ditions. Then it can be shown that for a suitable

constant c,

En ≡ maxa≤x≤b |f(x)− sn(x)| ≤ ch4

The corresponding bound for natural cubic spline in-

terpolation contains only a term of h2 rather than h4;

it does not converge to zero as rapidly.

EXAMPLE

Take f(x) = arctanx on [0, 5]. The following ta-

ble gives values of the maximum error En for various

values of n. The values of h are being successively

halved.

n En E12n/En

7 7.09E−313 3.24E−4 21.925 3.06E−5 10.649 1.48E−6 20.797 9.04E−8 16.4

BEST APPROXIMATION

Given a function f(x) that is continuous on a giveninterval [a, b], consider approximating it by some poly-nomial p(x). To measure the error in p(x) as an ap-proximation, introduce

E(p) = maxa≤x≤b |f(x)− p(x)|

This is called the maximum error or uniform error ofapproximation of f(x) by p(x) on [a, b].

With an eye towards efficiency, we want to find the‘best’ possible approximation of a given degree n.With this in mind, introduce the following:

ρn(f) = mindeg(p)≤n

E(p)

= mindeg(p)≤n


#The number ρn(f) will be the smallest possible uni-form error, orminimax error, when approximating f(x)by polynomials of degree at most n. If there is apolynomial giving this smallest error, we denote it bymn(x); thus E(mn) = ρn(f).

Example. Let f(x) = ex on [−1, 1]. In the followingtable, we give the values of E(tn), tn(x) the Tay-

lor polynomial of degree n for ex about x = 0, and

E(mn).

Maximum Error in:n tn(x) mn(x)1 7.18E− 1 2.79E− 12 2.18E− 1 4.50E− 23 5.16E− 2 5.53E− 34 9.95E− 3 5.47E− 45 1.62E− 3 4.52E− 56 2.26E− 4 3.21E− 67 2.79E− 5 2.00E− 78 3.06E− 6 1.11E− 89 3.01E− 7 5.52E− 10

Consider graphically how we can improve on the Tay-

lor polynomial

t1(x) = 1 + x

as a uniform approximation to ex on the interval [−1, 1].

The linear minimax approximation is

m1(x) = 1.2643 + 1.1752x

x

y

-1 1

1

2

y=t1(x)

y=m1(x)

y=ex

Linear Taylor and minimax approximations to ex

x

y

-1 1

0.0516

Error in cubic Taylor approximation to ex

x

y

-1 1

0.00553

-0.00553

Error in cubic minimax approximation to ex

Accuracy of the minimax approximation.

ρn(f) ≤[(b− a)/2]n+1

(n+ 1)!2nmaxa≤x≤b

¯̄̄f (n+1)(x)

¯̄̄This error bound does not always become smaller with

increasing n, but it will give a fairly accurate bound

for many common functions f(x).

Example. Let f(x) = ex for −1 ≤ x ≤ 1. Thenρn(e

x) ≤ e

(n+ 1)!2n(*)

n Bound (*) ρn(f)1 6.80E− 1 2.79E− 12 1.13E− 1 4.50E− 23 1.42E− 2 5.53E− 34 1.42E− 3 5.47E− 45 1.18E− 4 4.52E− 56 8.43E− 6 3.21E− 67 5.27E− 7 2.00E− 7

CHEBYSHEV POLYNOMIALS

Chebyshev polynomials are used in many parts of nu-merical analysis, and more generally, in applicationsof mathematics. For an integer n ≥ 0, define thefunction

Tn(x) = cos³n cos−1 x

´, −1 ≤ x ≤ 1 (1)

This may not appear to be a polynomial, but we willshow it is a polynomial of degree n. To simplify themanipulation of (1), we introduce

θ = cos−1(x) or x = cos(θ), 0 ≤ θ ≤ π (2)

Then

Tn(x) = cos(nθ) (3)

Example. n = 0

T0(x) = cos(0 · θ) = 1n = 1

T1(x) = cos(θ) = x

n = 2

T2(x) = cos(2θ) = 2 cos2(θ)− 1 = 2x2 − 1

x

y

-1 1

1

-1

T0(x)T1(x)T2(x)

x

y

-1 1

1

-1

T3(x)T4(x)

The triple recursion relation. Recall the trigonomet-

ric addition formulas,

cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β)Let n ≥ 1, and apply these identities to get

Tn+1(x) = cos[(n+ 1)θ] = cos(nθ + θ)

= cos(nθ) cos(θ)− sin(nθ) sin(θ)Tn−1(x) = cos[(n− 1)θ] = cos(nθ − θ)

= cos(nθ) cos(θ) + sin(nθ) sin(θ)

Add these two equations, and then use (1) and (3) to

obtain

Tn+1(x) + Tn−1 = 2 cos(nθ) cos(θ) = 2xTn(x)Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1

(4)

This is called the triple recursion relation for the Cheby-

shev polynomials. It is often used in evaluating them,

rather than using the explicit formula (1).

Example. Recall

T0(x) = 1, T1(x) = x

Tn+1(x) = 2xTn(x)− Tn−1(x), n ≥ 1

Let n = 2. Then

T3(x) = 2xT2(x)− T1(x)

= 2x(2x2 − 1)− x

= 4x3 − 3xLet n = 3. Then

T4(x) = 2xT3(x)− T2(x)

= 2x(4x3 − 3x)− (2x2 − 1)= 8x4 − 8x2 + 1

The minimum size property. Note that

|Tn(x)| ≤ 1, −1 ≤ x ≤ 1 (5)

for all n ≥ 0. Also, note thatTn(x) = 2

n−1xn + lower degree terms, n ≥ 1(6)

This can be proven using the triple recursion relation

and mathematical induction.

Introduce a modified version of Tn(x),

eTn(x) = 1

2n−1Tn(x) = xn+lower degree terms (7)

From (5) and (6),¯̄̄ eTn(x)¯̄̄ ≤ 1

2n−1, −1 ≤ x ≤ 1, n ≥ 1 (8)

Example.

eT4(x) = 1

8

³8x4 − 8x2 + 1

´= x4 − x2 +

1

8

A polynomial whose highest degree term has a coeffi-

cient of 1 is called a monic polynomial. Formula (8)

says the monic polynomial eTn(x) has size 1/2n−1 on−1 ≤ x ≤ 1, and this becomes smaller as the degreen increases. In comparison,

max−1≤x≤1 |xn| = 1

Thus xn is a monic polynomial whose size does not

change with increasing n.

Theorem. Let n ≥ 1 be an integer, and consider all

possible monic polynomials of degree n. Then the

degree n monic polynomial with the smallest maxi-

mum on [−1, 1] is the modified Chebyshev polynomialeTn(x), and its maximum value on [−1, 1] is 1/2n−1.

This result is used in devising applications of Cheby-

shev polynomials. We apply it to obtain an improved

interpolation scheme.

Documents

Analiza Numerica [Utm, Bostan v.]