An Introduction to › dtic › tr › fulltext › u2 › a277565.pdf · The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear

An Introduction to

the Conjugate Gradient Methodthat Even an Idiot Can Understand "

Jonathan Richard ShewchukMarch 7, 1994 .

CMU-CS-94-125 ,..,

School of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Abstract

The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations.Unfortunately, many textbook treatments of the topic are written so that even their own authors would be mystified,if they bothered to read their own writing. For this reason, an understanding of the method has been reserved for theelite brilliant few who have painstakingly decoded the mumblings of their forebears. Nevertheless, the ConjugateGradient Method is a composite of simple, elegant ideas that even stupid people can understand. Of course, a readeras intelligent as yourself will learn them almost effortlessly.

The idea of quadratic forms is introduced and used to derive the methods of Steepest Descent, Conjugate Directions.and Conjugate Gradients. Eigenvectors are explained and used to examine the convergence of the Jacobi Method,Steepest Descent, and Conjugate Gradients. Other topics include preconditioningand the nonlinear Conjugate GradientMethod. I have taken pains to make this article easy to read. Sixty-two illustrations are provided. Dense prose isavoided. Concepts are explained in several different ways. Most equations are coupled with an intuitive interpretation.

This docuntenyi ha besn approvedfor public release and ;iace: itsdistribution is 'unbidLmed.

Supported in part by the Natural Sciences and Engineering Research Council of Canada under a 1967 Science and EngineeringScholarship and by the National Science Foundation under Grant ASC-9318163. The vicws and conclusions contained in thisdocument are those of the author and should not be interpreted as representing the official policies, either express or implied, ofNSERC, NSF. or the i S. Government.

Keywords: conjugate gradient method, preconditioning, convergence analysis, idiot

Contents

1. Introduction 1

2. Notation 1

3. The Quadratic Form 2

4. The Method of Steepest Descent 6

5. Thinking with Egenvectors and Eigenvalues 95.1. Eigen do it if I try . .. . .. . . . .. .. . .. . .. . . .. .. . .. . . . . . . . . . . 95.2. Jacobi iterations .............................................. 105.3. A Concrete Example ............................................ 12

6. Convergence Analysis of Steepest Descent 136.1. Instant Results ................................................ 136.2. General Convergence ........................................... 17

7. The Method of Conjugate Directions 217.1. Conjugacy .................................................. 217.2. Gram-Schmidt Conjugation ....................................... 257.3. Properties of the Residual ........................................ 26

8. The Method of Conjugate Gradients 28

9. Convergence Analysis of Conjugate Gradients 309.1. Optimality of the Error Term ....................................... 309.2. Chebyshev Polynomials ......................................... 33

10. Complexity 36

11. Starting and Stopping 3611.1. Starting .................................................... 3611.2. Stopping ................................................... 36

12. Preconditioning 37

13. Conjugate Gradients on the Normal Equations 39

14. The Nonlinear Conjugate Gradient Method 4014.1. Outline of the Nonlinear Conjugate Gradient Method ........................ 4014.2. General Line Search ............................................ 4114.3. Preconditioning .............................................. 45

A Notes 46

B Canned Algorithms 47B 1. Steepest Descent .............................................. 47B2. Conjugate Gradients ............................................ 48B3. Preconditioned Conjugate Gradients ................................... 49

B4. Nonlinear Conjugate Gradients with Newton-Raphson and Fletcher-Reeves ........... 50B5. Preconditioned Nonlinear Conjugate Gradients with Secant and Polak-Ribire ....... ... 51

C Ugly Proofs 52Cl. The Solution to Ax = b Minimizes the Quadratic Form ........................ 52C2. A Symmetric Matrix Has n Orthogonal Eigenvectors .......................... 52C3. Convergence of Steepest Descent ........ .............................. 53C4. Optimality of Chebyshev Polynomials ................................... 53

DTtr i , •

DisDIt L

List of Figures

I Sample two-dimensional linear system and its solution .......................... 22 Graph of a quadratic form .......................................... 33 Contours of a quadratic form ........................................ 34 Gradient of a quadratic form ....... ................................. 45 Quadratic forms for positive-definite, negative-definite, singular, and indefinite matrices

(four illustrations) ................................................ 56 The method of Steepest Descent (four illustrations) ............................ 77 On the search line, f is minimized where the gradient is orthogonal to the search line. . . 7

8 Convergence of Steepest Descent ....................................... 89 Converging eigenvector. ......................................... 910 Diverging eigenvector ............................................. 911 A "ector can be expressed as a linear combination of eigenvectors .................. 1012 The eigenvectors are directed along the axes of the paraboloid defined by the quadratic form. 12

13 Convergence of the Jacobi Method (six illustrations) .......................... 1414 Steepest Descent converges to the exact solution on the first iteration if the error term is an

eigenvector. ......... ......................................... 1515 Steepest Descent converges to the exact solution on the first iteration if the eigenvalues are

all equal ..................................................... 1616 The energy norm ......... ....................................... 17

17 Convergence of Steepest Descent as a function of the slope and the condition number. . .. 1918 These four examples represent points near the corresponding four comers of the graph of W

(four illustrations) ................................................ 1919 Starting points that give the worst convergence for Steepest Descent ................ 2020 Convergence of Steepest Descent worsens as the condition number of the matrix increases. . 21

21 Method of Orthogonal Directions ...................................... 2222 A-orthogonal vectors (two illustrations) ................................... 2323 The method of Conjugate Directions converges in n steps (two illustrations) ......... .2424 Gram-Schmidt conjugation .......................................... 2525 The method of Conjugate Directions using the axial unit vectors, also known as Gaunian

elimination .......... .......................................... 2626 The search directions span the same subspace as the vectors from which they were constructed. 2727 Conjugate Gradient search directions span the same subspace as the residuals .......... 2828 The method of Conjugate Gradients ..................................... 2929 CG minimizes the energy norm of the error at each step ......................... 3030 The convergence of CG depends on how close a polynomial can be to zero on each eigenvalue

(four illustrations) ............................................... 3231 Chebyshev polynomials of degree 2, 5, 10, and 49 ............................ 33

32 Optimal polynomial of degree 2 ....................................... 34

33 Convergence of Conjugate Gradients as a function of condition number. ............. 3534 Number of iterations of Steepest Descent required to match one iteration of CG ....... .. 35

35 Contour lines of the quadratic form of the diagonally preconditioned sample problem. . .. 3"836 Convergence of the nonlinear Conjugate Gradient Method (four illustrati6ns) .......... 42

37 Nonlinear CG can be more effective with periodic restarts ....................... 4338 The Newton-Raphson method ........................................ 43

39 The Secant method ......... ...................................... 4540 The preconditioned nonlinear Conjugate Gradient Method. 46

iii

About this Article

An electronic copy of this article is available by anonymous FTP to REPORTS. ADM. CS. CMU. EDU (IPaddress 128.2.218.42) under the filename 1994/CMU-CS-94-125.ps. A PostScript file containingfull-page copies of the figures herein, suitable for transparencies, is avai!able electronically on request fromthe author (j rs@cs. cmu. edu). Most of the illustrations were created using Mathematica.

@ 1994 by Jonathan Richard Shewchuk. This article may be freely duplicated and distributed so longas no consideration is received in return, and this copyright notice remains intact.

This guide was created to help students learn Conjugate Gradient Methods as easily as possible. Pleasemail me (jrs@cs . cmu. edu) comments, corrections, and any intuitions I might have missed; some ofthese will be incorporated into a second edition. I am particularly interested in hearing about use of thisguide for classroom teaching.

For those who wish to learn more about iterative methods, I recommend William L. Briggs' "A MultigndTutorial" [21, one of the best-written mathematical books I have read.

Special thanks to Omar Ghattas, who taught me much of what I know about numerical methods, andprovided me with extensive comments on the first draft of this article. Thanks also to David O'Hallaron,James Stichnoth, and Daniel Tunkelang for their comments.

To help you skip chapters, here's a dependence graph of the sections:

I int~roduction P _4`1 _0 Complex7ity

2 Notation 13NraEutin

7 Conjugate Directions 6 SD Convergence

8 Cojugte radints 9 C Convergence

13. Stop &Start =12 Peodtoing 149 Nonlinear CG

This article is dedicated to every mathematician who uses figures as abundantly as I have herein.

iv

1. Introduction

When I decided to learn the Conjugate Gradient Method (henceforth, CG), I read four different descriptions,which I shall politely not identify. I understood none of them. By the end of the last, I swore in my ragethat if ever I unlocked the secrets of CG, I should guard them as jealously as my intellectual ancestors.Foolishly, I wrote this article instead.

CG is the most popular iterative method for solving large systems of linear equations. CG is effectivefor systems of the form

Azx b =)

where x is an unknown vector, b is a known vector, and A is a known, square, symmetric, positive-definite

(or positive-indefinite) matrix. (Don't worry if you've forgotten what "positive-definite" means; we shallreview it.) These systems arise in many important settings, such as finite difference and finite elementmethods for solving partial differential equations, structural analysis, circuit analysis, and math homework.

Iterative methods like CG are suited for use with sparse matrices. If A is dense, your best course ofaction is probably to factor A and solve the equation by backsubstitution. The time spent factoring a denseA is roughly equivalent to the time spent solving the system iteratively; and once A is factored, the systemcan be backsolved quickly for multiple values of b. Compare this dense matrix with a sparse matrix oflarger size that fills the same amount of memory. The triangular factors of a sparse A usually have manymore nonzero elements than A itself. Factoring may be impossible due to limited memory, and will betime-consuming as well; even the backsolving step may be slower than iterative solution. On the other

hand, most iterative methods are memory-efficient and run quickly with sparse matrices.

I assume that you have taken a first course in linear algebra, and that you have a solid understandingof matrix multiplication and linear independence, although you probably don't remember what thoseeigenthingies were all about. From this foundation, I shall build the edifice of CG as clearly as I can.

2. Notation

Before we begin, a few definitions and notes on notation are in order.

With a few exceptions, I shall use capital letters to denote matrices, lower case letters to denote vectors.

and Greek letters to denote scalars. A is an n x n matrix, and x and b are vectors - that is, n x I matrices.Equation 1, written out fully, looks like this:

All A12 ... Ain Xl b1A21 A22 A2n X2 b2

Ant An 2 ... Ann , x b,

The inner product of two vectors is written xTy, and represents the scalar sum F', xiyi. Note thatX T --- yTx. If x and y are orthogonal, then xTy = 0. In general, expressions that reduce to I x I matrices,such as zTy and xTAx, are treated as scalar values.

Jonathan Richard ShewchukX2

\4

2

-4 2 44

-2- • ,. 2x, +-6X2 =-8

-6

Figure 1: Sample two-dimensional linear system. The solution lies at the intersection of the lines.

A matrix A is positive-definite if, for every nonzero vector x,

ZTAx > 0. (2)

This may mean little to you, but don't feel bad; it's not a very intuitive idea, and it's hard to imagine howa matrix that is positive-definite might look differently from one that isn't. We will get a feeling for whatpositive-definiteness is about when we see how it affects the shape of quadratic forms.

Finally, don't forget the important basic identities (AB)T = BT AT and (AB - I = B-'A-.

3. The Quadratic Form

A quadratic form is simply a scalar, quadratic function of a vector with the form

f(x) = A4 x - bTX + c (3)2

where A is a matrix, x and b are vectors, and c is a scalar constant. I shall show shortly that if A is symmetricand positive-definite, f (x ) is minimized by the solution to Ax = b.

Throughout this paper, I will demonstrate ideas with the simple sample problem

A = 3 2 , b= _2 c = 0. (4)

The system Ax = b is illustrated in Figure 1. In general, the solution x lies at the intersection point of nhyperplanes, each having dimension n - 1. The solution in this case is x = [2, -2]T. The correspondingquadratic form f(x) appears in Figure 2. A contour plot of f(x) is illustrated in Figure 3. Because A is

The Quadratic Form 3

l/00

4f( .r)

F02

S 0X2 -2 4

2

-6 -4

Figure 2: Graph of a quadratic form f (x). The minimum point of this surface is the solution to Ax- b.

X2

-4-2 .

Figure 3: Contours of the quadratic form. Each ellipsoidal curve has constant f(.x).

4 Jonathan Richard Shewchukz2

66

-4 -2

-'00"eA-0-A- Nr~ * r O 0

171 1

Figure 4: Gradient f (a) of the quadratic form. For every z, the gradient points in the direction of steepestincrease of ff(z), and is orthogonal to the contour lines.

positive-definite, the surface defined by f(z) is shaped like a paraboloid bowl. (I'll have more to say about

this in a moment.)

The gradient of a quadratic form is defined to be

rI -f (Z)1'(2) = -().(5)

The gradient is a vector field that, for a given point x, points in the direction of greatest increase of f(z).Figure 4 illustrates the gradient vectors for Equation 3 with the constants given in Equation 4. At the bottomof the paraboloid bowl, the gradient is zero. One can minimize f(z) by setting f'(z) equal to zero.

With a little bit of tedious math, one can apply Equation 5 to Equation 3, and derive

f'(Z) = ATZ + Az - b. (6)

If A is symmetric, this equation reduces to

f(z) = Ax - b. (7)

Setting the gradient to zero, we obtain Equation 1, the linear system we wish to solve. Therefore, thesolution to Az = b is a critical point of f(z). If A is positive-definite as well as symmetric, then this

The Quadratic Form 5

(a) (b)

fPX) f W)

X2 JX2 0

(c) (d)

f(x) f(x)

X2 X1Xxl

Figure 5: (a) Quadratic form for . positive-definite matrix. (b) For a negative-definite matrix. (c) For asingular (and positive-indefinite) matrix. A line that runs through the bottom of the valley is the set ofsolutions. (d) For an indefinite matrix. Because the solution is a saddle point, Steepest Descent and CGwill not work. In three dimensions or higher, a singular matrix can also have a saddle.

solution is a minimum of f(x), so Ax = b can be sclved by finding an x that minimizes f(x). (If A is notsymmetric, then Equation 6 hints that CG will find a solution to the system !(AT + A)x = b. Note that½(AT + A) is symmetric.)

Why do symmetric positive-definite matrices have this nice property? Consider the relationship betweenf at some arbitrary point p and at the solution x = AI-b. From Equation 3 one can show (Appendix Cl)that if A is symmetric (be it positive-definite or not),

f(P) = f W + (p- X)A(p - x). (8)

If A is positive-definite as well, then by Property 2, the latter term is positive for all p : x. It follows thatx is a global minimum of f.

The fact that f(x) is a paraboloid is our best intuition of what it means for a matrix to be positive-definite.If A is not positive-definite, there are several other possibilities. A could be negative-definite - the re-,ultof negating a positive-definite matrix (see Figure 2, but hold it upside-down). A might be singular, in whichcase no solution is unique; the set of solutions is a line or hyperplane having a uniform. value for f. IfA is none of the above, then x is a saddle point, and techniques like Steepest Descent and CG will likelyfail. Figure 5 demonstrates the possibilities. The value of y determines where the minimum point of the

paraboloid lies, but does not affect the paraboloid's shape.

Why go to the trouble of converting the linear system into a tougher-looking problem? The methodsunder study - Steepest Descent and CG - were developed and are intuitively understood in terms ofminimization problems like Figure 2, not in terms of intersecting hvperplanes such as Figure 1.

6 Jonathan Richard Shewchuk4. The Method of Steepest Descent

In the method of Steepest Descent, we start at an arbitrary point x(o) and slide down to the bottom of theparaboloid. We take a series of steps X(1), Z(2),.., until we are satisfied that we are close enough to thesolution x.

When we take a step, we choose the direction in which f decreases most quickly, which is the directionopposite of f'(x(,)). According to Equation 7, this direction is -f'(x(i)) = b - Ax(i).

Allow me to introduce a few definitions, which you should memorize. The error e(i) = Z(i) - x is avector that indicates how far we are from the solution. The residual r(i) = b - Ax(i) indicates how far weare from the correct value of b. It is easy to see that r(i) = -Ae(i), and you should think of the residual asbeing the error transformed by A into the same space as b. More importantly, r(3) = -f'(x(i)), and youshould also think of the residual as the direction of steepest descent. For nonlinear problems, discussed inSection 14, only the latter definition applies. So remember, whenever you read "residual", think "directionof steepest descent."

Suppose we start at x(o) = [-2, -2 1T. Our first step, along the direction of steepest descent, will fallsomewhere on the solid line in Figure 6(a). In other words, we will choose a point

x(1 ) = x(o) + ar(o). (9)

The question is, how big a step should we take?

A line search is a procedure that chooses a to minimize f along a line. Figure 6(b) illustrates this task:we are restricted to choosing a point on the intersection of the vertical plane and the paraboloid. Figure 6(c)is the parabola defined by the intersection of these surfaces. What is the value of a at the base of theparabola?

a minimizes f when the directional derivative -1-f(x(i)) is equal to zero. By the chain rule,--f( x(I) ) =X( 1))T TdX(1) = ft(x(I))Tr(o). Setting this expression to zero, we find that a should be chosen so that

r(o) and f'(x(I)) are orthogonal (see Figure 6(d)).

There is an intuitive reason why we should expect these vectors to be orthogonal at the minimum.Figure 7 shows the gradient vectors at various points along the search line. The slope of the parabola(Figure 6(c)) at any point is equal to the magnitude of the projection of the gradient onto the line (Figure 7,dotted arrows). These projections represent the rate of increase of f as one traverses the search line. f isminimized where the projection is zero - where the gradient is orthogonal to the search line.

To determine a, note that f'(x(I)) = -r(I), and we have

r(1 )r(O) = 0

(b- Ax(I))Tr(o) = 0

(b - A(x(o) + ar(o)))Tr(o) = 0

(b - Ax(o))Tr(o) - a(Ar(o))Tr(o) = 0

(b - Ax(o))Tr(o) = a(Ar(o))Tr(o)

r(T)r(0) = ar(o)(Ar(O))

Tr(o) r(o)t -- rT(o)Ar(o)"

The Method of Steepest Descent 7

X2 (a) (b)

X 1 150~x' 2 2 (

2. 5.0

-2 (d)

f(x(i) + at( 4 ) (C)(d

120

10080 Xl

-2.

10

Figure 6: The method of Steepest Descent. (a) Starting at [-2. - 2]T, take a step in the direction of steepestdescent of f. (b) Find the point on the intersection of these two surfaces that minimizes f. (c) This parabolais the intersection of surfaces. The bottommost point is our target. (d) The gradient at the bottommost pointis orthogonal to the gradient of the previous step.

12

\ -2

Figure 7: The gradient f' is shown at several locations along the search line (solid arrows). Each gradient's

projection onto the line is also shown (dotted arrows). The gradient vectors represent the direction of

steepest increase of uf, and the projections represent the rate of increase as one traverses the search line.

On the search line, l is minimized where the gradient is orthogonal to the search line.

8 Jonathan Richard ShewchukX2

Z( -2

Figure 8: Here, the method of Steepest Descent starts at [-2, -2 ]T and converges at [2, - 2 ]T.

Putting it all together, the method of Steepest Descent is:

\ b - Ax(j), (10)

T

G()=rT At (11)

\W

The example is run until it converges in Figure 8. Note that the gradient is always orthogonal to thegradient of the previous step.

The algorithm, as written above, requires two matrix-vector multiplications per iteration. In general, thecomputational cost of iterative algorithms is dominated by matrix-vector products; fortunately, one can beeliminated. By premultiplying both sides of Equation 12 by -A and adding b, we have

r(i+i) = b(j) - *(i)At(i), (13)

Although Equation 10 is needed to compute r(O), Equation 13 can be used for every iteration thereafter. Theproduct Ar, which occurs in both Equations I'I and 13, need only be computed once. The disadvantage ofusing this recurrence is that the sequence defined by Equation 13 is generated without any feedback fromthe value of z(i); so that an accumulation of floating point roundoff error may cause x(j) to converge tosome point near z. This effect can be avoided by periodically using Equation 10 to recompute the correctresidual.

Before analyzing the convergence of Steepest Descent, I must digress to ensure that you have a solidunderstanding of eigenvectors.

Thinking with Eigen vectors and Eigen values 9

S. Thinking with Eigenvectors and Eigenvalues

After my one course in linear algebra, I knew eigenvectors and eigenvalues like the back of my head. If

your instructor was anything like mine, you recall solving problems involving eigendoohickeys, but you

never really understood them. Unfortunately, without an intuitive grasp of them, CG won't make sense

either. If you're already eigentalented, feel free to skip this section.

Eigenvectors are used primarily as an analysis tool; Steepest Descent and CG do not calculate the valueof any eigenvectors as part of the algorithm'.

5.1. Eigen do it if I try

An eigenvector v of a matrix B is a nonzero vector that does not rotate when B is applied to it (except

perhaps to point in precisely the opposite direction). v may change length or reverse its direction, but itwon't turn sideways. In other words, there is some scalar constant A such that By = Av. The value A is

an eigenvalue of B. For any constant a, the vector av is also an eigenvector with eigenvalue A, because

B(av) = aBv = Aav. In other words, if you scale an eigenvector, it's still an eigenvector.

Why should you care? Iterative methods often depend on applying B to a vector over and over again.

When B is repeatedly applied to an eigenvector v, one of two things can happen. If I A I < 1, then B' v = A' vwill vanish as i approaches infinity (Figure 9). If JAI > 1, then Biv will grow to infinity (Figure 10). Each

time B is applied, the vector grows or shrinks according to the value of IAl.

BB 2 v

B3By By

Figure 9: v is an eigenvector of B with a corresponding eigenvalue of -0.5. As i increases, B' v convergesto zero.

v By B 21 j B3 v

Figure 10: Here, v has a corresponding eigenvalue of 2. As i increases, Btv diverges to infinity.

'However. there are practical applications for eigenvectors. The eigenvectors of the stiffness matrix associated with a discretizedstructure of uniform density represent the natural modes of vibration of the structure being studied. For instance, the eigenvectorsof the stiffness matrix associated with a one-dimensional uniformly-spaced mesh are sine waves, and to express vibrations as alinear combination of these eigenvectors is equivalent to performing a discrete Fourier transform.

10 Jonathan Richard Shewchuk

If B is nonsingular, then there exists a set of n linearly independent eigenvectors of B, denotedV1, v2,... , v,,. This set is not unique, because each eigenvector can be scaled by an arbitrary nonzero

constant. Each eigenvector has a corresponding eigenvalue, denoted A1, A2 , .... , A,. These are uniquelydefined for a given matrix. The eigenvalues may or may not be equal to each other; for instance, the

eigenvalues of the identity matrix I are all one, and every nonzero vector is an eigenvector of I.

What if B is applied to a vector that is not an eigenvector? A very important skill in understanding

linear algebra - the skill this section is written to teach - is to think of a vector as a sum of other

vectors whose behavior is understood. Consider that the set of eigenvectors {vi} forms a basis for R'

(because a nonsingular B has n eigenvectors that are linearly independent). Any n-dimensional vector can

be expressed as a linear combination of eigenvectors, and because matrix multiplication is distributive, one

can examine the effect of B on each eigenvector separately.

In Figure 11, a vector x is illustrated as a sum of two eigenvectors vi and v2. Applying B to x is

equivalent to applying B to the eigenvectors, and summing the result. Upon repeated application, we have

Bix = Biv, + B'v 2 = Aiv, + Aiv2 . If the magnitudes of all the eigenvalues are smaller than one, B'xwill converge to zero, because the eigenvectors which compose x converge to zero when B is repeatedly

applied. If one of the eigenvalues has magnitude greater than one, x will diverge to infinity. This is why

numerical analysts attach importance to the spectral radius of a matrix:

p(B) = max IAil, Ai is an eigenvalue of B.

If we want x to converge to zero quickly, p(B) should be less than one, and preferably as small as possible.

vI ... 2 BX #" ".1 B \

B2

Figure 11: The vector x (solid arrow) can be expressed as a linear combination of eigenvectors (dashed

arrows), whose associated eigenvalues are A, = 0.7 and A2 = -2. The effect of repeatedly applying B to

x is best understood by examining the effect of B on each eigenvector. When B is repeatedly applied, one

eigenvector converges to zero while the other diverges; hence, Bi z also diverges.

Here's a useful fact: the eigenvalues of a positive-definite matrix are all positive. This fact can be proven

from the definition of eigenvalue:

By = Av

VT Bv = AVTV.

By the definition of positive-definite, the left-hand term is positive (for nonzero v). Hence, A must bepositive also.

5.2. Jacobi iterations

Of course, a procedure that always converges to zero isn't going to help you attract friends. Consider a moreuseful procedure: the Jacobi Method for solving Ax = b. The matrix A is split into two parts: D, whose

Thinking with Eigen vectors and Eigen values 11

diagonal elements are identical to those of A, and whose off-diagonal elements are zero; and E, whosediagonal elements are zero, and whose off-diagonal elements are identical to those of A. Thus, A = D + E.We derive the Jacobi Method:

Az = g

D1 = -Ez + Yx = -D-'Ez+D-'b

z = Bz +z, where B = -D-IE, z = D-1b. (14)

Note that because D is diagonal, it is easy to invert. This identity can be converted into an iterative methodby forming the recurrence

Z(i+I) = BZ(i) + z. (15)

Given a starting vector z(o), this formula generates a sequence of vectors. Our hope is that each successivevector will be closer to the solution z than the last. z is called a stationary point of Equation 15, because ifZ(i) = z, then x(i+I) will also equal z.

Now, this derivation may seem quite arbitrary to you, and you're right. We could have formed anynumber of identities for z instead of Equation 14. In fact, simply by splitting A differently - that is,by choosing a different D and E - we could have derived the GauB-Seidel method, or the method ofSuccessive Over-Relaxation (SOR). Our hope is that we have chosen a splitting for which B has a smallspectral radius. I chose the Jacobi splitting arbitrarily for simplicity.

Suppose we start with some arbitrary vector z(o). For each iteration, we apply B to this vector, then addz to the result. What does each iteration do?

Again, apply the principle of thinking of a vector as a sum of other, well-understood vectors. Expresseach iterate z(i) as the sum of the exact solution z and the error term e(j). Then, Equation 15 becomes

Z(i+I) = BZ(,) + z= (z +e(j))+ z

= + +z+Be(1 )

= z + Bew (by Equation 14),

"e(i+)= Be(,). (16)

Each iteration does not affect the "correct part" of z(:) (because z is a stationary point); but each iterationdoes affect the error term. It is apparent from Equation 16 that if p(B) < I, then the error term e(j) willconverge to zero as i approaches infinity. Hence, the initial vector z(O) has no effect on the inevitableoutcome!

Of course, the choice of z(0) does affect the number of iterations required to converge to z withina given tolerance. However, its effect is less important than that of the spectral radius p(B), whichdetermines the speed of convergence. Suppose that vj is the eigenvector of B with the largest eigenvalue(so that p(B) = Al). If the initial error e(O), expressed as a linear combination of eigenvectors, includes acomponent in the direction of vj, this component will be the slowest to converge. The convergence of theJacobi Method can be described as follows:

wee(e)ll <_ [p(B)i'ine(0)In

where IlelIl = (eTe) 12is the Euclidean norm (length) of e. (In fact, the inequality holds for any norm.)


-2 2

Figure 12: The eigenvectors of A are directed along the axes of the paraboloid defined by the quadraticform f(z). Each eigenvector is labeled with its associated eigenvalue. Each eigenvalue is proportional tothe steepness of the corresponding slope.

The convergence of the Jacobi Method depends on p(B), which depends on A. Unfortunately, Jacobidoes not converge for every A, or even for every positive-definite A.

5.3. A Concrete Example

To demonstrate these ideas, I shall solve the system specified by Equation 4. First, we need a method of

finding eigenvalues and eigenvectors. By definition, for any eigenvector v with eigenvalue A,

Av = Av = AIv

(Al- A)v = 0.

Eigenvectors are nonzero, so AI - A must be singular. Then,

det(AI- A) = 0.

The determinant of AI - A is called the characteristic polynomial. It is an n-degree polynomial in Awhose roots are the set of eigenvalues. The characteristic polynomial of A (from Equation 4) is

det[ A-3-2 A ]=A-9A+14=(A-7)(A-2),

and the eigenvalues are 7 and 2. To find the eigenvector associated with A = 7,

(AI-.A)v'4[- 2] v] = 0

:4v - 2v2 = 0.

Convergence Analysis of Steepest Descent 13

Any solution to this equation is an eigenvector, say, v = [1, 2]T. By the same method, -: find that [-2, 1 1T

is an eigenvector corresponding to the eigenvalue 2. In Figure 12, we see that these eigenvectors coincidewith the axes of the familiar ellipsoid, and that a larger eigenvalue corresponds to a steeper slope. (Negative

eigenvalues indicate that f decreases along the axis, as in Figures 5(b) and 5(d).)

Now, let's see the Jacobi Method in action. Using the constants specified by Equation 4, we have

1 0 2 ,]x(i)+[ 1 ][21--i1)3 ( ) ""34

The eigenvectors of B are [V'2, 1]T with eigenvalue -v'2/3, and [-v/2. 1 1T with eigenvalue V-2/3.

These are graphed in Figure 13(a); note that they do not coincide with the eigenvectors of A, and are notrelated to the axes of the paraboloid.

Figure 13(b) shows the convergence of the Jacobi method. The mysterious path the algorithm takes can

be understood by watching the eigenvector components of each successive error term (Figures 13(c), (d),and (e)). Figure 13(f) plots the eigenvector components as arrowheads. These are converging normally atthe rate defined by their eigenvalues, as in Figure 11.

I hope that this section has convinced you that eigenvectors are useful tools, and not just bizarre torturedevices inflicted upon you by your professors for the pleasure of watching you suffer (although the latter is

a nice fringe benefit).

6. Convergence Analysis of Steepest Descent

6.1. Instant Results

To understand the convergence of Steepest Descent, let's first consider the case where e(i) is an eigenvectorwith eigenvalue A,. Then, the residual r(i) - -Ae(i) = -A.e(i) is also an eigenvector. Equation 12 gives

e(i+,) = e(i) + --,.)A r(i)r(. Ar()

rT

Se(i) + AerT)r() --Aee)

- 0.

Figure 14 demonstrates why it takes only one step to converge to the exact solution. The point x(i) lieson one of the axes of the ellipsoid, and so the residual points directly to the center of the ellipsoid. Choosingow = A;-I gives us instant convergence.

For a more general analysis, we must express e(i) as a linear combination of eigenvectors, and we

shall furthermore require these eigenvectors to be orthonormal. It is proven in Appendix C2 that if A

is nonsingular and symmetric, there exists a set of n orthogonal eigenvectors of A. As we can scale


X2 (a) X2 (b)4

2

2 (C z2 (d)

-4

X2 -() 2 (d)4 4

2

2 4 6 4 -2 2 c 4

e(O) /"

-6 -6X2 (e) X2 (f)

4 4

2 Xl - 2 6 X2

-4 -2

-6 -4

Figure 13: Convergence of the Jacobi Method. (a) The eigenvectors of B are shown with their correspondingeigenvalues. Unlike the eigenvectors of A, these eigenvectors are not the axes of the paraboloid. (b) TheJacobi Method starts at [-2,-2]T and converges at [2, -2]T. (c, d, e) The error vectors e(o), e(I), e( 2) (solidarrows) and their eigenvector components (dashed arrows). (f) Arrowheads represent the eigenvectorcomponents of the first four error vectors. Each eigenvector component of the error is converging to zeroat the expected rate based upon its eigenvalue.

Convergence Analysis of Steepest Descent 15X2

-22 •

Figure 14: Steepest Descent converges to the exact solution on the first iteration if the error term is aneigenvector.

eigenvectors arbitrarily, let us choose so that each eigenvector is of unit length. This choice gives us theuseful property that 1,k j =k, (7

vTvk= 0, (17)

Express the error term as a linear combination of eigenvectorsn

e(i) = Z Vi, (18)j='

where ýj is the length of each component ofe(i). From Equations 17 and 18 we have the following identities:

r(i) = -Ae(i) = - Ajvj, (19)3

Ile(,)Il2 = e T)e() = (20)

e(T)Ae(i) = (Z( NvT)(-jAjvj)W i

-,(21)

I~()I*r~ilr(i) = E .•J (2:2)

23

r T)Ar(i) = " A (23)

Equation 19 shows that r(i) too can be expressed as a sum of eigenvector components, and the length ofthese components are -i Aj. Equations 20 and 22 are just Pythagoras' Law.


6\

-2?2

6 XI

Figure 15: Steepest Descent converges to the exact solution on the first iteration if the eigenvalues are allequal.

Now we can proceed with the analysis. Equation 12 gives

e(i+I) = e(1 ) + r()Ar(i)r(j)rT rAr~j

= e(1) + rc-yj (i) (24)Ej ý2Aý

We saw in the last example that, if e(1) has only one eigenvector component, then convergence isachieved in one step by choosing a(i) = A,-. Now let's examine the case where e(1) is arbitrary, but all theeigenvectors have a common eigenvalue V. Equation 24 becomes

I 2 1 2e(i+l)= e(i) + A3 F. ' (--\e(i))

=0

Figure 15 demonstrates why, once again, there is instant convergence. Because all the eigenvalues areequal, the ellipsoid is spherical; hence, no matter what point we start at, the residual must point to the centerof the sphere. As before, choose a(i) = A-1.

However, if there are several unequal, nonzero eigenvalues, then no choice of c(j) will eliminate all theeigenvector components, and ou: choice becomes a sort of compromise. In fact, the fraction in Equation 24is best thought of as a weighted average of the values of A-'. The weights ý2 ensure that longer components

Convergence Analysis of Steepest Descent 17X2

-\ X

Figure 16: The energy norm of these two vectors is equal.

of e(i) are given precedence. As a result, on any given iteration, some of the shorter components of e(i)might actually increase in length (though never for long). For this reason, the methods of Steepest Descentand Conjugate Gradients are called roughers. By contrast, the Jacobi Method is a smoother, because everyeigenvector component is reduced on every iteration. Steepest Descent and Conjugate Gradients are notsmoothers, although they are often erroneously identified as such in the mathematical literature.

6.2. General Convergence

To bound the convergence of Steepest Descent in the general case, we shall define the energy normIIellA = (eTAe)1/ 2 (see Figure 16). This norm is easier to work with than the Euclidean norm we used toanalyze the Jacobi Method. It is in some sense a more natural norm, because a bound on the energy norm


of the error term corresponds to a bound on the optimality of f (x). With this norm, we haveIle(,+1)1l2 = eT A

(I1)Ae(i+l)

= (e•) + a(i))r•))A(e(j) + o(j)r(,)) (by Equation 12)

= eT)Ae(i) + 2a(i)rT) Ae() + a2)rT )Ar() (by symmetry of A)r()Ar(i) i r0)Ari ~ )r'

J T 2

= Ile(i)1IA + 27 ~' (jr) J,)r($) + r~j~r T) ) Ar(1 )

- e 2 ( i )

r•)1A ,)r(i ) )

= Aie()i2( (rT)r(,))2(Ie)i)AIA(i) (0)

A 2)(" 3 •j2) (by Identities 21, 22, 23)

= fIe(j)II~ 2W, W2 j (25)

The analysis depends upon finding an upper bound for w. To demonstrate how the weights andeigenvalues affect convergence, I shall derive a result for n = 2. Assume that A, !> A,-. The spectralcondition number of A is defined to be Pe = AI/A 2 >_ 1. The slope of e(,) (relative to the coordinate systemdefined by the eigenvectors) is denoted/u = ý2/ý,. We have

w- = 1- -+ A2) 2

(ý/A 1 + ý22 )(A21 + AI

"- (K 2 + p2)2 (26)+/ A/2)(r,3 + A/2) (6

The value of w, which determines the rate of convergence of Steepest Descent, is graphed as a functionof A and :, in Figure 17. The graph confirms my two examples. If e(O) is an eigenvector, then the slope Ais zero (or infinite); we see from the graph that w is zero, so convergence is instant. If the eigenvalues areequal, then the condition number K is one; again, we see that w is zero.

Figure 18 illustrates examples from near each of the four comers of Figure 17. These quadratic formsare graphed in the coordinate system defined by their eigenvectors. Figures 18(a) and 18(b) are exampleswith a large condition number. Steepest Descent can converge quickly if a fortunate starting point is chosen

(Figure 18(a)), but is usually at its worst when K. is large (Figure 18(b)). The latter figure gives us our bestintuition for why a large condition number can be bad: f(x) forms a trough, and Steepest Descent bounces

back and forth between the sides of the trough while making little progress along its length. In Figures 18(c)and 18(d), the condition number is small, so the quadratic form is nearly spherical, and convergence is quickregardless of the starting point.

Holding K. constant (because A is fixed), a little basic calculus reveals that Equation 26 is maximizedwhen i = ±r.. In Figure 17, one can see a faint ridge defined by this line. Figure 19 plots worst-casestarting points for our sample matrix A. These starting points fall on the lines defined by c2/i = ±K. An

Convergence Analysis of Steepest Descent 19

V0. V

Z6

0 I 0.4

80 0.2

6020

' I!

20

Figure 17: Convergence f of Steepest Descent as a function ofr (the slope of e(ou) and er (the condition

number of A). Convergence is fast when. or An are small. For a fixed matrix, convergence is worst when

Va (a) V2m(le

V2 c)V2 d

2

Figure 18: These four examples represent points near the corresponding four comers of the graph inFigure 17. (a) Large ic, small 1A. (b) An example of poor convergence. ic and p are both large. (c) Small Pcand 1j (d) Small ic, large/p.


2 \ 1

Figure 19: Solid lines represent the starting points that give the worst convergence for Steepest Descent.Dashed lines represent steps toward convergence. Note that if the first iteration starts from a worst-casepoint, so do all succeeding iterations. Each step taken intersects the paraboloid axes (grey arrows) atprecisely a 450 angle. Here, K = 3.5.

upper bound for w (corresponding to the worst-case starting points) is found by setting p2 =2:

w2 < I _

-K5 + 2K4 + rK3

K;5 - 2nK4 + K

3

K5 + 2K4 + KC

3

r(r.- _)2

(K+ 1)2

K< - (27)-K•+1

Inequality 27 is plotted in Figure 20. The convergence becomes slow as K grows large; hence, matriceswith large condition numbers are called ill-conditioned. In Appendix C3, it is proven that Equation 27 isalso valid for n > 2, if the condition number of a symmetric, positive-definite matrix is defined to be

K = Amax/Amin,

the ratio of the largest to smallest eigenvalue. Our convergence results for Steepest Descent are

(K¢- l\'

IIe(!i)IA < K + Ile(o)IIA, (28)

The Method of Conjugate Directions 21W

I -

0.8

0.4

0.21-

"6 K20 40 60 80 100

Figure 20: Convergence of Steepest Descent (per iteration) worsens as the condition number of the matrixincreases.

f(x(_))_-_f(_) Ie(i)Ae(i)

( (by Equation 8)f(X(o)) - f(x) le(o)Ae(o)

< -

7. The Method of Conjugate Directions

7.1. Conjugacy

Steepest Descent sometimes finds itself taking steps in the same direction as earlier steps (see Figure 8).Wouldn't it be better if, every time we took a step, we got it right the first time? Here's an idea: let's picka set of orthogonal search directions d(0), d(l), ... d(n-1). In each search direction, we'll take exactly onestep, and that step will be just the right length to line up evenly with x. After n steps, we'll be done.

Figure 21 illustrates this idea, using the coordinate axes as search directions. The first (horizontal) stepleads to the correct x I-coordinate; the second (vertical) step will hit home. Notice that e(1 ) is orthogonal tod(O). In general, for each step we choose a point

x(i+!) = x(i) + o(i)d(i). (29)

To find the value of a(i), use the fact that e(i+I) should be ort-hogonal to d(i), so that we need never step in


Figure 21: The Method of Orthogonal Directions. Unfortunately, this method only works if you already knowthe answer.

the direction of d(i) again. Using this condition, we have

d= 0d(i)e(i+i) = 0

d[)(e(i) + o(i)d(i)) = 0 (by Equation 29)

Co)- d,(i)e( (30)

0)

Unfortunately, we haven't accomplished anything, because we can't compute a(i) without knowing c(i);and if we knew e(i), the problem would already be solved.

The solution is to make the search vectors A-orthogonal instead of orthogonal. Two vectors d(i) andd(j) are A-orthogonal. or conjugate, if

d(i)Ad(j) = 0.

Figure 22(a) shows what A-orthogonal vectors look like. Imagine if this article were printed on bubblegum, and you grabbed Figure 22(a) by the ends and stretched it until the ellipses appeared circular. Thevectors would then appear orthogonal, as in Figure 22(b).

Our new requirement is that e(i+) be A-orthogonal to d(i) (see Figure 23(a)). Not coincidentally, thisorthogonality condition is equivalent to finding the minimum point along the search direction d(i), as in

The Method of Conjugate Directions 23X22

(a) (b)

Figure 22: These pairs of vectors are A-orthogonal... because these pairs of vectors are orthogonal.

Steepest Descent. To see this, set the directional derivative to zero:

df --df(x(i+[)) = 0

Td

T-ri~~dj)= 0

dT Ae(i+ 1) = 0.

Following the derivation of Equation 30, here is the expression for a(i) when the search directions areA-orthogonal:

dT Ae(i)(i)

( 1

) = dT) Ad(i) (31)d(Ti r( i)

- i (32)dT Ad~jdi)A(i)

Unlike Equation 30, we can calculate this expression. Note that if the search vector were the residual, thisformula would be identical to the formula used by Steepest Descent.

24 Jonathan Richard ShewchukX2 X2

X - 1

(a) (b)

Figure 23: The method of Conjugate Directions converges in n steps. (a) The first step is taken along somedirection d(o). The minimum point x(I) is chosen by the constraint that e(I) must be A-orthogonal to d(o). (b)The initial error e(o) can be expressed as a sum of A-orthogonal components (grey arrows). Each step ofConjugate Directions eliminates one of these components.

To prove that this procedure really does compute x in n steps, express the error term as a linearcombination of search directions; namely,

n-I

e(O) = • 63d(j). (33)j=0

The values of bj can be found by a mathematical trick. Because the search directions are A-orthogonal, itis possible to eliminate all the 6j values but one from Expression 33 by premultiplying the expression bydTA:(k)

dTj)Ae(o) = Z 6(j)d(k)Ad(j)

d T)Ae(o) = 6(k)d(k)Ad(k) (by A-orthogonality of d vectors),4T) Ae~o

6 (k) = J k) (0)"(: id(k)

dEk)A(e(o) + k i= dA)Ad(k) ()d)) (by A-orthogonality of d vectors)

(k))A(k)-d)Ae(k) (By Equation 29) (34)

- T dAd(k)

By Equations 31 and 34, we find that a(i) = - 6(i). This fact gives us a new way to look at the error

The Method of Conjugate Directions 25

term.

e(i) = e(O) + aUdj=O

n-I i-I

=j--0 j=On-I

= • ( ) (35)3=i

Not surprisingly, Equation 35 shows that the process of building up x component by component can alsobe viewed as a process of cutting down the error term component by component (see Figure 23(b)). Aftern iterations, every component is cut away, and e(n) = 0; the proof is complete.

72. Gram-Schmidt Conjugation

All that is needed now is a set of A-orthogonal search directions {d(i)}. Fortunately, them is a simple wayto generate them, called a conjugate Gram-Schmidt process.

Figure 24: Gram-Schmiclt conjugation of two vectors. Begin with two linearly independent vectors uo andul. Set d(0) = u0. The vector ul is composed of two components: u', which is A-orthogonal (or conjugate)to d ro), and u+, which is parallel to d(o). After conjugation, only the A-orthogonal portion remains, andd(d) = u'.

Suppose we have a set of n linearly independent vectors u0, ul,.... , a-,. The coordinate axes willdo in a pinch, although more intelligent choices are possible. To construct d(i), take uai and subtract outany components which are not A-orthogonal to the previous d vectors (see Figure 24). In other words, setd(0) = uO, and for i > 0, set

i-II

d(i) = Ui "+e cnaikd(k), (36)

k=0

where the Oik are defined for i > k. To find their values, use the same trick used to findi j:i---I

d=)adj) = +TAd ) + ikd T()Ad36)

k=0

0 = uTAd(3 ) + 3ijd()TAd(), i > j (by A-orthogonality of d vectors)

uTAd(j) (37)

i ()Adu)

26 Jonathan Richard Shewchuk4,2

S"-2

Figure 25: The method of Conjugate Directions using the axial unit vectors, also known as GauBianelimination.

The difficulty with using Gram-Schmidt conjugation in the method of Conjugate Directions is that allthe old search vectors must be kept in memory to construct each new one, and furthermore 0(n 3) operationsare required to generate the full set. In fact, if the search vectors are constructed by conjugation of the axialunit vectors, Conjugate Directions becomes equivalent to performing GauBian elimination (see Figure 25).As a result, the method of Conjugate Directions enjoyed little use until the discovery of CG - which is amethod of Conjugate Directions - cured these disadvantages.

An important key to understanding the method of Conjugate Directions (and also CG) is to noticethat Figure 25 is just a stretched copy of Figure 21! Remember that when one is performing the methodof Conjugate Directions (including CG), one is simultaneously performing the method of OrthogonalDirections in a stretched (scaled) space.

7.3. Properties of the Residual

This section derives several properties needed to derive CG.

* As with the method of Steepest Descent, the number of matrix-vector products per iteration can bereduced to one by using a recurrence to find the residual:

r(i+)= -Ae(i+I)

= -A(e(i) + a(i)d(i))- r(i) - a(i)Ad(i) (38)

The Method of Conjugate Directions 27!ý i

Figure 26: Because the search directions d(o), d(l) are constructed from the vectors uo, u1 , they spanthe same subspace (the grey-colored plane). The error term e(2 ) is A-orthogonal to this subspace, theresidual r(2) is orthogonal to this subspace, and a new search direction d(2) is constructed (from u:) to beA-orthogonal to this subspace. The endpoints of u2 and d(2) lie on a plane parallel to the shaded subspace,because d(.2 ) is constructed from U2 by Gram-Schmidt conjugation.

" This recurrence suggests that, at the same time as the error term is being cut down component bycomponent, these components being parallel to the search directions d(1 ), the residual is also beingcut down component by component, these components being parallel to a different set of search

directions Ad(i) This suggestion is confirmed by premultiplying Equation 35 by -A:

n--

r 1) = - E 6(k)Ad(k).k=j

"* The inner product of d(i) and this expression is

d(i)r(j) = 0, i < j (by A-orthogonality of d-vectors). (39)

We could have derived this identity by another tack. Recall that once we take a step in a searchdirection, we need never step in that direction again; the error term is evermore A-orthogonal to allthe old search directions. Because r(i) = -Ae(i), the residual is evermore orthogonal to all the oldsearch directions.

"• Taking the inner product of Equation 36 and r(j), we have

i-I

dT T = uTr(j) + EIAkd (40)k=O

0 = uTr?), i < j (by Identity 39). (41)

Each residual is orthogonal to all the previous u vectors. This isn't surprising, because the searchvectors are built from the u vectors; therefore the search vectors d(O),... , d(i) span the same subspace

as uo ... , ui, and rT() (for all j > i) is orthogonal to this subspace (see Figure 26).

" There is one more identity we will use later. From Equation 40,

d Td(i)r(i) -- uTr(i). (42)


'(2) d(2 e(2)

Figure 27: In the method of Conjugate Gradients, each new residual is orthogonal to all the previousresiduals and search directions; and each new search direction is constructed (from the residual) to beA-orthogonal to all the previous residuals and search directions. The endpoints of r(2 ) and d(2) lie on aplane parallel to the shaded subspace. In CG, d(2) is a linear combination of r(2) and d(I).

& The Method of Conjugate Gradients

It may seem odd that an article about CG doesn't describe CG until page 28, but all the machinery is now inplace. In fact, CG is simply the method of Conjugate Directions where the search directions are constructedby conjugation of the residuals (that is, by setting ui = r(i)). There, we're done.

Following the residuals worked well for Steepest Descent, so it makes sense to try using the residuals toconstruct search directions for CG. Let's consider the implications. Equation 41 becomes

ri)r(j) = 0, i 0 j. (43)

Because each residual is orthogonal to the subspace spanned by the previous search vectors, all the residualsmust be mutually orthogonal (see Figure 27).

This relation allows us to simplify the 3ij terms. Let us calculate the value of r TAd(j) for Equation 37.We take the inner product of r(i) and Equation 38:

rTi)r(j+l) =r(iyr(j) -- o(j)r T)Ad~j

T T

a(j)r(i)Ad(j) = r(i)r(j) - r(i)r(j+I)SI rTT) ( ) r(i), = ,T ! rT

r(i)Ad(j) = -,T,_ r(i)r(i), i j+ 1, (By Equation 43.)

0, otherwise.rT{ ( 1 ) i j+ 1, (By Equation 37.)

0, i>j + 1.

As if by magic, most of the /3ij terms have disappeared! It is no longer necessary to store old search vectorsto ensure the A-orthogonality of new search vectors. This major advance is what makes CG as importantan algorithm as it is, because both the space complexity and time complexity per iteration are reduced fromO(n 2) to O(m), where m is the number of nonzero entries of A. Henceforth, we shall use the abbreviation

The Method of Conjugate Gradients 29X2

-2 2 X

Figure 28: The method of Conjugate Gradients.

,(i)= 3i,i-1. We simplify furtherrT)r(i)

/3( d-i_)r(i) (by Equation 32)

i_ r,)r(i)T

rfl)r(:IT ) (by Equation 42).

Let's put it all together into one piece now. The method of Conjugate Gradients is:

d(o) = r(O) = b - Ax(o), (44)

rT)r(i)t(i) -- )Ad(i) (by Equations 32 and 42), (45)dT Ad(i)

:(i+1) = x(i) + %(i)d(i),

r(i+,) = r(i) - ce(1)Ad(1), (46)

( [i ) (47)(+= ri)r(i)

d(i+l) = r(i+,) + O3(i+)d(i). (48)

The performance of CG on our sample problem is demonstrated in Figure 28. The name "ConjugateGradients" is a bit of a misnomer, because the gradients are not conjugate, and the conjugate directions arenot all gradients. "Conjugated Gradients" would be more accurate.


"0

Figure 29: In this figure, the shaded area is C = e(O) + span{d(O), d(I)}. The ellipsoid is a contour on whichthe energy norm is constant. After two steps, CG finds e(2), the point on C which minimizes IlellA

9. Convergence Analysis of Conjugate Gradients

CG is complete after n iterations, so why should we care about convergence analysis? In practice,accumulated floating point roundoff error causes the residual to gradually lose accuracy, and cancellationerror causes the search vectors to lose A-orthogonality. The former problem could be dealt with as it was forSteepest Descent, but the latter problem is not curable. Because of this loss of conjugacy, the mathematicalcommunity discarded CG during the 1960s, and interest only resurged when evidence for its effectivenessas an iterative procedure was published in the seventies.

A second concern is that for large enough problems, it may not be feasible to run even n iterations.

The first iteration of CG is identical to the first iteration of Steepest Descent, so without changes,Section 6.1 describes the conditions under which CG converges on the first iteration.

9.1. Optimality of the Error Term

The first step of the convergence analysis is to show that CG finds at every step the best solution withinthe bounds of where it's been allowed to explore. Where has it been allowed to explore? The value e(1 )is chosen from the subspace £ = e(O) + span{d(O), d(),... , d(i- )}. What do I mean by "best solution"?I mean that CG chooses the value from . that minimizes Ile(j)I)A (see Figure 29). In fact, some authorsderive CG by trying to minimize Ile(i)IA within E.

In the same way that the error term can be expressed as a linear combination of search directions(Equation 35), its energy norm can be expressed as a summation by taking advantage of the A-orthogonalityof the search vectors.

n--

IIe(,)IUA E 6(j)d(j)Ad(j) (by Equation 35).j=i

Each term in this summation is associated with a search direction which has not yet been traversed. Anyother vector e chosen from C must have these same terms in its expansion, which proves that e(,) must havethe minimum energy norm.

Convergence Analysis of Conjugate Gradients 31A close examination of Equations 44, 46, and 48 reveals that the residual r(,) is chosen from the subspace

1= r(O) + span{Ar(O), A2r(o),... , Air(0)}. This subspace is called a Krylov subspace, a subspace createdby repeatedly applying a matrix to a vector. Because r(0j = -Ae(j), it follows that e(,) is chosen from thesubspace 6 = e(o) + span{Ae(O), A2e(O),..., A'e(o)}, and C is also a Krylov subspace.

In other words, for a fixed i, the error term has the form

e(,) I (+ 1: ViAj) e(O).-

The coefficients 4'j are related to the values a(i) and /3(), but the precise relation is not important here.What is important is the proof at the beginning of this section that CG chooses these ,j coefficients tominimize Ile(,)IIA.

The term in parentheses above can be expressed as a polynomial. Let P, (A) be a polynomial of degreei. Pi can take either a scalar or a matrix as its argument, and will evaluate to the same; for instance, ifP2(A) = 2A2 + 1, then P2(A) = 2A2 + I. This flexible notation comes in handy for eigenvectors, for whichPi(A)v = Pi(A)v.

Now we can express the error term as

e(i) = Pi(A)e(o),

if we require that Pi(O) = 1. CG chooses this polynomial by choosing the 0, coefficients. Let's examinethe effect of applying this polynomial to e(O). As in the analysis of Steepest Descent, express e(O) as a linearcombination of orthogonal unit eigenvectors

n

e(O) =j=I

and we find that

e(j) = Z~3 Pt(A3 )vj

Ae(1) = Eý i(j)Av

Ie)IA =

CG finds the polynomial that minimizes this expression, but convergence is only as good as the convergenceof the worst eigenvector, so

I A(,l < minmax[P,(A)1y AP, A

= minmax[Pi(A)]2 Je(O)lA-. (49)P, A

Figure 30 illustrates, for several values of 1, the Pi that minimizes this expression for our sample problemwith eigenvalues 2 and 7. There is only one polynomial of degree zero that satisfies PO(O) = 1, and thatis Po(A) = 1, graphed in Figure 30(a). The optimal polynomial of degree one is PI(A) = 1 - 2z/9,graphed in Figure 30(b). Note that P1(2) = 5/9 and PI(7) = -5/9, and so the energy norm of the error


Po(A) (a) PO(I) (b)

0.75 0.750.5 0.5

0.25 0.25

-0.25 2 7 A 7-0. -0.5A

.0.7 -0.7

.11P2(A) (c) P()(d)

0.75 0.750. 0.5

0.25 0.25A A-0.25 -0.25

-0. -0.-0.75 -0. 7

-I -1

Figure 30: The convergence of CG after i iterations depends on how close a polynomial Pi of degree i canbe to zero on each eigenvalue, given the constraint that Pi (0) = 1.

term after one iteration of CG is no greater than 5/9 its initial value. Figure 30(c) shows that, after twoiterations, Equation 49 evaluates to zero. This is because a polynomial of degree two can be fit to threepoints (P 2(0) = 1, P2(2) = 0, and P2(7) = 0). In general, a polynomial of degree n can fit n + I points,and thereby accommodate n separate eigenvalues.

The foregoing discussing reinforces our understanding that CC yields the exact result after n iterations;and furthermore proves that CG is quicker if there are duplicated eigenvalues. Given infinite floating pointprecision, the number of iterations required to compute an exact solution is at most the number of distincteigenvalues.

We also find that CG converges more quickly when eigenvalues are clustered together (Figure 30(d))than when they are evenly distributed between Ai,n and A,,,,, because it is easier for CG to choose apolynomial that makes Equation 49 small.

If we know something about the characteristics of the eigenvalues of A, it is sometimes possible tosuggest a polynomial which leads to a proof of a fast convergence. For the remainder of this analysis,however, I shall assume the most general case: the eigenvalues are evenly distributed between A,i,, andA,.a, the number of distinct eigenvalues is large, and floating point roundoff oc. Irs.

Convergence Analysis of Conjugate Gradients 33

T2(w) T,(w)

21 2

0.5 0.5

-1.5 -1.-2 -2

TIo(w) T 4 9(W)

221

1.5

-1.5

-2

Figure 31: Chebyshev polynomials of degree 2, 5, 10, and 49.

9.2. Chebyshev Polynomials

A useful approach is to minimize Equation 49 over the range [A,i,,, Ama,] rather than at a finite number ofpoints. The polynomials that accomplish this are based on Chebyshev polynomials.

The Chebyshev polynomial of degree i is

T (Uw) = 1 ( -(w-v~7j

(If this expression doesn't look like a polynomial to you, try working it out for i equal to I or 2.) SeveralChebyshev polynomials are illustrated in Figure 31. The Chebyshev polynomials have the property thatITd(w)I < 1 (in fact, they oscillate between I and -I) on the domain w E [- , I], and furthermore thatITd(w)j is maximum on the domain w ý [-1, 11 among all such polynomials. In other words, ITj(w)Iincreases as quickly as possible outside the boxes in the illustration.

It is shown in Appendix C4 that Equation 49 is minimized by choosing

'(,A) = moxA

Amax- Am,,,

This polynomial has the oscillating properties of Chebyshev polynomials on the domain Amin _ A < Amax

(see Figure 32). The denominator enforces our requirement that Pi (0) = 1. The numerator has a maximum


N2 A)

0.75

0.5

0.2 5

-0.25"

-0.5

-0.75

-I

Figure 32: The polynomial P2(A) that minimizes Equation 49 for Ai, = 2 and A, = 7 in the general case.This curve is a scaled version of the Chebyshev polynomial of degree 2. The energy norm of the error termafter two iterations is no greater than 0.183 times its initial value. Compare with Figure 30(c), where it isknown that there are only two eigenvalues.

value of one on the interval between Ain and Am.., so from Equation 49 we have

Ule()IIA Ti Ama ( +'Amin I) Ile(o)IIA

= Ile(o)IIA

2 + I + IIe(0)IIA.

The second addend inside the square brackets converges to zero as i grows, so it is more common to expressthe convergence of CG with the weaker inequality

I-e(?IIA •-2 Ile(o)IIA. (50)

Figure 33 charts the convergence per iteration of CG, ignoring the lost factor of 2 (which will generallybe amortized over many iterations). Of course, CG often converges faster than Equation 50 would suggest,because of good eigenvalue distributions or good starting points. Comparing Equations 50 and 28, it is clearthat the convergence of CG is much quicker than that of Steepest Descent (see Figure 34). However, it isnot necessarily the case that every iteration of CG enjoys faster convergence; for example, the first iterationof CG is always identical to the first iteration of Steepest Descent. The factor of 2 in Equation 50 allowsCG a little slack for these poor iterations.

Convergence Analysis of Conjugate Gradients 35

w

0.8 - -

0a6 -7-

0.2

20 40 60 80 100

Figure 33: Convergence of Conjugate Gradients (per iteration) as a function of condition number. Comparewith Figure 20.

40-

35

30

25 .

20

x5,"

OKW2Fu 43a 6Dc 8r e t 000

Figure 34: Number of iterations of Steepest Descent required to match one iteration of CG.


10. Complexity

The dominating operations during an iteration of either Steepest Descent or CG are matrix-vector products.In general, matrix-vector multiplication requires O(m) operations, where m is the number of non-zeroentries in the matrix. For many problems, including those listed in the introduction, A is sparse andm E O(n).

Suppose we wish to perform enough iterations to reduce the norm of the error by a factor of F; that is,Ile(i)li < ele(O)dI. Equation 28 can be used to show that the maximum number of iterations required toachieve this bound using Steepest Descent is

2 M

whereas Equation 50 suggests that the maximum number of iterations CG requires is

I conclude that Steepest Descent has a time complexity of O(min), whereas CG has a time complexity of0(mvi'•). Both algorithms have a space complexity of 0( mi).

For finite difference and finite element approximations of second-order elliptic boundary value problemsposed on d-dimensional domains, K E 0(n 2/d). Thus, Steepest Descent has a time complexity of O( n2) fortwo-dimensional problems, versus 0(n 3 /2) for CG; and Steepest Descent has a time complexity of 0( n5/3)

for three-dimensional problems, versus O(n 4/3 ) for CG.

11. Starting and Stopping

In the preceding presentation of the Steepest Descent and Conjugate Gradient algorithms, several detailshave been omitted; particularly, how to choose a starting point, and when to stop.

11.1. Starting

There's not much to say about starting. If you have a rough estimate of the value of x, use it as thestarting value .(O). If not, set x(0) = 0; either algorithm will eventually converge when used to solve linearsystems. Nonlinear systems (coming up in Section 14) are trickier, though, because there may be severallocal minima, and the choice of starting point will determine which minimum the procedure converges to,or whether it will converge at all.

11.2. Stopping

When Steepest Descent or CG reaches the minimum point, the residual becomes zero, and if Equation I Ior 47 is evaluated an iteration later, a division by zero will result. It seems, then, that we must stopimmediately when the residual is zero. To complicate things, though, accumulated roundoff error in therecursive formulation of the residual (Equation 46) may yield a false zero residual; this problem could beresolved by restarting with Equation 44.

Preconditioning 37Usually, however, one wishes to stop before convergence is complete. Because the error term is not

available, it is customary to stop when the norm of the residual falls below a specified value; often, thisvalue is some small fraction of the initial residual (I]r(i)[] < EI]r(O)II). See Appendix B for sample code.

12. Preconditioning

Preconditioning is a technique for improving the condition number of a matrix. Suppose that M is asymmetric, positive-definite matrix that approximates A, but is easier to invert. We can solve Ax = bindirectly by solving

M-'Ax = M-'b. (51)

If r,( M- A) < ,K(A), or if the eigenvalues of M- I A are better clustered than those of A, we can iterativelysolve Equation 51 more quickly than the original problem. The catch is that M-IA is not generallysymmetric nor definite, even if M and A are.

We can circumvent this difficulty, because for every symmetric, positive-definite M there is a (notnecessarily unique) matrix E that has the property that EET -= M. (Such an E can be obtained, forinstance, by Cholesky factorization.) The matrices M-IA and E-'AE-T have the same eigenvalues (andtherefore the same condition number). This is true because if v is an eigenvector of M- I A with eigenvalueA, then ETV is an eigenvector of E-1 AE-T with eigenvalue \:

(E-IAE-T )(ETv) = (ET E-T )E-IAv = ETM-I Av = AETv.

The system Ax = b can be transformed into the problem

E-IAE-T-i = E-1b, i = ETX,

which we solve first for i, then for x. Because E-I AE-T is symmetric and positive-definite, i can befound by Steepest Descent or CG. The process of using CG to solve this system is called the TransformedPreconditioned Conjugate Gradient Method:

d(O) = F(O) = E-'b - E- AE-T (o),

=d7) E- IAE-Td)

r(~i+) = r(0) - Lk)E-IaE-dgO

= (i) (io)

d(i+1) = F(i+t) + /3(i+j(i)d'

This method has the undesirable characteristic that E must be computed. However, a few carefulvariable substitutions can eliminate E. Setting r(i) = E•' r(i) and d(i) = ETd(i), and using the identities

38 Jonathan Richard Shewchukx 2

Figure 35: Contour lines of the quadratic form of the diagonally preconditioned sample problem.

i~i) = ET'X( and E-TE 1 - we derive the Untransformed Preconditioned Conjugate GradientMethod:

r(O) = b - Ax(,

d(o) = M- 1r(O),

rT()M'r(i)

(i)

x(i,j) =x(j) + a(i) d(1),

r(i+t) -r(i) - aiA~)

rT M 1 Tt3(i+I) = (i+I) (i)

rT M-r(1 )

di1)= M ~+1 +/3(i 1) d(1).

The matrix E does not appear in these equations; only MI is needed. By the same means, it is possibleto derive a Preconditioned Steepest Descent Method which does not use E.

The effectiveness of a preconditioner M is determined by the condition number of M - A, and occa-sionally by its clustering of eigenvalues. The problem remains of finding a preconditioner that approximatesA well enough to improve convergence enough to make up for the cost of computing the product M- Ir(1 )once per iteration. (Note that it is not necessary to explicitly compute M or M-'; it is only necessary to beable to compute the effect of applying M- to a vector.) Within this constraint, there is a surprisingly richsupply of possibilities, and I can only scratch the surface here.

Intuitively, preconditioning is an attempt to stretch the quadratic form to make it appear more spherical,so that the eigenvalues are close to each other. A perfect preconditioner is M = A; for this preconditioner,

Conjugate Gradients on the Normal Equations 39

M 'A has a condition number of one, and the quadratic form is perfectly spherical, so solution takes onlyone iteration. Unfortunately, the preconditioning step is solving the system Mx = b, so this isn't a usefulpreconditioner at all.

The simplest preconditioner is a diagonal matrix whose diagonal entries are identical to those of A. Theprocess of applying this preconditioner, known as diagonal preconditioning or Jacobi preconditioning, isequivalent to scaling the quadratic form along the coordinate axes. (By comparison, the perfect precondi-tioner M = A scales the quadratic form along its eigenvector axes.) A diagonal matrix is trivial to invert,but is often only a mediocre preconditioner. The contour lines of our sample problem are shown, afterdiagonal preconditioning, in Figure 35. Comparing with Figure 3, it is clear that some improvement hasoccurred. The condition number has improved from 3.5 to roughly 2.8. Of course, this improvement ismuch more beneficial for systems where n > 2.

A more elaborate preconditioner is incomplete Cholesky preconditioning. Cholesky factorization is atechnique for factoring a matrix A into the form LLT, where L is a lower triangular matrix. IncompleteCholesky factorization is a variant in which little or no fill is allowed; A is approximated by the productI IT, where L might be restricted to have the same pattern of nonzero elements as A; other elements of L arethrown away. To use LIT as a preconditioner, the solution to LLTw = z is computed by backsubstitution(the inverse of LLT is never explicitly computed). Unfortunately, incomplete Cholesky preconditioning isnot always stable.

Many preconditioners, some quite sophisticated, have been developed. Whatever your choice, it isgenerally accepted that for large-scale applications, CG should nearly always be used with a preconditionerif possible.

13. Conjugate Gradients on the Normal Equations

CG can be used to solve systems where A is not symmetric, not positive-definite, and even not square. Asolution to the least squares problem

min liAx - b112 (52)

can be found by setting the derivative of Expression 52 to zero:

AT Ax = ATb. (53)

If A is square and nonsingular, the solution to Equation 53 is the solution to Ax = b. If A is not square,and Ax = b is overconstrained - that is, has more linearly independent equations than variables - thenthere may or may not be a solution to Ax = b, but it is always possible to find a value of x that minimizesExpression 52, the sum of the squares of the errors of each linear equation.

AT A is symmetric and positive (for any x, XT AT Ax = II AxI I2 > 0). If Ax = b is not underconstrained,then ATA is nonsingular, and methods like Steepest Descent and CG can be used to solve Equation 53. Theonly nuisance in doing so is that the condition number of ATA is the square of that of A, so convergence issignificantly slower.

An important technical point is that the matrix AT A is never formed explicitly, because it is less sparsethan A. Instead, ATA is multiplied by d by first finding the product Ad, and then ATAd. Also, numericalstability is improved if the value dTAT Ad (in Equation 45) is computed by taking the inner product of Adwith itself.

40 Jonathan Richard Shewchuk14. The Nonlinear Conjugate Gradient Method

CG can be used not only to find the minimum point of a quadratic form, but to minimize any continuousfunction f(x) for which the gradient f' can be computed. Applications include a variety of optimizationproblems, such as engineering design, neural net training, and nonlinear regression.

14.1. Outline of the Nonlinear Conjugate Gradient Method

To derive nonlinear CG, there are three changes to the linear algorithm: the recursive formula for theresidual cannot be used, it becomes more complicated to compute the step size a, and there are severaldifferent choices for /3.

In nonlinear CG, the residual is always set to the negation of the gradient; r(i) = -f'(x(i)). The searchdirections are computed by Gram-Schmidt conjugation of the residuals as with linear CG. Performing a linesearch along this search direction is much more difficult than in the linear case, and a variety of procedurescan be used. As with the linear CG, a value of a(i) that minimizes f(x(,) + a(i)d(i)) is found by ensuringthat the gradient is orthogonal to the search direction. We can use any algorithm that finds the zeros of theexpression [f'(x(1 ) + a(i)d(i))]Td(i).

In linear CG, there are several equivalent expressions for the value of 3. In nonlinear CG, these differentexpressions are no longer equivalent; researchers are still investigating the best choice. Two choices are theFletcher-Reeves formula, which we used in linear CG for its ease of computation, and the Polak-Ribi~reformula:

oFR r(i+l)r(i+1) PR r(i+)(r(i+l) -r(i))• (i+ 1) -- rir(i) 1 3(i+ 1) =T rir(i)T)(i) (i+

The Fletcher-Reeves method converges if the starting point is sufficiently close to the desired minimum,whereas the Polak-Ribi~re method can, in rare cases, cycle infinitely without converging. However, Polak-Ribi~re often converges much more quickly.

Fortunately, convergence of the Polak-Ribi~re method can be guaranteed by choosing/3 = max{ f 3PR. 0}.Using this value is equivalent to restarting CG if aPR < 0. To restart CG is to forget the past searchdirections, and start CG anew in the direction of steepest descent.

Here is an outline of the nonlinear CG method:

d(O) = r(O) = -f(x(o)),

Find a(i) that minimizes f(x(i) + cr(i)d(i)),

x(i+,) = x(i) + a(i)d(i),r(i+I) =-ftxi),

3hi+ i) = T or )(i+1) = max T-- .0

d(i+,) = r(i+i) + O3(i+ 1)d(i).

The Nonlinear Conjugate Gradient Method 41Nonlinear CG comes with few of the convergence guarantees of linear CG. The less similar f is to a

quadratic function, the more quickly the search directions lose conjugacy. (It will become clear shortly thatthe term "conjugacy" still has some meaning in nonlinear CG.) Another problem is that a general functionf may have many local minima. CG is not guaranteed to converge to the global minimum, and may noteven find a local minimum if f has no lower bound.

Figure 36 illustrates nonlinear CG. Figure 36(a) is a function with many local minima. Figure 36(b)demonstrates the convergence of nonlinear CG with the Fletcher-Reeves formula. In this example, CG isnot nearly as effective as in the linear case; this function is deceptively difficult to minimize. Figure 36(c)shows a cross-section of the surface, corresponding to the first line search in Figure 36(b). Notice that thereare several minima; the line search finds a value of a corresponding to a nearby minimum. Figure 36(d)shows the superior convergence of Polak-Ribi~re CG.

Because CG can only generate n conjugate vectors in an n-dimensional space, it makes sense to restartCG every n iterations, especially if n is small. Figure 37 shows the effect when nonlinear CG is restartedevery second iteration. (For this particular example, both the Fletcher-Reeves method and the Polak-Ribi~remethod appear the same.)

14.2. General Line Search

Depending on the value of f', it might be possible to use a fast algorithm to find the zeros of ftTd. Forinstance, if f' is polynomial in a, then an efficient algorithm for polynomial zero-finding can be used.However, we will only consider general-purpose algorithms.

Two iterative methods for zero-finding are the Newton-Raphson method and the Secant method. Bothmethods require that f be twice continuously differentiable. Newton-Raphson also requires that it bepossible to compute the second derivative of f(x + ad) with respect to a.

The Newton-Raphson method relies on the Taylor series approximation

f(x+ ad) : f(x) + a f(x + ad) +_• -2f(x + ad) (54)

2d f(x) + fd(+ )]Td+ 2-X)d. (55)

daSfz+ ad f()T+a~I~~.(55)

where f"(x) is the Hessian matrix

9 81 f 2f a21dx ax1 8-d 2 *.. ax,8x

f ( X) = x dzxl 1x2a-2 Oix2ax

•XnaX x ... 8xx

The function f(x + ad) is approximately minimized by setting Expression 55 to zero, giving

frdf- dTfd"


(a)X2

-2502

4 -5

X 2 24 6-4,- 2 - .. • x ,

f(x(,) + ad(1)) (C) .2

400

2002

-0.02 0.02 0.-4 -2 2

-200

Figure 36: Convergence of the nonlinear Conjugate Gradient Method. (a) A complicated function with manylocal minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linear CG. convergencedoes not occur in two steps. (c) Cross-section of the surface corresponding to the first line search. (d)Convergence path of Polak-Ribibre CG.

The Nonlinear Conjugate Gradient Method 43

X2

4 -2 _

Figure 37: Nonlinear CG can be more effective with periodic restarts.

07

0.75

0.25

II

0.25

I IO. I

II

.25-I

X-0.

-0.75

.1

Figure 38: The Newton-Raphson method for minimizing a one-dimensional function (solid curve). Startingfrom the point z, calculate the first and second derivatives, and use them to construct a quadratic approxi-mation to the function (dashed curve). A new point z is chosen at the base of the parabola. This procedureis iterated until convergence is reached.


The truncated Taylor series approximates f(x + ad) with a parabola; we step to the bottom of the parabola(see Figure 38). In fact, if f is a quadratic form, then this parabolic approximation is exact, because f" isjust the familiar matrix A. In general, the search directions are conjugate if they are f"-orthogonal. Themeaning of "conjugate" keeps changing, though, because f" varies with x. The more quickly f" varieswith x, the more quickly the search directions lose conjugacy. On the other hand, the closer x(i) is to thesolution, the less f" varies from iteration to iteration. The closer the starting point is to the solution, themore similar the convergence of nonlinear CG is to that of linear CG.

To perform an exact line search of a non-quadratic function, repeated steps must be taken along the lineuntil fPrd is zero; hence, one CG iteration may include many Newton-Raphson iterations. The values off'Td and dTf"d must be evaluated at each step. These evaluations may be inexpensive if dTf"d can be

analytically simplified, but if the full matrix f" must be evaluated repeatedly, the algorithm is prohibitivelyslow. For some applications, it is possible to circumvent this problem by performing an approximate linesearch that uses only the diagonal elements of f". Of course, there are functions for which it is not possibleto evaluate f" at all.

To perform an exact line search without computing f", the Secant method approximates the secondderivative of f(x + ad) by evaluating the first derivative at the distinct points a = 0 and a = a, where ais an arbitrary small nonzero number:

d2 [If(x + ad)].=, - [f(x + ad)],o-f (x + ad) ;a dc 0

= [f'(x + ad)ITd - [ff(x)]Td (56)

0a

which becomes a better approximation to the second derivative as a and a approach zero. If we substituteExpression 56 for the third term of the Taylor expansion (Equation 54), we have

d-•f(x +- ad) • [fl(x)Td + Za {[f'(x + ad)]Td- [f'(x)ITd}.

Minimize f(x + ad) by setting its derivative to zero:

a = -a [fj(X)]Td (57)[f'(x - ad)lTd - lf'(x)lTd

Like Newton-Raphson, the Secant method also approximates f(x + ad) with a parabola, but insteadof choosing the parabola by finding the first and second derivative at a point, it finds the first derivative attwo different points (see Figure 39). Typically, we will choose an arbitrary a on the first Secant methoditeration; on subsequent iterations we will choose x + ad to be the value of x from the previous Secantmethod iteration. In other words, if we let a[i] denote the value of a calculated during Secant iteration i,then afi+i1 = -ali].

Both the Newton-Raphson and Secant methods should be terminated when x is reasonably close to theexact solution. Demanding too little precision could cause a failure of counvergence, but demanding toofine precision makes the computation unnecessarily slow and gains nothing, because conjugacy will breakdown quickly anyway if f"(z) varies much with x. Therefore, a quick but inexact line search is oftenthe better policy (for instance, use only a fixed number of Newton-Raphson or Secant method iterations).

Unfortunately, inexact line search may lead to the construction of a search direction that is not a descentdirection. A common solution is to test for this eventuality (is rTd nonpositive?), and restart CG if necessaryby setting d = r.

... .... .

The Nonlinear Conjugate Gradient Method 45

1

0.5

Figure 39: The Secant method for minimizing a one-dimensional function (solid curve). Starting from thepoint x, calculate the first derivatives at two different points (here, a =0 and a = 2), and use them toconstruct a quadratic approximation to the function (dashed curve). Notice that both curves have the sameslope at a =0, and also at a = 2. As in the Newton-Raphson method, a new point z is chosen at the baseof the parabola, and the procedure is iterated until convergence.

A bigger problem with both methods is that they cannot distinguish minima from maxima. The resultof nonlinear CG generally depends strongly on the starting point, and if CG with the Newton-Raphson orSecant method starts near a local maximum, it is likely to converge to that point.

Each method has its own advantages. The Newton-Raphson method has a better convergence rate, andis to be preferred if dT f"d can be calculated (or well approximated) quickly (i.e., in 0 (n) time). The Secantmethod only requires first derivatives of f, but its success may depend on a good choice of the parameter ai.It is easy to derive a variety of other methods as well. For instance, by sampling f at three different points,it is possible to generate a parabola that approximates f (x + ard) without the need to calculate even a firstderivative of f.-

14.3. Preconditioning

Nonlinear CG can be preconditioned by choosing a preconditioner M that approximates f" and has theproperty that M-' r is easy to compute. In linear CG, the preconditioner attempts to transform the quadraticform so that it is similar to a sphere; a nonlinear CG preconditioner performs this transformation for a regionnear j)

Even when it is too expensive to compute the full Hessian f", it is often quite reasonable to computeits diagonal for use as a preconditioner. However, be forewarned that if x is sufficiently far from a localminimum, the diagonal elements of the Hessian may not all be positive. A preconditioner should be positive-definite, so nonpositive diagonal elements cannot be allowed. A conservative solution is to not precondition(set M = I) when the Hessian cannot be guaranteed to be positive-definite. Figure 40 demonstratesthe convergence of diagonally preconditioned nonlinear CG, with the Polak-Ribi~re formula, on the same

46 Jonathan Richard She wchuk

d0

-4 -2, 2 4" Xl

Figure 40: The preconditioned nonlinear Conjugate Gradient Method, using the Polak-Ribibre formula anda diagonal preconditioner. The space has been "stretched" to show the improvement in circularity of thecontour lines around the minimum.

function illustrated in Figure 36. Here, I have cheated by using the diagonal of f" at the solution point x toprecondition every iteration.

A Notes

Conjugate Direction methods were probably first presented by Schmidt [ 141 in 1908, and were independentlyreinvented by Fox, Huskey, and Wilkinson [7] in 1948. In the early fifties, the method of Conjugate Gradientswas discovered independently by Hestenes [101 and Stiefel [151; shortly thereafter, they jointly publishedwhat is considered the seminal reference on CG [11]. Convergence bounds for CG in terms of Chebyshevpolynomials were developed by Kaniel [121. A more thorough analysis of CG convergence is provided byvan der Sluis and van der Vorst [161. CG was popularized as an iterative method for large, sparse matricesby Reid [13] in 1971.

CG was generalized to nonlinear problems in 1964 by Fletcher and Reeves [6], based on work byDavidon [41 and Fletcher and Powell [51. Convergence of nonlinear CG with inexact line searches wasanalyzed by Daniel [3]. The choice of t3 for nonlinear CG is discussed by Gilbert and Nocedal [81..

A history and extensive annotated bibliography of CG to the mid-seventies is provided by GolubandO'Leary [9]. Most research since that time has focused on nonsymmetric systems. A survey of iterativemethods for the solution of linear systems is offered by Barrett et al. [1].

Canned Algorithms 47

B Canned Algorithms

The code given in this section represents efficient implementations of the algorithms discussed in this article.

B1. Steepest Descent

Given the inputs A, b, a starting value x, a maximum number of iterations imrax, and an error toleranceS<1:

i•=O

r t b - Axb 4= rTr

While i < imaz and 6 > e26o doq 4= Ar

x - x + arIf i is divisible by 50

r .- b- Ax

elser -= r - aq

6 4= rTri-4= i+ I

This algorithm terminates when the maximum number of iterations ima has been exceeded, or whenJjr(j)jj :_ lljr(0)ll.

The fast recursive formula for the residual is usually used, but once every 50 iterations, the exact residualis recalculated to remove accumulated floating point error. Of course, the number 50 is arbitrary; for largen, V/i might be appropriate. If the tolerance is large, the residual need not be corrected at all. If thetolerance is close to the limits of the floating point precision of the machine, a test should be added after bis evaluated to check if 6 < e26o, and if this test holds true, the exact residual should also be recomputedand 6 reevaluated. This prevents the procedure from terminating too quickly due to floating point roundofferror.

48 Jonathan Richard ShewchukB2. Conjugate Gradients

Given the inputs A, b, a starting value -., a maximum number of iterations ima, and an error tolerance< 1:

r 4== b- Ax

bnew 4= rTr

'5o 4- b5newWhile i < imx and b,•w > -2bo do

q 4Ad

X 4- x + adIf i is divisible by 50

r #=b- Ax

elser 4= r -q

bold b= 5newbnew 4-- rTr

d 4-- r + /3di~i+l

See the comments at the end of Section B 1.

Canned Algorithms 4983. Preconditioned Conjugate Gradients

Given the inputs A, b, a starting value x, a (perhaps implicitly defined) preconditioner M, a maximumnumber of iterations imax, and an error tolerance e < 1:

i•=0

r 4 b-Ax

d *= M-Ir6 neu ,= rTd60 -: 6 new

While i < i,ma_ and bnew, > E26o doq Ad

6

x 4 x + adIf i is divisible by 50

r4b-Axelse

r 4= r - aq

s 4- M-'r

bold '4 6 new

bnew 4-- rT s

d s+ ,3di'~i+l

The statement "s 4= M-r" implies that one should apply the preconditioner, which may not actuallybe in the form of a matrix.

See also the comments at the end of Section B 1.

50 Jonathan Richard ShewchukB4. Nonlinear Conjugate Gradients with Newton-Raphson and Fletcher-Reeves

Given a function f, a starting value x, a maximum number of CG iterations imaz, a CG error tolerance- < 1, a maximum number of Newton-Raphson iterations jma, and a Newton-Raphson error tolerancee<1:

i 0

k 0

bnew 4- rTr6o 4- b5new

While i < i,,,, and ',new > E2 bo doj•=o

bd 4€= dTdDo

a 4:-- - f,(x)]Td

x .= x + ad

while j < j,, and a2 d > 62

r 4= -f'(x)bold 4= tSnew

bnew 4- rTr

ol~d

d •= r + /dk .=k + I

If k = n or rTd < 0d rk0=O

i~i+l

This algorithm terminates when the maximum number of iterations imax has been exceeded, or whenIIr(i)11 < e11T(O)11.

Each Newton-Raphson iteration adds ad to x; the iterations are terminated when each update ad fallsbelow a given tolerance (11adli < c), or when the number of iterations exceeds jm,,,. A fast inexact linesearch can be accomplished by using a small jnax and/or by approximating the Hessian f"(x) with itsdiagonal.

Nonlinear CG is restarted (by setting d # r) whenever a search direction is computed that is not adescent direction. It is also restarted once every n iterations, to improve convergence for small n.

The calculation of a may result in a divide-by-zero error. This may occur because the starting point x(6)is not sufficiently close to the desired minimum, or because f is. not twice continuously differentiable. Inthe former case, the solution is to choose a better starting point or a more sophisticated line search. In thelatter case, CG might not be the most appropriate minimization algorithm.

Canned Algorithms 51

B5. Preconditoned Nonlinear Conjugate Gradients with Secant and Polak-Ribiare

Given a function f, a starting value x, a maximum number of CG iterations i.ax, a CG error tolerance

- < 1, a Secant method step parameter ao, a maximum number of Secant method iterations jmx, and a

Secant method error tolerance E < 1:

i¢=0k¢:O

r 4- -f'(z)Calculate a preconditioner M •t f"( z)s M- 1 rd sbnew 4- rT d

bo 4--new

While i < ina. and bnew > E2 o do

bd ,- dTda 4:: -0'o

rlprev '# [f'(x + orod)]TdDo

7 -# [f,(X)]Td

a I# a-n7-77prev - 77

x # x + ad

qprev #=jp=j+1

while j < jna, and a 26d > C2

r # -f'(x)bold # bnew

bmid 4 rTs

Calculate a preconditioner M ;t f"(x)s 4 M-'r

bnew -= rT s

6.16o~d

k.4=k-t1

If k = n or /3 < 0

d.4=s

elsed s+ +/3d

i=i+1

This algorithm terminates when the maximum number of iterations imaz has been exceeded, or whenli(11riJ _< Ellr(0)ll.

Each Secant method iteration adds ad to x; the iterations are terminated when each update ad falls below

a given tolerance (ladil < c), or when the number of iterations exceeds jnmax. A fast inexact line search can

be accomplished by using a small j.,,. The parameter a0 determines the value of a in Equation 57 for the

first step of each Secant method minimization. Unfortunately, it may be necessary to adjust this parameter

to achieve convergence.

52 Jonathan Richard ShewchukT T T -- I

The Polak-Ribire /3 parameter is _ r() r+Care mustOd(g)8(5 (t) (S

be taken that the preconditioner M is always positive-definite. The preconditioner is not necessarily in theform of a matrix.

Nonlinear CG is restarted (by setting d 4= r) whenever the Polak-Ribiýre / parameter is negative. It isalso restarted once every n iterations, to improve convergence for small n.

Nonlinear CG presents several choices: Preconditioned or not, Newton-Raphson method or Secantmethod or another method, Fletcher-Reeves or Polak-Ribi~re. It should be possible to construct any of thevariations from the versions presented above. (Note that Polak-Ribi~re is almost always to be preferred.)

C Ugly Proofs

C1. The Solution to Ax = b Minimizes the Quadratic Form

Suppose A is symmetric. Let x be a point that satisfies Ax = b and minimizes the quadratic form(Equation 3), and let e be an error term. Then

f(x+e) = (x + e)TA(x + e) - bT(x + e) + c (by Equation 3)

1xTAx +eTAx+1eTAe-bTX-brTe+c (by symmetry of A)

I xTAx-bTx+c+eTb+ eT Ae-bTe2 2

= f(x) + 2eAe.

If A is positive-definite, then the latter term is positive for all e $ 0; therefore x minimizes f.

C2. A Symmetric Matrix Has n Orthogonal Eigenvectors.

Suppose A is nonsingular, so that there is a set of n linearly independent eigenvectors vi, v2,..., v, . Supposefurthermore that A is symmetric. For any two eigenvectors, we have

vTv = T A-' Avjvjv = vAAv

= (A-Iv,)TAvj (by symmetry of A)

= AjV= A i v: V j .

This equation can hold only if vfvj = 0 or Ai = Aj. In the former case, vi and vj are orthogonal. Inthe latter case, both vi and vj have the same eigenvalue A, so any linear combination of vi and vj is alsoan eigenvector with eigenvalue A. Replace vj with a vector that is a linear combination of vi and vj, andis orthogonal to vi. (In other words, apply Gram-Schmidt orthogonalization to vi and vj, and any othervectors with the same eigenvalue.)

By applying this reasoning to every pair of eigenvectors, one can generate an orthogonal set of eigen-vectors.

Ugly Proofs 53

C3. Convergence of Steepest Descent

Consider the expression w from Equation 25:

•2 -1 (E"j 2 2 2(Ej =21)(Ei 2

Several simplifications can be made:

+J2A2 I Aj ) A3

-2 2A 1 Amin+ Amax j AminAmax

- +<AmaE •J (because A.%min Aj !5 Amz)

j Am,,nAmax Amimax ~ •A•Amx

(Am•n + Amax) E j2 -

Amin Amax

If we define -= E we have

(E ýJA'\)(E ýJ2) • 4 (Ammn + Ama.) Ej -4

(j 2 3 "3"Amin Ama2

Basic calculus shows that this expression is maximized when ) =(Amin + AmA:) ý:>, ý so we have=2F ýJ2 ,\s4wehae

(E ýJ2A\)(Z ýA) • (Amin + Ama:)(. - , )34Anin- Ama

It follows that

2 < 1 4-Amin+max- (Amin, + Ama )2

(Amin - Amax) 2

(Amin + A max) 2

n--10;<

- x +1'

wheren= K Thus, Equation 27 is true for anyn> 1.

C4. Optimality of Chebyshev Polynomials

Chebyshev polynomials are optimal for minimization of expressions like Expression 49 because theyincrease in magnitude more quickly outside the range [- 1, 1] than any other polynomial that is restricted tohave magnitude no greater than one inside the range [-1, 1].

The Chebyshev polynomial of degree i,

Ti(W) I • [(w + \IWV -+I + (w - V'L-' -- I

54 Jonathan Richard Shewchukcan also be expressed on the region [-1, 1] as

Ti(w) = cos(i cos w), -1 <w< 1.

From this expression (and from Figure 31), one can deduce that the Chebyshev polynomials have theproperty

IT()I l, -l < W <and oscillate rapidly between - 1 and 1:

Notice that the i zeros of Ti must fall between the i + I extrema of Ti in the range [-1, 1]. For an example,see the five zeros of Ts(w) in Figure 31.

Similarly, the functionT,

P,(A) Ti

oscillates in the range ±Ti( '•- + - on the domain [Amin, Amx]. Pi (A) also satisfies the requirementthat Pi(O) = 1.

The proof relies on a cute trick. There is no polynomial Qi(A) of degree i such that Qi(O) = I and Qiis better than Pi on the range [Amin, Arnaz]. To prove this, suppose that there is such a polynomial; then,Qi(A) < Ti( - )-I on the range [Amin, Amax]. It follows that the polynomial Pi - Qi has a zeroat A = 0, and also has i zeros on the range [Amin, Am"a]. Hence, Pi - Qi is a polynomial of degree i withat least i + 1 zeros, which is impossible. By contradiction, I conclude that the Chebyshev polynomial ofdegree i optimizes Expression 49.

References

[11 Richard Barrett, Michael Berry, Tony Chan, James Demmel, June Donato, Jack Dongarra, VictorEijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst, Templates for the solution of linearsystems: Building blocks for iterative methods, SIAM, Philadelphia, Pennsylvania, 1983.

[2] William L. Briggs, A multigrid tutorial, SIAM, Philadelphia, Pennsylvania, 1987.

[31 James W. Daniel, Convergence of the conjugate gradient method with computationally convenientmodifications, Numerische Mathematik 10 (1967), 125-131.

[41 W. C. Davidon, Variable metric method for minimization, Technical Report ANL-5990, ArgonneNational Laboratory, Argonne, Illinois, 1959.

[5] R. Fletcher and M. J. D. Powell, A rapidly convergent descent method for minimization, ComputerJournal 6 (1963), 163-168.

[61 R. Fletcher and C. M. Reeves, Function minimization by conjugate gradients, Computer Journal 7(1964), 149-154.

[71 L. Fox, H. D. Huskey, and J. H. Wilkinson, Notes on the solution of algebraic linear simultaneousequations, Quarierly Journal of Mechanics and Applied Mathematics 1 (1948), 149-173.

References 55(8] Jean Charles Gilbert and Jorge Nocedal, Global convergence properties of conjugate gradient methods

for optimization, SIAM Journal of Optimization 2 (1992), no. 1, 21-42.

[9] Gene H. Golub and Dianne P. O'Leary, Some history of the conjugate gradient and Lanczos algorithms:1948-1976, SIAM Review 31 (1989), no. 1, 50-102.

[10] Magnus R. Hestenes, Iterative methods for solving linear equations, Journal of Optimization Theoryand Applications 11 (1973), no. 4, 323-334, Originally published in 1951 as NAML Report No. 52-9,National Bureau of Standards, Washington, D.C.

[11] Magnus R. Hestenes and Eduard Stiefel, Methods of conjugate gradients for solving linear systems,Journal of Research of the National Bureau of Standards 49 (1952), 409-436.

[12] Shmuel Kaniel, Estimates for some computational techniques in linear algebra, Mathematics ofComputation 20 (1966), 369-378.

[131 John Ker Reid, On the method of conjugate gradients for the solution of large sparse systems oflinear equations, Large Sparse Sets of Linear Equations (London and New York) (John Ker Reid, ed.),Academic Press, London and New York, 1971, pp. 231-254.

[14] E. Schmidt, Rendiconti del Circolo Matematico di Palermo 25 (1908), 53-77.

[15] Eduard Stiefel, Uber einige methoden der relaxationsrechnung, Zeitschrift fUr Angewandte Mathe-matik und Physik 3 (1952), no. 1, 1-33.

[161 A. van der Sluis and H. A. van der Vorst, The rate of convergence of conjugate gradients, NumerischeMathematik 48 (1986), 543-560.

School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 15213-3890

Carnegie Mellon University does not dtscrimnate and Carnegie Meifon University is required not todiscriminate •n admission •emlo.oyment or administration of its programs on The bas~s of race co~or.national origin, sex or h,,dlcao in violation of Title Vl of the Civil Rights Act of t964. T•ite IX of the ,

Educational Amendments of t972 and Section 504 of toe Rehabilhtaton Act of 1973 or otner federai.

state or 'ocal ;aws, or execut~ve orraers

In addition. Carnegie Meilon University does not discrnmnate o admission employment or adminls-

traton of its orograms on the basis of rehigion creed ancestry belief. age. veteran status, sexuai

orientation or in v~olation of federal, state or ioca! lawvs or executive ordors

Irquires concerning aplaicatnor of these statements sn~ould be d~rected to the Provost Carnegie

Meilon University. 5000 Forbes Avenue Pittsburgn PA t1521t3 te/ephone (4 t2) 268-6684 or the W ee

President for Enrollment Carnegie Merlon University 5000 Forbes Avenue. PtItsburgh. PA 152t3.

telephone (412) 268-2056

Documents

An Introduction to › dtic › tr › fulltext › u2 › a277565.pdf · The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear