5 Overview of algorithms for unconstrained optimizationmepelman/teaching/NLP/... · 5 Overview of algorithms for unconstrained optimization ... such that rf (xk)Tdk < 0 is a ... rise

IOE 519: NLP, Winter 2012 c�Marina A. Epelman 19

5 Overview of algorithms for unconstrained optimization

5.1 General optimization algorithm

Recall: we are attempting to solve the problem

(P) min f(x)s.t. x 2 X

where f(x) is di↵erentiable and X ⇢ Rn is an open set.

Solutions to optimization problems are almost always impossible to obtain directly (or “in closedform”) — with a few exceptions. Hence, for the most part, we will solve these problems withiterative algorithms. These algorithms typically require the user to supply a starting point x0 2 X.Beginning at x0, an iterative algorithm will generate a sequence of points {xk}1

k=0 called iterates. Indeciding how to generate the next iterate, xk+1, the algorithms use information about the functionf at the current iterate, xk, and sometimes past iterates x0, . . . , xk�1. In practice, rather thanconstructing an infinite sequence of iterates, algorithms stop when an appropriate terminationcriterion is satisfied, indicating either that the problem has been solved within a desired accuracy,or that no further progress can be made.

Most algorithms for unconstrained optimization we will discuss fall into the category of directionalsearch algorithms:

General directional search optimization algorithm

Initialization Specify an initial guess of the solution x0

Iteration For k = 0, 1, . . .,If xk is optimal, stopOtherwise,

• Determine dk — a search directions

• Determine ↵k > 0 — a step size

• Determine xk+1 = xk + ↵kdk — a new estimate of the solution.

5.1.1 Choosing the direction

Typically, we require that dk is a descent direction of f at xk, that is,

f(xk + ↵dk) < f(xk) 8↵ 2 (0, ✏]

for some ✏ > 0. For the case when f is di↵erentiable, we have shown in Theorem 4.1 that any dk

such that rf(xk)Tdk < 0 is a descent direction whenever rf(xk) 6= 0.

Often, direction is chosen to be of the form

dk = �Dkrf(xk),

where Dk is a positive definite symmetric matrix. (Why is it important that Dk is positive defi-nite?)


The following are the two basic methods for choosing the matrix Dk at each iteration; they giverise to two classic algorithms for unconstrained optimization we are going to discuss in class:

Steepest descent Dk = I, k = 0, 1, 2, . . .

Newton’s method Dk = H(xk)�1 (provided H(xk) is positive definite.)

5.1.2 Choosing the stepsize

After dk is fixed, ↵k ideally would solve the one-dimensional optimization problem

min↵�0

f(xk + ↵dk).

This optimization problem is usually also impossible to solve exactly. Instead, ↵k is computed(via an iterative procedure referred to as line search) either to approximately solve the aboveoptimization problem, or to ensure a “su�cient” decrease in the value of f .

5.1.3 Testing for optimality

Based on the optimality conditions, xk is a locally optimal if rf(xk) = 0 and H(xk) is positivedefinite. However, such a point is unlikely to be found. In fact, the most of the analysis of thealgorithms in the above form deals with their limiting behavior, i.e., analyzes the limit points ofthe infinite sequence of iterates generated by the algorithm. Thus, to implement the algorithm inpractice, more realistic termination criteria need to be implemented. They often hinge, at leastin part, on approximately satisfying, to a certain tolerance, the first order necessary condition foroptimality discussed in the previous section.

5.2 Steepest descent algorithm for minimization

The steepest descent algorithm is a version of the general optimization algorithm that choosesdk = �rf(xk) at the kth iteration. As a source of motivation, note that f(x) can be approximatedby its linear expansion f(x̄+d) ⇡ f(x̄)+rf(x̄)Td. It is not hard to see that so long as rf(x̄) 6= 0,the direction

d̄ =�rf(x̄)krf(x̄)k =

�rf(x̄)p

rf(x̄)Trf(x̄)minimizes the above approximation over all direction of unit length. Indeed, for any direction dwith kdk = 1, the Schwartz inequality yields

rf(x̄)Td � �krf(x̄)k · kdk = �krf(x̄)k = rf(x̄)T d̄.

Of course, if rf(x̄) = 0, then x̄ is a candidate for local minimizer, i.e., x̄ satisfies the first ordernecessary optimality condition. The direction d̄ = �rf(x̄) is called the direction of steepest descentat the point x̄.

Note that d̄ = �rf(x̄) is a descent direction as long as rf(x̄) 6= 0. To see this, simply observethat d̄Trf(x̄) = �rf(x̄)Trf(x̄) < 0 so long as rf(x̄) 6= 0.

A natural consequence of this is the following algorithm, called the steepest descent algorithm.


Steepest Descent Algorithm:

Step 0 Given x0, set k 0

Step 1 dk = �rf(xk). If dk = 0, then stop.

Step 2 Choose stepsize ↵k by performing an exact (or inexact) line search.

Step 3 Set xk+1 xk + ↵kdk, k k + 1. Go to Step 1.

Note from Step 2 and the fact that dk = �rf(xk) is a descent direction, it follows that f(xk+1) <f(xk).

The following theorem establishes that under certain assumptions on f , the steepest descent al-gorithm converges regardless of the initial starting point x0 (i.e., it exhibits global convergence).

Theorem 5.1 (Convergence Theorem; Steepest Descent with exact line search) Supposethat f : Rn ! R is continuously di↵erentiable on the set S = {x 2 Rn : f(x) f(x0)}, and thatS is a closed and bounded set. Suppose further that the sequence {xk} is generated by the steepestdescent algorithm with stepsizes ↵k chosen by an exact line search. Then every point x̄ that is alimit point of the sequence {xk} satisfies rf(x̄) = 0.

Proof: The proof of this theorem is by contradiction. By the Weierstrass’ Theorem, at leastone limit point of the sequence {xk} must exist. Let x̄ be any such limit point. Without loss ofgenerality, assume that lim

k!1 xk = x̄, but that rf(x̄) 6= 0. This being the case, there is a value

of ↵̄ > 0 such that �4= f(x̄) � f(x̄ + ↵̄d̄) > 0, where d̄ = �rf(x̄). Then also (x̄ + ↵̄d̄) 2 intS,

because f(x̄+ ↵̄d̄) < f(x̄) f(x0).

Let {dk} be the sequence of directions generated by the algorithm, i.e., dk = �rf(xk). Since f iscontinuously di↵erentiable, lim

k!1 dk = d̄. Then since (x̄+ ↵̄d̄) 2 intS, and (xk+ ↵̄dk)! (x̄+ ↵̄d̄),for k su�ciently large we have xk + ↵̄dk 2 S and

f(xk + ↵̄dk) f(x̄+ ↵̄d̄) +�

2= f(x̄)� � +

�

2= f(x̄)� �

2.

However,

f(x̄) f(xk + ↵kdk) f(xk + ↵̄dk) f(x̄)� �

2,

which is, of course, a contradiction. Thus d̄ = �rf(x̄) = 0.

An example Suppose, f(x) is a simple quadratic function of the form:

f(x) =1

2xTQx+ qTx,

where Q is a positive definite symmetric matrix. The optimal solution of (P) is easily computedas:

x? = �Q�1q

(since Q is positive definite, it is non-singular) and direct substitution shows that the optimalobjective function value is:

f(x?) = �1

2qTQ�1q.


For convenience, let x denote the current point in the steepest descent algorithm. We have:

f(x) =1

2xTQx+ qTx

and let d denote the current direction, which is the negative of the gradient, i.e.,

d = �rf(x) = �Qx� q.

Now let us compute the next iterate of the steepest descent algorithm. If ↵ is the generic stepsize,then

f(x+ ↵d) =1

2(x+ ↵d)TQ(x+ ↵d) + qT (x+ ↵d)

=1

2xTQx+ ↵dTQx+

1

2↵2dTQd+ qTx+ ↵qTd

= f(x)� ↵dTd+1

2↵2dTQd.

Optimizing the value of ↵ in this last expression yields

↵ =dTd

dTQd,

and the next iterate of the algorithm then is

x0 = x+ ↵d = x+dTd

dTQdd, where d = �Qx� q.

and

f(x0) = f(x+ ↵d) = f(x)� ↵dTd+1

2↵2dTQd = f(x)� 1

2

(dTd)2

dTQd.

Suppose that

Q =

✓

+4 �2�2 +2

◆

and q =

✓

+2�2

◆

.

Then

rf(x) =✓

+4 �2�2 +2

◆✓

x1x2

◆

+

✓

+2�2

◆

and so

x? =

✓

01

◆

andf(x?) = �1.

Suppose that x0 = (0, 0). Then we have:

x1 = (�0.4, 0.4), x2 = (0, 0.8), etc.,

and the even numbered iterates satisfy

x2n = (0, 1� 0.2n) and f(x2n) = (1� 0.2n)2 � 2 + 2(0.2)n


and sokx2n � x?k = 0.2n, f(x2n)� f(x?) = (0.2)2n.

Therefore, starting from the point x0 = (0, 0), distance from the current iterate to the optimalsolution goes down by a factor of 0.2 after every two iterations of the algorithm (a similar observationcan be made about the progress of the objective function values). The graph below plots the progressof the sequence kxk � x?k as a function of iteration number; notice that the y-axis is drawn ona logarithmic scale — this allows us to visualize the progress of the algorithm better as values ofkxk � x?k approach zero.

Although it is easy to find the optimal solution of the quadratic optimization problem in closed form,the above example is relevant in that it demonstrates a typical performance of the steepest descentalgorithm. Additionally, most functions behave as near-quadratic functions in a neighborhood ofthe optimal solution, making the example even more relevant.

Termination criteria Ideally, the algorithm will terminate at a point xk such that rf(xk) = 0.However, the algorithm is not guaranteed to be able to find such point in finite amount of time.Moreover, due to rounding errors in computer calculations, the calculated value of the gradient willhave some imprecision in it.

Therefore, in practical algorithms the termination criterion is designed to test if the above conditionis satisfied approximately, so that the resulting output of the algorithm is an approximately optimalsolution. A natural termination criterion for the steepest descent could be krf(xk)k ✏, where ✏ >0 is a pre-specified tolerance. However, depending on the scaling of the function, this requirementcan be either unnecessarily stringent, or too loose to ensure near-optimality (consider a problemconcerned with minimizing distance, where the objective function can be expressed in inches, feet,or miles). Another alternative, that might alleviate the above consideration, is to terminate whenkrf(xk)k ✏|f(xk)| — this, however, may lead to problems when the objective function at theoptimum is zero. A combined approach is then to terminate when

krf(xk)k ✏(1 + |f(xk)|).

The value of ✏ is typically taken to be at most the square root of the machine tolerance (e.g.,✏ = 10�8 if 16-digit computing is used), due to the error incurred in estimating derivatives.


5.3 Stepsize selection

In the analysis in the above subsection we assumed that one-dimensional optimization probleminvoked in the line search in each iteration of the Steepest Descent algorithm was performed exactlyand with perfect precision, which is usually not possible. In this subsection we discuss one ofthe many practical ways of solving this problem approximately, to determine the stepsize at eachiteration of the general directional search optimization algorithm (including steepest descent).

5.3.1 Stepsize selection basics

Suppose that f(x) is a continuously di↵erentiable function, and that we seek to (approximately)solve:

↵̄ = argmin↵>0

f(x̄+ ↵d̄),

where x̄ is our current iterate, and d̄ is the current direction generated by an algorithm that seeksto minimize f(x). We assume that d̄ is a descent direction, i.e., rf(x̄)T d̄ < 0. Let

F (↵) = f(x̄+ ↵d̄),

whereby F (↵) is a function in the scalar variable ↵, and our problem is to solve for

↵̄ = argmin↵>0

F (↵).

Using the chain rule for di↵erentiation, we can show that

F 0(↵) = rf(x̄+ ↵d̄)T d̄.

Therefore, applying the necessary optimality conditions to the one-dimensional optimization prob-lem above, we want to find a value ↵̄ for which F 0(↵̄) = 0. Furthermore, since d is a descentdirection, F 0(0) < 0.

5.3.2 Armijo rule, or backtracking

Although there are iterative algorithms developed to solve the problem min↵

F (↵) (or F 0(↵) = 0)“exactly,” i.e., with a high degree of precision (such as, for instance, bisection search algorithm),they are typically too expensive computationally. (Recall that we need to perform a line search atevery iteration of our steepest optimization algorithm!) On the other hand, if we sacrifice accuracyof the line search, this can cause inferior performance of the overall algorithm.

The Armijo rule, or the backtracking method, is one of several inexact line search methods whichguarantees a su�cient degree of improvement in the objective function to ensure the algorithm’sconvergence.

Armijo rule requires two parameters: 0 < µ < 0.5 and 0 < � < 1. Suppose we are minimizing afunction F (↵) such that F 0(0) < 0 (which is indeed the case for the line search problems arising indescent algorithms). Then the first order approximation of F (↵) at ↵ = 0 is given by F (0)+↵F 0(0).Define F̂ (↵) = F (0)+µ↵F 0(0) (see figure). A stepsize ↵̄ is considered acceptable by Armijo rule onlyif F (↵̄) F̂ (↵̄), that is, if taking a step of size ↵̄ guarantees su�cient decrease of the function:

f(x̄+ ↵̄d̄)� f(x̄) µ↵̄rf(x̄)T d̄.


0 0.2 0.4 0.6 0.8 1 1.2

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

�

F(�

)

F(0)+� F(0)�

F(0)+µ� F(0)�

Note that the su�cient decrease condition will hold for any small value of ↵. On the other hand,we would like to prevent the step size from being too small, for otherwise our overall optimizationalgorithm would not be making much progress. To combine these two considerations, will implementthe following iterative backtracking procedure (here we use � = 1

2):

Backtracking line search

Step 0 Set k=0. ↵0 = 1.

Step k If F (↵k

) F̂ (↵k

), choose ↵k

as the step size; stop. If F (↵k

) > F̂ (↵k

), let ↵k+1 1

2↵k

,k k + 1.

Note that as a result of the above iterative scheme, the chosen stepsize is ↵ = 12t , where t � 0 is

the smallest integer such that F (1/2t) F̂ (1/2t) (or, for general �, F (�t) F̂ (�t)).

Typically, µ is chosen in the range between 0.01 and 0.3, and � — between 0.1 to 0.8.

Note that if xk

and xk+1 are the consecutive iterates of the general optimization algorithm with dk

– a descent direction, and the stepsizes chosen by backtracking, then f(xk+1) f(x

k

) — that is,the algorithm is guaranteed to produce an improvement in the function value at every iteration.Under additional assumptions on f , it can be also shown that the steepest descent algorithmwill demonstrate global convergence properties under the Armijo line search rule, as stated in thefollowing theorem.

Theorem 5.2 (Convergence Theorem; Steepest Descent with backtracking line search)Suppose that the set S = {x 2 Rn : f(x) f(x0)} is closed and bounded, and suppose that thegradient of f is Lipschitz continuous on the set S, i.e., there exist a constant G > 0 such that

krf(x)�rf(y)k Gkx� yk 8x, y 2 S.

Suppose further that the sequence {xk} is generated by the steepest descent algorithm with stepsizes↵k chosen by a backtracking line search. Then every point x̄ that is a limit point of the sequence{xk} satisfies rf(x̄) = 0.

The additional assumption, basically, ensures that the gradient of f does not change too rapidly.In the proof of the theorem, this allows to provide a lower bound on the stepsize in each iteration.(See any of the reference textbooks for details.)

Remark: Our discussion so far implicitly assumed that the domain of the optimization problemwas the entire Rn. If our optimization problem is

(P)min f(x) s.t. x 2 X,


where X is an open set, then the line-search problem is

min f(x̄+ ↵d̄) s.t. x̄+ ↵d̄ 2 X.

In this case, we must ensure that all iterate values of ↵ in the backtracking algorithm satisfyx̄+ ↵d̄ 2 X. As an example, consider the following problem:

(P) min f(x) := �m

P

i=1ln(b

i

� aTi

x)

s.t. b�Ax > 0.

Here the domain of f(x) is X = {x 2 Rn : b � Ax > 0}. Given a point x̄ 2 X and a direction d̄,the line-search problem is:

(LS) min h(↵) := f(x̄+ ↵d̄) = �m

P

i=1ln(b

i

� aTi

(x̄+ ↵d̄))

s.t. b�A(x̄+ ↵d̄) > 0.

Standard arithmetic manipulation can be used to establish that

b�A(x̄+ ↵d̄) > 0 if and only if ↵̌ < ↵ < ↵̂,

where

↵̌ := � mina

Ti d̄<0

⇢

bi

� aTi

x̄

�aTi

d̄

�

and ↵̂ := mina

Ti d̄>0

⇢

bi

� aTi

x̄

aTi

d̄

�

,

and the line-search problem then is:

LS : minimize h(↵) := �m

P

i=1ln(b

i

� aTi

(x̄+ ↵d̄))

s.t. 0 < ↵ < ↵̂.

The implementation of the backtracking rule for this problem would have to be modified — startingwith ↵ = 1, we will backtrack, if necessary, until alpha < ↵̂, and only then start checking thesu�cient decrease conditions.

5.4 Newton’s method for minimization

Again, we want to solve(P) min f(x)

x 2 Rn.

The Newton’s method can also be interpreted in the framework of the general optimization algo-rithm, but it truly stems from the Newton’s method for solving systems of nonlinear equations.Recall that if � : Rn ! Rn, to solve the system of equations

�(x) = 0,

one can apply an iterative method. Starting at a point x̄, approximate the function � by �(x̄+d) ⇡�(x̄)+r�(x̄)Td, where r�(x̄)T 2 Rn⇥n is the Jacobian of � at x̄, and provided that r�(x̄) is non-singular, solve the system of linear equations

r�(x̄)Td = ��(x̄)


to obtain d. Set the next iterate x = x̄+ d, and continue. This method is well-studied, and is well-known for its good performance when the starting point x̄ is chosen appropriately. The Newton’smethod for minimization is precisely an application of this equation-solving method to the (systemof) first-order optimality conditions rf(x) = 0.

Here is another view of the motivation behind the Newton’s method for optimization. At x = x̄,f(x) can be approximated by

f(x) ⇡ q(x)4= f(x̄) +rf(x̄)T (x� x̄) +

1

2(x� x̄)TH(x̄)(x� x̄),

which is the quadratic Taylor expansion of f(x) at x = x̄.

q(x) is a quadratic function which is minimized by solving rq(x) = 0, i.e., rf(x̄)+H(x̄)(x�x̄) = 0,which yields

x� x̄ = �H(x̄)�1rf(x̄).The direction �H(x̄)�1rf(x̄) is called the Newton direction, or the Newton step.

This leads to the following algorithm for solving (P):

Newton’s Method:

Step 0 Given x0, set k 0

Step 1 dk = �H(xk)�1rf(xk). If dk = 0, then stop.

Step 2 Choose stepsize ↵k = 1.

Step 3 Set xk+1 xk + ↵kdk, k k + 1. Go to Step 1.

Proposition 5.3 If H(x) is p.d., then d = �H(x)�1rf(x) is a descent direction.

Proof: It is su�cient to show that rf(x)Td = �rf(x)TH(x)�1rf(x) < 0. Since H(x) is positivedefinite, if v 6= 0,

0 < (H(x)�1v)TH(x)(H(x)�1v) = vTH(x)�1v,

completing the proof.

Note that:

• Work per iteration: O(n3)

• The iterates of Newton’s method are, in general, equally attracted to local minima and localmaxima. Indeed, the method is just trying to solve the system of equations rf(x) = 0.

• The method assumes H(xk) is nonsingular at each iteration. Moreover, unless H(xk) ispositive definite, dk is not guaranteed to be a descent direction.

• There is no guarantee that f(xk+1) f(xk).

• Step 2 could be augmented by a linesearch of f(xk + ↵dk) over the value of ↵; then previousconsideration would not be an issue.

• What if H(xk) becomes increasingly singular (or not positive definite)? Use H(xk) + ✏I.

• In general, points generated by the Newton’s method as it is described above, may notconverge. For example, H(xk)�1 may not exist. Even if H(x) is always non-singular, themethod may not converge, unless started “close enough” to the right point.


Example 1: Let f(x) = 7x� ln(x). Then rf(x) = f 0(x) = 7� 1x

and H(x) = f 00(x) = 1x

2 . It is nothard to check that x? = 1

7 = 0.142857143 is the unique global minimizer. The Newton direction atx is

d = �H(x)�1rf(x) = � f 0(x)

f 00(x)= �x2

✓

7� 1

x

◆

= x� 7x2,

and is defined so long as x > 0. So, Newton’s method will generate the sequence of iterates {xk}with xk+1 = xk+(xk�7(xk)2) = 2xk�7(xk)2. Below are some examples of the sequences generatedby this method for di↵erent starting points:

k xk xk xk

0 1 0.1 0.011 �5 0.13 0.01932 0.1417 0.035992573 0.14284777 0.0629168844 0.142857142 0.0981240285 0.142857143 0.1288497826 0.14148377 0.1428439388 0.1428571429 0.14285714310 0.142857143

(note that the iterate in the first column is not in the domain of the objective function, so thealgorithm has to terminate with an error). Below is a plot of the progress of the algorithm as afunction of iteration number (for the two sequences that did converge):

Example 2: f(x) = � ln(1� x1 � x2)� lnx1 � lnx2.

rf(x) = 1

1�x1�x2� 1

x11

1�x1�x2� 1

x2

�

,


H(x) =

2

4

⇣

11�x1�x2

⌘2+⇣

1x1

⌘2 ⇣

11�x1�x2

⌘2

⇣

11�x1�x2

⌘2 ⇣

11�x1�x2

⌘2+⇣

1x2

⌘2

3

5 .

x? =�

13 ,

13

�

, f(x?) = 3.295836866.

k xk1 xk2 kxk � x̄k0 0.85 0.05 0.589255650988791 0.717006802721088 0.0965986394557823 0.4508310619260112 0.512975199133209 0.176479706723556 0.2384832491574623 0.352478577567272 0.273248784105084 0.06306102942974464 0.338449016006352 0.32623807005996 0.008747169263796555 0.333337722134802 0.333259330511655 7.41328482837195e�5

6 0.333333343617612 0.33333332724128 1.19532211855443e�8

7 0.333333333333333 0.333333333333333 1.57009245868378e�16

Termination criteria Since Newton’s method is working with the Hessian as well as the gradient,it would be natural to augment the termination criterion we used in the Steepest Descent algorithmwith the requirement that H(xk) is positive semi-definite, or, taking into account the potential forthe computational errors, that H(xk) + ✏I is positive semi-definite for some ✏ > 0 (this parametermay be di↵erent than the one used in the condition on the gradient).


5.5 Comparing performance of the steepest descent and Newton algorithms

5.5.1 Rate of convergence

Suppose we have a converging sequence limk!1 s

k

= s̄, and we would like to characterize the speed,or rate, at which the iterates s

k

approach the limit s̄.

A converging sequence of numbers {sk

} exhibits linear convergence if for some 0 C < 1,

limk!1

|sk+1 � s̄||s

k

� s̄| = C.

C in the above expression is referred to as the rate constant ; if C = 0, the sequence exhibitssuperlinear convergence.

A converging sequence of numbers {sk

} exhibits quadratic convergence if

limk!1

|sk+1 � s̄||s

k

� s̄|2 = � <1.

Examples:

Linear convergence sk

=�

110

�

k

: 0.1, 0.01, 0.001, etc. s̄ = 0.

|sk+1 � s̄||s

k

� s̄| = 0.1.

Superlinear convergence sk

= 0.1 · 1k! :

110 ,

120 ,

160 ,

1240 ,

11250 , etc. s̄ = 0.

|sk+1 � s̄||s

k

� s̄| =k!

(k + 1)!=

1

k + 1! 0 as k !1.

Quadratic convergence sk

=�

110

�(2k�1): 0.1, 0.01, 0.0001, 0.00000001, etc. s̄ = 0.

|sk+1 � s̄||s

k

� s̄|2 =(102

k�1)2

102k= 1.

This illustration compares the rates of convergence of the above sequences (note that the y-axis isdisplayed on the logarithmic scale):


We will use the notion of rate of convergence to analyze one aspect of performance of optimizationalgorithms. Indeed, since an algorithm for nonlinear optimization problems, in its abstract form,generates an infinite sequence of points {x

k

} converging to a solution x̄ only in the limit, it makessense to discuss the rate of convergence of the sequence kekk = kxk � x̄k, or Ek = |f(xk) � f(x̄)|,which both have limit 0.

5.5.2 Rate of convergence of the steepest descent algorithm for the case of a quadraticfunction

In this section we explore answers to the question of how fast the steepest descent algorithmconverges. Recall that in the earlier example we observed linear convergence of both the sequence{Ek} and {ek}.

We will show now that the steepest descent algorithm with stepsizes selected by exact line searchin general exhibits linear convergence, but that the rate constant depends very much on the ratioof the largest to the smallest eigenvalue of the Hessian matrix H(x) at the optimal solution x = x?.In order to see how this dependence arises, we will examine the case where the objective functionf(x) is itself a simple quadratic function of the form:

f(x) =1

2xTQx+ qTx,

where Q is a positive definite symmetric matrix. We will suppose that the eigenvalues ofQ are

A = a1 � a2 � . . . � an

= a > 0,

i.e, A and a are the largest and smallest eigenvalues of Q.

We already derived that the optimal solution of (P) is

x? = �Q�1q

with the optimal objective function value is:

f(x?) = �1

2qTQ�1q.


Moreover, if x is the current point in the steepest descent algorithm, then

f(x) =1

2xTQx+ qTx,

and the next iterate of the steepest descent algorithm with exact line search is

x0 = x+ ↵d = x+dTd

dTQdd,

where d = �rf(x) and

f(x0) = f(x)� ↵dTd+1

2↵2dTQd = f(x)� 1

2

(dTd)2

dTQd.

Therefore,

f(x0)� f(x?)

f(x)� f(x?)=

f(x)� 12(dT d)2

d

TQd

� f(x?)

f(x)� f(x?)

= 1�12(dT d)2

d

TQd

12x

TQx+ qTx+ 12q

TQ�1q

= 1�12(dT d)2

d

TQd

12(Qx+ q)TQ�1(Qx+ q)

= 1� (dTd)2

(dTQd)(dTQ�1d)

= 1� 1

�

where

� =(dTQd)(dTQ�1d)

(dTd)2.

In order for the convergence constant to be good, which will translate to fast linear convergence,we would like the quantity � to be small. The following result provides an upper bound on thevalue of �.

Kantorovich Inequality: Let A and a be the largest and the smallest eigenvalues of Q, respec-tively. Then

� (A+ a)2

4Aa.

We will skip the proof of this inequality. Let us apply this inequality to the above analysis.Continuing, we have

f(x0)� f(x?)

f(x)� f(x?)= 1� 1

� 1� 4Aa

(A+ a)2=

(A� a)2

(A+ a)2=

✓

A/a� 1

A/a+ 1

◆2

.

Note by definition that A/a is always at least 1. If A/a is small (not much bigger than 1), then theconvergence constant will be much smaller than 1. However, if A/a is large, then the convergenceconstant will be only slightly smaller than 1. The following table shows some sample values:


Upper Bound on Number of Iterations to ReduceA a 1� 1

�

the Optimality Gap by 0.10

1.1 1.0 0.0023 13.0 1.0 0.25 210.0 1.0 0.67 6100.0 1.0 0.96 58200.0 1.0 0.98 116400.0 1.0 0.99 231

Note that the number of iterations needed to reduce the optimality gap by 0.10 grows linearly inthe ratio A/a.

Two pictures of possible iterations of the steepest descent algorithm are as follows:

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2


Some remarks:

• We analyzed the convergence of the function values; the convergence of the algorithm iteratescan be easily shown to be linear with the same rate constant.

• The bound on the rate of convergence is attained in practice quite often, which is unfortunate.The ratio of the largest to the smallest eigenvalue of a matrix is called the condition numberof the matrix.

• What about non-quadratic functions? If the Hessian at the locally optimal solution is positivedefinite, the function behaves as near-quadratic function in a neighborhood of that solution.The convergence exhibited by the iterates of the steepest descent algorithm will also be linear.The analysis of the non-quadratic case gets very involved; fortunately, the key intuition isobtained by analyzing the quadratic case.

• What about backtracking line search? Also linear convergence! (The rate constant dependsin part on the backtracking parameters.)

5.5.3 Rate of convergence of the pure Newton’s method

We have seen from our examples that, even for convex functions, the Newton’s method in its pureform (i.e., with stepsize of 1 at every iteration) does not guarantee descent at each iteration, andmay produce a diverging sequence of iterates. Moreover, each iteration of the Newton’s methodis much more computationally intensive then that of the steepest descent. However, under certainconditions, the method exhibits quadratic rate of convergence, making it the “ideal” method forsolving convex optimization problems. Recall that a method exhibits quadratic convergence whenke

k

k = kxk � x̄k ! 0 and

limk!1

kek+1kkekk2 = C.

Roughly speaking, if the iterates converge quadratically, the accuracy (i.e., the number of correctdigits) of the solution doubles in a fixed number of iterations.

There are many ways to state and prove results regarding the convergence on the Newton’s method.We provide one that provides a particular insight into the circumstances under which pure Newton’smethod demonstrates quadratic convergence.

Let kvk denote the usual Euclidian norm of a vector, namely kvk :=pvT v. Recall that the operator

norm of a matrix M is defined as follows:

kMk := maxx

{kMxk : kxk = 1}.

As a consequence of this definition, for any x, kMxk kMk · kxk.

Theorem 5.4 (Quadratic convergence) Suppose f(x) is twice continuously di↵erentiable andx? is a point for which rf(x?) = 0. Suppose H(x) satisfies the following conditions:

• there exists a scalar h > 0 for which k[H(x?)]�1k 1h

• there exists scalars � > 0 and L > 0 for which kH(x) � H(y)k Lkx � yk for all x and ysatisfying kx� x?k � and ky � x?k �.


Let x satisfy kx�x?k ��, where 0 < � < 1 and � := min�

�, 2h3L

, and let xN

:= x�H(x)�1rf(x).Then:

(i) kxN

� x?k kx� x?k2⇣

L

2(h�Lkx�x

?k)

⌘

(ii) kxN

� x?k < �kx� x?k, and hence the iterates converge to x?

(iii) kxN

� x?k kx� x?k2�

3L2h

�

.

The proof relies on the following two “elementary” facts.

Proposition 5.5 Suppose that M is a symmetric matrix. Then the following are equivalent:

1. h > 0 satisfies kM�1k 1h

2. h > 0 satisfies kMvk � h · kvk for any vector v

Proposition 5.6 Suppose that f(x) is twice di↵erentiable. Then

rf(z)�rf(x) =Z 1

0[H(x+ t(z � x))] (z � x)dt .

Proof: Let �(t) := rf(x + t(z � x)). Then �(0) = rf(x) and �(1) = rf(z), and �0(t) =

[H(x+ t(z � x))] (z � x). From the fundamental theorem of calculus, we have:

rf(z)�rf(x) =�(1)� �(0)

=

Z 1

0�

0(t)dt

=

Z 1

0[H(x+ t(z � x))] (z � x)dt .

Proof of Theorem 5.4

We have:

xN

� x? = x�H(x)�1rf(x)� x?

= x� x? +H(x)�1 (rf(x?)�rf(x))

= x� x? +H(x)�1Z 1

0[H(x+ t(x? � x))] (x? � x)dt (from Proposition 5.6)

= H(x)�1Z 1

0[H(x+ t(x? � x))�H(x)] (x? � x)dt.

Therefore

kxN

� x?k kH(x)�1kZ 1

0k [H(x+ t(x? � x))�H(x)] k · k(x? � x)kdt

kx? � xk · kH(x)�1kZ 1

0L · t · k(x? � x)kdt

= kx? � xk2kH(x)�1kLZ 1

0tdt

=kx? � xk2kH(x)�1kL

2.


We now bound kH(x)�1k. Let v be any vector. Then

kH(x)vk = kH(x?)v + (H(x)�H(x?))vk� kH(x?)vk � k(H(x)�H(x?))vk� h · kvk � kH(x)�H(x?)kkvk (from Proposition 5.5)

� h · kvk � Lkx? � xk · kvk= (h� Lkx? � xk) · kvk .

Invoking Proposition 5.5 again, we see that this implies that

kH(x)�1k 1

h� Lkx? � xk .

Combining this with the above yields

kxN

� x?k kx? � xk2 L

2 (h� Lkx? � xk) ,

which is (i) of the theorem. Because Lkx? � xk � · 2h3 < 2h

3 we have:

kxN

� x?k kx? � xk Lkx? � xk2 (h� Lkx? � xk)

� · 2h3

2�

h� 2h3

�kx? � xk = �kx? � xk,

which establishes (ii) of the theorem. Finally, we have

kxN

� x?k kx? � xk2 L

2 (h� Lkx? � xk) kx? � xk2 L

2�

h� 2h3

� = kx? � xk2 3L2h

,

which establishes (iii) of the theorem.

Notice that the results regarding the convergence and rate of convergence in the above theoremare local, i.e., they apply only if the algorithm is initialized at certain starting points (the ones“su�ciently close” to the desired limit). In practice, it is not known how to pick such startingpoints, or to check if the proposed starting point is adequate. (With the very important exceptionof self-concordant functions.)

5.6 Further discussion and modifications of the Newton’s method

5.6.1 Global convergence for strongly convex functions with a two-phase Newton’smethod

We have noted that, to ensure descent at each iteration, the Newton’s method can be augmentedby a line search. This idea can be formalized, and the e�ciency of the resulting algorithm can beanalyzed (see, for example, “Convex Optimization” by Stephen Boyd and Lieven Vandenberghe,available at http://www.stanford.edu/~boyd/cvxbook.html for a fairly simple presentation ofthe analysis).

Suppose that f(x) is strongly convex on it domain, i.e., assume there exists µ > 0 such that thesmallest eigenvalue of H(x) is greater than equal to µ for all x and that the Hessian is Lipschitzcontinuous everywhere on the domain of f . Suppose we apply the Newton’s method with the


stepsize at each iteration determined by the backtracking procedure of section 5.3.2. That is, ateach iteration of the algorithm we first attempt to take a full Newton step, but reduce the stepsizeif the decrease in the function value is not su�cient. Then there exist positive numbers ⌘ and �such that

• if krf(xk)k � ⌘, then f(xk+1)� f(xk) ��, and

• if krf(xk)k < ⌘, then stepsize ↵k = 1 will be selected, and the next iterate will satisfykrf(xk+1)k < ⌘, and so will all the further iterates. Moreover, quadratic convergence will beobserved in this phase.

As hinted above, the algorithm will proceed in two phases: while the iterates are far from theminimizer, a “dampening” of the Newton step will be required, but there will be a guaranteeddecrease in the objective function values. This phase (referred to as “dampened Newton phase”)

cannot take more than f(x0)�f(x?)�

iterations. Once the norm of the gradient becomes su�cientlysmall, no dampening of the Newton step will required in the rest of the algorithm, and quadraticconvergence will be observed, thus making it the “quadratically convergence phase.”

Note that it is not necessary to know the values of ⌘ and � to apply this version of the algorithm!The two-phase Newton’s method is globally convergent; however, to ensure global convergence, thefunction being minimized needs to posses particularly nice global properties.

5.6.2 Other modifications of the Newton’s method

We have seen that if Newton’s method is initialized su�ciently close to the point x̄ such thatrf(x̄) = 0 and H(x̄) is positive definite (i.e., x̄ is a local minimizer), then it will converge quadrat-ically, using stepsizes of ↵ = 1. There are three issues in the above statement that we should beconcerned with:

• What if H(x̄) is singular, or nearly-singular?

• How do we know if we are “close enough,” and what to do if we are not?

• Can we modify Newton’s method to guarantee global convergence?

In the previous subsection we “assumed away” the first issue, and, under an additional assumption,showed how to address the other two. What if the function f is not strongly convex, and H(x)may approach singularity?

There are two popular approaches (which are actually closely related) to address these issues. Thefirst approach ensures that the method always uses a descent direction. For example, instead ofthe direction �H(xk)�1rf(xk), use the direction �(H(xk)+ ✏

k

I)�1rf(xk), where ✏k

� 0 is chosenso that the smallest eigenvalue of H(xk) + ✏

k

I is bounded below by a fixed number � > 0. Itis important to choose the value of � appropriately — if it is chosen to be too small, the matrixemployed in computing the direction can become ill-conditioned if H(x̄) is nearly singular; if itis chosen to be too large, the direction becomes nearly that of the steepest descent algorithm,and hence only linear convergence can be guaranteed. Hence, the value of ✏

k

is often chosendynamically.

The second approach is the so-called trust region method. Note that the main idea behind theNewton’s method is to represent the function f(x) by its quadratic approximation q

k

(x) = f(xk)+


rf(xk)T (x � xk) + 12(x � xk)TH(xk)(x � xk) around the current iterate, and then minimize that

approximation. While locally the approximation works quite well, this may no longer be the casewhen a large step is taken. The trust region methods hence find the next iterate by solving thefollowing constrained optimization problem:

min qk

(x) s.t. kx� xkk �k

,

i.e., not allowing the next iterate to be outside the neighborhood of xk where the quadratic ap-proximation is close to the original function f(x) (as it turns out, this problem is not much harderto solve than the unconstrained minimization of q

k

(s)).

The value of �k

is set to represent the size of the region in which we can “trust” qk

(x) to providea good approximation of f(x). Smaller values of �

k

ensure that we are working with an accuraterepresentation of f(x), but result in conservative steps. Larger values of �

k

allow for larger steps,but may lead to inaccurate estimation of the objective function. To account for this, the value if �

k

is updated dynamically throughout the algorithm, namely, it is increased if it is observed that qk

(x)provided an exceptionally good approximation of f(x) at the previous iteration, and decreased isthe approximation was exceptionally bad.

Documents

5 Overview of algorithms for unconstrained optimizationmepelman/teaching/NLP/... · 5 Overview of algorithms for unconstrained optimization ... such that rf (xk)Tdk < 0 is a ... rise