Line-Search Methods for Smooth Unconstrained Optimizationabasu9/AMS_553-761/lecture05_handout2.pdf · j jj with f jgthe eigenvalues of A. thespectral decompositionof H is given by

Line-Search Methods for Smooth Unconstrained Optimization

Daniel P. RobinsonDepartment of Applied Mathematics and Statistics

Johns Hopkins University

September 17, 2020

1 / 106

Outline

1 Generic Linesearch Framework

2 Computing a descent direction pk (search direction)Steepest descent directionModified Newton directionQuasi-Newton directions for medium scale problemsLimited-memory quasi-Newton directions for large-scale problemsLinear CG method for large-scale problems

3 Choosing the step length αk (linesearch)Backtracking-Armijo linesearchWolfe conditionsStrong Wolfe conditions

4 Complete AlgorithmsSteepest descent backtracking Armijo linesearch methodModified Newton backtracking-Armijo linesearch methodModified Newton linesearch method based on the Wolfe conditionsQuasi-Newton linesearch method based on the Wolfe conditions

2 / 106

Notes

Notes

The problem

minimizex∈Rn

f(x)

where the objective function f : Rn → R

assume that f ∈ C1 (sometimes C2) and is Lipschitz continuous

in practice this assumption may be violated, but the algorithms we develop maystill work

in practice it is very rare to be able to provide an explicit minimizer

we consider iterative methods: given starting guess x0, generate sequence

{xk} for k = 1, 2, . . .

AIM: ensure that (a subsequence) has some favorable limiting propertiesI satisfies first-order necessary conditionsI satisfies second-order necessary conditions

Notation: fk = f(xk), gk = g(xk), Hk = H(xk)

4 / 106

The basic idea

consider descent methods, i.e., f(xk+1) < f(xk)

calculate a search direction pk from xk

ensure that this direction is a descent direction, i.e.,

gTk pk < 0 if gk 6= 0

so that, for small steps along pk, the objective function f will be reduced

calculate a suitable steplength αk > 0 so that

f(xk + αkpk) < fk

computation of αk is the linesearch

the computation of αk may itself require an iterative procedure

generic update for linesearch methods is given by

xk+1 = xk + αkpk

5 / 106

Notes

Notes

Issue 1: steps might be too long

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3f(x)

x

(x1,f(x1)

(x2,f(x2)

(x3,f(x3)

(x4,f(x4)(x5,f(x5)

Figure: The objective function f(x) = x2 and the iterates xk+1 = xk + αkpk generated by thedescent directions pk = (−1)k+1 and steps αk = 2 + 3/2k+1 from x0 = 2.

decrease in f is not proportional to the size of the directional derivative!6 / 106

Issue 2: steps might be too short

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3f(x)

x

(x1,f(x1)

(x2,f(x2)

(x3,f(x3)(x4,f(x4)(x5,f(x5)

Figure: The objective function f(x) = x2 and the iterates xk+1 = xk + αkpk generated by thedescent directions pk = −1 and steps αk = 1/2k+1 from x0 = 2.

decrease in f is not proportional to the size of the directional derivative!7 / 106

Notes

Notes

What is a practical linesearch?

in the “early” days exact linesearchs were used, i.e., pick αk to minimize

f(xk + αpk)

I an exact linesearch requires univariate minimizationI cheap if f is simple, e.g., a quadratic functionI generally very expensive and not cost effectiveI exact linesearch may not be much better than an approximate linesearch

modern methods use inexact linesearchsI ensure steps are neither too long nor too shortI make sure that the decrease in f is proportional to the directional derivativeI try to pick “appropriate” initial stepsize for fast convergence

F related to how the search direction sk is computed

the descent direction (search direction) is also importantI “bad” directions may not converge at allI more typically, “bad” directions may converge very slowly

8 / 106

Definition 2.1 (Steepest descent direction)For a differentiable function f , the search direction

pkdef= −∇f(xk) ≡ −gk

is called the steepest-descent direction.

pk is a descent direction provided gk 6= 0I gT

k pk = −gTk gk = −‖gk‖2

2 < 0

pk solves the problem

minimizep∈Rn

mLk (xk + p) subject to ‖p‖2 = ‖gk‖2

I mLk (xk + p) def

= fk + gTk p

I mLk (xk + p) ≈ f(xk + p)

Any method that uses the steepest-descent direction is a method of steepest descent.

11 / 106

Notes

Notes

Observation: the steepest descent direction is also the unique solution to

minimizep∈Rn

fk + gTk p + 1

2 pTp

approximates second-derivative Hessian by the identify matrix Ihow often is this a good idea?

is it a surprise that convergence is typically very slow?

Idea: choose positive definite Bk and compute search direction as the unique minimizer

pk = argminp∈Rn

mQk (p)

def= fk + gT

k p + 12 pTBkp

pk satisfies Bkpk = −gk

why must Bk be positive definite?I Bk � 0 =⇒ MQ

k is strictly convex =⇒ unique solutionI if gk 6= 0, then pk 6= 0 and is a descent direction

pTk gk = −pT

k Bkpk < 0 =⇒ pk is a descent direction

pick “intelligent” Bk that has “useful” curvature information

IF Hk � 0 and we choose Bk = Hk, then sk is the Newton direction. Awesome!

13 / 106

Question: How do we choose the positive-definite matrix Bk?

Ideally, Bk is chosen such that

‖Bk − Hk‖ is “small”

Bk = Hk when Hk is “sufficiently” positive definite.

Comments:

for the remainder of this section, we omit the suffix k and write H = Hk, B = Bk,and g = gk.

for a symmetric matrix A ∈ Rn×n, we use the matrix norm

‖A‖2 = max1≤j≤n

|λj|

with {λj} the eigenvalues of A.

the spectral decomposition of H is given by H = VΛVT, where

V =(v1 v2 · · · vn

)and Λ = diag(λ1, λ2, . . . , λn)

with Hvj = λjvj and λ1 ≥ λ2 ≥ · · · ≥ λn.

H is positive definite if and only if λj > 0 for all j.computing the spectral decomposition is, generally, very expensive!

14 / 106

Notes

Notes

Method 1: small scale problems (n < 1000)

Algorithm 1 Compute modified Newton matrix B from H

1: input H2: Choose β > 1, the desired bound on the condition number of B.3: Compute the spectral decomposition H = VΛVT.4: if H = 0 then5: Set ε = 16: else7: Set ε = ‖H‖2/β > 08: end if9: Compute

Λ = diag(λ1, λ2, . . . , λn) with λj =

{λj if λj ≥ εε otherwise

10: return B = VΛVT� 0, which satisfies cond(B) ≤ β

replaces the eigenvalues of H that are “not positive enough” with εΛ = Λ + D, where

D = diag(

max(0, ε− λ1),max(0, ε− λ2), . . . ,max(0, ε− λn))� 0

B = H + E, where E = VDVT� 015 / 106

Question: What are the properties of the resulting search direction?

p = −B−1g = −VΛ−1VTg = −n∑

j=1

vTj gλj

vj

Taking norms and using the orthogonality of V, gives

‖p‖22 = ‖B−1g‖2

2 =

n∑j=1

(vTj g)2

λ2j

=∑λj≥ε

(vTj g)2

λ2j

+∑λj<ε

(vTj g)2

ε2

Thus, we conclude that

‖p‖2 = O(

1ε

)provided vT

j g 6= 0 for at least one λj < ε

the step p goes unbounded as ε→ 0 provided vTj g 6= 0 for at least one λj < ε

any indefinite matrix will generally satisfy this property

the next method that we discuss is better!

16 / 106

Notes

Notes

Method 2: small scale problems (n < 1000)

Algorithm 2 Compute modified Newton matrix B from H

1: input H2: Choose β > 1, the desired bound on the condition number of B.3: Compute the spectral decomposition H = VΛVT.4: if H = 0 then5: Set ε = 16: else7: Set ε = ‖H‖2/β > 08: end if9: Compute

Λ = diag(λ1, λ2, . . . , λn) with λj =

λj if λj ≥ ε−λj if λj ≤ −εε otherwise

10: return B = VΛVT� 0, which satisfies cond(B) ≤ β

replace small eigenvalues λi of H with εreplace “sufficiently negative” eigenvalues λi of H with −λi

Λ = Λ + D, where

D = diag(

max(0,−2λ1, ε− λ1), . . . ,max(0,−2λn, ε− λn))� 0

B = H + E, where E = VDVT� 017 / 106

Suppose that B = H + E is computed from Algorithm 2 so that

E = B− H = V(Λ− Λ)VT

and since V is orthogonal that

‖E‖2 = ‖V(Λ− Λ)VT‖2 = ‖Λ− Λ‖2 = maxj|λj − λj|

By definition

λj − λj =

0 if λj ≥ εε− λj if −ε < λj < ε−2λj if λj ≤ −ε

which implies that‖E‖2 = max

1≤j≤n{ 0, ε− λj, −2λj }.

However, since λ1 ≥ λ2 ≥ · · · ≥ λn, we know that

‖E‖2 = max { 0, ε− λn, −2λn }

if λn ≥ ε, i.e., H is sufficiently positive definite, then E = 0 and B = Hif λn < ε, then B 6= H and it can be shown that ‖E‖2 ≤ 2 max(ε, |λn|)regardless, B is well-conditioned by construction

18 / 106

Notes

Notes

Example 2.2Consider

g =

(24

)and H =

(1 00 −2

)The Newton direction is

pN = −H−1g =

(−2

2

)so that gTpN = 4 > 0 (ascent direction)

and pN is a saddle point of the quadratic model gTp + 12 pTHp since H is indefinite.

Algorithm 1 (Method 1) generates

B =

(1 00 ε

)so that

p = −B−1g =

(−2− 4ε

)and gTp = −4− 16

ε< 0 (descent direction)

Algorithm 2 (Method 2) generates

B =

(1 00 2

)so that

p = −B−1g =

(−2−2

)and gTp = −12 < 0 (descent direction)

19 / 106

pN pN

Method 1

p

pN

Method 2

p

Notes

Notes

Question: What is the geometric interpretation?Answer: Let the spectral decomposition of H be

H = VΛVT

and assume that λj 6= 0 for all j. Partition the eigenvector and eigenvalue matrices as

Λ =

(Λ+

Λ−

)} λj>0

} λj<0and V =

(V+︸︷︷︸λj>0

V−︸︷︷︸λj<0

)

So that H =(V+ V−

) (Λ+

Λ−

)(VT

+

VT−

)and the Newton direction is

pN = −VΛ−1VTg = −(V+ V−

) (Λ−1+

Λ−1−

)(VT

+

VT−

)g

= −(V+ V−

) (Λ−1+

Λ−1−

)(VT

+gVT

−g

)= −V+Λ

−1+ VT

+g− V−Λ−1− VT

−g

= − V+Λ−1+ VT

+g︸︷︷︸pN+

− V−Λ−1− VT

−g︸︷︷︸pN−

def= pN

+ + pN−

wherepN+ = −V+Λ

−1+ VT

+g and pN− = −V−Λ

−1− VT

−g21 / 106

Result

Given mQk (p) = gTp + 1

2 pTHp with H nonsingular, the Newton direction

pN = pN+ + pN

− = −V+Λ−1+ VT

+g− V−Λ−1− VT

−g

is such that

pN+ minimizes mQ

k (p) on the subspace spanned by V+.

pN− maximizes mQ

k (p) on the subspace spanned by V−.

22 / 106

Notes

Notes

Proof:

The function ψ(y) def= mQ

k (V+y) may be written as

ψ(y) = gTV+y + 12 yTVT

+HV+y

and has a stationary point y∗ given by

y∗ = −(VT+HV+)−1VT

+g

Since∇2ψ(y) = VT

+HV+ = VT+VΛVTV+ = Λ+ � 0

we know that y∗ is, in fact, the unique minimizer of ψ(y) and

V+y∗ = −V+Λ−1+ VT

+g ≡ pN+

For the second part

y∗ = −(VT−HV−)−1VT

−g = −Λ−1− VT

−g

so thatV−y∗ = −V−Λ

−1− VT

−g ≡ pN−.

Since Λ− ≺ 0, we know that pN− maximizes mQ

k (p) on the space spanned by thecolumns of V−.

23 / 106

Some observations

pN+ is a descent direction for f :

gTpN+ = −gTV+Λ

−1+ VT

+g = −yTΛ−1+ y ≤ 0

where y def= VT

+g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to thecolumn space of V+

pN− is an ascent direction for f :

gTpN− = −gTV−Λ

−1− VT

−g = −yTΛ−1− y ≥ 0

where y def= VT

−g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to thecolumn space of V−

pN = pN+ + pN

− may or may not be a descent direction!

This is the geometry of the Newton direction. What about the modified Newtondirection?

24 / 106

Notes

Notes

Question: what is the geometry of the modifed-Newton step?Answer: Partition the matrices in the spectral decomposition H = VΛVT such that

V = [V+ Vε V−] and Λ =

Λ+

ΛεΛ−

where

Λ+ = {λi : λi ≥ ε} Λε = {λi : |λi| < ε} Λ− = {λi : λi ≤ −ε}

Theorem 2.3 (Properties of the modifed-Newton direction)Suppose that the positive-definite matrix Bk is computed from Algorithm 2 with input Hk

and value ε > 0. If gk 6= 0 and pk is the unique solution to Bkp = −gk, then pk may bewritten as pk = p+ + pε + p−, where

(i) p+ is a direction of positive curvature that minimizes mQk (p) in the space spanned

by the columns of V+;

(ii) −(p−) is a direction of negative curvature that maximizes mQk (p) in the space

spanned by the columns of V−; and

(iii) pε is a multiple of the steepest-descent direction in the space spanned by thecolumns of Vε with ‖pε‖2 = O(1/ε).

Comments:implies that singular Hessians Hk may easily cause numerical difficultiesthe direction p− is the negative of pN

− associated with the Newton step25 / 106

Main goals of quasi-Newton methods

Maintain positive-definite “approximation” Bk to Hk

Instead of computing Bk+1 from scratch, e.g., from a spectral decomposition ofHk+1, want to compute Bk+1 as an update to Bk using information at xk+1

inject “real” curvature information from H(xk+1) into Bk+1

Should be cheap to form and compute with Bk

be applicable for medium-scale problems (n ≈ 1000− 10, 000)

Hope that fast convergence (superlinear) of the iterates {xk} computed using Bk

can be recovered

27 / 106

Notes

Notes

The optimization problem solved in the (k + 1)st iterate will be

minimizep∈Rn

mQk+1(p)

def= fk+1 + gT

k+1p + 12 pTBk+1p

How do we choose Bk+1?

How about to satisfy the following:the model should agree with f at xk+1

the gradient of the model should agree with the gradient of f at xk+1

the gradient of the model should agree with the gradient of f at xk (previous point)

These are equivalent to1 mQ

k+1(0) = fk+1√

already satisfied2 ∇mQ

k+1(0) = gk+1√

already satisfied3 ∇mQ

k+1(−αkpk) = gk

Bk+1(−αkpk) + gk+1 = gk ⇐⇒ Bk+1 (xk+1 − xk)︸︷︷︸sk

= gk+1 − gk︸︷︷︸yk

Thus, we aim to cheaply compute a symmetric positive-definite matrix Bk+1 thatsatisfies the so-called secant equation

Bk+1sk = yk where skdef= xk+1 − xk and yk

def= gk+1 − gk

28 / 106

Observation: If the matrix Bk+1 is positive definite, then the curvature condition

sTk yk > 0 (1)

is satisfied.

using the fact that Bk+1sk = yk, we see that

sTk yk = sT

k Bk+1sk

which must be positive if Bk+1 � 0the previous relation explains why (1) is called a curvature condition

(1) will hold for any two points xk and xk+1 if f is strictly convex

(1) does not always hold for any two points xk and xk+1 if f is nonconvex

for nonconvex functions, we need to be careful how we choose αk that definesxk+1 = xk + αkpk (see Wolfe conditions for linesearch strategies)

Note: for the remainder, I will drop the subscript k from all terms.

29 / 106

Notes

Notes

The problemGiven a symmetric positive-definite matrix B, and vectors s and y, find a symmetricpositive-definite matrix B that satisfies the secant equation

Bs = y

Want a cheap update, so let us first consider a rank-one update of the form

B = B + uvT

for some vectors u and v that we seek.Derivation:Case 1: Bs = yChoose u = v = 0 since

Bs = (B + uvT)s = Bs + uvTs = y

Case 2: Bs 6= yIf u = 0 or v = 0 then

Bs = (B + uvT)s = Bs + uvTs = Bs 6= y

So search for u 6= 0 and v 6= 0. Symmetry of B and desired symmetry of B imply that

B = BT ⇐⇒ B + uvT = BT + vuT ⇐⇒ uvT = vuT

But uvT = vuT, u 6= 0, and v 6= 0 imply that v = αu for some α 6= 0 (exercise). Thus,we should search for an α 6= 0 and u 6= 0 such that the secant condition is satisfied by

B = B + αuuT30 / 106

From previous slide, we hope to find an α > 0 and u 6= 0 so that

B = B + αuuT

satisfiesy want

= Bs = Bs + αuuTs = Bs + α(uTs)uwhich implies

α(uTs)u = y− Bs 6= 0.This implies that

u = β(y− Bs) for some β 6= 0.

Plugging back in, we are searching for

B = B + αβ2(y− Bs)(y− Bs)T

= B + γ(y− Bs)(y− Bs)T for some γ 6= 0

that satisfies the secant equation

y want= Bs = Bs + γ(y− Bs)Ts(y− Bs)

which impliesy− Bs = γ(y− Bs)Ts(y− Bs)

which means we can choose

γ =1

(y− Bs)Tsprovided (y− Bs)Ts 6= 0.

31 / 106

Notes

Notes

Definition 2.4 (SR1 quasi-Newton formula)

If (y− Bs)Ts 6= 0 for some symmetric matrix B and vectors s and y, then the symmetricrank-1 (SR1) quasi-Newton formula for updating B is

B = B +(y− Bs)(y− Bs)T

(y− Bs)Ts

and satisfies

B is symmetric

Bs = y (secant condition)

In practice, we choose some κ ∈ (0, 1) and then use

B =

{B + (y−Bs)(y−Bs)T

(y−Bs)Ts if |sT(y− Bs)| > κ‖s‖2‖y− Bs‖2

B otherwise

to avoid numerical error.

Question: Is B positive definite if B is positive definite?Answer: Sometimes....but sometimes not!

32 / 106

Example 2.5 (SR1 update is not necessarily positive definite)Consider the SR1 update with

B =

(2 11 1

)s =

(−1−1

)and y =

(−3

2

)

The matrix B is positive definite with eigenvalues ≈ {0.38, 2.6}. Moreover,

yTs = (−3 2 )

(−1−1

)= 1 > 0

so that the necessary condition for B to be positive definite holds. The SR1 updategives

y− Bs =

(−3

2

)−(

2 11 1

)(−1−1

)=

(04

)

(y− Bs)Ts =(

0 4)( −1−1

)= −4

B =

(2 11 1

)+

1(−4)

(04

)(0 4

)=

(2 11 −3

)The matrix B is indefinite with eigenvalues ≈ {2.19, −3.19}.

33 / 106

Notes

Notes

Theorem 2.6 (convergence of SR1 update)Suppose that

f is C2 on Rn

limk→∞ xk = x∗ for some vector x∗

∇2f is bounded and Lipschitz continous in a neighborhood of x∗

the steps sk are uniformly linearly independent

|sTk(yk − Bksk)| > κ‖sk‖2‖yk − Bksk‖2 for some κ ∈ (0, 1)

Then, the matrices generated by the SR1 formula satisfy

limk→∞

Bk = ∇2f(x∗)

Comment: the SR1 updates may converge to an indefinite matrix

34 / 106

Some comments

if B is symmetric, then the SR1 update B (when it exists) is symmetric

if B is positive definite, then the SR1 update B (when it exists) is not necessarilypositive definite

linesearch methods require positive-definite matrices

SR1 updating is not appropriate (in general) for linesearch methods

SR1 updating is an attractive option for optimization methods that effectively utilizeindefinite matrices (e.g., trust-region methods)

to derive an updating strategy that produces positive-definite matrices, we have tolook beyond rank-one updates. Would a rank-two update work?

35 / 106

Notes

Notes

For hereditary positive definiteness we must consider updates of at least rank two.

ResultU is a symmetric matrix of rank two if and only if

U = βuuT + δwwT,

for some nonzero β and δ, with u and w linearly independent.

IfB = B + U

for some rank-two matrix U and Bs = y for some vectors s and y, then

Bs = (B + U)s = y,

and it follows thaty− Bs = Us = β(uTs)u + δ(wTs)w

36 / 106

Let us make the “obvious” choices of

u = Bs and w = y

which gives the rank-two update

U = βBs(Bs)T + δyyT

and we now search for β and δ so that Bs = y is satisfied. Substituting, we have

y want= Bs = (B + βBs(Bs)T + δyyT)s

= Bs + βBssTBs + δyyTs= Bs + β(sTBs)Bs + δ(yTs)y

Making the “simple” choice of

β = − 1sTBs

and δ =1

yTs

gives

Bs = Bs− 1sTBs

(sTBs)Bs +1

yTs(yTs)y

= Bs− Bs + y = y

as desired. This gives the Broyden, Fletcher, Goldfarb, Shanno (BFGS) quasi-Newtonupdate formula.

37 / 106

Notes

Notes

Broyden, Fletcher, Goldfarb, Shanno (BFGS) update

Given a symmetric positive-definite matrix B and vectors s and y that satisfy yTs > 0,the quasi-Newton BFGS update formula is given by

B = B− 1sTBs

Bs(Bs)T +1

yTsyyT (2)

Result

If B is positive definite and yTs > 0, then

B = B− 1sTBs

Bs(Bs)T +1

yTsyyT

is positive definite and satisfies Bs = y.

Proof:Homework assignment. (hint: see the next slide)

38 / 106

A second derivation of the BFGS formula

Let M = B−1 for some given symmetric positive definite matrix B. Solve

minimizeM∈Rn×n

‖M −M‖W subject to M = MT, My = s

for a “closest” symmetric matrix M, where

‖A‖W = ‖W1/2AW1/2‖F and ‖C‖F =n∑

i=1

n∑j=1

c2ij

for any weighting matrix W that satisfies Wy = s. If we choose

W = G−1 and G =

∫ 1

0[∇2f(y + τ s)]−1 dτ

then the solution is

M =

(I − syT

yTs

)M(

I − ysT

yTs

)+

ssT

yTs.

Computing the inverse B = M−1 using the Sherman-Morrison-Woodbury formula gives

B = B− 1sTBs

Bs(Bs)T +1

yTsyyT (BFGS formula again)

so that B � 0 if and only if M � 0.39 / 106

Notes

Notes

Recall: given the positive-definite matrix B and vectors s and y satisfying sTy > 0, theBFGS formula (2) may be written as

B = B− aaT + bbT

wherea =

Bs(sTBs)1/2 and b =

y(yTs)1/2

Limited-memory BFGS updateSuppose at iteration k we have m previous pairs (si, yi) for i = k− m, . . . , k− 1. Thenthe limited-memory BFGS (L-BFGS) update with memory m and positive-definite seedmatrix B0

k may be written in the form

Bk = B0k +

k−1∑i=k−m

[bibT

i − aiaTi

]for some vectors ai and bi that may be recovered via an unrolling formula (next slide).

the positive-definite seed matrix B0k is often chosen to be diagonal

essentially perform m BFGS updates to the seed matrix B0k

A different seed matrix may be used for each iteration kI Added flexibility allows “better” seed matrices to be used as iterations proceed

41 / 106

Algorithm 3 Unrolling the L-BFGS formula

1: input m ≥ 1, {(si, yi)} for i = k− m, . . . , k− 1, and B0k � 0

2: for i = k− m, . . . , k− 1 do3: bi ← yi/(yT

i si)1/2

4: ai ← B0ksi +

∑i−1j=k−m

[(bT

j si)bj − (aTj si)aj

]5: ai ← ai/(sT

i ai)1/2

6: end for

typically 3 ≤ m ≤ 30bi and bT

j si should be saved and reused from previous iterations (exceptions arebk−1 and bT

j sk−1 because they depend on the most recent data obtained).

with the previous savings in computation and assuming B0k = I, the total cost to

get the ai, bi vectors is approximately 32 m2n.

cost of a matrix-vector product with Bk requires 4mn multiplications.

other so-called compact representations are slightly more efficient

42 / 106

Notes

Notes

The problem of interestGiven a symmetric positive-definite matrix A, solve the linear system

Ax = b

which is equivalent to finding the unique minimizer of

minimizex∈Rn

q(x)def= 1

2 xTAx− bTx (3)

because∇q(x) = Ax− b and∇2q(x) = A � 0 implies that

∇q(x) = Ax− b = 0 ⇐⇒ Ax = b

We seek a method that cheaply computes approximate solutions to the systemAx = b, or equivalent approximately minimizes (3).

Do not want to compute a factorization of A, because it is assumed to be tooexpensive. We are interested in very large-scale problems

We allow ourselves to compute matrix-vector products, i.e., Av for any vector vWhy do we care about this? We will see soon!

Notation: r(x)def= Ax− b and rk

def= Axk − b.

44 / 106

q(x) = 12 xTAx− bTx

Algorithm 4 Coordinate descent algorithm for minimizing q

1: input initial guess x0 ∈ Rn and a symmetric positive-definite matrix A ∈ Rn×n

2: loop3: for i = 0, 1, . . . , n− 1 do4: Compute αi as the solution to

minimizeα∈R

q(xi + αei+1) (4)

5: xi+1 ← xi + αiei+1

6: end for7: x0 ← xn

8: end loop

ej represents the jth coordinate vector, i.e.,

ej = ( 0 0 . . .jth︷︸︸︷1 . . . 0 0 )T∈ Rn

Is it easy to solve (4)? Yes! (exercise)Does this work?

45 / 106

Notes

Notes

Figure: Coordinate descent converges in nsteps. Very good! Axes are aligned with thecoordinate directions ei.

Figure: Coordinate descent does notconverge in n steps. Looks like it willbe very slow! Axes are not aligned withthe coordinate directions ei.

Fact: the axes are determined by the eigenvectors of A.Observation: coordinate descent converges in no more than n steps if the eigenvectorsalign with the coordinate directions, i.e., V = I where V is the eigenvector matrix of A.Implication: Using the spectral decomposition of A

A = VΛVT = IΛIT = Λ (a diagonal matrix)

Conclusion: Coordinate descent converges in ≤ n steps when A is diagonal.46 / 106

First thought: This is only good for diagonal matrices. That sucks!Second thought: Can we use a transformation of variables so that the transformedHessian is diagonal?

Let S be a square nonsingular matrix and define the transformed variablesy = S−1x ⇐⇒ Sy = x.

Consider the transformed quadratic function q defined as

q(y )def= q(Sy ) ≡ 1

2 yT STAS︸︷︷︸A

y− ( STb︸︷︷︸b

)Ty = 12 yT Ay− b Ty

so that a minimizer x∗ of q satisfies x∗ = Sy∗ where y∗ is a minimizer of q.If A = STAS is diagonal, then applying the coordinate descent Algorithm 4 to qwould converge to y∗ in ≤ n steps.A is diagonal if and only if

sTi Asj = 0 for all i 6= j,

whereS =

(s0 s1 . . . sn−1

)Line 5 of Algorithm 4 (in the transformed variables) is

yi+1 = yi + αiei+1.

Multiplying by S and using Sy = x leads to

xi+1 = Syi+1 = S(yi + αiei+1) = Syi + αiSei+1 = xi + αisi

47 / 106

Notes

Notes

Definition 2.7 (A-conjugate directions)A set of nonzero vectors {s0, s1, . . . , sl} is said to be conjugate with respect to thesymmetric positive-definite matrix A (A-conjugate for short) if

sTi Asj = 0 for all i 6= j.

We want to find A-conjugate vectors.

The eigenvectors of A are A-conjugate since the spectral decomposition gives

A = VΛVT ⇐⇒ VTAV = Λ (diagonal matrix of eigenvalues)

Computing the spectral decomposition is too expensive for large scale problems.

We want to find A-conjugate vectors cheaply!

Let us look at a general algorithm.

48 / 106

Algorithm 5 General A-conjugate direction algorithm

1: input symmetric positive-definite matrix A ∈ Rn×n, and vectors x0 and b ∈ Rn.2: Set r0 ← Ax0 − b and k← 0.3: Choose direction s0.4: while rk 6= 0 do5: Compute αk ← argminα∈R q(xk + αsk)6: Set xk+1 ← xk + αksk.7: Set rk+1 ← Axk+1 − b.8: Pick new A-conjugate direction sk+1.9: Set k← k + 1.

10: end while

How do we compute αk?

How do we compute the A-conjugate directions efficiently?

Theorem 2.8 (general A-conjugate algorithm converges)For any x0 ∈ Rn, the general A-conjugate direction Algorithm 5 computes the solutionx∗ of the system Ax = b in at most n steps.

Proof: Theorem 5.1 in Nocedal and Wright.

49 / 106

Notes

Notes

Exercise: show that the αk that solves the minimization problem on line 5 ofAlgorithm 5 satisfies

αk = − sTk rk

sTk Ask

.

Before discussing how we compute the A-conjugate directions {si}, let us look at someproperties of them if we assume we already have them.

Theorem 2.9 (Expanding subspace minimization)

Let x0 ∈ Rn be any starting point and suppose that {xk} is the sequence of iteratesgenerated by the A-conjugate direction Algorithm 5 for a set of A-conjugate directions{sk}. Then,

rTk si = 0 for all i = 0, . . . k− 1

and xk is the unique minimizer of q(x) = 12 xTAx− bTx over the set

{x : x = x0 + span(s0, s1, . . . , sk−1)}

Proof: See Theorem 5.2 in Nocedal and Wright.

50 / 106

Question: If {si}k−1i=0 are A-conjugate, how do we compute sk to be A-conjugate?

Answer: Theorem 2.9 shows that rk is orthogonal to the previously explored space. Tokeep it simple and hopefully cheap, let us try

sk = −rk + β sk−1 for some β. (5)

If this is going to work, then the A-conjugacy condition and (5) requires that

0 = sTk−1Ask

= sTk−1A(−rk + βsk−1)

= sTk−1(−Ark + βAsk−1)

= −sTk−1Ark + βsT

k−1Ask−1

and solving for β gives

β =sT

k−1Ark

sTk−1Ask−1

.

With this choice of β, the vector sk is A-conjugate!.....but, how do we get started?Choosing s0 = −r0, i.e., a steepest descent direction, makes sense.

This gives us a complete algorithm; the linear CG algorithm!

51 / 106

Notes

Notes

Algorithm 6 Preliminary linear CG algorithm

1: input symmetric positive-definite matrix A ∈ Rn×n, and vectors x0 and b ∈ Rn.2: Set r0 ← Ax0 − b, s0 ← −r0, and k← 0.3: while rk 6= 0 do4: Set αk ← −( sT

k rk)/( sTk Ask).

5: Set xk+1 ← xk + αksk.6: Set rk+1 ← Axk+1 − b.7: Set βk+1 ← ( sT

k Ark+1)/( sTk Ask).

8: Set sk+1 ← −rk+1 + βk+1sk.9: Set k← k + 1.

10: end while

This preliminary version requires two matrix-vector products per iteration.

By using some more “tricks” and algebra, we can reduce this to a singlematrix-vector multiplication per iteration.

52 / 106

Algorithm 7 Linear CG algorithm

1: input symmetric positive-definite matrix A ∈ Rn×n, and vectors x0 and b ∈ Rn.2: Set r0 ← Ax0 − b, s0 ← −r0, and k← 0.3: while rk 6= 0 do4: Setαk ← ( rT

k rk)/( sTk Ask).

5: Set xk+1 ← xk + αksk.6: Set rk+1 ← rk + αkAsk.7: Set βk+1 ← ( rT

k+1rk+1)/( rTk rk).

8: Set sk+1 ← −rk+1 + βk+1sk.9: Set k← k + 1.

10: end while

Only requires the single matrix-vector multiplication Ask.More practical to use a stopping condition like

‖rk‖ ≤ 10−8 max(1, ‖r0‖2

)Required computation

I 2 inner productsI 3 vector sumsI 1 matrix-vector multiplication

Ideal for large (sparse) matrices AConverges quickly if cond(A) is small or the eigenvalues are “clustered”

I convergence can be accelerated by using precondioning53 / 106

Notes

Notes

We want to use CG to compute an approximate solution p to

minimizep∈Rn

f + gTp + 12 pTHp

H may be indefinite (even singular) and must ensure that p is a descent direction

Algorithm 8 Newton CG algorithm for computing descent directions

1: input symmetric matrix H ∈ Rn×n and vector g2: Set p0 = 0, r0 ← g, s0 ← −g, and k← 0.3: while rk 6= 0 do4: if sT

k Hsk > 0 then5: Set αk ← ( rT

k rk)/( sTk Hsk).

6: else7: if k = 0, then return pk = −g else return pk end if8: end if9: Set pk+1 ← pk + αksk.

10: Set rk+1 ← rk + αkHsk.11: Set βk+1 ← ( rT

k+1rk+1)/( rTk rk).

12: Set sk+1 ← −rk+1 + βk+1sk.13: Set k← k + 1.14: end while15: return pk

Made the replacements x← p, A← H, and b← −g in Algorithm 7. 54 / 106

Theorem 2.10 (Newton CG algorithm computes descent directions.)If g 6= 0 and Algorithm 8 terminates on iterate K, then every iterate pk of the algorithmsatisfies pT

k g < 0 for k ≤ K.

Proof:Case 1: Algorithm 8 terminates on iteration K = 0.It is immediate that gTp0 = −gTg = −‖g‖2

2 < 0 since g 6= 0.Case 2: Algorithm 8 terminates on iteration K ≥ 1.First observe from p0 = 0, s0 = −g, r0 = g, g 6= 0, and lines 10 and 9 of Algorithm 8that

gTp1 = gT(p0 + α0s0) = α0gTs0 =rT

0 r0

sT0 Hs0

gTs0 = − ‖g‖42

gTHg< 0. (6)

By unrolling the “p” update, using A-conjugacy of {si}, and definition of rk, we get

sTk rk = sT

k (Hpk + g) = sTk g + sT

k Hk−1∑j=0

αjsj = sTk g.

The previous line, lines 9 and 5 of Algorithm 8, and the fact that sTk rk = −‖rk‖2

2 gives

gTpk+1 = gTpk + αkgTsk = gTpk + αksTk rk = gTpk +

rTk rk

sTk Hsk

sTk rk = gTpk −

‖rk‖42

sTk Hsk

< gTpk.

Combining the previous inequality and (6) results in

gTpk+1 < gTpk < · · · < gTp1 < −‖g‖4

2

gTHg< 0.

55 / 106

Notes

Notes

Algorithm 9 Backtracking

1: input xk, pk

2: Choose αinit > 0 and τ ∈ (0, 1)3: Set α(0) = αinit and l = 04: until f(xk + α(l)pk) “<” f(xk)5: Set α(l+1) = τα(l)

6: Set l← l + 17: end until8: return αk = α(l)

typical choices (not always)I αinit = 1I τ = 1/2

this prevents the step from getting too small

does not prevent the step from being too long

decrease in f may not be proportional to the directional derivative

need to tighten the requirement that

f(xk + α(l)pk

)“<” f(xk)

58 / 106

The Armijo conditionGiven a point xk and search direction pk, we say that αk satisfies the Armijo condition if

f(xk + αkpk) ≤ f(xk) + ηαk gT

k pk (7)

for some η ∈ (0, 1), e.g., η = 0.1 or even η = 0.0001

Purpose: ensure the decrease in f is substantial relative to the directional derivative.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

α

f(xk+αpk)

f(xk)+αgTk pk

f(xk)+αβgTk pk

59 / 106

Notes

Notes

Algorithm 10 A backtracking-Armijo linesearch

1: input xk, pk

2: Choose αinit > 0, η ∈ (0, 1), and τ ∈ (0, 1)3: Set α(0) = αinit and l = 04: until f

(xk + α(`)pk) ≤ f(xk) + ηα(`)gT

k pk

5: Set α(l+1) = τα(l)

6: Set `← `+ 17: end until8: return αk = α(`)

Does this always terminate?

Does it give us what we want?

60 / 106

Theorem 3.1 (Satisfying the Armijo condition)

Suppose that

f ∈ C1 and g(x) is Lipschitz continuous with Lipschitz constant γ(x)

p is a descent direction at xThen for any given η ∈ (0, 1), the Armijo condition

f(x + αp) ≤ f(x) + ηα g(x)Tp

is satisfied for all α ∈ [0, αmax(x)], where

αmax(x) :=2(η − 1)g(x)Tpγ(x)‖p‖2

2> 0

Proof:A Taylor’s approximation and

α ≤ 2(η − 1)g(x)Tpγ(x)‖p‖2

2

imply that

f(x + αp) ≤ f(x) + αg(x)Tp + 12γ(x)α2‖p‖2

2

≤ f(x) + αg(x)Tp + α(η − 1)g(x)Tp

= f(x) + αη g(x)Tp61 / 106

Notes

Notes

Theorem 3.2

[Bound on αk when using an Armijo backtracking linesearch] Suppose that

f ∈ C1 and g(x) is Lipschitz continuous with Lipschitz constant γk at xk

pk is a descent direction at xk

Then for any chosen η ∈ (0, 1) the stepsize αk generated by the backtracking-Armijolinesearch terminates with

αk ≥ min

(αinit,

2τ(η − 1)gTk pk

γk‖pk‖22

)≡ min

(αinit, ταmax(xk)

)where τ ∈ (0, 1) is the backtracking parameter.

Proof:Theorem 3.1 implies that the backtracking will terminate as soon as α(l) ≤ αmax(xk).Consider the following two cases:

αinit satisfies the Armijo condition: it is then clear that αk = αinit.

αinit does not satisfy the Armijo condition: thus, there must be a last linesearchiterate (the l-th) for which

α(l) > αmax(xk) =⇒ αk = α(l+1) = τα(l) > ταmax(xk)

Combining the two cases gives the required result.62 / 106

Algorithm 11 Generic backtracking-Armijo linesearch method

1: input initial guess x0

2: Set k = 03: loop4: Find a descent direction pk at xk.5: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10.6: Set xk+1 = xk + αkpk

7: Set k← k + 18: end loop9: return approximate solution xk

should include a sensible relative stopping tolerance such as

‖∇f(xk)‖ ≤ 10−8 max(1, ‖∇f(x0)‖

)should also terminate if a maximum number of allowed iterations is reached

should also terminate if a maximum time limit is reached

63 / 106

Notes

Notes

Theorem 3.3 (Global convergence of generic backtracking-Armijo linesearch)

Suppose that

f ∈ C1

g is Lipschitz continuous on Rn with Lipschitz constant γ

Then, the iterates generated by the generic backtracking-Armijo linesearch methodAlgorithm 11 must satisfy one of the following conditions:

1 finite termination, i.e., there exists a positive integer k such that

gk = 0 for some k ≥ 0

2 objective is unbounded below, i.e.,

limk→∞

fk = −∞

3 an angle condition between gk and pk exists, i.e.,

limk→∞

min(|pT

k gk|, |pTk gk|2/‖pk‖2

2

)= 0

64 / 106

Proof:Suppose that outcomes 1 and 2 do not happen, i.e., gk 6= 0 for all k ≥ 0 and

limk→∞

fk > −∞. (8)

The Armijo condition implies that

fk+1 − fk ≤ ηαkpTk gk

for all k. Summing over the first j iterations yields

fj+1 − f0 ≤ ηj∑

k=0

αkpTk gk (9)

Taking limits of the previous inequality as j→∞ and using (8) implies that the LHS isbounded below and, therefore, the RHS is also bounded below. Since the sum iscomposed of all negative terms, we deduce that the following sum

∞∑k=0

αk|pTk gk| (10)

is bounded.

65 / 106

Notes

Notes

From Theorem 3.2, αk ≥ min{αinit,

2τ(η−1)gTk pk

γ‖pk‖22

}. Thus,

∑∞k=0 αk|pT

k gk| ≥∑∞

k=0 min{αinit,

2τ(η−1)gTk pk

γ‖pk‖22

}|pT

k gk|

=∑∞

k=0 min{αinit|pT

k gk|,2τ(1−η)|pT

k gk|2

γ‖pk‖22

}≥

∑∞k=0 min{αinit,

2τ(1−η)γ}min

{|pT

k gk|,|pT

k gk|2

‖pk‖22

}

Using the fact that the series (10) converges, we obtain that

limk→∞

min(|pT


2

)= 0

66 / 106

Some observationsthe Armijo condition prevents too “long” of stepstypically, the Armijo condition is satisfied for very “large” intervalsthe Armijo condition can be satisfied by steps that are not even close to aminimizer of φ(x) := f(xk + αpk)

φ(α) = f(xk + αpk)

α

l(α) = fk + c1αgTk pk

acceptable acceptable

68 / 106

Notes

Notes

Wolfe conditionsGiven the current iterate xk, search direction pk, and constants 0 < c1 < c2 < 1, wesay that the step length αk satisfies the Wolfe conditions if

f(xk + αkpk) ≤ f(xk) + c1αk∇f(xk)Tpk (11a)

∇f(xk + αkpk)Tpk ≥ c2∇f(xk)

Tpk (11b)

condition (11a) is the Armijo condition (ensures “sufficient” decrease)condition (11b) is a directional derivative condition (prevents too short of steps)


α



desired slope

acceptable

69 / 106

Algorithm 12 Generic Wolfe linesearch method

1: input initial guess x0

2: Set k = 03: loop4: Find a descent direction pk at xk.5: Compute a stepsize αk that satisfies the Wolfe conditions (11).6: Set xk+1 = xk + αkpk

7: Set k← k + 18: end loop9: return approximate solution xk

should include a sensible relative stopping tolerance such as

‖∇f(xk)‖ ≤ 10−8 max(1, ‖∇f(x0)‖

)should also terminate if a maximum number of allowed iterations is reached


70 / 106

Notes

Notes

Theorem 3.4 (Convergence of generic Wolfe linesearch (Zoutendijk))Assume that

f : Rn → R is bounded below on C1 on Rn

∇f(x) is Lipschitz continous on Rn with constant LIt follows that the iterates generated from the generic Wolfe linesearch method satisfy∑

k≥0

cos2(θk)‖gk‖22 <∞ (12)

where

cos(θk)def=

−gTk pk

‖gk‖2‖pk‖2(13)

Commentsdefinition (13) measures the angle between the gradient gk and search direction pk

I cos(θ) ≈ 0 implies gk and pk are nearly orthogonalI cos(θ) ≈ 1 implies gk and pk are nearly parallel

condition (12) ensures that the gradient converges to zero provided gk and pk stayaway from being arbitrarily close to orthogonal

Proof: See Theorem 3.2 in Nocedal and Wright.

71 / 106

Some comments

The Armijo linesearchI only requires function evaluationsI may easily accept steps that are far from a univariate minimizer of φ(α) = f(xk +αpk)

The Wolfe linesearchI requires function and gradient evaluationsI typically produces steps that are closer to the univariate minimizer of φI may still allow the acceptance of steps that are far from a univariate minimizer of φI “ideal” when seach directions are computed from linear systems based on the BFGS

quasi-Newton updating formula

Note: we have not yet said how to compute steps that satisfy the Wolfe conditions. Wehave not even shown that such steps exist! (coming soon)

72 / 106

Notes

Notes

Strong Wolfe conditionsGiven the current iterate xk, search direction pk, and constants 0 < c1 < c2 < 1, wesay that the step length αk satisfies the Strong Wolfe conditions if

f(xk + αkpk) ≤ f(xk) + c1αk∇f(xk)Tpk (14a)

|∇f(xk + αkpk)Tpk| ≤ c2|∇f(xk)

Tpk| (14b)


α



desired slope

acceptable

74 / 106

Theorem 3.5

Suppose that f ∈ Rn → R is in C1 and pk is a descent direction at xk, and assume thatf is bounded below along the ray {xk + αpk : α > 0}. It follows that if0 < c1 < c2 < 1, then there exist nontrivial intervals of step lengths that satisfy thestrong Wolfe conditions (14).

Proof:The function φ(α) := f(xk + αpk) is bounded below over α > 0 by assumption andmust, therefore, intersect the graph of l(α) := f(xk) + αc1∇f(xk)

Tpk at least oncesince c1 ∈ (0, 1); let αmin > 0 be the smallest such intersecting value so that

f(xk + αminpk) = f(xk) + αminc1∇f(xk)Tpk. (15)

It is clear that (11a)/(14a) hold for all α ∈ (0, αmin]. The Mean Value Theorem nowimplies

f(xk + αmin pk)− f(xk) = αmin∇f(xk + αMpk)Tpk for some αM ∈ (0, αmin).

Combining the last equality with (15) and the facts that c1 < c2 and∇f(xk)Tpk < 0 yield

∇f(xk + αMpk)Tpk = c1∇f(xk)

Tpk > c2∇f(xk)Tpk. (16)

Thus, αM satisfies (11a)/(14a) and (11b) with a strict inequality so we may use thesmoothness assumption to deduce that there is an interval around αM for which (11)and (14a) hold. Finally, since the left-hand-side of (16) is negative, we also knowthat (14b) holds, which completes the proof.

75 / 106

Notes

Notes

Algorithm 13 A bisection type (weak) Wolfe linesearch

1: input xk, pk, 0 < c1 < c2 < 12: Choose α = 0, t = 1, and β =∞3: while (f

(xk + tpk) > f(xk) + c1 · t · (gT

k pk) OR∇f(xk + tpk)

Tpk < c2(gTk pk)) do

4: if f(xk + tpk) > f(xk) + c1 · t · (gT

k pk) then5: Set β := t6: Set t := α+β

27: else8: α := t9: if β =∞ then

10: t := 2α11: else12: t := α+β

213: end if14: end if15: end while16:

17: return t

See Section 3.5 in Nocedal-Wright for more discussions; in particular, a procedure forthe strong Wolfe conditions.

76 / 106

Some comments

The Armijo linesearchI only requires function evaluationsI computationally cheapest to implementI may easily accept steps that are far from a univariate minimizer of φ(α) = f(xk +αpk)

The Wolfe linesearchI requires function and gradient evaluationsI typically produces steps that are closer to the univariate minimizer of φI may still allow the acceptance of steps that are far from a univariate minimizer of φI computationally more expensive to implement compared to Armijo

The strong Wolfe linesearchI requires function and gradient evaluationsI typically produces steps that are closer to the univariate minimizer of φI only approximate stationary points of the univariate function φ satisfy these conditions.I computationally most expensive to implement compared to Armijo

Why use (strong) Wolfe conditions?

sometimes more accurate line search lead to fewer outer (major) iterations of thelinesearch method.

beneficial in the context of quasi-Newton methods for maintaining positive-definitematrix approximations.

77 / 106

Notes

Notes

Theorem 3.6 (Wolfe conditions guarantee the BFGS update is well-defined)Suppose that Bk � 0 and that pk is a descent direction for f at xk. If a (strong) Wolfelinesearch is used to compute αk such that xk+1 = xk + αkpk, then

yTk sk > 0

wheresk = xk+1 − xk = αkpk and yk = gk+1 − gk.

Proof:It follows from (11b) that

∇f(xk + αkpk)Tpk ≥ c2∇f(xk)

Tpk.

Multplying both sides by αk, using xk+1 = xk + αkpk, and notation gk = ∇f(xk), yields

gTk+1sk ≥ c2gT

k sk.

Subtracting gTk sk from both sides gives

gTk+1sk − gT

k sk ≥ c2gTk sk − gT

k sk

so thatyT

k sk ≥ (c2 − 1)gTk sk = αk(c2 − 1)gT

k pk > 0

since c2 ∈ (0, 1) and pk is a descent direction for f at xk.78 / 106

Algorithm 14 Steepest descent backtracking-Armijo linesearch method

1: Input: initial guess x0

2: Set k = 03: until ‖∇f(xk)‖ ≤ 10−8 max

(1, ‖∇f(x0)‖

)4: Set pk = −∇f(xk) as the search direction.5: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10.6: Set xk+1 = xk + αkpk

7: Set k← k + 18: end until9: return approximate solution xk

should also terminate if a maximum number of allowed iterations is reached


81 / 106

Notes

Notes

Theorem 4.1 (Global convergence for steepest descent)

If f ∈ C1 and g is Lipschitz continuous on Rn, then the iterates generated by thesteepest-descent backtracking-Armijo linesearch Algorithm 14 satisfies one of thefollowing conditions:

1 finite termination, i.e., there exists some finite integer k ≥ 0 such that

gk = 0

2 objective is unbounded below, i.e.,

limk→∞

fk = −∞

3 convergence of gradients, i.e.,lim

k→∞gk = 0

Proof:Since pk = −gk, if outcome 1 above does not happen, then we may conclude that

‖gk‖ = ‖pk‖ 6= 0 for all k ≥ 0

Additionally, if outcome 2 above does not occur, then Theorem 3.3 ensures that

0 = limk→∞

{min

(|pT


2

)}= lim

k→∞{‖gk‖2 min (‖gk‖2, 1)}

and thus limk→∞ gk = 0.82 / 106

Some general comments on the steepest descent backtracking-Armijo algorithm

archetypal globally convergent method

many other methods resort to steepest descent when things “go wrong”

not scale invariant

convergence is usually slow! (linear local convergence, possibly with constantclose to 1; for global convergence only sublinear rate O(( 1

ε)2) to get to a

“stationary" point, i.e., ‖gk‖ ≤ ε)in practice, often does not converge at all

slow convergence results because the computation of pk does not use anycurvature information

call any method that uses the steepest descent direction a “method of steepestdescent”

each iteration is relatively cheap, but you may need many of them!

83 / 106

Notes

Notes

Rate of convergence

Theorem 4.2 (Global convergence rate for steepest-descent)

Suppose f ∈ C1 and∇f is Lipschitz continuous on Rn. Further assume that f(x) isbounded from below on Rn. Let {xk} be the iterates generated by the steepest-descentbacktracking-Armijo linesearch Algorithm 14. Then there exists a constant M such thatfor all T ≥ 1

mink=0,...,T

‖∇f(xk)‖ ≤M√

T + 1.

Consequently, for any ε > 0, within d(Mε

)2e steps, we will see an iterate where thegradient has norm at most ε. In other words, we reach an “ε-stationary” point inO(( 1

ε)2) steps

REMARK: Same thing holds for constant step size for an appropriately chosenconstant. See HW.

Proof:Recall (9) in the proof of Theorem 3.3:

T∑k=0

αk|gTk pk| ≤

f0 − fk+1

η≤ M′ since f is bounded from below.

84 / 106

Rate of convergence

Thus, ∑k:ταmax<αinit

2τ(1− η)|gTk pk|2

γ‖pk‖2 +∑

k:ταmax≥αinit

αinit|gTk pk| ≤ M′.

Thus, ∑k:ταmax<αinit

|gTk pk|2

‖pk‖2 +∑

k:ταmax≥αinit

|gTk pk| ≤

M′

C, (17)

where C = min

{2τ(1−η)

γ, αinit

}.

Setting pk = −gk in the inequality above, we obtain that

(T + 1) mink=0,...,T

‖gk‖2 ≤∑

k=0,...,T

‖gk‖2 ≤ M′

C.

Therefore,

mink=0,...,T

‖gk‖ ≤M√

T + 1,

for some constant M.

85 / 106

Notes

Notes

Steepest descent example

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

1

1.5

Figure: Contours for the objective function f(x, y) = 10(y− x2)2 + (x− 1)2, and the iteratesgenerated by the steepest-descent backtracking-Armijo linesearch Algorithm 14.

86 / 106

Rate of convergence

Theorem 4.3 (Local convergence rate for steepest-descent)Suppose the following all hold:

f is twice continuously differentiable and∇2f is Lipschitz continuous on Rn withLipschitz constant M.

x∗ is a local minimum of f with positive definite Hessian∇2f(x∗) with eigenvalueslower bounded by ` and upper bounded by L.

The initial starting iterate x0 is close enough to x∗; more precisely,r0 := ‖x0 − x∗‖ < r := 2`

M .

The steepest-descent with constant step size α = 2`+L converges as follows:

‖xk − x∗‖ ≤ rr0

r− r0

(1− 2`

L + 3`

)k

.

In other words, in O(log( 1ε)) iterations, we are within a distance of ε from x∗.

Proof: See Theorem 1.2.4 in Nesterov’s book.

See also Theorem 3.4 in Nocedal-Wright for a similar result with exact line-search.

87 / 106

Notes

Notes

Algorithm 15 Modified-Newton backtracking-Armijo linesearch method



(1, ‖∇f(x0)‖

)4: Compute positive-definite matrix Bk from Hk using Algorithm 2.5: Compute the search direction pk as the solution to Bkp = −gk.6: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10.7: Set xk+1 = xk + αkpk

8: Set k← k + 19: end until

10: return approximate solution xk



89 / 106

Theorem 4.4 (Global convergence of modifed-Newton backtracking-Armijo)

Suppose that

f ∈ C1

g is Lipschitz continuous on Rn

the eigenvalues of Bk are uniformly bounded away from zero and infinity

Then, the iterates generated by the modified-Newton backtracking-Armijo Algorithm 15satisfy one of the following:

1 finite termination, i.e., there exists a finite k such thatgk = 0

2 objective is unbounded below, i.e.,lim

k→∞fk = −∞


k→∞gk = 0

Proof:Let λmin(Bk) and λmax(Bk) be the smallest and largest eigenvalues of Bk. Byassumption, there are bounds 0 < λmin ≤ λmax such that

λmin ≤ λmin(Bk) ≤sTBks‖s‖2 ≤ λmax(Bk) ≤ λmax for any nonzero s.

90 / 106

Notes

Notes

From the previous slide, it follows that

λ−1max ≤ λ−1

max(Bk) = λmin(B−1k ) ≤ sTB−1

k s‖s‖2 ≤ λmax(B−1

k ) = λ−1min(Bk) ≤ λ−1

min

(18)for any nonzero vector s. Using Bkpk = −gk and (18) we see that

| pTk gk| = |gT

k B−1k gk| ≥ λ−1

max‖gk‖22 (19)

In addition, we may use (18) to see that

‖pk‖22 = gT

k B−2k gk ≤ λ−2

min‖gk‖22,

which implies that‖pk‖2 ≤ λ−1

min‖gk‖2.

It now follows from the previous inequality and (19) that

|pTk gk|‖pk‖2

≥ λmin

λmax‖gk‖2. (20)

Combining the previous inequality and (19) yields

min(|pT


2

)≥ ‖gk‖2

2

λmaxmin

(1, λ

2min

λmax

). (21)

From Theorem 3.3 we know that limk→∞min(|pT


2)

= 0 andcombined with (21), we conclude that

limk→∞

gk = 0.

91 / 106

Rate of convergence

Theorem 4.5 (Global convergence rate for modified-Newton)

Suppose f ∈ C1 and∇f is Lipschitz continuous on Rn. Further assume that f(x) isbounded from below on Rn. Let {xk} be the iterates generated by the modified-Newtonbacktracking-Armijo linesearch Algorithm 15 such that Bk has eigenvalues uniformlybounded away from 0 and infinity. Then there exists a constant M such that for allT ≥ 1

mink=0,...,T

‖∇f(xk)‖ ≤M√

T + 1.



ε)2) steps

Proof: Almost identical to steepest descent proof (Theorem 4.2); use (19) and (20)in (17).

REMARK: Same thing holds for constant step size for an appropriately chosenconstant.

92 / 106

Notes

Notes

Rate of convergence

Theorem 4.6 (Local quadratic convergence for modified-Newton)Suppose that

f ∈ C2

H is Lipschitz continuous on Rn with constant γ

Suppose that iterates are generated from the modified-Newton backtracking-ArmijoAlgorithm 15 such that

αinit = 1η ∈ (0, 1

2 )

Bk = Hk whenever Hk is positive definite

If the sequence of iterates {xk} has a limit point x∗ satisfying H(x∗) is positive definite,then it follows that

(i) αk = 1 for all sufficiently large k,

(ii) the entire sequence {xk} converges to x∗, and

(iii) the rate of convergence is q-quadratic, i.e, there is a constant κ ≥ 0 and a naturalnumber k0 such that for all k ≥ k0

‖xk+1 − x∗‖2 ≤ κ‖xk − x∗‖22

93 / 106

Proof:By assumption we know that there is some subsequence K such that

limk∈K

xk = x∗ and H(x∗) � 0

Continuity implies that Hk is positive definite for all k ∈ K sufficiently large so that

pTk Hkpk ≥ 1

2λmin(H(x∗))‖pk‖22 for all k0 ≤ k ∈ K (22)

for some k0, where λmin(H(x∗)) is the smallest eigenvalue of H(x∗). This fact andsince Hkpk = −gk for k0 ≤ k ∈ K implies that

|pTk gk| = −pT

k gk = pTk Hkpk ≥ 1

2λmin(H(x∗))‖pk‖22 for all k0 ≤ k ∈ K. (23)

This also implies|pT

k gk|‖pk‖2

≥ 12λmin(H(x∗))‖pk‖2. (24)

We now deduce from Theorem 3.3, (23), and (24) that

limk∈K→∞

pk = 0.

94 / 106

Notes

Notes

Mean Value Theorem implies that there exists zk between xk and xk + pk such that

f(xk + pk) = fk + pTk gk + 1

2 pTk H(zk)pk.

Lipschitz continuity of H(x) and the fact that Hkpk = −gk implies

f(xk + pk)− fk − 12 pT

k gk = 12 (pT

k gk + pTk H(zk)pk)

= 12 pT

k (H(zk)− Hk)pk

≤ 12‖H(zk)− H(xk)‖2‖pk‖2

2 ≤ 12γ‖pk‖3

2. (25)

Since η ∈ (0, 1/2) and pk → 0, we may pick k sufficiently large so that

2γ‖pk‖2 ≤ λmin(H(x∗))(1− 2η). (26)

Combining (25), (26), and (23) shows that

f(xk + pk)− fk ≤ 12 pT

k gk + 12γ‖pk‖3

2

≤ 12 pT

k gk + 14λmin(H(x∗))(1− 2η)‖pk‖2

2

≤ 12 pT

k gk − 12 (1− 2η)pT

k gk

= ηpTk gk

so that the unit step xk + pk satisfies the Armijo condition for all k ∈ K sufficiently large.

95 / 106

Note that (22) implies that

‖H−1k ‖2 ≤ 2/λmin(H(x∗)) for all k0 ≤ k ∈ K. (27)

The iteration gives

xk+1 − x∗ = xk − x∗ − H−1k gk = xk − x∗ − H−1

k (gk − g(x∗))

= H−1k [ g(x∗)− gk − Hk(x∗ − xk)] .

But a Taylor Approximation provides the estimate

‖g(x∗)− gk − Hk(x∗ − xk)‖2 ≤γ

2‖xk − x∗‖2

2

which implies

‖xk+1 − x∗‖2 ≤γ

2‖H−1

k ‖2‖xk − x∗‖22 ≤

γ

λmin(H(x∗))‖xk − x∗‖2

2 (28)

Thus, we know that for k ∈ K sufficiently large, the unit step will be accepted and

‖xk+1 − x∗‖2 ≤ 12‖xk − x∗‖2

which proves parts (i) and (ii) since the entire argument may be repeated. Finally, part(iii) then follows from part (ii) and (28) with the choice

κ = γ/λmin(H(x∗))

which completes the proof.96 / 106

Notes

Notes

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

1

1.5

Figure: Contours for the objective function f(x, y) = 10(y− x2)2 + (x− 1)2, and the iteratesgenerated by the modified-Newton backtracking-Armijo Algorithm 15. 97 / 106

Algorithm 16 Modified-Newton Wolfe linesearch method



(1, ‖∇f(x0)‖

)4: Compute positive-definite matrix Bk from Hk using Algorithm 2.5: Compute a search direction pk as the solution to Bkp = −gk.6: Compute a stepsize αk that satisfies the Wolfe conditions (11a) and (11b).7: Set xk+1 ← xk + αkpk





99 / 106

Notes

Notes

Algorithm 17 Quasi-Newton Wolfe linesearch method



(1, ‖∇f(x0)‖

)4: Compute positive-definite matrix Bk from Bk−1 using the BFGS rule, or using Al-

gorithm 3 with a seed matrix B0k.

5: Compute a search direction pk as the solution to Bkp = −gk.6: Compute a stepsize αk that satisfies the Wolfe conditions (11a) and (11b).7: Set xk+1 ← xk + αkpk





101 / 106

Theorem 4.7 (Global convergence of modifed-Newton/quasi-Newton Wolfe)

Suppose that

f ∈ C1

g is Lipschitz continuous on Rn

the condition number of Bk is uniformly bounded by β

Then, the iterates generated by the modified/quasi-Newton Wolfe Algorithm 17 or 16satisfy one of the following:

1 finite termination, i.e., there exists a finite k such thatgk = 0

2 objective is unbounded below, i.e.,lim

k→∞fk = −∞


k→∞gk = 0

Proof:Homework assignment.

102 / 106

Notes

Notes

Rate of convergence

Theorem 4.8 (Global convergence rate for modified-Newton/quasi-NewtonWolfe)

Suppose f ∈ C1 and∇f is Lipschitz continuous on Rn. Further assume that f(x) isbounded from below on Rn. Let {xk} be the iterates generated by themodified/quasi-Newton Wolfe lineasearch Algorithms 16 or 17, such that Bk hasuniformly bounded condition number. Then there exists a constant M such that for allT ≥ 1

mink=0,...,T

‖∇f(xk)‖ ≤M√

T + 1.



ε)2) steps

103 / 106

Rate of convergence

Theorem 4.9 (Local superlinear convergence for quasi-Newton)Suppose that

f ∈ C2

H is Lipschitz continuous on Rn

Suppose that iterates are generated from the quasi-Newton Wolfe AlgorithmAlgorithm 17 such that 0 < c1 < c2 ≤ 1

2 .

If the sequence of iterates {xk} has a limit point x∗ satisfying H(x∗) is positive definiteand g(x∗) = 0, then it follows that

(i) the step length αk = 1 is valid for all sufficiently large k,

(ii) the entire sequence {xk} converges to x∗ superlinearly.

See also Theorems 3.6 and 3.7 from Nocedal-Wright.

104 / 106

Notes

Notes

A Summary

The problem

minimizex∈Rn

f(x)

where the objective function f : Rn → R

Solve instead quadratic approximation at current iterate xk, after choosing positivedefinite Bk, to get a search direction that is a descent direction:

pk = argminp∈Rn

mQk (p)

def= fk + gT

k p + 12 pTBkp

Bk = identity matrix: Steepest Descent Direction.Bk obtained from Hessian Hk by adjusting eigenvalues: Modified-Newton.Bk obtained by updating at every iteration using current and previous gradientinformation: Quasi-Newton.

Decide on step size strategy: Armijo backtracking line search, Wolfe backtracking linesearch, constant step size (HW assignment), O

(1√

k

)in iteration k ... many others.

105 / 106

A Summary

Steepest Descent Modified/Quasi Newton

Requires only function and gradientevaluations

Could need Hessian evaluations as well(with spectral decompositions)

No overhead of solving linear system ineach iteration

Solves a linear system Bp = g in eachiteration

Global convergence to ε-stationarypoint in O(( 1

ε)2) steps

Global convergence to ε-stationarypoint in O(( 1

ε)2) steps

Linear local convergence: convergesto ε distance from local minimum inO(log( 1

ε)) steps

Superlinear (quadratic) local conver-gence: converges to ε distancefrom local minimum in o(log( 1

ε))

(O(log log( 1ε)) steps.

106 / 106

Notes

Notes

Documents

Line-Search Methods for Smooth Unconstrained Optimizationabasu9/AMS_553-761/lecture05_handout2.pdf · j jj with f jgthe eigenvalues of A. thespectral decompositionof H is given by