Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Line-Search Methods for Smooth Unconstrained Optimization
Daniel P. RobinsonDepartment of Applied Mathematics and Statistics
Johns Hopkins University
September 17, 2020
1 / 106
Outline
1 Generic Linesearch Framework
2 Computing a descent direction pk (search direction)Steepest descent directionModified Newton directionQuasi-Newton directions for medium scale problemsLimited-memory quasi-Newton directions for large-scale problemsLinear CG method for large-scale problems
3 Choosing the step length αk (linesearch)Backtracking-Armijo linesearchWolfe conditionsStrong Wolfe conditions
4 Complete AlgorithmsSteepest descent backtracking Armijo linesearch methodModified Newton backtracking-Armijo linesearch methodModified Newton linesearch method based on the Wolfe conditionsQuasi-Newton linesearch method based on the Wolfe conditions
2 / 106
Notes
Notes
The problem
minimizex∈Rn
f(x)
where the objective function f : Rn → R
assume that f ∈ C1 (sometimes C2) and is Lipschitz continuous
in practice this assumption may be violated, but the algorithms we develop maystill work
in practice it is very rare to be able to provide an explicit minimizer
we consider iterative methods: given starting guess x0, generate sequence
{xk} for k = 1, 2, . . .
AIM: ensure that (a subsequence) has some favorable limiting propertiesI satisfies first-order necessary conditionsI satisfies second-order necessary conditions
Notation: fk = f(xk), gk = g(xk), Hk = H(xk)
4 / 106
The basic idea
consider descent methods, i.e., f(xk+1) < f(xk)
calculate a search direction pk from xk
ensure that this direction is a descent direction, i.e.,
gTk pk < 0 if gk 6= 0
so that, for small steps along pk, the objective function f will be reduced
calculate a suitable steplength αk > 0 so that
f(xk + αkpk) < fk
computation of αk is the linesearch
the computation of αk may itself require an iterative procedure
generic update for linesearch methods is given by
xk+1 = xk + αkpk
5 / 106
Notes
Notes
Issue 1: steps might be too long
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3f(x)
x
(x1,f(x1)
(x2,f(x2)
(x3,f(x3)
(x4,f(x4)(x5,f(x5)
Figure: The objective function f(x) = x2 and the iterates xk+1 = xk + αkpk generated by thedescent directions pk = (−1)k+1 and steps αk = 2 + 3/2k+1 from x0 = 2.
decrease in f is not proportional to the size of the directional derivative!6 / 106
Issue 2: steps might be too short
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3f(x)
x
(x1,f(x1)
(x2,f(x2)
(x3,f(x3)(x4,f(x4)(x5,f(x5)
Figure: The objective function f(x) = x2 and the iterates xk+1 = xk + αkpk generated by thedescent directions pk = −1 and steps αk = 1/2k+1 from x0 = 2.
decrease in f is not proportional to the size of the directional derivative!7 / 106
Notes
Notes
What is a practical linesearch?
in the “early” days exact linesearchs were used, i.e., pick αk to minimize
f(xk + αpk)
I an exact linesearch requires univariate minimizationI cheap if f is simple, e.g., a quadratic functionI generally very expensive and not cost effectiveI exact linesearch may not be much better than an approximate linesearch
modern methods use inexact linesearchsI ensure steps are neither too long nor too shortI make sure that the decrease in f is proportional to the directional derivativeI try to pick “appropriate” initial stepsize for fast convergence
F related to how the search direction sk is computed
the descent direction (search direction) is also importantI “bad” directions may not converge at allI more typically, “bad” directions may converge very slowly
8 / 106
Definition 2.1 (Steepest descent direction)For a differentiable function f , the search direction
pkdef= −∇f(xk) ≡ −gk
is called the steepest-descent direction.
pk is a descent direction provided gk 6= 0I gT
k pk = −gTk gk = −‖gk‖2
2 < 0
pk solves the problem
minimizep∈Rn
mLk (xk + p) subject to ‖p‖2 = ‖gk‖2
I mLk (xk + p) def
= fk + gTk p
I mLk (xk + p) ≈ f(xk + p)
Any method that uses the steepest-descent direction is a method of steepest descent.
11 / 106
Notes
Notes
Observation: the steepest descent direction is also the unique solution to
minimizep∈Rn
fk + gTk p + 1
2 pTp
approximates second-derivative Hessian by the identify matrix Ihow often is this a good idea?
is it a surprise that convergence is typically very slow?
Idea: choose positive definite Bk and compute search direction as the unique minimizer
pk = argminp∈Rn
mQk (p)
def= fk + gT
k p + 12 pTBkp
pk satisfies Bkpk = −gk
why must Bk be positive definite?I Bk � 0 =⇒ MQ
k is strictly convex =⇒ unique solutionI if gk 6= 0, then pk 6= 0 and is a descent direction
pTk gk = −pT
k Bkpk < 0 =⇒ pk is a descent direction
pick “intelligent” Bk that has “useful” curvature information
IF Hk � 0 and we choose Bk = Hk, then sk is the Newton direction. Awesome!
13 / 106
Question: How do we choose the positive-definite matrix Bk?
Ideally, Bk is chosen such that
‖Bk − Hk‖ is “small”
Bk = Hk when Hk is “sufficiently” positive definite.
Comments:
for the remainder of this section, we omit the suffix k and write H = Hk, B = Bk,and g = gk.
for a symmetric matrix A ∈ Rn×n, we use the matrix norm
‖A‖2 = max1≤j≤n
|λj|
with {λj} the eigenvalues of A.
the spectral decomposition of H is given by H = VΛVT, where
V =(v1 v2 · · · vn
)and Λ = diag(λ1, λ2, . . . , λn)
with Hvj = λjvj and λ1 ≥ λ2 ≥ · · · ≥ λn.
H is positive definite if and only if λj > 0 for all j.computing the spectral decomposition is, generally, very expensive!
14 / 106
Notes
Notes
Method 1: small scale problems (n < 1000)
Algorithm 1 Compute modified Newton matrix B from H
1: input H2: Choose β > 1, the desired bound on the condition number of B.3: Compute the spectral decomposition H = VΛVT.4: if H = 0 then5: Set ε = 16: else7: Set ε = ‖H‖2/β > 08: end if9: Compute
Λ = diag(λ1, λ2, . . . , λn) with λj =
{λj if λj ≥ εε otherwise
10: return B = VΛVT� 0, which satisfies cond(B) ≤ β
replaces the eigenvalues of H that are “not positive enough” with εΛ = Λ + D, where
D = diag(
max(0, ε− λ1),max(0, ε− λ2), . . . ,max(0, ε− λn))� 0
B = H + E, where E = VDVT� 015 / 106
Question: What are the properties of the resulting search direction?
p = −B−1g = −VΛ−1VTg = −n∑
j=1
vTj gλj
vj
Taking norms and using the orthogonality of V, gives
‖p‖22 = ‖B−1g‖2
2 =
n∑j=1
(vTj g)2
λ2j
=∑λj≥ε
(vTj g)2
λ2j
+∑λj<ε
(vTj g)2
ε2
Thus, we conclude that
‖p‖2 = O(
1ε
)provided vT
j g 6= 0 for at least one λj < ε
the step p goes unbounded as ε→ 0 provided vTj g 6= 0 for at least one λj < ε
any indefinite matrix will generally satisfy this property
the next method that we discuss is better!
16 / 106
Notes
Notes
Method 2: small scale problems (n < 1000)
Algorithm 2 Compute modified Newton matrix B from H
1: input H2: Choose β > 1, the desired bound on the condition number of B.3: Compute the spectral decomposition H = VΛVT.4: if H = 0 then5: Set ε = 16: else7: Set ε = ‖H‖2/β > 08: end if9: Compute
Λ = diag(λ1, λ2, . . . , λn) with λj =
λj if λj ≥ ε−λj if λj ≤ −εε otherwise
10: return B = VΛVT� 0, which satisfies cond(B) ≤ β
replace small eigenvalues λi of H with εreplace “sufficiently negative” eigenvalues λi of H with −λi
Λ = Λ + D, where
D = diag(
max(0,−2λ1, ε− λ1), . . . ,max(0,−2λn, ε− λn))� 0
B = H + E, where E = VDVT� 017 / 106
Suppose that B = H + E is computed from Algorithm 2 so that
E = B− H = V(Λ− Λ)VT
and since V is orthogonal that
‖E‖2 = ‖V(Λ− Λ)VT‖2 = ‖Λ− Λ‖2 = maxj|λj − λj|
By definition
λj − λj =
0 if λj ≥ εε− λj if −ε < λj < ε−2λj if λj ≤ −ε
which implies that‖E‖2 = max
1≤j≤n{ 0, ε− λj, −2λj }.
However, since λ1 ≥ λ2 ≥ · · · ≥ λn, we know that
‖E‖2 = max { 0, ε− λn, −2λn }
if λn ≥ ε, i.e., H is sufficiently positive definite, then E = 0 and B = Hif λn < ε, then B 6= H and it can be shown that ‖E‖2 ≤ 2 max(ε, |λn|)regardless, B is well-conditioned by construction
18 / 106
Notes
Notes
Example 2.2Consider
g =
(24
)and H =
(1 00 −2
)The Newton direction is
pN = −H−1g =
(−2
2
)so that gTpN = 4 > 0 (ascent direction)
and pN is a saddle point of the quadratic model gTp + 12 pTHp since H is indefinite.
Algorithm 1 (Method 1) generates
B =
(1 00 ε
)so that
p = −B−1g =
(−2− 4ε
)and gTp = −4− 16
ε< 0 (descent direction)
Algorithm 2 (Method 2) generates
B =
(1 00 2
)so that
p = −B−1g =
(−2−2
)and gTp = −12 < 0 (descent direction)
19 / 106
pN pN
Method 1
p
pN
Method 2
p
Notes
Notes
Question: What is the geometric interpretation?Answer: Let the spectral decomposition of H be
H = VΛVT
and assume that λj 6= 0 for all j. Partition the eigenvector and eigenvalue matrices as
Λ =
(Λ+
Λ−
)} λj>0
} λj<0and V =
(V+︸︷︷︸λj>0
V−︸︷︷︸λj<0
)
So that H =(V+ V−
) (Λ+
Λ−
)(VT
+
VT−
)and the Newton direction is
pN = −VΛ−1VTg = −(V+ V−
) (Λ−1+
Λ−1−
)(VT
+
VT−
)g
= −(V+ V−
) (Λ−1+
Λ−1−
)(VT
+gVT
−g
)= −V+Λ
−1+ VT
+g− V−Λ−1− VT
−g
= − V+Λ−1+ VT
+g︸ ︷︷ ︸pN+
− V−Λ−1− VT
−g︸ ︷︷ ︸pN−
def= pN
+ + pN−
wherepN+ = −V+Λ
−1+ VT
+g and pN− = −V−Λ
−1− VT
−g21 / 106
Result
Given mQk (p) = gTp + 1
2 pTHp with H nonsingular, the Newton direction
pN = pN+ + pN
− = −V+Λ−1+ VT
+g− V−Λ−1− VT
−g
is such that
pN+ minimizes mQ
k (p) on the subspace spanned by V+.
pN− maximizes mQ
k (p) on the subspace spanned by V−.
22 / 106
Notes
Notes
Proof:
The function ψ(y) def= mQ
k (V+y) may be written as
ψ(y) = gTV+y + 12 yTVT
+HV+y
and has a stationary point y∗ given by
y∗ = −(VT+HV+)−1VT
+g
Since∇2ψ(y) = VT
+HV+ = VT+VΛVTV+ = Λ+ � 0
we know that y∗ is, in fact, the unique minimizer of ψ(y) and
V+y∗ = −V+Λ−1+ VT
+g ≡ pN+
For the second part
y∗ = −(VT−HV−)−1VT
−g = −Λ−1− VT
−g
so thatV−y∗ = −V−Λ
−1− VT
−g ≡ pN−.
Since Λ− ≺ 0, we know that pN− maximizes mQ
k (p) on the space spanned by thecolumns of V−.
23 / 106
Some observations
pN+ is a descent direction for f :
gTpN+ = −gTV+Λ
−1+ VT
+g = −yTΛ−1+ y ≤ 0
where y def= VT
+g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to thecolumn space of V+
pN− is an ascent direction for f :
gTpN− = −gTV−Λ
−1− VT
−g = −yTΛ−1− y ≥ 0
where y def= VT
−g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to thecolumn space of V−
pN = pN+ + pN
− may or may not be a descent direction!
This is the geometry of the Newton direction. What about the modified Newtondirection?
24 / 106
Notes
Notes
Question: what is the geometry of the modifed-Newton step?Answer: Partition the matrices in the spectral decomposition H = VΛVT such that
V = [V+ Vε V−] and Λ =
Λ+
ΛεΛ−
where
Λ+ = {λi : λi ≥ ε} Λε = {λi : |λi| < ε} Λ− = {λi : λi ≤ −ε}
Theorem 2.3 (Properties of the modifed-Newton direction)Suppose that the positive-definite matrix Bk is computed from Algorithm 2 with input Hk
and value ε > 0. If gk 6= 0 and pk is the unique solution to Bkp = −gk, then pk may bewritten as pk = p+ + pε + p−, where
(i) p+ is a direction of positive curvature that minimizes mQk (p) in the space spanned
by the columns of V+;
(ii) −(p−) is a direction of negative curvature that maximizes mQk (p) in the space
spanned by the columns of V−; and
(iii) pε is a multiple of the steepest-descent direction in the space spanned by thecolumns of Vε with ‖pε‖2 = O(1/ε).
Comments:implies that singular Hessians Hk may easily cause numerical difficultiesthe direction p− is the negative of pN
− associated with the Newton step25 / 106
Main goals of quasi-Newton methods
Maintain positive-definite “approximation” Bk to Hk
Instead of computing Bk+1 from scratch, e.g., from a spectral decomposition ofHk+1, want to compute Bk+1 as an update to Bk using information at xk+1
inject “real” curvature information from H(xk+1) into Bk+1
Should be cheap to form and compute with Bk
be applicable for medium-scale problems (n ≈ 1000− 10, 000)
Hope that fast convergence (superlinear) of the iterates {xk} computed using Bk
can be recovered
27 / 106
Notes
Notes
The optimization problem solved in the (k + 1)st iterate will be
minimizep∈Rn
mQk+1(p)
def= fk+1 + gT
k+1p + 12 pTBk+1p
How do we choose Bk+1?
How about to satisfy the following:the model should agree with f at xk+1
the gradient of the model should agree with the gradient of f at xk+1
the gradient of the model should agree with the gradient of f at xk (previous point)
These are equivalent to1 mQ
k+1(0) = fk+1√
already satisfied2 ∇mQ
k+1(0) = gk+1√
already satisfied3 ∇mQ
k+1(−αkpk) = gk
Bk+1(−αkpk) + gk+1 = gk ⇐⇒ Bk+1 (xk+1 − xk)︸ ︷︷ ︸sk
= gk+1 − gk︸ ︷︷ ︸yk
Thus, we aim to cheaply compute a symmetric positive-definite matrix Bk+1 thatsatisfies the so-called secant equation
Bk+1sk = yk where skdef= xk+1 − xk and yk
def= gk+1 − gk
28 / 106
Observation: If the matrix Bk+1 is positive definite, then the curvature condition
sTk yk > 0 (1)
is satisfied.
using the fact that Bk+1sk = yk, we see that
sTk yk = sT
k Bk+1sk
which must be positive if Bk+1 � 0the previous relation explains why (1) is called a curvature condition
(1) will hold for any two points xk and xk+1 if f is strictly convex
(1) does not always hold for any two points xk and xk+1 if f is nonconvex
for nonconvex functions, we need to be careful how we choose αk that definesxk+1 = xk + αkpk (see Wolfe conditions for linesearch strategies)
Note: for the remainder, I will drop the subscript k from all terms.
29 / 106
Notes
Notes
The problemGiven a symmetric positive-definite matrix B, and vectors s and y, find a symmetricpositive-definite matrix B that satisfies the secant equation
Bs = y
Want a cheap update, so let us first consider a rank-one update of the form
B = B + uvT
for some vectors u and v that we seek.Derivation:Case 1: Bs = yChoose u = v = 0 since
Bs = (B + uvT)s = Bs + uvTs = y
Case 2: Bs 6= yIf u = 0 or v = 0 then
Bs = (B + uvT)s = Bs + uvTs = Bs 6= y
So search for u 6= 0 and v 6= 0. Symmetry of B and desired symmetry of B imply that
B = BT ⇐⇒ B + uvT = BT + vuT ⇐⇒ uvT = vuT
But uvT = vuT, u 6= 0, and v 6= 0 imply that v = αu for some α 6= 0 (exercise). Thus,we should search for an α 6= 0 and u 6= 0 such that the secant condition is satisfied by
B = B + αuuT30 / 106
From previous slide, we hope to find an α > 0 and u 6= 0 so that
B = B + αuuT
satisfiesy want
= Bs = Bs + αuuTs = Bs + α(uTs)uwhich implies
α(uTs)u = y− Bs 6= 0.This implies that
u = β(y− Bs) for some β 6= 0.
Plugging back in, we are searching for
B = B + αβ2(y− Bs)(y− Bs)T
= B + γ(y− Bs)(y− Bs)T for some γ 6= 0
that satisfies the secant equation
y want= Bs = Bs + γ(y− Bs)Ts(y− Bs)
which impliesy− Bs = γ(y− Bs)Ts(y− Bs)
which means we can choose
γ =1
(y− Bs)Tsprovided (y− Bs)Ts 6= 0.
31 / 106
Notes
Notes
Definition 2.4 (SR1 quasi-Newton formula)
If (y− Bs)Ts 6= 0 for some symmetric matrix B and vectors s and y, then the symmetricrank-1 (SR1) quasi-Newton formula for updating B is
B = B +(y− Bs)(y− Bs)T
(y− Bs)Ts
and satisfies
B is symmetric
Bs = y (secant condition)
In practice, we choose some κ ∈ (0, 1) and then use
B =
{B + (y−Bs)(y−Bs)T
(y−Bs)Ts if |sT(y− Bs)| > κ‖s‖2‖y− Bs‖2
B otherwise
to avoid numerical error.
Question: Is B positive definite if B is positive definite?Answer: Sometimes....but sometimes not!
32 / 106
Example 2.5 (SR1 update is not necessarily positive definite)Consider the SR1 update with
B =
(2 11 1
)s =
(−1−1
)and y =
(−3
2
)
The matrix B is positive definite with eigenvalues ≈ {0.38, 2.6}. Moreover,
yTs = (−3 2 )
(−1−1
)= 1 > 0
so that the necessary condition for B to be positive definite holds. The SR1 updategives
y− Bs =
(−3
2
)−(
2 11 1
)(−1−1
)=
(04
)
(y− Bs)Ts =(
0 4)( −1−1
)= −4
B =
(2 11 1
)+
1(−4)
(04
)(0 4
)=
(2 11 −3
)The matrix B is indefinite with eigenvalues ≈ {2.19, −3.19}.
33 / 106
Notes
Notes
Theorem 2.6 (convergence of SR1 update)Suppose that
f is C2 on Rn
limk→∞ xk = x∗ for some vector x∗
∇2f is bounded and Lipschitz continous in a neighborhood of x∗
the steps sk are uniformly linearly independent
|sTk(yk − Bksk)| > κ‖sk‖2‖yk − Bksk‖2 for some κ ∈ (0, 1)
Then, the matrices generated by the SR1 formula satisfy
limk→∞
Bk = ∇2f(x∗)
Comment: the SR1 updates may converge to an indefinite matrix
34 / 106
Some comments
if B is symmetric, then the SR1 update B (when it exists) is symmetric
if B is positive definite, then the SR1 update B (when it exists) is not necessarilypositive definite
linesearch methods require positive-definite matrices
SR1 updating is not appropriate (in general) for linesearch methods
SR1 updating is an attractive option for optimization methods that effectively utilizeindefinite matrices (e.g., trust-region methods)
to derive an updating strategy that produces positive-definite matrices, we have tolook beyond rank-one updates. Would a rank-two update work?
35 / 106
Notes
Notes
For hereditary positive definiteness we must consider updates of at least rank two.
ResultU is a symmetric matrix of rank two if and only if
U = βuuT + δwwT,
for some nonzero β and δ, with u and w linearly independent.
IfB = B + U
for some rank-two matrix U and Bs = y for some vectors s and y, then
Bs = (B + U)s = y,
and it follows thaty− Bs = Us = β(uTs)u + δ(wTs)w
36 / 106
Let us make the “obvious” choices of
u = Bs and w = y
which gives the rank-two update
U = βBs(Bs)T + δyyT
and we now search for β and δ so that Bs = y is satisfied. Substituting, we have
y want= Bs = (B + βBs(Bs)T + δyyT)s
= Bs + βBssTBs + δyyTs= Bs + β(sTBs)Bs + δ(yTs)y
Making the “simple” choice of
β = − 1sTBs
and δ =1
yTs
gives
Bs = Bs− 1sTBs
(sTBs)Bs +1
yTs(yTs)y
= Bs− Bs + y = y
as desired. This gives the Broyden, Fletcher, Goldfarb, Shanno (BFGS) quasi-Newtonupdate formula.
37 / 106
Notes
Notes
Broyden, Fletcher, Goldfarb, Shanno (BFGS) update
Given a symmetric positive-definite matrix B and vectors s and y that satisfy yTs > 0,the quasi-Newton BFGS update formula is given by
B = B− 1sTBs
Bs(Bs)T +1
yTsyyT (2)
Result
If B is positive definite and yTs > 0, then
B = B− 1sTBs
Bs(Bs)T +1
yTsyyT
is positive definite and satisfies Bs = y.
Proof:Homework assignment. (hint: see the next slide)
38 / 106
A second derivation of the BFGS formula
Let M = B−1 for some given symmetric positive definite matrix B. Solve
minimizeM∈Rn×n
‖M −M‖W subject to M = MT, My = s
for a “closest” symmetric matrix M, where
‖A‖W = ‖W1/2AW1/2‖F and ‖C‖F =n∑
i=1
n∑j=1
c2ij
for any weighting matrix W that satisfies Wy = s. If we choose
W = G−1 and G =
∫ 1
0[∇2f(y + τ s)]−1 dτ
then the solution is
M =
(I − syT
yTs
)M(
I − ysT
yTs
)+
ssT
yTs.
Computing the inverse B = M−1 using the Sherman-Morrison-Woodbury formula gives
B = B− 1sTBs
Bs(Bs)T +1
yTsyyT (BFGS formula again)
so that B � 0 if and only if M � 0.39 / 106
Notes
Notes
Recall: given the positive-definite matrix B and vectors s and y satisfying sTy > 0, theBFGS formula (2) may be written as
B = B− aaT + bbT
wherea =
Bs(sTBs)1/2 and b =
y(yTs)1/2
Limited-memory BFGS updateSuppose at iteration k we have m previous pairs (si, yi) for i = k− m, . . . , k− 1. Thenthe limited-memory BFGS (L-BFGS) update with memory m and positive-definite seedmatrix B0
k may be written in the form
Bk = B0k +
k−1∑i=k−m
[bibT
i − aiaTi
]for some vectors ai and bi that may be recovered via an unrolling formula (next slide).
the positive-definite seed matrix B0k is often chosen to be diagonal
essentially perform m BFGS updates to the seed matrix B0k
A different seed matrix may be used for each iteration kI Added flexibility allows “better” seed matrices to be used as iterations proceed
41 / 106
Algorithm 3 Unrolling the L-BFGS formula
1: input m ≥ 1, {(si, yi)} for i = k− m, . . . , k− 1, and B0k � 0
2: for i = k− m, . . . , k− 1 do3: bi ← yi/(yT
i si)1/2
4: ai ← B0ksi +
∑i−1j=k−m
[(bT
j si)bj − (aTj si)aj
]5: ai ← ai/(sT
i ai)1/2
6: end for
typically 3 ≤ m ≤ 30bi and bT
j si should be saved and reused from previous iterations (exceptions arebk−1 and bT
j sk−1 because they depend on the most recent data obtained).
with the previous savings in computation and assuming B0k = I, the total cost to
get the ai, bi vectors is approximately 32 m2n.
cost of a matrix-vector product with Bk requires 4mn multiplications.
other so-called compact representations are slightly more efficient
42 / 106
Notes
Notes
The problem of interestGiven a symmetric positive-definite matrix A, solve the linear system
Ax = b
which is equivalent to finding the unique minimizer of
minimizex∈Rn
q(x)def= 1
2 xTAx− bTx (3)
because∇q(x) = Ax− b and∇2q(x) = A � 0 implies that
∇q(x) = Ax− b = 0 ⇐⇒ Ax = b
We seek a method that cheaply computes approximate solutions to the systemAx = b, or equivalent approximately minimizes (3).
Do not want to compute a factorization of A, because it is assumed to be tooexpensive. We are interested in very large-scale problems
We allow ourselves to compute matrix-vector products, i.e., Av for any vector vWhy do we care about this? We will see soon!
Notation: r(x)def= Ax− b and rk
def= Axk − b.
44 / 106
q(x) = 12 xTAx− bTx
Algorithm 4 Coordinate descent algorithm for minimizing q
1: input initial guess x0 ∈ Rn and a symmetric positive-definite matrix A ∈ Rn×n
2: loop3: for i = 0, 1, . . . , n− 1 do4: Compute αi as the solution to
minimizeα∈R
q(xi + αei+1) (4)
5: xi+1 ← xi + αiei+1
6: end for7: x0 ← xn
8: end loop
ej represents the jth coordinate vector, i.e.,
ej = ( 0 0 . . .jth︷︸︸︷1 . . . 0 0 )T∈ Rn
Is it easy to solve (4)? Yes! (exercise)Does this work?
45 / 106
Notes
Notes
Figure: Coordinate descent converges in nsteps. Very good! Axes are aligned with thecoordinate directions ei.
Figure: Coordinate descent does notconverge in n steps. Looks like it willbe very slow! Axes are not aligned withthe coordinate directions ei.
Fact: the axes are determined by the eigenvectors of A.Observation: coordinate descent converges in no more than n steps if the eigenvectorsalign with the coordinate directions, i.e., V = I where V is the eigenvector matrix of A.Implication: Using the spectral decomposition of A
A = VΛVT = IΛIT = Λ (a diagonal matrix)
Conclusion: Coordinate descent converges in ≤ n steps when A is diagonal.46 / 106
First thought: This is only good for diagonal matrices. That sucks!Second thought: Can we use a transformation of variables so that the transformedHessian is diagonal?
Let S be a square nonsingular matrix and define the transformed variablesy = S−1x ⇐⇒ Sy = x.
Consider the transformed quadratic function q defined as
q(y )def= q(Sy ) ≡ 1
2 yT STAS︸ ︷︷ ︸A
y− ( STb︸︷︷︸b
)Ty = 12 yT Ay− b Ty
so that a minimizer x∗ of q satisfies x∗ = Sy∗ where y∗ is a minimizer of q.If A = STAS is diagonal, then applying the coordinate descent Algorithm 4 to qwould converge to y∗ in ≤ n steps.A is diagonal if and only if
sTi Asj = 0 for all i 6= j,
whereS =
(s0 s1 . . . sn−1
)Line 5 of Algorithm 4 (in the transformed variables) is
yi+1 = yi + αiei+1.
Multiplying by S and using Sy = x leads to
xi+1 = Syi+1 = S(yi + αiei+1) = Syi + αiSei+1 = xi + αisi
47 / 106
Notes
Notes
Definition 2.7 (A-conjugate directions)A set of nonzero vectors {s0, s1, . . . , sl} is said to be conjugate with respect to thesymmetric positive-definite matrix A (A-conjugate for short) if
sTi Asj = 0 for all i 6= j.
We want to find A-conjugate vectors.
The eigenvectors of A are A-conjugate since the spectral decomposition gives
A = VΛVT ⇐⇒ VTAV = Λ (diagonal matrix of eigenvalues)
Computing the spectral decomposition is too expensive for large scale problems.
We want to find A-conjugate vectors cheaply!
Let us look at a general algorithm.
48 / 106
Algorithm 5 General A-conjugate direction algorithm
1: input symmetric positive-definite matrix A ∈ Rn×n, and vectors x0 and b ∈ Rn.2: Set r0 ← Ax0 − b and k← 0.3: Choose direction s0.4: while rk 6= 0 do5: Compute αk ← argminα∈R q(xk + αsk)6: Set xk+1 ← xk + αksk.7: Set rk+1 ← Axk+1 − b.8: Pick new A-conjugate direction sk+1.9: Set k← k + 1.
10: end while
How do we compute αk?
How do we compute the A-conjugate directions efficiently?
Theorem 2.8 (general A-conjugate algorithm converges)For any x0 ∈ Rn, the general A-conjugate direction Algorithm 5 computes the solutionx∗ of the system Ax = b in at most n steps.
Proof: Theorem 5.1 in Nocedal and Wright.
49 / 106
Notes
Notes
Exercise: show that the αk that solves the minimization problem on line 5 ofAlgorithm 5 satisfies
αk = − sTk rk
sTk Ask
.
Before discussing how we compute the A-conjugate directions {si}, let us look at someproperties of them if we assume we already have them.
Theorem 2.9 (Expanding subspace minimization)
Let x0 ∈ Rn be any starting point and suppose that {xk} is the sequence of iteratesgenerated by the A-conjugate direction Algorithm 5 for a set of A-conjugate directions{sk}. Then,
rTk si = 0 for all i = 0, . . . k− 1
and xk is the unique minimizer of q(x) = 12 xTAx− bTx over the set
{x : x = x0 + span(s0, s1, . . . , sk−1)}
Proof: See Theorem 5.2 in Nocedal and Wright.
50 / 106
Question: If {si}k−1i=0 are A-conjugate, how do we compute sk to be A-conjugate?
Answer: Theorem 2.9 shows that rk is orthogonal to the previously explored space. Tokeep it simple and hopefully cheap, let us try
sk = −rk + β sk−1 for some β. (5)
If this is going to work, then the A-conjugacy condition and (5) requires that
0 = sTk−1Ask
= sTk−1A(−rk + βsk−1)
= sTk−1(−Ark + βAsk−1)
= −sTk−1Ark + βsT
k−1Ask−1
and solving for β gives
β =sT
k−1Ark
sTk−1Ask−1
.
With this choice of β, the vector sk is A-conjugate!.....but, how do we get started?Choosing s0 = −r0, i.e., a steepest descent direction, makes sense.
This gives us a complete algorithm; the linear CG algorithm!
51 / 106
Notes
Notes
Algorithm 6 Preliminary linear CG algorithm
1: input symmetric positive-definite matrix A ∈ Rn×n, and vectors x0 and b ∈ Rn.2: Set r0 ← Ax0 − b, s0 ← −r0, and k← 0.3: while rk 6= 0 do4: Set αk ← −( sT
k rk)/( sTk Ask).
5: Set xk+1 ← xk + αksk.6: Set rk+1 ← Axk+1 − b.7: Set βk+1 ← ( sT
k Ark+1)/( sTk Ask).
8: Set sk+1 ← −rk+1 + βk+1sk.9: Set k← k + 1.
10: end while
This preliminary version requires two matrix-vector products per iteration.
By using some more “tricks” and algebra, we can reduce this to a singlematrix-vector multiplication per iteration.
52 / 106
Algorithm 7 Linear CG algorithm
1: input symmetric positive-definite matrix A ∈ Rn×n, and vectors x0 and b ∈ Rn.2: Set r0 ← Ax0 − b, s0 ← −r0, and k← 0.3: while rk 6= 0 do4: Setαk ← ( rT
k rk)/( sTk Ask).
5: Set xk+1 ← xk + αksk.6: Set rk+1 ← rk + αkAsk.7: Set βk+1 ← ( rT
k+1rk+1)/( rTk rk).
8: Set sk+1 ← −rk+1 + βk+1sk.9: Set k← k + 1.
10: end while
Only requires the single matrix-vector multiplication Ask.More practical to use a stopping condition like
‖rk‖ ≤ 10−8 max(1, ‖r0‖2
)Required computation
I 2 inner productsI 3 vector sumsI 1 matrix-vector multiplication
Ideal for large (sparse) matrices AConverges quickly if cond(A) is small or the eigenvalues are “clustered”
I convergence can be accelerated by using precondioning53 / 106
Notes
Notes
We want to use CG to compute an approximate solution p to
minimizep∈Rn
f + gTp + 12 pTHp
H may be indefinite (even singular) and must ensure that p is a descent direction
Algorithm 8 Newton CG algorithm for computing descent directions
1: input symmetric matrix H ∈ Rn×n and vector g2: Set p0 = 0, r0 ← g, s0 ← −g, and k← 0.3: while rk 6= 0 do4: if sT
k Hsk > 0 then5: Set αk ← ( rT
k rk)/( sTk Hsk).
6: else7: if k = 0, then return pk = −g else return pk end if8: end if9: Set pk+1 ← pk + αksk.
10: Set rk+1 ← rk + αkHsk.11: Set βk+1 ← ( rT
k+1rk+1)/( rTk rk).
12: Set sk+1 ← −rk+1 + βk+1sk.13: Set k← k + 1.14: end while15: return pk
Made the replacements x← p, A← H, and b← −g in Algorithm 7. 54 / 106
Theorem 2.10 (Newton CG algorithm computes descent directions.)If g 6= 0 and Algorithm 8 terminates on iterate K, then every iterate pk of the algorithmsatisfies pT
k g < 0 for k ≤ K.
Proof:Case 1: Algorithm 8 terminates on iteration K = 0.It is immediate that gTp0 = −gTg = −‖g‖2
2 < 0 since g 6= 0.Case 2: Algorithm 8 terminates on iteration K ≥ 1.First observe from p0 = 0, s0 = −g, r0 = g, g 6= 0, and lines 10 and 9 of Algorithm 8that
gTp1 = gT(p0 + α0s0) = α0gTs0 =rT
0 r0
sT0 Hs0
gTs0 = − ‖g‖42
gTHg< 0. (6)
By unrolling the “p” update, using A-conjugacy of {si}, and definition of rk, we get
sTk rk = sT
k (Hpk + g) = sTk g + sT
k Hk−1∑j=0
αjsj = sTk g.
The previous line, lines 9 and 5 of Algorithm 8, and the fact that sTk rk = −‖rk‖2
2 gives
gTpk+1 = gTpk + αkgTsk = gTpk + αksTk rk = gTpk +
rTk rk
sTk Hsk
sTk rk = gTpk −
‖rk‖42
sTk Hsk
< gTpk.
Combining the previous inequality and (6) results in
gTpk+1 < gTpk < · · · < gTp1 < −‖g‖4
2
gTHg< 0.
55 / 106
Notes
Notes
Algorithm 9 Backtracking
1: input xk, pk
2: Choose αinit > 0 and τ ∈ (0, 1)3: Set α(0) = αinit and l = 04: until f(xk + α(l)pk) “<” f(xk)5: Set α(l+1) = τα(l)
6: Set l← l + 17: end until8: return αk = α(l)
typical choices (not always)I αinit = 1I τ = 1/2
this prevents the step from getting too small
does not prevent the step from being too long
decrease in f may not be proportional to the directional derivative
need to tighten the requirement that
f(xk + α(l)pk
)“<” f(xk)
58 / 106
The Armijo conditionGiven a point xk and search direction pk, we say that αk satisfies the Armijo condition if
f(xk + αkpk) ≤ f(xk) + ηαk gT
k pk (7)
for some η ∈ (0, 1), e.g., η = 0.1 or even η = 0.0001
Purpose: ensure the decrease in f is substantial relative to the directional derivative.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
α
f(xk+αpk)
f(xk)+αgTk pk
f(xk)+αβgTk pk
59 / 106
Notes
Notes
Algorithm 10 A backtracking-Armijo linesearch
1: input xk, pk
2: Choose αinit > 0, η ∈ (0, 1), and τ ∈ (0, 1)3: Set α(0) = αinit and l = 04: until f
(xk + α(`)pk) ≤ f(xk) + ηα(`)gT
k pk
5: Set α(l+1) = τα(l)
6: Set `← `+ 17: end until8: return αk = α(`)
Does this always terminate?
Does it give us what we want?
60 / 106
Theorem 3.1 (Satisfying the Armijo condition)
Suppose that
f ∈ C1 and g(x) is Lipschitz continuous with Lipschitz constant γ(x)
p is a descent direction at xThen for any given η ∈ (0, 1), the Armijo condition
f(x + αp) ≤ f(x) + ηα g(x)Tp
is satisfied for all α ∈ [0, αmax(x)], where
αmax(x) :=2(η − 1)g(x)Tpγ(x)‖p‖2
2> 0
Proof:A Taylor’s approximation and
α ≤ 2(η − 1)g(x)Tpγ(x)‖p‖2
2
imply that
f(x + αp) ≤ f(x) + αg(x)Tp + 12γ(x)α2‖p‖2
2
≤ f(x) + αg(x)Tp + α(η − 1)g(x)Tp
= f(x) + αη g(x)Tp61 / 106
Notes
Notes
Theorem 3.2
[Bound on αk when using an Armijo backtracking linesearch] Suppose that
f ∈ C1 and g(x) is Lipschitz continuous with Lipschitz constant γk at xk
pk is a descent direction at xk
Then for any chosen η ∈ (0, 1) the stepsize αk generated by the backtracking-Armijolinesearch terminates with
αk ≥ min
(αinit,
2τ(η − 1)gTk pk
γk‖pk‖22
)≡ min
(αinit, ταmax(xk)
)where τ ∈ (0, 1) is the backtracking parameter.
Proof:Theorem 3.1 implies that the backtracking will terminate as soon as α(l) ≤ αmax(xk).Consider the following two cases:
αinit satisfies the Armijo condition: it is then clear that αk = αinit.
αinit does not satisfy the Armijo condition: thus, there must be a last linesearchiterate (the l-th) for which
α(l) > αmax(xk) =⇒ αk = α(l+1) = τα(l) > ταmax(xk)
Combining the two cases gives the required result.62 / 106
Algorithm 11 Generic backtracking-Armijo linesearch method
1: input initial guess x0
2: Set k = 03: loop4: Find a descent direction pk at xk.5: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10.6: Set xk+1 = xk + αkpk
7: Set k← k + 18: end loop9: return approximate solution xk
should include a sensible relative stopping tolerance such as
‖∇f(xk)‖ ≤ 10−8 max(1, ‖∇f(x0)‖
)should also terminate if a maximum number of allowed iterations is reached
should also terminate if a maximum time limit is reached
63 / 106
Notes
Notes
Theorem 3.3 (Global convergence of generic backtracking-Armijo linesearch)
Suppose that
f ∈ C1
g is Lipschitz continuous on Rn with Lipschitz constant γ
Then, the iterates generated by the generic backtracking-Armijo linesearch methodAlgorithm 11 must satisfy one of the following conditions:
1 finite termination, i.e., there exists a positive integer k such that
gk = 0 for some k ≥ 0
2 objective is unbounded below, i.e.,
limk→∞
fk = −∞
3 an angle condition between gk and pk exists, i.e.,
limk→∞
min(|pT
k gk|, |pTk gk|2/‖pk‖2
2
)= 0
64 / 106
Proof:Suppose that outcomes 1 and 2 do not happen, i.e., gk 6= 0 for all k ≥ 0 and
limk→∞
fk > −∞. (8)
The Armijo condition implies that
fk+1 − fk ≤ ηαkpTk gk
for all k. Summing over the first j iterations yields
fj+1 − f0 ≤ ηj∑
k=0
αkpTk gk (9)
Taking limits of the previous inequality as j→∞ and using (8) implies that the LHS isbounded below and, therefore, the RHS is also bounded below. Since the sum iscomposed of all negative terms, we deduce that the following sum
∞∑k=0
αk|pTk gk| (10)
is bounded.
65 / 106
Notes
Notes
From Theorem 3.2, αk ≥ min{αinit,
2τ(η−1)gTk pk
γ‖pk‖22
}. Thus,
∑∞k=0 αk|pT
k gk| ≥∑∞
k=0 min{αinit,
2τ(η−1)gTk pk
γ‖pk‖22
}|pT
k gk|
=∑∞
k=0 min{αinit|pT
k gk|,2τ(1−η)|pT
k gk|2
γ‖pk‖22
}≥
∑∞k=0 min{αinit,
2τ(1−η)γ}min
{|pT
k gk|,|pT
k gk|2
‖pk‖22
}
Using the fact that the series (10) converges, we obtain that
limk→∞
min(|pT
k gk|, |pTk gk|2/‖pk‖2
2
)= 0
66 / 106
Some observationsthe Armijo condition prevents too “long” of stepstypically, the Armijo condition is satisfied for very “large” intervalsthe Armijo condition can be satisfied by steps that are not even close to aminimizer of φ(x) := f(xk + αpk)
φ(α) = f(xk + αpk)
α
l(α) = fk + c1αgTk pk
acceptable acceptable
68 / 106
Notes
Notes
Wolfe conditionsGiven the current iterate xk, search direction pk, and constants 0 < c1 < c2 < 1, wesay that the step length αk satisfies the Wolfe conditions if
f(xk + αkpk) ≤ f(xk) + c1αk∇f(xk)Tpk (11a)
∇f(xk + αkpk)Tpk ≥ c2∇f(xk)
Tpk (11b)
condition (11a) is the Armijo condition (ensures “sufficient” decrease)condition (11b) is a directional derivative condition (prevents too short of steps)
φ(α) = f(xk + αpk)
α
l(α) = fk + c1αgTk pk
acceptable acceptable
desired slope
acceptable
69 / 106
Algorithm 12 Generic Wolfe linesearch method
1: input initial guess x0
2: Set k = 03: loop4: Find a descent direction pk at xk.5: Compute a stepsize αk that satisfies the Wolfe conditions (11).6: Set xk+1 = xk + αkpk
7: Set k← k + 18: end loop9: return approximate solution xk
should include a sensible relative stopping tolerance such as
‖∇f(xk)‖ ≤ 10−8 max(1, ‖∇f(x0)‖
)should also terminate if a maximum number of allowed iterations is reached
should also terminate if a maximum time limit is reached
70 / 106
Notes
Notes
Theorem 3.4 (Convergence of generic Wolfe linesearch (Zoutendijk))Assume that
f : Rn → R is bounded below on C1 on Rn
∇f(x) is Lipschitz continous on Rn with constant LIt follows that the iterates generated from the generic Wolfe linesearch method satisfy∑
k≥0
cos2(θk)‖gk‖22 <∞ (12)
where
cos(θk)def=
−gTk pk
‖gk‖2‖pk‖2(13)
Commentsdefinition (13) measures the angle between the gradient gk and search direction pk
I cos(θ) ≈ 0 implies gk and pk are nearly orthogonalI cos(θ) ≈ 1 implies gk and pk are nearly parallel
condition (12) ensures that the gradient converges to zero provided gk and pk stayaway from being arbitrarily close to orthogonal
Proof: See Theorem 3.2 in Nocedal and Wright.
71 / 106
Some comments
The Armijo linesearchI only requires function evaluationsI may easily accept steps that are far from a univariate minimizer of φ(α) = f(xk +αpk)
The Wolfe linesearchI requires function and gradient evaluationsI typically produces steps that are closer to the univariate minimizer of φI may still allow the acceptance of steps that are far from a univariate minimizer of φI “ideal” when seach directions are computed from linear systems based on the BFGS
quasi-Newton updating formula
Note: we have not yet said how to compute steps that satisfy the Wolfe conditions. Wehave not even shown that such steps exist! (coming soon)
72 / 106
Notes
Notes
Strong Wolfe conditionsGiven the current iterate xk, search direction pk, and constants 0 < c1 < c2 < 1, wesay that the step length αk satisfies the Strong Wolfe conditions if
f(xk + αkpk) ≤ f(xk) + c1αk∇f(xk)Tpk (14a)
|∇f(xk + αkpk)Tpk| ≤ c2|∇f(xk)
Tpk| (14b)
φ(α) = f(xk + αpk)
α
l(α) = fk + c1αgTk pk
acceptable acceptable
desired slope
acceptable
74 / 106
Theorem 3.5
Suppose that f ∈ Rn → R is in C1 and pk is a descent direction at xk, and assume thatf is bounded below along the ray {xk + αpk : α > 0}. It follows that if0 < c1 < c2 < 1, then there exist nontrivial intervals of step lengths that satisfy thestrong Wolfe conditions (14).
Proof:The function φ(α) := f(xk + αpk) is bounded below over α > 0 by assumption andmust, therefore, intersect the graph of l(α) := f(xk) + αc1∇f(xk)
Tpk at least oncesince c1 ∈ (0, 1); let αmin > 0 be the smallest such intersecting value so that
f(xk + αminpk) = f(xk) + αminc1∇f(xk)Tpk. (15)
It is clear that (11a)/(14a) hold for all α ∈ (0, αmin]. The Mean Value Theorem nowimplies
f(xk + αmin pk)− f(xk) = αmin∇f(xk + αMpk)Tpk for some αM ∈ (0, αmin).
Combining the last equality with (15) and the facts that c1 < c2 and∇f(xk)Tpk < 0 yield
∇f(xk + αMpk)Tpk = c1∇f(xk)
Tpk > c2∇f(xk)Tpk. (16)
Thus, αM satisfies (11a)/(14a) and (11b) with a strict inequality so we may use thesmoothness assumption to deduce that there is an interval around αM for which (11)and (14a) hold. Finally, since the left-hand-side of (16) is negative, we also knowthat (14b) holds, which completes the proof.
75 / 106
Notes
Notes
Algorithm 13 A bisection type (weak) Wolfe linesearch
1: input xk, pk, 0 < c1 < c2 < 12: Choose α = 0, t = 1, and β =∞3: while (f
(xk + tpk) > f(xk) + c1 · t · (gT
k pk) OR∇f(xk + tpk)
Tpk < c2(gTk pk)) do
4: if f(xk + tpk) > f(xk) + c1 · t · (gT
k pk) then5: Set β := t6: Set t := α+β
27: else8: α := t9: if β =∞ then
10: t := 2α11: else12: t := α+β
213: end if14: end if15: end while16:
17: return t
See Section 3.5 in Nocedal-Wright for more discussions; in particular, a procedure forthe strong Wolfe conditions.
76 / 106
Some comments
The Armijo linesearchI only requires function evaluationsI computationally cheapest to implementI may easily accept steps that are far from a univariate minimizer of φ(α) = f(xk +αpk)
The Wolfe linesearchI requires function and gradient evaluationsI typically produces steps that are closer to the univariate minimizer of φI may still allow the acceptance of steps that are far from a univariate minimizer of φI computationally more expensive to implement compared to Armijo
The strong Wolfe linesearchI requires function and gradient evaluationsI typically produces steps that are closer to the univariate minimizer of φI only approximate stationary points of the univariate function φ satisfy these conditions.I computationally most expensive to implement compared to Armijo
Why use (strong) Wolfe conditions?
sometimes more accurate line search lead to fewer outer (major) iterations of thelinesearch method.
beneficial in the context of quasi-Newton methods for maintaining positive-definitematrix approximations.
77 / 106
Notes
Notes
Theorem 3.6 (Wolfe conditions guarantee the BFGS update is well-defined)Suppose that Bk � 0 and that pk is a descent direction for f at xk. If a (strong) Wolfelinesearch is used to compute αk such that xk+1 = xk + αkpk, then
yTk sk > 0
wheresk = xk+1 − xk = αkpk and yk = gk+1 − gk.
Proof:It follows from (11b) that
∇f(xk + αkpk)Tpk ≥ c2∇f(xk)
Tpk.
Multplying both sides by αk, using xk+1 = xk + αkpk, and notation gk = ∇f(xk), yields
gTk+1sk ≥ c2gT
k sk.
Subtracting gTk sk from both sides gives
gTk+1sk − gT
k sk ≥ c2gTk sk − gT
k sk
so thatyT
k sk ≥ (c2 − 1)gTk sk = αk(c2 − 1)gT
k pk > 0
since c2 ∈ (0, 1) and pk is a descent direction for f at xk.78 / 106
Algorithm 14 Steepest descent backtracking-Armijo linesearch method
1: Input: initial guess x0
2: Set k = 03: until ‖∇f(xk)‖ ≤ 10−8 max
(1, ‖∇f(x0)‖
)4: Set pk = −∇f(xk) as the search direction.5: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10.6: Set xk+1 = xk + αkpk
7: Set k← k + 18: end until9: return approximate solution xk
should also terminate if a maximum number of allowed iterations is reached
should also terminate if a maximum time limit is reached
81 / 106
Notes
Notes
Theorem 4.1 (Global convergence for steepest descent)
If f ∈ C1 and g is Lipschitz continuous on Rn, then the iterates generated by thesteepest-descent backtracking-Armijo linesearch Algorithm 14 satisfies one of thefollowing conditions:
1 finite termination, i.e., there exists some finite integer k ≥ 0 such that
gk = 0
2 objective is unbounded below, i.e.,
limk→∞
fk = −∞
3 convergence of gradients, i.e.,lim
k→∞gk = 0
Proof:Since pk = −gk, if outcome 1 above does not happen, then we may conclude that
‖gk‖ = ‖pk‖ 6= 0 for all k ≥ 0
Additionally, if outcome 2 above does not occur, then Theorem 3.3 ensures that
0 = limk→∞
{min
(|pT
k gk|, |pTk gk|2/‖pk‖2
2
)}= lim
k→∞{‖gk‖2 min (‖gk‖2, 1)}
and thus limk→∞ gk = 0.82 / 106
Some general comments on the steepest descent backtracking-Armijo algorithm
archetypal globally convergent method
many other methods resort to steepest descent when things “go wrong”
not scale invariant
convergence is usually slow! (linear local convergence, possibly with constantclose to 1; for global convergence only sublinear rate O(( 1
ε)2) to get to a
“stationary" point, i.e., ‖gk‖ ≤ ε)in practice, often does not converge at all
slow convergence results because the computation of pk does not use anycurvature information
call any method that uses the steepest descent direction a “method of steepestdescent”
each iteration is relatively cheap, but you may need many of them!
83 / 106
Notes
Notes
Rate of convergence
Theorem 4.2 (Global convergence rate for steepest-descent)
Suppose f ∈ C1 and∇f is Lipschitz continuous on Rn. Further assume that f(x) isbounded from below on Rn. Let {xk} be the iterates generated by the steepest-descentbacktracking-Armijo linesearch Algorithm 14. Then there exists a constant M such thatfor all T ≥ 1
mink=0,...,T
‖∇f(xk)‖ ≤M√
T + 1.
Consequently, for any ε > 0, within d(Mε
)2e steps, we will see an iterate where thegradient has norm at most ε. In other words, we reach an “ε-stationary” point inO(( 1
ε)2) steps
REMARK: Same thing holds for constant step size for an appropriately chosenconstant. See HW.
Proof:Recall (9) in the proof of Theorem 3.3:
T∑k=0
αk|gTk pk| ≤
f0 − fk+1
η≤ M′ since f is bounded from below.
84 / 106
Rate of convergence
Thus, ∑k:ταmax<αinit
2τ(1− η)|gTk pk|2
γ‖pk‖2 +∑
k:ταmax≥αinit
αinit|gTk pk| ≤ M′.
Thus, ∑k:ταmax<αinit
|gTk pk|2
‖pk‖2 +∑
k:ταmax≥αinit
|gTk pk| ≤
M′
C, (17)
where C = min
{2τ(1−η)
γ, αinit
}.
Setting pk = −gk in the inequality above, we obtain that
(T + 1) mink=0,...,T
‖gk‖2 ≤∑
k=0,...,T
‖gk‖2 ≤ M′
C.
Therefore,
mink=0,...,T
‖gk‖ ≤M√
T + 1,
for some constant M.
85 / 106
Notes
Notes
Steepest descent example
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
1
1.5
Figure: Contours for the objective function f(x, y) = 10(y− x2)2 + (x− 1)2, and the iteratesgenerated by the steepest-descent backtracking-Armijo linesearch Algorithm 14.
86 / 106
Rate of convergence
Theorem 4.3 (Local convergence rate for steepest-descent)Suppose the following all hold:
f is twice continuously differentiable and∇2f is Lipschitz continuous on Rn withLipschitz constant M.
x∗ is a local minimum of f with positive definite Hessian∇2f(x∗) with eigenvalueslower bounded by ` and upper bounded by L.
The initial starting iterate x0 is close enough to x∗; more precisely,r0 := ‖x0 − x∗‖ < r := 2`
M .
The steepest-descent with constant step size α = 2`+L converges as follows:
‖xk − x∗‖ ≤ rr0
r− r0
(1− 2`
L + 3`
)k
.
In other words, in O(log( 1ε)) iterations, we are within a distance of ε from x∗.
Proof: See Theorem 1.2.4 in Nesterov’s book.
See also Theorem 3.4 in Nocedal-Wright for a similar result with exact line-search.
87 / 106
Notes
Notes
Algorithm 15 Modified-Newton backtracking-Armijo linesearch method
1: Input: initial guess x0
2: Set k = 03: until ‖∇f(xk)‖ ≤ 10−8 max
(1, ‖∇f(x0)‖
)4: Compute positive-definite matrix Bk from Hk using Algorithm 2.5: Compute the search direction pk as the solution to Bkp = −gk.6: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10.7: Set xk+1 = xk + αkpk
8: Set k← k + 19: end until
10: return approximate solution xk
should also terminate if a maximum number of allowed iterations is reached
should also terminate if a maximum time limit is reached
89 / 106
Theorem 4.4 (Global convergence of modifed-Newton backtracking-Armijo)
Suppose that
f ∈ C1
g is Lipschitz continuous on Rn
the eigenvalues of Bk are uniformly bounded away from zero and infinity
Then, the iterates generated by the modified-Newton backtracking-Armijo Algorithm 15satisfy one of the following:
1 finite termination, i.e., there exists a finite k such thatgk = 0
2 objective is unbounded below, i.e.,lim
k→∞fk = −∞
3 convergence of gradients, i.e.,lim
k→∞gk = 0
Proof:Let λmin(Bk) and λmax(Bk) be the smallest and largest eigenvalues of Bk. Byassumption, there are bounds 0 < λmin ≤ λmax such that
λmin ≤ λmin(Bk) ≤sTBks‖s‖2 ≤ λmax(Bk) ≤ λmax for any nonzero s.
90 / 106
Notes
Notes
From the previous slide, it follows that
λ−1max ≤ λ−1
max(Bk) = λmin(B−1k ) ≤ sTB−1
k s‖s‖2 ≤ λmax(B−1
k ) = λ−1min(Bk) ≤ λ−1
min
(18)for any nonzero vector s. Using Bkpk = −gk and (18) we see that
| pTk gk| = |gT
k B−1k gk| ≥ λ−1
max‖gk‖22 (19)
In addition, we may use (18) to see that
‖pk‖22 = gT
k B−2k gk ≤ λ−2
min‖gk‖22,
which implies that‖pk‖2 ≤ λ−1
min‖gk‖2.
It now follows from the previous inequality and (19) that
|pTk gk|‖pk‖2
≥ λmin
λmax‖gk‖2. (20)
Combining the previous inequality and (19) yields
min(|pT
k gk|, |pTk gk|2/‖pk‖2
2
)≥ ‖gk‖2
2
λmaxmin
(1, λ
2min
λmax
). (21)
From Theorem 3.3 we know that limk→∞min(|pT
k gk|, |pTk gk|2/‖pk‖2
2)
= 0 andcombined with (21), we conclude that
limk→∞
gk = 0.
91 / 106
Rate of convergence
Theorem 4.5 (Global convergence rate for modified-Newton)
Suppose f ∈ C1 and∇f is Lipschitz continuous on Rn. Further assume that f(x) isbounded from below on Rn. Let {xk} be the iterates generated by the modified-Newtonbacktracking-Armijo linesearch Algorithm 15 such that Bk has eigenvalues uniformlybounded away from 0 and infinity. Then there exists a constant M such that for allT ≥ 1
mink=0,...,T
‖∇f(xk)‖ ≤M√
T + 1.
Consequently, for any ε > 0, within d(Mε
)2e steps, we will see an iterate where thegradient has norm at most ε. In other words, we reach an “ε-stationary” point inO(( 1
ε)2) steps
Proof: Almost identical to steepest descent proof (Theorem 4.2); use (19) and (20)in (17).
REMARK: Same thing holds for constant step size for an appropriately chosenconstant.
92 / 106
Notes
Notes
Rate of convergence
Theorem 4.6 (Local quadratic convergence for modified-Newton)Suppose that
f ∈ C2
H is Lipschitz continuous on Rn with constant γ
Suppose that iterates are generated from the modified-Newton backtracking-ArmijoAlgorithm 15 such that
αinit = 1η ∈ (0, 1
2 )
Bk = Hk whenever Hk is positive definite
If the sequence of iterates {xk} has a limit point x∗ satisfying H(x∗) is positive definite,then it follows that
(i) αk = 1 for all sufficiently large k,
(ii) the entire sequence {xk} converges to x∗, and
(iii) the rate of convergence is q-quadratic, i.e, there is a constant κ ≥ 0 and a naturalnumber k0 such that for all k ≥ k0
‖xk+1 − x∗‖2 ≤ κ‖xk − x∗‖22
93 / 106
Proof:By assumption we know that there is some subsequence K such that
limk∈K
xk = x∗ and H(x∗) � 0
Continuity implies that Hk is positive definite for all k ∈ K sufficiently large so that
pTk Hkpk ≥ 1
2λmin(H(x∗))‖pk‖22 for all k0 ≤ k ∈ K (22)
for some k0, where λmin(H(x∗)) is the smallest eigenvalue of H(x∗). This fact andsince Hkpk = −gk for k0 ≤ k ∈ K implies that
|pTk gk| = −pT
k gk = pTk Hkpk ≥ 1
2λmin(H(x∗))‖pk‖22 for all k0 ≤ k ∈ K. (23)
This also implies|pT
k gk|‖pk‖2
≥ 12λmin(H(x∗))‖pk‖2. (24)
We now deduce from Theorem 3.3, (23), and (24) that
limk∈K→∞
pk = 0.
94 / 106
Notes
Notes
Mean Value Theorem implies that there exists zk between xk and xk + pk such that
f(xk + pk) = fk + pTk gk + 1
2 pTk H(zk)pk.
Lipschitz continuity of H(x) and the fact that Hkpk = −gk implies
f(xk + pk)− fk − 12 pT
k gk = 12 (pT
k gk + pTk H(zk)pk)
= 12 pT
k (H(zk)− Hk)pk
≤ 12‖H(zk)− H(xk)‖2‖pk‖2
2 ≤ 12γ‖pk‖3
2. (25)
Since η ∈ (0, 1/2) and pk → 0, we may pick k sufficiently large so that
2γ‖pk‖2 ≤ λmin(H(x∗))(1− 2η). (26)
Combining (25), (26), and (23) shows that
f(xk + pk)− fk ≤ 12 pT
k gk + 12γ‖pk‖3
2
≤ 12 pT
k gk + 14λmin(H(x∗))(1− 2η)‖pk‖2
2
≤ 12 pT
k gk − 12 (1− 2η)pT
k gk
= ηpTk gk
so that the unit step xk + pk satisfies the Armijo condition for all k ∈ K sufficiently large.
95 / 106
Note that (22) implies that
‖H−1k ‖2 ≤ 2/λmin(H(x∗)) for all k0 ≤ k ∈ K. (27)
The iteration gives
xk+1 − x∗ = xk − x∗ − H−1k gk = xk − x∗ − H−1
k (gk − g(x∗))
= H−1k [ g(x∗)− gk − Hk(x∗ − xk)] .
But a Taylor Approximation provides the estimate
‖g(x∗)− gk − Hk(x∗ − xk)‖2 ≤γ
2‖xk − x∗‖2
2
which implies
‖xk+1 − x∗‖2 ≤γ
2‖H−1
k ‖2‖xk − x∗‖22 ≤
γ
λmin(H(x∗))‖xk − x∗‖2
2 (28)
Thus, we know that for k ∈ K sufficiently large, the unit step will be accepted and
‖xk+1 − x∗‖2 ≤ 12‖xk − x∗‖2
which proves parts (i) and (ii) since the entire argument may be repeated. Finally, part(iii) then follows from part (ii) and (28) with the choice
κ = γ/λmin(H(x∗))
which completes the proof.96 / 106
Notes
Notes
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
1
1.5
Figure: Contours for the objective function f(x, y) = 10(y− x2)2 + (x− 1)2, and the iteratesgenerated by the modified-Newton backtracking-Armijo Algorithm 15. 97 / 106
Algorithm 16 Modified-Newton Wolfe linesearch method
1: Input: initial guess x0
2: Set k = 03: until ‖∇f(xk)‖ ≤ 10−8 max
(1, ‖∇f(x0)‖
)4: Compute positive-definite matrix Bk from Hk using Algorithm 2.5: Compute a search direction pk as the solution to Bkp = −gk.6: Compute a stepsize αk that satisfies the Wolfe conditions (11a) and (11b).7: Set xk+1 ← xk + αkpk
8: Set k← k + 19: end until
10: return approximate solution xk
should also terminate if a maximum number of allowed iterations is reached
should also terminate if a maximum time limit is reached
99 / 106
Notes
Notes
Algorithm 17 Quasi-Newton Wolfe linesearch method
1: Input: initial guess x0
2: Set k = 03: until ‖∇f(xk)‖ ≤ 10−8 max
(1, ‖∇f(x0)‖
)4: Compute positive-definite matrix Bk from Bk−1 using the BFGS rule, or using Al-
gorithm 3 with a seed matrix B0k.
5: Compute a search direction pk as the solution to Bkp = −gk.6: Compute a stepsize αk that satisfies the Wolfe conditions (11a) and (11b).7: Set xk+1 ← xk + αkpk
8: Set k← k + 19: end until
10: return approximate solution xk
should also terminate if a maximum number of allowed iterations is reached
should also terminate if a maximum time limit is reached
101 / 106
Theorem 4.7 (Global convergence of modifed-Newton/quasi-Newton Wolfe)
Suppose that
f ∈ C1
g is Lipschitz continuous on Rn
the condition number of Bk is uniformly bounded by β
Then, the iterates generated by the modified/quasi-Newton Wolfe Algorithm 17 or 16satisfy one of the following:
1 finite termination, i.e., there exists a finite k such thatgk = 0
2 objective is unbounded below, i.e.,lim
k→∞fk = −∞
3 convergence of gradients, i.e.,lim
k→∞gk = 0
Proof:Homework assignment.
102 / 106
Notes
Notes
Rate of convergence
Theorem 4.8 (Global convergence rate for modified-Newton/quasi-NewtonWolfe)
Suppose f ∈ C1 and∇f is Lipschitz continuous on Rn. Further assume that f(x) isbounded from below on Rn. Let {xk} be the iterates generated by themodified/quasi-Newton Wolfe lineasearch Algorithms 16 or 17, such that Bk hasuniformly bounded condition number. Then there exists a constant M such that for allT ≥ 1
mink=0,...,T
‖∇f(xk)‖ ≤M√
T + 1.
Consequently, for any ε > 0, within d(Mε
)2e steps, we will see an iterate where thegradient has norm at most ε. In other words, we reach an “ε-stationary” point inO(( 1
ε)2) steps
103 / 106
Rate of convergence
Theorem 4.9 (Local superlinear convergence for quasi-Newton)Suppose that
f ∈ C2
H is Lipschitz continuous on Rn
Suppose that iterates are generated from the quasi-Newton Wolfe AlgorithmAlgorithm 17 such that 0 < c1 < c2 ≤ 1
2 .
If the sequence of iterates {xk} has a limit point x∗ satisfying H(x∗) is positive definiteand g(x∗) = 0, then it follows that
(i) the step length αk = 1 is valid for all sufficiently large k,
(ii) the entire sequence {xk} converges to x∗ superlinearly.
See also Theorems 3.6 and 3.7 from Nocedal-Wright.
104 / 106
Notes
Notes
A Summary
The problem
minimizex∈Rn
f(x)
where the objective function f : Rn → R
Solve instead quadratic approximation at current iterate xk, after choosing positivedefinite Bk, to get a search direction that is a descent direction:
pk = argminp∈Rn
mQk (p)
def= fk + gT
k p + 12 pTBkp
Bk = identity matrix: Steepest Descent Direction.Bk obtained from Hessian Hk by adjusting eigenvalues: Modified-Newton.Bk obtained by updating at every iteration using current and previous gradientinformation: Quasi-Newton.
Decide on step size strategy: Armijo backtracking line search, Wolfe backtracking linesearch, constant step size (HW assignment), O
(1√
k
)in iteration k ... many others.
105 / 106
A Summary
Steepest Descent Modified/Quasi Newton
Requires only function and gradientevaluations
Could need Hessian evaluations as well(with spectral decompositions)
No overhead of solving linear system ineach iteration
Solves a linear system Bp = g in eachiteration
Global convergence to ε-stationarypoint in O(( 1
ε)2) steps
Global convergence to ε-stationarypoint in O(( 1
ε)2) steps
Linear local convergence: convergesto ε distance from local minimum inO(log( 1
ε)) steps
Superlinear (quadratic) local conver-gence: converges to ε distancefrom local minimum in o(log( 1
ε))
(O(log log( 1ε)) steps.
106 / 106
Notes
Notes